Dataset link: https://www.kaggle.com/datasets/gpreda/data-science-on-reddit

### Importing necessary libraries

In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [44]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

### Importing CSV file

In [3]:
df = pd.read_csv("data_science.csv")
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,STEM Career Change,5,swvi7j,https://www.reddit.com/r/datascience/comments/...,6,1645341000.0,I’m currently working as a field biologist for...,2022-02-20 09:17:13
1,Comment,79,hxct3v5,,0,1645130000.0,DataScientologists,2022-02-17 22:34:46
2,Comment,1,hxcsshc,,0,1645130000.0,Sounds like you'll need some sort of fuzzy mat...,2022-02-17 22:32:44
3,Comment,2,hxcva4l,,0,1645131000.0,Best of both worlds. Work in DS without workin...,2022-02-17 22:48:40
4,Comment,1,hxcuqf2,,0,1645131000.0,"Hi u/Ok_Acanthisitta5478, I removed your submi...",2022-02-17 22:45:11


### Shape of Dataset

In [4]:
df.shape

(21095, 8)

In [5]:
len(df['title'].unique())

1868

In [6]:
df["score"].describe()

count    21095.000000
mean         9.913107
std         58.987466
min        -91.000000
25%          1.000000
50%          2.000000
75%          4.000000
max       2952.000000
Name: score, dtype: float64

In [7]:
len(df['id'].unique())

21095

### Checking null values

In [8]:
df.isnull().sum()

title            0
score            0
id               0
url          19225
comms_num        0
created          0
body           201
timestamp        0
dtype: int64

### Removing unused columns

In [9]:
df.drop(['id','url','created','timestamp'],axis=1,inplace=True)

In [10]:
df.head()

Unnamed: 0,title,score,comms_num,body
0,STEM Career Change,5,6,I’m currently working as a field biologist for...
1,Comment,79,0,DataScientologists
2,Comment,1,0,Sounds like you'll need some sort of fuzzy mat...
3,Comment,2,0,Best of both worlds. Work in DS without workin...
4,Comment,1,0,"Hi u/Ok_Acanthisitta5478, I removed your submi..."


In [11]:
df.isnull().sum()

title          0
score          0
comms_num      0
body         201
dtype: int64

In [12]:
df.dropna(inplace=True)

### Removing null values

In [13]:
df.isnull().sum()

title        0
score        0
comms_num    0
body         0
dtype: int64

In [14]:
df.shape

(20894, 4)

### Dropping duplicate values

In [15]:
df.drop_duplicates(inplace=True)

In [16]:
df.shape

(20674, 4)

In [17]:
df['body'][0]

'I’m currently working as a field biologist for fisheries research, and am looking to transfer into a more data-science oriented career field. I’ve grown tired of the field work side and love the data side, while most of my coworkers are the opposite.\n\nI have a M.S. in Environmental Science, with coursework in single and multivariate stats, although I don’t use very much complicated math in my job. I have more experience than most of my early-career coworkers with R, and do use it at work, but am light years behind the statisticians in my office. No experience in Python, SQL, or any other data science software. \n\nMy questions would be:\n\n1. What skills would I need to gain / build on before making the switch? \n\n2. What’s a reasonable entry salary? Biologists don’t make great money so almost anything would be an increase haha.  \n\n3. Are online courses / certifications worth it? The amount of marketing I see for those is insane. \n\nI luckily have access to large amounts of data

In [18]:
df['title'][0]

'STEM Career Change'

In [19]:
df.head(25)

Unnamed: 0,title,score,comms_num,body
0,STEM Career Change,5,6,I’m currently working as a field biologist for...
1,Comment,79,0,DataScientologists
2,Comment,1,0,Sounds like you'll need some sort of fuzzy mat...
3,Comment,2,0,Best of both worlds. Work in DS without workin...
4,Comment,1,0,"Hi u/Ok_Acanthisitta5478, I removed your submi..."
5,Comment,1,0,So I have the opportunity to officially enter ...
6,Comment,3,0,> You should know that nested loops has expone...
7,Comment,2,0,Though it's mainly taught in ECE these days.
8,Comment,2,0,/u/Morodin_88 read my mind. \n\nYou can be a...
9,Comment,2,0,Agreed... ITT a bunch of people trying to just...


### Creating new dataframe with just all comment in it

In [20]:
comment = df[df['title']== "Comment"]
comment.head()

Unnamed: 0,title,score,comms_num,body
1,Comment,79,0,DataScientologists
2,Comment,1,0,Sounds like you'll need some sort of fuzzy mat...
3,Comment,2,0,Best of both worlds. Work in DS without workin...
4,Comment,1,0,"Hi u/Ok_Acanthisitta5478, I removed your submi..."
5,Comment,1,0,So I have the opportunity to officially enter ...


In [21]:
comment.describe()

Unnamed: 0,score,comms_num
count,19005.0,19005.0
mean,5.918969,0.0
std,22.602571,0.0
min,-91.0,0.0
25%,1.0,0.0
50%,2.0,0.0
75%,3.0,0.0
max,990.0,0.0


In [22]:
comment['score'].value_counts()[:25]

 1     8486
 2     3420
 3     1578
 4      675
 5      623
 0      461
 6      451
 7      346
 8      253
 9      249
 10     193
 11     141
-1      140
 12     120
 13     111
 14      92
-2       84
 15      77
 16      73
 17      69
 18      57
 19      54
-3       50
 21      47
 20      44
Name: score, dtype: int64

### Dataframe without comments

In [23]:
no_comment = df[df['title'] != "Comment"]

In [24]:
no_comment['title'].value_counts()[:10]

Is there appetite for a separate space for experienced DS?                    2
transportation engineer to become an transportation data scientist            2
STEM Career Change                                                            1
Anyone here a risk data analyst for a Trust and Safety team?                  1
What does working with a good Product or Program Manager look like?           1
Switching from basic ML to a different domain - Recommender Systems vs NLP    1
Missing values for categorical variables                                      1
What's a good metric to compare VAR/VECM models with?                         1
Webscraper that scrapes every hour                                            1
Building out data science team. Need help.                                    1
Name: title, dtype: int64

In [25]:
comment.shape

(19005, 4)

In [26]:
no_comment.shape

(1669, 4)

We will predict score using body column

### Text Preprocessing

In [28]:
import warnings
warnings.filterwarnings("ignore")

In [29]:
comment['body'].replace("[^a-zA-Z]"," ",regex=True, inplace=True)

In [30]:
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()

In [35]:
for i in range(comment.shape[0]):
    # Lowering the review
    sentence = comment.iloc[i,3].lower()
    # Spliting all the words
    sentence = sentence.split()
    # Removing Stop Words
    sentence = [word for word in sentence if word not in set(stopwords.words('english'))]
    # Lemmatization
    sentence = [lemma.lemmatize(word) for word in sentence]
    # Joining all remaining words
    sentence = " ".join(sentence)
    comment.iloc[i,3] = sentence

In [37]:
comment.head()

Unnamed: 0,title,score,comms_num,body
1,Comment,79,0,datascientologists
2,Comment,1,0,sound like need sort fuzzy matching get confid...
3,Comment,2,0,best world work d without working d hehe
4,Comment,1,0,hi u ok acanthisitta removed submission follow...
5,Comment,1,0,opportunity officially enter data field large ...


### Convert Words to Vector

In [38]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#### Bag of Words

In [39]:
bow = CountVectorizer(ngram_range=(1,1))
X = bow.fit_transform(comment['body'])

### Independent Variable

In [40]:
X.shape

(19005, 19779)

### Dependent Variable

In [41]:
y = comment['score']

In [42]:
y.shape

(19005,)

### Splitting our dataset into training and testing set

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

In [46]:
X_train.shape

(15204, 19779)

In [47]:
X_test.shape

(3801, 19779)

### Train our model using Random Forest Regressor

In [48]:
reg = RandomForestRegressor()
reg.fit(X_train, y_train)

### Accuracy Score

In [49]:
reg.score(X_test, y_test)

-0.3384614330140856

In [50]:
from sklearn.linear_model import LinearRegression
reg.fit(X_train, y_train)

In [51]:
reg.score(X_test, y_test)

-0.3852470290415806