# Reddit Comment Score Prediction

---
Reddit is a social news platform that allows users to discuss and vote on content that other users have submitted.On an average reddit receives 470,000 comments per day. The comments are further upvoted or downvoted by the registered users. 

Imagine you are going to start a forum where users can post or comment or share content on the platform. Now you want to filter out some positive comments and recommend them to your users. 
___


Build a machine learning model that will help you know which comment or content is going to be popular in the near future (the content which receives the highest upvotes will be popular) and accordingly recommend such content to your users.
___

Dataset Link: https://dphi.tech/challenges/data-sprint-36-reddit-comment-score-prediction/89/data

In [16]:
import pandas as pd

# text preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet')
# nltk.download('omw-1.4')
from string import punctuation

# feature engineering
from sklearn.feature_extraction.text import TfidfVectorizer

# model building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [17]:
df = pd.read_csv('_6_3_Train_Data.csv')
df

Unnamed: 0,text,author,controversiality,parent_text,parent_score,parent_votes,parent_author,parent_controversiality,Score
0,i must be retarded i thought it meant con lawl...,['calantus'],0,"It's quite unfair to call Hillary Clinton a ""c...",245,245,Whisper,0,-8
1,DOWNMODDED FOR IRRELEVANCE? ISN'T THAT HOW THI...,['Shadowrose'],0,upmodded for awesome kindness,32,32,b3mus3d,0,-16
2,"THAT WAS SUPPOSED TO MEAN "" BY A PLACE WHERE P...",['NExusRush'],0,"What the hell does ""because its by a golf cour...",12,12,mr_jellyneck,0,-7
3,I THOUGHT EVERYONE DID; ITS FUCKING DELICIOUS :\,['R0N_SWANS0N'],0,NICE TRY JENNIFER! I KNOW IT'S YOU AND I KNOW...,117,117,ometzo,0,67
4,"Great work, Zhesbe! I'd give you a raise but y...",['reddums'],0,"""HEY BOSS COME LOOK AT WHAT I DID!""",1933,1933,Zhesbe,0,1348
...,...,...,...,...,...,...,...,...,...
4994,"Dying words of my father: ""Son, one day a man ...",['Karmamechanic'],0,"Gather round, drinking buddies. It's that tim...",540,540,willis77,0,234
4995,CATERING TO EVERYONE AND THEIR IMPOSSIBLE SIMU...,['Schym'],0,So basically Sona players will get the authent...,560,560,sleeplessone,0,107
4996,RABBLERABBLERABBLERABBLE!,['Azurphax'],0,"**everyone, its Forthewolfx!**",370,370,KinkyTraficCone,0,193
4997,"LITTLE KNOWN FACT, ""VIOLA"" IS NOT ONLY A FRENC...",['DcGutz'],0,"Ending the comment train with ""Voila.""",4,4,Anderson0457,0,-8


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   text                     4999 non-null   object
 1   author                   4999 non-null   object
 2   controversiality         4999 non-null   int64 
 3   parent_text              4999 non-null   object
 4   parent_score             4999 non-null   int64 
 5   parent_votes             4999 non-null   int64 
 6   parent_author            4999 non-null   object
 7   parent_controversiality  4999 non-null   int64 
 8   Score                    4999 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 351.6+ KB


In [19]:
df.describe()

Unnamed: 0,controversiality,parent_score,parent_votes,parent_controversiality,Score
count,4999.0,4999.0,4999.0,4999.0,4999.0
mean,0.0006,216.714943,216.714943,0.0012,89.810962
std,0.024492,449.422467,449.422467,0.034627,200.739917
min,0.0,-8907.0,-8907.0,0.0,-1658.0
25%,0.0,13.0,13.0,0.0,-10.0
50%,0.0,67.0,67.0,0.0,66.0
75%,0.0,229.0,229.0,0.0,115.5
max,1.0,5619.0,5619.0,1.0,3133.0


In [20]:
# checking unique values in every column
df.nunique()

text                       4992
author                     4317
controversiality              2
parent_text                4992
parent_score               1001
parent_votes               1001
parent_author              4448
parent_controversiality       2
Score                       572
dtype: int64

In [21]:
# Checking if parent_score and parent_votes are exactly same

set(df['parent_score'] == df['parent_votes'])

{True}

# Data preprocessing

In [22]:
df['controversiality'].value_counts()

0    4996
1       3
Name: controversiality, dtype: int64

In [23]:
# Dropping parent_votes column
df.drop('parent_votes',axis = 1, inplace = True)

# Dropping other irrelevant columns
df.drop(['author','parent_author','controversiality','parent_controversiality'],axis = 1, inplace = True)

In [24]:
# dropping duplicates
df.drop_duplicates(inplace = True)
df.reset_index(inplace = True, drop = True)

In [25]:
df.head()

Unnamed: 0,text,parent_text,parent_score,Score
0,i must be retarded i thought it meant con lawl...,"It's quite unfair to call Hillary Clinton a ""c...",245,-8
1,DOWNMODDED FOR IRRELEVANCE? ISN'T THAT HOW THI...,upmodded for awesome kindness,32,-16
2,"THAT WAS SUPPOSED TO MEAN "" BY A PLACE WHERE P...","What the hell does ""because its by a golf cour...",12,-7
3,I THOUGHT EVERYONE DID; ITS FUCKING DELICIOUS :\,NICE TRY JENNIFER! I KNOW IT'S YOU AND I KNOW...,117,67
4,"Great work, Zhesbe! I'd give you a raise but y...","""HEY BOSS COME LOOK AT WHAT I DID!""",1933,1348


In [26]:
# Clean the text data

lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Remove special characters and numbers
    text = re.sub('[^a-zA-Z]', ' ', text)
    
    # Convert text to lowercase
    text = text.lower()
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stop words and punctuation
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words and word not in punctuation]
    
    # Lemmatize the tokens
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    
    # Join the tokens back into a single string
    cleaned_text = ' '.join(lemmatized_tokens)
    
    return cleaned_text

# Apply text cleaning to 'text' and 'parent_text' columns
df['text'] = df['text'].apply(clean_text)
df['parent_text'] = df['parent_text'].apply(clean_text)

# Print the preprocessed data
df.head()


Unnamed: 0,text,parent_text,parent_score,Score
0,must retarded thought meant con lawl oh well work,quite unfair call hillary clinton cunt lack de...,245,-8
1,downmodded irrelevance work,upmodded awesome kindness,32,-16
2,supposed mean place people undoubtedly snake b...,hell golf course anything think bunch rich whi...,12,-7
3,thought everyone fucking delicious,nice try jennifer know know like baba ganoush,117,67
4,great work zhesbe give raise seem handled,hey bos come look,1933,1348


In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

vectorizer = TfidfVectorizer()

text_features = vectorizer.fit_transform(df['text'])
text_df = pd.DataFrame(text_features.toarray(), columns=vectorizer.get_feature_names_out())

parent_text_features = vectorizer.transform(df['parent_text'])
parent_text_df = pd.DataFrame(parent_text_features.toarray(), columns=vectorizer.get_feature_names_out())

df = pd.concat([df, text_df, parent_text_df], axis=1)


In [31]:
# Split the data into features and target variable
X = df.drop(['Score', 'text', 'parent_text'], axis=1)
y = df['Score']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)


y_pred = model.predict(X_test)


In [29]:
mse = mean_squared_error(y_test, y_pred)
mse

9.846653935497957e+19