In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# to make this notebook's output stable across runs
np.random.seed(42)

In [None]:
!pip show scikit-learn

In [None]:
# train = pd.read_csv("train.csv")
# test = pd.read_csv("test.csv")
# movie = pd.read_csv("movies.csv")

train = pd.read_csv("/kaggle/input/sentiment-prediction-on-movie-reviews/train.csv")
test = pd.read_csv("/kaggle/input/sentiment-prediction-on-movie-reviews/test.csv")
movie = pd.read_csv("/kaggle/input/sentiment-prediction-on-movie-reviews/movies.csv")

In [None]:
train.head()

In [None]:
movie.tail()

In [None]:
test.head()

In [None]:
train.info()

In [None]:
movie.info()

In [None]:
movie['rating'].unique()

there are 10 unique values for rating , and NaN type - so we can first impute the NaN type and then apply for one hot encoding

In [None]:
movie['genre'].unique()

There are 2913 unique genres for 143258 entries in the movies dataset. This means there is a proper scheme followed while filling the genre

So for genre they wrote the string as 'Action, Mystery & thriller' , and not like {'Action', 'Mystery', 'Thriller'} --> if it was like this we can easily apply multilabel binarizer.

So we need to extract the individual words from each string, convert to lower case and form a set, and then create a list from all these sets and apply multilabel binarizer (or) just let it be as one single string and apply CountVectorizer

In [None]:
movie['originalLanguage'].unique()

for originalLanguage there are 112 unique values and NaN - OneHotEncoding

In [None]:
len(movie['soundType'].unique())

for soundType there are 551 unique values out of 15917 non-null values - so similar to genre, need to fill NaN values, then extract words from each string and then apply encoding

In [None]:
movie['ratingContents'].unique()

in ratingContents there are 8,353 unique values out of the 13,991 non-null values,  and each value is a list of strings - after imputing, for each sample, take the list, extract the strings from list, merge to form one string, and use that string

Data Type of some easily confused columns:

rating type: string

ratingContents type: **list of strings**

releaseDateTheaters type: string

releaseDateStreaming type: string

runtimeMinutes type: numpy.float64

boxOffice type: string

distributor type: stringsoundType type: string

In [None]:
train_col_names = train.columns
movie_col_names = movie.columns
movie_col_names

In [None]:
print("train shape:", train.shape)
for i in train_col_names:
    print(i,"null values", train[i].isnull().sum())

print("\n")
print("movie shape:", movie.shape)
for i in movie_col_names:
    print(i,"null values", movie[i].isnull().sum())

In [None]:
print("Percentage of null values: To decide which features to keep in Second Trial.\n")
print('For train dataset')
for i in train_col_names:
    print(i,"% null values", ((train[i].isnull().sum())/162758)*100)


print("\n")
print('For movies dataset')
print("movie shape:", movie.shape)
for i in movie_col_names:
    print(i,"% null values", ((movie[i].isnull().sum())/143258)*100 )

**For First trial use (Dropping columns on basis of Domain Knowledge)
From train: isFrequentReviewer, reviewText and
From movie: audienceScore, rating, ratingContents, runtimeMinutes, genre, originalLanguage, director, boxOffice, soundType**

Reason: From domain knowledge, we can drop movieid,reviewerName from train dataset- because moviedid is just id name, and a person's name won't have any effect on how they find a movie to be, let's say a person with name 'Sam' has posiitve sentiment for movie 'A', but doesn't mean every person named Sam will have positive sentiment for movie 'A'.

From test dataset we can drop movieid, title, releaseDateTheaters, releaseDateStreaming, distributor - from domain knowledge - because movieid is just id and title is name of movie and won't effect how someone finds movie sentiment postive or negative, the dates of release in theaters and streaming platforms won't affect how someone finds the movie because, assumption: the data for train is collected from different websites where users can give reviews, so no matter when movie was released, one can watch movie whenever and given review. The distributor also won't affect how someone finds the movie because say, even if well reputed distributor, if the movie was bad the reviewer will give negative review only.

**For Second trial use (using feature selection):**
Some features like rating, ratingContents, boxOffice, and soundType would have to be dropped because they have more than 50% null values. However, instead of dropping them directly, I have made use of feature selection technique of SelectFromModel to select the best features.

In [None]:
#audienceScore histogram
plt.figure(figsize=(8,6))
print(movie['audienceScore'].mode())
plt.title('Histogram for audienceScore')
hi = sns.histplot(movie['audienceScore'], kde = True)
plt.show()


Can see the histogram distribution for audienceScore is between 0 to 100. And the values in bin 50-52 has the highest frequency of more than 2500 (with 50 being the mode).

In [None]:
#runtime Minutes
plt.figure(figsize=(8,6))
print(movie['runtimeMinutes'].mode())
plt.title('Histogram for runtimeMinutes')
sns.histplot(movie['runtimeMinutes'], kde = True)
plt.xlim(0,400)
plt.show()

The histogram for runtimeMinutes has a right tail (although the complete tail has not been shown here beacuse pf figure size constraints - the max value is 2700 minutes which is an outlier). There are more number of outliers which lie on the right side of quartile 3 ( i.e., >(Q3 + 1.5*IQR) ).

In [None]:
# boxplot of audienceScore
plt.figure(figsize=(8,6))
plt.title('Box plot for audienceScore')
box = sns.boxplot(movie['audienceScore'], orient = 'h', x = movie['audienceScore'])
median = movie['audienceScore'].median()
box.annotate(str(median), xy = (58,0))
plt.show()

Can observe that audienceScore does not have outliers and the median is 57.

In [None]:
# box plot of runtimeMinutes
plt.figure(figsize=(8,6))
plt.title('box plot for runtimeMinutes')
box = sns.boxplot(movie['runtimeMinutes'], orient = 'h', x=movie['runtimeMinutes'])
median = movie['runtimeMinutes'].median()

box.annotate(str(median), xy = (92,0))
print('max:', movie['runtimeMinutes'].max())
print('min:', movie['runtimeMinutes'].min())
print('median:', median)
plt.show()

can see some movies have very high runtime like 1000 min, >2500 min. This shows the noise in the data. The runtimeMinutes has outliers.

In [None]:
# capping of outliers on runtimeMinutes

runtimeMin = list(movie['runtimeMinutes'])
# # finding the max and min values ->  upperlimit = mean+(3*standard deviation), lowerlimit = mean-(3*standard deviation)
# upperlimit = movie2['runtimeMinutes'].mean() + 3*(movie2['runtimeMinutes'].std())
# lowerlimit = movie2['runtimeMinutes'].mean() - 3*(movie2['runtimeMinutes'].std())

q3 = movie['runtimeMinutes'].quantile(0.75)
q1 = movie['runtimeMinutes'].quantile(0.25)
iqr = q3-q1
min = q1 - 1.5 * iqr
max = q3 + 1.5 * iqr

print("Quartile 1:", q1)
print("Quartile 3:", q3)
print("min:", min)
print("max:", max)

print("number of outliers above max:", len(movie[movie['runtimeMinutes']>max]))
print("number of outliers below min:", len(movie[movie['runtimeMinutes']<min]))

#replacing the outliers with max and min values
for i in range(len(runtimeMin)):
  if (runtimeMin[i]>max):
    runtimeMin[i] = max
  elif (runtimeMin[i]<min):
    runtimeMin[i] = min

movie['runtimeMinutes'] = pd.Series(runtimeMin)


In [None]:
print('max:', movie['runtimeMinutes'].max())
print('min: ', movie['runtimeMinutes'].min())

plt.figure(figsize=(7,5))
plt.title('Box plot of runtimeMinutes after removal of outliers')
box = sns.boxplot(movie['runtimeMinutes'], orient = 'h', x=movie['runtimeMinutes'])

median = movie['runtimeMinutes'].median()
print('median:', median)
box.annotate(str(median), xy = (92,0))
plt.show()

can see that after removal of outliers, the median runtimeMinutes is 92 minutes, max: 131.5 and min 55.5 minutes.

In [None]:
#Checking the class balance
plt.figure(figsize=(8, 6))
plt.title('Pie chart for sentiment classes percentage')
train['sentiment'].value_counts().plot(kind='pie',autopct='%0.01f%%' )

can see there is imbalance and positive sentiment values are almost 67% and negative sentiments are 33%.

In [None]:
dataplot = movie.iloc[:, [2,7]]
plt.title('KDE plot for audienceScore and runtimeMinutes')
sns.kdeplot(dataplot, bw = 0.2)

KDE plot like histogram shows the distribution of numerical features of data using PDF - here audienceScore and runtimeMinutes. After smoothing out with the bandwidth = 0.3, we can see that audienceScore gives somewhat a uniform distribution.

In [None]:
#removing any rows in which all values are null
train2 = train.dropna(how = "all")
movie2 = movie.dropna(how = "all")

In [None]:
#dropping the columns 'releaseDateTheatres', 'releaseDateStreaming', and 'distributor' because
#they do not have any impact on the movie sentiment

movie2 = movie2.drop(columns = ['releaseDateTheaters', 'releaseDateStreaming', 'distributor'] )

In [None]:
#deleting duplicates in movie2 dataset using groupby
movie2 = movie2.groupby(["movieid"]).mean()
movie2.head()

since groupby on movie2 is returning only the numerical columns, taking the categorical columns into movie3 and removing the duplicates and merging it with movie2.

In [None]:
movie3 = movie.drop_duplicates(subset = 'movieid', keep = 'first')
movie3 = movie3.drop(columns = ['audienceScore', 'releaseDateTheaters', 'releaseDateStreaming', 'runtimeMinutes', 'distributor'])
movie3 = pd.merge(movie3, movie2, how ='left', on = ['movieid'])
movie3.shape

In [None]:
movie3.head()

In [None]:
movie3_columns = movie3.columns
movie3_columns

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

ct2 = ColumnTransformer([('si_movieid', 'passthrough', [0]),
                        ('si_title', 'passthrough', [1]),

                        ('si_rating', SimpleImputer(strategy = 'most_frequent'), [2]),
                        ('si_ratingContents', SimpleImputer(strategy = 'constant', fill_value = 'na'), [3]),


                        ('si_genre', SimpleImputer(strategy = 'most_frequent'), [4]),
                        ('si_OL', SimpleImputer(strategy = 'most_frequent'), [5]),
                        ('si_Director', SimpleImputer(strategy = 'most_frequent'), [6]),
                        ('si_BO', SimpleImputer(strategy = 'constant', fill_value = 0 ), [7]),

                        ('si_ST', SimpleImputer(strategy = 'constant', fill_value = 'na'), [8]),
                        ('si_AS', SimpleImputer(strategy = 'median'), [9]),
                        ('si_RTM', SimpleImputer(strategy = 'median'), [10])
                       ])

movie3 = ct2.fit_transform(movie3)
movie3 = pd.DataFrame(movie3, columns = movie3_columns)
#type(movie3)
movie3.head()

In [None]:
# movie3.info()

In [None]:
# converting boxOffice values from string to float

import re
bo = list(movie3['boxOffice'])

for i in range(len(bo)):

  if(bo[i]!= 0):
      bo[i] = bo[i].replace("$", "")

      if("K" in bo[i]):
        bo[i] = bo[i].replace("K","")
        bo[i] = int(float(bo[i])*1000)

      elif("M" in bo[i]):
        bo[i] = bo[i].replace("M","")
        bo[i] = int(float(bo[i])*1000000)

      elif("B" in bo[i]):
        bo[i] = bo[i].replace("B","")
        bo[i] = int(float(bo[i])*1000000000)


movie3['boxOffice'] = pd.Series(bo)
print(movie3["boxOffice"])

In [None]:
print(movie3.iloc[12,7])

In [None]:
#replacing the '0' values in boxOffice with median of int values

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

ct3 = ColumnTransformer([('si_movieid', 'passthrough', [0]),
                        ('si_title', 'passthrough', [1]),

                        ('si_rating', 'passthrough', [2]),
                        ('si_ratingContents', 'passthrough', [3]),


                        ('si_genre', 'passthrough', [4]),
                        ('si_OL', 'passthrough', [5]),
                        ('si_Director', 'passthrough', [6]),
                        ('si_BO', SimpleImputer(missing_values = 0 , strategy = 'median')  , [7]),

                        ('si_ST', 'passthrough', [8]),
                        ('si_AS', 'passthrough', [9]),
                        ('si_RTM', 'passthrough', [10])
                       ])

movie3 = ct3.fit_transform(movie3)
movie3 = pd.DataFrame(movie3, columns = movie3_columns)
print(type(movie3))
movie3.head()

# sim = SimpleImputer(missing_values = 0 , strategy = 'median')
# bo_column = sim.fit_transform(np.array(movie3['boxOffice']).reshape(-1,1))
# # print(bo_column)
# movie3['boxOffice'] = pd.Series(list(bo_column))
# movie3['boxOffice']

In [None]:
print(movie3.iloc[5,:])
print(type(movie3.iloc[5,7]))

so the boxOffice values previously consisiting of 0 are filled with median of remaining values. All the values have been converted into float and are in currency $

In [None]:
# converting ratingContents list of strings to single string
#actually it is not list of strings, it is a string written like that- "['Thematic Elements', 'Language', 'Injury']"

print(len(movie3['ratingContents'].unique()))

before removing duplicates the number of unique values were 8,353 unique values out of the 13,991 non-null values. Now after removing the duplicates and filling null with 'na', there are 7305 (excluding 'na') unique values.

In [None]:
import re
rc = list(movie3['ratingContents'])

for i in range(len(rc)):
  new_sentence = " "
  if rc[i] == 'na':
    rc[i] = 'na'
    continue
  else:
    new_sentence = re.sub("|'|,", "", rc[i])
    new_sentence = new_sentence.replace('/',' ')
    new_sentence = new_sentence.replace('[','')
    new_sentence = new_sentence.replace(']','')

  rc[i] = new_sentence

movie3['ratingContents'] = pd.Series(rc)
#print(rc)
print(movie3.loc[5,:])

In [None]:
# In genre removing the ",", "and", "&"

import re
ge = list(movie3['genre'])

for i in range(len(ge)):
    new_sentence = " "

    new_sentence = re.sub("&|'|,", "", ge[i])
    new_sentence = new_sentence.replace('and','')
    # new_sentence = new_sentence.replace(']','')

    ge[i] = new_sentence

movie3['genre'] = pd.Series(ge)
# print(ge)
# print(movie3.loc[5,:])

In [None]:
# In soundType removing the ","

import re
st = list(movie3['soundType'])

for i in range(len(st)):
    new_sentence = " "

    new_sentence = re.sub("&|'|,", "", st[i])
    new_sentence = new_sentence.replace('and','')
    # new_sentence = new_sentence.replace(']','')

    st[i] = new_sentence

movie3['soundType'] = pd.Series(st)
# print(st)
# print(movie3.loc[5,:])

In [None]:
movie3.head()

In [None]:
train.info()

In [None]:
test.info()

In [None]:
train.head()

In [None]:
# Count for isFrequentReviewer in train datatset

names = train['isFrequentReviewer'].value_counts().index.tolist()
values = train['isFrequentReviewer'].value_counts().values.tolist()


fig, ax = plt.subplots()
bar_container = ax.bar(['False','True'], values)
ax.set(ylabel='counts', title='Bar plot of IsFrequentReviewer')
ax.bar_label(bar_container, fmt='{:,.0f}')


We can see that out of 162758 entries, 113189 entries the reviewer is not frequent reviewer, and 49,569 entries the reviewer is a frequent reviewer (which is 2.28:1 ratio)

In [None]:
# histogram for the length of reviewText

plt.figure(figsize=(7,5))
plt.title('Histogram for reveiwText length for train data')
train['reviewText'].str.len().hist()
print('modal length:', train['reviewText'].str.len().mode())
# hi = sns.histplot(train['reviewText'].str.len())
plt.show()

# lengths = list(train['reviewText'])


We calculate length of each 'reviewText' in the 'train' DataFrame using the str.len() method, which returns the length of each string in the 'reviewText' column. Then, it calls the hist() function to plot the histogram of these lengths.

The resulting histogram will show the distribution of review text lengths. The x-axis will represent the lengths of the review texts, and the y-axis will represent the frequency (count) of review texts with each length.

This helps us understand the variability in the length of the review texts in train dataset. The most frequent length of the reviewText for train dataset is 136 words.

In [None]:
# Count for isTopCritic in test datatset

names = test['isTopCritic'].value_counts().index.tolist()
values = test['isTopCritic'].value_counts().values.tolist()


fig, ax = plt.subplots()
bar_container = ax.bar(['False','True'], values)
ax.set(ylabel='counts', title='Bar plot of IsTopCritic')
ax.bar_label(bar_container, fmt='{:,.0f}')

We can see that out of 55315 entries, 38,428 entries the reviewer is not top critic and 16,887 times the reviewer is top critic (which is 2.27:1 ratio --> almost same as that of isFrequentReviewer for train)

In [None]:
# histogram for the length of reviewText in test

plt.figure(figsize=(7,5))
plt.title('Histogram for reveiwText length for test data')
test['reviewText'].str.len().hist()
print('modal length:', test['reviewText'].str.len().mode())
# hi = sns.histplot(test['reviewText'].str.len())
plt.show()


In [None]:
train_col_names = train.columns
# print(train_col_names)
for i in train_col_names:
    print(i,"null values", train[i].isnull().sum())

In [None]:
# imputing null values in train dataset - as only reviewText contains null values, using fillna
train['reviewText']= train['reviewText'].fillna('na')
test['reviewText'] = test['reviewText'].fillna('na')
print(train.iloc[113,:])

In [None]:
train_unique_movieid = train['movieid'].unique()
len(train_unique_movieid)

In [None]:
train_unique_reviewer = train['reviewerName'].unique()
len(train_unique_reviewer)

In [None]:
train2 = train.copy()
dupes_on_same = train2.drop_duplicates(subset = ['movieid', 'reviewerName'])
dupes_on_same2 = train2.drop_duplicates(subset = ['movieid', 'reviewerName', 'isFrequentReviewer', 'reviewText', 'sentiment'])
print(len(dupes_on_same))
print(len(dupes_on_same2))

1) in the train dataset, there are only 16,812 unique movie ids out of 1,62,758 entries - can infer that for same movie id several people gave reviews, or/and duplicates of the row exist (from point 3a), and/or same movieid and same reviewer but different entries for isFrequentReviwer, reviewText or sentiment (fom point 3b)

2) also can see that there are only 4482 unique reviewers who gave reviews - could be because that same reviewer reviewed several different movies, or/and duplicates of the row exist

3) on dropping rows with same values in columns 'movieid' and 'reviewerName', it results in dataframe with 161205 rows, AND,

on dropping the rows on which all columns were considered for checking for duplicates, it resulted in 161640 rows - meaning :
 a) 1118 rows were duplicate.  
 b) there are rows with movieid and reviwerName same, but different entries for 'isFrequentReviewer', 'reviewText' or 'sentiment' - that is why is is resulting in different number of rows when duplicates are removed considering all features and when duplicates are removed considering only the movieid and reviewerName

In [None]:
#merging train and movie3 datasets
merged_train = pd.merge(movie3, train, how ='right', on = ['movieid'])
merged_test = pd.merge(movie3, test, how='right', on = ['movieid'])
print("shape of merged train dataframe:", merged_train.shape)
print("shape of merged test dataframe:", merged_test.shape)

In [None]:
merged_train.head()

In [None]:
#dropping columns movieid, title and revierName as they do not have any effect on the sentiment
merged_train = merged_train.drop(columns = ['movieid', 'title', 'reviewerName'])
merged_test = merged_test.drop(columns = ['movieid', 'title', 'reviewerName'])
print("merged_train shape:", merged_train.shape)
print("merged_test_shape:", merged_test.shape)

In [None]:
merged_train.head()

In [None]:
#Text preprocessing before giving to countVectorizer
import re

def remove_newlines_tabs(text):
    """
    This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.

    """

    # Replacing all the occurrences of \n,\\n,\t,\\ with a space.
    Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
    return Formatted_text



# Code for removing repeated characters and punctuations

def reducing_incorrect_character_repeatation(text):
    """
    This Function will reduce repeatition to two characters
    for alphabets and to one character for punctuations.

    arguments:
         input_text: "text" of type "String".

    return:
        value: Finally formatted text with alphabets repeating to
        two characters & punctuations limited to one repeatition

    Example:
    Input : Realllllllllyyyyy,        Greeeeaaaatttt   !!!!?....;;;;:)
    Output : Reallyy, Greeaatt !?.;:)

    """
    # Pattern matching for all case alphabets
    Pattern_alpha = re.compile(r"([A-Za-z])\1{1,}", re.DOTALL)

    # Limiting all the  repeatation to two characters.
    Formatted_text = Pattern_alpha.sub(r"\1\1", text)

    # Pattern matching for all the punctuations that can occur
    Pattern_Punct = re.compile(r'([.,/#!$%^&*?;:{}=_`~()+-])\1{1,}')

    # Limiting punctuations in previously formatted string to only one.
    Combined_Formatted = Pattern_Punct.sub(r'\1', Formatted_text)

    # The below statement is replacing repeatation of spaces that occur more than two times with that of one occurrence.
    Final_Formatted = re.sub(' {2,}',' ', Combined_Formatted)
    return Final_Formatted

CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have",
}

# The code for expanding contraction words
def expand_contractions(text, contraction_mapping =  CONTRACTION_MAP):
    """expand shortened words to the actual form.
       e.g. don't to do not

       arguments:
            input_text: "text" of type "String".

       return:
            value: Text with expanded form of shorthened words.

       Example:
       Input : ain't, aren't, can't, cause, can't've
       Output :  is not, are not, cannot, because, cannot have

     """
    # Tokenizing text into tokens.
    list_Of_tokens = text.split(' ')

    # Check whether Word is in lidt_Of_tokens or not.
    for Word in list_Of_tokens:
        # Check whether found word is in dictionary "Contraction Map" or not as a key.
         if Word in CONTRACTION_MAP:
                # If Word is present in both dictionary & list_Of_tokens, replace that word with the key value.
                list_Of_tokens = [item.replace(Word, CONTRACTION_MAP[Word]) for item in list_Of_tokens]

    # Converting list of tokens to String.
    String_Of_tokens = ' '.join(str(e) for e in list_Of_tokens)
    return String_Of_tokens




## Function to preprocess text by regex and some special symbols

def preprocess_text(text_messages):


  processed = text_messages.replace(r'^.+@[^\.].*\.[a-z]{2,}$',
                                  'emailaddress')

  # Replace URLs with 'webaddress'
  processed = processed.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$',
                                    'webaddress')

  # Replace money symbols with 'moneysymb' (£ can by typed with ALT key + 156)
  processed = processed.replace(r'£|\$', 'moneysymb')

  # Replace 10 digit phone numbers (formats include paranthesis, spaces, no spaces, dashes) with 'phonenumber'
  processed = processed.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$',
                                    'phonenumbr')

  # Replace numbers with 'numbr'
  processed = processed.replace(r'\d+(\.\d+)?', 'numbr')

  processed = processed.replace(r'[^\w\d\s]', ' ')

# Replace whitespace between terms with a single space
  processed = processed.replace(r'\s+', ' ')

# Remove leading and trailing whitespace
  processed = processed.replace(r'^\s+|\s+?$', '')

# change words to lower case - Hello, HELLO, hello are all the same word
  processed = processed.lower()


  return processed



# Writing main function to merge all the preprocessing steps.
def text_preprocessing(text, accented_chars=True, contractions=True, newlines_tabs=True, repeatition=True,
                       mis_spell=True, remove_html=True, preprocess = True, lemma = True):
    """
    This function will preprocess input text and return
    the clean text.
    """

    if newlines_tabs == True: #remove newlines & tabs.
        Data = remove_newlines_tabs(text)

    # if remove_html == True: #remove html tags
    #     Data = strip_html_tags(Data)

    # if accented_chars == True: #remove accented characters
    #     Data = accented_characters_removal(Data)

    if repeatition == True: #Reduce repeatitions
        Data = reducing_incorrect_character_repeatation(Data)

    if contractions == True: #expand contractions
        Data = expand_contractions(Data)

    if preprocess == True:
        Data = preprocess_text(Data)


    return Data

In [None]:
#applying text preprocessing on ratingContents, genre, soundType, reviewText

merged_train['ratingContents'] = merged_train['ratingContents'].apply(text_preprocessing)
merged_train['genre'] = merged_train['genre'].apply(text_preprocessing)
merged_train['soundType'] = merged_train['soundType'].apply(text_preprocessing)
merged_train['reviewText'] = merged_train['reviewText'].apply(text_preprocessing)

merged_test['ratingContents'] = merged_test['ratingContents'].apply(text_preprocessing)
merged_test['genre'] = merged_test['genre'].apply(text_preprocessing)
merged_test['soundType'] = merged_test['soundType'].apply(text_preprocessing)
merged_test['reviewText'] = merged_test['reviewText'].apply(text_preprocessing)


In [None]:
#separating the sentiment column into labels

merged_train_labels = np.array(merged_train['sentiment'])
merged_train = merged_train.drop(columns = ['sentiment'])

print("merged train labels:", merged_train_labels)
print("merged_train shape:", merged_train.shape)

print("merged_test shape:", merged_test.shape)

In [None]:
merged_train.head()

In [None]:
#renaming the isTopCritic to isFrequentReviewr
merged_test = merged_test.rename(columns = {'isTopCritic': 'isFrequentReviewer'})

In [None]:
#split the data into train and test and then apply the transformations -> fit_transform on train and only transform on the validation
# make these transformations in pipeline
# use the pipeline for train, validation and test set

# transformations:
# OHE on rating, originalLanguage, director, isFrequentReviewer
# CountVectorizer on ratingContents, genre, reviewText, soundType
# MinMaxScaler on boxOffice, audienceScore and runtimeMinutes
# LabelBinarizer on y - sentiment column


In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, LabelBinarizer, MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer

merged_X_train, merged_X_val, merged_y_train, merged_y_val = train_test_split(merged_train, merged_train_labels, test_size = 0.3, random_state = 42)

#there are 3 sets:
# training : merged_X_train, merged_y_train
# validation: merged_X_val, merged_y_val
# test: merged_test

In [None]:
print("merged_X_train shape:", merged_X_train.shape)
print("merged_X_val shape:", merged_X_val.shape)

In [None]:
# CountVectorizer

# from sklearn.feature_extraction.text import TfidfVectorizer
# # vectorizer = TfidfVectorizer(analyzer= 'word', tokenizer = None, max_df = 1, ngram_range = (1,2))
# tried using TfidfVectorizer but it was giving lower f1-score than that using CountVectorizer

vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None)
#kept stop_words = 'english' initially. On removing this parameter, the score of SVM increased from 0.80 to 0.81
#this could be beacuse some stop words are important and removal of them might change the sentiment sometimes

In [None]:
type(merged_X_train)

In [None]:
merged_X_train['ratingContents']

In [None]:
merged_test

In [None]:
ohe_columns = ['rating', 'originalLanguage', 'director', 'isFrequentReviewer']
# cv_columns = ['ratingContents', 'genre', 'reviewText', 'soundType']
sc_columns = ['boxOffice', 'audienceScore', 'runtimeMinutes']
#oe_columns = ['isFrequentReviewer']

ratCon_t = vectorizer.fit_transform(merged_X_train['ratingContents'])
ratCon_v = vectorizer.transform(merged_X_val['ratingContents'])
ratCon_T = vectorizer.transform(merged_test['ratingContents'])

gen_t = vectorizer.fit_transform(merged_X_train['genre'])
gen_v = vectorizer.transform(merged_X_val['genre'])
gen_T = vectorizer.transform(merged_test['genre'])

revtext_t= vectorizer.fit_transform(merged_X_train['reviewText'])
revtext_v= vectorizer.transform(merged_X_val['reviewText'])
revtext_T= vectorizer.transform(merged_test['reviewText'])

st_t = vectorizer.fit_transform(merged_X_train['soundType'])
st_v = vectorizer.transform(merged_X_val['soundType'])
st_T = vectorizer.transform(merged_test['soundType'])

merged_X_train_dropped = merged_X_train.drop(columns = ['ratingContents', 'genre', 'reviewText', 'soundType'])
merged_X_val_dropped = merged_X_val.drop(columns = ['ratingContents', 'genre', 'reviewText', 'soundType'])
merged_test_dropped = merged_test.drop(columns = ['ratingContents', 'genre', 'reviewText', 'soundType'])

ct = ColumnTransformer([('ohe', OneHotEncoder(handle_unknown = 'ignore'), ohe_columns),
                        ('minmaxscaler', MinMaxScaler(), sc_columns),
                        ],  remainder = 'passthrough')


In [None]:
#converting merged_X_train into sparse

merged_X_train_transformed = ct.fit_transform(merged_X_train_dropped)
merged_X_val_transformed = ct.transform(merged_X_val_dropped)
merged_test_transformed = ct.transform(merged_test_dropped)

In [None]:
merged_X_train_transformed.shape

In [None]:
merged_X_train_transformed

In [None]:
from scipy.sparse import csr_matrix
merged_X_train_sparse = csr_matrix(merged_X_train_transformed)
merged_X_val_sparse = csr_matrix(merged_X_val_transformed)
merged_test_sparse = csr_matrix(merged_test_transformed)


In [None]:
merged_X_train_sparse.shape

In [None]:
print(ratCon_v.shape)
print(gen_v.shape)
print(st_v.shape)
print(revtext_v.shape)

In [None]:
#concatenating the count vectorised columns and other columns
from scipy.sparse import csr_matrix, hstack
merged_X_train_final = hstack([ratCon_t, gen_t, st_t, merged_X_train_sparse, revtext_t]) #sparse and concatenated
merged_X_val_final = hstack([ratCon_v, gen_v, st_v, merged_X_val_sparse, revtext_v])  #sparse and concatenated
merged_test_final = hstack([ratCon_T, gen_T, st_T, merged_test_sparse, revtext_T])   #sparse and concatenated

In [None]:
merged_X_train_final.shape

In [None]:
merged_test_final.shape

In [None]:
#y preprocessing
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer(sparse_output = True)

merged_y_train_final = lb.fit_transform(merged_y_train)
merged_y_val_final = lb.transform(merged_y_val)

In [None]:
# merged_X_train_final, merged_y_train_final
# merged_X_val_final, merged_y_val_final
# merged_test_final

In [None]:
merged_y_train_final

**SVC:**

In [None]:
#Linear SVC

from sklearn.svm import LinearSVC

svc = LinearSVC(C = 0.01, random_state =46, max_iter = 946)
svc.fit(merged_X_train_final, merged_y_train_final.toarray())


In [None]:
y_pred_svc = svc.predict(merged_X_val_final)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score
cp = classification_report(merged_y_train_final.toarray(), svc.predict(merged_X_train_final))
cm = confusion_matrix(merged_y_train_final.toarray(), svc.predict(merged_X_train_final))
linsvctr = f1_score(merged_y_train_final.toarray(), svc.predict(merged_X_train_final), average = 'weighted')
print(cp)
print('The confusion matrix:')
print(cm)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score
cp = classification_report(merged_y_val_final.toarray(), y_pred_svc)
cm = confusion_matrix(merged_y_val_final.toarray(), y_pred_svc)
linsvc = f1_score(merged_y_val_final.toarray(), y_pred_svc, average = 'weighted')

print(cp)
print('The confusion matrix:')
print(cm)

For basic LinearSVC (without HPT), f1-score for train is 0.95 and test is 0.78 -> so model is overfitting.

Next tried increasing regularisation (C from 1 to 0.01) and decreasing max_iter (1000 to 800) -> reduced overfitting (train:0.85, test:0.8)

After performing HPT, found the best values max_iter =894 and C=0.01 -> got the same f1-score of train: 0.85 and test:0.80 with that.

After removing stop_words parameter and doing HPT --> max_iter =946 and C=0.01 --> got f1-score train:0.85 and test: 0.81 with that.

In [None]:
# #Linear SVC with HPT
# from sklearn.svm import LinearSVC

# svc2 = LinearSVC(random_state = 46)
# from sklearn.model_selection import RandomizedSearchCV
# param_dist = {'C': [0.001, 0.01, 0.1, 1],
#               'max_iter': np.arange(800, 1000)}     #for LinearSVC default C=1, max_iter=1000, and loss=squared_hinge

# rs_svc = RandomizedSearchCV(svc2, param_distributions = param_dist, scoring = 'f1_micro', cv = 5, n_jobs =-1)
# rs_svc.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
# print("best score:", rs_svc.best_score_)
# print("best parameters:", rs_svc.best_params_)

The best values for parameters C and max_iter as found by HPT in order to reduce overfitting are 894 and 0.01 respectively (with best score: 0.8100) -> will train the basic svc model (written before this) on these hyper parameters.

Next did: After removing stopwords = 'english' parameter in CountVectorizer, validation score improved from 0.80 to 0.81 with C = 0.01, max_iter=946 -->so trained the basic model using these

**SVM with SGDClassifier**

In [None]:
#SVM using SGDClasssifier with HPT

from sklearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.feature_selection import SelectFromModel

In [None]:
from sklearn.linear_model import SGDClassifier
sgd_svm = SGDClassifier(loss = 'hinge', random_state = 48)

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'penalty': ['elasticnet'], 'alpha': [0.00001, 0.0001,0.001,0.01], 'max_iter': [1000, 1500, 2000] }

gs_svm= GridSearchCV(sgd_svm, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1, scoring = 'f1_micro')
gs_svm.fit(merged_X_train_final, merged_y_train_final.toarray()) #removed toarray()

In [None]:
gs_svm.best_score_

In [None]:
gs_svm.best_params_

In [None]:
# Training another SGD svm with best parameters on whole data

from sklearn.metrics import classification_report
from sklearn.linear_model import SGDClassifier

sgd2_svm = SGDClassifier(loss = 'hinge', random_state =48, penalty = 'elasticnet', max_iter = 1000, alpha = 0.0001 )
sgd2_svm.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
y_predsgd_svm = sgd2_svm.predict(merged_X_val_final)

In [None]:
y_predsvm_test_best = sgd2_svm.predict(merged_test_final)
y_predsvm_test_best = lb.inverse_transform(y_predsvm_test_best)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score
cp = classification_report(merged_y_train_final.toarray(), sgd2_svm.predict(merged_X_train_final))
cm = confusion_matrix(merged_y_train_final.toarray(), sgd2_svm.predict(merged_X_train_final))
svmsgdtr = f1_score(merged_y_train_final.toarray(), sgd2_svm.predict(merged_X_train_final), average ='weighted')
print('Classification Report for train data using SGD svm with best hyper parameters')

print(cp)
print('The confusion matrix')
print(cm)

In [None]:
from sklearn.metrics import classification_report, f1_score
cp = classification_report(merged_y_val_final.toarray(), y_predsgd_svm)
cm = confusion_matrix(merged_y_val_final.toarray(), y_predsgd_svm)
svmsgd = f1_score(merged_y_val_final.toarray(), y_predsgd_svm, average = 'weighted')
print('classification report for validation data using SGD svm with best hyper parameters')
print(cp)

print('The confusion matrix:')
print(cm)

Not much overfitting for SVM as train ->   0.84
                           validation ->    0.80   
this is with 66,260 features

After removing stop_words parameter in CountVectorizer, there are 66,588 features and
                train --> 0.85
                validation --> 0.81
                So both train and validation scores for SVM (with best values for hyper parameters) have increased on removal of stop_words parameter. And also not much overfitting.

In [None]:
#Feature selection and PIPELINE using the previously trained sgd2_svm

from sklearn.feature_selection import SelectFromModel

sfm = SelectFromModel(sgd2_svm, prefit = True)
merged_X_train_fs = sfm.fit_transform(merged_X_train_final)
merged_test_fs = sfm.transform(merged_test_final)

pipe = Pipeline([('sfm', sfm),
                     ('sgd2_svm', sgd2_svm)
                     ])

pipe.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
print(merged_X_train_fs.shape)
print(merged_test_fs.shape)

In [None]:
len(pipe[0].get_feature_names_out())

In [None]:
y_pred_fs_val = pipe.predict(merged_X_val_final)

In [None]:
from sklearn.metrics import classification_report
cp = classification_report(merged_y_train_final.toarray(), pipe.predict(merged_X_train_final))
cm = confusion_matrix(merged_y_train_final.toarray(), pipe.predict(merged_X_train_final))
svmsgdfstr = f1_score(merged_y_train_final.toarray(), pipe.predict(merged_X_train_final), average = 'weighted')
print('classification report for train data after feature selection (using SGD svm with best hyper parameters)')
print(cp)

print('The confusion matrix')
print(cm)

In [None]:
from sklearn.metrics import classification_report, f1_score
cp = classification_report(merged_y_val_final.toarray(), y_pred_fs_val)
cm = confusion_matrix(merged_y_val_final.toarray(), y_pred_fs_val)
svmsgdfs = f1_score(merged_y_val_final.toarray(), y_pred_fs_val, average = 'weighted')
print('classification report for validation data after feature selection (using SGD svm with best hyper parameters)')
print(cp)

print('The confusion matrix:')
print(cm)

After feature selection: with 9702 features getting score train->0.84 and validation-> 0.8 (before feature selection also same) -->this is using the best parameters. So no change in the score after feature selection. This is good thing because unnecessary features get eliminated and we can save our computational resources while maintaining the score.


After removing stop_words parameter in CountVectorizer, and doing feature selection with best parameters, it is giving 9319 features. train-->0.85 and validation -->0.808. So slight decrease in validation score after feature selection.

In [None]:
#graph to compare the SVM with different steps:

scores = [round(linsvctr, 4), round(linsvc,4), round(svmsgdtr,4), round(svmsgd,4), round(svmsgdfstr,4), round(svmsgdfs, 4)]
plt.title('f1_score Comparison among SVC (all after HPT)')
plt.bar(['train LinSVC', 'val LinSVC','train SVM','val SVM','train SVM FS' ,'val SVM FS'], scores)
plt.xlabel('models')
plt.ylabel('f1_score')

for index, value in enumerate(scores):
    plt.text(index, value, str(value), ha='center', va='bottom')


plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.show()



can see that the validation score decreased slightly from 0.813 to 0.808 after feature selection.

In [None]:
y_predsvm_test = pipe.predict(merged_test_final)
y_predsvm_test = lb.inverse_transform(y_predsvm_test)

# y_predsvm_test_best = lb.inverse_transform(sgd2_svm.predict(merged_test_final))

**Logistic Regression Model**

In [None]:
# logistic regression model

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

logreg = LogisticRegression(C = 0.1, random_state = 49, max_iter = 80, class_weight = 'balanced')
logreg.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
y_pred_lr = logreg.predict(merged_X_val_final)

In [None]:
from sklearn.metrics import classification_report
cp = classification_report(merged_y_train_final.toarray(), logreg.predict(merged_X_train_final))
cm = confusion_matrix(merged_y_train_final.toarray(), logreg.predict(merged_X_train_final))
logisticregtr = f1_score(merged_y_train_final.toarray(), logreg.predict(merged_X_train_final), average = 'weighted')
print(cp)

print('The confusion matrix:')
print(cm)

In [None]:
from sklearn.metrics import classification_report, f1_score
cp = classification_report(merged_y_val_final.toarray(), y_pred_lr)
cm = confusion_matrix(merged_y_val_final.toarray(), y_pred_lr)
logisticreg = f1_score(merged_y_val_final.toarray(), y_pred_lr, average = 'weighted')
print(cp)

print('The confusion matrix')
print(cm)

For basic LogisticRegression model (without HPT), the f1-score is for train: 0.88 and validation:0.81  --> so model is overfitting.

Next tried to icrease regularisation (C from 1 to 0.01) and decrease max_iter from 100 to 80 --> reduced overfitting (train:0.78, test:0.76), but overall test score decreased.

Next tried with C 0.1 and max_iter 80 --> f1-score for train: 0.84 and validation: 0.80 -> so not much overfitting.

After removing stop_words parameter in CountVectorizer, and previous C and max_iter, got f1-score for train: 0.85 and validation: 0.80. So the validation score did not imporve for logistic regression.

In [None]:
print(y_pred_lr[0])

In [None]:
#LogisticRegression with HPT

from sklearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.feature_selection import SelectFromModel

In [None]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(loss = 'log_loss', random_state = 49)

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'penalty': ['elasticnet'], 'alpha': [0.00001, 0.0001,0.001,0.01], 'max_iter': [1000, 1500, 2000] }

gs_logreg= GridSearchCV(sgd, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1, scoring = 'f1_micro')
gs_logreg.fit(merged_X_train_final, merged_y_train_final.toarray()) #removed toarray()

In [None]:
gs_logreg.best_score_

In [None]:
gs_logreg.best_params_

In [None]:
# Training another SGD logereg with best parameters on whole data

from sklearn.metrics import classification_report
from sklearn.linear_model import SGDClassifier

sgd2 = SGDClassifier(loss = 'log_loss', random_state = 49, penalty = 'elasticnet', max_iter = 1000 , alpha = 0.0001 )
sgd2.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
y_predsgd_lr = sgd2.predict(merged_X_val_final)

In [None]:
from sklearn.metrics import classification_report, f1_score
cp = classification_report(merged_y_train_final.toarray(), sgd2.predict(merged_X_train_final))
cm = confusion_matrix(merged_y_train_final.toarray(), sgd2.predict(merged_X_train_final))
lrsgdtr= f1_score(merged_y_train_final.toarray(), sgd2.predict(merged_X_train_final), average = 'weighted')

print('Classification Report for train data using SGD logreg with best hyper parameters')
print(cp)
print('The confusion matrix:')
print(cm)

In [None]:
from sklearn.metrics import classification_report, f1_score
cp = classification_report(merged_y_val_final.toarray(), y_predsgd_lr)
cm = confusion_matrix(merged_y_val_final.toarray(), y_predsgd_lr)
lrsgd = f1_score(merged_y_val_final.toarray(), y_predsgd_lr, average = 'weighted')

print('classification report for validation data using SGD logreg with best hyper parameters')
print(cp)

print('The confusion matix:')
print(cm)

Not much overfitting for logreg as train ->0.83
                                  validation ->0.80

This is for 66,260 features

After removing stop_words parameter from CountVectorizer, got 66,588 features.
For this train --> 0.83 and validation -->0.80. So no change in the train and validation scores for logistic regression (with best hyper parameters) on removal of stop_words parameter.

        

In [None]:
#Feature selection and PIPELINE using the previously trained sgd2

from sklearn.feature_selection import SelectFromModel

sfm = SelectFromModel(sgd2, prefit = True)
merged_X_train_fs = sfm.fit_transform(merged_X_train_final)
merged_test_fs = sfm.transform(merged_test_final)

pipe2 = Pipeline([('sfm', sfm),
                     ('sgd2', sgd2)
                     ])

pipe2.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
print(merged_X_train_fs.shape)
print(merged_test_fs.shape)

In [None]:
len(pipe2[0].get_feature_names_out())

In [None]:
y_predlr_fs_val = pipe2.predict(merged_X_val_final)

In [None]:
from sklearn.metrics import classification_report, f1_score
cp = classification_report(merged_y_train_final.toarray(), pipe2.predict(merged_X_train_final))
cm = confusion_matrix(merged_y_train_final.toarray(), pipe2.predict(merged_X_train_final))
lrsgdfstr = f1_score(merged_y_train_final.toarray(), pipe2.predict(merged_X_train_final), average = 'weighted')

print('classification report for train data after feature selection (using SGD logreg with best hyper parameters)')
print(cp)

print('The confusion matrix')
print(cm)

In [None]:
from sklearn.metrics import classification_report, f1_score
cp = classification_report(merged_y_val_final.toarray(), y_predlr_fs_val)
cm = confusion_matrix(merged_y_val_final.toarray(), y_predlr_fs_val)
lrsgdfs = f1_score(merged_y_val_final.toarray(), y_predlr_fs_val, average = 'weighted')

print('classification report for validation data after feature selection (using SGD logreg with best hyper parameters)')
print(cp)

print('The confusion matrix')
print(cm)

After feature selection: with 6628 features also getting score train-> 0.83 and validation-> 0.80  (before feature selection also it is 0.80)   -->this is using the best parameters. So no change in the score after feature selection. This is good thing because unnecessary features get eliminated and we can save our computational resources while maintaining the score.

After removing the stop_words parameter in CountVectorizer, and doing feature selection, giving 6630 features. With f1-score for train --> 0.83 and validation --> 0.80. This is same as before applying feature selection.

In [None]:
#Comparison of score among LogisticRegression

scores = [round(logisticregtr,4), round(logisticreg,4), round(lrsgdtr,4), round(lrsgd,4), round(lrsgdfstr,4), round(lrsgdfs,4)]
plt.title('f1_score comparison among Logistic regression')
plt.bar(['train-basic', 'val-basic','train-HPT', 'val-HPT', 'train-HPT & FS', 'val-HPT & FS'], scores)
plt.xlabel('models')
plt.ylabel('f1_score')

for index, value in enumerate(scores):
    plt.text(index, value, str(value), ha='center', va='bottom')


plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.show()



can see the score remains almost same even after feature selection.

In [None]:
y_predlr_test = pipe2.predict(merged_test_final)
y_predlr_test = lb.inverse_transform(y_predlr_test)

Naive Bayes model

In [None]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
y_pred_mnb = mnb.predict(merged_X_val_final)

In [None]:
from sklearn.metrics import classification_report, f1_score
cp = classification_report(merged_y_train_final.toarray(), mnb.predict(merged_X_train_final))
cm = confusion_matrix(merged_y_train_final.toarray(), mnb.predict(merged_X_train_final))
multinbtr = f1_score(merged_y_train_final.toarray(), mnb.predict(merged_X_train_final), average = 'weighted')
print(cp)

print('The confusion matrix:')
print(cm)

In [None]:
from sklearn.metrics import classification_report, f1_score
cp = classification_report(merged_y_val_final.toarray(), y_pred_mnb)
cm = confusion_matrix(merged_y_val_final.toarray(), y_pred_mnb)
multinb = f1_score(merged_y_val_final.toarray(), y_pred_mnb, average = 'weighted')

print(cp)

print('The confusion matrix:')
print(cm)

Basic MNB f1-score: train:0.83, validation: 0.78 --> so overfitting
Next tried HPT

After removing stop_words parameter in CountVectorizer, giving the same train and validation f1-score as before.

In [None]:
# # Multinomial Naive Bayes with HPT
from sklearn.naive_bayes import MultinomialNB

mnb2 = MultinomialNB()

from sklearn.model_selection import RandomizedSearchCV
param_dist = {'alpha': [0.001, 0.1, 1], 'force_alpha': [True, False]}

rs_mnb = RandomizedSearchCV(mnb2, param_distributions = param_dist, scoring = 'f1_micro', cv = 5, n_jobs =-1)
rs_mnb.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
print("best score:", rs_mnb.best_score_)
print("best parameters:", rs_mnb.best_params_)

In [None]:
# Training another MNB with best parameters on whole data

from sklearn.metrics import classification_report
mnb3 = MultinomialNB(alpha =1, force_alpha = True)
mnb3.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
y_predmnb_val = mnb3.predict(merged_X_val_final)

In [None]:
from sklearn.metrics import classification_report
cp = classification_report(merged_y_train_final.toarray(), mnb3.predict(merged_X_train_final))
cm = confusion_matrix(merged_y_train_final.toarray(), mnb3.predict(merged_X_train_final))
multinbhpttr = f1_score(merged_y_train_final.toarray(), mnb3.predict(merged_X_train_final), average = 'weighted')

print('Classification Report for train data using MultinomialNB with best hyper parameters')
print(cp)

print('The confusion matrix:')
print(cm)

In [None]:
from sklearn.metrics import classification_report, f1_score
cp = classification_report(merged_y_val_final.toarray(), y_predmnb_val)
cm = confusion_matrix(merged_y_val_final.toarray(), y_predmnb_val)
multinbhpt = f1_score(merged_y_val_final.toarray(), y_predmnb_val, average = 'weighted')

print('classification report for validation data using MultinomialNB with best hyper parameters')
print(cp)

print('The confusion matrix:')
print(cm)

Did not perform feature selection because the validation score was not improving much even afte Hyper parameter tuning.

In [None]:
#Comparison of score among MultinomialNB

scores = [round(multinbtr,4), round(multinb,4), round(multinbhpttr,4), round(multinbhpt,4)]
plt.title('f1_score comparison among MNB')
plt.bar(['train-basic', 'val-basic','train-HPT', 'val-HPT'], scores)
plt.xlabel('models')
plt.ylabel('f1_score')

for index, value in enumerate(scores):
    plt.text(index, value, str(value), ha='center', va='bottom')


plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.show()


In [None]:
y_predmnb_test = mnb3.predict(merged_test_final)
y_predmnb_test = lb.inverse_transform(y_predmnb_test)

Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(random_state=0)
dtc.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
y_pred_dtc = dtc.predict(merged_X_val_final)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score
cp = classification_report(merged_y_train_final.toarray(), dtc.predict(merged_X_train_final))
cm = confusion_matrix(merged_y_train_final.toarray(), dtc.predict(merged_X_train_final))
decttr = f1_score(merged_y_train_final.toarray(), dtc.predict(merged_X_train_final), average = 'weighted')
print(cp)
print('The Confusion Matrix:')
print(cm)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score
cp = classification_report(merged_y_val_final.toarray(), y_pred_dtc)
cm = confusion_matrix(merged_y_val_final.toarray(), y_pred_dtc)
dect = f1_score(merged_y_val_final.toarray(), y_pred_dtc, average = 'weighted')

print(cp)
print('The confusion matrix:')
print(cm)

On basic DecisionTreeClassifier, f1-score for train -->1, validation-->0.7, so overfitting.

In [None]:
# # DecisionTreeClassifier with HPT
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

dtc2 = DecisionTreeClassifier(random_state=0)

param_dist = {'max_depth': [2,3,5], 'min_samples_leaf': [0.5, 1,2], 'min_samples_split': [2,5,7]}

rs_dtc = RandomizedSearchCV(dtc2, param_distributions = param_dist, scoring = 'f1_micro', cv = 5, n_jobs =-1)
rs_dtc.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
print("best score:", rs_dtc.best_score_)
print("best parameters:", rs_dtc.best_params_)

In [None]:
# Training another DTC with best parameters on the whole data

from sklearn.metrics import classification_report
dtc3 = DecisionTreeClassifier(random_state =0 ,min_samples_split = 7, min_samples_leaf= 1, max_depth =5 )
dtc3.fit(merged_X_train_final, merged_y_train_final.toarray())

In [None]:
y_preddtc_val = dtc3.predict(merged_X_val_final)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score
cp = classification_report(merged_y_train_final.toarray(), dtc3.predict(merged_X_train_final))
cm = confusion_matrix(merged_y_train_final.toarray(), dtc3.predict(merged_X_train_final))
decthpttr = f1_score(merged_y_train_final.toarray(), dtc3.predict(merged_X_train_final), average = 'weighted')

print('Classification Report for train data using MultinomialNB with best hyper parameters')
print(cp)
print('The confusion matrix:')
print(cm)

In [None]:
from sklearn.metrics import classification_report
cp = classification_report(merged_y_val_final.toarray(), y_preddtc_val)
cm = confusion_matrix(merged_y_val_final.toarray(), y_preddtc_val)
decthpt = f1_score(merged_y_val_final.toarray(), y_preddtc_val, average = 'weighted')

print('classification report for validation data using MultinomialNB with best hyper parameters')
print(cp)
print('The confusion matrix:')
print(cm)

After Hyper parameter tuning, the f1-score of train
:0.70 and validation: 0.69 --> so reduced the overfitting. The validation score reduced from 0.70 to 0.69.

In [None]:
#Comparison of score among DecisionTree

scores = [round(decttr,4), round(dect,4), round(decthpttr,4), round(decthpt,4)]
plt.title('f1_score comparison among DecisionTreeClassifier')
plt.bar(['train-basic', 'val-basic','train-HPT', 'val-HPT'], scores)
plt.xlabel('models')
plt.ylabel('f1_score')

for index, value in enumerate(scores):
    plt.text(index, value, str(value), ha='center', va='bottom')


plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.show()

In [None]:
y_preddtc_test = dtc3.predict(merged_test_final)
y_preddtc_test = lb.inverse_transform(y_preddtc_test)

In [None]:
#Comparison of best validation score among all

scores = [round(svmsgd,4), round(lrsgdfs,4), round(multinbhpt,4), round(decthpt,4)]
plt.title('Best validation f1_score comparison among all')
plt.bar(['SVM HPT(no FS)', 'LR HPT & FS','MNB HPT', 'DT HPT'], scores)
plt.xlabel('models')
plt.ylabel('f1_score')

for index, value in enumerate(scores):
    plt.text(index, value, str(value), ha='center', va='bottom')


plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.show()

Can see that among all the SVM model with hyper parameter tuning and without feature selection is giving the best validation score at 0.8133, followed by LogisticRegression model with hyper parameter tuning and with feature selection at 0.8047. The Multinomial Naive Bayes with hyper parameter tuning and Decision Tree with hyper parameter tuning are giving best score of 0.7834 and 0.6941 respectively.

In [None]:
submission = pd.DataFrame(columns = ['id', 'sentiment'])
submission["id"] = [i for i in range(len(y_predsvm_test_best))]
submission["sentiment"] = y_predsvm_test_best
submission.to_csv("submission.csv", index = False)