### Dataset Link: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format

Sentiment Analysis can help us finding out the mood and emotions of general a customer or reviewer and it helps in gathering the insightful information regarding the context. Sentiment Analysis is a process of analyzing data and classifying it based on the need of the research.

In [39]:
import pandas as pd
from textblob import TextBlob
from nltk.tokenize.toktok import ToktokTokenizer
import re
tokenizer = ToktokTokenizer()
import spacy
nlp = spacy.load('en_core_web_sm', disable=['ner'])

In [40]:
TextBlob("he is very good boy").sentiment

Sentiment(polarity=0.9099999999999999, subjectivity=0.7800000000000001)

In [41]:
TextBlob("he is not a good boy").sentiment

Sentiment(polarity=-0.35, subjectivity=0.6000000000000001)

In [42]:
TextBlob("Eerybody says this man is poor").sentiment

Sentiment(polarity=-0.4, subjectivity=0.6)

### Polarity and Subjectivity
Polarity is a float value which helps in identifying whether a sentence is positive or negative. Its values ranges in [-1,1] where 1 means positive statement and -1 means a negative statement. 

On the other side, Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1]. Closer the value to 1, more likly it is public opinion.

In [43]:
### Data Loading
train=pd.read_csv("Train.csv")
train

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1
...,...,...
39995,"""Western Union"" is something of a forgotten cl...",1
39996,This movie is an incredible piece of work. It ...,1
39997,My wife and I watched this movie because we pl...,0
39998,"When I first watched Flatliners, I was amazed....",1


In [44]:
label_0=train[train['label']==0].sample(n=5000)
label_1=train[train['label']==1].sample(n=5000)

In [45]:
train=pd.concat([label_1,label_0])
from sklearn.utils import shuffle
train = shuffle(train)

In [46]:
train

Unnamed: 0,text,label
10204,Uh oh! Another gay film. This time it's showin...,0
28500,In case you're wondering the buffoonish Loren ...,0
34313,"A must see film with great dialogues, great mu...",1
7195,"This film does not fail to engage and move, ev...",1
22891,After watching the Next Action Star reality TV...,1
...,...,...
25690,There's really no way to beat around the bush ...,0
15506,"Okay, I struggled to set aside the fact that i...",0
20273,The mind boggles at exactly what about Univers...,0
11102,I just saw this movie at the Tribeca Film Fest...,1


Here, the data has two labels ie 0 and 1. 0 stands for "Negative" and "1" stands for "Positive".

### Data Preprocessing

In [47]:
train.isnull().sum()

text     0
label    0
dtype: int64

In [48]:
" "

' '

In [49]:
import numpy as np
train.replace(r'^\s*$', np.nan, regex=True,inplace=True)
train.dropna(axis = 0, how = 'any', inplace = True)

In [50]:
train.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)
print('escape seq removed')

escape seq removed


In [51]:
import numpy as np
train.replace(r'^\s*$', np.nan, regex=True,inplace=True)
train.dropna(axis = 0, how = 'any', inplace = True)

In [52]:
train

Unnamed: 0,text,label
10204,Uh oh! Another gay film. This time it's showin...,0
28500,In case you're wondering the buffoonish Loren ...,0
34313,"A must see film with great dialogues, great mu...",1
7195,"This film does not fail to engage and move, ev...",1
22891,After watching the Next Action Star reality TV...,1
...,...,...
25690,There's really no way to beat around the bush ...,0
15506,"Okay, I struggled to set aside the fact that i...",0
20273,The mind boggles at exactly what about Univers...,0
11102,I just saw this movie at the Tribeca Film Fest...,1


In [53]:
train['text']=train['text'].str.encode('ascii', 'ignore').str.decode('ascii')
print('non-ascii data removed')

non-ascii data removed


In [54]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [55]:
def remove_punctuations(text):
    import string
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text
train['text']=train['text'].apply(remove_punctuations)

In [56]:
train

Unnamed: 0,text,label
10204,Uh oh Another gay film This time its showing t...,0
28500,In case youre wondering the buffoonish Loren C...,0
34313,A must see film with great dialogues great mus...,1
7195,This film does not fail to engage and move eve...,1
22891,After watching the Next Action Star reality TV...,1
...,...,...
25690,Theres really no way to beat around the bush i...,0
15506,Okay I struggled to set aside the fact that in...,0
20273,The mind boggles at exactly what about Univers...,0
11102,I just saw this movie at the Tribeca Film Fest...,1


In [57]:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [58]:
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

In [59]:
def custom_remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [60]:
train['text']=train['text'].apply(custom_remove_stopwords)

In [61]:
train

Unnamed: 0,text,label
10204,Uh oh Another gay film time showing black side...,0
28500,case youre wondering buffoonish Loren Cn Crypt...,0
34313,must see film great dialogues great music grea...,1
7195,film not fail engage move even 2008 audience f...,1
22891,watching Next Action Star reality TV series pl...,1
...,...,...
25690,Theres really no way beat around bush saying L...,0
15506,Okay struggled set aside fact selling EVP real...,0
20273,mind boggles exactly Universal Soldier merited...,0
11102,saw movie Tribeca Film Festival say thought am...,1


In [62]:
def remove_special_characters(text):
    text = re.sub('[^a-zA-z0-9\s]', '', text)
    return text

In [63]:
train['text']=train['text'].apply(remove_special_characters)

In [64]:
def remove_html(text):
    import re
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r' ', text)

In [65]:
train['text']=train['text'].apply(remove_html)

In [66]:
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r' ',text)

In [67]:
train['text']=train['text'].apply(remove_URL)

In [68]:
def remove_numbers(text):
    """ Removes integers """
    text = ''.join([i for i in text if not i.isdigit()])         
    return text

In [69]:
train['text']=train['text'].apply(remove_numbers)

In [70]:
def cleanse(word):
    rx = re.compile(r'\D*\d')
    if rx.match(word):
        return ''
    return word
def remove_alphanumeric(strings):
    nstrings = [" ".join(filter(None, (
    cleanse(word) for word in string.split()))) 
    for string in strings.split()]
    str1 = ' '.join(nstrings)
    return str1

In [71]:
train['text']=train['text'].apply(remove_alphanumeric)

In [72]:
train

Unnamed: 0,text,label
10204,Uh oh Another gay film time showing black side...,0
28500,case youre wondering buffoonish Loren Cn Crypt...,0
34313,must see film great dialogues great music grea...,1
7195,film not fail engage move even audience famili...,1
22891,watching Next Action Star reality TV series pl...,1
...,...,...
25690,Theres really no way beat around bush saying L...,0
15506,Okay struggled set aside fact selling EVP real...,0
20273,mind boggles exactly Universal Soldier merited...,0
11102,saw movie Tribeca Film Festival say thought am...,1


In [37]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

In [38]:
train['text']=train['text'].apply(lemmatize_text)

In [73]:
train['sentiment'] = train['text'].apply(lambda tweet: TextBlob(tweet).sentiment)

In [74]:
train

Unnamed: 0,text,label,sentiment
10204,Uh oh Another gay film time showing black side...,0,"(0.0831932773109244, 0.39383753501400554)"
28500,case youre wondering buffoonish Loren Cn Crypt...,0,"(-0.4166666666666667, 0.6095238095238096)"
34313,must see film great dialogues great music grea...,1,"(0.39841269841269844, 0.6341269841269841)"
7195,film not fail engage move even audience famili...,1,"(0.2032271944922547, 0.4665232358003442)"
22891,watching Next Action Star reality TV series pl...,1,"(0.20686456400742112, 0.5181199752628324)"
...,...,...,...
25690,Theres really no way beat around bush saying L...,0,"(-0.08806216931216931, 0.3828703703703704)"
15506,Okay struggled set aside fact selling EVP real...,0,"(0.09909090909090905, 0.4511111111111112)"
20273,mind boggles exactly Universal Soldier merited...,0,"(0.09144283746556475, 0.452347337006428)"
11102,saw movie Tribeca Film Festival say thought am...,1,"(0.390909090909091, 0.5636363636363636)"


In [75]:
sentiment_series = train['sentiment'].tolist()

In [76]:
columns = ['polarity', 'subjectivity']
df1 = pd.DataFrame(sentiment_series, columns=columns, index=train.index)

In [77]:
df1

Unnamed: 0,polarity,subjectivity
10204,0.083193,0.393838
28500,-0.416667,0.609524
34313,0.398413,0.634127
7195,0.203227,0.466523
22891,0.206865,0.518120
...,...,...
25690,-0.088062,0.382870
15506,0.099091,0.451111
20273,0.091443,0.452347
11102,0.390909,0.563636


In [78]:
result = pd.concat([train,df1],axis=1)

In [79]:
result.drop(['sentiment'],axis=1,inplace=True)

In [80]:
result.loc[result['polarity']>=0.3, 'Sentiment'] = "Positive"
result.loc[result['polarity']<0.3, 'Sentiment'] = "Negative"

In [81]:
result

Unnamed: 0,text,label,polarity,subjectivity,Sentiment
10204,Uh oh Another gay film time showing black side...,0,0.083193,0.393838,Negative
28500,case youre wondering buffoonish Loren Cn Crypt...,0,-0.416667,0.609524,Negative
34313,must see film great dialogues great music grea...,1,0.398413,0.634127,Positive
7195,film not fail engage move even audience famili...,1,0.203227,0.466523,Negative
22891,watching Next Action Star reality TV series pl...,1,0.206865,0.518120,Negative
...,...,...,...,...,...
25690,Theres really no way beat around bush saying L...,0,-0.088062,0.382870,Negative
15506,Okay struggled set aside fact selling EVP real...,0,0.099091,0.451111,Negative
20273,mind boggles exactly Universal Soldier merited...,0,0.091443,0.452347,Negative
11102,saw movie Tribeca Film Festival say thought am...,1,0.390909,0.563636,Positive


In [82]:
result.loc[result['label']==1, 'Sentiment_label'] = 1
result.loc[result['label']==0, 'Sentiment_label'] = 0

In [83]:
result

Unnamed: 0,text,label,polarity,subjectivity,Sentiment,Sentiment_label
10204,Uh oh Another gay film time showing black side...,0,0.083193,0.393838,Negative,0.0
28500,case youre wondering buffoonish Loren Cn Crypt...,0,-0.416667,0.609524,Negative,0.0
34313,must see film great dialogues great music grea...,1,0.398413,0.634127,Positive,1.0
7195,film not fail engage move even audience famili...,1,0.203227,0.466523,Negative,1.0
22891,watching Next Action Star reality TV series pl...,1,0.206865,0.518120,Negative,1.0
...,...,...,...,...,...,...
25690,Theres really no way beat around bush saying L...,0,-0.088062,0.382870,Negative,0.0
15506,Okay struggled set aside fact selling EVP real...,0,0.099091,0.451111,Negative,0.0
20273,mind boggles exactly Universal Soldier merited...,0,0.091443,0.452347,Negative,0.0
11102,saw movie Tribeca Film Festival say thought am...,1,0.390909,0.563636,Positive,1.0
