## Sentiment Analysis

Sentiment Analysis can help us finding out the mood and emotions of general a customer or reviewer and it helps in gathering the insightful information regarding the context. Sentiment Analysis is a process of analyzing data and classifying it based on the need of the research.

In [1]:
import pandas as pd
from textblob import TextBlob
from nltk.tokenize.toktok import ToktokTokenizer
import re
tokenizer = ToktokTokenizer()
import spacy
nlp = spacy.load('en_core_web_sm', disable=['ner'])

In [2]:
TextBlob("he is very good boy").sentiment

Sentiment(polarity=0.9099999999999999, subjectivity=0.7800000000000001)

In [3]:
TextBlob("he is not a good boy").sentiment

Sentiment(polarity=-0.35, subjectivity=0.6000000000000001)

In [4]:
TextBlob("Eerybody says this man is poor").sentiment

Sentiment(polarity=-0.4, subjectivity=0.6)

### Polarity and Subjectivity
Polarity is a float value which helps in identifying whether a sentence is positive or negative. Its values ranges in [-1,1] where 1 means positive statement and -1 means a negative statement. 

On the other side, Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1]. Closer the value to 1, more likly it is public opinion.

In [5]:
### Data Loading
train=pd.read_csv("Train.csv")
train

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1
...,...,...
39995,"""Western Union"" is something of a forgotten cl...",1
39996,This movie is an incredible piece of work. It ...,1
39997,My wife and I watched this movie because we pl...,0
39998,"When I first watched Flatliners, I was amazed....",1


In [6]:
label_0=train[train['label']==0].sample(n=5000)
label_1=train[train['label']==1].sample(n=5000)

In [7]:
train=pd.concat([label_1,label_0])
from sklearn.utils import shuffle
train = shuffle(train)

In [8]:
train

Unnamed: 0,text,label
15077,While it was nice to see a film about older pe...,0
19422,Same old same old about Che. It completely ign...,0
10822,Is this a good movie? It's hard to say -- but ...,1
20273,The mind boggles at exactly what about Univers...,0
25788,"This is surprisingly above average slasher, th...",1
...,...,...
7451,this is one of the funniest shows i have ever ...,1
4178,Just in time to capitalize on the long-awaited...,0
17145,This is one creepy movie. Creepier than anythi...,1
19107,This film is amazing - it's just like a nightm...,1


Here, the data has two labels ie 0 and 1. 0 stands for "Negative" and "1" stands for "Positive".

### Data Preprocessing

In [9]:
train.isnull().sum()

text     0
label    0
dtype: int64

In [10]:
" "

' '

In [11]:
import numpy as np
train.replace(r'^\s*$', np.nan, regex=True,inplace=True)
train.dropna(axis = 0, how = 'any', inplace = True)

In [12]:
train.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)
print('escape seq removed')

escape seq removed


In [13]:
import numpy as np
train.replace(r'^\s*$', np.nan, regex=True,inplace=True)
train.dropna(axis = 0, how = 'any', inplace = True)

In [14]:
train

Unnamed: 0,text,label
15077,While it was nice to see a film about older pe...,0
19422,Same old same old about Che. It completely ign...,0
10822,Is this a good movie? It's hard to say -- but ...,1
20273,The mind boggles at exactly what about Univers...,0
25788,"This is surprisingly above average slasher, th...",1
...,...,...
7451,this is one of the funniest shows i have ever ...,1
4178,Just in time to capitalize on the long-awaited...,0
17145,This is one creepy movie. Creepier than anythi...,1
19107,This film is amazing - it's just like a nightm...,1


In [15]:
train['text']=train['text'].str.encode('ascii', 'ignore').str.decode('ascii')
print('non-ascii data removed')

non-ascii data removed


In [16]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [17]:
def remove_punctuations(text):
    import string
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text
train['text']=train['text'].apply(remove_punctuations)

In [18]:
train

Unnamed: 0,text,label
15077,While it was nice to see a film about older pe...,0
19422,Same old same old about Che It completely igno...,0
10822,Is this a good movie Its hard to say but in 1...,1
20273,The mind boggles at exactly what about Univers...,0
25788,This is surprisingly above average slasher tha...,1
...,...,...
7451,this is one of the funniest shows i have ever ...,1
4178,Just in time to capitalize on the longawaited ...,0
17145,This is one creepy movie Creepier than anythin...,1
19107,This film is amazing its just like a nightmar...,1


In [19]:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [20]:
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

In [21]:
def custom_remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [22]:
train['text']=train['text'].apply(custom_remove_stopwords)

In [23]:
train

Unnamed: 0,text,label
15077,nice see film older people finding falling lov...,0
19422,old old Che completely ignored really interest...,0
10822,good movie hard say 1953 many people remarkabl...,1
20273,mind boggles exactly Universal Soldier merited...,0
25788,surprisingly average slasher thats enjoyable w...,1
...,...,...
7451,one funniest shows ever seen really refreshing...,1
4178,time capitalize longawaited movie version Drea...,0
17145,one creepy movie Creepier anything David Lynch...,1
19107,film amazing like nightmare bizarre story dark...,1


In [24]:
def remove_special_characters(text):
    text = re.sub('[^a-zA-z0-9\s]', '', text)
    return text

In [25]:
train['text']=train['text'].apply(remove_special_characters)

In [26]:
def remove_html(text):
    import re
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r' ', text)

In [27]:
train['text']=train['text'].apply(remove_html)

In [28]:
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r' ',text)

In [29]:
train['text']=train['text'].apply(remove_URL)

In [30]:
def remove_numbers(text):
    """ Removes integers """
    text = ''.join([i for i in text if not i.isdigit()])         
    return text

In [31]:
train['text']=train['text'].apply(remove_numbers)

In [32]:
def cleanse(word):
    rx = re.compile(r'\D*\d')
    if rx.match(word):
        return ''
    return word
def remove_alphanumeric(strings):
    nstrings = [" ".join(filter(None, (
    cleanse(word) for word in string.split()))) 
    for string in strings.split()]
    str1 = ' '.join(nstrings)
    return str1

In [33]:
train['text']=train['text'].apply(remove_alphanumeric)

In [34]:
train

Unnamed: 0,text,label
15077,nice see film older people finding falling lov...,0
19422,old old Che completely ignored really interest...,0
10822,good movie hard say many people remarkably eff...,1
20273,mind boggles exactly Universal Soldier merited...,0
25788,surprisingly average slasher thats enjoyable w...,1
...,...,...
7451,one funniest shows ever seen really refreshing...,1
4178,time capitalize longawaited movie version Drea...,0
17145,one creepy movie Creepier anything David Lynch...,1
19107,film amazing like nightmare bizarre story dark...,1


In [35]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

In [36]:
train['text']=train['text'].apply(lemmatize_text)

In [37]:
train['sentiment'] = train['text'].apply(lambda tweet: TextBlob(tweet).sentiment)

In [38]:
train

Unnamed: 0,text,label,sentiment
15077,nice see film old people find fall love perfor...,0,"(0.38749999999999996, 0.6166666666666666)"
19422,old old Che completely ignore really interesti...,0,"(-0.04018595041322313, 0.5757690541781452)"
10822,good movie hard say many people remarkably eff...,1,"(0.0769607843137255, 0.6230392156862745)"
20273,mind boggle exactly Universal Soldier merit se...,0,"(0.07830086580086582, 0.4502489177489178)"
25788,surprisingly average slasher that s enjoyable ...,1,"(0.20198412698412696, 0.5997795414462082)"
...,...,...,...
7451,one funniest show ever see really refreshing w...,1,"(0.13974358974358972, 0.6035256410256411)"
4178,time capitalize longawaite movie version Dream...,0,"(0.0005408902691511402, 0.4510248447204969)"
17145,one creepy movie Creepier anything David Lynch...,1,"(-0.019191919191919194, 0.5805555555555556)"
19107,film amazing like nightmare bizarre story dark...,1,"(0.2794642857142857, 0.5214285714285715)"


In [39]:
sentiment_series = train['sentiment'].tolist()

In [40]:
columns = ['polarity', 'subjectivity']
df1 = pd.DataFrame(sentiment_series, columns=columns, index=train.index)

In [41]:
df1

Unnamed: 0,polarity,subjectivity
15077,0.387500,0.616667
19422,-0.040186,0.575769
10822,0.076961,0.623039
20273,0.078301,0.450249
25788,0.201984,0.599780
...,...,...
7451,0.139744,0.603526
4178,0.000541,0.451025
17145,-0.019192,0.580556
19107,0.279464,0.521429


In [42]:
result = pd.concat([train,df1],axis=1)

In [43]:
result.drop(['sentiment'],axis=1,inplace=True)

In [44]:
result.loc[result['polarity']>=0.3, 'Sentiment'] = "Positive"
result.loc[result['polarity']<0.3, 'Sentiment'] = "Negative"

In [45]:
result

Unnamed: 0,text,label,polarity,subjectivity,Sentiment
15077,nice see film old people find fall love perfor...,0,0.387500,0.616667,Positive
19422,old old Che completely ignore really interesti...,0,-0.040186,0.575769,Negative
10822,good movie hard say many people remarkably eff...,1,0.076961,0.623039,Negative
20273,mind boggle exactly Universal Soldier merit se...,0,0.078301,0.450249,Negative
25788,surprisingly average slasher that s enjoyable ...,1,0.201984,0.599780,Negative
...,...,...,...,...,...
7451,one funniest show ever see really refreshing w...,1,0.139744,0.603526,Negative
4178,time capitalize longawaite movie version Dream...,0,0.000541,0.451025,Negative
17145,one creepy movie Creepier anything David Lynch...,1,-0.019192,0.580556,Negative
19107,film amazing like nightmare bizarre story dark...,1,0.279464,0.521429,Negative


In [46]:
result.loc[result['label']==1, 'Sentiment_label'] = 1
result.loc[result['label']==0, 'Sentiment_label'] = 0

In [47]:
result

Unnamed: 0,text,label,polarity,subjectivity,Sentiment,Sentiment_label
15077,nice see film old people find fall love perfor...,0,0.387500,0.616667,Positive,0.0
19422,old old Che completely ignore really interesti...,0,-0.040186,0.575769,Negative,0.0
10822,good movie hard say many people remarkably eff...,1,0.076961,0.623039,Negative,1.0
20273,mind boggle exactly Universal Soldier merit se...,0,0.078301,0.450249,Negative,0.0
25788,surprisingly average slasher that s enjoyable ...,1,0.201984,0.599780,Negative,1.0
...,...,...,...,...,...,...
7451,one funniest show ever see really refreshing w...,1,0.139744,0.603526,Negative,1.0
4178,time capitalize longawaite movie version Dream...,0,0.000541,0.451025,Negative,0.0
17145,one creepy movie Creepier anything David Lynch...,1,-0.019192,0.580556,Negative,1.0
19107,film amazing like nightmare bizarre story dark...,1,0.279464,0.521429,Negative,1.0
