# Sentiment analysis 

The objective of this problem is to perform Sentiment analysis from the tweets collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
path = '/content/drive/My Drive/ColabNotebooks/SNLP/LABINTERNAL/'

## Question 1

### Read the data
- read tweets.csv
- use latin encoding if it gives encoding error while loading

In [3]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
df= pd.read_csv(path + 'tweets.csv',encoding='latin')

In [4]:
#checking for null values
df.isnull().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

### Drop null values
- drop all the rows with null values

In [5]:
df.dropna(subset=['tweet_text'],inplace=True)

### Print the dataframe
- print initial 5 rows of the data
- use df.head()

In [6]:
df.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## Question 2

### Preprocess data
- convert all text to lowercase - use .lower()
- select only numbers, alphabets, and #+_ from text - use re.sub()
- strip all the text - use .strip()
    - this is for removing extra spaces

In [7]:
df['tweet_text'] = df['tweet_text'].apply(lambda s: s.lower())
df['tweet_text'] = df['tweet_text'].apply(lambda s: re.sub('[^0-9a-z #+_]','',s))
df['tweet_text'] = df['tweet_text'].apply(lambda s: s.strip())


print dataframe

In [8]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iPhone,Negative emotion
1,jessedee know about fludapp awesome ipadiphon...,iPad or iPhone App,Positive emotion
2,swonderlin can not wait for #ipad 2 also they ...,iPad,Positive emotion
3,sxsw i hope this years festival isnt as crashy...,iPad or iPhone App,Negative emotion
4,sxtxstate great stuff on fri #sxsw marissa may...,Google,Positive emotion


## Question 3

### Preprocess data
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - select only those rows where value equal to "positive emotion" or "negative emotion"
- find the value counts of "positive emotion" and "negative emotion"

In [9]:
df.is_there_an_emotion_directed_at_a_brand_or_product.unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

In [10]:
df= df[df.is_there_an_emotion_directed_at_a_brand_or_product.isin(['Negative emotion','Positive emotion'])]

In [11]:
df.is_there_an_emotion_directed_at_a_brand_or_product.value_counts()

Positive emotion    2978
Negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

## Question 4

### Encode labels
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - change "positive emotion" to 1
    - change "negative emotion" to 0
- use map function to replace values

In [12]:
df['is_there_an_emotion_directed_at_a_brand_or_product']=df['is_there_an_emotion_directed_at_a_brand_or_product'].map(lambda x: 1 if x=='Positive emotion' else 0)

## Question 5

### Get feature and label
- get column "tweet_text" as feature
- get column "is_there_an_emotion_directed_at_a_brand_or_product" as label

In [45]:
X = df.tweet_text
y = df.is_there_an_emotion_directed_at_a_brand_or_product

### Create train and test data
- use train_test_split to get train and test set
- set a random_state
- test_size: 0.25

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=.75,random_state=1)

## Question 6

### Vectorize data
- create document-term matrix
- use CountVectorizer()
    - ngram_range: (1, 2)
    - stop_words: 'english'
    - min_df: 2   
- do fit_transform on X_train
- do transform on X_test

In [47]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(ngram_range= (1, 2), stop_words= 'english', min_df = 2)

In [48]:
#Feed SMS data to CountVectorizer
X_train = cvect.fit_transform(X_train)
X_test = cvect.transform(X_test)

In [49]:
#Check the vocablury size
len(cvect.vocabulary_)

6132

In [50]:
#What is there in the vocabulary
cvect.vocabulary_

{'rt': 4465,
 'mention': 3324,
 'google': 1954,
 'talking': 5327,
 'search': 4561,
 'ranking': 4313,
 'im': 2404,
 'sure': 5026,
 'saw': 4523,
 'bing': 694,
 'writing': 6053,
 'notes': 3747,
 'qagb': 4204,
 'sxsw': 5054,
 'rt mention': 4467,
 'mention google': 3388,
 'im sure': 2418,
 'writing notes': 6054,
 'qagb sxsw': 4205,
 'look': 3119,
 'blue': 729,
 'hair': 2170,
 'ive': 2754,
 'got': 2059,
 'free': 1785,
 'android': 284,
 'phone': 3960,
 'info': 2464,
 'stickers': 4902,
 'allhat3': 235,
 'mention sxsw': 3485,
 'sxsw look': 5159,
 'look blue': 3120,
 'blue hair': 730,
 'hair ive': 2171,
 'ive got': 2756,
 'got free': 2062,
 'free android': 1786,
 'android phone': 301,
 'phone info': 3963,
 'info stickers': 2471,
 'stickers mention': 4903,
 'mention allhat3': 3333,
 'getting': 1894,
 'new': 3686,
 'ipad': 2533,
 'flash': 1728,
 'apple': 364,
 'store': 4911,
 'link': 3032,
 'mention getting': 3383,
 'new ipad': 3697,
 'ipad sxsw': 2630,
 'flash apple': 1729,
 'apple store': 418,
 

In [51]:
cvect.get_feature_names()

['10',
 '10 attendees',
 '10 hot',
 '10 link',
 '100',
 '101',
 '106',
 '10x',
 '11',
 '12',
 '12 months',
 '12b',
 '12b miles',
 '136',
 '136 google',
 '1413',
 '14day',
 '14day return',
 '15',
 '15 minute',
 '15 minutes',
 '150',
 '150 million',
 '1500',
 '1500 macbook',
 '15k',
 '16gb',
 '16gb wifi',
 '1986',
 '1986quot',
 '1st',
 '1st day',
 '1st prize',
 '1st time',
 '20',
 '20 concept',
 '20 min',
 '2010',
 '2011',
 '2011 computing',
 '2011 google',
 '2011 link',
 '2011 mention',
 '2011 novelty',
 '2011 prizes',
 '2011 weekend',
 '21',
 '22',
 '22 tracks',
 '24',
 '24 hours',
 '247',
 '247 amp',
 '247 stream',
 '25',
 '250k',
 '250k new',
 '2day',
 '2nd',
 '2nd place',
 '2nd prize',
 '2not',
 '2not worry',
 '2quot',
 '2quot mention',
 '2s',
 '2s austin',
 '2s sxsw',
 '30',
 '30 android',
 '313',
 '315',
 '315 details',
 '32',
 '32gb',
 '330',
 '330pm',
 '35',
 '35 million',
 '37',
 '3d',
 '3d buildings',
 '3d xml',
 '3g',
 '3g 64gb',
 '3g 64mb',
 '3g ipad',
 '3g iphone',
 '3gs',


In [52]:
#Size of Document Term Matrix
X_train.shape

(2661, 6132)

## Question 7

### Select classifier logistic regression
- use logistic regression for predicting sentiment of the given tweet
- initialize classifier

In [56]:
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()

### Fit the classifer
- fit logistic regression classifier

In [58]:
logisticRegr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Question 8

### Select classifier naive bayes
- use naive bayes for predicting sentiment of the given tweet
- initialize classifier
- use MultinomialNB

In [64]:
from sklearn.naive_bayes import GaussianNB
naiBay = GaussianNB()

### Fit the classifer
- fit naive bayes classifier

In [66]:
naiBay.fit(X_train.toarray(), y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

## Question 9

### Make predictions on logistic regression
- use your trained logistic regression model to make predictions on X_test

In [59]:
predictions = logisticRegr.predict(X_test)

### Make predictions on naive bayes
- use your trained naive bayes model to make predictions on X_test
- use a different variable name to store predictions so that they are kept separately

In [68]:
predictionsNB = naiBay.predict(X_test.toarray())

## Question 10

### Calculate accuracy of logistic regression
- check accuracy of logistic regression classifer
- use sklearn.metrics.accuracy_score

In [61]:
score = logisticRegr.score(X_test, y_test)
print(score)

0.8680947012401353


In [62]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_true=y_test, y_pred=predictions)
print('Acc: {:.4f}'.format(acc))

Acc: 0.8681


### Calculate accuracy of naive bayes
- check accuracy of naive bayes classifer
- use sklearn.metrics.accuracy_score

In [69]:
acc = accuracy_score(y_true=y_test, y_pred=predictionsNB)
print('Acc: {:.4f}'.format(acc))

Acc: 0.8253


In [70]:
print('Logistic Regression gave better result than Naive Bayes')

Logistic Regression gave better result than Naive Bayes
