# Sentiment analysis 

The objective of this problem is to perform Sentiment analysis from the tweets collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

## Question 1

### Read the data
- read tweets.csv
- use latin encoding if it gives encoding error while loading

In [3]:
import pandas as pd
df= pd.read_csv('tweets.csv',encoding= 'latin')

### Drop null values
- drop all the rows with null values

In [4]:
#Looking at number of nulls in columns
df.isna().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [5]:
#Dropping null rows
df.dropna(axis = 0, inplace = True)

### Print the dataframe
- print initial 5 rows of the data
- use df.head()

In [6]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## Question 2

### Preprocess data
- convert all text to lowercase - use .lower()
- select only numbers, alphabets, and #+_ from text - use re.sub()
- strip all the text - use .strip()
    - this is for removing extra spaces

In [7]:
import re
#Converting text to lowercase
df['tweet_text'] = df['tweet_text'].apply(lambda s: s.lower())
#Selecting only numbers, alphabets, and #+_ 
df['tweet_text'] = df['tweet_text'].apply(lambda s: re.sub('[^0-9a-z #+_]','',s))
#Removing extra spaces
df['tweet_text'] = df['tweet_text'].apply(lambda s: s.strip())

print dataframe

In [8]:
#printing head of dataframe
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iPhone,Negative emotion
1,jessedee know about fludapp awesome ipadiphon...,iPad or iPhone App,Positive emotion
2,swonderlin can not wait for #ipad 2 also they ...,iPad,Positive emotion
3,sxsw i hope this years festival isnt as crashy...,iPad or iPhone App,Negative emotion
4,sxtxstate great stuff on fri #sxsw marissa may...,Google,Positive emotion


## Question 3

### Preprocess data
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - select only those rows where value equal to "positive emotion" or "negative emotion"
- find the value counts of "positive emotion" and "negative emotion"

In [9]:
#Looking at unique values in the target column
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion                      2672
Negative emotion                       519
No emotion toward brand or product      91
I can't tell                             9
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [10]:
#Selecting only those rows with "Positive emotion" or "Negative emotion"
df = df[(df['is_there_an_emotion_directed_at_a_brand_or_product']=="Positive emotion")|(df['is_there_an_emotion_directed_at_a_brand_or_product']=="Negative emotion")]

In [11]:
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

## Question 4

### Encode labels
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - change "positive emotion" to 1
    - change "negative emotion" to 0
- use map function to replace values

In [12]:
#Mapping labels to 1 and 0  
df['is_there_an_emotion_directed_at_a_brand_or_product'] = df['is_there_an_emotion_directed_at_a_brand_or_product'].map({"Positive emotion":1,"Negative emotion":0})

In [13]:
#Checking if labels got mapped correctly
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

1    2672
0     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

## Question 5

### Get feature and label
- get column "tweet_text" as feature
- get column "is_there_an_emotion_directed_at_a_brand_or_product" as label

In [14]:
feature = df['tweet_text']
label = df['is_there_an_emotion_directed_at_a_brand_or_product']

### Create train and test data
- use train_test_split to get train and test set
- set a random_state
- test_size: 0.25

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(feature, label, test_size = 0.25, random_state = 1)

## Question 6

### Vectorize data
- create document-term matrix
- use CountVectorizer()
    - ngram_range: (1, 2)
    - stop_words: 'english'
    - min_df: 2   
- do fit_transform on X_train
- do transform on X_test

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [17]:
cv = CountVectorizer(ngram_range=(1,2), stop_words='english',min_df=2)
#do fit_transform on X_train
X_train_vector = cv.fit_transform(X_train)
#do transform on X_test
X_test_vector = cv.transform(X_test)

## Question 7

### Select classifier logistic regression
- use logistic regression for predicting sentiment of the given tweet
- initialize classifier

In [18]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

### Fit the classifer
- fit logistic regression classifier

In [19]:
lr.fit(X_train_vector, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Question 8

### Select classifier naive bayes
- use naive bayes for predicting sentiment of the given tweet
- initialize classifier
- use MultinomialNB

In [20]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()

### Fit the classifer
- fit naive bayes classifier

In [21]:
mnb.fit(X_train_vector, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Question 9

### Make predictions on logistic regression
- use your trained logistic regression model to make predictions on X_test

In [22]:
lr_predict = lr.predict(X_test_vector)

### Make predictions on naive bayes
- use your trained naive bayes model to make predictions on X_test
- use a different variable name to store predictions so that they are kept separately

In [23]:
mnb_predict = mnb.predict(X_test_vector)

## Question 10

### Calculate accuracy of logistic regression
- check accuracy of logistic regression classifer
- use sklearn.metrics.accuracy_score

In [30]:
from sklearn.metrics import accuracy_score
lr_train_acc = accuracy_score(y_train, lr.predict(X_train_vector))
lr_test_acc = accuracy_score(y_test, lr_predict)
print('Training accuracy of logistic regression model = ', round(lr_train_acc,2))
print('Testing accuracy of logistic regression model = ', round(lr_test_acc,2))

Training accuracy of logistic regression model =  0.97
Testing accuracy of logistic regression model =  0.87


### Calculate accuracy of naive bayes
- check accuracy of naive bayes classifer
- use sklearn.metrics.accuracy_score

In [31]:
mnb_train_acc = accuracy_score(y_train, mnb.predict(X_train_vector))
mnb_test_acc = accuracy_score(y_test, mnb_predict)
print('Training accuracy of Multinomial NB model = ', round(mnb_train_acc,2))
print('Testing accuracy of Multinomial NB model = ', round(mnb_test_acc,2))

Training accuracy of Multinomial NB model =  0.93
Testing accuracy of Multinomial NB model =  0.86


**>> The accuracies of both the Logistic Regression as well as Naive bayes models are comparable. But the logistic regression model is more overfit than the Naive bayes model - there is a 10% difference between training and test accuracies in logistic regression but just 7% difference for Naive bayes.**

----