# Sentiment analysis 

The objective of the second problem is to perform Sentiment analysis from the tweets collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

## Question 1

### Read the data
- read tweets.csv
- use latin encoding if it gives encoding error while loading

In [1]:
import numpy as np
import pandas as pd
import warnings  ## done on local
warnings.filterwarnings('ignore')

tweets_df=pd.read_csv("tweets.csv",encoding= "latin-1")
tweets_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [2]:
tweets_df.dtypes
tweets_df=pd.DataFrame(tweets_df)

### Drop null values
- drop all the rows with null values

In [3]:
tweets_df.isna().sum() ## there are 5802 +1 nulls cells

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [4]:
tweets_df.shape

(9093, 3)

In [5]:
tweets_df=tweets_df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
tweets_df.shape

(3291, 3)

In [6]:
tweets_df.isna().sum()  ## null rows dropped

tweet_text                                            0
emotion_in_tweet_is_directed_at                       0
is_there_an_emotion_directed_at_a_brand_or_product    0
dtype: int64

### Print the dataframe
- print initial 5 rows of the data
- use df.head()

In [7]:
tweets_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## Question 2

### Preprocess data
- convert all text to lowercase - use .lower()
- select only numbers, alphabets, and #+_ from text - use re.sub()
- strip all the text - use .strip()
    - this is for removing extra spaces

In [8]:
tweets_df["tweet_text"][0]

'.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.'

In [9]:
tweets_df["tweet_text"]=tweets_df["tweet_text"].str.lower()
tweets_df["tweet_text"][0]

'.@wesley83 i have a 3g iphone. after 3 hrs tweeting at #rise_austin, it was dead!  i need to upgrade. plugin stations at #sxsw.'

In [10]:
import re
tweets_df["tweet_text"]=re.sub('[^A-Za-z0-9#+_]+', '', str(tweets_df["tweet_text"]))

In [11]:
tweets_df["tweet_text"].str.strip()

0       0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...
1       0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...
2       0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...
3       0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...
4       0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...
                              ...                        
9077    0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...
9079    0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...
9080    0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...
9085    0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...
9088    0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...
Name: tweet_text, Length: 3291, dtype: object

print dataframe

In [12]:
tweets_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...,iPhone,Negative emotion
1,0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...,iPad or iPhone App,Positive emotion
2,0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...,iPad,Positive emotion
3,0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...,iPad or iPhone App,Negative emotion
4,0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...,Google,Positive emotion


## Question 3

### Preprocess data
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - select only those rows where value equal to "positive emotion" or "negative emotion"
- find the value counts of "positive emotion" and "negative emotion"

In [13]:
tweets_df["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

Positive emotion                      2672
Negative emotion                       519
No emotion toward brand or product      91
I can't tell                             9
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [14]:
tweets_pos_neg_df = pd.DataFrame(tweets_df[(tweets_df.is_there_an_emotion_directed_at_a_brand_or_product== "Positive emotion") | (tweets_df.is_there_an_emotion_directed_at_a_brand_or_product=="Negative emotion")])
tweets_pos_neg_df.count()  ## 2672+519= 3191 

tweet_text                                            3191
emotion_in_tweet_is_directed_at                       3191
is_there_an_emotion_directed_at_a_brand_or_product    3191
dtype: int64

In [15]:
tweets_pos_neg_df["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

## Question 4

### Encode labels
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - change "positive emotion" to 1
    - change "negative emotion" to 0
- use map function to replace values

In [16]:
# convert label to a numerical variable
tweets_pos_neg_df['is_there_an_emotion_directed_at_a_brand_or_product'] = tweets_pos_neg_df.is_there_an_emotion_directed_at_a_brand_or_product.map({'Positive emotion':1, 'Negative emotion':0})

In [17]:
tweets_pos_neg_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...,iPhone,0
1,0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...,iPad or iPhone App,1
2,0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...,iPad,1
3,0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...,iPad or iPhone App,0
4,0wesley83ihavea3giphoneafter3hrstwe1jessedeekn...,Google,1


## Question 5

### Get feature and label
- get column "tweet_text" as feature
- get column "is_there_an_emotion_directed_at_a_brand_or_product" as label

In [18]:
# how to define X and y (from the iris data) to use with a MODEL
X = tweets_pos_neg_df["tweet_text"]
y = tweets_pos_neg_df["is_there_an_emotion_directed_at_a_brand_or_product"]
print(X.shape)
print(y.shape)

(3191,)
(3191,)


### Create train and test data
- use train_test_split to get train and test set
- set a random_state
- test_size: 0.25

In [19]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.25, random_state=7)
print(X_train.shape,",",X_test.shape)
print(y_train.shape, ",",y_test.shape)

(2393,) , (798,)
(2393,) , (798,)


## Question 6

### Vectorize data
- create document-term matrix
- use CountVectorizer()
    - ngram_range: (1, 2)
    - stop_words: 'english'
    - min_df: 2   
- do fit_transform on X_train
- do transform on X_test

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

# instantiate the vectorizer
vect = CountVectorizer(ngram_range=(1, 2),stop_words='english',min_df=2)
# combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)

# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<798x9 sparse matrix of type '<class 'numpy.int64'>'
	with 7182 stored elements in Compressed Sparse Row format>

## Question 7

### Select classifier logistic regression
- use logistic regression for predicting sentiment of the given tweet
- initialize classifier

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
%matplotlib inline

logreg = LogisticRegression(C=1e9)


### Fit the classifer
- fit logistic regression classifier

In [22]:
logreg.fit(X_train_dtm, y_train)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Question 8

### Select classifier naive bayes
- use naive bayes for predicting sentiment of the given tweet
- initialize classifier
- use MultinomialNB

In [23]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

### Fit the classifer
- fit naive bayes classifier

In [24]:
# train the model using X_train_dtm
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Question 9

### Make predictions on logistic regression
- use your trained logistic regression model to make predictions on X_test

In [25]:
y_pred_class_logistic = logreg.predict(X_test_dtm)

### Make predictions on naive bayes
- use your trained naive bayes model to make predictions on X_test
- use a different variable name to store predictions so that they are kept separately

In [26]:
# make class predictions for X_test_dtm
y_pred_class_multinomial = nb.predict(X_test_dtm)

## Question 10

### Calculate accuracy of logistic regression
- check accuracy of logistic regression classifer
- use sklearn.metrics.accuracy_score

In [27]:
print("Test Accuracy ",metrics.accuracy_score(y_test, y_pred_class_logistic))

y_pred_class_train_logistic = logreg.predict(X_train_dtm)
print("Train Accuracy ",metrics.accuracy_score(y_train, y_pred_class_train_logistic))

Test Accuracy  0.8408521303258145
Train Accuracy  0.8361888842457167


### Calculate accuracy of naive bayes
- check accuracy of naive bayes classifer
- use sklearn.metrics.accuracy_score

In [28]:
print("Test Accuracy ",metrics.accuracy_score(y_test, y_pred_class_multinomial))

y_pred_class_train_multinomial = nb.predict(X_train_dtm)
print("Train Accuracy ",metrics.accuracy_score(y_train, y_pred_class_train_multinomial))

Test Accuracy  0.8408521303258145
Train Accuracy  0.8361888842457167


In [29]:
## Both Logistic and Naive Bayes give same accuracy.