# Sentiment analysis 

The objective of this problem is to perform Sentiment analysis from the tweets collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

## Question 1

### Read the data
- read tweets.csv
- use latin encoding if it gives encoding error while loading

In [1]:
# run this cell to to mount the google drive if you are using google colab
from google.colab import drive
drive.mount('/content/drive')
project_path = '/content/drive/My Drive/assignments/'

Mounted at /content/drive


In [8]:
import pandas as pd
df= pd.read_csv(project_path+'tweets.csv', encoding='ISO-8859-1')

In [10]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [12]:
df.shape

(9093, 3)

### Drop null values
- drop all the rows with null values

In [14]:
df.dropna(inplace=True)

### Print the dataframe
- print initial 5 rows of the data
- use df.head()

In [16]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [17]:
df.shape

(3291, 3)

## Question 2

### Preprocess data
- convert all text to lowercase - use .lower()
- select only numbers, alphabets, and #+_ from text - use re.sub()
- strip all the text - use .strip()
    - this is for removing extra spaces

In [26]:
import re;
df['tweet_text'] = df[['tweet_text']].applymap(lambda s: s.lower())
df['tweet_text'] = df[['tweet_text']].applymap(lambda s: re.sub('[^0-9a-z #+_]','',s))

print dataframe

In [27]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iPhone,Negative emotion
1,jessedee know about fludapp awesome ipadiphon...,iPad or iPhone App,Positive emotion
2,swonderlin can not wait for #ipad 2 also they ...,iPad,Positive emotion
3,sxsw i hope this years festival isnt as crashy...,iPad or iPhone App,Negative emotion
4,sxtxstate great stuff on fri #sxsw marissa may...,Google,Positive emotion


## Question 3

### Preprocess data
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - select only those rows where value equal to "positive emotion" or "negative emotion"
- find the value counts of "positive emotion" and "negative emotion"

In [38]:
df = df[(df['is_there_an_emotion_directed_at_a_brand_or_product'] == 'Positive emotion')| (df['is_there_an_emotion_directed_at_a_brand_or_product'] == 'Negative emotion')]

In [39]:
df.shape

(3191, 3)

## Question 4

### Encode labels
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - change "positive emotion" to 1
    - change "negative emotion" to 0
- use map function to replace values

In [47]:
df['is_there_an_emotion_directed_at_a_brand_or_product'] = df['is_there_an_emotion_directed_at_a_brand_or_product'].map(lambda emotion: 1 if (emotion == 'Positive emotion') else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [48]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iPhone,0
1,jessedee know about fludapp awesome ipadiphon...,iPad or iPhone App,1
2,swonderlin can not wait for #ipad 2 also they ...,iPad,1
3,sxsw i hope this years festival isnt as crashy...,iPad or iPhone App,0
4,sxtxstate great stuff on fri #sxsw marissa may...,Google,1


## Question 5

### Get feature and label
- get column "tweet_text" as feature
- get column "is_there_an_emotion_directed_at_a_brand_or_product" as label

In [49]:
features = df['tweet_text']
labels = df['is_there_an_emotion_directed_at_a_brand_or_product']

### Create train and test data
- use train_test_split to get train and test set
- set a random_state
- test_size: 0.25

In [50]:
from sklearn.model_selection import train_test_split

In [52]:
X_train, X_test, y_train, y_test = train_test_split(
    features,
    labels,
    test_size=0.25, 
    random_state=42
)

## Question 6

### Vectorize data
- create document-term matrix
- use CountVectorizer()
    - ngram_range: (1, 2)
    - stop_words: 'english'
    - min_df: 2   
- do fit_transform on X_train
- do transform on X_test

In [54]:
from sklearn.feature_extraction.text import CountVectorizer

In [56]:
cv = CountVectorizer(ngram_range=(1,2), stop_words='english', min_df=2)

In [57]:
X_train = cv.fit_transform(X_train)
X_test = cv.transform(X_test)

In [58]:
cv.vocabulary_

{'idea': 2055,
 'google': 1689,
 'analytics': 206,
 'tools': 4868,
 'know': 2499,
 'website': 5160,
 'doing': 1216,
 'sxsw': 4420,
 'bavcid': 547,
 'google analytics': 1692,
 'doing sxsw': 1217,
 'hey': 1964,
 'capture': 764,
 'booth': 645,
 'experience': 1382,
 'retrollect': 3856,
 'popular': 3585,
 'wins': 5229,
 'ipad': 2177,
 'link': 2649,
 'hey sxsw': 1969,
 'wins ipad': 5231,
 'ipad link': 2230,
 'great': 1805,
 'app': 246,
 'madebymany': 2809,
 'great sxsw': 1817,
 'sxsw ipad': 4504,
 'ipad app': 2184,
 'app madebymany': 270,
 'got': 1780,
 'foodspotting': 1523,
 'iphone': 2308,
 'apps': 373,
 'link iphone': 2667,
 'iphone apps': 2316,
 'theyd': 4764,
 'stupid': 4366,
 'apple': 298,
 'opening': 3356,
 'temporary': 4724,
 'store': 4299,
 'austin': 430,
 'amp': 179,
 'launch': 2534,
 'mention': 2916,
 'apple opening': 329,
 'opening temporary': 3363,
 'temporary store': 4726,
 'store austin': 4305,
 'austin sxsw': 452,
 'sxsw amp': 4426,
 'amp ipad': 190,
 'ipad launch': 2228,
 'l

In [60]:
X_train.shape

(2393, 5372)

In [61]:
X_test.shape

(798, 5372)

## Question 7

### Select classifier logistic regression
- use logistic regression for predicting sentiment of the given tweet
- initialize classifier

In [62]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

### Fit the classifer
- fit logistic regression classifier

In [63]:
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Question 8

### Select classifier naive bayes
- use naive bayes for predicting sentiment of the given tweet
- initialize classifier
- use MultinomialNB

In [64]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()

### Fit the classifer
- fit naive bayes classifier

In [65]:
mnb.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Question 9

### Make predictions on logistic regression
- use your trained logistic regression model to make predictions on X_test

In [66]:
y_pred_lr = lr.predict(X_test)

### Make predictions on naive bayes
- use your trained naive bayes model to make predictions on X_test
- use a different variable name to store predictions so that they are kept separately

In [67]:
y_pred_mnb = mnb.predict(X_test)

## Question 10

### Calculate accuracy of logistic regression
- check accuracy of logistic regression classifer
- use sklearn.metrics.accuracy_score

In [68]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred_lr)

0.8709273182957393

### Calculate accuracy of naive bayes
- check accuracy of naive bayes classifer
- use sklearn.metrics.accuracy_score

In [69]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred_mnb)

0.8558897243107769