## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [14]:
data =pd.read_csv('tweets.csv',encoding ='latin')

In [15]:
newdata =data.dropna(axis =0,how='any')

In [16]:
newdata.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [17]:
newdata.shape

(3291, 3)

### Preprocess data
1. convert all text to lowercase - use .lower()
2. select only numbers, alphabets, and #+_ from text - use re.sub()
3. strip all the text - use .strip() - this is for removing extra spaces

In [10]:
#newdata["tweet_text"]= data["tweet_text"].str.lower() #  convenrting the description to lower case

In [18]:
import re

In [20]:
newdata=newdata.applymap(lambda s:s.lower()) 
newdata =newdata.applymap(lambda s:re.sub("[^0-9a-zA-Z #+_]",  "",s))
newdata  =newdata.applymap(lambda s:s.strip())

In [24]:
#newdata.head(5)
newdata['is_there_an_emotion_directed_at_a_brand_or_product'].unique()

array(['negative emotion', 'positive emotion',
       'no emotion toward brand or product', 'i cant tell'], dtype=object)

### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [28]:
label =newdata[ (newdata['is_there_an_emotion_directed_at_a_brand_or_product']=='negative emotion')  |  (newdata['is_there_an_emotion_directed_at_a_brand_or_product']=='positive emotion')] 

In [36]:
label.head(10)
#label.shape

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iphone,negative emotion
1,jessedee know about fludapp awesome ipadiphon...,ipad or iphone app,positive emotion
2,swonderlin can not wait for #ipad 2 also they ...,ipad,positive emotion
3,sxsw i hope this years festival isnt as crashy...,ipad or iphone app,negative emotion
4,sxtxstate great stuff on fri #sxsw marissa may...,google,positive emotion
7,#sxsw is just starting #ctia is around the cor...,android,positive emotion
8,beautifully smart and simple idea rt madebyman...,ipad or iphone app,positive emotion
9,counting down the days to #sxsw plus strong ca...,apple,positive emotion
10,excited to meet the samsungmobileus at #sxsw s...,android,positive emotion
11,find amp start impromptu parties at #sxsw with...,android app,positive emotion


### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [46]:
x=label['tweet_text']  
y=label['is_there_an_emotion_directed_at_a_brand_or_product']

In [49]:
# Split the data for x_testtraining and testing
x_train, x_test, y_train,y_test = train_test_split(x, y, random_state =100,test_size=0.20)

In [50]:
# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(x_train)
X_test_dtm = vect.transform(x_test)

### 5. Find number of different words in vocabulary

In [51]:
print(vect.get_feature_names()[-50:])

['yeah', 'year', 'years', 'yearsquot', 'yeasayer', 'yeay', 'yelp', 'yelping', 'yep', 'yer', 'yes', 'yesterday', 'yet', 'yield', 'yikes', 'yo', 'yobongo', 'yonkers', 'york', 'you', 'youd', 'youll', 'your', 'youre', 'yours', 'yourself', 'youtube', 'youve', 'yowza', 'yr', 'yrs', 'yrsday', 'yummy', 'yup', 'zaggle', 'zappos', 'zazzle', 'zazzlesxsw', 'zazzlsxsw', 'ze', 'zelda', 'zeldman', 'zero', 'zip', 'zms', 'zombies', 'zomg', 'zone', 'zoom', 'zzzs']


#### Tip: To see all available functions for an Object use dir

In [52]:
dir(vect)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_limit_features',
 '_sort_features',
 '_stop_words_id',
 '_validate_params',
 '_validate_vocabulary',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 'dtype',
 'encoding',
 'fit',
 'fit_transform',
 'fixed_vocabulary_',
 'get_feature_names',
 'get_params',
 'get_stop_words',
 'input',
 'inverse_transform',
 'lowercase',
 'max_df',
 'max_features',
 'min_df',


### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [53]:
label['is_there_an_emotion_directed_at_a_brand_or_product']

0       negative emotion
1       positive emotion
2       positive emotion
3       negative emotion
4       positive emotion
7       positive emotion
8       positive emotion
9       positive emotion
10      positive emotion
11      positive emotion
12      positive emotion
13      positive emotion
14      positive emotion
15      positive emotion
17      negative emotion
18      positive emotion
19      positive emotion
20      positive emotion
21      positive emotion
22      positive emotion
23      positive emotion
24      positive emotion
25      positive emotion
26      positive emotion
27      positive emotion
28      positive emotion
29      positive emotion
30      positive emotion
31      positive emotion
36      positive emotion
              ...       
9000    positive emotion
9006    positive emotion
9008    negative emotion
9009    positive emotion
9012    positive emotion
9013    positive emotion
9017    positive emotion
9018    positive emotion
9022    positive emotion


### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [60]:
label["emotion"]=label["is_there_an_emotion_directed_at_a_brand_or_product"].map({"positive emotion" :1 ,"negative emotion" :0})

In [61]:
label.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,emotion
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iphone,negative emotion,0
1,jessedee know about fludapp awesome ipadiphon...,ipad or iphone app,positive emotion,1
2,swonderlin can not wait for #ipad 2 also they ...,ipad,positive emotion,1
3,sxsw i hope this years festival isnt as crashy...,ipad or iphone app,negative emotion,0
4,sxtxstate great stuff on fri #sxsw marissa may...,google,positive emotion,1


### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [63]:
x=label['tweet_text']  
y=label['emotion']
# Split the data for x_testtraining and testing
x_train, x_test, y_train,y_test = train_test_split(x, y, random_state =100,test_size=0.20)

In [64]:
# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(x_train)
X_test_dtm = vect.transform(x_test)

## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [65]:
# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy
print (metrics.accuracy_score(y_test, y_pred_class))

0.8748043818466353


In [68]:
# use logistic regression with all features
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
print(metrics.accuracy_score(y_test, y_pred_class))

0.8826291079812206


## 10. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [69]:
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

### 11 Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [70]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)
tokenize_test(vect)

Features:  26334
Accuracy:  0.8841940532081377


### Q 12 Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [71]:
# remove English stop words and only keep 100 features
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

Features:  100
Accuracy:  0.8403755868544601


### Q 13 Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [72]:
# remove English stop words and only keep 100 features
vect = CountVectorizer(stop_words='english', max_features=300)
tokenize_test(vect)

Features:  300
Accuracy:  0.8450704225352113


### Q 14 Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [73]:
# remove English stop words and only keep 100 features
vect = CountVectorizer(stop_words='english', max_features=15000)
tokenize_test(vect)

Features:  5191
Accuracy:  0.8701095461658842


### Q. 15 -Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [75]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=100000,min_df=2)
tokenize_test(vect)

Features:  8111
Accuracy:  0.8575899843505478
