## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('tweets.csv',encoding='latin')

In [3]:
data=data.dropna()
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


### Preprocess data
1. convert all text to lowercase - use .lower()
2. select only numbers, alphabets, and #+_ from text - use re.sub()
3. strip all the text - use .strip() - this is for removing extra spaces

In [4]:
data=data.applymap(lambda s:s.lower())

In [5]:
import re
data=data.applymap(lambda s:re.sub("[^0-9a-z #+_]"," " ,s))

In [6]:
data=data.applymap(lambda s:s.strip())

In [7]:
data.shape

(3291, 3)

In [8]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweet...,iphone,negative emotion
1,jessedee know about fludapp awesome ipad ip...,ipad or iphone app,positive emotion
2,swonderlin can not wait for #ipad 2 also they...,ipad,positive emotion
3,sxsw i hope this year s festival isn t as cras...,ipad or iphone app,negative emotion
4,sxtxstate great stuff on fri #sxsw marissa ma...,google,positive emotion


### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [9]:
data_sub = data[(data.is_there_an_emotion_directed_at_a_brand_or_product=='positive emotion') | (data.is_there_an_emotion_directed_at_a_brand_or_product=='negative emotion')]

In [10]:
data_sub.shape

(3191, 3)

### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [11]:
X=data_sub['tweet_text']
Y=data_sub['is_there_an_emotion_directed_at_a_brand_or_product']


In [12]:
X.shape

(3191,)

In [13]:
X.head()

0    wesley83 i have a 3g iphone  after 3 hrs tweet...
1    jessedee know about  fludapp   awesome ipad ip...
2    swonderlin can not wait for #ipad 2 also  they...
3    sxsw i hope this year s festival isn t as cras...
4    sxtxstate great stuff on fri #sxsw  marissa ma...
Name: tweet_text, dtype: object

In [14]:
Y.shape

(3191,)

In [15]:
data_sub.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweet...,iphone,negative emotion
1,jessedee know about fludapp awesome ipad ip...,ipad or iphone app,positive emotion
2,swonderlin can not wait for #ipad 2 also they...,ipad,positive emotion
3,sxsw i hope this year s festival isn t as cras...,ipad or iphone app,negative emotion
4,sxtxstate great stuff on fri #sxsw marissa ma...,google,positive emotion


In [16]:
# use CountVectorizer to create document-term matrices from X_train and X_test
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = CountVectorizer()
X_dtm = vect.fit_transform(X)

In [17]:
type(X_dtm)

scipy.sparse.csr.csr_matrix

In [18]:
X_dtm.shape

(3191, 5610)

### 5. Find number of different words in vocabulary

In [19]:
print (vect.get_feature_names())



#### Tip: To see all available functions for an Object use dir

In [20]:
dir()

['CountVectorizer',
 'In',
 'LogisticRegression',
 'MultinomialNB',
 'Out',
 'TfidfVectorizer',
 'X',
 'X_dtm',
 'Y',
 '_',
 '_10',
 '_12',
 '_13',
 '_14',
 '_15',
 '_17',
 '_18',
 '_3',
 '_7',
 '_8',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_dh',
 '_i',
 '_i1',
 '_i10',
 '_i11',
 '_i12',
 '_i13',
 '_i14',
 '_i15',
 '_i16',
 '_i17',
 '_i18',
 '_i19',
 '_i2',
 '_i20',
 '_i3',
 '_i4',
 '_i5',
 '_i6',
 '_i7',
 '_i8',
 '_i9',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'data',
 'data_sub',
 'exit',
 'get_ipython',
 'metrics',
 'np',
 'pd',
 'quit',
 're',
 'sp',
 'train_test_split',
 'vect',

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [21]:
Y.value_counts()

positive emotion    2672
negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [22]:
data_sub['is_there_an_emotion_directed_at_a_brand_or_product'] = np.where(data_sub['is_there_an_emotion_directed_at_a_brand_or_product']== 'positive emotion',1,0)

In [23]:
Y=data_sub['is_there_an_emotion_directed_at_a_brand_or_product']
Y.value_counts()

1    2672
0     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=1,test_size=0.20 )
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [25]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [26]:
# train the model using X_train_dtm
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [27]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [28]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.8341158059467919

In [29]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[ 27,  97],
       [  9, 506]], dtype=int64)

In [30]:
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# train the model using X_train_dtm
logreg.fit(X_train_dtm, y_train)
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)



In [31]:
# calculate predicted probabilities for X_test_dtm
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([0.88086353, 0.16445505, 0.80569119, 0.99683904, 0.56391117,
       0.94415846, 0.80857336, 0.95257023, 0.95473345, 0.98180946,
       0.1523031 , 0.71425767, 0.92608863, 0.9247352 , 0.99416808,
       0.83391794, 0.82722427, 0.98787224, 0.87704846, 0.96769315,
       0.89688456, 0.320912  , 0.93920317, 0.99539115, 0.2164588 ,
       0.87025237, 0.92027933, 0.96177027, 0.80916845, 0.0964679 ,
       0.98972899, 0.99684767, 0.48739581, 0.398789  , 0.97895555,
       0.8414055 , 0.99614091, 0.91603146, 0.98605036, 0.98392807,
       0.96998442, 0.51318549, 0.44718055, 0.86170004, 0.91974865,
       0.6621502 , 0.35069192, 0.81474316, 0.3503181 , 0.92468539,
       0.94244047, 0.80723852, 0.99073075, 0.27801829, 0.67286296,
       0.96582946, 0.93548391, 0.98917811, 0.90013033, 0.48819013,
       0.96851673, 0.61291556, 0.96053913, 0.98028497, 0.94773202,
       0.7375954 , 0.84484381, 0.43203571, 0.9533004 , 0.98862458,
       0.93530761, 0.95628788, 0.972246  , 0.98441913, 0.98193

In [32]:
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)

0.8528951486697965

## 10. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [33]:
def tokenize_test_nb(vect):
    x_train_dtm = vect.fit_transform(X_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [34]:
def tokenize_test_lr(vect):
    x_train_dtm = vect.fit_transform(X_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(X_test)
    logreg = LogisticRegression()
    logreg.fit(x_train_dtm, y_train)
    y_pred_class = logreg.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

### 11 Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [35]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test_nb(vect)

Features:  25706
Accuracy:  0.838810641627543


In [36]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test_lr(vect)

Features:  25706
Accuracy:  0.8544600938967136


### Q 12 Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [37]:
vect = CountVectorizer(stop_words='english')
tokenize_test_nb(vect)

Features:  4761
Accuracy:  0.8403755868544601


In [38]:
vect = CountVectorizer(stop_words='english')
tokenize_test_lr(vect)

Features:  4761
Accuracy:  0.8544600938967136


### Q 13 Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [39]:
vect = CountVectorizer(stop_words='english', max_features=300)
tokenize_test_nb(vect)

Features:  300
Accuracy:  0.8028169014084507


In [40]:
vect = CountVectorizer(stop_words='english', max_features=300)
tokenize_test_lr(vect)

Features:  300
Accuracy:  0.8200312989045383


### Q 14 Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [41]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=15000)
tokenize_test_nb(vect)

Features:  15000
Accuracy:  0.837245696400626


In [42]:
vect = CountVectorizer(ngram_range=(1, 2), max_features=15000)
tokenize_test_lr(vect)

Features:  15000
Accuracy:  0.8591549295774648


### Q. 15 -Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [43]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test_nb(vect)

Features:  8275
Accuracy:  0.8450704225352113


In [44]:
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test_lr(vect)

Features:  8275
Accuracy:  0.8622848200312989
