## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

In [0]:
import skimage
from skimage import data, io, filters
import cv2
from google.colab.patches import cv2_imshow
from google.colab import drive
import numpy as np
import os, sys

In [2]:
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [0]:
import pandas as pd
import numpy as np

In [0]:
missing_values = ["n/a", "na", "--", "?" ]
tweets = pd.read_csv("/content/drive/My Drive/NLP/tweets.csv",encoding='latin',na_values = missing_values)

In [5]:
tweets.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [0]:
# making new data frame with dropped NA values 
tweets = tweets.dropna(axis = 0, how ='any')

In [7]:
tweets.shape

(3291, 3)

### Preprocess data
1. convert all text to lowercase - use .lower()
2. select only numbers, alphabets, and #+_ from text - use re.sub()
3. strip all the text - use .strip() - this is for removing extra spaces

In [0]:
import re

In [9]:
tweets = tweets.astype(str).apply(lambda x: x.str.lower())
tweets.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 i have a 3g iphone. after 3 hrs twe...,iphone,negative emotion
1,@jessedee know about @fludapp ? awesome ipad/i...,ipad or iphone app,positive emotion
2,@swonderlin can not wait for #ipad 2 also. the...,ipad,positive emotion
3,@sxsw i hope this year's festival isn't as cra...,ipad or iphone app,negative emotion
4,@sxtxstate great stuff on fri #sxsw: marissa m...,google,positive emotion


In [10]:
tweets = tweets.applymap(lambda s:s.lower())
tweets = tweets.applymap(lambda s:re.sub("[^0-9a-z #+_]", " ",s))
tweets = tweets.applymap(lambda s:s.strip())

tweets.head(10)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweet...,iphone,negative emotion
1,jessedee know about fludapp awesome ipad ip...,ipad or iphone app,positive emotion
2,swonderlin can not wait for #ipad 2 also they...,ipad,positive emotion
3,sxsw i hope this year s festival isn t as cras...,ipad or iphone app,negative emotion
4,sxtxstate great stuff on fri #sxsw marissa ma...,google,positive emotion
7,#sxsw is just starting #ctia is around the co...,android,positive emotion
8,beautifully smart and simple idea rt madebyma...,ipad or iphone app,positive emotion
9,counting down the days to #sxsw plus strong ca...,apple,positive emotion
10,excited to meet the samsungmobileus at #sxsw ...,android,positive emotion
11,find amp start impromptu parties at #sxsw wi...,android app,positive emotion


### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [11]:
tweets['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

positive emotion                      2672
negative emotion                       519
no emotion toward brand or product      91
i can t tell                             9
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [12]:
# Filter all rows

tweets_f = tweets[(tweets['is_there_an_emotion_directed_at_a_brand_or_product']== 'negative emotion') | (tweets['is_there_an_emotion_directed_at_a_brand_or_product']== "positive emotion")] 


# Print the shape of the dataframe 
print(tweets_f.shape) 

# Print the new dataframe 
tweets_f.head(15)

(3191, 3)


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweet...,iphone,negative emotion
1,jessedee know about fludapp awesome ipad ip...,ipad or iphone app,positive emotion
2,swonderlin can not wait for #ipad 2 also they...,ipad,positive emotion
3,sxsw i hope this year s festival isn t as cras...,ipad or iphone app,negative emotion
4,sxtxstate great stuff on fri #sxsw marissa ma...,google,positive emotion
7,#sxsw is just starting #ctia is around the co...,android,positive emotion
8,beautifully smart and simple idea rt madebyma...,ipad or iphone app,positive emotion
9,counting down the days to #sxsw plus strong ca...,apple,positive emotion
10,excited to meet the samsungmobileus at #sxsw ...,android,positive emotion
11,find amp start impromptu parties at #sxsw wi...,android app,positive emotion


In [13]:
tweets_f['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

positive emotion    2672
negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [0]:
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [0]:
# define X and y

X = tweets_f['tweet_text'] #independent variable
y = tweets_f['is_there_an_emotion_directed_at_a_brand_or_product'] #target

# split the new DataFrame into training and testing sets [Default test size = 25%]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [0]:
# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [17]:
X_train_dtm.shape
X_test_dtm.shape

(798, 4885)

In [18]:
# show vectorizer options
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

### 5. Find number of different words in vocabulary

In [19]:
# last 50 features
print(vect.get_feature_names()[-50:])

['xperia', 'xwave', 'ya', 'yall', 'yawn', 'yay', 'yea', 'yeah', 'year', 'years', 'yeay', 'yellow', 'yelp', 'yelping', 'yep', 'yes', 'yesterday', 'yet', 'yobongo', 'yonkers', 'york', 'you', 'youneedthis', 'your', 'yours', 'yourself', 'youtube', 'yr', 'yrs', 'yummy', 'zaarly', 'zaarlyiscoming', 'zagg', 'zaggle', 'zappos', 'zazzle', 'zazzlesxsw', 'zazzlsxsw', 'ze', 'zelda', 'zeldman', 'zero', 'zip', 'zite', 'zms', 'zombies', 'zomg', 'zone', 'zoom', 'zzzs']


positive emotion    2672
negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

#### Tip: To see all available functions for an Object use dir

In [39]:
dir(vect)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_sort_features',
 '_stop_words_id',
 '_validate_custom_analyzer',
 '_validate_params',
 '_validate_vocabulary',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 'dtype',
 'encoding',
 'fit',
 'fit_transform',
 'fixed_vocabulary_',
 'get_feature_names',
 'get_params',
 'get_stop_words',
 'input',
 'inverse_transf

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [23]:
tweets_f['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

positive emotion    2672
negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [0]:
tweets_f['emotions'] = tweets_f['is_there_an_emotion_directed_at_a_brand_or_product'].map({'positive emotion': '1', 'negative emotion':'0'})

In [0]:
Label = tweets_f.copy()

In [26]:
Label.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,emotions
0,wesley83 i have a 3g iphone after 3 hrs tweet...,iphone,negative emotion,0
1,jessedee know about fludapp awesome ipad ip...,ipad or iphone app,positive emotion,1
2,swonderlin can not wait for #ipad 2 also they...,ipad,positive emotion,1
3,sxsw i hope this year s festival isn t as cras...,ipad or iphone app,negative emotion,0
4,sxtxstate great stuff on fri #sxsw marissa ma...,google,positive emotion,1


### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [0]:
# define X and y

X = Label['tweet_text'] #independent variable
y = Label['emotions'] #target

# split the new DataFrame into training and testing sets [Default test size = 25%]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [0]:
# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [29]:
# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy
print (metrics.accuracy_score(y_test, y_pred_class))

0.8483709273182958


In [30]:
# use logistic regression with all features
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
print(metrics.accuracy_score(y_test, y_pred_class))

0.8483709273182958


## 10. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [0]:
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(X_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

### 11 Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [34]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1, 2), min_df=2, max_df=100, binary=True)
tokenize_test(vect)

Features:  7677
Accuracy:  0.8546365914786967


### Q 12 Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [35]:
# with stopwords, dtm size
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

Features:  4647
Accuracy:  0.8571428571428571


### Q 13 Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [36]:
# with stopwords, dtm size
vect = CountVectorizer(stop_words='english',max_features =300 )
tokenize_test(vect)

Features:  300
Accuracy:  0.8095238095238095


### Q 14 Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [37]:
# with stopwords, dtm size
vect = CountVectorizer(stop_words='english',max_features =1500,ngram_range=(1, 2) )
tokenize_test(vect)

Features:  1500
Accuracy:  0.8270676691729323


### Q. 15 -Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [38]:
# with stopwords, dtm size
vect = CountVectorizer(stop_words='english',max_features =1500,ngram_range=(1, 2), min_df=2 )
tokenize_test(vect)

Features:  1500
Accuracy:  0.8308270676691729
