## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

In [1]:
# importing libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings('ignore')

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [2]:
data = pd.read_csv("tweets.csv",encoding = 'unicode_escape')

In [3]:
data = data.dropna()

In [4]:
data["tweet_text"][0]

'.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.'

In [5]:
data.shape

(3291, 3)

In [6]:
data.sample(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
1355,@mention - #Apple is the classiest fascist com...,Apple,Negative emotion
6218,RT @mention Just saw someone take a picture wi...,iPad,Negative emotion
2967,Pop up @mention store in Austin. Brilliant. ...,Apple,Positive emotion
4003,Can't. Take. Hands. Off. iPhone. Even when it...,iPhone,Positive emotion
4042,New Social Network may launch 2day! &quot;Goog...,Other Google product or service,Positive emotion


### 2. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [7]:
import re
def preprocess(text):
    try:
        return re.sub("[^a-zA-Z0-9]", " ", text)
    except Exception as e:
        return ""

In [8]:
data['text'] = [preprocess(text) for text in data.tweet_text]

In [9]:
data.shape

(3291, 4)

In [10]:
data.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley83 I have a 3G iPhone After 3 hrs twe...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee Know about fludapp Awesome iPad i...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin Can not wait for iPad 2 also The...
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,sxsw I hope this year s festival isn t as cra...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,sxtxstate great stuff on Fri SXSW Marissa M...


In [11]:
data.text

0         wesley83 I have a 3G iPhone  After 3 hrs twe...
1        jessedee Know about  fludapp   Awesome iPad i...
2        swonderlin Can not wait for  iPad 2 also  The...
3        sxsw I hope this year s festival isn t as cra...
4        sxtxstate great stuff on Fri  SXSW  Marissa M...
7        SXSW is just starting   CTIA is around the co...
8       Beautifully smart and simple idea RT  madebyma...
9       Counting down the days to  sxsw plus strong Ca...
10      Excited to meet the  samsungmobileus at  sxsw ...
11      Find  amp  Start Impromptu Parties at  SXSW Wi...
12      Foursquare ups the game  just in time for  SXS...
13      Gotta love this  SXSW Google Calendar featurin...
14      Great  sxsw ipad app from  madebymany  http   ...
15      haha  awesomely rad iPad app by  madebymany ht...
17      I just noticed DST is coming this weekend  How...
18      Just added my  SXSW flights to  planely  Match...
19      Must have  SXSW app  RT  malbonster  Lovely re...
20      Need t

### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [12]:
data = data[data["is_there_an_emotion_directed_at_a_brand_or_product"].isin(["Positive emotion","Negative emotion"])]

In [13]:
data.shape

(3191, 4)

In [14]:
data.sample(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
6912,RT @mention We've got some exciting things tha...,iPhone,Positive emotion,RT mention We ve got some exciting things tha...
4433,My Verizon iPhone is kicking hairy butts at SX...,iPhone,Positive emotion,My Verizon iPhone is kicking hairy butts at SX...
4284,@mention massive lines at #sxsw apple store......,Apple,Positive emotion,mention massive lines at sxsw apple store ...
7622,Temperature going up :) RT @mention @mention ...,Google,Positive emotion,Temperature going up RT mention mention ...
7947,Robot wars. Steampunk time machines. Mystery b...,Google,Positive emotion,Robot wars Steampunk time machines Mystery b...


### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
# create the transform
vect = CountVectorizer()
# tokenize and build vocab
vector = vect.fit_transform(data.text)
# summarize
print(vect.vocabulary_)
# encode document
#vector = vect.transform(data)
# summarize encoded vector
#print(vector.shape)
#print(type(vector))
#print(vector.toarray())



### 5. Find number of different words in vocabulary

In [16]:
print(dir(vect.vocabulary_))

['__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']


In [17]:
vocab = vect.get_feature_names()
print (vocab)






#### Tip: To see all available functions for an Object use dir

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [18]:
data["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [19]:
data["is_there_an_emotion_directed_at_a_brand_or_product"] = data["is_there_an_emotion_directed_at_a_brand_or_product"].replace(['Positive emotion','Negative emotion'],[1,0])

### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [20]:
X = data['text']
Y = data['is_there_an_emotion_directed_at_a_brand_or_product']

In [21]:
X.shape
Y.shape

(3191,)

(3191,)

In [22]:
X = vector
Y = data['is_there_an_emotion_directed_at_a_brand_or_product']
X.shape
Y.shape
vector

(3191, 5600)

(3191,)

<3191x5600 sparse matrix of type '<class 'numpy.int64'>'
	with 53151 stored elements in Compressed Sparse Row format>

## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=1)

In [25]:
X_train=X_train.toarray()
X_test=X_test.toarray()

In [26]:
#naive bayes

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, Y_train)
model.score(X_train, Y_train)
model.score(X_test, Y_test)
test_pred = model.predict(X_test)
from sklearn import metrics
print(metrics.classification_report(Y_test, test_pred))
print(metrics.confusion_matrix(Y_test, test_pred))

GaussianNB(priors=None, var_smoothing=1e-09)

0.9520824003582624

0.767223382045929

              precision    recall  f1-score   support

           0       0.39      0.44      0.41       179
           1       0.87      0.84      0.85       779

   micro avg       0.77      0.77      0.77       958
   macro avg       0.63      0.64      0.63       958
weighted avg       0.78      0.77      0.77       958

[[ 78 101]
 [122 657]]


In [27]:
#Logistic
from sklearn.linear_model import LogisticRegression
model_LR = LogisticRegression()
model_LR.fit(X_train, Y_train)

model_LR.score(X_train, Y_train)
model_LR.score(X_test, Y_test)

test_pred = model_LR.predict(X_test)
from sklearn import metrics
print(metrics.classification_report(Y_test, test_pred))
print(metrics.confusion_matrix(Y_test, test_pred))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

0.9793999104343932

0.8549060542797495

              precision    recall  f1-score   support

           0       0.73      0.35      0.48       179
           1       0.87      0.97      0.92       779

   micro avg       0.85      0.85      0.85       958
   macro avg       0.80      0.66      0.70       958
weighted avg       0.84      0.85      0.83       958

[[ 63 116]
 [ 23 756]]


## 10. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [28]:
X = data.text
Y = data['is_there_an_emotion_directed_at_a_brand_or_product']

In [29]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1)

In [30]:
from sklearn.naive_bayes import MultinomialNB
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [31]:
def tokenize_predict(input):
    return tokenize_test(input)

In [32]:
tokenize_predict(vect)

Features:  4707
Accuracy:  0.8423799582463466


### 11 Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [33]:

vect1 = CountVectorizer(ngram_range=(1,2))
tokenize_predict(vect1)

Features:  23605
Accuracy:  0.8496868475991649


### Q 12 Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [34]:
vect2 = CountVectorizer(stop_words='english')
tokenize_predict(vect2)

Features:  4471
Accuracy:  0.8423799582463466


### Q 13 Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [35]:
vect3 = CountVectorizer(stop_words='english', max_features=300)
tokenize_predict(vect3)

Features:  300
Accuracy:  0.8100208768267223


### Q 14 Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [36]:
vect4 = CountVectorizer(ngram_range=(1,2), max_features=15000)
tokenize_predict(vect4)

Features:  15000
Accuracy:  0.8465553235908142


### Q. 15 -Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [37]:
vect5 = CountVectorizer(ngram_range=(1,2), min_df=2)
tokenize_predict(vect5)

Features:  7230
Accuracy:  0.8507306889352818
