<h2 align="center">  A Sentiment Analysis Case Study

### Introduction
___

- Twitter reviews dataset
- Contains 22275 reviews (positve and negative)
- Contains at most reviews per movie
- 70/30 train/test split
- Evaluation accuracy

<b>Model: Logistic regression</b>
- $p(y = 1|x) = \sigma(w^{T}x)$
- Linear classification model
- Can handle sparse data
- Fast to train
- Weights can be interpreted
<img src="https://i.imgur.com/VieM41f.png" align="center" width=500 height=500>

## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk 
import re 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()

## Importing Dataset

In [2]:
df = pd.read_csv('tweet.csv')

In [3]:
df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [4]:
df.shape

(22275, 3)

## <h2 align="center">Bag of words / Bag of N-grams model</h2>

### Transforming documents into feature vectors

Below, we will call the fit_transform method on CountVectorizer. This will construct the vocabulary of the bag-of-words model and transform the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two

### Example of Countvectorizer

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()

docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [6]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [7]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


## Data Preparation (for first review) :

In [8]:
text = df['tweet'][0]

In [9]:
text

' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run'

### Cleaning first review :

In [10]:
text = re.sub('[^a-zA-Z]',' ',text)

In [11]:
text

'  user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction     run'

In [12]:
text = text.lower()
text = text.split()

In [13]:
text

['user',
 'when',
 'a',
 'father',
 'is',
 'dysfunctional',
 'and',
 'is',
 'so',
 'selfish',
 'he',
 'drags',
 'his',
 'kids',
 'into',
 'his',
 'dysfunction',
 'run']

In [14]:
text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]

In [15]:
text

['user', 'father', 'dysfunct', 'selfish', 'drag', 'kid', 'dysfunct', 'run']

In [16]:
text = ' '.join(text)

In [17]:
text

'user father dysfunct selfish drag kid dysfunct run'

# Data Preparation  (for all reviews)

In [18]:
df.iloc[:,1:3]

Unnamed: 0,label,tweet
0,0,@user when a father is dysfunctional and is s...
1,0,@user @user thanks for #lyft credit i can't us...
2,0,bihday your majesty
3,0,#model i love u take with u all the time in ...
4,0,factsguide: society now #motivation
...,...,...
22270,0,ascot times with this babe â¤ï¸â¤ï¸ #ascot...
22271,0,happy monday #positivity #monday
22272,1,you're running out of #hater bitches to #whit...
22273,0,@user @user babe wtf i thot we were gonna do t...


In [19]:
clean_review = []
for i in range(22275):
    txt = re.sub('[^a-zA-Z]',' ',df['tweet'][i])
    txt = txt.lower()
    txt = txt.split()
    txt = [ps.stem(word) for word in txt if not word in set(stopwords.words('english'))]
    txt = ' '.join(txt)
    clean_review.append(txt)

### Using CountVectorizer

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=500)

In [21]:
X = cv.fit_transform(clean_review)

In [22]:
y = df['label'].values

In [23]:
y

array([0, 0, 0, ..., 1, 0, 0], dtype=int64)

# Spliting dataset into train test 

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [25]:
print('Shape of X_train:', X_train.shape)
print('Shape of y_train:', y_train.shape)
print('Shape of X_test:', X_test.shape)
print('Shape of y_test:', y_test.shape)

Shape of X_train: (15592, 500)
Shape of y_train: (15592,)
Shape of X_test: (6683, 500)
Shape of y_test: (6683,)


# Document Classification using LogisticRegression

In [26]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [27]:
logreg = logreg.fit(X_train, y_train)

## Predicting Values

In [28]:
y_pred_test = logreg.predict(X_test)
y_pred_train = logreg.predict(X_train)

In [29]:
y_pred_train

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [30]:
y_pred_test

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

### Accuracy Score on Whole DataSet (r2 value)

In [31]:
logreg.score(X,y)

0.9471604938271605

### Accuracy Score on Train DataSet (r2 value)

In [32]:
logreg.score(X_train,y_train)

0.9493971267316572

### Accuracy Score on Test DataSet (r2 value)

In [33]:
logreg.score(X_test,y_test)

0.9419422415083046

## Model Performance

In [34]:
from sklearn.metrics import confusion_matrix

In [35]:
confusion_matrix(y_test, y_pred_test)

array([[6145,   55],
       [ 333,  150]], dtype=int64)

In [36]:
from sklearn.metrics import precision_score, recall_score, f1_score

### Applying Precision , Recall, f1 on micro 

In [37]:
p = precision_score(y_test, y_pred_test, average='micro')
r = recall_score(y_test, y_pred_test, average='micro') 
f = f1_score(y_test, y_pred_test, average='micro')
print(p, r, f)

0.9419422415083046 0.9419422415083046 0.9419422415083046


### Applying Precision , Recall, f1 on macro 

In [38]:
p1 = precision_score(y_test, y_pred_test, average='macro')
r1 = recall_score(y_test, y_pred_test, average='macro') 
f1 = f1_score(y_test, y_pred_test, average='macro')
print(p1, r1, f1)

0.8401512812596481 0.6508440192346223 0.7027211576912211


   # Highest Accuracy -- 94.559%  