<a href="https://colab.research.google.com/github/walraven/ETL-project/blob/master/TFIDF_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TF-IDF - BAG OF WORDS

Bag of words implementation for text analysis is analogous to dumping all the words in a document into a bag and then counting their frequency. One major limitation to this method is that nuance, or any meaning implied by phrases and idiomatic expressions, can be difficult to infer. Consider an example document written in natural language with slightly negative, but mostly ambiguous sentiment:
> *I can't say that I'm very satisfied with this item. The best thing about it was the price. The quality is not so good, but the color is beautiful. It doesn't have all the bells and whistles, but I know it will get you by!*

We can perform a rudimentary tokenization on the document:
>*whistles, doesn't, know, so, not, best, quality, can't, beautiful, but, item, bells, good, color, say, very, price, satisfied*

Given just the tokens, it is very difficult to intuitively reason whether the sentiment is positive or negative. How are we to know which words "can't," "doesn't," and "not" apply to? The phrase "get you by" has been lost to stopword removal. 

Here is another review with the same tokens:

>*So, I'm satisfied with this item. About the quality: it was very good. But the thing is, I know by the price you will not get all the bells and whistles. But, it doesn't have the best color, I can't say it is that beautiful.*

not
All the same words, but the sentiment is markedly different! By training a machine-learning model, however, we can predict sentiment with a fair amount of accuracy.

After tokenizing all documents, we determine term frequency and inter-document frequency for each token. A token and tf-idf vector form a feature. The features of a test dataset are used to train various machine learning models to predict sentiment.

## Imports and Data Fetching

In [0]:
# imports
import pandas as pd
import numpy as np
import sklearn
import joblib
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.utils.fixes import loguniform
from sklearn import metrics
from google.colab import drive

In [3]:
#get data
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [4]:
#decompress data
!tar -xvf '/content/drive/My Drive/Data-Camp/project3/amazon_review_polarity_csv.tar.gz'

amazon_review_polarity_csv/
amazon_review_polarity_csv/test.csv
amazon_review_polarity_csv/train.csv
amazon_review_polarity_csv/readme.txt


## Preprocess Train and Test Data

In [0]:
train_data = pd.read_csv('amazon_review_polarity_csv/train.csv', names=['label', 'title', 'review'])
test_data = pd.read_csv('amazon_review_polarity_csv/test.csv', names = ['label', 'title', 'review'])
label_names = ['negative', 'positive']

In [0]:
#necessary for logistic regression and random search below
tf_idf_vectorizer = TfidfVectorizer()
X_train_tf_idf = tf_idf_vectorizer.fit_transform(train_data.review)
X_test_tf_idf = tf_idf_vectorizer.transform(test_data.review)

In [7]:
#save this for use later
vectorizer_filename = 'fitted_vectorizer.sav'
joblib.dump(tf_idf_vectorizer, vectorizer_filename)

['fitted_vectorizer.sav']

## Model: LOGISTIC REGRESSION

In [0]:
#create model
lr_classifier_model = LogisticRegression(solver='sag', n_jobs=-1)

In [0]:
#train model with train data
lr_classifier_model.fit(X_train_tf_idf, train_data.label)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 40 concurrent workers.


convergence after 22 epochs took 198 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  3.3min finished


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=-1, penalty='l2',
                   random_state=None, solver='sag', tol=0.0001, verbose=2,
                   warm_start=False)

In [0]:
#predict with model, using test data
lr_predicted = lr_classifier_model.predict(X_test_tf_idf)

In [0]:
#view the results
print(f'accuracy : {np.mean(lr_predicted == test_data.label)}')
print(metrics.classification_report(test_data.label , lr_predicted, target_names=label_names))

accuracy : 0.8893175


##Model: NAIVE BAYES

In [0]:
from sklearn.naive_bayes import MultinomialNB
nb_classifier_model = Pipeline([('tfidf-vect', TfidfVectorizer()),
                     ('clf', MultinomialNB())],
                     verbose=True)
nb_classifier_model.fit(train_data.review, train_data.label)
nb_predicted = nb_classifier_model.predict(test_data.review)

[Pipeline] .............. (step 1 of 3) Processing vect, total= 3.3min
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=  26.6s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   2.6s


In [0]:
num_reviews = 0
num_correct = 0
for actual, prediction in zip(test_data.label, nb_predicted):
  num_reviews += 1
  if actual == prediction:
    num_correct += 1
print(f'accuracy: {num_correct/num_reviews} :: ({num_correct}/{num_reviews})''')
print(metrics.classification_report(test_data.label, nb_predicted, target_names=label_names))

accuracy: 0.82834
accuracy: 0.82909
accuracy: 0.8288666666666666
accuracy: 0.828535
accuracy: 0.827396
accuracy: 0.82657
accuracy: 0.8261057142857143
accuracy: 0.8256575
accuracy: 0.8256575 :: (330263/400000)


## Model: SUPPORT VECTOR MACHINE

In [0]:
svm_classifier_model = Pipeline([('tfidf-vect', TfidfVectorizer()),
                     ('clf', SGDClassifier(n_jobs=-1))],
                     verbose=True)
svm_classifier_model.fit(train_data.review, train_data.label)
svm_predicted = svm_classifier_model.predict(test_data.review)
print(f'accuracy : {np.mean(svm_predicted == test_data.label)}')
print(metrics.classification_report(test_data.label, svm_predicted, target_names=label_names))

[Pipeline] ........ (step 1 of 2) Processing tfidf-vect, total= 3.6min
[Pipeline] ............... (step 2 of 2) Processing clf, total=  22.0s
accuracy : 0.86563
              precision    recall  f1-score   support

    negative       0.86      0.87      0.87    200000
    positive       0.87      0.86      0.87    200000

    accuracy                           0.87    400000
   macro avg       0.87      0.87      0.87    400000
weighted avg       0.87      0.87      0.87    400000



## Model: LOGISTIC REGRESSION - *Manual Tuning*

In [0]:
lr_lbfgs_model = LogisticRegression(solver='lbfgs', tol=0.0005, verbose=2, n_jobs=-1)
lr_lbfgs_model.fit(X_train_tf_idf, train_data.label)
lr_lbfgs_predicted = lr_lbfgs_model.predict(X_test_tf_idf)
print(f'accuracy : {np.mean(lr_lbfgs_predicted == test_data.label)}')
print(metrics.classification_report(test_data.label , lr_lbfgs_predicted, target_names=label_names))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  4.0min finished


accuracy : 0.8872875
              precision    recall  f1-score   support

    negative       0.89      0.89      0.89    200000
    positive       0.89      0.89      0.89    200000

    accuracy                           0.89    400000
   macro avg       0.89      0.89      0.89    400000
weighted avg       0.89      0.89      0.89    400000



Pretty close to the other model in terms of accuracy. There's a faster way to get a better model: Randomized Search

## Model Selection: RANDOMIZED SEARCH

In [0]:
#with many params, use randomized search
lr_model = LogisticRegression(n_jobs=-1)
lr_params = dict(C=uniform(loc=0,scale=3),
                 tol=loguniform(1e-5,1e-3),
                 max_iter=uniform(loc=100,scale=400),
                 solver=['sag', 'lbfgs'])
lr_grid = RandomizedSearchCV(lr_model, lr_params,verbose=3, n_iter=10, n_jobs=8)
lr_search = lr_grid.fit(X_train_tf_idf, train_data.label)
print(lr_search.best_params_)
print(lr_search.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:  6.9min
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed: 23.4min finished


{'C': 1.4896280135486135,
 'max_iter': 462.52287211168,
 'solver': 'sag',
 'tol': 0.0005473468337312529}

In [0]:
#make a slight tune to try to further increase accuracy
ideal_lr_model = LogisticRegression(n_jobs=-1, C=1.48963, max_iter = 463, solver='sag', tol=0.00055)
ideal_lr_model.fit(X_train_tf_idf, train_data.label)
ideal_lr_predictions = ideal_lr_model.predict(X_test_tf_idf)
print(f'accuracy : {np.mean(ideal_lr_predictions == test_data.label)}')
print(metrics.classification_report(test_data.label , ideal_lr_predictions, target_names=label_names))

accuracy : 0.88943
              precision    recall  f1-score   support

    negative       0.89      0.89      0.89    200000
    positive       0.89      0.89      0.89    200000

    accuracy                           0.89    400000
   macro avg       0.89      0.89      0.89    400000
weighted avg       0.89      0.89      0.89    400000



In [0]:
#this model performed even better! so we'll save it as our best log reg model
lr_filename = 'top_lr_model.sav'
joblib.dump(ideal_lr_model, lr_filename)

['top_lr_model.sav']

## Model Selection: GRID SEARCH

In [0]:
#svm: alpha, tol, max_iter
from sklearn.model_selection import GridSearchCV
svm_model = SGDClassifier(n_jobs=-1, early_stopping=True)
svm_params = dict(alpha=[1e-5, 5e-4, 1e-4, 5e-3, 1e-3],
               tol=[1e-4, 5e-3, 1e-3, 5e-2, 1e-2],
               max_iter=[1000, 1500, 2000, 2500])
svm_grid = GridSearchCV(estimator=svm_model, param_grid=svm_params, verbose=3, n_jobs=6)
svm_search = svm_grid.fit(X_train_tf_idf, train_data.label)
print(svm_search.best_params_)
print(svm_search.best_score_)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  20 tasks      | elapsed:  1.2min
[Parallel(n_jobs=6)]: Done 116 tasks      | elapsed:  6.1min
[Parallel(n_jobs=6)]: Done 276 tasks      | elapsed: 14.3min
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed: 26.0min finished


{'alpha': 1e-05, 'max_iter': 2000, 'tol': 0.05}
0.8827786111111111


In [0]:
#try again with a slightly different grid
svm_2_params=dict(alpha=[1e-6, 5e-5, 1e-5, 5e-4, 1e-4],
                  tol=[5e-3, 1e-3, 5e-2, 1e-2,5e-1],
                  max_iter=[1500, 2000])
svm_2_grid = GridSearchCV(estimator=svm_model, param_grid=svm_2_params, verbose=3, n_jobs=6)
svm_2_search = svm_2_grid.fit(X_train_tf_idf, train_data.label)
print(svm_2_search.best_params_)
print(svm_2_search.best_score_)

Fitting 5 folds for each of 75 candidates, totalling 375 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  20 tasks      | elapsed:  1.2min
[Parallel(n_jobs=6)]: Done 116 tasks      | elapsed:  6.0min
[Parallel(n_jobs=6)]: Done 276 tasks      | elapsed: 14.0min
[Parallel(n_jobs=6)]: Done 375 out of 375 | elapsed: 19.0min finished


{'alpha': 1e-06, 'max_iter': 1500, 'tol': 0.5}
0.8875144444444445


In [0]:
#save this model as it is the best perfroming svm model
svm_grid_model_fn = 'ideal_svm_grid.sav'
joblib.dump(svm_2_grid, svm_grid_model_fn)

['ideal_svm_grid.sav']

## Model Selection: RANDOMIZED SEARCH with SGDClassifier

In [0]:
svm_rand_model = SGDClassifier(n_jobs=-1, early_stopping=True)
svm_rand_params = dict(alpha=loguniform(1e-7,1e-3),
                 tol=loguniform(1e-5,1e-1),
                 max_iter=uniform(loc=1250,scale=1750),
                 )
svm_rand_grid = RandomizedSearchCV(svm_rand_model, svm_rand_params, n_iter=10, n_jobs=1)
svm_rand_search = svm_rand_grid.fit(X_train_tf_idf, train_data.label)
print(svm_rand_search.best_params_)
print(svm_rand_search.best_score_)

This model did not have higher performance than the one produced by grid search, which is understandable as it does not test all possible models, only a randomly selected subset of them.