# **<font color= green>Accenture - Sentiment Analysis<font/>**
    

## <font color=purple>Installing scikit-learn</font>


**Verify installation:** *(run on the terminal)*<br>
python -m pip show scikit-learn &nbsp; <font color=green>*# to see which version and where scikit-learn is installed*</font><br>
python -m pip freeze &nbsp; <font color=green>*#to see all packages installed in the active virtualenv*</font><br>
python -c "import sklearn; sklearn.show_versions()"

In [None]:
pip install --upgrade pip

In [None]:
pip install -U scikit-learn

In [None]:
print('Hello World!')

**Verify is spicy is already installed** -- *(python -m pip show spicy)* -- **if not, then:**

In [None]:
pip install -U spicy

In [None]:
pip install -U matplotlib

## <font color=purple>Download and Load Dataset</font>
dataset of 50,000 movie reviews taken from IMDb.


**Download:** https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews <br>
**1.** Create a folder on the project root nammed "DataSet"<br>
**2.** Move the file into the folder

**Pandas DataFrame:** https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html <br>

In [None]:
import pandas as pd
pd.set_option('max_colwidth', -1)

dataset_path = 'DataSet/'
df=pd.read_csv(dataset_path+'IMDB Dataset.csv', sep=',',header=0)
df.head(2)

## <font color=purple>Clean and Preprocess</font>
**Remove special characters**<br>
Definition and replace of the special characters for a space or empty string

In [None]:
import re

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

df = df.replace(REPLACE_NO_SPACE, '')
df = df.replace(REPLACE_WITH_SPACE, ' ')
df.head(1)

Now, we have to **stratify** the data and divide the data into a **Train** and **Test** set.

In [None]:
from sklearn.model_selection import train_test_split

TRAIN, TEST = train_test_split(df, test_size=0.5, stratify= df['sentiment'])

Split each set into **X** and **Y**

In [None]:
Xtrain = TRAIN['review']
Xtest = TEST['review']

Ytrain = TRAIN['sentiment']
Ytest = TEST['sentiment']
Xtrain.head(1)

## <font color=purple>Vectorization</font>

In order to apply our machine learning algoritms and therefore build an intelligent classifier, we must convert each review to a **numeric representation**. This process is called ***vectorization***. 

For this process we'll use ***CountVectorizer***(**bag of words**) which creates a large sparse matrix with one column for every unique word in our document and each review will be represented by a row. 

Firstly, we'll set the '***binary*** *=true*'. Therefore each review will be transformed into one row containing 0s and 1s, where 1 means that the word assigned to a collumn appears in that review. We will also set as input parameter the ***'stop_words'***, which are the set of words that don't add any contextual meaning.


**Note** that the output is a sparse matrix (mostly zeros).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

Xtrain = TRAIN['review']
Xtest = TEST['review']
Ytrain = TRAIN['sentiment']
Ytest = TEST['sentiment']

cv = CountVectorizer(binary=True, stop_words='english')
cv.fit(Xtrain)
Xtrain = cv.transform(Xtrain)
Xtest = cv.transform(Xtest)

print(Xtrain.shape)

In [None]:
print(Xtrain[0])

## <font color=purple>Designing Phase</font>

In this phase, the **Xtrain** set is subdivided into another *train* and *test* set, in order to tune the parameters of our classifier and decide which tune we consider to be **optimal** for the **performance** of our classifier.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X_train, X_test, Y_train, Y_test = train_test_split(Xtrain, Ytrain, train_size = 0.8)

for c in [0.01, 0.1, 0.5, 1, 1.5, 2]:
    
    lr = LogisticRegression(C=c, max_iter=200)
    lr.fit(X_train, Y_train)
    print('Accuracy for C=%s: %s' % (c, accuracy_score(Y_test, lr.predict(X_test))))

Then, ***'C = 0.1'*** is the optimal value for the parameter *C*, since it provided the model with the best accuracy.

## <font color=purple>Train The Final Classifier</font>
Now that we know the optimal parameters for the design of our classifier, we must train a **new classifier** using the entire Train set (**Xtrain**) and then evaluate its performance with the Test set (**Xtest,Ytest**) based on the accuracy.

In [None]:
final_classifier = LogisticRegression(C=0.1, max_iter=300)
final_classifier.fit(Xtrain,Ytrain)
print('Final Accuracy: %s' % accuracy_score(Ytest, final_classifier.predict(Xtest)))

# <font color=red>-----------------------------------------------------------------------------------------------------------------</font> <br>
# <font color=red>-----------------------------------------------------------------------------------------------------------------</font>

## <font color=green>Count Vectorizer using '*binary=False*'</font>

Now, our sparse matrix instead of be fulfilled with 0s and 1s, each element of this matrix will represent the number of times the respective word appears in each review.

In [None]:
Xtrain = TRAIN['review']
Xtest = TEST['review']
Ytrain = TRAIN['sentiment']
Ytest = TEST['sentiment']

cv = CountVectorizer(binary=False, stop_words='english')
cv.fit(Xtrain)
Xtrain = cv.transform(Xtrain)
Xtest = cv.transform(Xtest)

X_train, X_test, Y_train, Y_test = train_test_split(Xtrain, Ytrain, train_size = 0.8)

for c in [0.01, 0.1, 0.5, 1, 1.5, 2]:
    lr = LogisticRegression(C=c, max_iter=300)
    lr.fit(X_train, Y_train)
    print('Accuracy for C=%s: %s' % (c, accuracy_score(Y_test, lr.predict(X_test))))

In [None]:
final_classifier = LogisticRegression(C=0.1, max_iter=300)
final_classifier.fit(Xtrain,Ytrain)
print('Final Accuracy: %s' % accuracy_score(Ytest, final_classifier.predict(Xtest)))

## <font color=green>Count Vectorizer using '*ngrams*'</font>

In the fields of **NLP**, an n-gram is a contiguous sequence of n items from a given sample of text or speech. <br>
This means that in our sparse matrix instead of assigning each unique word in our corpus to a single column, we will define sequences of ***n*** items.

Example: "I did not love the movie" --> {[I], [did], [not], [love], [the], [movie]}
         "I did not love the movie" --> {[I], [did], [not love], [the], [movie]}

In [None]:
Xtrain = TRAIN['review']
Xtest = TEST['review']
Ytrain = TRAIN['sentiment']
Ytest = TEST['sentiment']

cv = CountVectorizer(binary=False, ngram_range=(1, 2), stop_words='english')
cv.fit(Xtrain)
Xtrain = cv.transform(Xtrain)
Xtest = cv.transform(Xtest)

X_train, X_test, Y_train, Y_test = train_test_split(Xtrain, Ytrain, train_size = 0.8)

for c in [0.01, 0.1, 0.5, 1, 1.5, 2]:
    lr = LogisticRegression(C=c, max_iter=300)
    lr.fit(X_train, Y_train)
    print('Accuracy for C=%s: %s' % (c, accuracy_score(Y_test, lr.predict(X_test))))

In [None]:
final_classifier = LogisticRegression(C=0.1, max_iter=300)
final_classifier.fit(Xtrain,Ytrain)
print('Final Accuracy: %s' % accuracy_score(Ytest, final_classifier.predict(Xtest)))

## <font color=green>Use of Support Vector Machines (SVM) -- LinearSVC</font>



In [None]:
from sklearn.svm import LinearSVC

for c in [0.01, 0.1, 0.5, 1, 1.5, 2]:
    svc = LinearSVC(C=c)
    svc.fit(X_train, Y_train)
    print('Accuracy for C=%s: %s' % (c, accuracy_score(Y_test, svc.predict(X_test))))

In [None]:
final_classifier = LinearSVC(C=0.01)
final_classifier.fit(Xtrain,Ytrain)
print('Final Accuracy: %s' % accuracy_score(Ytest, final_classifier.predict(Xtest)))