# **<font color= green>Accenture - Sentiment Analysis<font/>**


## <font color=purple>Load Dataset</font>

In [None]:
import pandas as pd
pd.set_option('max_colwidth', -1)

dataset_path = 'DataSet/'
df=pd.read_csv(dataset_path+'IMDB Dataset.csv', sep=',',header=0)
df.head(2)

## <font color=purple>Clean and Preprocess</font>
**Remove special characters**<br>
Definition and replace of the special characters for a space or empty string

In [None]:
import re

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

df = df.replace(REPLACE_NO_SPACE, '')
df = df.replace(REPLACE_WITH_SPACE, ' ')
df.head(1)

Now, we have to **stratify** the data and divide the data into a **Train** and **Test** set.

In [None]:
from sklearn.model_selection import train_test_split

TRAIN, TEST = train_test_split(df, test_size=0.5, stratify= df['sentiment'])

Subdivide each set into **X** and **Y**

In [None]:
Xtrain = TRAIN['review']
Xtest = TEST['review']

Ytrain = TRAIN['sentiment']
Ytest = TEST['sentiment']
Xtrain.head(1)

## <font color=purple>Vectorization</font>

In order to apply our machine learning algoritmos and therefore build an intelligent classifier, we must convert each review to a **numeric representation**. This process is called ***vectorization***. 

There are several ways to do tf-idf transformation but in a nutshell, tf-idf aims to represent the number of times a given word appears in a document (a movie review in our case) relative to the number of documents in the corpus that the word appears in — where words that appear in many documents have a value closer to zero and words that appear in less documents have values closer to 1.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

Xtrain = TRAIN['review']
Xtest = TEST['review']
Ytrain = TRAIN['sentiment']
Ytest = TEST['sentiment']

tfidf = TfidfVectorizer(binary=True, ngram_range=(1,2), stop_words='english')
tfidf.fit(Xtrain)
Xtrain = tfidf.transform(Xtrain)
Xtest = tfidf.transform(Xtest)

print(Xtrain.shape)

In [None]:
print(Xtrain[0])

## <font color=purple>Designing Phase</font>

In this phase, the **Xtrain** set is subdivided into another *train* and *test* set, in order to tune the parameters of our classifier and decide which tune we consider to be **optimal** for the **performance** of our classifier.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

X_train, X_test, Y_train, Y_test = train_test_split(Xtrain, Ytrain, train_size = 0.8)

for c in [0.01, 0.1, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4]:
    svc = LinearSVC(C=c)
    svc.fit(X_train, Y_train)
    print('Accuracy for C=%s: %s' % (c, accuracy_score(Y_test, svc.predict(X_test))))

## <font color=purple>Train The Final Classifier</font>
Now that we know the optimal parameters for the design of our classifier, we must train a **new classifier** using the entire Train set (**Xtrain**) and then evaluate its performance with the Test set (**Xtest,Ytest**) based on the accuracy.

In [None]:
final_classifier = LinearSVC(C=3.5, max_iter=300)
final_classifier.fit(Xtrain,Ytrain)
print('Final Accuracy: %s' % accuracy_score(Ytest, final_classifier.predict(Xtest)))