# Spacy & Sentiment Analysis

<font color='steelblue'>

<span style="font-family:verdana; font-size:1.4em;">

<b>Using Spacy to prepare the text do sentiment analysis, use a classifier to train the model and make prediction<br></b>
<ul>
    <li>Data is from University of California Irvine</li>
    <li>Data consists of messages received on Yelp, IMDB and Amazon,It contains sentences labelled with positive or negative sentiment, extracted from reviews of products, movies, and restaurants</li>
    <li>For each website, there exist 500 positive and 500 negative sentences. Those were selected randomly for larger datasets of reviews</li> 
    <li>We attempted to select sentences that have a clearly positive or negative connotaton, the goal was for no neutral sentences to be selected</li>
</ul>

</span>

</font>
<br><br>
<font color='grey'>
<span style="font-family:verdana; font-size:1.2em;">
    <b>Code includes:
    <ul>
        <li>Loading the datasets</li>
        <li>Processing the messages using Spacy</li>
        <li>Build pipeline to perform preprocessing and training model</li>
        <li>Train model</li>
        <li>Evaluate model</li>
     </ul>
    </b>
</span>

# Install spacy<br>
<font color='tomato'>
<span style="font-family:verdana; font-size:1.4em;">
    Couple of options to install spacy:
    <ul>
        <li>conda install -c conda-forge spacy</li>
        <li>pip install spacy</li>
    </ul><br>
    In anaconda terminal use the following commands to download small model or large model:
    <ul>
        <li>python -m spacy download en_core_web_sm</li>
        <li>python -m spacy download en_core_web_lg</li>
    </ul>
        
</span>
</font>

In [None]:
import spacy
import pandas as pd

## Load datasets<br>
<font color='gray'>

<span style="font-family:verdana; font-size:1.2em;">
    <ul>
        <li>There are three different text files with the message and sentiment separated by tab</li>
        <li>These files are from Yelp, IMDB and Amazon</li>
        <li>For each website there are 500 positive and 500 negative reviews (selected from larger datasets)</li>
    </ul>
</span>
</font>

In [None]:
dfYelp = pd.read_table('../datasets/yelp_labelled.txt')
dfImdb = pd.read_table('../datasets/imdb_labelled.txt')
dfAmz = pd.read_table('../datasets/amazon_cells_labelled.txt')

In [None]:
# Concatenate the tables
tables = [dfYelp, dfImdb, dfAmz]

In [None]:
# Set column header
for colname in tables:
    colname.columns = ['Message', 'Target']

In [None]:
for colname in tables:
    print(colname.columns)

## Create dataframe and explore data

In [None]:
# Assign a Key to Make it Easier
keys = ['Yelp','IMDB','Amazon']

In [None]:
# Create a dataframe by merging the tables
df = pd.concat(tables, keys = keys)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.columns

In [None]:
df.isnull().sum()

In [None]:
df.describe()

## Process the Text using spaCy<br>
<font color='gray'>
<span style="font-family:verdana; font-size:1.2em;">
    Do following processing:<br>
    <ul>
        <li>Remove stop words, punctuations</li>
        <li>Convert the text to lower case and strip leading and trailing spaces</li>
        <li>Tokenize the words</li>
    </ul>
</span>
</font>

In [None]:
# text processing libraries
from spacy.lang.en.stop_words import STOP_WORDS          # stop words
import string                                            # for punctuations
from spacy.lang.en import English                        # english parser

In [None]:
# Machine Learning libraries
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import accuracy_score 
from sklearn.base import TransformerMixin 
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

## Helper functions

In [None]:
punctuations = string.punctuation
parser = English()
stopwords = list(STOP_WORDS)

In [None]:
# Define a tokenizer using Spacy
def spacyTokenizer(sentence):
    tokens = parser(sentence)
    tokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens]
    tokens = [word for word in tokens if word not in stopwords and word not in punctuations]
    return tokens

In [None]:
# Strip the text of leading and trailing spaces & convert it to lower case
def cleanText(text):
    return text.strip().lower()

In [None]:
# Create custom transformer using Spacy

class Predictors(TransformerMixin):
    
    def transform(self, data, **transform_params):
        return [cleanText(text) for text in data]
    
    def fit(self, data, y = None, **fit_params):
        return self

## Build Pipeline<br>
<font color='gray'>


<span style="font-family:verdana; font-size:1.2em;">
    Build a pipeline to do the following:
    <ul>
        <li>Preprocess the text (using functions defined above)</li>
        <li>Tokenize the text</li>
        <li>Apply classifier to perform training</li>
    </ul>
</span>
</font>

### Click following link for more info on TF-IDF
<a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf" target="_blank">TF-IDF</a>

In [None]:
# TF-IDF vectorizer
tfvectorizer = TfidfVectorizer(tokenizer = spacyTokenizer)

In [None]:
# Classifier
classifier = LogisticRegression(verbose = 2)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df['Message']
y = df['Target']

In [None]:
# create training and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2345)

In [None]:
# Create pipeline
pipe = Pipeline([("cleaner", Predictors()),
                ("vectorizer", tfvectorizer),
                ("classifier", classifier)
                ])

In [None]:
# train the model
pipe.fit(X_train, y_train)

In [None]:
predictions = pipe.predict(X_test)

## Model Performance<br>
<font color='gray'>

<span style="font-family:verdana; font-size:1.2em;">
    Evaluate the model performance with following:
    <ul>
        <li>Print accuracy of the training and predictions</li>
        <li>Print classification report</li>
        <li>Confusion Matrix</li>
    </ul>
</span>
</font>

In [None]:
# print 10 predictions
i = 0
for (sample, pred) in zip(X_test, predictions):
    if i > 10:
        break
    print(sample, 'Prediction --> ', pred)
    i = i + 1

In [None]:
# training model accuracy
print("Train Accuracy: ", pipe.score(X_train, y_train))

In [None]:
print("Test Accuracy: ",pipe.score(X_test,y_test))

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
print(classification_report(y_test, predictions))

In [None]:
cm = confusion_matrix(y_test, predictions)

In [None]:
cseg = ["Negative", "Positive"]
cm_df = pd.DataFrame(cm, index = cseg, columns = cseg)

In [None]:
# Plot the confusion matrix
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.figure(figsize = (10, 6))
sns.heatmap(cm_df, annot=True, cmap=plt.cm.Blues, fmt = 'g', annot_kws={"size": 16})
sns.set(font_scale=0.5)
plt.title('Confusion Matrix\n', fontsize = 18)
plt.ylabel('True label', fontsize = 16)
plt.xlabel('Predicted label', fontsize = 16)
plt.show()

In [None]:
examples = ["It was a great movie",
            "I do enjoy my job",
            "What a poor product!,I will have to get a new one",
            "It was amazing feeling!"]

In [None]:
pipe.predict(examples)