# Introduction
The goal of these notebooks serves to show: 
1. A very simple sentiment analysis application built using [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and sentiment classification using Logistic Regression, 
1. Sentiment will be language agnostic, and
1. Deployment and servicing of an ML model model on [PythonAnywhere](https://www.pythonanywhere.com).
1. Use of [Swagger UI](https://swagger.io/tools/swagger-ui/) to encapsulate the API 

You can test out the application [here](https://mybinder.org/v2/gh/zerafachris/playGround/master?filepath=published%2FsentimentAnalysisApp%2F04_interactive_sentiment_analysis.ipynb) by using [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/zerafachris/playGround/master?filepath=published%2FsentimentAnalysisApp%2F04_interactive_sentiment_analysis.ipynb)


# Sentiment Analysis Application
The ML model I wanted to build here makes use of Term Frequency – Inverse Document Frequency (TF-IDF) and use this to get the sentiment of reviews left by people reviewing my GitHub page.

## Data - [01_create_data.ipynb](https://github.com/zerafachris/playGround/blob/master/published/sentimentAnalysisApp/01_create_data.ipynb)
I did not have sentiment data, but page reviews could be considered to be similar to *movie* reviews. Thus, I used the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/). This is a set of 50,000 movie reviews and respective binary labels for sentiment. Getting and downloading of this data can be done via [01_create_data.ipynb](https://github.com/zerafachris/playGround/blob/master/published/sentimentAnalysisApp/01_create_data.ipynb). Below you can see a data sample:

In [20]:
import pandas as pd
df = pd.read_csv('./data/test.csv').head(2)
for sent, rev in zip(df['sentiment'], df['review']):
    print('''
    Sentiment: {}
    Review: {}\n'''.format(sent, rev))    


    Sentiment: pos
    Review:  Years ago when I first read John Irving s The World According to Garp I was astounded that most of the younger adults with whom I had contact didn t like the book when I loved it I began to understand that it was an age and experience thing I experienced somewhat of a d j vu when reading some of the comments on this site that were clearly written by younger viewers Fully enjoying Separate Lies is surely an age and experience thing br br In this film the viewer sees a seemingly happy upper middle class couple he a successful lawyer she the perfect wife of a successful lawyer They have a townhouse in London and a home in the country All s well until there enters the villain in the guise of the son of the richest man in the village This guy appears to be a cad from the word Go He is disdainful of everyone and everything including his own children In the traditional form of nice guys finishing last the lawyer s wife engages in an affair with the bounder You

## ML modelling - [02_TF_IDF.ipynb](https://github.com/zerafachris/playGround/blob/master/published/sentimentAnalysisApp/02_TF_IDF.ipynb)
### Pre-Processing
Initial  data is cleaned via pre-processing in the form of:
- Removing HTML tags,
- Removing non-letters,
- To lower case,
- Tokenization to each word,
- Stemming
    - Tokenization and Stemming was simplified by using Natural Language Toolkit ([NLTK](https://www.nltk.org/))

### TF-IDF application

### ML-modelling
I tried out multiple classification techniques. Results for this are given by:
	
|   Modelling  | Score (**roc_auc**) |
|-----|:------:|
| Support Vector Machine - Linear|      0.969465 |
| Naive Bayes Classifier - Bernoulli |      0.941231 |
| Multi-Layer Perceptron (MLP) |      0.968113 |
|  Logistic Regrssion (LR)  |   0.969760  |

LR was the quickest to model and produced the best **roc_auc** score. I decided to proceed with this. Alternatively, I could have considered having an ensemble of models. However, I feel this goes beyond the scope of this notebook.


## Productionization of App - [03_TF_IDF_prod_lr.ipynb](https://github.com/zerafachris/playGround/blob/master/published/sentimentAnalysisApp/03_TF_IDF_prod_lr.ipynb)
The next notebook prepares pipelines for productionization.

# App deployment
## PythonAnywhere
