## External Applications

<hr style="height: 1px; background-color: #808080">
## 1. FastText overview
### What is it?

FastText is a linear model with a rank constraint and a fast loss approximation.<br>
It can obtain the accuracy comparable to deep learning classifiers.<br>

But it is way faster:
- FastText can train on more than one 200M words in less than five minutes using a standard multicore CPU
- Classify nearly 150K reviews in less than a minute

### Architecture

<img src="https://raw.githubusercontent.com/akruchko/test/master/1_model_architecture_of_fastText.PNG">
The model architecture of fastText for a sentence with N ngram features x1, . . . , xN .<br> The features are embedded and averaged to form the hidden variable$^1$

<hr style="height: 1px; width: 100px; background-color: #808080"; align="left"> <br>
$^1$ https://arxiv.org/pdf/1607.01759.pdf

### Algorithm

FastText uses the softmax function $f$ to compute the probability distribution over the predefined classes. For a set of N documents, this leads to minimizing the negative loglikelihood over the classes:


\begin{align}
\ -\frac{1}{N} \sum_{n=1}^N y_n log(f(BAx_n))
\end{align}
$x_n$ - the normalized bag of features of the n-th document, <br>
$y_n$ - the label, <br>
$A, B$ - weight matrices

Optimization is performing using stochastic gradient descent and a linearly decaying learning rate.

<hr style="height: 1px; background-color: #808080">
### Data preprocessing for fastText
- remove nonprintable characters
- fix $n't$, $'re$, $'s$ and other cases
- remove punctuation and digits
- Porter stemming

In [1]:
import pandas as pd

from itertools import product

from string import punctuation, digits
from nltk.stem import PorterStemmer

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report

In [2]:
from fasttext import supervised

In [3]:
df_train = pd.read_csv('../data/movie_reviews.csv')
df_test = pd.read_csv('../data/test.csv')

In [4]:
# prepare punctuation and digits list for removal
translator = str.maketrans('', '', punctuation + digits)

# basic preprocessing:
def clean_data(df, col):
    df['clean_text'] = df[col].str.replace('\n', '').str.replace('\r', '').str.replace('\t', '')
    df.clean_text = df.clean_text.str.replace("n't", " not").str.replace("'re", " are").str.replace("'s", " s")
    df.clean_text = df.clean_text.str.replace("'ve", " have").str.replace("'ll", " will").str.replace("'d", " d")
    df.clean_text = df.clean_text.str.translate(translator).str.strip().str.lower()
    return df

In [5]:
df_train = clean_data(df_train, 'text')
df_test = clean_data(df_test, 'text')

In [6]:
print('before:\n', df_train.text[0])
print('after:\n', df_train.clean_text[0])

before:
 To an entire generation of filmgoers, it just might represent the most significant leap in storytelling that they will ever see...
after:
 to an entire generation of filmgoers it just might represent the most significant leap in storytelling that they will ever see


In [7]:
# splitting on train and validation
df_train2, df_val = train_test_split(df_train[['label', 'clean_text']], test_size=0.2, random_state=42)

In [8]:
# Since fastText can be trained only from text files, we should mark labels. The default is `__label__` but can be custom.
df_train2['ft_label'] = df_train2['label'].apply(lambda x: '__label__1 ' if x == 1 else '__label__0 ')
df_train2[['ft_label', 'clean_text']].to_csv('../data/train_fastText.csv', index=False, header=False)

<hr style="height: 1px; background-color: #808080">
## 3. Parameters overview and the model training

### Parameters overview

- `lr` - learning rate. Default: **0.1**.
- `dim` - size of word vectors in the hidden unit. Default: **100**. Should be less for small datasets and the number of labels.
- `epoch` - number of epochs. Default: **5**. Higher for small learning rates.
- `min_count` - minimal number of word occurences. Default: **1**. 5 or higher to avoid overfitting.
- `word_ngrams` - max length of word ngram. Default: **1**. Higher order ngrams lead to overfitting on small datasets. if value greater than 1 learning rate and epoch should be revised.
- `bucket` - number of buckets. Default: **2000000**. Developers recommend to use lower values for small datasets (ex. 100K).
- `minn` - min length of char ngram. Default: **0**.
- `maxn` - max length of char ngram. Default: **0**.

### The model training

In [9]:
def get_score(df, clf, label='label', text='clean_text', model_name='../data/fastText'):
  
    prediction = clf.predict_proba(list(df[text]))
    prediction = [int(item[0][0]) for item in prediction]

    return [round(accuracy_score(list(df[label]), prediction), 4), 
            round(f1_score(list(df[label]), prediction, pos_label=0), 4)]

In [10]:
%%time
# Let's train the model with the predefined parameters
clf = supervised('../data/train_fastText.csv', '../data/fastText', label_prefix='__label__', 
                 lr=0.01, epoch=10, min_count=5)

CPU times: user 32.2 s, sys: 1.01 s, total: 33.2 s
Wall time: 5.82 s


In [11]:
acc, f1 = get_score(df_test, clf)
print('test accuracy:', round(acc, 4), 'test f1:', round(f1, 4))

test accuracy: 0.7824 test f1: 0.761


## Conclusions

- Really fast: representing sentences with bag of words and bag of n-grams with hashing trick; a hierachical softmax
- It was developed mainly for large datasets (ex. 1 billion words). In case of small datasets hyperparameters should be tuned carefully to avoid overfitting or you shoud get more data.

## 2. Vader. No ML needed.

In [12]:
!pip install vaderSentiment



In [13]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    #note: depending on how you installed (e.g., using source code download versus pip install), you may need to import like this:
    #from vaderSentiment import SentimentIntensityAnalyzer

# --- examples -------
sentences = ["VADER is smart, handsome, and funny.",      # positive sentence example
            "VADER is not smart, handsome, nor funny.",   # negation sentence example
            "VADER is smart, handsome, and funny!",       # punctuation emphasis handled correctly (sentiment intensity adjusted)
            "VADER is very smart, handsome, and funny.",  # booster words handled correctly (sentiment intensity adjusted)
            "VADER is VERY SMART, handsome, and FUNNY.",  # emphasis for ALLCAPS handled
            "VADER is VERY SMART, handsome, and FUNNY!!!",# combination of signals - VADER appropriately adjusts intensity
            "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!",# booster words & punctuation make this close to ceiling for score
            "The book was good.",                                     # positive sentence
            "The book was kind of good.",                 # qualified positive sentence is handled correctly (intensity adjusted)
            "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
            "At least it isn't a horrible book.",         # negated negative sentence with contraction
            "Make sure you :) or :D today!",              # emoticons handled
            "Today SUX!",                                 # negative slang with capitalization emphasis
            "Today only kinda sux! But I'll get by, lol"  # mixed sentiment example with slang and constrastive conjunction "but"
             ]

analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<65} {}".format(sentence, str(vs)))

VADER is smart, handsome, and funny.----------------------------- {'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}
VADER is not smart, handsome, nor funny.------------------------- {'neg': 0.646, 'neu': 0.354, 'pos': 0.0, 'compound': -0.7424}
VADER is smart, handsome, and funny!----------------------------- {'neg': 0.0, 'neu': 0.248, 'pos': 0.752, 'compound': 0.8439}
VADER is very smart, handsome, and funny.------------------------ {'neg': 0.0, 'neu': 0.299, 'pos': 0.701, 'compound': 0.8545}
VADER is VERY SMART, handsome, and FUNNY.------------------------ {'neg': 0.0, 'neu': 0.246, 'pos': 0.754, 'compound': 0.9227}
VADER is VERY SMART, handsome, and FUNNY!!!---------------------- {'neg': 0.0, 'neu': 0.233, 'pos': 0.767, 'compound': 0.9342}
VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!--------- {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469}
The book was good.----------------------------------------------- {'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'co

In [14]:
text_test = list(df_test['clean_text'])
label_test = df_test['label']

In [15]:
analyzer = SentimentIntensityAnalyzer()
df_test['vader'] = df_test['clean_text'].apply(lambda x: analyzer.polarity_scores(x)['compound'])

In [17]:
df_test.head()

Unnamed: 0,label,text,clean_text,vader
0,0,"it's so laddish and juvenile , only teenage bo...",it s so laddish and juvenile only teenage boy...,0.4404
1,0,exploitative and largely devoid of the depth o...,exploitative and largely devoid of the depth o...,0.0
2,0,[garbus] discards the potential for pathologic...,garbus discards the potential for pathological...,-0.25
3,0,a visually flashy but narratively opaque and e...,a visually flashy but narratively opaque and e...,0.0
4,0,"the story is also as unoriginal as they come ,...",the story is also as unoriginal as they come ...,0.5367


In [18]:
df_test_new = df_test[df_test['vader']!=0]

In [19]:
df_test_new['predict'] = df_test_new['vader'].apply(lambda x: 1 if x>0 else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [20]:
df_test_new.head()

Unnamed: 0,label,text,clean_text,vader,predict
0,0,"it's so laddish and juvenile , only teenage bo...",it s so laddish and juvenile only teenage boy...,0.4404,1
2,0,[garbus] discards the potential for pathologic...,garbus discards the potential for pathological...,-0.25,0
4,0,"the story is also as unoriginal as they come ,...",the story is also as unoriginal as they come ...,0.5367,1
7,0,unfortunately the story and the actors are ser...,unfortunately the story and the actors are ser...,-0.34,0
8,0,all the more disquieting for its relatively go...,all the more disquieting for its relatively go...,-0.3612,0


In [21]:
from sklearn.metrics import classification_report
print(classification_report(df_test_new['label'], df_test_new['predict']))

             precision    recall  f1-score   support

          0       0.69      0.49      0.57      4394
          1       0.62      0.80      0.70      4651

avg / total       0.66      0.65      0.64      9045



## 3. OpenAI Sentiment Neuron.

See more at [https://blog.openai.com/unsupervised-sentiment-neuron/](https://blog.openai.com/unsupervised-sentiment-neuron/)

There is also GitHub code available at [Github](https://github.com/openai/generating-reviews-discovering-sentiment)