<a href="https://colab.research.google.com/github/yala/introML_chem/blob/master/lab1/sentiment_analysis_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Machine Learning Packages

In this tutorial, we'll take you through developing models to classify text in `sklearn` from start to finish. We'll go through preprocessing, feature engineering, and experimentation. 

Let's get started!

In [0]:
import pickle
import re
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.linear_model  import PassiveAggressiveClassifier
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import warnings
warnings.filterwarnings("ignore")


# The Task: Beer Sentiment Analysis

Given long and detailed beer reviews, we want to predict if the reviewed ranked it as an bad, okay or good.


## Step 1: Preprocessing the data
To start off, we're going to load the data from some pickle files and do some simple preprocessing. We'll throw away non-alphanumeric characters and lowercase everything.

i.e
```"Best Beer ever!!!" -> "best beer ever"```

The sanity check the data, we'll look at a few examples.


In [0]:
!apt-get install wget
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_train.p
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_dev.p
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_test.p

train_path = "overall_train.p"
dev_path   = "overall_dev.p"
test_path  = "overall_test.p"

train_set =  pickle.load(open(train_path, 'rb'))
dev_set =  pickle.load(open(dev_path, 'rb'))
test_set =  pickle.load(open(test_path, 'rb'))



def preprocess_data(data):
    for indx, sample in enumerate(data):
        text, label = sample['text'], sample['y']
        text = re.sub('\W+', ' ', text).lower().strip()
        data[indx] = text, label
    return data

train_set = preprocess_data(train_set)
dev_set = preprocess_data(dev_set)
test_set =  preprocess_data(test_set)

In [0]:
print("Num Train: {}".format(len(train_set)))
print("Num Dev: {}".format(len(dev_set)))
print("Num Test: {}".format(len(test_set)))
print("Example Reviews:")
print(train_set[0])
print()
print(train_set[1])


## Step 2: Feature Engineering 

How do we represent a review? We're going to use a simple bag of words representation. Meaning we'll represent each review as a vector, and the whole set of reviews as a large matrix.

For example, consider our vocabulary is ```[best, ever, beer, cat, good, dog]```.
The bag of words representation for:
```"best beer ever"``` is ```[1, 1, 1, 0, 0, 0]```
Where one indicates that the vocab words did appear and 0 indicates the words that did not. S

With sklearn, we can do this very easily with ```sklearn.feature_extraction.text.CountVectorizer```

<img src="https://github.com/yala/MLCodeLab/blob/master/lab1/vectorizer.png?raw=true">


In [0]:
#Extract tweets and labels into 2 lists
trainText = [t[0] for t in train_set]
trainY = [t[1] for t in train_set]

devText = [t[0] for t in dev_set]
devY = [t[1] for t in dev_set]


testText = [t[0] for t in test_set]
testY = [t[1] for t in test_set]

# Set that word has to appear at least 5 times to be in vocab
min_df = 5
max_features = 1000
countVec = CountVectorizer(min_df = min_df, max_features = max_features )
# Learn vocabulary from train set
countVec.fit(trainText)

# Transform list of review to matrix of bag-of-word vectors
trainX = countVec.transform(trainText)
devX = countVec.transform(devText)
testX = countVec.transform(testText)

In [0]:
print("Shape of Train X {}\n".format(trainX.shape))
print("Sample of the vocab:\n {}".format(np.random.choice(countVec.get_feature_names(), 20)))


## Step 3: Pick a model and experiment

Here we'll explore various types of linear models, namely Logistic Regression, Passive Aggressive, and Perceptron. It's very straight-forward
to fit a new classifier and get preliminary results


In [0]:
lr = LogisticRegression()
passAgg    = PassiveAggressiveClassifier()
perceptron = Perceptron()

In [0]:
lr.fit(trainX, trainY)


print("Logistic Regression Train:", lr.score(trainX, trainY))
print("Logistic Regression Dev:", lr.score(devX, devY))
print("--")

In [0]:
passAgg.fit(trainX, trainY) 
print("Passive Aggressive Train:", passAgg.score(trainX, trainY))
print("Passive Aggressive Dev:", passAgg.score(devX, devY))
print("--")

In [0]:
perceptron.fit(trainX, trainY) 
print("Perceptron Train:", perceptron.score(trainX, trainY))
print("Perceptron Dev:", perceptron.score(devX, devY))
print("--")

## Step 4: Analysis, Debugging the Model
To understand how to make the model better, it's important understand what the model is learning, and what it's getting wrong.

To do this, we can inspect the highest weighted features of our best LR model and look at some examples the model got wrong on the development set. 


In [0]:
lr = LogisticRegression()
lr.fit(trainX, trainY)
print("Logistic Regression Train:", lr.score(trainX, trainY))
print("Logistic Regression Dev:", lr.score(devX, devY))
print("--")

In [0]:
print("Intepreting LR")
for label in range(3):
    coefs = lr.coef_[label]
    vocab = np.array(countVec.get_feature_names())
    num_features = 10

    top = np.argpartition(coefs, -num_features)[-num_features:]
    # Sort top
    top = top[np.argsort(coefs[top])]
    s_coef = coefs[top]
    scored_vocab = list(zip(vocab[top], s_coef))
    print("Top weighted features for label {}:\n \n {}\n -- \n".format(label, scored_vocab))

In [0]:
## Find erronous dev errors
devPred = lr.predict(devX)
errors = []
for indx in range(len(devText)):
    if devPred[indx] != devY[indx]:
        error = "Review: \n {} \n Predicted: {} \n Correct: {} \n ---".format(
            devText[indx],
            devPred[indx],
            devY[indx])
        errors.append(error)

np.random.seed(2)
print("Random dev error: \n {} \n \n {} \n \n{}".format(
        np.random.choice(errors,1),
        np.random.choice(errors,1),
        np.random.choice(errors,1))
     )

## Step 5: Play with regularization

We can see that LogisticRegression so far works the best so far, but it is greatly over fitting. Meaning that it does much better on train than development. A common strategy to dealing with this is adding an extra penalty for model complexity, like the square sum of the model weights. We call this idea regularization. 

In sklearn, it is very easy to test out various regularization amounts and tune the model. The smaller the parameter `C`, the stronger the regularization cost.

In [0]:
lr = LogisticRegression(C=.5)
lr.fit(trainX, trainY)


print("Logistic Regression Train:", lr.score(trainX, trainY))
print("Logistic Regression Dev:", lr.score(devX, devY))
print("--")

lr = LogisticRegression(C=.1)
lr.fit(trainX, trainY)


print("Logistic Regression Train:", lr.score(trainX, trainY))
print("Logistic Regression Dev:", lr.score(devX, devY))
print("--")

In [0]:
lr = LogisticRegression(C=.01)
lr.fit(trainX, trainY)


print("Logistic Regression Train:", lr.score(trainX, trainY))
print("Logistic Regression Dev:", lr.score(devX, devY))
print("--")

## Step 6: Adding in Ngrams

How does our model distinguish between the sentiment phrase that says:
```"great flavor and too bad there isn't more."```
versus
```"bad flavor and too great there isn't more."```

In our bag of words model, both have the same vector. In order to capture some of these ordering depency, we generalize the bag-of-words model to take "n-grams" of words that occur in the training set. a "bi-gram" is a pair of words, "tri-gram" triple, etc.

Let see how this imporves our model 


In [0]:
# Set that word has to appear at least 5 times to be in vocab
min_df = 5
ngram_range = (1,3)
max_features = 5000
countVecNgram = CountVectorizer(min_df = min_df, ngram_range = ngram_range, max_features=max_features)
# Learn vocabulary from train set
countVecNgram.fit(trainText)

# Transform list of review to matrix of bag-of-word vectors
trainXNgram = countVecNgram.transform(trainText)
devXNgram = countVecNgram.transform(devText)
testXNgram = countVecNgram.transform(testText)

In [0]:
lrNgram = LogisticRegression(C=1)
lrNgram.fit(trainXNgram, trainY)
print("Logistic Regression Train:", lrNgram.score(trainXNgram, trainY))
print("Logistic Regression Dev:", lrNgram.score(devXNgram, devY))
print("--")

lrNgram = LogisticRegression(C=.5)
lrNgram.fit(trainXNgram, trainY)
print("Logistic Regression Train:", lrNgram.score(trainXNgram, trainY))
print("Logistic Regression Dev:", lrNgram.score(devXNgram, devY))
print("--")

lrNgram = LogisticRegression(C=.1)
lrNgram.fit(trainXNgram, trainY)
print("Logistic Regression Train:", lrNgram.score(trainXNgram, trainY))
print("Logistic Regression Dev:", lrNgram.score(devXNgram, devY))
print("--")

lrNgram = LogisticRegression(C=.01)
lrNgram.fit(trainXNgram, trainY)
print("Logistic Regression Train:", lrNgram.score(trainXNgram, trainY))
print("Logistic Regression Dev:", lrNgram.score(devXNgram, devY))
print("--")

## Step 7: Take best model, and report results on Test

In [0]:
print("Logistic Regression Test:", lrNgram.score(testXNgram, testY))


## Next Step: Doing it on your own