# Lab 5: Sentiment classification

In this week's lab, we perform supervised text classifcation. You will see:
 * How to perform text classification using Sklearn, specifically sentiment classification
 * How to evaluate a classifier with key metrics including Precision, Recall, F1, and confusion matrix
 * The impact of imbalanced class distributions
 * How to use FeatureUnion and Pipelines to incorporate multiple features
 * How to incorporate different types of features (sparse and dense)
 * How to use GridSearchCV to tune parameters



Our dataset this week comes from reviews for Android apps on the Play Store. 


In [3]:
local_file = "reviews_Apps_for_Android_5.json.gz"
#!curl -o  $local_file https://storage.googleapis.com/tad2018/reviews_Apps_for_Android_5.json.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 91.0M  100 91.0M    0     0  1048k      0  0:01:28  0:01:28 --:--:-- 1459k00:19 1458k


We will limit the number of reviews to make it smaller so that the lab is faster to complete. You may remove the limit if you desire.

In [4]:
import gzip
import json
import pandas as pd

review_list = list()

# Construct a dataframe, by opening the JSON file line-by-line
with gzip.open(local_file) as jsonfile:
  for i, line in enumerate(jsonfile):
    review = json.loads(line)
    #print(review)
    #if (i >= review_limit): break
    # asin is the product number, overall is the number of stars awarded by the user for that product
    review_list.append( (review['asin'], review['reviewerID'], review['reviewText'], review['summary'], review['overall']))
                   
print("We have %d reviews in our dataset"  % len(review_list))

collabels = ['productId', 'reviewerID', 'reviewText', 'summary', 'overall']
reviews = pd.DataFrame(review_list, columns=collabels)

We have 752937 reviews in our dataset


Let's explore the data before we jump into classification.

In [5]:
reviews.head(20)

Unnamed: 0,productId,reviewerID,reviewText,summary,overall
0,B004A9SDD8,A1N4O8VOJZTDVB,"Loves the song, so he really couldn't wait to ...",Really cute,3.0
1,B004A9SDD8,A2HQWU6HUKIEC7,"Oh, how my little grandson loves this app. He'...",2-year-old loves it,5.0
2,B004A9SDD8,A1SXASF6GYG96I,I found this at a perfect time since my daught...,Fun game,5.0
3,B004A9SDD8,A2B54P9ZDYH167,My 1 year old goes back to this game over and ...,We love our Monkeys!,5.0
4,B004A9SDD8,AFOFZDTX5UC6D,There are three different versions of the song...,This is my granddaughters favorite app on my K...,5.0
5,B004A9SDD8,A331GYAT4ESYI3,THis is just so cute and a great app for littl...,so cute,5.0
6,B004A9SDD8,A2YEHF8T823TDC,I watch my great grandson 4 days a week and it...,Terrific!,5.0
7,B004A9SDD8,A3699WHISXX94Z,This app is wild and crazy. Little ones love ...,Five Little Monkeys,5.0
8,B004A9SDD8,A2BXV49EIES2TB,love love love this app. I was going through d...,love but to quite,5.0
9,B004A9SDD8,A37HM5TMCMHJES,"Very cute, with alot of items to move about. ...",Cute,5.0


Create a histogram of the scores ('overall') rating. 

In [6]:
reviews.hist('overall')

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f85a9e82908>]],
      dtype=object)

It seems that most reviews are positive. For this exercise, we'll be looking at the task of binary classification. We will separate the reviews into two classes for the purposes of binary sentiment classification -- Like vs Not like. 

Beyond binary classification, there are other ways to formulate this task. For example, we could also try to predict each of the rating classes. What type of classification would this be instead?  


### Create the (class) labels

#### Exercise
*  Create a function, create_label that outputs a vector of class labels (Y vector). Reviews with a score strictly greater than *3* should be assigned a positive label (`1`), the label should be, `0` otherwise. 
*   Apply the function to the *overall* and create a new data column in the reviews dataframe, `reviews['Class']`



In [7]:
#@title
##Solution

# Alternate solution:
#reviews['Class'] = 1 * (reviews['overall'] > 3)

def create_label(x):
    if x > 3:
        return 1 # 'positive' 
    return 0 # 'negative'
  
reviews['Class'] = reviews.overall.apply(create_label)

#### Exercise  
* Print the class prior probabilities, P(c)

In [8]:
print(reviews['Class'])

0         0
1         1
2         1
3         1
4         1
5         1
6         1
7         1
8         1
9         1
10        1
11        0
12        1
13        0
14        1
15        1
16        1
17        0
18        1
19        1
20        1
21        1
22        1
23        1
24        0
25        1
26        0
27        1
28        1
29        1
         ..
752907    1
752908    1
752909    1
752910    1
752911    1
752912    1
752913    1
752914    1
752915    1
752916    1
752917    1
752918    1
752919    1
752920    1
752921    1
752922    1
752923    1
752924    1
752925    1
752926    1
752927    1
752928    1
752929    1
752930    1
752931    1
752932    1
752933    0
752934    1
752935    1
752936    1
Name: Class, Length: 752937, dtype: int64


You should see that most (~72% of the labels are positive overall in the dataset). This means we have an issue of class imbalance.

### Train/Validation/Test sets

Next, lets split the reviews dataframe in to train, validation, and test sets.  Recall that training data is used to train our model.  Validation data is used to develop our model (develop new features, etc...) and tune parameters.  The final result should be reported on the test data (that we haven't looked at throughout). 

1. Split your data (all instances) into *training* and *testing* (80/20 is a reasonable starting point)
2. Split the **training data** into *training* and *validation* (again, 80/20 is a typical split).

Note that as the size of the dataset increases, the ratio of train/validation/test may vary in other datasets.  For example, if you have millions (or billions) of instances the splits could be more like 99%/0.5%/0.5% for train/validation/test.  This is because you are still using a large number of instances to evaluate the model.

We shuffle the data randomly to avoid potential ordering bias.  Note that because we are all using different random splits of the data every time we run the notebook the results will differ (very) slightly due to differences in the random splits. 

In [9]:
# shuffle the data randomly to avoid possible bias.
random_reviews = reviews.sample(frac=1)

# You may change this, but it's set to not be "too big".
review_limit = min(200000, len(random_reviews))
random_reviews = random_reviews.iloc[:review_limit, :]

# 1. Split the data 80/20 train/test
train_split = int(len(random_reviews) * 0.8)
tmp_train = random_reviews.iloc[:train_split,:]
test_data = random_reviews.iloc[train_split:,:]

# 2. Split the train data into a train/validation split that's 80% train, 20% developemnt 
validation_split = int(train_split * 0.8)
train_data = tmp_train.iloc[:validation_split,:]
validation_data = tmp_train.iloc[validation_split:,:]


Lets see some statistics of our resulting datasets.

In [10]:
print('Training set contains {:d} reviews.'.format(len(train_data)))
print('Vadlidation set contains {:d} reviews.'.format(len(validation_data)))
print('Test set contains {:d} reviews.'.format(len(test_data)))

number_positive_train = sum(train_data['Class'] == 1)
number_positive_validation = sum(validation_data['Class'] == 1)
number_positive_test = sum(test_data['Class'] == 1)

print('Training set contains %0.0f%% positive reviews' % (100*number_positive_train/len(train_data)))
print('Validation set contains %0.0f%% positive reviews' % (100*number_positive_validation/len(validation_data)))
print('Test set contains %0.0f%% positive reviews' % (100*number_positive_test/len(test_data)))

Training set contains 128000 reviews.
Vadlidation set contains 32000 reviews.
Test set contains 40000 reviews.
Training set contains 72% positive reviews
Validation set contains 72% positive reviews
Test set contains 72% positive reviews


We can see that both the train/dev/test data all have about 72% positive labels.  This is the same percentage as the overall collection. This is a good sanity check to make sure the data isn't biased.



Now we have our instances with labels.  Now, we need to create a text representation.  We will start by processing the reviews with spaCy to tokenize and normalize the text. 


In [13]:
import spacy

# Load the medium english model. 
# We will use this model to get embedding features for tokens later.
#!python -m spacy download en_core_web_md

nlp = spacy.load('/usr/lib/python3.7/site-packages/en_core_web_md/en_core_web_md-2.0.0', disable=['ner'])
nlp.remove_pipe('tagger')
nlp.remove_pipe('parser')

# Download a stopword list
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/stuart/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [14]:
#@Tokenize
def spacy_tokenize(string):
  tokens = list()
  doc = nlp(string)
  for token in doc:
    tokens.append(token)
  return tokens

#@Normalize
def normalize(tokens):
  normalized_tokens = list()
  for token in tokens:
    normalized = token.text.lower().strip()
    if ((token.is_alpha or token.is_digit)):
      normalized_tokens.append(normalized)
  return normalized_tokens
  return normalized_tokens

#@Tokenize and normalize
def tokenize_normalize(string):
  return normalize(spacy_tokenize(string))

Test to make sure the normalization is working.  Note that for this simple example we aren't keeping punctuation.  It might be something that could be useful as an option to test later.

In [15]:
tokenize_normalize("the app is fun. very happy.")

['the', 'app', 'is', 'fun', 'very', 'happy']

The result should be [the, app, is, fun, very, happy].

Now we are ready to create the features. As this is for text classification, our features are terms. We'll start with a simple one-hot encoding.

#### Exercise
* Create a `CountVectorizer` initliazed with `tokenize_normalize` function for the tokenizer parameter. Also use `binary=True` to create a one-hot encoding.
* Fit the vectorizer model on the `train_data`, specifically the `reviewText` 
* Transform each of the three datasets using the vectorizer: `train_data`, `validation_data`, and `test_data`. Assign the result to correspondingly named 'features' variables, `train_features`, `validation_features`, `test_features`

**Hint**: Recall that `fit` builds a vocabulary and does pre-processing. `transform()`and then transforms the each review into a sparse vector.  


In [17]:
from sklearn.feature_extraction.text import CountVectorizer 
one_hot_vectorizer = CountVectorizer(tokenizer=tokenize_normalize, binary=True)
one_hot_vectorizer.fit(train_data["reviewText"])
train_features = one_hot_vectorizer.transform(train_data["reviewText"])
validation_features = one_hot_vectorizer.transform(validation_data["reviewText"])
test_features = one_hot_vectorizer.transform(test_data["reviewText"])

### Exercise
* Extract the labels for each of the datasets and put them into variables (for convenience)
* Recall the labels are in `['Class']` of the corresponding dataframe.
* Assign the labels to the corresponding variable, for example `train` should be assigned to `train_labels`.  Follow the same convention for the other two sets of labels as well.



In [18]:
train_labels = train_data['Class']
validation_labels = validation_data['Class']
test_labels = test_data['Class']

Now lets train a classifier using the training data.

We will train a [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) classifier. Recall from lecture that the first step for NB is to assume a distribution of the feature data. A "bernoulli distribution" is for binary values, whether a value is present or not.  We use the
[BernoulliNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html) for features with a one-hot encoding.

#### Exercise
* Import and create a  `BernoulliNB` classifier for the data with default arguments. 
* `Fit` (train) the classifier model using the `train_features` and `train_labels`
* Assign the output of training your classifier to a `nb_model` variable.

**Question:** What would the appropriate Naive Bayes classifier be if we used token frequency counts instead of a one-hot encoding?


In [19]:
#@title
#Solution

from sklearn.naive_bayes import BernoulliNB
classifier = BernoulliNB()
nb_model = classifier.fit(train_features, train_labels)

Can we see how well our classifier is doing? The `score()` function on classifiers calculates the classifier's accuracy.

In [20]:
nb_model.score(train_features, train_labels)

0.8254609375

On the training data we get a accuracy of about 83%. 

#### Exercise
* Score the validation dataset using the trained classifier 

In [21]:
validation_score = nb_model.score(validation_features, validation_labels)
print(validation_score)

0.8064375


This gives us an accuracy of around 81.5%.  Pretty similar to the training data effectiveness, but slightly lower.

Now we have our classifier learned, lets try it out on some example reviews. 
- The `predict()` function just returns the MAP estimate (highest probability class)
- The `predict_proba()` returns the normalized probability. 


In [22]:
print(nb_model.predict(one_hot_vectorizer.transform(["the app is awful. total spam."])))
print(nb_model.predict_proba(one_hot_vectorizer.transform(["the app is awful. total spam."])))

print(nb_model.predict(one_hot_vectorizer.transform(["the app is fun. very happy"])))
print(nb_model.predict_proba(one_hot_vectorizer.transform(["the app is fun. very happy"])))

[1]
[[0.3780106 0.6219894]]
[1]
[[9.24458822e-04 9.99075541e-01]]


We should expect the  first to be negative, the second positive. However, due to variation in the random subsets of data used (and the limited size, this doesn't always work).  This is why it's important to inspect the model outputs.   

* Why might the model perform better on the positive case rather than the negative?

Below is a function that prints out a evaluation summary with key metrics we used in class as well as a 'classification report'. 

In [23]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import fbeta_score

def evaluation_summary(description, predictions, true_labels):
  print("Evaluation for: " + description)
  precision = precision_score(predictions, true_labels)
  recall = recall_score(predictions, true_labels)
  accuracy = accuracy_score(predictions, true_labels)
  f1 = fbeta_score(predictions, true_labels, 1) #1 means f_1 measure
  print("Classifier '%s' has Acc=%0.3f P=%0.3f R=%0.3f F1=%0.3f" % (description,accuracy,precision,recall,f1))
  print(classification_report(predictions, true_labels, digits=3))
  print('\nConfusion matrix:\n',confusion_matrix(true_labels, predictions)) # Note the order here is true, predicted, odd.



The sklearn.metrics package includes score functions, including the key [classification metrics](https://scikit-learn.org/stable/modules/classes.html#classification-metrics) discussed in lecture. 

In [24]:
validation_predicted_labels = nb_model.predict(validation_features)
evaluation_summary("One-hot NB",  validation_predicted_labels, validation_labels)

Evaluation for: One-hot NB
Classifier 'One-hot NB' has Acc=0.806 P=0.894 R=0.847 F1=0.870
              precision    recall  f1-score   support

           0      0.578     0.676     0.623      7576
           1      0.894     0.847     0.870     24424

   micro avg      0.806     0.806     0.806     32000
   macro avg      0.736     0.761     0.746     32000
weighted avg      0.819     0.806     0.811     32000


Confusion matrix:
 [[ 5119  3737]
 [ 2457 20687]]


See the documentation for [confusion_matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html) for how to make it look 'pretty'. 

It is important in practice to always have a *baseline* - you need to know that you are better than randomly guessing the class. As discussed in lecture, Sklearn provides instances of [DummyClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) for this purpose.  A dummy classifier has the same interface as all other Estimators (classifiers) in SKLearn.  We `fit` (train) it on the training data and `predict` on the test data.

#### Exercise: 
* Create and train two DummyClassifier instances with `strategy=stratified` (which we sometimes will call 'Random') and `strategy=most_frequent` (MF) which assigns the most common (majority class) label.
* Print an evaluation summary of each on the validation data.
* Study the output. How do these baselines compare with the one-hot encoding classifier?

In [25]:
from sklearn.dummy import DummyClassifier

random = DummyClassifier(strategy="stratified") 
mf = DummyClassifier(strategy="most_frequent")
random.fit(test_features, test_labels)
mf.fit(test_features, test_labels)

random_predict = random.predict(validation_features)
mf_predict = mf.predict(validation_features)

evaluation_summary("Random", random_predict, validation_labels)
evaluation_summary("Most Frequent", mf_predict, validation_labels)

Evaluation for: Random
Classifier 'Random' has Acc=0.601 P=0.725 R=0.724 F1=0.724
              precision    recall  f1-score   support

           0      0.277     0.278     0.278      8822
           1      0.725     0.724     0.724     23178

   micro avg      0.601     0.601     0.601     32000
   macro avg      0.501     0.501     0.501     32000
weighted avg      0.601     0.601     0.601     32000


Confusion matrix:
 [[ 2454  6402]
 [ 6368 16776]]
Evaluation for: Most Frequent
Classifier 'Most Frequent' has Acc=0.723 P=1.000 R=0.723 F1=0.839
              precision    recall  f1-score   support

           0      0.000     0.000     0.000         0
           1      1.000     0.723     0.839     32000

   micro avg      0.723     0.723     0.723     32000
   macro avg      0.500     0.362     0.420     32000
weighted avg      1.000     0.723     0.839     32000


Confusion matrix:
 [[    0  8856]
 [    0 23144]]


  'recall', 'true', average, warn_for)


Ok, comparing the results with the dummy classifiers, our classifier's performance can be best described as "maybe OK".  The  class prior 'random' classifier gets an accuracy of about 60% and F1 of about 0.72.  The majority class has an F1 of 0.84 and an accuracy 0f 0.725.  This shouldn't be suprisingly; that's the percentage of the data that is positive. F1 is a useful summary measure combining both precision and recall, you will use this in the coursework as well.  Let's see if we can improve the effectiveness.

####Exercise: 
* Now train and evaluate a `LogisticRegression` classifier. 
* Specify a `solver=saga`, a variation of stochastic gradient decent
* Evaluate the effectiveness on the validation data. 

This could take a minute or two.

In [26]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression(solver="saga")
logistic.fit(test_features, test_labels)
predict = logistic.predict(validation_features)
evaluation_summary("Logisitic", predict, validation_labels)

Evaluation for: Logisitic
Classifier 'Logisitic' has Acc=0.849 P=0.918 R=0.878 F1=0.898
              precision    recall  f1-score   support

           0      0.667     0.758     0.709      7793
           1      0.918     0.878     0.898     24207

   micro avg      0.849     0.849     0.849     32000
   macro avg      0.793     0.818     0.804     32000
weighted avg      0.857     0.849     0.852     32000


Confusion matrix:
 [[ 5906  2950]
 [ 1887 21257]]




LR should provide a (small) improvement on F1 to approximately 0.9, which is better than a BernoulliNB for this data.

Now, let's experiment with different features for LogisticRegression beyond a one-hot encoding.

## Adding Features

Recall that we want to be able to incorporate features from external sources. 
We will use a simple lexicon dictionary of 'good' and 'bad' words as features. 

SKLearn has multiple ways of combining features.   We'll look at both of them in this lab. 

For part I, we will do this using SKLearn [Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). The first (and easiest) ways is to use a [FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) as part of a pipeline.There's a nice post on how to be a [Kaggle Pro using FeatureUnion and Pipelines](https://www.kaggle.com/metadist/work-like-a-pro-with-pipelines-and-feature-unions). 

A pipeline is useful because it allows us to pass in a dataset and extract out different fields that we can process and vectorize separately. 

Recall that there is also a "summary" field in our data. Let's use it to crate a separate set of features. 

Unfortunately, Pandas does not play nice with SKLearn pipelines natively.  As result, to access the columns in the Panda data we need to use a transformer to select a column from the data.  You will use this as one of the steps in your pipeline. 

In [27]:
from sklearn.base import BaseEstimator, TransformerMixin

class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.    """

    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

#### Exercise
* Create a pipeline that creates a FeatureUnion between two features from the `reviewText` and `summary` fields.
* Inside of the FeatureUnion, (for each field separately) use the ItemSelector to select the field, then create a vectorization step with the CountVectorizer that creates a one-hot encoding of each field. 

Unlike in the Kaggle example, do not include the classifier as part of your pipeline (yet).  

* Apply fit / transform on the three pandas data frames to run your pipeline (train, validation, test).
* Separately, train and evaluate a  logistic regression model on the resulting features.


In [42]:
from sklearn.pipeline import Pipeline, FeatureUnion 
pipeline = Pipeline([("union", FeatureUnion([
    ("reviewText", Pipeline([
        ("select", ItemSelector("reviewText")),
        ("vec", one_hot_vectorizer)
    ])),
    ("summary", Pipeline([
        ("select", ItemSelector("summary")),
        ("vec", one_hot_vectorizer)
    ]))
]))])

In [29]:
union_train_features = pipeline.fit_transform(train_data)
union_validation_features = pipeline.transform(validation_data)
union_test_features = pipeline.transform(test_data)

logistic_reg = LogisticRegression(solver="saga")
logistic_reg.fit(union_train_features, train_labels)
predict = logistic_reg.predict(union_validation_features)
evaluation_summary("Pipeline", predict, validation_labels)

Evaluation for: Pipeline
Classifier 'Pipeline' has Acc=0.858 P=0.926 R=0.883 F1=0.904
              precision    recall  f1-score   support

           0      0.678     0.779     0.725      7708
           1      0.926     0.883     0.904     24292

   micro avg      0.858     0.858     0.858     32000
   macro avg      0.802     0.831     0.814     32000
weighted avg      0.866     0.858     0.861     32000


Confusion matrix:
 [[ 6002  2854]
 [ 1706 21438]]


This provides a significant boost in effectiveness up to around 0.925.  Having good features matters! 

## Combining sparse and dense features ##

We can build upon last week's lab, to provide more features for sentiment analysis - in particular, we can represent each review by the average of its constituent word vectors.  

The medium and large spaCy models include word vector representations built into them.  We will use these instead of the glove embeddings we used in last lab.


In [30]:
tokens = spacy_tokenize("Great fun. afskfsd")
for token in tokens:
  print(token.text, token.has_vector, token.vector_norm, token.is_oov, token.vector)

Great True 5.4395933 False [-9.3846e-02  5.8296e-01 -1.9271e-02 -7.0072e-02  1.8095e-01  1.5343e-01
  1.7444e-01 -1.8207e-01 -6.6300e-02  2.3681e+00 -1.2753e-01  1.7784e-02
  1.0581e-01  1.9629e-01 -2.5103e-01 -2.7987e-01 -2.9529e-01  1.1575e+00
 -2.0997e-01  8.3031e-02 -2.6101e-02 -2.3911e-01  2.7443e-01 -2.2339e-01
 -4.9437e-02  1.9215e-01  1.2176e-01  2.2273e-01 -1.2051e-01  1.9972e-01
  2.1834e-01  3.0302e-01 -1.7650e-02  6.6369e-02  1.5469e-01 -2.7746e-01
  2.9550e-01 -3.5517e-01 -3.6803e-01 -2.1441e-01 -1.6825e-02  3.2859e-01
 -1.6417e-01 -4.3756e-02  3.2168e-01  4.7823e-01 -3.0072e-01  3.5865e-01
  1.8450e-01 -1.1995e-01 -4.8905e-02  3.7055e-01  4.4224e-01  1.7276e-01
  1.8705e-01  2.3734e-01  5.5195e-03  1.5334e-01 -8.0614e-02 -9.8517e-04
 -1.3972e-01 -5.1074e-01 -1.0340e-01  4.5437e-01  6.5120e-02 -1.9894e-01
  2.0476e-01  2.5925e-01  1.5235e-01  6.9943e-02  2.5109e-01  1.1591e-01
 -1.1138e-01  1.0800e-01  2.0717e-01 -1.2912e-01 -9.8970e-02 -8.5548e-02
 -3.3701e-01  3.1039e-01

The classifier below is a slightly modified version from Lab 4.  The key change is that it performs tokenization and uses the token vector built in to spaCy. 


In [31]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class AverageEmbeddingTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.dimension = 300
        
    def fit(self, X, y):
        return self
      
    def transform(self, X):  
      # Skip OOV terms. 
      # Return 0 for all dimensions if no words are in the vocabulary.
      dense_matrix =  np.array([ 
          np.mean([token.vector for token in self.tokenizer(doc) if not token.is_oov]
                or [np.zeros(self.dimension)], axis=0)
          for doc in X
      ])
      return dense_matrix

In [32]:
# Note: We don't call fit here because it doesn't do anything.
embedding_vectorizer = AverageEmbeddingTransformer(spacy_tokenize)
train_embedding_features = embedding_vectorizer.transform(train_data['reviewText'])
validation_embedding_features = embedding_vectorizer.transform(validation_data['reviewText'])
test_embedding_features = embedding_vectorizer.transform(test_data['reviewText'])

#### Exercise
* Train and evaluate a logistic regression model using the embedding features.


In [33]:
log_reg = LogisticRegression(solver="saga")
log_reg.fit(train_embedding_features, train_labels)
predict = log_reg.predict(validation_embedding_features)
evaluation_summary("AvgEmbedding", predict, validation_labels)

Evaluation for: AvgEmbedding
Classifier 'AvgEmbedding' has Acc=0.837 P=0.923 R=0.861 F1=0.891
              precision    recall  f1-score   support

           0      0.611     0.752     0.674      7201
           1      0.923     0.861     0.891     24799

   micro avg      0.837     0.837     0.837     32000
   macro avg      0.767     0.807     0.783     32000
weighted avg      0.853     0.837     0.842     32000


Confusion matrix:
 [[ 5415  3441]
 [ 1786 21358]]


You should see that these are not as effective as words.  The score of embeddings on their own is probably around 0.89, but this is not bad considering we have just 300 features per post!  This is actually very compact and powerful. 

But, we can do better if we combine both types of representations. The features from embeddings are dense.  The word features are a sparse vector.  How do we combine them?  They need to be consistent.  We will convert the dense embeddings to sparse matrices.  *Question*: Why do we not convert the sparse word features to dense features instead?

In [34]:
print(type(train_embedding_features))
# Below should be changed to the name of the variable that is the output of your pipeline. 
print(type(union_train_features)) 

<class 'numpy.ndarray'>
<class 'scipy.sparse.csr.csr_matrix'>


In [35]:

from scipy.sparse import csr_matrix
train_sparse_embeddings = csr_matrix(train_embedding_features)
validation_sparse_embeddings = csr_matrix(validation_embedding_features)
type(validation_sparse_embeddings)

scipy.sparse.csr.csr_matrix

We now have sparse features.  We can now combine them together.

#### Exercise

1. Combine the embedding feature matrix (e.g. train_embedding_features) with the one-hot encoding matrix (e.g. one_hot_train_features) for both the training and validation feature data.
2. Train a and evaluate a logistic regression model on the combined features.

**Hint: ** We need to "stack" the matrices to merge them.  The common functions are "hstack" (horizontal) stack adds columns and "vstack" (vertical stack) that adds rows.  Numpy has built-in support for dense arrays / matrices.  However, the one-hot encoding features are sparse features. We converted the dense embedding matrix to a sparse matrix.  We can now  use the  [sparse library](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html) in scipy to concatenate the matrices by 'horizontally stacking' them.  



In [36]:
from scipy.sparse import hstack
stack_train = hstack([train_embedding_features, union_train_features])
stack_validation = hstack([validation_embedding_features, union_validation_features])

In [37]:
log_regression = LogisticRegression(solver="saga")
log_regression.fit(stack_train, train_labels)
predict = log_regression.predict(stack_validation)
evaluation_summary("Embedding", predict, validation_labels)

Evaluation for: Embedding
Classifier 'Embedding' has Acc=0.862 P=0.927 R=0.887 F1=0.907
              precision    recall  f1-score   support

           0      0.690     0.784     0.734      7795
           1      0.927     0.887     0.907     24205

   micro avg      0.862     0.862     0.862     32000
   macro avg      0.809     0.836     0.820     32000
weighted avg      0.870     0.862     0.865     32000


Confusion matrix:
 [[ 6114  2742]
 [ 1681 21463]]




Putting them together should provide a little boost, possibly close to 0.930.

Additional parameter tuning (or better features!) could improve this model further.

## Tuning parameters: combining Pipelines + Grid Search

#### Exercise: Create a simple pipeline model with a classifier
* Create a pipeline model for a VERY simple  model.
* Select the `reviewText` column (only)
* Create a one-hot encoding with CountVectorizer
* Use a LogisticRegression model classifier with solver=saga
* Train the model (training data) and evaluate the model (validation data).

**Hint**: The pipeline should have three steps in it. 

In [38]:
pipe = Pipeline([
    ("select", ItemSelector("reviewText")),
    ("vec", one_hot_vectorizer),
    ("class", LogisticRegression(solver="saga"))
])

pipe.fit(train_data, train_labels)
predict = pipe.predict(validation_data)
evaluation_summary("Pipeline", predict, validation_labels)

Evaluation for: Pipeline
Classifier 'Pipeline' has Acc=0.858 P=0.927 R=0.882 F1=0.904
              precision    recall  f1-score   support

           0      0.675     0.781     0.724      7659
           1      0.927     0.882     0.904     24341

   micro avg      0.858     0.858     0.858     32000
   macro avg      0.801     0.831     0.814     32000
weighted avg      0.867     0.858     0.861     32000


Confusion matrix:
 [[ 5979  2877]
 [ 1680 21464]]


#### Exercise
* Use GridSearchCV to try one-hot vs bag-of-words (binary=True/False) one the pipeline.
* Specify cv=2 to only perform two-fold cross validation (the default is 3, which is slower)
* Specify the scoring= parameter.  Make it use f1_macro instead of the default (accuracy).

**Hint:** You can see an example of this to tune parameters for text classification in the [20 newsgroups example](https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py). 

This trains multiple models and selects the best set of model parameters. This can be expensive with large numbers of parameters.


In [40]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer

grid = GridSearchCV(pipe, {}, cv=2, scoring=make_scorer(f1_score))
grid.fit(train_data, train_labels)
predict = grid.predict(validation_data)
evaluation_summary("Grid one-hot", predict, validation_labels)



Evaluation for: Grid one-hot
Classifier 'Grid one-hot' has Acc=0.858 P=0.927 R=0.882 F1=0.904
              precision    recall  f1-score   support

           0      0.675     0.781     0.724      7663
           1      0.927     0.882     0.904     24337

   micro avg      0.858     0.858     0.858     32000
   macro avg      0.801     0.831     0.814     32000
weighted avg      0.867     0.858     0.861     32000


Confusion matrix:
 [[ 5981  2875]
 [ 1682 21462]]


We should find that using counts doesn't help this classification task.

## FInal evaluation on test data

We developed our model and saw how well it performed on validation data so far.  The final step is to evaluate on held out test data that has not been used at all in our development process.  We should not develop our features / model based on the test results.  It's only used for reporting the final numbers.  This is like a submission to a leaderboard where you don't know the test data.

#### Exercise: **Final evaluation**
* The final step is to report classification numbers and confusion matrix on the held out test set 
* Evaluate the best of each of the different classifiers models (Dummy classifiers, Naive Bayes, One-Hot LR (multiple fields), and the  final combined model with embeddings) on the test data. 

In [45]:
evaluation_summary('MF', mf.predict(test_features), test_labels)
evaluation_summary('NB', nb_model.predict(test_features), test_labels)
evaluation_summary('One-Hot', pipe.predict(test_data), test_labels)
evaluation_summary('Combined', grid.predict(test_data), test_labels)

Evaluation for: MF
Classifier 'MF' has Acc=0.723 P=1.000 R=0.723 F1=0.839
              precision    recall  f1-score   support

           0      0.000     0.000     0.000         0
           1      1.000     0.723     0.839     40000

   micro avg      0.723     0.723     0.723     40000
   macro avg      0.500     0.361     0.420     40000
weighted avg      1.000     0.723     0.839     40000


Confusion matrix:
 [[    0 11081]
 [    0 28919]]
Evaluation for: NB
Classifier 'NB' has Acc=0.804 P=0.892 R=0.845 F1=0.868
              precision    recall  f1-score   support

           0      0.574     0.671     0.619      9482
           1      0.892     0.845     0.868     30518

   micro avg      0.804     0.804     0.804     40000
   macro avg      0.733     0.758     0.743     40000
weighted avg      0.817     0.804     0.809     40000


Confusion matrix:
 [[ 6361  4720]
 [ 3121 25798]]
Evaluation for: One-Hot
Classifier 'One-Hot' has Acc=0.861 P=0.928 R=0.885 F1=0.906
            

# Feedback quiz

Now that you have complete the lab, please take the time to complete the [lab feedback quiz](https://moodle.gla.ac.uk/mod/feedback/view.php?id=1120741) on Moodle. These quizzes are important, as they help us to calibrate the lab to class progress, and consider how to improve the class for next year.



# Additional exercises

## Add external features

### Sentiment lexicon
You could add external features from sentiment lexicons and libraries. One widely used library for sentiment classification is VADER, which was developed by manually labeling data and manually assigning word scores for sentiment and polarity.

VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text by C.J. Hutto and Eric Gilbert
Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

There is a python library [VADER sentiment library](https://github.com/cjhutto/vaderSentiment). Below is some sample code to see how it's used. 
* How well does it predict the labels on its own as a classifier?
* You might apply it as a new feature and retrain LR classifier. 


In [None]:

!pip install vaderSentiment

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

# Example code:
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<65} {}".format(sentence, str(vs)))
    


### Paragraph vectors

You could also try using paragraph vectors (doc2vec) instead of averaging word embeddings from the previous lab.



## Deeper evaluation ##
Do you trust this classifier? Look at it's failures.

When was a post predicted to be negative, when it was predicted strongly positive? You can use to obtain a `predict_proba()` which predicts the probability of the post in the class.


