<a href="https://colab.research.google.com/github/steve-wilson/ds32019/blob/master/06_Machine_Learning_DS3Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fundamentals of Text Analysis for User Generated Content @ [DS3](https://www.ds3-datascience-polytechnique.fr/)

# Part 6: Machine Learning

## Currently Under Construction!
## If you are seeing this, you aren't viewing final version of this notebook.
## Please check back later

[<- Previous: Text Embeddings](https://colab.research.google.com/drive/1kXX_ifuY5cnHqUt9Xt0cmzAp_NiVho9n)

Dates: June 27-28, 2019

Facilitator: [Steve Wilson](https://steverw.com)

---



## Initial Setup

- **Run "Setup" below first.**

    - This will load libraries and download some resources that we'll use throughout the tutorial.

    - You will see a message reading "Done with setup!" when this process completes.



In [0]:
#@title Setup (click the "run" button to the left) {display-mode: "form"}

## Setup ##

# imports

# built-in Python libraries
# -------------------------

# counting and data management
import collections
# for dealing with reddit data
import json
# operating system utils
import os
# randomization functions
import random
# regular expressions
import re
# additional string functions
import string
# system utilities
import sys
# request() will be used to load web content
import urllib.request


# 3rd party libraries
# -------------------

# Natural Language Toolkit (https://www.nltk.org/)
import nltk

# download punctuation related NLTK functions
# (needed for sent_tokenize())
nltk.download('punkt')
# download NLKT part-of-speech tagger
# (needed for pos_tag())
nltk.download('averaged_perceptron_tagger')
# download wordnet
# (needed for lemmatization)
nltk.download('wordnet')
# download stopword lists
# (needed for stopword removal)
nltk.download('stopwords')
# dictionary of English words
nltk.download('words')

# numpy: matrix library for Python
import numpy as np

# scipy: scientific operations
# works with numpy objects
import scipy

# matplotlib (and pyplot) for visualizations
import matplotlib
import matplotlib.pyplot as plt

# sklearn for basic machine learning operations
import sklearn
import sklearn.manifold
import sklearn.cluster

# worldcloud tool
!pip install wordcloud
from wordcloud import WordCloud

# for checking object memory usage
!pip install pympler
from pympler import asizeof

# for spelling correction
!pip install pyspellchecker
import spellchecker

!pip install spacy
import spacy
NLP = spacy.load('en',disable=['ner','parser'])

!pip install fasttext
import fasttext

# re-defining some (slightly modified) functions from earlier
# -----------------------------------------------------------
def text_to_lemma_frequencies(text):    
    doc = NLP(text)
    words = [token.lemma_ for token in doc if token.is_stop != True and token.is_punct != True]
    return collections.Counter(words)

# modified version to save memory usage
def docs2matrix_sparse(document_list, vocab2index={}, allow_vocab_update=True):
    
    latest_index = 0
    vocab_was_set = False
    if vocab2index:
        latest_index = max(vocab2index.values()) + 1
        vocab_was_set = True
    
    # make coordinates
    data = []
    rows = []
    cols = []
    for row,doc in enumerate(document_list):
        lf = text_to_lemma_frequencies(doc)
        for token, count in lf.items():
            if token not in vocab2index and allow_vocab_update:
                vocab2index[token] = latest_index
                latest_index += 1
            if token in vocab2index:
                col = vocab2index[token]
                data.append(count)
                rows.append(row)
                cols.append(col)
    
    corpus_matrix = None
    if vocab_was_set:
        max_row = max(rows)
        max_col = max(cols)
        corpus_matrix = scipy.sparse.coo_matrix( (data, (rows,cols)), (max_row+1 ,latest_index))
    else:
        # create a coordinate format sparse matrix
        corpus_matrix = scipy.sparse.coo_matrix( (data, (rows,cols)))
    return corpus_matrix, vocab2index

# Downloading data
# ----------------
if not os.path.exists("aclImdb"):
    !wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    !tar -xzf aclImdb_v1.tar.gz
    
if not os.path.exists("crawl-300d-2M-subword.zip"):
    !wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
else:
    print("Fasttext embeddings already downloaded")
if not os.path.exists("crawl-300d-2M-subword.vec"):
    print("Extracting Fasttext embeddings. This may take several minutes...")
    !unzip crawl-300d-2M-subword.zip
else:
    print("Fasttext already extracted.")
if not os.path.exists("Stsbenchmark.tar.gz"):
    !wget http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz
    !tar -xzf Stsbenchmark.tar.gz
# get sample reddit data
if not os.path.exists("reddit_2019_05_5K.json"):
    !wget https://raw.githubusercontent.com/steve-wilson/ds32019/master/data/reddit_2019_05_5K.json

print()
print("Done with setup!")
print("If you'd like, you can click the (X) button to the left to clear this output.")

---
## Supervised Learning with Text Data

- Yes, we are finally going to do some machine learning!
- But, before that, we need to setup our data...

### Setting up the data

- Load some sample text data - let's use a portion of the IMDB Reviews that we looked it in part 1.
- We will be using the IMDB *test* data as our *development* data for this example notebook. 
    - Note that in practice, you should always keep a held out test set which is only used for reporting your final results.
        - Do not update your model or try any new hyperparamter settings after seeing these results.
        - Read more about this idea [here](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) if you aren't familiar with machine learning practices.

In [0]:
def read_files(data_dir, amount):
    data = []
    for file_name in os.listdir(data_dir)[:int(amount/2)]:
        with open(data_dir + os.sep + file_name) as file_handle:
            file_data = file_handle.read()
            data.append(file_data)
    return data

# setup directories
base_dir = "aclImdb/"
labels = ['pos','neg']
# set the amount of files to use for train and test
# need to be divisible by 2 to make balanced classes
splits_and_amounts = {'train':5000,'test':500}
data = {}
for split_name, amount in splits_and_amounts.items():
    data[split_name] = {}
    for label in labels:
        data_dir = base_dir + split_name + os.sep + label
        data[split_name][label] = read_files(data_dir, amount)

- Now we will load these into *sparse* matrices (`scipy.sparse.coo_matrix`)
    - remember from notebook 01 that this will save quite a bit of RAM.
- This will take some time now that we're working with a bit more data.

In [0]:
# setup train and test data
X_train, vocab = docs2matrix_sparse(data['train']['pos'] + data['train']['neg'])
X_test, vocab = docs2matrix_sparse(data['test']['pos'] + data['test']['neg'], vocab, False)
y_train = np.array([0]*len(data['train']['pos']) + [1]*len(data['train']['neg']))
y_test = np.array([0]*len(data['test']['pos']) + [1]*len(data['test']['neg']))
index2vocab = {v:k for k,v in vocab.items()}

print("X_train shape:",X_train.shape)
print("X_test shape:",X_test.shape)
print("y_train shape:",y_train.shape)
print("y_test shape:",y_test.shape)

- Next, we will shuffle the data so that we don't pass *all* of the examples from class 1 before class 2.
    - Since coo matrices don't support the same type of slicing that dense matrices do, we can use our own function to shuffle the rows.
- The shapes of all of the matrices should stay the same after shuffling.

In [0]:
def coo_sort(order, coo_matrix):
    order_lookup = {o:i for i,o in enumerate(order)}
    coo_matrix.row = np.array([order_lookup[r] for r in coo_matrix.row])
    return coo_matrix

# shuffle_data
train_reorder = list(range(X_train.shape[0]))
test_reorder = list(range(X_test.shape[0]))
random.shuffle(train_reorder)
random.shuffle(test_reorder)

X_train = coo_sort(train_reorder, X_train)
X_test = coo_sort(test_reorder, X_test)
y_train = y_train[train_reorder]
y_test = y_test[test_reorder]

print("X_train shape:",X_train.shape)
print("X_test shape:",X_test.shape)
print("y_train shape:",y_train.shape)
print("y_test shape:",y_test.shape)

### Logistic Regression for Movie Review Sentiment Analysis

- Let's start with a simple logistic regression classifier, using the word counts as our features:

In [0]:
# Using the sklearn LogisticRegression classifier
model = sklearn.linear_model.LogisticRegression(solver='lbfgs', max_iter=1000).fit(X_train, y_train)
train_preds = model.predict(X_train)
test_preds = model.predict(X_test)

train_accuracy = sklearn.metrics.accuracy_score(y_train,train_preds)
test_accuracy = sklearn.metrics.accuracy_score(y_test,test_preds)
print("Training accuracy:",train_accuracy,"Testing accuracy:",test_accuracy)

- Let's try improving this by doing 2 things we've looked at before:
    - Removing rare words (let's say, those that appear < 5 times in the training data).
    - Using tf-idf scores instead of counts.

In [0]:
# need to include X_test so we can remove the same columns from it
# let's convert to csc matrix to do this remove more easily
def remove_rare_words(X_train, X_test, index2vocab, threshold = 5):
    word_counts = np.sum(X_train, axis=0)
    cols_to_keep = [idx for idx in range(word_counts.shape[1]) if word_counts[0, idx] > threshold]
    return X_train[:,cols_to_keep], X_test[:,cols_to_keep], update_vocab(index2vocab, cols_to_keep)
    
def update_vocab(i2v, keep):
    keep = set(keep)
    list_i2v = [i2v[i] for i in range(len(i2v)) if i in keep]
    return {j:v for j,v in enumerate(list_i2v)}
    
X_train = scipy.sparse.csc_matrix(X_train)
X_test = scipy.sparse.csc_matrix(X_test)
no_rare_X_train, no_rare_X_test, no_rare_index2vocab = remove_rare_words(X_train, X_test, index2vocab)
print("X_train shape:",no_rare_X_train.shape)
print("X_test shape:",no_rare_X_test.shape)

- This reduced our number of features drastically. Let's see how this effects our predictions.

In [0]:
model = sklearn.linear_model.LogisticRegression(solver='lbfgs', max_iter=1000).fit(no_rare_X_train, y_train)
train_preds = model.predict(no_rare_X_train)
test_preds = model.predict(no_rare_X_test)

train_accuracy = sklearn.metrics.accuracy_score(y_train,train_preds)
test_accuracy = sklearn.metrics.accuracy_score(y_test,test_preds)
print("Training accuracy:",train_accuracy,"Testing accuracy:",test_accuracy)

- There doesn't seem to have much of an effect, but it seems that we can do just as well without all of those rare words.
- Let's add tf-idf to this (this should actually give us some accuracy boost):

In [0]:
# we should only use the IDF scores from X_train to transform both X_train and X_test, 
# since we can't assume knowledge of X_test while training our model.
def tfidf_transform(X_train, X_test):
    
    # compute IDF scores for each word given X_train
    docs_using_terms = np.array(sorted([v for k,v in collections.Counter(X_train.nonzero()[1]).items()]))
    idf_scores = np.log(X_train.shape[1]/docs_using_terms)
    
    # replace 0's with 1's to avoid divide by zero error later
    X_train_sum = X_train.sum(axis=1)
    X_test_sum = X_test.sum(axis=1)
    X_train_sum[np.where(X_train_sum==0)] = 1
    X_test_sum[np.where(X_test_sum==0)] = 1
    
    # these row-level operations will take a performance hit if we are using coo or csc
    normalized_X_train = X_train / X_train_sum
    normalized_X_test = X_test / X_test_sum
    
    # compuite tfidf scores
    tfidf_X_train = np.multiply(normalized_X_train,idf_scores)
    tfidf_X_test = np.multiply(normalized_X_test,idf_scores)
    return tfidf_X_train, tfidf_X_test

tfidf_X_train, tfidf_X_test = tfidf_transform(no_rare_X_train, no_rare_X_test)
print("X_train shape:",tfidf_X_train.shape)
print("X_test shape:",tfidf_X_test.shape)
print("X_train sample elements:",tfidf_X_train[0,:5])

In [0]:
model = sklearn.linear_model.LogisticRegression(solver='lbfgs', max_iter=1000).fit(tfidf_X_train, y_train)
train_preds = model.predict(tfidf_X_train)
test_preds = model.predict(tfidf_X_test)

train_accuracy = sklearn.metrics.accuracy_score(y_train,train_preds)
test_accuracy = sklearn.metrics.accuracy_score(y_test,test_preds)
print("Training accuracy:",train_accuracy,"Testing accuracy:",test_accuracy)

- What kinds of words contribute to positve or negative movie review classifications?

In [0]:
# sort coefficients and store their index in the feature list
feats_and_coefs = sorted(list(enumerate(model.coef_[0])),key=lambda x:x[1])
pos_features = [fc[0] for fc in feats_and_coefs[:15]]
neg_features = [fc[0] for fc in feats_and_coefs[-15:]]

print("Features predictive of positive reviews:")
print(" ".join([no_rare_index2vocab[idx] for idx in pos_features]))
      
print("Features predictive of negative reviews:")
print(" ".join([no_rare_index2vocab[idx] for idx in neg_features]))

- This looks reasonable, but can we do better? 
- Maybe we've reached our potential using just BOW features?
    - Feel free to experiment with this more to see if you get boost the performance significantly using other methods!
- Let's see what we can do with a simple bag-of-embeddings type of approach using mean-pooling.
    - First, we will need to load our embeddings, just like in the previous notebook:

In [0]:
emb_path = "crawl-300d-2M-subword.bin"
fasttext_model = fasttext.load_model(emb_path)
print("Loaded fasttext embeddings.")

- Now, we can use mean-pooling to create a new d-dimensional vector for each document.
    - d is the dimension of our word embeddings (300 in this case).

In [0]:
def convert_to_embeddings(X, i2v, emb):
    # make it coo matrix if it isn't already
    X = scipy.sparse.coo_matrix(X)
    row2embedding_sum = {}
    row2embedding_count = collections.defaultdict(int)
    embeddings = []
    # iterate through nonzero elements of the coo matrix
    # there will be a row, col, and data list the can be indexed with i
    for i in range(X.getnnz()):
        if X.row[i] not in row2embedding_sum:
            row2embedding_sum[X.row[i]] = emb[i2v[X.col[i]]] * X.data[i]
        else:
            row2embedding_sum[X.row[i]] += emb[i2v[X.col[i]]] * X.data[i]
        row2embedding_count[X.row[i]] += X.data[i]
    for row, embedding_sum in sorted(row2embedding_sum.items()):
        embeddings.append(embedding_sum / row2embedding_count[row])
    return np.array(embeddings)

emb_X_train = convert_to_embeddings(X_train, index2vocab, fasttext_model)
emb_X_test = convert_to_embeddings(X_test, index2vocab, fasttext_model)

print("X train shape:", emb_X_train.shape)
print("X test shape:", emb_X_test.shape)

- And checking the results...

In [0]:
model = sklearn.linear_model.LogisticRegression(solver='lbfgs', max_iter=1000).fit(emb_X_train, y_train)
train_preds = model.predict(emb_X_train)
test_preds = model.predict(emb_X_test)

train_accuracy = sklearn.metrics.accuracy_score(y_train,train_preds)
test_accuracy = sklearn.metrics.accuracy_score(y_test,test_preds)
print("Training accuracy:",train_accuracy,"Testing accuracy:",test_accuracy)

- Here, bag-of-words beats averaged bag-of-vectors
- We can also easily combine the features into a single model, e.g.:

In [0]:
comb_X_train = np.hstack( (tfidf_X_train, emb_X_train))
comb_X_test = np.hstack( (tfidf_X_test, emb_X_test))
print("X train shape", comb_X_train.shape)
print("X test shape", comb_X_train.shape)

- Then check to see if this outperforms the original feature sets (in this case, not so much):

In [0]:
model = sklearn.linear_model.LogisticRegression(solver='lbfgs', max_iter=1000).fit(comb_X_train, y_train)
train_preds = model.predict(comb_X_train)
test_preds = model.predict(comb_X_test)

train_accuracy = sklearn.metrics.accuracy_score(y_train,train_preds)
test_accuracy = sklearn.metrics.accuracy_score(y_test,test_preds)
print("Training accuracy:",train_accuracy,"Testing accuracy:",test_accuracy)

- Of course, we could also change our classifier to something else:
    - Feel free to try out different models from sklearn: [supervised learning](https://scikit-learn.org/stable/supervised_learning.html).
        - You won't have to change the code much! Just swap `linear_model.LogisticRegression()` for some other model to see what happens.

### Putting it together: Upvote prediction on Reddit

- Let's load a small sample of reddit data like we did before, but also keep track of the scores and use these for supervision.
    - The `score` is computed as $u - d$ where $u=$ *total upvotes* and $d=$ *total downvotes*.

In [0]:
sample_reddit_posts_raw = open("reddit_2019_05_5K.json",'r').readlines()
reddit_json = [json.loads(post) for post in sample_reddit_posts_raw]

texts = []
y = []

for post in reddit_json:
    if len(post['selftext'].strip()) > 100 and post['selftext'] not in ["[removed]","[deleted]"]:
        text = post['selftext']
        score = post['score']
        texts.append(text)
        y.append(float(score))
        
X, vocab = docs2matrix_sparse(texts)
y = np.array(y)

print("X shape:",X.shape)
print("y shape:",y.shape)

- Let's split our data into train and test:

In [0]:
X = scipy.sparse.csr_matrix(X)

datasize = 800
X_train, X_test = X[:datasize,:], X[datasize:,:]
y_train, y_test = y[:datasize], y[datasize:]

print("X_train shape:",X_train.shape)
print("X_test shape:",X_test.shape)
print("y_train shape:",y_train.shape)
print("y_test shape:",y_test.shape)

**Exercise 8:** Upvote prediction

- Build a **regression model** to predict reddit upvotes based on text features.
    - You won't be able to use classification models for this task
        - Check out sklearn's regression models: search for "regression" [here](https://scikit-learn.org/stable/supervised_learning.html).
            - Hint: you could use the `LinearRegression()` constructor to build a regresion model.
    - You can use whichever features you like.
        - Consider features outside of just the text. What other post metadata might be helpful?
    - Evalute on the test set.
    - Feel free to load in *even more* Reddit data if you like and see what kind of performance you can get.
    - You should evaluate using [mean absolute error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error).
        - Note that lower scores are better in this case.


In [0]:
# Use pieces from previous steps to build a dataset and regression model



In [0]:
#@title Sample Solution (double-click to view) {display-mode: "form"}

# nothing fancy, but here is something to give some basic results:
model = sklearn.linear_model.LinearRegression().fit(X_train, y_train)
train_preds = model.predict(X_train)
test_preds = model.predict(X_test)
train_error = sklearn.metrics.mean_absolute_error(y_train,train_preds)
test_error = sklearn.metrics.mean_absolute_error(y_test,test_preds)

print("Training error:",train_error,"Testing error:",test_error)

## Some resources on deep learning for NLP

- There is much, much more about deep learning for NLP that has already been covered very nicely elsewhere, and is definitely worth checking out. Just a few pointers to get you started:
    - [Basics of Deep Learning](https://colah.github.io/) from Chris Olah's Blog
    - [Deep Learning for NLP tutorial](https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html) by Robert Guthrie
    - [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html) by Alexander Rush
    - [Transfer Learning in NLP tutorial @ NAACL 2019](https://docs.google.com/presentation/d/1fIhGikFPnb7G5kr58OvYC3GN4io7MznnM0aAgadvJfc/edit) by Ruder, Peters, Swayamdipta, and Wolf
- That's it for the **Fundamentals of Text Analysis** Tutorial! Thanks for following along!
    