## Text data, deep learning, word embeddings.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

First, just for fun...

```conda install -c conda-forge wordcloud```

For later...

In [None]:
conda install -c conda-forge keras

Collecting package metadata (current_repodata.json): / 

In [2]:
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)

ModuleNotFoundError: No module named 'wordcloud'

In [None]:
bigabe=["Four score and seven years ago our fathers brought forth upon this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth."]

In [None]:
print(bigabe)

In [None]:
len(bigabe)

In [None]:
bigabe[0]

In [None]:
import re

Grab that big long string -- get rid of punctuation, and split into separate words. 

In [None]:
s="Working with text can be fun, but can be frustrating too."

In [None]:
print(s.split())

In [None]:
s="!Punctuation!can.be.pretty,;: superfluous."

In [None]:
re.sub("[^\w]", " ",  s)

The "\w" means "any word character" -- alphanumeric (letters, numbers, regardless of case) or an underscore.

In [None]:
re.sub("[^\w]", " ",  s).split()

In [None]:
wordList = re.sub("[^\w]", " ",  bigabe[0]).split()

In [None]:
print(wordList)

In [None]:
wordcloud = WordCloud().generate(bigabe[0])

In [None]:
plt.figure(figsize=(15,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Or...

In [None]:
wordcloud = WordCloud(background_color='grey').generate(bigabe[0])
plt.figure( figsize=(15,10) )
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Another example -- let's go get some text from the web.

In [None]:
import requests
from urllib.request import urlopen

In [None]:
url = "https://serc.carleton.edu/download/files/193011/plain_text_version_declaration_inde.txt"

In [None]:
response = urlopen(url).read()

In [None]:
print(response)

In [None]:
type(response)

In [None]:
response[2]

We need to convert the ```bytes``` object to a string, but actually wordcloud will handle the regular expression processing (recognizing we don't want punctuation or symbols in the word cloud.

In [None]:
str(response)[2]

We can do this manually as above...

In [None]:
tst = re.sub("[^\w]", " ", str(response))

In [None]:
tst

Or just directly...

In [None]:
wordcloud = WordCloud(background_color='white',random_state=0).generate(str(response))
plt.figure( figsize=(15,10))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
wordcloud = WordCloud(random_state=0).generate(str(response))
# Display the generated image:
# the matplotlib way:
plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
from PIL import Image

Can use shapes (masks) to generate wordclouds within the particular shape 

In [None]:
map_mask = np.array(Image.open('us_map.png'))
trooper=np.array(Image.open('troop.png'))

In [None]:
wordcloud = WordCloud(background_color='white',mask=map_mask,random_state=0).generate(str(response))
plt.figure( figsize=(18,10))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis("off")
plt.show()


In [None]:
wordcloud = WordCloud(background_color='white',mask=trooper,contour_width=2,contour_color='black',random_state=0).generate(str(response))
plt.figure( figsize=(15,10))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis("off")
plt.show()

More involved... Sometimes can we actually use text as features in a model. A natural application of this is ***sentiment analysis***.

Can we predict whether a movie review is positive or negative? 

This would be relatively easy (if time consuming) for a human. Seems pretty hard for a machine.

In [None]:
from sklearn.datasets import load_files

Download and unzip: [IMDB movie reviews](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz)

Then, delete the "unsup" folder inside train so you only have "pos" and "neg" folders inside the "test" and "train" folders.

Finally, copy the filepath to the "train" folder:

In [None]:
import time

And load the training data. Note: it took a little under a minute to load on my machine. 

In [None]:
t0=time.time()
rev_train = load_files("/Users/smdevlin/Dropbox/Teaching/CertA/data/aclImdb/train/")
t1=time.time()
print(t1-t0)

In [None]:
text_train,y_train = rev_train.data, rev_train.target

In [None]:
print(len(text_train),len(y_train))

In [None]:
y_train[0:5]

In [None]:
text_train[3]

The prefix 'b' is telling us that this is a ``bytes`` object.

Let's remove the HTML code in there...

In [None]:
text_train = [doc.replace(b"<br />",b"") for doc in text_train]

In [None]:
text_train[3]

A lot of other cleaning, stripping, etc. can be done as above. For example, replace non-alphanumeric strings with empty spaces.

In [None]:
re.sub("[^\w]", " ",  str(text_train[3]))

In [None]:
y_train[0:10]

So, ```y_train``` is telling us if the review is positive (1) or negative (0). Clearly, ```y_train[3]=0```.

In [None]:
print(text_train[9],":",y_train[9])

The data also contains a test set. Read in using ```load_files``` and point to the test folder:

In [None]:
rev_test=load_files("/Users/smdevlin/Dropbox/Teaching/CertA/data/aclImdb/test/")

In [None]:
text_test,y_test = rev_test.data, rev_test.target

Get rid of html

In [None]:
text_test=[doc.replace(b"br />",b"") for doc in text_test]

In [None]:
len(text_test)

Now, the plan is to generate features from the text that we can use to predict whether a review is positive or negative. 

First here's a small example:

In [None]:
abe=["Four score and seven years ago, our fathers brought forth apon this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.",
     "Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure."]

In [None]:
print(abe)

In [None]:
len(abe)

In [None]:
abe[0]

In [None]:
abe[1]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vect=CountVectorizer()
vect.fit(abe)

In [None]:
print(vect.vocabulary_)

CountVectorizer creates a vocabulary-- all lower case, no punctuation, elimanates some words like 'a'. (Note: the numbers here are locations when the words are organized alphabetically: 'four' is in position 18.)

In [None]:
abe_vect=vect.transform(abe)

The words become features (columns), and each 'sentence' (could be a sentence or a whole movie review) is a vector (row) whose entries tell us how many times that feature appears in that row.

This is stored in a matrix with rows = number of sentences and cols = number of features. 

In [None]:
repr(abe_vect)

There were two sentences and 42 words so this is a 2 by 42 matrix. Sparse means there are a LOT of 0s. The computer doesn't store all the zeros -- too wasteful. 

For a small example like ours we can look at the matrix:

In [None]:
print(abe_vect.toarray())

The words are in alphabetical order (ago, all, and...); and "*and*" appears twice.

Note, this simply looks at which words appear in which reviews (or sentences). No context or relationships between them, etc. Not like real langauge: This is called a "*bag of words*" approach.

OK -- back to movie reviews.

In [None]:
vect=CountVectorizer().fit(text_train)
X_train=vect.transform(text_train)

In [None]:
repr(X_train)

25000 reviews (rows), 75,911 features = words (columns)

We can access the features to explore them a little...

In [None]:
feats=vect.get_feature_names()

In [None]:
len(feats)

In [None]:
print(feats[0:30])

Hmmm -- maybe not all the features are particularly useful. We could (should) do some cleaning. We'll do some a bit later.

In [None]:
print(feats[19010:19040])

We also get lots of variants on a single word which we may or may not want to keep.  

Also, we may not want to just throw away all numeric features. Maybe 007 comes from James Bond movies? Let's check... pull out some reviews containing "007":

In [None]:
keep=[]
for i in range(len(text_train)):
    if "007" in str(text_train[i]):
        keep.append(i)

In [None]:
print(keep[0:20])

Sometimes yes (note also this seems to be a video game review):

In [None]:
text_train[3082]

Sometimes no.

In [None]:
text_train[131]

In [None]:
X_train.shape

In [None]:
type(X_train)

In [None]:
X_train[0,0]

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

Let's see what we get by training a Logistic Regression model on the training data. There are a LOT of features, clearly... almost 76k. More features than observations in fact!

We will just run 5 fold cross validation and keep track of the scores to see how the accuracy looks on the hold-out fold (not tuning any parameters here).

In [None]:
scores = cross_val_score(LogisticRegression(),X_train,y_train,cv=5)

We get some convergence warnings -- typically we dealt with this by scaling the data but it may not be a good idea to do so here. 
 * given the number of features and observations it's not terribly surprising.
 * our data is sparse and destroying that sparsity (which scaling will do) might not be a great idea.
 * turns out it's not a big problem here. (Features are not on different scales.) 

In [None]:
scores

That's not too bad!!

In [None]:
np.round(np.mean(scores),4)

In [None]:
X_test = vect.transform(text_test)

In [None]:
X_test.shape

In [None]:
y_test.shape

Re-train on the full training set (we're going to get a convergence error again)... 

``sklearn`` works nicely with the sparse matrix data structure (X_train) 

In [None]:
lrm = LogisticRegression().fit(X_train,y_train)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test,lrm.predict(X_test))

In [None]:
pd.crosstab(y_test,lrm.predict(X_test),colnames=['Prediction'],rownames=['Actual'])

Let's peek at a review where we missed:

In [None]:
np.where((lrm.predict(X_test)==1)&(y_test==0))

So we predicted positive and they were negative:

In [None]:
text_test[25]

Understandable? A negative review of one epsiode of the Twilight Zone with positive embedded comments about the series. Context might help-- words that appear near one another.

In [None]:
text_test[32]

That one is a less excusable mistake, but does positive comments in it: "acting was as good as it gets...".

Note: Different from statsmodels, sklearn's logistic regression automatically uses an L-2 penalty, but it's flipped -- larger C means smaller penalty. 

Tuning it might be able to help a little...

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid={'C':[0.001,0.01,0.10,1,10]}
grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)

Lot's of errors will get thrown here...

In [None]:
grid.fit(X_train,y_train)

In [None]:
grid.best_params_

Note: A nice feature of GridSearchCV (that we haven't used in the past) is that it automatically refits the model with optimal parameters on the full training set. We can access the score on the test set as follows: 

In [None]:
np.round(grid.score(X_test,y_test),4)

rounds to 0.88 so slight improvement! (previously 0.8668)

Not bad at all. We can refine (or at least clean up) the model by only using features that apper in at least $n$ training documents ($n>1$) -- since it seems unlikely that a word in only 1 doc will show up in the test set too.

Easy: just use the min_df (df=document frequency) setting in CountVectorizer

In [None]:
vect = CountVectorizer(min_df=10).fit(text_train)
X_train=vect.transform(text_train)

In [None]:
repr(X_train)

Now we have 18,515 features instead of 75,911 

In [None]:
feature_names=vect.get_feature_names()
print(feature_names[0:50])

Though looks like we could still clean up a bit.

In [None]:
print(feature_names[9100:9150])

In [None]:
grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid.fit(X_train,y_train)

In [None]:
grid.best_score_

Similar to before but much fewer features.

## Rescaling with tf-idf 

Rather than simply counting the number of times a word appears in a document, we try to weigh the features so that more important (revealing) features have higher weight. 

One way to do this is using ***tf-idf*** : term frequency - inverse document frequency

Give more weight to words that appear often in a particular document (review) but infrequently across the whole set of documents. Those words may be particularly revealing as relates to the content of the document. 

$$
tfidf(w,d) = tf \cdot \log(\frac{N+1}{N_w+1})+1
$$

* $N$ = number of documents.
* $N_w$ = number of documents containing word $w$.
* $tf$ = number of times $w$ appears in document $d$.

Let's say we want to perform tfidf scaling and cross validation (on the logistic regression parameter C).

We run into a subtle issue worth pointing out.

<img src="Leakage.png">

How to fix? Re-scale the data for each re-training on the training folds, test on the *separately* re-scaled test fold. (Simulate what new data will look like to the model.)

```sklearn``` provides a method for doing this.

In [None]:
from sklearn.pipeline import Pipeline

Tell sklearn what we want to do to our data and in what order: for us, first scale, then logistic regression.

We want tfidf then LR.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

We have to name each operation:

In [None]:
pipe = Pipeline([("scaler",TfidfVectorizer(min_df=5)),("logisticregression",LogisticRegression())])

Now we tell sklearn which parameters to search over, and which part of the pipeline they belong to-- using our names above and double underscore: __

In [None]:
new_param_grid={'logisticregression__C':[0.001,0.01,0.10,1,10]}

In [None]:
new_grid=GridSearchCV(pipe, param_grid=new_param_grid,cv=5)

In [None]:
new_grid.fit(text_train,y_train)

Looks OK  -- just the usual LR convergence warnings.

In [None]:
new_grid.best_score_

Barely detectable difference: before we had 0.88832

Let's look at which features tf-idf thought were important:

Thhe pipeline syntax makes it a bit of a hassle to get at the various functions we've used:

In [None]:
vectorizer=new_grid.best_estimator_.named_steps['scaler']

Apply to the text_train data...

In [None]:
X_train_tf = vectorizer.transform(text_train)

In [None]:
max_val = X_train_tf.max(axis=0).toarray().ravel()

In [None]:
max_val[0:10]

In [None]:
sorted_by_tfidf = max_val.argsort()

In [None]:
feature_names = np.array(vectorizer.get_feature_names())

Highest tfidf:

In [None]:
print(feature_names[sorted_by_tfidf[-30:]])

Kinda sorta makes sense --  but it's not really helping us zoom in on pos vs. neg reviews. It's more picking out specific movies or shows. Maybe that's why it didn't really help much.

Let's pull out the coeficients of the features in the logistic regression model

In [None]:
coefs = new_grid.best_estimator_.named_steps['logisticregression'].coef_[0]

In [None]:
coefs

And see which one's were most important: both + and -

In [None]:
top_pos=np.argsort(coefs)[-10:]
top_neg=np.argsort(coefs)[:10]
most_infl=np.hstack([top_neg,top_pos])

In [None]:
plt.figure(figsize=(10, 5))
plt.bar(np.arange(20), coefs[most_infl])
feature_names = np.array(feature_names)
plt.subplots_adjust(bottom=0.3)
plt.xticks(np.arange(0,20),feature_names[most_infl], rotation=60,ha="right")
plt.ylabel("Coefficient magnitude")
plt.xlabel("Feature")
plt.show()

## n-Grams

The bag of words model discards word order and association, both of which are certainly relevant!

Phrases like "not too bad!" and "just too bad!" are very similar without the context given by the immediate neighbors of "bad".

One approach is to include n-grams : groups of $n$ words that come togther.

In [None]:
print(abe)

In [None]:
cv=CountVectorizer(ngram_range=(1,2)).fit(abe)

In [None]:
print(cv.get_feature_names())

```ngram_range=(1,2)``` uses words and pairs of words. That can help but will also blow up the number of features. 

Let's take a swing at movie review using trigrams:

In [None]:
pipe = Pipeline([("scaler",TfidfVectorizer(min_df=5)),("logisticregression",LogisticRegression())])

In [None]:
tri_param_grid={'logisticregression__C':[0.001,0.01,0.10,1,10],"scaler__ngram_range":[(1,3)]}

In [None]:
ggrid=GridSearchCV(pipe, param_grid=tri_param_grid,cv=5)

### WARNING: Slow! Takes almost 10 minutes to run!

In [None]:
#t0=time.time()
#ggrid.fit(text_train,y_train)
#t1=time.time()
#print(t1-t0)

In [None]:
ggrid.best_score_

Helped a little!

In [None]:
ggrid.best_params_

Now, re-scale test data using Tfidfvectorizer with min_df=5 and ngram_range=(1,3) and apply model to the test set.

## Deep Learning

One of the most important advances in ML in the last decade. BUT... it has limitations and difficulties that make all the other techniques we've studied important and (still) relevant.

In particular, Deep Learning models have many parameters that need to be tuned carefully. It can be difficult, time consuming, and computationally expensive to do so.

Natural applications often involve very large data sets.

We'll consider a simple example of a feed-forward Neural Network. 

*Image from ISL: James, Witten, Hastie, Tibshirani.*

<img src="SingleLayerNNpic.png" width=600, height=600/>

Basic idea: 

 * Model takes inputs $X_1,X_2,\ldots X_p$ (Input Layer).
 * Each input node feeds into the ***hidden layer units***. The hidden layer consist of $K$ ***activations***:$A_1,\ldots A_K$.
 * The $K$ activations feed into the ***output*** layer.

Mathematically, the output layer is a linear regression model in the activations:

$$
f(X)= \beta_0 + \sum_{k=1}^K\beta_kA_k
$$

And where the activations are ***nonlinear*** functions of the inputs:

For $k=1,2,\ldots K$,
$$
A_k = h_k(X)= g(w_{k0}+\sum_{j=1}^pw_{kj}X_j)
$$

and where $g$ is a nonlinear activation function: usually a *rectified linear unit* or ReLU, which is zero below a threshold and linear above it. 

<img src="relu_pic.png" width=300, height=300/>

All together:
$$
f(X) = \beta_0 + \sum_{k=1}^K\beta_k\cdot g(w_{k0}+\sum_{j=1}^pw_{kj}X_j)
$$

Even the simple picture above has 25 parameters that need to be fit. Compare to RF or boosting.

The nonlinearity is essential -- without it the DL model is just a big linear model. It also allows for extremely complex nonlinear functions of the features to enter the model.

DL is supervised! 

in a regression model we fit parameters to minimize SSE:
$$
\sum (y_i - f(X_i))^2
$$

For classification we use a different loss function (cross entropy) involving the probability of observations being in class $i$.

We can use:

   * many hidden layers (many more parameters)
   * many different task appropriate architectures.

<img src="multi_layer_NN_pic.png" width=600, height=600/>

Examples of different architectures:

   * Convolutional Neural Networks : Image classification
   * Recurrant Neural Networks : Text data

We'll consider a few examples... first, a pretrained CNN for image classification: ResNet50

#pip install tensorflow

In [None]:
from tensorflow import keras
from tensorflow.keras.applications import ResNet50

In [None]:
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.applications.imagenet_utils import decode_predictions

In [None]:
from keras.preprocessing.image import image
from keras.preprocessing.image import img_to_array

In [None]:
plt.figure(figsize=(8,8))
img = image.load_img("hawk.jpg", target_size = (224, 224))
plt.imshow(img)
plt.show()

Preprocessing for ResNet: 50 layers pretrained on over 1 million images from the ImageNet database. There are more than 1000 categories (classes).  

ResNet50 has over ***23 MILLION*** parameters to train!!

In [None]:
img = image.img_to_array(img)
img = np.expand_dims(img, axis=0)
img = preprocess_input(img)

In [None]:
model = ResNet50(weights="imagenet")

In [None]:
preds = model.predict(img)
print("Predicted:", decode_predictions(preds, top=1)[0])

Wrong already! Or is it? (At least it was not confident...)

In [None]:
plt.figure(figsize=(8,8))
img_cr = image.load_img("hawk_cropped.jpg", target_size = (224, 224))
plt.imshow(img_cr)
plt.show()

In [None]:
img_cr = image.img_to_array(img_cr)
img_cr = np.expand_dims(img_cr, axis=0)
img_cr = preprocess_input(img_cr)

In [None]:
preds = model.predict(img_cr)
print("Predicted:", decode_predictions(preds, top=1)[0])

That's not bad: a kite is a bird of prey.

In [None]:
plt.figure(figsize=(8,8))
img_f = image.load_img("flamingo.jpg", target_size = (224, 224))
plt.imshow(img_f)
plt.show()

In [None]:
img_f = image.img_to_array(img_f)
img_f = np.expand_dims(img_f, axis=0)
img_f = preprocess_input(img_f)

In [None]:
preds = model.predict(img_f)
print("Predicted:", decode_predictions(preds, top=1)[0])

## Word embeddings

While the tri-grams (and $n$-grams) capture some context above, they don't capture any relationships between words that don't appear near one another. We're unlikely to realize that 'beautiful' and 'pulchritudinous'; or 'hairy' and 'hirsute,' are close relatives. 

Word embeddings attempt to capture this by embedding words in Euclidean space in such a way that the geometry reflects the semantic meanings of the words: similar words are nearby in space. 

Train a NN to predict the next word in a sentence... only throw away the prediction and use the learned weights to represent words in a vector space. 

These are usually high-dimensional embeddings (100 dim or more) and are "learned" by training a neural network.

In [None]:
#pip install gensim

In [None]:
from gensim.models.word2vec import Word2Vec

Use our IMDB reviews as a corpus... there are pretrained models available (trained on Twitter, Wikipedia, etc.)

In [None]:
texts=[ re.sub("[^\w]", " ", str(review)).lower().split() for review in text_train] 

In [None]:
len(texts)

Let's create a 50 dim embedding. (Note: This should be interesting/fun but isn't big enough and doesn't have enough training data to be truly impressive.)

In [None]:
t0=time.time()
model = Word2Vec(texts, vector_size=50, window=10, min_count=5,sample=1e-3, workers=2)
t1=time.time()
print(t1-t0)

In [None]:
model.wv.most_similar(positive='great')

In [None]:
model.wv.most_similar(positive='horror')

In [None]:
model.wv.most_similar(positive='fun')

In [None]:
model.wv.most_similar(negative='fun')

In [None]:
model.wv.most_similar(positive='kids')

In [None]:
vec = model.wv['scary'].reshape((1, 50)) + model.wv['blood'].reshape((1, 50))

In [None]:
model.wv.most_similar(positive=vec)

In [None]:
mwvec = model.wv['oscar'].reshape((1, 50)) + model.wv['loser'].reshape((1, 50))

In [None]:
model.wv.most_similar(positive=mwvec)

In [None]:
model.wv.most_similar(positive='vader')

In [None]:
model.wv.most_similar(positive='jabba')

In [None]:
model.wv.most_similar(positive='superman')

In [None]:
model.wv.most_similar(positive='scorsese')

In [None]:
model.wv.most_similar(positive='godfather')

In [None]:
diff= model.wv['funny'].reshape((1, 50)) - model.wv['important'].reshape((1, 50))

In [None]:
model.wv.most_similar(positive=diff)

In [None]:
model.wv.most_similar(positive='hamlet')

In [None]:
model.wv.most_similar(positive='cry')

In [None]:
model.wv.most_similar(positive='oscar')

In [None]:
model.wv.similarity('darth', 'vader')