# Import libraries

To allow easier navigation please go to

[edit]$\rightarrow$[nbextensions config]

This will open a new window. In this new window select/tick

* Collapsible Headings
* Table of Contents (2)

You can then simple close that window, and refresh this page.

In the toolbar above you can now click the following icon to get a contents menu.

<img src="Images/ContentsButton.png" align='middle'>

Please now run the following code to make the notebook just a little more beautiful.

In [None]:
from IPython.core.display import HTML
css_file_path = 'custom.css'
styles = open(css_file_path, "r").read()
HTML(styles) 

We now install and then import the following libraries that you will use in this lab.

In [None]:
!pip install gensim matplotlib pandas

In [None]:
!pip install --upgrade pandas

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt

import gensim

import sklearn.feature_extraction.text as sklearntext
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# Import the IMDB data

We will be investigating data from the IMDB wedsite. The dataset is ready to be inported, but it can also be found [here](https://www.kaggle.com/PromptCloudHQ/imdb-data).

Begin by loading the dataset into a pandas dataframe:

In [None]:
data_path = 'IMDB-Movie-Data.csv'
df_all = pd.read_csv(data_path, sep=',')
df_all.head()

We will be trying to predict if a movie is a drama given its description.

There is a lot of extra information in the dataframe, `df_all`, so we'll begin by just extracting the desctiption and the genre.

We call the new dataframe df

In [None]:
df = df_all[['Genre', 'Description']].copy()

The only standardisation of the text we will do is to make the text lower case.

In [None]:
df['Description'] = df['Description'].str.lower()

It will be useful later on to have the length (number of words) of each movie description, so lets add that as a column to our dataframe.

<div class="alert alert-block alert-info">
To get the word counts we could use scikit-learn's [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), but we will introduce this tool in a later section.

In [None]:
df['NumWords'] = df['Description'].apply(lambda x: len(x.split()))

We will also need a column indicating if the genre is *drama* or *non-drama* (this will be our labels, y). We will have
* 0 $\rightarrow$ non-drama
* 1 $\rightarrow$ drama

In [None]:
df['Drama'] = df['Genre'].str.contains('Drama').astype(int)

Our dataframe now has the form:

In [None]:
df.sample(n=10)

# Explorative data analysis


One simple way of thinking about EDA is by asking which problems might *might* we run into with this dataset? For each possible problem we need to understand what the data looks like so we can deal with is. So lets look into
  * Is there a wide variance in description length, or examples with no description at all?
  * Is our data skewed (that is, do we have a strong imbalance between dramas and non-dramas)?
  * The total number of words, and how many words occur only once.

## Description lenghts

Lets begin with the description lenths. We want to check the mean and std of the number of words in theinput texts, lets first plot these, and then we will use pandas to calculate the mean and standard deviation.

In [None]:
df.hist(column='NumWords', by='Drama', layout=(1,2), bins=np.linspace(0, 80, 81), figsize=(20,10))

In [None]:
df.groupby('Drama').agg(['mean', 'median', 'sum', 'std'])['NumWords']

Do these distributions look OK to work with?

Does it matter that the mean isn't identical between the two categories?

## Skewed data set

Pandas' `.groupby` can also allow us very easily to see if we have a skewed dataset. By skewed we mean that the number of examples for one category is much less than for other other. Below we find this is not the case.

In [None]:
df.groupby('Drama').agg(['count'])['NumWords']

Finally we actually want two `np.array` which store the inputs and labels for out ML model. These can now be easily taken from our dataframe `df`.

In [None]:
X = np.array(df.loc[:,'Description'])
y = np.array(df.loc[:,'Drama'])

To be clear, we now have two numpy arrays, one, `X` with a list of movie descriptions, and one, `y`, with labes 0 and 1 for if that movie is a drama or not. Lets have a look at a "random" sample:

In [None]:
print(f'{X[42]} -> {y[42]}')

## Word counts

What about the total **number of words**, and the total **number of unique words**? To do this we will use scikit-learn's [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which does a bag or words embedding. This is a bit to learn, but it will be useful later, so lets take a brief tour through the `CountVectorizer()` object, and it's output, `X_sparse`.

<div class="alert alert-block alert-info">
This sparse matrix `X_sparse` *is* a bag of words embedding, but for now we are using it as a convenient way of counting the number of words and unique words.

Lets make a `CountVectorizer`, called `count_vect`, fit it to our input, `X`, and then transform our input texts, `X`, into a bag of words embedding called `X_sparse`.

In [None]:
count_vect = CountVectorizer(lowercase=True)
count_vect = count_vect.fit(X)
X_sparse = count_vect.transform(X)

The bag of words embedding uses a python dictionary (like a hashmap) to convert words to numbers. In python, dictionaries work in one direction, so lets also make a dictionary to convert our numbers back to words (an 'inverse hashmap'). Note that in python accessing values in a dictionary is done like:

```python
value = dictionay[key]
```

In [None]:
vocab = count_vect.vocabulary_
vocab_inverse = {}
for word, int_embedding in vocab.items(): # In Python 2: .iteritems()
    vocab_inverse[int_embedding] = word

We now have two dictionaries for our embeddings (these are what scikit-learn us using to convert between `X` and `X_sparse`):

`vocab`: $\:\:\textrm{word} \rightarrow \textrm{number}$

`vocab_inverse`: $\:\:\textrm{number} \rightarrow \textrm{word}$

If you want to, print out either of `vocab` or `vocab_inverse` to see what these dictionaries look like.

In [None]:
print(vocab)
print(vocab_inverse)

By investigating an example we should see what is contained within `X` and `X_sparse`. Note that `X` is a list of *all* the input data, so `X[0]` should be the description of the 'first' movie.

In [None]:
X[0]

`X_sparse` is a sparse matrix with all the input data encoded using bag of words. Again `X_sparse[42]` is the encoding of the 'first' movie.

In [None]:
print(X_sparse[42])

Finally we can check the encodings of any words. For example number 5288 should be a word appearing twice in the description:

In [None]:
print(vocab_inverse[5288])

In movie `X[20]` they mention a geologist. Which number word is `geologist` encoded as?

In [None]:
#### Your code here



### Back to the word count problem

OK, back to the original problem, counting the number of words and unique words. Now that we have our sparse matrix `X_sparse` this should be very easy. We can just use the `.sum()` method of sparse matrices.

In [None]:
total_words = X_sparse.sum()
total_words_unique = len(count_vect.vocabulary_)

print('total_words: {}'.format(total_words))
print('total_words_unique: {}'.format(total_words_unique))

Why do we care about the total number of words? And why did we check the number of unique words?

# Splitting data into test and training sets

We will just use scikit-learns built in function for this. Notice that `sklearn.model_selection.train_test_split` shuffles the dataset, and then splits it. This means that `X[0]` and `X_train[0]` (or `X_test[0]`) most likely won't represent the same movie description.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape)
print(y_train.shape)

# Creating the word embeddings

In this section we are going to create 6 matrices, which we will use as input to our machine learning models in the next section. The 6 matrices will be the following:

* `X_train_bow` and  `X_test_bow`
* `X_train_tfidf` and `X_test_tfidf`
* `X_train_wtv` and `X_test_wtv`

That is a training and a test matrix for each of the three embeddings we are using: bag of words (bow), tf-idf (tfidf) and Word2Vec (wtv).

## Bag of Words (bow) and Text-frequency Inverse-document-frequency (tfidf)

We have actually already created a bow emdedding using scikit-learn's `CountVectorizer`, however we need to now make this for the train and test sets. We will also use scikit-learn's [`TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for the tfidf embedding.

Previously we used the `.fit()` and `.transform()`. Here we will take a closer look at what they are doing.

<div class="alert alert-block alert-info">
It is also common to see the `.fit_transform()` method which does boths steps at once, although it is a little less clear what is happening in that case.

First we essentially initialise a `CountVectorizer()` object, this is the tool we use to make our embedding. This object doesn't know about our dataset yet.

In [None]:
count_vect = CountVectorizer(lowercase=True)

Then we need to 'fit' that vectorizer to our dataset (using the `.fit()` method). You can think of this as the vectorizer learning which set of words it need to be *able* to encode.

In [None]:
count_vect = count_vect.fit(X)

Now that we have a vectorizer (called `count_vect` in this case), we can use that to transform the dataset (i.e. do the bag of words embedding). We do this with the `.transform()` method.

In [None]:
X_train_bow = count_vect.transform(X_train)
X_test_bow = count_vect.transform(X_test)

Now we will go through the process again, but instead using a tf-idf vectorizer.

In [None]:
tfidf_vect = TfidfVectorizer(lowercase=True, max_df=1.0)
tfidf_vect = tfidf_vect.fit(X)
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf = tfidf_vect.transform(X_test)

You should print out some sample encodings from `X_train_bow` and `X_train_tfidf` to check you understand what these represent. You can also print the corresponding sample from `X_train`.

<div class="alert alert-block alert-info">
`X_train_bow` and `X_train_tfidf` are just scikit-learn *sparse matrices*, the same as `X_sparse` was, the methods used in the word counts section should work here too.

In [None]:
### Your code here
print(X_train_bow[42])
print(X_train_tfidf[42])



## Dense (Word2Vec) word embeddings

Before we do the actual word embeddings we will investigate how these embeddings work.

### Creating dense word embedding using Gensim

Using [gensim](https://radimrehurek.com/gensim/) one can create their own word2vec word embeddings.

We will start by **building** a simple word2vec model following [this](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/) blog post. Here we train on the senteces given in the list `sentences`. This is obviously a *very* small 'dataset', and thus the results just show what an embedding can look like.

If you wish, you can change either `min_count` or `size` to see what they do. But be sure to change them back to `size=100, min_count=1` before moving on.

In [None]:
from gensim.models import Word2Vec
from sklearn.manifold import TSNE

# Input dataset:
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
             ['this', 'is', 'the', 'second', 'sentence'],
             ['yet', 'another', 'sentence'],
             ['one', 'more', 'sentence'],
             ['and', 'the', 'final', 'sentence']]

s1 = "President Trump confirms CIA chief secretly visited North Korean leader last week".split(" ")
s2 = "Barbara Bush former US first lady literacy campaigner has died".split(" ")
s3 = "One person killed after US passenger jet with engine failure made emergency landing in Philadelphia".split(" ")
s4 = "The prime minister personally apologized Caribbean leaders".split(" ")

sentences = [s1, s2, s3, s4]


# Train the model:
super_simple_model = Word2Vec(sentences,  size=100, min_count=1)

print(super_simple_model)
# print(list(model.wv.vocab))

After training the model we reduce the word embedding dimension from 100 -> 2 using [t-distributed Stochastic Neighbor Embedding](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) then plot the words.

In [None]:
# Transform the data via PCA to 2D so we can visually plot it.
model_embeddings = super_simple_model.wv[super_simple_model.wv.vocab]
pca = TSNE(n_components=2)
result = pca.fit_transform(model_embeddings)

plt.figure(1, figsize=(16, 8))
plt.scatter(result[:, 0], result[:, 1])
words = list(super_simple_model.wv.vocab)
for i, word in enumerate(words):
    plt.annotate(word, xy=(result[i, 0], result[i, 1]))
plt.show()

With so few words in our dataset, and since there is no clear meaning on the axes it's hard to assess the embedding, but the words have some strucutre.

### Exploring dense word embeddings

For english there are many large and high quality embeddings that already exist (pre built embeddings also exist for many other languages).

We will import a word2vec embedding created from a dataset of roughly 100 billion words from google news. The embedding covers 3 million words, but to save space, we have cut it down to 200,000 words. Each word is encoded to a 300 dimensional vector, hence the binary file imported above is aroung 250 MB.

Please run the next code section. If the download takes a while just move on to the step, we won't need the file for a while.

In [None]:
###!wget -O gW2V.bin https://www.dropbox.com/s/tu5045ajmj20ih9/googleNewsCut.bin?dl=1

Lets now load the google news dataset into a gensim model object (this will take ~2 mins).

In [None]:
model = gensim.models.KeyedVectors.load_word2vec_format('gW2V.bin', binary=True)

Gensim's `model` object has a range of useful methods for investigating the word2vec model. The rest of this section (down to **Word2Vec embedding for our IMDB data**) is now left completely open for you to play with.

Lets see what an embedding looks like. You can print the whole thing if you like (just remove `[:20]`).

In [None]:
word = 'house'
print('Word: {}'.format(word))
print('First 20 values of embedding:\n{}'.format(model[word][:20]))





A classic example to demonstrate the usefulness of word2vec, is solving

$X - woman = king - man$

for $X$. Note that you can think of this as asking the question "what is to woman, what king is to man?"

If you want a challenge use the fact that London is the capital of England to find the capital of Norway. What other capitals does it know? Does it know capitals at all?

In [None]:
print(model.most_similar(positive=['woman', 'king'], negative=['man'], topn=3))

print(model.most_similar(positive=['France', 'Viking'], negative=['Norway'], topn=3))

print(model.most_similar(positive=['Tennis', 'Ronaldo'], negative=['Soccer'], topn=3))


The method `.doesnt_match()` finds the word which is furtherest away from the other words in the embedding space.

In [None]:
# .split() simply splits the sentence into a list of words
print(model.doesnt_match("breakfast cereal dinner lunch".split()))






Word embeddings are built by analysing which words are used in conjunction with other words in the data set. Thus biases within the dataset will be present in the embeddings.

`.similarity()` calculate how silimar two words are using the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).

In [None]:
print(model.similarity('man', 'nurse') - model.similarity('man', 'doctor'))
print(model.similarity('women', 'nurse') - model.similarity('women', 'doctor'))





### Word2Vec embedding for our IMDB data

We will just use the google news embeddings to change our movie description to word2vec vectors. It will be useful to [wrap the embeddings into a class](http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/) with fit and transform methods such that we can use them easily with scikit-learn. Although we will not need to fit this, because the embeddings are already 'fitted'.

We will use the simplest approach to encoding the description, which is to simply take the mean of the embeddings of all the words in the movie description.

In [None]:
class WordVecVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = 300

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in texts.split() if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for texts in X
        ])

In [None]:
wtv_vect = WordVecVectorizer(model)
X_train_wtv = wtv_vect.transform(X_train)
X_test_wtv = wtv_vect.transform(X_test)

print(X_train_wtv.shape) # Should be 700 (num of example) by 300 (dim. of embedding)

Fantastic, now we have both test and training data sets with bow, tf-idf and word2vec encoding. These are stored in the matrices

* `X_train_bow` and  `X_test_bow`
* `X_train_tfidf` and `X_test_tfidf`
* `X_train_wtv` and `X_test_wtv`


# Training and predicting

We will now try linear regression, SVM, a decision tree, and a random forest to flassify the movie genres for the three embeddings.

The basic method is
 1. Create a classifier object like `clf = LogisticRegression()`
 2. Then fit this object to the training set using `.fit()`
 3. Use the model to predict the labels for the test set using `.predict()`
 4. Elavuate the model using one of the metris from [`sklearn.metrics`](http://scikit-learn.org/stable/modules/model_evaluation.html)

First lets load the models from scikit-learn.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

OK, step 1, create the classifier object.

In [None]:
clf_lg = LogisticRegression(random_state=42)

Now fit this classifier. We will jsut use the tf-idf encodings, and of course we have to give the labels to it also.

In [None]:
clf_lg.fit(X_train_tfidf, y_train)

And then evaluate the model.

In [None]:
predictions = clf_lg.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, predictions)
print('LR accuracy with tfidf: {}\n'.format(accuracy))


To understand the output, lets just look at a single example. Note that `clf_lr` is now our *fitted* linear regression model. Reading each line you can hopefully find what the `.predict()` and `.predict_proba` methods do. If not you can ask, or [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) is the documentation.

In [None]:
i = 42
print(X_train[i])
print(clf_lg.predict(X_train_tfidf[i]))
print(clf_lg.predict_proba(X_train_tfidf[i]))
print(y_train[i]) # 0=non Drama, 1=Drama

Lets now find the accuracy of a *support vector machine* with tf-idf encoding.

In [None]:
# The default SGDClassifier implements a SVM:
clf_svm = SGDClassifier(random_state=42)
clf_svm.fit(X_train_tfidf, y_train)

predictions = clf_svm.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, predictions)
print('SVM accuracy with tfidf: {}\n'.format(accuracy))


Below we have named the models, as this will be used later. By setting the `random_state` variable we simply give the see to the random number generator a fixed value. You are free to play with this, or the number or trees, `n_estimators` in the random forest.

In [None]:
clf_dt = DecisionTreeClassifier(random_state=42)

### Your code here




In [None]:
clf_rf = RandomForestClassifier(n_estimators=200, random_state=42)

### Your code here




The following simply runs over models and encoding and looks at accuracy to judge which model is the best. We have not fine tuned the models at all, so this just provides an overview of what we have done and indicates which methods are most promising.

In [None]:
ecodings = [X_train_bow, X_train_tfidf, X_train_wtv]
ecodings_t = [X_test_bow, X_test_tfidf, X_test_wtv]
models = [clf_lg, clf_svm, clf_rf, clf_dt]

I = pd.Index(["Bow", "Tf-idf", "Word2Vec"], name="Encodings:")
C = pd.Index(["LR", "SVM", "RF", 'DT'], name="Models:")
df = pd.DataFrame(index=I, columns=C)
for i, (e, e_t) in enumerate(zip(ecodings, ecodings_t)):
    for j, m in enumerate(models):
        m.fit(e, y_train)
        predictions = m.predict(e_t)
        df.iloc[i, j] = accuracy_score(y_test, predictions)
df

# Adding  n-grams

Thus far we in all of our word encodings we have only dealt with single words and have not taken into account the order of the words in the sentences. This can be done by using n-grams (sets of n words together which we treat as a single object).

Although it is conceptually possible to create n-grams for the word2vec model this is not generally done and we will just investigate n-grams for the bag of words and tf-idf models.

Using n-grams is part of the embedding, thus we will have to return to our scikit-learn vectorizers. There is a built in parameter for these like `ngram_range=(1, 3)`, where in this case we will generate all singe words, pairs, and triplets. This is most easily explained through an example.

<div class="alert alert-block alert-info">
The `token_pattern` will pick out all words (the default ignores words with one letter)
You may include this in other pieces of code, above or below, if you want to see the results.

In [None]:
sentence =  ['this sentence is a short sentence']

ngrammer = CountVectorizer(lowercase=True, 
                           ngram_range=(2, 2),
                           token_pattern=u"(?u)\\b\\w+\\b",
                           max_df=1.0)
ngrammer = ngrammer.fit(sentence)
encoded = ngrammer.transform(sentence)

print(ngrammer.vocabulary_)
print(encoded)

If you try different values for the `ngram_range` in the code able then you should find that the values are the lower and higher bounds of the n-gram range. For example with `ngram_range=(2, 2)` we only get bi-grams. With `ngram_range=(1, 3)` we would get words, bi-grams and tri-grams.

You can now check if adding n-grams helps the overall model. Below we have copy pasted the tf-idf encoding using the logistic regression model (renaming the variables).

Also, the parameter `max_df` can cut out words which occur to often in your texts. For example is the word `the` occurs in 94% of all the texts, it might be actually making it harder for our classifier. You may thus want to play with this parameter. Also you will have to think about how overfitting might effect the results with a high value for the upper n-gram bound.

Finally default value for `ngram_range` is `(1, 1)`, so using that should give the results you already achieved above (with `max_df=1.0`).

In [None]:
### Test how chaning the ngram_range effects results:

fancy_vect = TfidfVectorizer(lowercase=True, ngram_range=(1, 1), max_df=1.0)
fancy_vect = fancy_vect.fit(X)
X_train_fancy = fancy_vect.transform(X_train)
X_test_fancy = fancy_vect.transform(X_test)
    
# Fitting the model:
clf_lg = LogisticRegression()
clf_lg.fit(X_train_fancy, y_train)

# Now evaluating our model:
predictions = clf_lg.predict(X_test_fancy)
accuracy = accuracy_score(y_test, predictions)
print('Log. Reg. accuracy with tfidf: {}\n'.format(accuracy))

# Extra - Buliding a pipeline and configuring the Tfidf and Bow vectorizers

We have now gone through quite a lot of options and ways to build our movie classifier. We now want the option that will classify new movies as drama or not as accurately as possible.

There are roughly equal numbers of drama to non-drama movies and we don't care especially about getting one category correct at the cost of the other, so accuracy for the training set will be good enough to measure model quality.

But first we will take a short detour over a point we skimmed over in our word embeddings.

How many words does the description of `X[20]` have? Is this consistent with the number of words in the bow embedding (`X_sparse[20]`)? You can use the `.sum()` method for sparse matricies, and `len(string.split())` function to split the string by whitespace and count the items in the resulting list.

In [None]:
### Your code here




You should have gotten a difference of 4 actually. Hopefully by printing the description you can work out why. The answer is also in the help section at the bottom of this document.

## Pipelines

Scikit-learn has objects called [pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), which can essentially string together various parts of our model. These will be helpful in testing different models quickly. For example:

In [None]:
bow_lr = Pipeline([
    ('bow', CountVectorizer(lowercase=True)),
    ('lr', LogisticRegression()), ])
bow_lr.fit(X_train, y_train)
print(np.mean(bow_lr.predict(X_test) == y_test))

We are using almost entirely the default settings of the models and embeddings which we got from scikit-learn. To give some idea of the customizability of this we could instead change our models to something like:

<div class="alert alert-block alert-info">
`text.ENGLISH_STOP_WORDS` are a list of basic words the ending will ignore.

In [None]:
pipe = Pipeline([
    ('wtv', TfidfVectorizer(max_df=0.3,
                              stop_words=sklearntext.ENGLISH_STOP_WORDS,
                              lowercase=False,
                              ngram_range=(1, 3))),
    ('log_reg', LogisticRegression()) ])
pipe.fit(X_train, y_train)
print(np.mean(pipe.predict(X_test) == y_test))

Suddenly we have many different parameters for a single model, and we must use a combination of actually testing different combinations, and considering which options we think are most likely to work well.

### Changing `random_state`

We have quite a lot of options and outcomes of models, and seem ready to generate a lot of new data, but if you play with the `random_state` variable you will see we are at the limits of how much we can tune our model given the small size of our dataset. The last thing you want to do is investigate a model for days, and begin to see structure which really isn't there and you would get roughly similar results simply using the best we have found thus fat (the default logistic regression model with a word2vec encoding).

However, you are encouraged to play with the model and try and to improve it's accuracy. In the help section at the bottom of this document I have made a function which finds the optimal regularization value, and get an accuracy of roughly 72% on the test set.

In [None]:
# Feel free to use any combination of the tools above to create
# a model with decent accuracy. Check either the pipelines above,
# or even try encapsulating a model in a function so you can scan
# over it more easily, as is done in the help section below.

### Your code here





# Extra reading and acknowledgements

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

http://www.davidsbatista.net/blog/2017/04/01/document_classification/

https://github.com/alexandres/lexvec#pre-trained-vectors

http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html

https://stats.stackexchange.com/questions/267169/how-to-use-pre-trained-word2vec-model

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/


# Help with exercises

## Length of strings and bow representation

You should get a srting of 31 words, and only 27 are encoded in the bow model. This is because the `CountVectorizer()` ignores all words of length 1 and there are 4 instances of 'a' in the description.

## Checking for a skewed data skewed

In [None]:
print('Number of dramas / number of non-dramas: {} / {}'.format(sum(y), len(y)))

## Optimising the regularization in the best 'out of the box' model

In [None]:
def model_search(c_val):
    pipe = Pipeline([('tfidf', WordVecVectorizer(model)),
                     ('log_reg', LogisticRegression(C=c_val, random_state=42)) ])
    pipe.fit(X_train, y_train)
    return np.mean(pipe.predict(X_test) == y_test)

a_s = np.random.rand(100,)*8 # random float number in [0,8)
data = []
for a in a_s:
    data.append([a, model_search(a)])

In [None]:
df_op = pd.DataFrame(data, columns=['C', 'Accuracy'])
df_op.plot(x='C', y='Accuracy', kind='scatter')