In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

In [2]:
from static_grader import grader

# NLP: Analyzing Review Text


Unstructured data makes up the vast majority of data.  This is a basic intro to handling unstructured data.  Our objective is to be able to extract the sentiment (positive or negative) from review text.  We will do this from Yelp review data.

The first three questions task you to build models, of increasing complexity, to predict the rating of a review from its text.  These models will be assessed based on the root mean squared error of the number of stars predicted.  There is a reference solution (which should not be too hard to beat) that defines the score of 1.

The final question asks only for the result of a calculation, and your results will be compared directly to those of a reference solution.


## A note on scoring

It **is** possible to score >1 on these questions. This indicates that you've beaten our reference model - we compare our model's score on a test set to your score on a test set. See how high you can go!


## Download and parse the data


To start, let's download the data set from Amazon S3:

In [6]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_review_reduced.json.gz'

download: s3://dataincubator-course/mldata/yelp_train_academic_dataset_review_reduced.json.gz to ./yelp_train_academic_dataset_review_reduced.json.gz


The training data are a series of JSON objects, in a Gzipped file. Python supports Gzipped files natively: [`gzip.open`](https://docs.python.org/2/library/gzip.html) has the same interface as `open`, but handles `.gz` files automatically.

The built-in json package has a `loads()` function that converts a JSON string into a Python dictionary.  We could call that once for each row of the file. [`ujson`](http://docs.micropython.org/en/latest/library/ujson.html) has the same interface as the built-in `json` library, but is *substantially* faster (at the cost of non-robust handling of malformed json).  We will use that inside a list comprehension to get a list of dictionaries:

In [3]:
import gzip
import ujson as json

with gzip.open('yelp_train_academic_dataset_review_reduced.json.gz') as f:
    data = [json.loads(line) for line in f]

Scikit Learn will want the labels in a separate data structure, so let's pull those out now.

In [None]:
#data=pd.read_json('yelp_train_academic_dataset_review_reduced.json.gz',lines=True)

In [None]:
stars = [row['stars'] for row in data]

### Notes:

1. [Pandas](http://pandas.pydata.org/) is able to read JSON text directly.  Use the `read_json()` function with the `lines=True` keyword argument.  While the rest of this notebook will assume you are using a list of dictionaries, you can complete it with dataframes, if you so desire. Some of the example code will need to be modified in this case.

2. There are obvious mistakes in the data.  There is no need to try to correct them.


## Building models


For the first three questions, you will need to build and train an estimator to predict the star rating from the text of a review.  We recommend building a pipeline out of transformers and estimators provided by Scikit Learn.  You can decide whether these pipelines should take full review objects or just their text as input to the `fit()` and `predict()` methods, but it does pay to be consistent.

You may find it useful to serialize the trained models to disk.  This will allow you to reload the models after restarting the notebook, without needing to retrain them.  We recommend using the [`dill` library](https://pypi.python.org/pypi/dill) for this (although the [`joblib` library](http://scikit-learn.org/stable/modules/model_persistence.html) also works).  Use
```python
dill.dump(estimator, open('estimator.dill', 'w'))
```
to serialize the object `estimator` to the file `estimator.dill`.  If you have trouble with this, try setting the `recurse=True` keyword arguments in the call of `dill.dump()`.  The estimator can be deserialized with
```python
estimator = dill.load(open('estimator.dill', 'r'))
```

You may run into trouble with the size of your models and Digital Ocean's memory limit. This is a major concern in real-world applications. Your production environment will likely not be that different from Digital Ocean and being able to deploy there is important. Think about what information the different stages of
your pipeline need and how you can reduce the memory footprint.

Additionally, you may notice that your serialized models are very large and take a long time to load.  Some hints to reduce their size:

- If you are using `GridSearchCV` to find the optimal values of hyperparameters (and you should be), the resultant object will contain many copies of the estimator that aren't needed any more.  Instead of serializing the whole `GridSearchCV`, serialize just the estimator with the correct hyperparameters.  This can be accessed through the `.best_estimator_` attribute of the `GridSearchCV` object.  Alternatively, the `.best_params_` attribute gives the best values of the hyperparameters.

- The `CountVectorizer` keeps track of all words that were excluded from vectorization in its `.stop_words_` attribute.  This can be interesting to examine, but isn't needed for predictions.  Set this attribute to the empty list before serializing it to save disk space.

# Questions


Each of the "model" questions asks you to create a function that models the number of stars given in a review from the review text.  It will be passed a list of dictionaries.  Each of these will have the same format as the JSON objects you've just read in.  This function should return a list of numbers of the same length, giving the predicted star ratings.

This function is passed to the `score()` function, which will receive input from the grader, run your function with that input, report the results back to the grader, and print out the score the grader returned.  Depending on how you constructed your estimator, you may be able to pass the predict method directly to the `score()` function.  If not, you will need to write a small wrapper function to mediate the data types.


## bag_of_words_model

Build a linear model predicting the star rating based on the count of the words in each document (bag-of-words model).  Use a [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) or [`HashingVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer) to produce a feature matrix giving the counts of each word in each review.  Feed this in to linear model, such as `Ridge` or `SGDRegressor`, to predict the number of stars from each review.

**Hints**:
1. Don't forget to use tokenization!  This is important for good performance but it is also the most expensive step.  Try vectorizing as a first initial step and then running grid-search and cross-validation only on of this pre-processed data.  `CountVectorizer` has to memorize the mapping between words and the index to which it is assigned.  This is linear in the size of the vocabulary.  The `HashingVectorizer` does not have to remember this mapping and will lead to much smaller models.

```python
from sklearn.feature_extraction.text import CountVectorizer

text = [row['text'] for row in data]
X = CountVectorizer().fit_transform(text)

# Now, this can be run with many different parameters
# without needing to retrain the vectorizer:
model.fit(X, stars, hyperparameter=something)
```

2. Try choosing different values for `min_df` (minimum document frequency cutoff) and `max_df` in `CountVectorizer`.  Setting `min_df` to zero admits rare words which might only appear once in the entire corpus.  This is both prone to overfitting and makes your data unmanageably large.  Don't forget to use cross-validation or to select the right value.  Notice that `HashingVectorizer` doesn't support `min_df`  and `max_df`.  However, it's not hard to roll your own transformer that solves for these.

3. Try using [`LinearRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) or [`RidgeCV`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV).  If the memory footprint is too big, try switching to [Stochastic Gradient Descent](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) You might find that even ordinary linear regression fails due to the data size.  Don't forget to use [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) to determine the regularization parameter!  How do the regularization parameter `alpha` and the values of `min_df` and `max_df` from `CountVectorizer` change the answer?

4. You will likely pick up several hyperparameters between the tokenization step and the regularization of the estimator.  While is is more strictly correct to do a grid search over all of them at once, this can take a long time. Quite often, doing a grid search over a single hyperparameter at a time can produce similar results.  Alternatively, the grid search may be done over a smaller subset of the data, as long as it is representative of the whole.

5. Finally, assemble a pipeline that will transform the data from records all the way to predictions.  This will allow you to submit its predict method to the grader for scoring.

In [203]:
import pandas as pd
import numpy as np
import re
from sklearn import base
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as ENG_stopwords 
import string
from sklearn import linear_model
from sklearn.feature_extraction.text import CountVectorizer,HashingVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score 
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
punctuations = string.punctuation
import pickle
import scipy.sparse
import collections
import ujson as json

In [130]:
#!python -m spacy download en_core_web_lg

In [5]:
import spacy
nlp=spacy.load('en')

In [None]:
df=pd.read_json('yelp_train_academic_dataset_review_reduced.json.gz',lines=True)
df.head()

In [8]:
# df_train, df_test = train_test_split(df[['text','stars']],\
#                                      stratify=df['stars'],\
#                                      test_size=0.2,\
#                                      random_state=18)

# df_train_y=df_train.stars
# df_train_X=df_train.drop('stars',axis=1)
# df_test_X=df_test.drop('stars',axis=1)
# df_test_y=df_test.stars

In [164]:
# df_train_y.to_pickle('df_train_y')

# df_train_X.to_pickle('df_train_X')

# df_test_X.to_pickle('df_test_X')

# df_test_y.to_pickle('df_test_y')

In [74]:
df_train_y=pd.read_pickle('df_train_y')
df_train_X=pd.read_pickle('df_train_X')
df_test_X=pd.read_pickle('df_test_X')
df_test_y=pd.read_pickle('df_test_y')

In [75]:
class ColumnSelectTransformer(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self, col_names):
        self.col_names = col_names  # We will need these in transform()
    
    def fit(self, X, y=None):
        # This transformer doesn't need to learn anything about the data,
        # so it can just return self without any further processing
        return self
    
    def transform(self, X):
        if not isinstance(X,pd.DataFrame):
            X=pd.DataFrame(X)
        alist=X[self.col_names].values.tolist()
        if isinstance(alist[0], (list,)):
            return sum(alist,[])
        return alist
        # Return an array with the same number of rows as X and one
        # column for each in self.col_names

In [76]:
class PreprocessorTransformer(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self, col_names='text'):
        self.col_names = col_names  # We will need these in transform()
    
    def fit(self, X, y=None):
        # This transformer doesn't need to learn anything about the data,
        # so it can just return self without any further processing
        return self
    
    def transform(self, X):
        results=[]
        for i in X:
            
            cleaned=preprocessor(i).split()
            removed=remove_stop_pun(cleaned)
            results.append(removed)
            
        
        return results
        

In [95]:
def clean_text(text):     
    return text.strip().lower()
nltk_stopwords_mod=set(nltk.corpus.stopwords.words('english'))
def remove_stop_pun(text):
#     results=[]
    return [tok for tok in text if (tok not in punctuations and tok not in nltk_stopwords_mod)]
#     for tok in text:
#         if not bool(re.search(r'\d', tok)):
#             if tok not in punctuations:
#                 if tok not in nltk.corpus.stopwords.words('english'):
#                     results.append(tok)
            
#     return results

In [117]:
with open('yelp_stopwords.txt','rb') as file:
    yelp_stopwords=file.readlines()[0].decode('utf-8').split('\r')

In [120]:
nltk_stopwords_mod.update(yelp_stopwords)
nltk_stopwords_mod.update(['bla','well'])

In [87]:
def pre_tokenizer(text):
    if not isinstance(text, str):
        text=str(text)
    cleaned=preprocessor(text).split()
    removed=remove_stop_pun(cleaned)
    return removed

In [104]:
#!wget -O yelp_stopwords.txt https://raw.githubusercontent.com/Jigar24/Yelp-Topic-Modelling/master/stopwords.txt

--2018-10-11 14:59:26--  https://raw.githubusercontent.com/Jigar24/Yelp-Topic-Modelling/master/stopwords.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.20.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.20.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3622 (3.5K) [text/plain]
Saving to: 'yelp_stopwords.txt'


2018-10-11 14:59:26 (75.5 MB/s) - 'yelp_stopwords.txt' saved [3622/3622]



In [100]:
pre_tokenizer('ezzyujdouig4p gyb3pv_a this is a test bla bla well')

['test']

In [7]:
#remove_stop_pun(preprocessor(fst_stp[0]).split())

In [56]:
def preprocessor(astring):
    #text=alist[0]
    #astring = str(astring)
    astring = re.sub('<[^>]*>', '', astring)
    astring=re.sub("\S*\d\S*", "", astring).strip()
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', astring)
    astring = re.sub('[\W]+', ' ', astring.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return astring

In [64]:
col_select=ColumnSelectTransformer('text')

In [65]:
fst_stp=col_select.fit_transform(df_temp)

In [71]:
process=PreprocessorTransformer('text')

In [72]:
snd_stp=process.fit_transform(fst_stp)

In [103]:
cv = CountVectorizer(tokenizer=pre_tokenizer)
#cv1 = CountVectorizer(stop_words=stopwords)

In [104]:
tokens=cv.fit_transform(fst_stp)

In [113]:
rigid=linear_model.Ridge()

In [114]:
rigid.fit(tokens,df_train_y[:4])

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [106]:
print(tokens.toarray())

[[1 1 0 ... 1 0 1]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 [0 1 1 ... 0 0 0]]


In [38]:
t=PreprocessorTransformer()

In [39]:
t.fit_transform(df_temp)

NameError: name 'df_temp' is not defined

In [26]:
nlp=spacy.load('en_core_web_lg',disable=['ner', 'parser'])

In [32]:
def spacy_tokenizer(sentence):
    tokens = nlp.pipe(sentence)
#     _tokens = []
#     for tok in tokens:
#         if tok.is_stop or tok.is_punct:
#             continue
#         else:
#             if tok.lemma_ != "-PRON-":
#                 _tokens.append(tok.lemma_.lower().strip())
#             else:
#                 _tokens.append(tok.lower_)
        
    tokens = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tokens]
    tokens = [tok for tok in tokens if (tok not in stopwords and tok not in punctuations)]     
    return tokens

In [121]:
class SpacyTransformer(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self, col_names='text'):
        self.col_names = col_names  # We will need these in transform()
    
    def fit(self, X, y=None):
        # This transformer doesn't need to learn anything about the data,
        # so it can just return self without any further processing
        return self
    
    def transform(self, X):
        if isinstance(X,pd.DataFrame):
            X=X[self.col_names].values.tolist()
        #print(X)
        return [spacy_tokenizer(i) for i in X]
        

In [None]:
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in     vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) 
classifier = LinearSVC()

In [143]:
import nltk

In [153]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/vagrant/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [291]:
from sklearn.pipeline import Pipeline
#from nltk.corpus import stopwords, words
bag_of_words_est3 = Pipeline([
    ('text',ColumnSelectTransformer(['text'])),
    #('strip_space',PreprocessorTransformer()),
    #('vectorizer',CountVectorizer()),
    ('hvect', HashingVectorizer(norm='l2',tokenizer = pre_tokenizer,stop_words=nltk_stopwords_mod)),
    #('hvect', HashingVectorizer(norm='l2',stop_words=nltk.corpus.stopwords.words('english'))),
    #('vectorizer',CountVectorizer(tokenizer = pre_tokenizer)),
    ('Ridge', linear_model.Ridge(alpha = 0.2))
])
#bag_of_words_est.fit(data, stars)

In [173]:
# import pickle
# pkl_filename = "hash_model.pkl"  
# with open(pkl_filename, 'wb') as file:  
#     pickle.dump(bag_of_words_est2, file)

In [156]:
with open(pkl_filename, 'rb') as file:  
    bag_of_words_est = pickle.load(file)

In [292]:
bag_of_words_est3.fit(df_train_X[:1000],df_train_y[:1000])

Pipeline(memory=None,
     steps=[('text', ColumnSelectTransformer(col_names=['text'])), ('hvect', HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

In [168]:
y_pred=bag_of_words_est2.predict(df_test_X)

In [203]:
parameters = {
    'vectorizer__max_df': (0.5, 0.75, 1.0),
    'vectorizer__min_df': (0.01,0.05),
    'vectorizer__max_features': (None, 5000, 10000),
    #'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    # 'tfidf__use_idf': (True, False),
    # 'tfidf__norm': ('l1', 'l2'),
    #'clf__max_iter': (5,),
    'Ridge__alpha': (0.00001, 0.3)
    #'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__max_iter': (10, 50, 80),
}

In [None]:
# bag_of_words_est3 = Pipeline([
#     ('text',ColumnSelectTransformer(['text'])),
#     #('strip_space',PreprocessorTransformer()),
#     #('vectorizer',CountVectorizer()),
#     #('hvect', HashingVectorizer(norm='l2',stop_words=nltk.corpus.stopwords.words('english'))),
#     ('vectorizer',CountVectorizer(tokenizer = pre_tokenizer)),
#     ('Ridge', linear_model.Ridge(alpha = 0.1))
# ])

In [293]:
bag_of_words_est3.fit(df_train_X,df_train_y)

Pipeline(memory=None,
     steps=[('text', ColumnSelectTransformer(col_names=['text'])), ('hvect', HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

In [None]:
import pickle
pkl_filename = "hash_model2.pkl"  
with open(pkl_filename, 'wb') as file:  
    pickle.dump(bag_of_words_est3, file)

In [201]:
grid_search = GridSearchCV(bag_of_words_est3, param_grid=param_grid, cv=5)

NameError: name 'param_grid' is not defined

In [170]:
y_pred

array([3.27827595, 3.6560669 , 3.65202657, 4.41642968, 3.45992641,
       4.19996179, 3.89467914, 3.63441835, 3.7703331 , 4.64428548,
       3.90513529, 3.48517155, 4.4345296 , 4.36572501, 3.16345258,
       3.99171452, 3.60752442, 3.87664112, 3.62246333, 4.23079934,
       3.8147538 , 3.81146937, 3.75912095, 3.71261672, 4.23989826,
       2.94474994, 3.35020055, 3.17176146, 3.92071311, 3.63209517,
       3.5885292 , 3.70242953, 3.69104267, 4.09912584, 4.07969628,
       3.96249576, 3.11019962, 3.52979469, 3.51629881, 4.21057077,
       3.600051  , 2.89419729, 3.51314514, 3.90138743, 4.53044089,
       4.01160075, 3.33016271, 4.27964687, 3.4369498 , 4.26310209,
       4.27375108, 4.03590278, 3.65633296, 3.55806772, 3.98209199,
       4.35543886, 3.96320675, 3.89328294, 4.05817187, 3.00605404,
       3.17682462, 3.95521873, 3.93447817, 3.19445001, 3.76652267,
       3.04702488, 3.1389399 , 4.13334395, 2.72894043, 3.366292  ,
       3.86964989, 4.15000206, 2.93155745, 3.07105869, 4.55786

In [197]:
accuracy_score(df_test_y[:1000].values,np.round(bag_of_words_est3.predict(df_test_X[:1000]),0))

0.363

In [None]:
# pkl_filename = "hash_model.pkl"  
# with open(pkl_filename, 'wb') as file:  
#     pickle.dump(bag_of_words_est, file)

In [None]:
cat_pipe = Pipeline([('cst',ColumnSelectTransformer(['categories'])),
                     ('DictEn',DictEncoder()),
                     ('DictVec',DictVectorizer(sparse=False)),
                     ('Ridge', linear_model.Ridge(alpha = 0.1))
                     
        # ColumnSelectTransformer
        # KNeighborsRegressor
    ])

In [14]:
grader.score('nlp__bag_of_words_model', bag_of_words_est3.predict)

Your score:  1.3194332860293514


In [289]:
grader.score('nlp__bag_of_words_model', bag_of_words_est3.predict)

Your score:  1.2619672911699913


## normalized_model

Normalization is key for good linear regression. Previously, we used the count as the normalization scheme.  Add in a normalization transformer to your pipeline to improve the score.  Try some of these:

1. You can use the "does this word present in this document" as a normalization scheme, which means the values are always 1 or 0.  So we give no additional weight to the presence of the word multiple times.

2. Try using the log of the number of counts (or more precisely, $log(x+1)$). This is often used because we want the repeated presence of a word to count for more but not have that effect tapper off.

3. [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common normalization scheme used in text processing.  Use the [`TfidfTransformer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer). There are options for using `idf` and taking the logarithm of `tf`.  Do these significantly affect the result?

Finally, if you can't decide which one is better, don't forget that you can combine models with a linear regression.

In [15]:
grader.score('nlp__normalized_model', bag_of_words_est3.predict)

Your score:  1.2183051415982085


In [290]:
grader.score('nlp__normalized_model', bag_of_words_est3.predict)

Your score:  1.1652436357642129


In [None]:
grader.score('nlp__normalized_model', normalized_est.predict)

## bigram_model

In a bigram model, we'll consider both single words and pairs of consecutive words that appear.  This is going to be a much higher dimensional problem (large $p$) so you should be careful about overfitting.

Sometimes, reducing the dimension can be useful.  Because we are dealing with a sparse matrix, we have to use [`TruncatedSVD`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD).  If we reduce the dimensions, we can use a more sophisticated models than linear ones.

As before, memory problems can crop up due to the engineering constraints. Playing with the number of features, using the `HashingVectorizer`, incorporating `min_df` and `max_df` limits, and handling stop-words in some way are all methods of addressing this issue. If you are using `CountVectorizer`, it is possible to run it with a fixed vocabulary (based on a training run, for instance). Check the documentation.

**A side note on multi-stage model evaluation:** When your model consists of a pipeline with several stages, it can be worthwhile to evaluate which parts of the pipeline have the greatest impact on the overall accuracy (or other metric) of the model. This allows you to focus your efforts on improving the important algorithms, and leaving the rest "good enough".

One way to accomplish this is through ceiling analysis, which can be useful when you have a training set with ground truth values at each stage. Let's say you're training a model to extract image captions from websites and return a list of names that were in the caption. Your overall accuracy at some point reaches 70%. You can try manually giving the model what you know are the correct image captions from the training set, and see how the accuracy improves (maybe up to 75%). Alternatively, giving the model the perfect name parsing for each caption increases accuracy to 90%. This indicates that the name parsing is a much more promising target for further work, and the caption extraction is a relatively smaller factor in the overall performance.

If you don't know the right answers at different stages of the pipeline, you can still evaluate how important different parts of the model are to its performance by changing or removing certain steps while keeping everything else constant. You might try this kind of analysis to determine how important adding stopwords and stemming to your NLP model actually is, and how that importance changes with parameters like the number of features.

In [175]:
grader.score('nlp__bigram_model', bag_of_words_est2.predict)

Your score:  1.2149417045378152


In [16]:
bag_of_words_est4 = Pipeline([
    ('text',ColumnSelectTransformer(['text'])),
    #('strip_space',PreprocessorTransformer()),
    #('vectorizer',CountVectorizer()),
    ('hvect', HashingVectorizer(norm='l2',ngram_range=(1,2),tokenizer = pre_tokenizer)),
    #('hvect', HashingVectorizer(norm='l2',stop_words=nltk.corpus.stopwords.words('english'))),
    #('vectorizer',CountVectorizer(tokenizer = pre_tokenizer)),
    ('Ridge', linear_model.Ridge(alpha = 0.2))
])

In [17]:
bag_of_words_est4.fit(df_train_X,df_train_y)

Pipeline(memory=None,
     steps=[('text', ColumnSelectTransformer(col_names=['text'])), ('hvect', HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

In [20]:
import pickle
pkl_filename = "bigram_model.pkl"  
with open(pkl_filename, 'wb') as file:  
    pickle.dump(bag_of_words_est4, file)

In [18]:
grader.score('nlp__bigram_model', bag_of_words_est4.predict)

Your score:  1.2534451191274427


In [None]:
grader.score('nlp__bigram_model', bigram_est.predict)

## food_bigrams

Look over all reviews of restaurants.  You can determine which businesses are restaurants by looking in the `yelp_train_academic_dataset_business.json.gz` file from the ml project or downloaded below.

In [21]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_business.json.gz'

download: s3://dataincubator-course/mldata/yelp_train_academic_dataset_business.json.gz to ./yelp_train_academic_dataset_business.json.gz


In [79]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_review.json.gz'

download: s3://dataincubator-course/mldata/yelp_train_academic_dataset_review.json.gz to ./yelp_train_academic_dataset_review.json.gz


In [80]:
with gzip.open('yelp_train_academic_dataset_review.json.gz') as f:
    business_data = [json.loads(line) for line in f]

NameError: name 'gzip' is not defined

Each row of this file corresponds to a single business.  The category key gives a list of categories for each; take all where "Restaurants" appears.

In [8]:
business_data=pd.read_json('yelp_train_academic_dataset_business.json.gz',lines=True)

In [7]:
review_data=pd.read_json('yelp_train_academic_dataset_review.json.gz',lines=True)

In [9]:
len(review_data)

1012913

In [12]:
len(df)

NameError: name 'df' is not defined

In [82]:
business_data.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
0,vcNAWiLM4dR7D2nwwJ7nCA,2007-05-17,15SdjuK7DmYqUAj6rjGowg,5,dr. goldberg offers everything i look for in a...,review,Xqd0DzHaiyRqVH3WRG7hzg,"{'funny': 0, 'useful': 2, 'cool': 1}"
1,vcNAWiLM4dR7D2nwwJ7nCA,2010-03-22,RF6UnRTtG7tWMcrO2GEoAg,2,"Unfortunately, the frustration of being Dr. Go...",review,H1kH6QZV7Le4zqTRNxoZow,"{'funny': 0, 'useful': 2, 'cool': 0}"
2,vcNAWiLM4dR7D2nwwJ7nCA,2012-02-14,-TsVN230RCkLYKBeLsuz7A,4,Dr. Goldberg has been my doctor for years and ...,review,zvJCcrpm2yOZrxKffwGQLA,"{'funny': 0, 'useful': 1, 'cool': 1}"
3,vcNAWiLM4dR7D2nwwJ7nCA,2012-03-02,dNocEAyUucjT371NNND41Q,4,Been going to Dr. Goldberg for over 10 years. ...,review,KBLW4wJA_fwoWmMhiHRVOA,"{'funny': 0, 'useful': 0, 'cool': 0}"
4,vcNAWiLM4dR7D2nwwJ7nCA,2012-05-15,ebcN2aqmNUuYNoyvQErgnA,4,Got a letter in the mail last week that said D...,review,zvJCcrpm2yOZrxKffwGQLA,"{'funny': 0, 'useful': 2, 'cool': 1}"


In [93]:
len(business_data)

37938

In [10]:
business_data_mod=business_data[business_data['categories'].apply(str).str.contains("Restaurants")]

In [11]:
restaurant_ids = business_data_mod['business_id'].values.tolist()

In [12]:
assert len(business_data_mod) == 12876

The "business_id" here is the same as in the review data.  Use this to extract the review text for all reviews of restaurants.

In [121]:
restaurant_reviews=pd.DataFrame()
restaurant_reviews = review_data[review_data.business_id.isin(restaurant_ids)].copy()

In [122]:
len(restaurant_reviews)

574278

In [22]:
assert len(restaurant_reviews) == 143361

AssertionError: 

We want to find collocations --- that is, bigrams that are "special" and appear more often than you'd expect from chance. We can think of the corpus as defining an empirical distribution over all *n*-grams.  We can find word pairs that are unlikely to occur consecutively based on the underlying probability of their words. Mathematically, if $p(w)$ be the probability of a word $w$ and $p(w_1 w_2)$ is the probability of the bigram $w_1 w_2$, then we want to look at word pairs $w_1 w_2$ where the statistic

  $$ \frac{p(w_1 w_2)}{p(w_1) p(w_2)} $$

is high.  Return the top 100 (mostly food) bigrams with this statistic with the 'right' prior factor (see below).

Estimating the probabilities is simply a matter of counting, and there are number of approaches that will work.  One is to use one of the tokenizers to count up how many times each word and each bigram appears in each review, and then sum those up over all reviews.  You might want to know that the `CountVectorizer` has a `.get_feature_names()` method which gives the string associated with each column.  (Question for thought: Why doesn't the `HashingVectorizer` have a similar method?)

*Questions:* This statistic is a ratio and problematic when the denominator is small.  We can fix this by applying Bayesian smoothing to $p(w)$ (i.e. mixing the empirical distribution with the uniform distribution over the vocabulary).

1. How does changing this smoothing parameter affect the word pairs you get qualitatively?

2. We can interpret the smoothing parameter as adding a constant number of occurrences of each word to our distribution.  Does this help you determine set a reasonable value for this 'prior factor'?

3. For fun: also check out [Amazon's Statistically Improbable Phrases](http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases).

*Implementation note:*
As you adjust the size of the Bayesian smoothing parameter, you will notice first nonsense phrases being removed and then legitimate bigrams being removed, leaving you with only generic bigrams.  The goal is to find a value of the smoothing parameter between these two transitions.

The reference solution is not an aggressive filterer: it errors in favor of leaving apparently nonsensical words. On further consideration, many of these are actually somewhat meaningful. The smoothing parameter chosen in the reference solution is equivalent to giving each word 30 previous appearances prior to considering this data.  This was chosen by generating a list of bigrams for a range of smoothing parameters and seeing how many of the bigrams were shared between neighboring values.  When the shared fraction reached 95%, we judged the solution to have converged.  Note that `min_df` should not be set too high, where it could exclude these borderline words.

In [15]:
restaurant_reviews.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
8,JwUE5GmEO-sH1FuwJgKBlQ,2009-05-03,9uHZyOu5CTCDl1L6cfvOCA,4,Good truck stop dining at the right price. We ...,review,p4ySEi8PEli0auZGBsy6gA,"{'funny': 0, 'useful': 0, 'cool': 0}"
9,JwUE5GmEO-sH1FuwJgKBlQ,2009-05-04,ow1c4Lcl3ObWxDC2yurwjQ,4,"If you like lot lizards, you'll love the Pine ...",review,ZYaumz29bl9qHpu-KVtMGA,"{'funny': 6, 'useful': 0, 'cool': 0}"
10,JwUE5GmEO-sH1FuwJgKBlQ,2010-10-30,FRTCszJWkJonDAZx3yr8FA,4,Enjoyable experience for the whole family. The...,review,SvS7NXWG2B2kFoaHaWdGfg,"{'funny': 0, 'useful': 0, 'cool': 0}"
11,JwUE5GmEO-sH1FuwJgKBlQ,2011-02-06,qQIvtbqUujvvnJDzPSfmFA,4,One of my favorite truck stop diners with soli...,review,qOYI9O0ecMJ9VaqcM9phNw,"{'funny': 0, 'useful': 0, 'cool': 0}"
12,JwUE5GmEO-sH1FuwJgKBlQ,2011-03-31,4iPPOQIo5Mr1NAUPUgCUrQ,4,Only went here once about a year and a half ag...,review,EEYwj6_t1OT5WQGypqEPNg,"{'funny': 0, 'useful': 0, 'cool': 0}"


In [123]:
restaurant_reviews['cleaned_text']=restaurant_reviews.text.apply(preprocessor)

In [21]:
restaurant_reviews['cleaned_text'].head()

8     good truck stop dining at the right price we l...
9     if you like lot lizards you ll love the pine c...
10    enjoyable experience for the whole family the ...
11    one of my favorite truck stop diners with soli...
12    only went here once about a year and a half ag...
Name: cleaned_text, dtype: object

In [20]:
import nltk

In [157]:
monogram=CountVectorizer(ngram_range=(1,1),stop_words=nltk.corpus.stopwords.words('english'),min_df=20)
bigram=CountVectorizer(ngram_range=(2,2),stop_words=nltk.corpus.stopwords.words('english'),min_df=20)

In [139]:
monogram=CountVectorizer(ngram_range=(1,1),tokenizer=pre_tokenizer,min_df=20)
bigram=CountVectorizer(ngram_range=(2,2),tokenizer=pre_tokenizer,min_df=20)

In [180]:
monogram=CountVectorizer(ngram_range=(1,1),stop_words=nltk_stopwords_mod,min_df=20)
bigram=CountVectorizer(ngram_range=(2,2),stop_words=nltk_stopwords_mod,min_df=20)

In [209]:
monogram_tf=TfidfVectorizer(ngram_range=(1,1),stop_words=nltk_stopwords_mod,min_df=20)
bigram_tf=TfidfVectorizer(ngram_range=(2,2),stop_words=nltk_stopwords_mod,min_df=20)

In [137]:
restaurant_reviews['cleaned_text'][:10]

8     good truck stop dining at the right price we l...
9     if you like lot lizards you ll love the pine c...
10    enjoyable experience for the whole family the ...
11    one of my favorite truck stop diners with soli...
12    only went here once about a year and a half ag...
13    great truck stop restaurant i ve had breakfast...
14    yeah thats right a five freakin star rating fi...
15    ate a saturday morning breakfast at the pine c...
16    attention fans of david lynch do stop by this ...
17    with a recent addition of a truck driver for a...
Name: cleaned_text, dtype: object

In [158]:
mono_tokens=monogram.fit_transform(restaurant_reviews['cleaned_text'])

In [181]:
mono_tokens=monogram.fit_transform(restaurant_reviews['cleaned_text'])

In [210]:
mono_tokens_tf=monogram_tf.fit_transform(restaurant_reviews['cleaned_text'])

In [159]:
bigram_tokens=bigram.fit_transform(restaurant_reviews['cleaned_text']) 

In [159]:
bigram_tokens=bigram.fit_transform(restaurant_reviews['cleaned_text']) 

In [211]:
bigram_tokens_tf=bigram_tf.fit_transform(restaurant_reviews['cleaned_text']) 

In [160]:
len(monogram.get_feature_names())

25822

In [212]:
len(monogram_tf.get_feature_names())

25468

In [None]:
len(mono_tokens.toarray().sum(axis=0))

In [175]:
temp=pickle.load('mono_tokens.pkl')

TypeError: file must have 'read' and 'readline' attributes

In [33]:
type(monogram)

sklearn.feature_extraction.text.CountVectorizer

In [25]:

scipy.sparse.save_npz('mono_tokens_pre_token.npz',mono_tokens)

In [26]:
scipy.sparse.save_npz('bigram_tokens_pre_token.npz',bigram_tokens)

In [161]:
monogram.get_feature_names()[:40]

['__',
 '___',
 '____',
 '_____',
 'aa',
 'aaa',
 'aaaaand',
 'aaaand',
 'aaah',
 'aah',
 'aaron',
 'ab',
 'aback',
 'abacus',
 'abalone',
 'abandon',
 'abandoned',
 'abandoning',
 'abbey',
 'abbreviated',
 'abby',
 'abc',
 'abd',
 'abe',
 'aber',
 'aberdeen',
 'aberration',
 'abhor',
 'abhorrent',
 'abide',
 'abilities',
 'ability',
 'abit',
 'abita',
 'able',
 'abnormal',
 'abnormally',
 'abode',
 'abominable',
 'abomination']

In [143]:
monogram.get_feature_names()[:40]

[':(',
 ':)',
 ';(',
 ';)',
 '=(',
 '=)',
 '__',
 '___',
 '____',
 '_____',
 '______',
 '_______',
 'aa',
 'aaa',
 'aaaaand',
 'aaaah',
 'aaaand',
 'aaah',
 'aaahhh',
 'aaand',
 'aah',
 'aahh',
 'aahhh',
 'aahing',
 'aand',
 'aaron',
 'aarp',
 'ab',
 'aback',
 'abacus',
 'abalone',
 'abandon',
 'abandoned',
 'abandoning',
 'abbey',
 'abbreviated',
 'abby',
 'abc',
 'abd',
 'abdominal']

In [133]:
temp=monogram.fit_transform(restaurant_reviews['cleaned_text'].values[:4].tolist())

In [136]:
restaurant_reviews['cleaned_text'].values[:4].tolist()

['good truck stop dining at the right price we love coming here on the weekends when we don t feel like cooking ',
 'if you like lot lizards you ll love the pine cone ',
 'enjoyable experience for the whole family the wait staff was courteous and friendly the food was reasonably priced and a good value a word of advice leave room for dessert the deserters are great but huge plan to bring some home',
 'one of my favorite truck stop diners with solid food and friendly quick service my god those desserts are huge i can t imagine eating that giant cream puff all the food we had was delicious and i love how they leave a carafe of coffee on the table love this place would definitely be back if i was in the area ']

In [40]:
import pickle
# with open('bigram.pkl', 'wb') as file:  
#     pickle.dump(bigram,file)

In [43]:
bigram.get_feature_names()[:10]

['00 00',
 '00 10',
 '00 11',
 '00 12',
 '00 15',
 '00 16',
 '00 20',
 '00 25',
 '00 30',
 '00 50']

In [34]:
import pickle
# with open('bigram_pre_token.pkl','wb') as file:
#     pickle.dump(bigram,file)

In [34]:
with open('bigram.pkl','rb') as file:
    bigram=pickle.load(file)

In [3]:
mono_tokens=scipy.sparse.load_npz('mono_tokens.npz')

In [26]:
bigram_tokens=scipy.sparse.load_npz('bigram_tokens.npz')

In [4]:
mono_tokens.shape

(574278, 36031)

In [162]:
mono_count=mono_tokens.sum(axis=0).tolist()[0]

In [163]:
bi_count=bigram_tokens.sum(axis=0).tolist()[0]

In [213]:
mono_count_tf=mono_tokens_tf.sum(axis=0).tolist()[0]

bi_count_tf=bigram_tokens_tf.sum(axis=0).tolist()[0]

In [164]:
bi_count[:10]

[37, 43, 23, 23, 44, 26, 27, 212, 42, 23]

In [None]:
bi_dict=collections.Counter(dict(zip(bigram.get_feature_names(),bi_count)))

In [None]:
mono_dict=collections.Counter(dict(zip(monogram.get_feature_names(),mono_count)))

In [214]:
bi_dict_tf=collections.Counter(dict(zip(bigram_tf.get_feature_names(),bi_count_tf)))

mono_dict_tf=collections.Counter(dict(zip(monogram_tf.get_feature_names(),mono_count_tf)))

In [75]:
with open('bi_dict.pkl','wb') as file:
    pickle.dump(bi_dict,file)

In [70]:
# with open('mono_dict.pkl','rb') as file:
#     mn_d=pickle.load(file)

In [35]:
mono_keys=monogram.get_feature_names()

In [167]:
bi_dict.most_common(10)

[('go back', 33030),
 ('happy hour', 29059),
 ('really good', 28536),
 ('pretty good', 27814),
 ('food good', 24912),
 ('first time', 24033),
 ('las vegas', 22697),
 ('next time', 22326),
 ('come back', 21760),
 ('good food', 19204)]

In [215]:
bi_dict_tf.most_common(10)

[('food good', 3351.783143853856),
 ('great food', 2914.3366070386523),
 ('happy hour', 2810.6719081596725),
 ('good food', 2605.024382340969),
 ('food great', 2592.3777364948446),
 ('pretty good', 2543.954824402212),
 ('great service', 2377.1107304746224),
 ('love place', 2237.890852905853),
 ('las vegas', 2148.1662178591664),
 ('service great', 2123.73771882867)]

In [149]:
mono_dict.most_common(100)

[('food', 472694),
 ('good', 437679),
 ('place', 376964),
 ('great', 281531),
 ('service', 228799),
 ('time', 197303),
 ('back', 177252),
 ('ordered', 143858),
 ('restaurant', 142027),
 ('chicken', 131181),
 ('order', 126268),
 ('menu', 124803),
 ('nice', 119697),
 ('love', 112091),
 ('pretty', 103697),
 ('delicious', 101981),
 ('eat', 97732),
 ('pizza', 97391),
 ('vegas', 95980),
 ('sauce', 92883),
 ('cheese', 91614),
 ('bar', 88361),
 ('lunch', 86279),
 ('salad', 85845),
 ('fresh', 83571),
 ('meal', 82411),
 ('people', 82321),
 ('made', 79967),
 ('dinner', 79810),
 ('table', 79105),
 ('night', 79064),
 ('friendly', 77780),
 ('make', 77632),
 ('wait', 76882),
 ('amazing', 72467),
 ('burger', 71590),
 ('staff', 71082),
 ('bit', 66942),
 ('experience', 66400),
 ('sushi', 64262),
 ('fries', 63871),
 ('bad', 63502),
 ('side', 61566),
 ('hot', 59524),
 ('give', 59208),
 ('meat', 58177),
 ('price', 57454),
 ('happy', 57195),
 ('times', 57007),
 ('thing', 56386),
 ('drinks', 56300),
 ('small

In [14]:
t=np.array(t)

In [16]:
len(t[0])

36031

In [216]:
def get_prob(mono_dict,bi_dict,alpha):
    results=collections.Counter()
    
    for b,m in zip([*bi_dict],list(map(lambda x: x.split(" "),[*bi_dict]))):
        results[b]= (bi_dict[b])/((mono_dict
                                        [m[0]]+alpha)*(mono_dict[m[1]]+alpha)*1.0)
    return results
    

In [282]:
final_results=get_prob(mono_dict_tf,bi_dict_tf,25)
final_results.most_common(5)

[('rula bula', 0.009376564929761026),
 ('riff raff', 0.00920401941606817),
 ('reina pepiada', 0.009194342739019753),
 ('knick knacks', 0.009051696104908873),
 ('itty bitty', 0.008999222794068928)]

In [283]:
final_results.most_common(100)

[('rula bula', 0.009376564929761026),
 ('riff raff', 0.00920401941606817),
 ('reina pepiada', 0.009194342739019753),
 ('knick knacks', 0.009051696104908873),
 ('itty bitty', 0.008999222794068928),
 ('pel meni', 0.00898740267601008),
 ('baskin robbins', 0.008979195391923137),
 ('ropa vieja', 0.008819885640172246),
 ('himal chuli', 0.008804720727354491),
 ('dac biet', 0.008762665220077365),
 ('krispy kreme', 0.00868209424149129),
 ('gulab jamun', 0.008536320951759414),
 ('khai hoan', 0.00845208462123658),
 ('uuu uuu', 0.008434733255585624),
 ('hoity toity', 0.008275640752286385),
 ('cien agaves', 0.008249026370630842),
 ('hodge podge', 0.008154213414545737),
 ('tammie coe', 0.00803201016088791),
 ('puerto rican', 0.00787556959447633),
 ('tutti santi', 0.007839300247338232),
 ('roka akor', 0.007819867333295032),
 ('feng shui', 0.007622866704267403),
 ('wal mart', 0.00756750504648341),
 ('leaps bounds', 0.007542607946011882),
 ('patatas bravas', 0.0075254578380124565),
 ('hu tieu', 0.00732

In [284]:
final_list=[word for word, word_count in final_results.most_common(100)]# if bi_dict[word] > 30] 

In [285]:
grader.score('nlp__food_bigrams', lambda: final_list)

Your score:  1.0


In [227]:
final_list_tf=[word for word, word_count in final_results.most_common(1000) if bi_dict_tf[word] > 30] 

In [171]:
final_results.most_common(100)

[('hodge podge', 0.002586206896551724),
 ('himal chuli', 0.002516074923119933),
 ('hoity toity', 0.002511989038593286),
 ('roka akor', 0.0024509803921568627),
 ('knick knacks', 0.0024346353339684554),
 ('reina pepiada', 0.002420520231213873),
 ('cien agaves', 0.0024174327545114062),
 ('baskin robbins', 0.0023941343707915607),
 ('itty bitty', 0.0023631762762197544),
 ('khai hoan', 0.00234192037470726),
 ('riff raff', 0.0023106123122627496),
 ('grana padano', 0.002250768555116381),
 ('tutti santi', 0.0022428526814865287),
 ('ropa vieja', 0.00219705659036203),
 ('gulab jamun', 0.0021843145412939464),
 ('pel meni', 0.0021634615384615386),
 ('ore ida', 0.0021608643457382954),
 ('laan xang', 0.0021397177885727502),
 ('dac biet', 0.0021314848108538913),
 ('rula bula', 0.002129735389301352),
 ('hu tieu', 0.002099306316173786),
 ('innis gunn', 0.0020891747759634945),
 ('bandeja paisa', 0.0020710059171597634),
 ('tammie coe', 0.002054612937433722),
 ('chicha morada', 0.002046007520460075),
 ('al

In [228]:
len(final_list_tf)

336

In [241]:
grader.score('nlp__food_bigrams', lambda: final_list)

Your score:  0.9600000000000006


In [222]:
final_results.most_common(100)

[('amuse bouche', 0.002449538880231712),
 ('panna cotta', 0.0024366664262408114),
 ('hong kong', 0.002368348656065774),
 ('dac biet', 0.0022203418262186697),
 ('kool aid', 0.00221921927720082),
 ('hush puppies', 0.002174993577926386),
 ('pei wei', 0.002159885646362734),
 ('rula bula', 0.0021597104711979357),
 ('wi fi', 0.0021574838154616743),
 ('http www', 0.0021246549830347475),
 ('joel robuchon', 0.002071413355549124),
 ('osso bucco', 0.0020663521320564973),
 ('pina colada', 0.0020560820705300357),
 ('bok choy', 0.002038502953195484),
 ('tres leches', 0.002027480482770849),
 ('valle luna', 0.0019444043001599273),
 ('croque madame', 0.0019429611874709279),
 ('prix fixe', 0.0019142608363247845),
 ('wal mart', 0.0019129014109983777),
 ('ami gabi', 0.0019120632510543846),
 ('hustle bustle', 0.0018992244982861065),
 ('krispy kreme', 0.0018890288611304914),
 ('coca cola', 0.0018856716025204975),
 ('beaten path', 0.0018605632783578686),
 ('kilt lifter', 0.0018466222309688651),
 ('huevos ran

In [277]:
len(final_list[1:5] +final_list[6:102])

100

In [91]:
tt + collections.Counter(dict.fromkeys(tt, 5))

Counter({'and': 6,
         'document': 9,
         'first': 7,
         'is': 9,
         'one': 6,
         'second': 6,
         'the': 9,
         'third': 6,
         'this': 9})

In [196]:
corpus = [
     'This is the first document. Solve this interesting project.',
     'This document is the second document and third one and another third one. Time-series analysis is interesting.',
     'And this is the third one. Homework needs to be done. It is quite an interesting project.',
     'Is this the first document? Which movie are we watching? Can we do Time-series homework now?',
 ]

In [197]:
vectorizer = CountVectorizer(stop_words=yelp_stopwords)
vectorizer_bi = CountVectorizer(ngram_range=(2,2),stop_words=yelp_stopwords)

In [206]:
tfidf=TfidfVectorizer(stop_words=yelp_stopwords)
x_tf=tfidf.fit_transform(corpus)

In [207]:
print(tfidf.get_feature_names(),end=" ")

['analysis', 'document', 'homework', 'interesting', 'movie', 'project', 'series', 'solve', 'time', 'watching'] 

In [208]:
x_tf.sum(axis=0).tolist()[0]

[0.4833548558354761,
 1.334769978906423,
 0.9951080723719671,
 1.2142573236714682,
 0.4838099584718287,
 1.1187668445441918,
 0.762523848576601,
 0.6406554311067799,
 0.762523848576601,
 0.4838099584718287]

In [198]:
x=vectorizer.fit_transform(corpus)
x_bi=vectorizer_bi.fit_transform(corpus)

In [199]:
x.sum(axis=0).tolist()[0]

[1, 4, 2, 3, 1, 2, 2, 1, 2, 1]

In [200]:
x_bi.sum(axis=0).tolist()[0]

[1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1]

In [201]:
vectorizer.get_feature_names()

['analysis',
 'document',
 'homework',
 'interesting',
 'movie',
 'project',
 'series',
 'solve',
 'time',
 'watching']

In [202]:
print(vectorizer_bi.get_feature_names(),end=" ")

['analysis interesting', 'document document', 'document movie', 'document solve', 'document time', 'homework interesting', 'interesting project', 'movie watching', 'series analysis', 'series homework', 'solve interesting', 'time series', 'watching time'] 

In [99]:
tt_mono=collections.Counter(dict(list(zip(vectorizer.get_feature_names(),x.sum(axis=0).tolist()[0]))))

In [100]:
tt_bi=collections.Counter(dict(list(zip(vectorizer_bi.get_feature_names(),x_bi.sum(axis=0).tolist()[0]))))

In [101]:
tt_mono

Counter({'and': 1,
         'document': 4,
         'first': 2,
         'is': 4,
         'one': 1,
         'second': 1,
         'the': 4,
         'third': 1,
         'this': 4})

In [105]:
tt_bi 

Counter({'and this': 1,
         'document is': 1,
         'first document': 2,
         'is the': 3,
         'is this': 1,
         'second document': 1,
         'the first': 2,
         'the second': 1,
         'the third': 1,
         'third one': 1,
         'this document': 1,
         'this is': 2,
         'this the': 1})

In [109]:
'and this'.split(" ")

['and', 'this']

In [147]:
for ab,val in tt_bi.items():
    a,b=ab.split(" ")
    print (ab,val/(tt_mono[a]*tt_mono[b])*1.0)

and this 0.25
document is 0.0625
first document 0.25
is the 0.1875
is this 0.0625
second document 0.25
the first 0.25
the second 0.25
the third 0.25
third one 1.0
this document 0.0625
this is 0.125
this the 0.0625


In [None]:
bigram_prob = [biprobs[b]/(monoprobs[s[0]]*monoprobs[s[1]]) for b,s in zip(unique_biwords,bi_keys_split)]

In [138]:
azip=([*tt_bi],list(map(lambda x: x.split(" "),[*tt_bi])))

In [142]:
for b,s in zip([*tt_bi],list(map(lambda x: x.split(" "),[*tt_bi]))):
    print(b,s[0],s[1])

and this and this
document is document is
first document first document
is the is the
is this is this
second document second document
the first the first
the second the second
the third the third
third one third one
this document this document
this is this is
this the this the


In [173]:
temp_dict=collections.Counter()
for b,s in zip([*tt_bi],list(map(lambda x: x.split(" "),[*tt_bi]))):
    temp_dict[b]= tt_bi[b]/(tt_mono[s[0]]*tt_mono[s[1]]*1.0)

In [171]:
temp_c=collections.Counter(temp_dict)

In [177]:
temp_c.most_common(8)

[('third one', 1.0),
 ('and this', 0.25),
 ('first document', 0.25),
 ('second document', 0.25),
 ('the first', 0.25),
 ('the second', 0.25),
 ('the third', 0.25),
 ('is the', 0.1875)]

In [153]:
[(tt_bi[b],tt_mono[s[0]],tt_mono[s[1]],\
         tt_bi[b]/(tt_mono[s[0]]*tt_mono[s[1]]*1.0)) \
 for b,s in zip([*tt_bi],list(map(lambda x: x.split(" "),[*tt_bi])))]
   

[(1, 1, 4, 0.25),
 (1, 4, 4, 0.0625),
 (2, 2, 4, 0.25),
 (3, 4, 4, 0.1875),
 (1, 4, 4, 0.0625),
 (1, 1, 4, 0.25),
 (2, 4, 2, 0.25),
 (1, 4, 1, 0.25),
 (1, 4, 1, 0.25),
 (1, 1, 1, 1.0),
 (1, 4, 4, 0.0625),
 (2, 4, 4, 0.125),
 (1, 4, 4, 0.0625)]

In [156]:
[tt_bi[b]/(tt_mono[s[0]]*tt_mono[s[1]]*1.0) \
 for b,s in zip([*tt_bi],list(map(lambda x: x.split(" "),[*tt_bi])))]
   

[0.25,
 0.0625,
 0.25,
 0.1875,
 0.0625,
 0.25,
 0.25,
 0.25,
 0.25,
 1.0,
 0.0625,
 0.125,
 0.0625]

In [155]:
[tt_bi[b]/(tt_mono[s[0]]*tt_mono[s[0]]*1.0) \
 for b,s in zip([*tt_bi],list(map(lambda x: x.split(" "),[*tt_bi])))]
    

[1.0,
 0.0625,
 0.5,
 0.1875,
 0.0625,
 1.0,
 0.125,
 0.0625,
 0.0625,
 1.0,
 0.0625,
 0.125,
 0.0625]

In [130]:
[*tt_mono.items()]

[('and', 1),
 ('document', 4),
 ('first', 2),
 ('is', 4),
 ('one', 1),
 ('second', 1),
 ('the', 4),
 ('third', 1),
 ('this', 4)]

In [161]:
[*tt_bi.items()]

[('and this', 1),
 ('document is', 1),
 ('first document', 2),
 ('is the', 3),
 ('is this', 1),
 ('second document', 1),
 ('the first', 2),
 ('the second', 1),
 ('the third', 1),
 ('third one', 1),
 ('this document', 1),
 ('this is', 2),
 ('this the', 1)]

In [46]:
print(final_list[:100],end="")

['hodge podge', 'himal chuli', 'hoity toity', 'roka akor', 'knick knacks', 'reina pepiada', 'cien agaves', 'baskin robbins', 'itty bitty', 'khai hoan', 'riff raff', 'grana padano', 'tutti santi', 'ropa vieja', 'gulab jamun', 'pel meni', 'ore ida', 'laan xang', 'dac biet', 'rula bula', 'hu tieu', 'innis gunn', 'bandeja paisa', 'tammie coe', 'chicha morada', 'alain ducasse', 'feng shui', 'pièce résistance', 'leaps bounds', 'dol sot', 'itsy bitsy', 'mille feuille', 'marche bacchus', 'uuu uuu', 'nooks crannies', 'celine dion', 'nanay gloria', 'doon varna', 'luc lac', 'krispy kreme', 'woonam jung', 'perrier jouet', 'deja vu', 'molecular gastronomy', 'puerto rican', 'vice versa', 'patatas bravas', 'sais quoi', 'cullen skink', 'lloyd wright', 'pura vida', 'lomo saltado', 'valle luna', 'nuoc mam', 'wal mart', 'dueling pianos', 'bradley ogden', 'barnes noble', 'avant garde', 'honky tonk', 'haricot vert', 'kao tod', 'irn bru', 'ak yelpcdn', 'porta alba', 'lis doon', 'khao soi', 'malai kofta', 'a

In [160]:
tt_bi+collections.Counter(dict.fromkeys(tt_bi,5))

Counter({'and this': 6,
         'document is': 6,
         'first document': 7,
         'is the': 8,
         'is this': 6,
         'second document': 6,
         'the first': 7,
         'the second': 6,
         'the third': 6,
         'third one': 6,
         'this document': 6,
         'this is': 7,
         'this the': 6})

In [None]:

top100 = ['haricot vert'] * 100

In [None]:
grader.score('nlp__food_bigrams', lambda: top100)

In [248]:
grader.score('nlp__food_bigrams', lambda: final_list[:100])

['pel meni', 'f_5_unx wrafcxuakbzrdw', 'bandeja paisa', 'laan xang', 'roka akor', 'mille feuille', 'grana padano', 'innis gunn', 'chicha morada', 'dol sot', 'hodge podge', 'sais quoi', 'cullen skink', 'himal chuli', 'hoity toity', 'woonam jung', 'celine dion', 'perrier jouet', 'riff raff', 'luc lac', 'ore ida', 'baskin robbins', 'reina pepiada', 'rustler rooste', 'alain ducasse', 'ezzyujdouig4p gyb3pv_a', 'cien agaves', 'dueling pianos', 'deja vu', 'nanay gloria', 'homer simpson', 'khai hoan', 'hon machi'] is too short

Failed validating 'minItems' in schema:
    {'items': {'type': 'string'},
     'maxItems': 100,
     'minItems': 100,
     'type': 'array'}

On instance:
    ['pel meni',
     'f_5_unx wrafcxuakbzrdw',
     'bandeja paisa',
     'laan xang',
     'roka akor',
     'mille feuille',
     'grana padano',
     'innis gunn',
     'chicha morada',
     'dol sot',
     'hodge podge',
     'sais quoi',
     'cullen skink',
     'himal chuli',
     'hoity toity',
     'woonam ju

In [233]:
final_results[:10]

TypeError: unhashable type: 'slice'

In [241]:
grader.score('nlp__food_bigrams', lambda: final_list)

Your score:  0.9600000000000006


*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*