<a href="https://colab.research.google.com/github/shilpasy/Projects_partof_DataScienceFellowship_Python/blob/main/NLP_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import seaborn as sns
sns.set()
import pandas as pd

In [None]:
from static_grader import grader
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import gzip
import ujson as json

# NLP: Analyzing Review Text


Unstructured data makes up the vast majority of data.  This is a basic intro to handling unstructured data.  Our objective is to be able to extract the sentiment (positive or negative) and gain insight from review text.  We will do this from Yelp review data.

## Metrics and scoring

The first two questions task you to build models, of increasing complexity, to predict the rating of a review from its text. The grader uses a test set to evaluate your model's performance against our reference solution, using the $R^2$ score. It **is** possible to receive a score greater than one, indicating that you've beaten our reference model. We compare our model's score on a test set to your score on the same test set. See how high you can go!

The final two questions asks only for the result of a calculation, and your results will be compared directly to those of a reference solution.

## Download and parse the data


To start, let's download the data set from Amazon S3:

In [None]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_review_reduced.json.gz'

The training data are a series of JSON objects, in a Gzipped file. Python supports Gzipped files natively: [`gzip.open`](https://docs.python.org/3/library/gzip.html) has the same interface as `open`, but handles `.gz` files automatically.

The built-in `json` package has a `loads` function that converts a JSON string into a Python dictionary. We could call that once for each row of the file. [`ujson`](http://docs.micropython.org/en/latest/library/ujson.html) has the same interface as the built-in `json` package, but is *substantially* faster (at the cost of non-robust handling of malformed JSON). We will use that inside a list comprehension to get a list of dictionaries:

In [None]:
with gzip.open('yelp_train_academic_dataset_review_reduced.json.gz') as f:
    data = [json.loads(line) for line in f]

In [None]:
data[0]

{'votes': {'funny': 0, 'useful': 0, 'cool': 0},
 'user_id': 'Qrs3EICADUKNFoUq2iHStA',
 'review_id': '_ePLBPrkrf4bhyiKWEn4Qg',
 'stars': 1,
 'date': '2013-04-19',
 'text': "I don't know what Dr. Goldberg was like before  moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. I was going to Dr. Johnson before he left and Goldberg took over when Johnson left. He is not a caring doctor. He is only interested in the co-pay and having you come in for medication refills every month. He will not give refills and could less about patients's financial situations. Trying to get your 90 days mail away pharmacy prescriptions through this guy is a joke. And to make matters even worse, his office staff is incompetent. 90% of the time when you call the office, they'll put you through to a voice mail, that NO ONE ever answers or returns your call. Both my adult children and husband have decided to leave this practice after experiencing such frustration. The entire office ha

The scikit-learn API requires that we keep labels (in this case, the star ratings) and features in separate data structures.

In [None]:
stars = [row['stars'] for row in data]

In [None]:
mainDF = pd.DataFrame(data=data) #just for testing
mainDF.head()

Unnamed: 0,votes,user_id,review_id,stars,date,text,type,business_id
0,"{'funny': 0, 'useful': 0, 'cool': 0}",Qrs3EICADUKNFoUq2iHStA,_ePLBPrkrf4bhyiKWEn4Qg,1,2013-04-19,I don't know what Dr. Goldberg was like before...,review,vcNAWiLM4dR7D2nwwJ7nCA
1,"{'funny': 6, 'useful': 0, 'cool': 0}",ZYaumz29bl9qHpu-KVtMGA,ow1c4Lcl3ObWxDC2yurwjQ,4,2009-05-04,"If you like lot lizards, you'll love the Pine ...",review,JwUE5GmEO-sH1FuwJgKBlQ
2,"{'funny': 0, 'useful': 0, 'cool': 0}",EEYwj6_t1OT5WQGypqEPNg,4iPPOQIo5Mr1NAUPUgCUrQ,4,2011-03-31,Only went here once about a year and a half ag...,review,JwUE5GmEO-sH1FuwJgKBlQ
3,"{'funny': 0, 'useful': 1, 'cool': 0}",MnXcXwr0keJpkIiwuPsOKg,_utPYHIdXeq8CqQ4iYD1bw,3,2012-01-08,Ate a Saturday morning breakfast at the Pine C...,review,JwUE5GmEO-sH1FuwJgKBlQ
4,"{'funny': 0, 'useful': 1, 'cool': 0}",wC8r-m6KHifL6R2i8ok8yg,gksnzyc9jQ9hNXESjvTrQw,3,2012-08-26,This is definitely not your usual truck stop. ...,review,JwUE5GmEO-sH1FuwJgKBlQ


In [None]:
len(stars)

253272

# Questions


## Question 1: bag_of_words_model

Build a linear model predicting the star rating based on the text reviews. Apply the bag-of-words model using the [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) to produce a feature matrix giving the counts of each word in each review.

**Hints**:
1. You will need to extract the review text from the raw input data, a list of dictionaries. You can take a similar approach you took in the `ml` miniproject by first converting the data into a pandas data frame and then using [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=columntransformer#sklearn.compose.ColumnTransformer) or you can build a custom transform to extract the text. Either way, remember that the `CountVectorizer` accepts as input to its `transform` method a 1D array of text.

1. Try choosing different values for `min_df` (minimum document frequency cutoff) and `max_df` in `CountVectorizer`. Setting `min_df` to zero admits rare words which might only appear once in the entire corpus.  This is both prone to overfitting and makes your data unmanageably large.  Don't forget to use cross-validation to select the right value.

1. Try using [`LinearRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) or [`Ridge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?highlight=ridge#sklearn.linear_model.Ridge). There is also [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html?highlight=ridge#sklearn.linear_model.RidgeCV) which has built-in leave-on-out cross-validation. If the memory footprint is too big, try switching to [Stochastic Gradient Descent](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor). Don't forget to search for the optimal value of the regularization parameter. How do the regularization parameter `alpha` and the values of `min_df` and `max_df` from `CountVectorizer` change the answer?

1. You will likely pick up several hyperparameters between the vectorization step and the regularization of the predictor. While it is more strictly correct to do a grid search over all of them at once, this can take a long time. Quite often, doing a grid search over a single hyperparameter at a time can produce similar results.  Alternatively, the grid search may be done over a smaller subset of the data, as long as it is representative of the whole.

1. Finally, assemble a pipeline that will transform the data from list of dictionaries all the way to predictions.  This will allow you to submit the model's `predict` method to the grader for scoring as the test set used by the grader is a list of dictionaries.

In [None]:
from sklearn.base import TransformerMixin, BaseEstimator

class DFtext(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        # This transformer doesn't need to learn anything about the data, so it can just return self without any further processing
        return self

    def transform(self, X):
        # Return a pandas data frame from X
        df_data = pd.DataFrame(X)
        return df_data['text']

In [None]:
review_text = DFtext().fit_transform(data)
print (review_text[2])

Only went here once about a year and a half ago, but they had great pancakes! My only problem with it at the time was that they allowed smoking, so I left smelling like a cigarette. With the change in law, I'm sure the atmosphere has improved!


In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
#for revtxt in review_text:
#    for s in (nlp(revtxt).sents):
 #       print(s)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(mainDF['text'], stars, test_size=0.25)#, random_state=42)

In [None]:
pipelineQ1 = Pipeline ([
    ('review_text',  DFtext()),
    ('bag_of_words_vectorizer', CountVectorizer()),
    #('regressor', Ridge())
    #('regressor', LinearRegression())
    ('regressor', SGDRegressor())
])

param_grid = {'bag_of_words_vectorizer__max_df' : (0.6, 0.7, 0.8, 0.9, 1),
             'bag_of_words_vectorizer__min_df': (2, 4, 6),
              #bag_of_words_vectorizer__ngram_range: ((2,2), (2,3))
              'regressor__alpha': (2.5, 6)
             }

grid_search = GridSearchCV(pipelineQ1, param_grid, verbose=10)
#######**********grid_search.fit(X_test, y_test) #########******PAUSED THIS STEP FOR KERNEL RESTART

In [None]:
#grid_search.best_params_

In [None]:
bag_of_words_model = Pipeline ([
    ('review_text',  DFtext()),
    ('bag_of_words_vectorizer', CountVectorizer(max_df=0.9, min_df=4)),
    ('regressor', Ridge(alpha = 10))
    #('regressor', LinearRegression())
])

In [None]:
##bag_of_words_model = ...

bag_of_words_model.fit(data, stars)

Pipeline(steps=[('review_text', DFtext()),
                ('bag_of_words_vectorizer',
                 CountVectorizer(max_df=0.9, min_df=4)),
                ('regressor', Ridge(alpha=10))])

In [None]:
grader.score('nlp__bag_of_words_model', bag_of_words_model.predict)

Your score: 1.0062



## Question 2: bigram_model

In a bigram model, we'll consider both single words and pairs of consecutive words that appear. This is going to be a much higher-dimensional problem so you should be careful about overfitting. You should also use a vectorizer that applies some sort of normalization, e.g., the `TfidfVectorizer` or a word count vectorizer combined with `TfidfTransformer`.

Sometimes, reducing the dimension can be useful. If you're using the `TfidfVectorizer`, you can change the `max_features` hyperparameter to reduce the size of the resulting vocabulary. For `HashingVectorizer`, you can adjust the size of the feature matrix through `n_features`.

**A side note on multi-stage model evaluation:** When your model consists of a pipeline with several stages, it can be worthwhile to evaluate which parts of the pipeline have the greatest impact on the overall accuracy (or other metric) of the model. This allows you to focus your efforts on improving the important algorithms, and leaving the rest "good enough".

One way to accomplish this is through ceiling analysis, which can be useful when you have a training set with ground truth values at each stage. Let's say you're training a model to extract image captions from websites and return a list of names that were in the caption. Your overall accuracy at some point reaches 70%. You can try manually giving the model what you know are the correct image captions from the training set, and see how the accuracy improves (maybe up to 75%). Alternatively, giving the model the perfect name parsing for each caption increases accuracy to 90%. This indicates that the name parsing is a much more promising target for further work, and the caption extraction is a relatively smaller factor in the overall performance.

If you don't know the right answers at different stages of the pipeline, you can still evaluate how important different parts of the model are to its performance by changing or removing certain steps while keeping everything else constant. You might try this kind of analysis to determine how important adding stopwords and stemming to your NLP model actually is, and how that importance changes with parameters like the number of features.

In [None]:
#ngram = 2 for this bigram model

#using `TfidfVectorizer` - hyperparm - max_features

#where am I using the HashingVectorizer? n_feature> ??????????????????

#ng_tfidf = TfidfVectorizer(max_features=300)
#ng_tfidf.fit(X_test, y_test)
#print(ng_tfidf.get_feature_names_out()[:10])

In [None]:
#print(ng_tfidf.transform(X_test))

In [None]:
bigram_model_previous = Pipeline ([
    ('review_text', DFtext()),
    ('ng_tfidf', TfidfVectorizer(max_df=0.9, min_df=4, ngram_range = (2,2), stop_words='english')),
    #('regressor', Ridge(alpha = 10))
    ("regressor", SGDRegressor())
                                ])

In [None]:
bigram_model = Pipeline ([
        ('review_text', DFtext()),
        ('bag_of_words_vectorizer', CountVectorizer(max_df=0.9, min_df=4)),
        ("tfidf", TfidfTransformer()),
        ("regressor", SGDRegressor())
        # ('regressor', Ridge(alpha = 10))
    ])

In [None]:
bigram_model.get_params()

{'memory': None,
 'steps': [('review_text', DFtext()),
  ('bag_of_words_vectorizer', CountVectorizer(max_df=0.9, min_df=4)),
  ('tfidf', TfidfTransformer()),
  ('regressor', SGDRegressor())],
 'verbose': False,
 'review_text': DFtext(),
 'bag_of_words_vectorizer': CountVectorizer(max_df=0.9, min_df=4),
 'tfidf': TfidfTransformer(),
 'regressor': SGDRegressor(),
 'bag_of_words_vectorizer__analyzer': 'word',
 'bag_of_words_vectorizer__binary': False,
 'bag_of_words_vectorizer__decode_error': 'strict',
 'bag_of_words_vectorizer__dtype': numpy.int64,
 'bag_of_words_vectorizer__encoding': 'utf-8',
 'bag_of_words_vectorizer__input': 'content',
 'bag_of_words_vectorizer__lowercase': True,
 'bag_of_words_vectorizer__max_df': 0.9,
 'bag_of_words_vectorizer__max_features': None,
 'bag_of_words_vectorizer__min_df': 4,
 'bag_of_words_vectorizer__ngram_range': (1, 1),
 'bag_of_words_vectorizer__preprocessor': None,
 'bag_of_words_vectorizer__stop_words': None,
 'bag_of_words_vectorizer__strip_accen

In [None]:
#bigram_model = ...

bigram_model.fit(data, stars)

Pipeline(steps=[('review_text', DFtext()),
                ('bag_of_words_vectorizer',
                 CountVectorizer(max_df=0.9, min_df=4)),
                ('tfidf', TfidfTransformer()), ('regressor', SGDRegressor())])

In [None]:
grader.score('nlp__bigram_model', bigram_model.predict)

Your score: 0.9442


## Question 3: word_polarity

Let's consider a different approach and try to derive some insight from our analysis.  

We want to determine the most "polarizing words" in the corpus of reviews.  In other words, we want to identify words that strongly signal a review is either positive or negative.  For example, we understand that a word like "terrible" will most likely appear in negative rather than positive reviews.  

During training, the [naive Bayes model](https://scikit-learn.org/stable/modules/naive_bayes.html#) calculates probabilities such as $Pr(\textrm{terrible}\ |\ \textrm{negative}),$ the probability that the word "terrible" appears in the review text, given that the review is negative.  Using these probabilities, we can define a **polarity score** for each word $w$,

$$\textrm{polarity}(w) = \log\left(\frac{Pr(w\ |\ \textrm{positive})}{Pr(w\ |\ \textrm{negative})}\right).$$

Polarity analysis is an example where a simpler model (naive Bayes) offers more explicability than more complicated models.  Aside from this, naive Bayes models are easy to train, the training process is parallelizable, and these models lend themselves well to online learning.  Given enough training data, naive Bayes models have performed well in NLP applications such as spam filtering.  

For this problem, you are asked to determine the top 25 most positive polar words and the 25 most negative polar words.  For this analysis, you should:

1.  **Filter** the collection of reviews you were using above to **only keep** the one-star and five-star reviews. Since these are the "most polar" reviews, it should give us the most polarizing words.   
1.  Use the naive Bayes model, `MultinomialNB`.  
1.  Use TF-IDF weighting.
1.  Remove stop words.
1.  As mentioned, generate a (Python) list with most positive (25 words) and most negative (25 words) polar words.  

A naive Bayes model (after training) stores the log of the probabilities in an attribute of the model.  It is a `numpy` array of shape (number of classes, number of features).  You will need the mapping between feature indices to words to find the most polarizing words.  

In [None]:
filteredDF = mainDF.query('stars== 5 or stars ==1')

In [None]:
len(filteredDF)

116576

In [None]:
#from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(filteredDF['text'], filteredDF['stars'], test_size=0.25)#, random_state=42)

In [None]:
#from spacy.lang.en import STOP_WORDS

In [None]:
pipelineQ3 = Pipeline([
        #('review_text', DFtext()),
        ('bag_of_words_vectorizer', CountVectorizer(max_df=0.9, min_df=4, lowercase=False, stop_words='english')),
        #('bag_of_words_vectorizer', CountVectorizer(lowercase=False, stop_words='english')),
        ("tfidf", TfidfTransformer()),
       # ("mnb", MultinomialNB()) #hyper-parameter tuning on alpha of mnb?
        ])

In [None]:
transformedX = pipelineQ3.fit_transform(filteredDF['text'])

In [None]:
mnb_model = MultinomialNB()
mnb_model.fit(transformedX, filteredDF['stars'])

MultinomialNB()

In [None]:
feature_log_prob = mnb_model.feature_log_prob_
feature_log_prob

array([[ -7.47180714,  -9.27268587, -12.10690552, ..., -12.02531467,
        -12.05577606, -12.20419714],
       [ -8.4483143 ,  -9.86516854, -12.42135734, ..., -12.68864697,
        -12.263729  , -12.52119609]])

In [None]:
polarity = feature_log_prob[0,:] - feature_log_prob[1,:] #difference in the log prob between two classes- positive and negative
#get feature names
feature_names = pipelineQ3['bag_of_words_vectorizer'].get_feature_names()
#map feature names to their polarity values
mapped_polarity_feat = list(zip(polarity, feature_names))

# sort
mapped_polarity_feat_sorted = sorted(mapped_polarity_feat)

#25 most positive polar words and the 25 most negative polar words

polar_words_score = mapped_polarity_feat_sorted[:25] + mapped_polarity_feat_sorted[-25:]

polar_words=[]
for pw in polar_words_score:
    polar_words.append(pw[1])



In [None]:
polar_data = filteredDF

In [None]:
# We're only keeping the one and five star reviews
grader.check(len(polar_data) == 116576)

True

In [None]:
#polar_words = ['perfection'] * 50

In [None]:
polar_words

['Excellent',
 'Delicious',
 'Amazing',
 'Love',
 'perfection',
 'Highly',
 'Loved',
 'Great',
 'Fantastic',
 'Awesome',
 'AMAZING',
 'gem',
 'Outstanding',
 'delicious',
 'fantastic',
 'Wonderful',
 'yummy',
 'BEST',
 'refreshing',
 'delish',
 'impeccable',
 'perfect',
 'notch',
 'amazing',
 'YUM',
 'Overpriced',
 'Waste',
 'disrespectful',
 'Poor',
 'Disgusting',
 'refund',
 'AWFUL',
 'Gross',
 'blamed',
 'tasteless',
 'poisoning',
 'unhelpful',
 'rudely',
 'incompetent',
 'worst',
 'HORRIBLE',
 'RUDE',
 'unacceptable',
 'unprofessional',
 'Rude',
 'Terrible',
 'Awful',
 'WORST',
 'Worst',
 'Horrible']

In [None]:
grader.score('nlp__word_polarity', polar_words)

Your score: 0.6600


In [None]:
pipelineQ3_test2 = Pipeline ([
    ('ng_tfidf', TfidfVectorizer(stop_words='english')),
    ("mnb", MultinomialNB())
                            ])

pipelineQ3_test2.fit( filteredDF['text'],  filteredDF['stars'])
#pipelineQ3_test2.predict(y_test)

Pipeline(steps=[('ng_tfidf', TfidfVectorizer(stop_words='english')),
                ('mnb', MultinomialNB())])

In [None]:
feature_log_prob = pipelineQ3_test2['mnb'].feature_log_prob_

polarity = feature_log_prob[0,:] - feature_log_prob[1,:] #difference in the log prob between two classes- positive and negative
#get feature names
feature_names = pipelineQ3_test2['ng_tfidf'].get_feature_names()
#map feature names to their polarity values
mapped_polarity_feat = list(zip(polarity, feature_names))

# sort
mapped_polarity_feat_sorted = sorted(mapped_polarity_feat)

#25 most positive polar words and the 25 most negative polar words

polar_words_score = mapped_polarity_feat_sorted[:25] + mapped_polarity_feat_sorted[-25:]

polar_words=[]
for pw in polar_words_score:
    polar_words.append(pw[1])



In [None]:
grader.score('nlp__word_polarity', polar_words)

Your score: 1.0000


## Question 4: food_bigrams

Look over all reviews of restaurants.  You can determine which businesses are restaurants by looking in the `yelp_train_academic_dataset_business.json.gz` file from the ml project or downloaded below.

In [None]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_business.json.gz'

In [None]:
with gzip.open('yelp_train_academic_dataset_business.json.gz') as f:
    business_data = [json.loads(line) for line in f]

In [None]:
#business_data[0]

In [None]:
mainDF_Q4 = pd.DataFrame(data=business_data) #just for testing
#mainDF_Q4.head(3)

Each row of this file corresponds to a single business.  The category key gives a list of categories for each; take all where "Restaurants" appears.

In [None]:
mainDF_Q4_restaurants = mainDF_Q4[mainDF_Q4['categories'].apply(lambda x: "Restaurants" in x)]
restaurant_ids= mainDF_Q4_restaurants['business_id']

In [None]:
# Look at the categories to check for spelling and capitalization
grader.check(len(restaurant_ids) == 12876)

True

The "business_id" here is the same as in the review data.  Use this to extract the review text for all reviews of restaurants.

In [None]:
restaurant_reviews= mainDF.merge(mainDF_Q4_restaurants, on='business_id')['text']

In [None]:
# Just reviews of restaurants
# restaurant_ids is helpful here
grader.check(len(restaurant_reviews) == 143361)

True

In [None]:
restaurant_reviewst_DF = mainDF.merge(mainDF_Q4_restaurants, how = 'inner', on='business_id')
restaurant_reviewst_DF.drop(columns=['stars_y', 'type_y'],  inplace=True)
restaurant_reviewst_DF.rename(columns={"stars_x": "stars", "type_x": "type"}, inplace=True)

In [None]:
#restaurant_reviewst_DF.head()

We want to find collocations --- that is, bigrams that are "special" and appear more often than you'd expect from chance. We can think of the corpus as defining an empirical distribution over all *n*-grams.  We can find word pairs that are unlikely to occur consecutively based on the underlying probability of their words. Mathematically, if $p(w)$ be the probability of a word $w$ and $p(w_1 w_2)$ is the probability of the bigram $w_1 w_2$, then we want to look at word pairs $w_1 w_2$ where the statistic

  $$ \frac{p(w_1 w_2)}{p(w_1) p(w_2)} $$

is high.  Return the top 100 (mostly food) bigrams with this statistic with the 'right' prior factor (see below).

Estimating the probabilities is simply a matter of counting, and there are number of approaches that will work.  One is to use one of the tokenizers to count up how many times each word and each bigram appears in each review, and then sum those up over all reviews.  You might want to know that the `CountVectorizer` has a `.get_feature_names_out()` method which gives the string associated with each column.  (Question for thought: Why doesn't the `HashingVectorizer` have a similar method?)

*Questions:* This statistic is a ratio and problematic when the denominator is small.  We can fix this by applying Bayesian smoothing to $p(w)$ (i.e. mixing the empirical distribution with the uniform distribution over the vocabulary).

1. How does changing this smoothing parameter affect the word pairs you get qualitatively?

2. We can interpret the smoothing parameter as adding a constant number of occurrences of each word to our distribution.  Does this help you determine set a reasonable value for this 'prior factor'?

3. For fun: also check out [Amazon's Statistically Improbable Phrases](http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases).

*Implementation note:*
As you adjust the size of the Bayesian smoothing parameter, you will notice first nonsense phrases being removed and then legitimate bigrams being removed, leaving you with only generic bigrams.  The goal is to find a value of the smoothing parameter between these two transitions.

The reference solution is not an aggressive filterer: it errors in favor of leaving apparently nonsensical words. On further consideration, many of these are actually somewhat meaningful. The smoothing parameter chosen in the reference solution is equivalent to giving each word 30 previous appearances prior to considering this data.  This was chosen by generating a list of bigrams for a range of smoothing parameters and seeing how many of the bigrams were shared between neighboring values.  When the shared fraction reached 95%, we judged the solution to have converged.

There are a few reviews that include the same nonsense strings multiple times.  To keep these from showing up in our results, we set `min_df=10`, to ensure that a bigram occurs in at least 10 reviews before we consider it.

In [None]:
#from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(restaurant_reviewst_DF['text'], restaurant_reviewst_DF['stars'], test_size=0.1)#, random_state=42)

In [None]:
X = restaurant_reviewst_DF['text']
y = restaurant_reviewst_DF['stars']

In [None]:
####In this cell you get the total number of Words (monograms) and bigrams

#test_monogram = CountVectorizer(min_df=10, ngram_range = (1,1) , stop_words = 'english')
#test_bigram =  CountVectorizer(min_df=10, ngram_range = (2,2) , stop_words = 'english')

test_monogram = CountVectorizer(ngram_range = (1,1))
test_bigram =  CountVectorizer(min_df=10, ngram_range = (2,2))

TM = test_monogram.fit_transform(X, y)
TB = test_bigram.fit_transform(X, y)

TM_sum = np.squeeze(np.asarray(TM.sum(axis=0)))
TB_sum = np.squeeze(np.asarray(TB.sum(axis=0)))

total_bigrams = np.sum(TM_sum)
total_words = np.sum(TB_sum)

print(f'total monograms {total_bigrams} total bigrams {total_words} and the total is {total_bigrams + total_words}')


total monograms 17077355 total bigrams 13487390 and the total is 30564745


In [None]:
#bigram_model_again = CountVectorizer(min_df=10, ngram_range = (1,2), stop_words='english')
bigram_model_again = CountVectorizer(min_df=10, ngram_range = (1,2))
BGA = bigram_model_again.fit_transform(X, y)

print(BGA.shape)
print(type(BGA))

(143361, 175260)
<class 'scipy.sparse.csr.csr_matrix'>


In [None]:
all_sums = np.squeeze(np.asarray(BGA.sum(axis=0)))
print(type(all_sums))
print(all_sums.shape)
print(np.sum(all_sums))

<class 'numpy.ndarray'>
(175260,)
30408426


In [None]:
#print(X.getcol(0).sum())

In [None]:
all_sums.shape

(175260,)

In [None]:
all_sums[0:10]

array([2852,   17,   11,   75,  176,   18,   12,   10,   16,   27])

In [None]:
features  = bigram_model_again.get_feature_names()
len(features)

175260

In [None]:
features.index('malai kofta')

85877

In [None]:
features_dict = {}
for i, f in enumerate(features):
    features_dict[f] = i

In [None]:
features_dict['malai kofta']

85877

In [None]:
##### My initial attempt to find the total which took forever to run
"""
total_bigrams = 0 #################
total_words = 0

for i, f in enumerate(features):
    words = f.split(' ')
    if len(words) == 2:
        total_bigrams += all_sums[features.index(f)]
    else:
        total_words += all_sums[features.index(words[0])]

"""

"\ntotal_bigrams = 0 #################\ntotal_words = 0\n\nfor i, f in enumerate(features):\n    words = f.split(' ')\n    if len(words) == 2:\n        total_bigrams += all_sums[features.index(f)]\n    else:\n        total_words += all_sums[features.index(words[0])]\n\n"

In [None]:
# stat_frac_list =[]

# for i, f in enumerate(features):
#     words = f.split(' ') # get w1, w2 from w1w2
#     if len(words) == 2:
#         w1w2_count = all_sums[features.index(f)] #finding index of only bigrams & looking at its corresponding sum i.e. wordlen ==2
#         w1_count = all_sums[features.index(words[0])] #looking for only those words which are part of the bigram
#         w2_count = all_sums[features.index(words[1])]
#         w1_count = w1_count + 30
#         w2_count = w2_count + 30
#         num_stat_frac = w1w2_count/total_bigrams #this is p(w1w2)
#         deno_stat_frac = (w1_count/total_words) * (w2_count/total_words) #this is p(w1)*p(w2)
#         stat_frac = num_stat_frac/deno_stat_frac
#         stat_frac_list.append((stat_frac, f))

In [None]:
###### rewriting the above code to make it faster ###############
stat_frac_list = []

for i, f in enumerate(features):
    words = f.split(' ') # get w1, w2 from w1w2
    if len(words) == 2:
        w1w2_count = all_sums[features_dict[f]] #finding index of only bigrams & looking at its corresponding sum i.e. wordlen ==2
        w1_count = all_sums[features_dict[words[0]]] #looking for only those words which are part of the bigram
        w2_count = all_sums[features_dict[words[1]]]
        w1_count = w1_count + 30
        w2_count = w2_count + 30
        num_stat_frac = w1w2_count/total_bigrams #this is p(w1w2)
        deno_stat_frac = (w1_count/total_words) * (w2_count/total_words) #this is p(w1)*p(w2)
        stat_frac = num_stat_frac/deno_stat_frac
        stat_frac_list.append((stat_frac, f))

In [None]:
stat_frac_list.sort(reverse=True)
stat_frac_list[:100]

[(80551.26856497522, 'knick knacks'),
 (79050.83306146436, 'rula bula'),
 (78831.45054603013, 'himal chuli'),
 (77572.39146332706, 'feng shui'),
 (76898.39747826221, 'cien agaves'),
 (76326.95293384594, 'ropa vieja'),
 (76315.94847119688, 'tammie coe'),
 (75368.63034220981, 'riff raff'),
 (75317.87705578409, 'roka akor'),
 (74341.67326542997, 'itty bitty'),
 (73972.91496550223, 'khai hoan'),
 (72463.26363967566, 'hoity toity'),
 (72364.8081184261, 'baskin robbins'),
 (72133.68542421321, 'dac biet'),
 (72112.86101085712, 'reina pepiada'),
 (72043.18674901087, 'chicha morada'),
 (71384.13834347797, 'gulab jamun'),
 (71013.99836688215, 'nanay gloria'),
 (70778.07146200878, 'hodge podge'),
 (69169.4789287813, 'luc lac'),
 (68437.52677080479, 'dueling pianos'),
 (66994.33808196429, 'haricot vert'),
 (65839.26328744767, 'tutti santi'),
 (64820.0796452271, 'patatas bravas'),
 (63384.39523655595, 'nuoc mam'),
 (63154.74163062641, 'hu tieu'),
 (62923.796021287984, 'puerto rican'),
 (62111.36883

In [None]:
ansQ4 = []
for frac, bigram in stat_frac_list[:100]:
    ansQ4.append(bigram)

In [None]:
ansQ4

['knick knacks',
 'rula bula',
 'himal chuli',
 'feng shui',
 'cien agaves',
 'ropa vieja',
 'tammie coe',
 'riff raff',
 'roka akor',
 'itty bitty',
 'khai hoan',
 'hoity toity',
 'baskin robbins',
 'dac biet',
 'reina pepiada',
 'chicha morada',
 'gulab jamun',
 'nanay gloria',
 'hodge podge',
 'luc lac',
 'dueling pianos',
 'haricot vert',
 'tutti santi',
 'patatas bravas',
 'nuoc mam',
 'hu tieu',
 'puerto rican',
 'porta alba',
 'alain ducasse',
 'ore ida',
 'wal mart',
 'celine dion',
 'bradley ogden',
 'lomo saltado',
 'krispy kreme',
 'vice versa',
 'holyrood 9a',
 'pura vida',
 'kao tod',
 'valle luna',
 'deja vu',
 'chino bandido',
 'sous vide',
 'lloyd wright',
 'artery clogging',
 'har gow',
 'hors oeuvres',
 'pina colada',
 'molecular gastronomy',
 'harry potter',
 'malai kofta',
 'aguas frescas',
 'ping pang',
 'ama ebi',
 'yada yada',
 'yadda yadda',
 'duct tape',
 'casey moore',
 'pin kaow',
 'womp womp',
 'cochinita pibil',
 'lindo michoacan',
 'scantily clad',
 'demi 

In [None]:
ansQ4

['knick knacks',
 'rula bula',
 'himal chuli',
 'feng shui',
 'cien agaves',
 'ropa vieja',
 'tammie coe',
 'riff raff',
 'roka akor',
 'itty bitty',
 'khai hoan',
 'hoity toity',
 'baskin robbins',
 'dac biet',
 'reina pepiada',
 'chicha morada',
 'gulab jamun',
 'nanay gloria',
 'hodge podge',
 'luc lac',
 'dueling pianos',
 'haricot vert',
 'tutti santi',
 'patatas bravas',
 'nuoc mam',
 'hu tieu',
 'puerto rican',
 'porta alba',
 'alain ducasse',
 'ore ida',
 'wal mart',
 'celine dion',
 'bradley ogden',
 'lomo saltado',
 'krispy kreme',
 'vice versa',
 'holyrood 9a',
 'pura vida',
 'kao tod',
 'valle luna',
 'deja vu',
 'chino bandido',
 'sous vide',
 'lloyd wright',
 'artery clogging',
 'har gow',
 'hors oeuvres',
 'pina colada',
 'molecular gastronomy',
 'harry potter',
 'malai kofta',
 'aguas frescas',
 'ping pang',
 'ama ebi',
 'yada yada',
 'yadda yadda',
 'duct tape',
 'casey moore',
 'pin kaow',
 'womp womp',
 'cochinita pibil',
 'lindo michoacan',
 'scantily clad',
 'demi 

In [None]:
#matrix_sum = {}
#for i, f in enumerate(features):
#    matrix_sum[f] = np.sum(X[:][i].toarray())

#######********** I want to know why this is not working???? what column row indices does X use. Both [:][n] and [n][:] gave me the same thing.
#### and [a:b] gives me something else and no clear way to check this.


In [None]:
#import time
#for i in range(1000):
#     time.sleep(5)
#print(i)

In [None]:
top100 = ansQ4
#top100 = ['haricot vert'] * 100

In [None]:
grader.score('nlp__food_bigrams', top100)

Your score: 1.0000


*Copyright &copy; 2021 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*