# Overview

Unstructured data makes up the vast majority of data.  This is a basic intro to handling unstructured data.  My objective is to be able to extract the
sentiment (positive or negative) from Yelp review text, and perform prediction on Yelp review ratings. 


In [None]:
import re
import numpy as np
import pandas as pd
import dill
import ujson
import gzip
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn import linear_model
from sklearn import grid_search

In [None]:
# Data Sample
test_json = [
    {"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "WsGQfLLy3YlP_S9jBE3j1w", "review_id": "kzFlI35hkmYA_vPSsMcNoQ", "stars": 5, "date": "2012-11-03", "text": "Love it!!!!! Love it!!!!!! love it!!!!!!!   Who doesn't love Culver's!", "type": "review", "business_id": "LRKJF43s9-3jG9Lgx4zODg"},
    {"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "Veue6umxTpA3o1eEydowZg", "review_id": "Tfn4EfjyWInS-4ZtGAFNNw", "stars": 3, "date": "2013-12-30", "text": "Everything was great except for the burgers they are greasy and very charred compared to other stores.", "type": "review", "business_id": "LRKJF43s9-3jG9Lgx4zODg"},
    {"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "u5xcw6LCnnMhddoxkRIgUA", "review_id": "ZYaS2P5EmK9DANxGTV48Tw", "stars": 5, "date": "2010-12-04", "text": "I really like both Chinese restaurants in town.  This one has outstanding crab rangoon.  Love the chicken with snow peas and mushrooms and General Tso Chicken.  Food is always ready in 10 minutes which is accurate.  Good place and they give you free pop.", "type": "review", "business_id": "RgDg-k9S5YD_BaxMckifkg"},
    {"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "kj18hvJRPLepZPNL7ySKpg", "review_id": "uOLM0vvnFdp468ofLnszTA", "stars": 3, "date": "2011-06-02", "text": "Above average takeout with friendly staff. The sauce on the pan fried noodle is tasty. Dumplings are quite good.", "type": "review", "business_id": "RgDg-k9S5YD_BaxMckifkg"},
    {"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "L5kqM35IZggaPTpQJqcgwg", "review_id": "b3u1RHmZTNRc0thlFmj2oQ", "stars": 4, "date": "2012-05-28", "text": "We order from Chang Jiang often and have never been disappointed.  The menu is huge, and can accomodate anyone's taste buds.  The service is quick, usually ready in 10 minutes.", "type": "review", "business_id": "RgDg-k9S5YD_BaxMckifkg"}
]

# Load Data

In [None]:
with gzip.open('yelp_train_academic_dataset_review.json.gz', 'rb') as f:
    file_content = f.read()
b = file_content.split('\n')

In [None]:
dict_list = []
for record in b:
    if ujson.loads('['+record+']'):
        item = ujson.loads('['+record+']')[0]
        text = item['text']
        stars = item['stars']
        business_id = item['business_id']
        dic = {"stars":stars, "text":text}
        dict_list.append(dic)



## bag_of_words_model
Build a linear model based on the count of the words in each document
(bag-of-words model).

**Hints**:
1. Don't forget to use tokenization!  This is important for good performance
   but it is also the most expensive step.  Try vectorizing as a first initial
   step:
   ```Python
       X = feature_extraction.text \
                             .CountVectorizer() \
                             .fit_transform(text)
       y = scores
   ``` 
   and then running grid-serach and cross-validation only on of this
   pre-processed data.  `CountVectorizer` has to memorize the mapping between
   words and the index to which it is assigned.  This is linear in the size of
   the vocabulary.  The `HashingVectorizer` does not have to remember this
   mapping and will lead to much smaller models.

2. Try choosing different values for `min_df` (minimum document frequency
   cutoff) and `max_df` in `CountVectorizer`.  Setting `min_df` to zero admits
   rare words which might only appear once in the entire corpus.  This is both
   prone to overfitting and makes your data unmanageably large.  Don't forget
   to use cross-validation or to select the right value.  Notice that
   `HashingVectorizer` doesn't support `min_df`  and `max_df`.  However, it's
   not hard to roll your own transformer that solves for these.

3. Try using `LinearRegression` or `RidgeCV`.  If the memory footprint is too
   big, try switching to Stochastic Gradient Descent
   (`sklearn.linear_model.SGDRegressor`) You might find that even ordinary
   linear regression fails due to the data size.  Don't forget to use
   `GridSearchCV` to determine the regularization parameter!  How do the
   regularization parameter `alpha` and the values of `min_df` and `max_df`
   from `CountVectorizer` change the answer?

In [None]:
# build a text list
text_list = []
for item in dict_list:
    text_list.append(item['text'])

In [None]:
# create a star list 
Y = []
for item in dict_list:
    Y.append(item['stars'])

In [None]:
# Build an estimator 
class q1estimator(BaseEstimator, RegressorMixin):
    def __init__(self):
        self.clf = linear_model.SGDRegressor()
        
        
    def fit(self):
        self.count_vect = HashingVectorizer()
        X_train = self.count_vect.transform(text_list)
        self.clf.fit(X_train, Y)
        return self

    def predict(self, record):
        transformed_feature = self.count_vect.transform([record["text"]])
        value = self.clf.predict(transformed_feature)[0]
        return value


In [None]:
SGDRegressor = q1estimator()
q1_estimator = SGDRegressor.fit()

dill.dump(q1_estimator, open("NLPQ1","w")) 
predicted_value = q1_estimator.predict(test_json[1])
print predicted_value

## normalized_model
Normalization is key for good linear regression. Previously, we used the count
as the normalization scheme.  Try some of these alternative vectorizations:

1. You can use the "does this word present in this document" as a normalization
   scheme, which means the values are always 1 or 0.  So we give no additional
   weight to the presence of the word multiple times.

2. Try using the log of the number of counts (or more precisely, $log(x+1)$).
   This is often used because we want the repeated presence of a word to count
   for more but not have that effect tapper off.

3. [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common
   normalization scheme used in text processing.  Use the `TFIDFTransformer`.
   There are options for using `idf` and taking the logarithm of `tf`.  Do
   these significantly affect the result?

Finally, if you can't decide which one is better, don't forget that you can
combine models with a linear regression.

In [None]:
## Question 2 Normalized Model
from sklearn.feature_extraction.text import TfidfTransformer

class q2estimator(BaseEstimator, RegressorMixin):
    def __init__(self):
        self.clf = linear_model.SGDRegressor()
        
        
    def fit(self):
        self.count_vect = HashingVectorizer()
        X_train = self.count_vect.transform(text_list)
        self.tf_transformer = TfidfTransformer(use_idf=False).fit(X_train)
        X_train_tf = self.tf_transformer.transform(X_train)
        self.clf.fit(X_train_tf, Y)
        return self

    def predict(self, record):
        transformed_feature = self.count_vect.transform([record["text"]])
        X_train_tf = self.tf_transformer.transform(transformed_feature)
        value = self.clf.predict(X_train_tf)[0]
        return value

In [None]:
q2Regressor = q2estimator()
q2_estimator = q2Regressor.fit()
dill.dump(q2_estimator, open("NLPQ2","w")) 
q2_estimator.predict(test_json[1])


## bigram_model
In a bigram model, let's consider both single words and pairs of consecutive
words that appear.  This is going to be a much higher dimensional problem
(large $p$) so you should be careful about overfitting.

Sometimes, reducing the dimension can be useful.  Because we are dealing with a
sparse matrix, we have to use `TruncatedSVD`.  If we reduce the dimensions, we
can use a more sophisticated models than linear ones.

As before, memory problems can crop up due to the engineering constraints.
Playing with the number of features, using the `HashingVectorizer`,
incorporating `min_df` and `max_df` limits, and handling stop-words in some way
are all methods of addressing this issue. If you are using `CountVectorizer`,
it is possible to run it with a fixed vocabulary (based on a training run, for
instance). Check the documentation.

*** A side note on multi-stage model evaluation: When your model consists of a
pipeline with several stages, it can be worthwhile to evaluate which parts of
the pipeline have the greatest impact on the overall accuracy (or other metric)
of the model. This allows you to focus your efforts on improving the important
algorithms, and leaving the rest "good enough".

One way to accomplish this is through ceiling analysis, which can be useful
when you have a training set with ground truth values at each stage. Let's say
you're training a model to extract image captions from websites and return a
list of names that were in the caption. Your overall accuracy at some point
reaches 70%. You can try manually giving the model what you know are the
correct image captions from the training set, and see how the accuracy improves
(maybe up to 75%). Alternatively, giving the model the perfect name parsing for
each caption increases accuracy to 90%. This indicates that the name parsing is
a much more promising target for further work, and the caption extraction is a
relatively smaller factor in the overall performance.

If you don't know the right answers at different stages of the pipeline, you
can still evaluate how important different parts of the model are to its
performance by changing or removing certain steps while keeping everything
else constant. You might try this kind of analysis to determine how important
adding stopwords and stemming to your NLP model actually is, and how that
importance changes with parameters like the number of features.

In [None]:
## Question 3 Bigram Model
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline

bigram_pipeline = Pipeline([
                            ('vectorize',HashingVectorizer(n_features = 100000, ngram_range=(1,2),stop_words = "english")),
                            ('dimensionality_reduce',TruncatedSVD(n_components = 10, algorithm = 'arpack',n_iter=2)),
                            ('estimator',linear_model.SGDRegressor()),
                        ])
bigram_pipeline.fit(text_list,Y)




In [None]:
dill.dump(bigram_pipeline, open("NLPQ3","w")) 
value = bigram_pipeline.predict([test_json[4]['text']])   # test


## food_bigrams
Look over all reviews of restaurants (you may need to look at the dataset from
`ml.py` to figure out which ones correspond to restaurants). We want to find
collocations --- that is, bigrams that are "special" and appear more often than
you'd expect from chance.  We can think of the corpus as defining an empirical
distribution over all ngrams.  We can find word pairs that are unlikely to
occur consecutively based on the underlying probability of their words.
Mathematically, if $p(w)$ be the probability of a word $w$ and $p(w_1 w_2)$ is
the probability of the bigram $w_1 w_2$, then we want to look at word pairs
$w_1 w_2$ where the statistic

  $$ p(w_1 w_2) / (p(w_1) * p(w_2)) $$

is high.  Return the top 100 (mostly food) bigrams with this statistic with
the 'right' prior factor (see below).

*Questions:* (to think about: they are not a part of the answer).  This
statistic is a ratio and problematic when the denominator is small.  We can fix
this by applying Bayesian smoothing to $p(w)$ (i.e. mixing the empirical
distribution with the uniform distribution over the vocabulary).

1. How does changing this smoothing parameter effect the word pairs you get
   qualitatively?

2. We can interpret the smoothing parameter as adding a constant number of
   occurences of each word to our distribution.  Does this help you determine
   set a reasonable value for this 'prior factor'?

3. For fun: also check out [Amazon's Statistically Improbable
   Phrases](http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases).

*Implementation notes:*
- The reference solution is not an aggressive filterer. Although there are
  definitely artifacts in the bigrams you'll find, many of the seeming nonsense
  words are actually somewhat meaningful and so using smoothing parameters in
  the thousands or a high min_df might give you different results.

In [None]:
# laod the data from the ml.py
with gzip.open('yelp_train_academic_dataset_business.json.gz', 'rb') as f:
    file_content_ml = f.read()
a_ml = ujson.dumps(file_content_ml,encode_html_chars=True)
b_ml = ujson.loads(a_ml)
b_ml = b_ml.replace('\n',',').replace('&','and')
List = '['+b_ml[:-1]+']'
Data_ml = ujson.loads(List)


In [None]:
dict_list_new = []
for record in b:
    if ujson.loads('['+record+']'):
        item = ujson.loads('['+record+']')[0]
        text = item['text']
        stars = item['stars']
        business_id = item['business_id']
        dic = {"text":text,"business_id":business_id}
        dict_list_new.append(dic)

In [None]:
df_review = pd.DataFrame(dict_list_new)
df_business = pd.DataFrame(Data_ml)
new_list = []
for item in Data_ml:
    if 'Restaurants' in item['categories']:
        new_list.append(item)
    else: continue
df_Restaurants = pd.DataFrame(new_list)
df_restaurants_crop = df_Restaurants[['categories','stars','business_id']]

In [None]:
df_merge = pd.merge(df_restaurants_crop, df_review, how = 'inner', on = 'business_id')
new_text_list = df_merge['text'].values.tolist()
Y_list = df_merge['stars'].values.tolist()

In [None]:
#Create Unigram Model convert a collection of text documents to a matrix of token counts
q4_uni_gram = CountVectorizer(ngram_range=(1,1),stop_words="english",min_df=0.00001,max_features = 100000,strip_accents = "ascii")
X_train_q4_unigram = q4_uni_gram.fit_transform(new_text_list)

# Create Bigram Model
q4_bi_gram = CountVectorizer(ngram_range=(2,2),stop_words="english", min_df=0.00001,max_features = 100000,strip_accents = "ascii")
X_train_q4_bigram = q4_bi_gram.fit_transform(new_text_list)


In [None]:
bicount_list = X_train_q4_bigram.sum(axis=0).tolist()[0]
total_count_biword = X_train_q4_bigram.sum()
total_count_uniword = X_train_q4_unigram.sum()
unicount_list = X_train_q4_unigram.sum(axis = 0).tolist()[0]
uniword_list = sorted(q4_uni_gram.vocabulary_, key=q4_uni_gram.vocabulary_.get)
biword_list = sorted(q4_bi_gram.vocabulary_, key=q4_bi_gram.vocabulary_.get)

In [None]:
# build a dictionary for biwords and uniword
l=[]
biword_dict = dict(zip(biword_list,bicount_list))
uniword_dict = dict(zip(uniword_list,unicount_list))
for word_pair, counts in biword_dict.items():
    w1 = word_pair.split(' ')[0]
    w2 = word_pair.split(' ')[1]
    item = (w1,w2 ,uniword_dict[w1],uniword_dict[w2],counts)
    l.append(item)


In [None]:
df_ratio = pd.DataFrame(l)
df_ratio.columns = ['w1','w2','count1','count2','counts']
df_ratio['ratio'] = (df_ratio.counts)/((df_ratio.count1)*(df_ratio.count2))

In [None]:
# Create a List of Biwords
result = df_ratio.sort(['ratio'], ascending=[0])
q4_result = result[['w1','w2']].values.tolist()
final_l = []
for item in q4_result:
    new_item = item[0]+" "+item[1]
    final_l.append(new_item)

In [None]:
# remove digits in the string 
RE_D = re.compile('\d|\_')
final_l_str = []
for item in final_l:
    if RE_D.search(item):
        continue
    else: 
        final_l_str.append(item)

In [None]:
final_l_str[0:100]