# Unpopular Category Selection

### Description
#### Data Mininig :-
   In this report we will be trying to solve two different aspect of Text Information, by mining Yelp dataset. These can be summarized as
   1. Finding dissimilarities between different categories of Cuisines by converting vectorizing words of reviews and using that to find least correlated Cuisine categories
   2. We also deviced a way to Rate various restaurants for given category and state. We used review counts to get the most popular and sentiments for highest rated to rank a given restaurant.
Finally using the above derived information we prepare the final dataset to be consumed by the Web application.

#### Try It Out -  Web Application :-
   In the web app we give users to try out a Cuisine which they may never try. Essentially we ask the users for the Cuisines they like the most. Using the reviews for all the cuisine categories, we find the one which is `Least Correlated` to that particular category. We give users a choice of state they are from, and using the least correlated category provide the top three most ```Highly Rated``` and ```Popular```  restaurants.


   
### Setup
1. To install the application we have used Pipenv
2. To install dependencies run `pipenv install`
3. To view the Notebook run `pipenv run jupyter notebook`
4. Once executed one can view and run the unpopular.ipynb notebook

### Libraries used
1. pandas - for dataframe manipulation
2. nltk - for lemmatization and stop words removal
3. gensim - for word2vec modelling
4. numpy - for Dataframe processing

In [40]:
import warnings
import multiprocessing as mp
warnings.filterwarnings("ignore")

### Fetch dataset
1. Create dataset for business's and associated reviews
2. Final dataset has Reviews for all restaurant's
3. Filter out only US states.
4. Convert Categorical Reviews by using the following 
```
Review >=3.5 Then 1
Else -1
```

In [41]:
import pandas as pd

business_ds = pd.read_json('yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json',lines=True)
all_review_ds = pd.read_json('yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_review.json',lines=True).sample(500000)

In [42]:
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

categories=['Turkish','Indian','Mexican','Italian','Chinese', \
            'Mediterranean','American (New)','American (Traditional)', \
            'Lebanese','Moroccan','Afghan', \
           'Bangladeshi','Japanese']

In [43]:
restaurant_business_ds=business_ds.loc[business_ds['categories'].apply(lambda x:'Restaurants' in list(x) )] \
.loc[business_ds['state'].apply(lambda x:x in states)]
all_business_reviews = restaurant_business_ds.set_index('business_id').join(all_review_ds.set_index('business_id'),how='inner',lsuffix='_bus', rsuffix='_rev') \
[['name','text','categories','state','review_count','stars_bus','stars_rev','full_address','longitude','latitude']]

In [44]:

all_business_reviews['rest_rating'] = all_business_reviews.apply(lambda row :1 if row['stars_bus']>=3.5 else -1 , axis = 1)
all_business_reviews['rev_rating'] = all_business_reviews.apply(lambda row :1 if row['stars_rev']>=3.5 else -1 , axis = 1) 

### Dataset cleanup
1. Use nltk stopwords to eliminate all stop words
2. Lemmatize the words again using nltk library
3. Using gensim simple_process remove punctuations and unnecessary characters
4. Split sentences into tokens

In [45]:
import nltk
import gensim
from nltk.stem import WordNetLemmatizer 
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
# stop_words = stopwords.words('english')
from spacy.lang.en.stop_words import STOP_WORDS

[nltk_data] Downloading package stopwords to /home/zztop/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/zztop/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [46]:
import re
lemmatizer = WordNetLemmatizer()
def preprocessor(sentence):
    return [lemmatizer.lemmatize(word) for word in simple_preprocess(str(sentence),deacc=True) if word not in STOP_WORDS]

# def preprocess_sentenses(sentences):
#     for sentence in sentences:
#         yield(preprocessor(sentence))   

In [47]:

def get_reviews_all_categories(revs):
    for cat in revs['categories']:
        if cat not in categories:
            continue
        return {'categories':cat,'text':revs['text'],'name':revs['name'],'state':revs['state'], \
                'rest_rating':revs['rest_rating'],'rev_rating':revs['rev_rating'],'review_count':revs['review_count'] \
               ,'full_address':revs['full_address'],'longitude':revs['longitude'],'latitude':revs['latitude']}
reviews=[]
   
# business_reviews.iloc[:1]['categories']
with mp.Pool(processes = 6) as p:
    reviews[:] = p.map(get_reviews_all_categories, (revs for _, revs in all_business_reviews.iterrows()))
reviews = pd.DataFrame(filter(lambda x: x is not None, reviews))

In [48]:
reviews["words"]=reviews.apply(lambda rev: preprocessor(rev['text']),axis=1)
reviews=reviews.drop(["text"], axis=1).dropna()


### Create Trigram models
1. We used gensim Phraser library to create trigram models using word collocation
2. Parameter used are
    1. min_count=3 - Atleast the word appear three times
    2. Behind the scene it uses (Normalized Pointwise Mutual Information(NPMI)) to score for forming phrases.
    3. A phrase with word a followed by word b is accepted if the NPMI score is above a threshold
    4. We have left the threshold to be default value of 7. 
3. Sum up all the trigram reviews per category. 

In [49]:
from gensim.models.phrases import Phrases, Phraser

bigram = Phrases(reviews['words'].values, min_count=3, threshold=7)
trigram = Phrases(bigram[reviews['words'].values], min_count=3,threshold=7)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]
data_words_trigrams = make_trigrams(reviews['words'].values)

In [50]:
reviews['trigram'] =data_words_trigrams
reviews=reviews.drop(["words"], axis=1)

In [51]:
# import dask.dataframe as dd
# reviews_dask_df = dd.from_pandas(reviews, npartitions=6)

# categories_all_reviews_dask_ds=reviews_dask_df.groupby(["categories"]).agg({'trigram': sum})

In [52]:
# categories_all_reviews_dask_ds = categories_all_reviews_dask_ds.reset_index()
# categories_all_reviews_dask_ds=categories_all_reviews_dask_ds.set_index(categories_all_reviews_dask_ds['categories'])
# categories_all_reviews_dask_ds=categories_all_reviews_dask_ds.drop(["categories"], axis=1)

In [53]:
# categories_all_reviews_ds=categories_all_reviews_dask_ds.compute()

In [54]:
categories_all_reviews_ds =reviews.groupby(["categories"]).agg({'trigram': sum})
# categories_all_reviews_ds = categories_all_reviews_ds.reset_index()
# categories_all_reviews_ds.set_index(categories_all_reviews_ds['categories'],inplace=True)
# categories_all_reviews_ds=categories_all_reviews_ds.drop(["categories"], axis=1)

In [55]:
categories_all_reviews_ds = categories_all_reviews_ds.reset_index()
categories_all_reviews_ds=categories_all_reviews_ds.set_index(categories_all_reviews_ds['categories'])
categories_all_reviews_ds=categories_all_reviews_ds.drop(["categories"], axis=1)

### Create Vectorized word model using Word2Vec
1. Both the models uses Neural Network that uses the context of the word for creating similar representation with similar meaning words
2. Traditional way's represents word vectors as one hot rod, that is a vector for only one word/phrase.
    1. The length of the vector is equal to the size of the total unique vocabulary in the corpora.
    2. However with this representation its not easy to infer relationship between two given words.
    3. Hence word with similar meaning are treated as two separate vectors in one hot rod vector representation
3. Word2Vec uses context of the word, essentially words in its neighborhood to create word representation.
    1. It uses a Neural Network whose hidden layer essentially encodes the word representation
    2. Its of two type,Skip-gram and CBOW. The difference is in former the input to the Neural Net is the targetted word and output is its surrounding word, whereas in latter the input and output are swapped.
    3. Essentially the main difference is how the word vectors are created.
    4. Generally the performances of both models are similar and I will go with Skip-Gram, which supposedly does a better job with rare words
4. Fasttext is an extension to Word2Vec. It treats each word as set of ngrams as it feeds it to the Neural Net.
    1. It is generally better than Word2Vec as breaking a rare word into ngrams increases the chances of finding more neighborhood context words.
    2. Its also efficient with Out of vocabulory words as creating a ngram representation may help find hidden neighborhood context words.


In [56]:
from gensim.models import Word2Vec
word2vec_model = Word2Vec(data_words_trigrams, min_count=10,seed=1,sg=1)
# fasttext_model = FastText(categories_all_reviews_ds.trigrams, min_count=10,seed=1,sg=1)

### Find Dissimilarities between Cuisine Categories
1. Using the Word2Vec model from above we aggregate all word vectors using mean pooling
    1. By Mean pooling we convert a multi set Word2Vec model into a single high dimensional vector with a constant length. More can be read [here](https://www.cs.tau.ac.il/~wolf/papers/qagg.pdf)
2. Using the mean pooled word2vec dataframe we find correlatation between all the categories.
3. We used Pearson's r to calculate correlation as suggested [here](https://www.aclweb.org/anthology/D19-1008.pdf)
4. From the output Correlation matrix, given a Category we can calculate the least correlated Category

In [57]:
# Create features from Word Vectors - https://www.kaggle.com/sakshat/word2vec-xgboost
import numpy as np
def makeFeatureVec(words,model,num_features):
    feature_vec=np.zeros((num_features,),dtype="float32")
    nwords=0
    index2word_set=set(model.wv.index2word)
    for word in words:
        if word in index2word_set: 
            nwords=nwords+1
            feature_vec=np.add(feature_vec,model[word])
    feature_vec=np.divide(feature_vec,nwords)
    return feature_vec

In [58]:
word2vec_mean_df = pd.DataFrame([makeFeatureVec(words,word2vec_model,word2vec_model.wv.vector_size)  for words in categories_all_reviews_ds.trigram])

In [59]:
word2vec_mean_df.set_index(categories_all_reviews_ds.index,inplace=True)

In [60]:
word2vec_mean_corr_df = word2vec_mean_df.T.corr(method='pearson')

In [61]:
word2vec_mean_corr_df['Least_Correlated']=word2vec_mean_corr_df \
.apply(lambda cuisine:cuisine.idxmin() )

In [72]:
word2vec_mean_corr_df[['Least_Correlated']].to_json('Least_Correlated.json')
word2vec_mean_corr_df['Least_Correlated']

categories
Afghan                                  Japanese
American (New)                            Indian
American (Traditional)                  Moroccan
Chinese                                 Moroccan
Indian                    American (Traditional)
Italian                                   Indian
Japanese                                Moroccan
Lebanese                                Japanese
Mediterranean                           Japanese
Mexican                                 Moroccan
Moroccan                                 Mexican
Turkish                                 Japanese
Name: Least_Correlated, dtype: object

In [63]:
word2vec_mean_corr_df.to_json('popularity.json',orient='records')

### Rating Restaurants
We Rate a restaurant in two ways
    1. Popularity 
    2. Highly Rated 

#### Popularity
1. To find the most Popular restaurants we get the restaurants with most number of reviews per Category per State
2. Save the results in a json object to be consumed by the Web Application

In [64]:
popularity_ds =reviews[['categories','name','state','full_address','review_count','latitude','longitude']].drop_duplicates() \
.sort_values(['categories','review_count'],ascending=False) \
.groupby(['categories','state']).nth((0,1,2)).reset_index() \

popularity_ds.to_json('popularity.json',orient='records')

popularity_ds.loc[popularity_ds['categories']=='Chinese'].loc[popularity_ds['state']=='AZ']

Unnamed: 0,categories,state,name,full_address,review_count,latitude,longitude
26,Chinese,AZ,China Magic Noodle House,"2015 N Dobson Rd\nChandler, AZ 85224",240,33.336409,-111.876379
27,Chinese,AZ,China Chili,"302 E Flower St\nPhoenix, AZ 85012",224,33.48585,-112.069181
28,Chinese,AZ,Snoh Ice Shavery,"914 E Camelback Rd\nUnit 4B\nPhoenix, AZ 85014",222,33.509529,-112.060958


#### Highly Rated
1. To find the most Highly Rated restaurant we calculate user sentiments by calculating the percent of positive ratings, rather than taken average user rating.
```
User Sentiment of a restaurant = (Sum Of All Positive Reviews)/(Total Number of Reviews)
```
2. Using the sentiments, find the restaurant with the Highest Sentiment per Category, per State
2. Save the results in a json object to be consumed by the Web Application

In [65]:
review_count_df =reviews[['categories','name','state','full_address','rev_rating','review_count','latitude','longitude']].loc[reviews['rev_rating']==1] \
.groupby(['categories','name','full_address','latitude','longitude','state','review_count']) \
.agg( postive_review_count=pd.NamedAgg(column='rev_rating', aggfunc='count')) \
.reset_index() 

review_count_df['sentiment']=review_count_df['postive_review_count']/review_count_df['review_count']

sentiment_df = review_count_df.sort_values(['categories','sentiment'],ascending=False).reset_index() \
.groupby(['categories','state']).nth((0,1,2)).sort_values(['categories','state','sentiment'],ascending=False).reset_index()
sentiment_df=sentiment_df.drop(columns=['review_count','postive_review_count'])
sentiment_df.to_json('sentiment.json',orient='records')

sentiment_df.loc[sentiment_df['categories']=='Chinese'].loc[sentiment_df['state']=='AZ']

Unnamed: 0,categories,state,index,name,full_address,latitude,longitude,sentiment
65,Chinese,AZ,2595,Jade Palace,"8120 N Hayden Rd\nScottsdale, AZ 85258",33.555735,-111.898815,0.714286
66,Chinese,AZ,2749,Panda Express,"1747 E Florence Blvd\nCasa Grande, AZ 85122",32.879219,-111.712176,0.666667
67,Chinese,AZ,2967,Wongs To Go,"5219 S 7th St\nPhoenix, AZ 85040",33.398942,-112.064389,0.666667


In [66]:
sentiment_df.state.unique(),popularity_ds.state.unique()

(array(['NV', 'AZ', 'WI', 'GA'], dtype=object),
 array(['AZ', 'NV', 'WI', 'GA'], dtype=object))

### Results
1. From the above scatter plot we can see clusters of words/phrases representing several indian cuisines.
2. These word associations can used by users to explore more different Indian dishes, for eg
    1. If you like tandoori chicken, you will also like to try Butter chicken
3. Using just one desert as input we got several other deserts. Eg
    1. with rice_pudding we discovered, mango ice cream, kheer, mango pudding.
    

### Improvements
1. For one there was no model evaluation. We did not score models based on input labels
2. Use Segphrase to find input labels and feed into Fastext to find similar words
3. Evaluate accuracy of the generated model.
4. We used top 5 most similar phrases. However we should put a threshold on similarity and get all labels above this threshold. This will weed out weak labels, and possibly get a more broader feature base

### References
1. https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c