# HW3

Submit via Slack. Due on **Tuesday, April 12th, 2022, 6:29pm PST**. You may work with one other person.

> This homework is co-completed by **Siqin Yang** (7374355500) and **Ningxi Wang** (3605565772) :)

In [1]:
import numpy as np
import pandas as pd
import re
from collections import Counter
from tqdm import tqdm

import nltk
from nltk.corpus import stopwords
from textacy.preprocessing.replace import urls, numbers, emojis, currency_symbols
from textacy.preprocessing.remove import punctuation
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## TF-IDF (5pts)

You are an analyst working for Amazon's product team, and charged with identifying areas for improvement for the toy reviews.

Using the **amazon-fine-foods.csv** dataset, pick 2-4 products, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- which stopwords to remove?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?
- which words to collocate together? (skip)

Finally, generate a TF-IDF report that explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the features your analysis showed that customers cited as reasons for a good review (i.e. 4/5)
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?

In [2]:
amz = pd.read_csv('../datasets/amazon_fine_foods.csv')
amz.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,20983,B002QWP89S,A21U4DR8M6I9QN,"K. M Merrill ""justine""",1,1,5,1318896000,addictive! but works for night coughing in dogs,my 12 year old sheltie has chronic brochotitis...
1,20984,B002QWP89S,A17TDUBB4Z1PEC,jaded_green,1,1,5,1318550400,genuine Greenies best price,"These are genuine Greenies product, not a knoc..."
2,20985,B002QWP89S,ABQH3WAWMSMBH,tenisbrat87,1,1,5,1317168000,Perfect for our little doggies,"Our dogs love Greenies, but of course, which d..."
3,20986,B002QWP89S,AVTY5M74VA1BJ,tarotqueen,1,1,5,1316822400,dogs love greenies,"What can I say, dogs love greenies. They begg ..."
4,20987,B002QWP89S,A13TNN54ZEAUB1,dcz2221,1,1,5,1316736000,Greenies review,This review is for a box of Greenies Lite for ...


### data preprocess

delete duplicated reviews

In [3]:
amz.drop_duplicates('Text', inplace=True)

In [4]:
amz.shape

(4931, 10)

> However, this does not drop the following duplicates as they have difference in space (single space vs double space)

Our dogs love Greenies, but of course, which doggies don't?  I bought this for my dashchund and minpin, and it's perfect!  A great price for a great product.  Who could ask for more.

----
Our dogs love Greenies, but of course, which doggies don't? I bought this for my dashchund and minpin, and it's perfect! A great price for a great product. Who could ask for more.

In [5]:
amz.loc[amz['Id'].isin([20985,21300]), 'Text']

2      Our dogs love Greenies, but of course, which d...
317    Our dogs love Greenies, but of course, which d...
Name: Text, dtype: object

In [6]:
amz.loc[amz['Id']==20985, 'Text'].values == amz.loc[amz['Id']==21300, 'Text'].values

array([False])

In [7]:
amz['Text'] = amz['Text'].str.replace("  ", " ", regex=True)

In [8]:
amz.drop_duplicates('Text', inplace=True)
amz.shape

(4930, 10)

pick certain products for analysis

In [9]:
products = amz.groupby('ProductId').count()['Id'].sort_values(ascending=False).head(5).index
products = list(products)
products

['B007JFMH8M', 'B002QWP89S', 'B003B3OOPA', 'B001EO5Q64', 'B0013NUGDE']

In [10]:
amz_prd = amz.loc[amz['ProductId'].isin(products), ['Id', 'ProductId', 'Score', 'Text']]
amz_prd.columns = ['id', 'productid', 'score', 'text']
amz_prd.head()

Unnamed: 0,id,productid,score,text
0,20983,B002QWP89S,5,my 12 year old sheltie has chronic brochotitis...
1,20984,B002QWP89S,5,"These are genuine Greenies product, not a knoc..."
2,20985,B002QWP89S,5,"Our dogs love Greenies, but of course, which d..."
3,20986,B002QWP89S,5,"What can I say, dogs love greenies. They begg ..."
4,20987,B002QWP89S,5,This review is for a box of Greenies Lite for ...


### text preprocess

####  stopwords

- Q: why remove/keep stopwords?
- A: stopwords are useless in analyzing as they don't convey any information in patterns

In [11]:
stpw = set(stopwords.words('english'))

In [12]:
stpw |= set(['amz', 'amazon'])

In [13]:
stpw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'amazon',
 'amz',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 

In [14]:
def count_words(lines, delimiter=" "):
    words = Counter() # instantiate a Counter object called words
    for line in lines:
        for word in line.split(delimiter):
            word = word.lower()
            if word in stpw: continue
            words[word] += 1 # increment count for word
    return words

In [15]:
counts = count_words(amz_prd['text'])
counts.most_common(150)

[('coconut', 1558),
 ('oil', 1384),
 ('like', 1184),
 ('love', 1098),
 ('/><br', 1024),
 ('great', 1017),
 ('use', 965),
 ('good', 827),
 ('one', 814),
 ('product', 697),
 ('taste', 689),
 ('soft', 644),
 ('would', 623),
 ('really', 615),
 ('hair', 589),
 ('cookie', 585),
 ('cookies', 575),
 ('also', 543),
 ('get', 506),
 ('oatmeal', 469),
 ('chips', 462),
 ('skin', 447),
 ('flavor', 417),
 ('little', 416),
 ('try', 407),
 ('buy', 403),
 ('even', 397),
 ('much', 395),
 ('it.', 395),
 ("i've", 384),
 ('using', 384),
 ('used', 381),
 ("i'm", 373),
 ('dog', 366),
 ('tried', 354),
 ('-', 329),
 ('eat', 320),
 ('best', 317),
 ('greenies', 315),
 ('them.', 295),
 ('baked', 294),
 ('bought', 293),
 ('quaker', 293),
 ('dogs', 292),
 ('loves', 291),
 ('better', 290),
 ('first', 289),
 ('/>i', 285),
 ('loved', 282),
 ('potato', 280),
 ('definitely', 275),
 ('sweet', 273),
 ('price', 272),
 ('time', 266),
 ('got', 265),
 ('make', 265),
 ('go', 260),
 ('since', 258),
 ('many', 254),
 ('received', 

- Q: which stopwords to remove? / adding in custom stopwords?

- A: nltk.corpus.stopwords + customized ones below

> by comparing common words with default stopwords, add more stopwords

In [16]:
stpw |= set(['product', 'would', 'really', 'also', 'even', 'much', 'definitely', 'many',
            'every', 'still', 'always', 'bit', 'way', 'something', 'whole', 'thing',
            'absolutely', 'lot', 'almost', 'enough', 'food', 'foods'])

#### regx cleaning
- regex cleaning and substitution?

In [17]:
def standardize_word(doc, word_orig, word_std):
    doc = doc.str.replace(word_orig, word_std,
                          flags=re.IGNORECASE, regex=True)
    return doc # has to return, otherwise local var wont affect global var

In [18]:
amz_prd['text_std'] = amz_prd['text']

`<br />` cleaning 

In [19]:
word_orig, word_std = r'(<br />)', ''
amz_prd['text_std'] = standardize_word(amz_prd['text_std'], word_orig, word_std)

`('-', 329)`, `('&', 239)` cleaning

In [20]:
word_orig, word_std = r'(-|&)', ''
amz_prd['text_std'] = standardize_word(amz_prd['text_std'], word_orig, word_std)

punctuation

In [21]:
word_orig, word_std = r'[.!?\-\"\\]', ' '
amz_prd['text_std'] = standardize_word(amz_prd['text_std'], word_orig, word_std)

In [22]:
amz_prd['text_std'] = amz_prd['text_std'].apply(punctuation)

numbers, currency_symbols, emojis

In [23]:
amz_prd['text_std'] = amz_prd['text_std'].\
                        apply(urls).\
                        apply(numbers).\
                        apply(currency_symbols).\
                        apply(emojis)

#### stemming

- Q: stemming versus lemmatization?
- A: after conducting lemmatization, we found that TF-IDF vectorizer identifies `oatmeal cookie, oatmeal cookies, oatmeal cooky` as three different tokens, thus we choose stemming in this case.

In [24]:
def stemming_sentence(sentence):
    stemmed_sentence = []
    for word in sentence.split(' '):      
        stemmed_sentence.append(stemmer.stem(word))
    return " ".join(stemmed_sentence)

In [25]:
stemmer = PorterStemmer()
for i in tqdm(amz_prd.index):
    amz_prd.loc[i, 'text_std'] = stemming_sentence(amz_prd.loc[i, 'text_std'])

100%|██████████| 3291/3291 [00:03<00:00, 882.91it/s] 


#### more stopwords

after preliminary vectorization results

In [26]:
stpw |= set(['_cur_', '_number_', 'star', 'stars', 'able', 'actually', 'ago', 'already',
            'although', 'another', 'anyone', 'www', 'com', 'gp'])

add stpw after stemming

In [27]:
stpw |= set(['thi', 'absolut', 'wa', 'ver'])

### text analysis

#### n-gram vectorize

- Q: what n for your n-grams?
- A: we pick `ngram_range=(2,3)` as our hyperparameter. As the reviews are considerably lengthy, we choose `ngram_range=(2,3)` to better represent the pattern of reviews. Also, by choosing `ngram_range=(2,3)`, we can get a basic idea of products and customers' opinions in two and three words.

In [28]:
vectorizer = CountVectorizer(ngram_range=(2,3), stop_words=stpw, binary=True, min_df=0.01)
X = vectorizer.fit_transform(amz_prd['text_std'])

vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(f"Shape of dataframe is {vectorized_df.shape}")
print(f"Total number of occurences: {vectorized_df.sum().sum()}")
vectorized_df.head()

Shape of dataframe is (3291, 125)
Total number of occurences: 8541


Unnamed: 0,bake cooki,bake oatmeal,bake oatmeal cooki,best price,coconut flavor,coconut oil,coconut oil ha,coconut oil use,coconut tast,cook oil,...,use oil,use skin,veri good,veri happi,veri soft,veri tasti,virgin coconut,virgin coconut oil,work great,year old
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
for i in vectorized_df.columns:
    print(i)

bake cooki
bake oatmeal
bake oatmeal cooki
best price
coconut flavor
coconut oil
coconut oil ha
coconut oil use
coconut tast
cook oil
cook use
cooki delici
cooki good
cooki great
cooki influenst
cooki raisin
cooki soft
cooki tast
cooki veri
definit buy
dog love
dog love greeni
dog teeth
dri skin
everi day
extra virgin
extra virgin coconut
feel like
first time
get one
give one
give tri
go buy
good price
good tast
great price
great snack
great tast
hair skin
health benefit
help keep
hi teeth
highli recommend
href http
individu wrap
influenst mom
kid love
like coconut
look forward
love coconut
love coconut oil
love cooki
love greeni
love love
love oatmeal
love soft
love tast
love use
mani use
mom voxbox
nutiva organ
oatmeal cooki
oatmeal raisin
oatmeal raisin cooki
oil cook
oil great
oil ha
oil use
oliv oil
one day
organ coconut
organ coconut oil
organ extra
organ extra virgin
pet store
pleasantli surpris
pop chip
potato chip
quaker soft
quaker soft bake
raisin cooki
realli enjoy
realli g

#### TF-IDF report

##### poor review features
the features your analysis showed that customers cited as reasons for a poor review

In [57]:
rev_poor = amz_prd.loc[amz_prd['score']<3]

In [66]:
tfidf_vec = TfidfVectorizer(ngram_range=(2,3), stop_words=stpw, binary=True, min_df=0.01)

X = tfidf_vec.fit_transform(rev_poor['text_std'])
terms = tfidf_vec.get_feature_names_out()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

##### most common issues of customer dissatisfaction
the most common issues identified from your analysis that generated customer dissatisfaction.

*You should explain what the pain points are and cite specific examples/action items from the corpus for management to consider*

In [67]:
score = tf_idf.sum(axis=1)
score = pd.DataFrame(score, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)

In [68]:
score.head(15)

Unnamed: 0,score
potato chip,4.260429
coconut oil,4.19757
salt vinegar,4.025182
veri disappoint,2.30344
salt pepper,2.123748
pack materi,2.051615
oatmeal cooki,2.016355
greeni dog,1.983555
tri one,1.940954
tast like,1.908749


> most common issues of customer dissatisfaction

**coconut oil**:
the following reviews like
`This is not cocnut oil at all. It is more like dalda(vanaspati)in cosistency and has no coconut flavor/smell at all.`, 
`It's not USDA approved and the scent smelled artifical. The texture was heavier then most unrefined coconut oil.`,
`there is cocunut oil all over the 2 jars, bubble wrap and all over the box!` 
 demonstrate that the product quality and packaging cause the dissatisfaction and management should consider to choose good-brand ones and require extra attention to the pacakaging process for the oil-like product
 
**taste like**:
reviews like `They really do taste like dissolving cardboard`, `made the weak coffee taste like apple peelings`, 
`it does not taste like Cappuccino at all`, `It does not really taste like a cup of cappuchino` demonstrate that the food taste is the key issue for customer dissatisfaction. Therefore, management should increse the standard for food quality, especially its taste.

##### good review features
the features your analysis showed that customers cited as reasons for a good review (i.e. 4/5)

In [34]:
rev_good = amz_prd.loc[amz_prd['score']>=4]

In [35]:
tfidf_vec = TfidfVectorizer(ngram_range=(2,3), stop_words=stpw, binary=True, min_df=0.01)

X = tfidf_vec.fit_transform(rev_good['text_std'])
terms = tfidf_vec.get_feature_names_out()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

In [36]:
score = tf_idf.sum(axis=1)
score = pd.DataFrame(score, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
score.head(15)

Unnamed: 0,score
coconut oil,229.715145
dog love,99.742412
oatmeal cooki,73.247019
potato chip,68.680427
highli recommend,65.091544
tast like,62.16767
love cooki,58.59507
use cook,57.601552
use hair,56.038366
use coconut,55.808612


> pros and cons
- Q: Explain to what degree the TF-IDF findings make sense - what are its limitations?
- A: TF-IDF quantifies the importance of words, terms and so on and help to exact the most descriptive terms. However, TF-IDF can not help carry semantic meaning which means TF-IDF can not consider semantically similar words together.

## Similarity and Word Embeddings (2 pts)

Using
* `TfIdfVectorizer`

Identify the most similar pair of reviews from the `amazon-fine-foods.csv` dataset using both **Euclidean distance** and **cosine similarity**.

In [37]:
def find_most_similar_pair(data: pd.DataFrame, metric='euclidean_distance') -> pd.DataFrame:
    
    if metric=='euclidean_distance':
        metric_df = pd.DataFrame(euclidean_distances(data), 
                                 index=amz_prd['id'], columns=amz_prd['id'])
    elif metric=='cosine_similarity':
        metric_df = pd.DataFrame(cosine_similarity(data), 
                                 index=amz_prd['id'], columns=amz_prd['id'])
        metric_df = metric_df.where(metric_df<1., 1.)

    # Select upper triangle of correlation matrix
    metric_df = metric_df.where(np.triu(np.ones(metric_df.shape), k=1).astype(bool))
    optimum = metric_df.min().min() if metric=='euclidean_distance' else metric_df.max().max()

    pair = metric_df.stack().rename_axis(('doc1_id','doc2_id')).reset_index(name='value')
    pair = pair.loc[pair['value']==optimum]
    pair['text1'], pair['text2'] = np.nan, np.nan
    for i in pair.index:
        pair.loc[i, 'text1'] = amz_prd.loc[amz_prd['id']==pair.loc[i, 'doc1_id'], 'text'].values
        pair.loc[i, 'text2'] = amz_prd.loc[amz_prd['id']==pair.loc[i, 'doc2_id'], 'text'].values
    
    return pair

### original text

remove restrictions like `min_df=0.01` to increase the dim of vector and thus increase the liability of the distance/similarity

In [38]:
tfidf_vec = TfidfVectorizer(ngram_range=(2,3), stop_words=stpw, binary=True)

X = tfidf_vec.fit_transform(amz_prd['text'])
terms = tfidf_vec.get_feature_names_out()
tf_idf = pd.DataFrame(X.toarray(), columns=terms, index=amz_prd['id'])
tf_idf.shape

(3291, 166666)

#### euclidean_distance

In [39]:
pair_eucl = find_most_similar_pair(tf_idf, metric='euclidean_distance')
pair_eucl

Unnamed: 0,doc1_id,doc2_id,value,text1,text2
3335967,189401,189415,0.267027,I am happy to say that I found the texture in ...,I received a mini popcorn machine like you wou...


#### cosine_similarity

In [40]:
pair_cos = find_most_similar_pair(tf_idf, metric='cosine_similarity')
pair_cos

Unnamed: 0,doc1_id,doc2_id,value,text1,text2
3335967,189401,189415,0.964348,I am happy to say that I found the texture in ...,I received a mini popcorn machine like you wou...


### processed text

In [41]:
tfidf_vec = TfidfVectorizer(ngram_range=(2,3), stop_words=stpw, binary=True)

X = tfidf_vec.fit_transform(amz_prd['text_std'])
terms = tfidf_vec.get_feature_names_out()
tf_idf = pd.DataFrame(X.toarray(), columns=terms, index=amz_prd['id'])
tf_idf.shape

(3291, 161232)

#### Euclidean distance

In [42]:
pair_eucl = find_most_similar_pair(tf_idf, metric='euclidean_distance')
pair_eucl

Unnamed: 0,doc1_id,doc2_id,value,text1,text2
3335967,189401,189415,0.24347,I am happy to say that I found the texture in ...,I received a mini popcorn machine like you wou...


#### cosine similarity

In [43]:
pair_cos = find_most_similar_pair(tf_idf, metric='cosine_similarity')
pair_cos

Unnamed: 0,doc1_id,doc2_id,value,text1,text2
3335967,189401,189415,0.970361,I am happy to say that I found the texture in ...,I received a mini popcorn machine like you wou...


In all four evaluations, the following reviews are the most similar pair:
    
    I am happy to say that I found the texture in the jar to be harder then I thought (though it may have been a bit frozen from outside in the box). It definitely smells like coconut, like a mounds bar. I was so concerned about the popcorn tasting like nothing but coconut. I was VERY surprised buy the wonderful taste it gives the popcorn, I can't believe I never realized at the movies that the reason the popcorn taste so good is that bit of coconut flavor to it. It also left my popcorn machine very clean, when the other oil coated the whole machine in a greasy gunk. Combine with Eden Organics Organic popcorn kernels from Amazon for a total organic popcorn. I received a mini popcorn machine like you would find in a theater for Christmas. After going through the sample packs of yellow gunk oil that came with it for popping I searched for a more healthier and natural way to make popcorn. I read a lot about different oils people use, but one thing was always the same, if you want it to taste like at the movies use coconut oil.
    
    ----
    
    I received a mini popcorn machine like you would find in a theater for Christmas.  After going through the sample packs of yellow gunk oil that came with it for popping I searched for a more healthier and natural way to make popcorn.  I read a lot about different oils people use, but one thing was always the same, if you want it to taste like at the movies use coconut oil.<br /><br />I found the texture in the jar to be harder then I thought (though it may have been a bit frozen from outside in the box).  It definitely smells like coconut, like a mounds bar.  I was so concerned about the popcorn tasting like nothing but coconut.  I was VERY surprised buy the wonderful taste it gives the popcorn, I can't believe I never realized at the movies that the reason the popcorn taste so good is that bit of coconut flavor to it.  It also left my popcorn machine very clean, when the other oil coated the whole machine in a greasy gunk. Combine with Eden Organics Organic popcorn kernels from Amazon for a total organic popcorn.

## Naive Bayes (3pts)

You are an NLP data scientist working at Fandango. You observe the following dataset in your review comments:

**Intent to Buy Tickets:**
1.	Love this movie. Can’t wait!
2.	I want to see this movie so bad.
3.	This movie looks amazing.

**No Intent to Buy Tickets:**
1.	Looks bad.
2.	Hard pass to see this bad movie.
3.	So boring!

You can consider the following stopwords for removal: `to`, `this`.

Is the following review an `Intent to Buy` or `No Intent to Buy`? Show your work for each computation.
> This looks so bad.

You'll need to compute:
* Prior
* Likelihood
* Posterior

### handwriten

#### text preprocess

***remove stopwords & stemming***:

**yes $:=$ Intent to Buy Tickets:**
1.  love movie cannot wait
2.  I want see movie so bad
3.  movie look amazing

**no $:=$ No Intent to Buy Tickets:**
1.	look bad
2.	hard pass see bad movie
3.	so boring

**test text:**
look so bad

#### Prior

$$\begin{align}
P(y=yes) = \frac{3}{3+3} = 0.5 \\
P(y=no) = \frac{3}{3+3} = 0.5
\end{align}$$

#### Likelihood

$$\begin{align}
P(X|y=yes) &= P(x=look|y=yes) * P(x=so|y=yes) * P(x=bad|y=yes) \\
& = \frac{1}{3} * \frac{1}{3} * \frac{1}{3} \\
& = \frac{1}{27}
\end{align}$$

$$\begin{align}
P(X|y=no) &= P(x=look|y=no) * P(x=so|y=no) * P(x=bad|y=no) \\
& = \frac{1}{3} * \frac{1}{3} * \frac{2}{3} \\
& = \frac{2}{27}
\end{align}$$

#### Posterior

$$\begin{align}
P(y=yes|X) &\propto P(X|y=yes) * P(y=yes) \\
& = \frac{1}{27} * \frac{1}{2} \\
& = \frac{1}{54}
\end{align}$$

$$\begin{align}
P(y=no|X) &\propto P(X|y=no) * P(y=no) \\
& = \frac{2}{27} * \frac{1}{2} \\
& = \frac{1}{27}
\end{align}$$

As $P(y=no|X) > P(y=yes|X)$, this review is considered as `No Intent to Buy`.

calculating $P(X)$ to get real postrior:

$$\begin{align}
P(y=yes|X) &= \frac{P(X|y=yes) * P(y=yes)}{P(X)} \\
& = \frac{\frac{1}{54}}{\frac{1}{18}} \\
& = \frac{1}{3}
\end{align}$$

$$\begin{align}
P(y=no|X) &= \frac{P(X|y=no) * P(y=no)}{P(X)} \\
& = \frac{\frac{1}{27}}{\frac{1}{18}} \\
& = \frac{2}{3}
\end{align}$$

### coding

In [44]:
def remove(text):
    text=text.lower()
    sw=['to','this']
    words = text.split()
    sentence = []
    for i in words:
        if i in sw:
            continue
        sentence.append(i)
    return ' '.join(sentence)

In [45]:
documents = [
    (remove("Love this movie. Can't wait"), "BUY"),
    (remove("I want to see this movie so bad"), "BUY"),
    (remove("This movie looks amazing"), "BUY"),  
    (remove("Looks bad"), "NOT_BUY"),
    (remove("Hard pass to see this bad movie"), "NOT_BUY"),
    (remove("So boring"), "NOT_BUY")
]
documents

[("love movie. can't wait", 'BUY'),
 ('i want see movie so bad', 'BUY'),
 ('movie looks amazing', 'BUY'),
 ('looks bad', 'NOT_BUY'),
 ('hard pass see bad movie', 'NOT_BUY'),
 ('so boring', 'NOT_BUY')]

In [46]:
corpus = set()

In [47]:
# Build corpus
for document in documents:
    text = document[0]
    class_value = document[1]
    for word in text.split():
        corpus.add(word)

corpus

{'amazing',
 'bad',
 'boring',
 "can't",
 'hard',
 'i',
 'looks',
 'love',
 'movie',
 'movie.',
 'pass',
 'see',
 'so',
 'wait',
 'want'}

In [48]:
conditional_probabilities = pd.DataFrame(index=list(corpus), 
                                         columns=["likelihood_given_buy", "likelihood_given_not_buy"])

In [49]:
buy_documents = 0
not_buy_documents = 0
for document in documents:
    if document[1] == "BUY":
        buy_documents += 1
    else:
        not_buy_documents += 1

In [50]:
p_buy = buy_documents / (buy_documents + not_buy_documents)
p_not_buy = not_buy_documents / (buy_documents + not_buy_documents)
p_buy, p_not_buy

(0.5, 0.5)

In [51]:
for word in corpus:
    buy_documents_with_word = 0
    not_buy_documents_with_word = 0
    
    for document in documents:
        document_class = document[1]
        if word in document[0].split():
            if document[1] == "BUY":
                buy_documents_with_word += 1
            else:
                not_buy_documents_with_word += 1
    
    conditional_probabilities.loc[word, "likelihood_given_buy"] = buy_documents_with_word * 1.0 / buy_documents
    conditional_probabilities.loc[word, "likelihood_given_not_buy"] = not_buy_documents_with_word * 1.0 / not_buy_documents

In [52]:
test_document = remove("This looks so bad")

In [53]:
def get_likelihood(test_document, conditional_probabilities):
    likelihood_buy = 1
    likelihood_not_buy = 1
    for word in test_document.split():
        likelihood_buy = likelihood_buy * conditional_probabilities.loc[word, "likelihood_given_buy"]
        likelihood_not_buy = likelihood_not_buy * conditional_probabilities.loc[word, "likelihood_given_not_buy"]
    
    return likelihood_buy, likelihood_not_buy

In [54]:
likelihood_buy, likelihood_not_buy = get_likelihood(test_document, conditional_probabilities)
likelihood_buy, likelihood_not_buy

(0.037037037037037035, 0.07407407407407407)

In [55]:
def get_posterior(likelihood_buy, likelihood_not_buy, p_buy, p_not_buy):
    posterior_buy = likelihood_buy * p_buy / (likelihood_buy * p_buy + likelihood_not_buy * p_not_buy)
    posterior_not_buy = likelihood_not_buy * p_not_buy / (likelihood_buy * p_buy + likelihood_not_buy * p_not_buy)
    return posterior_buy, posterior_not_buy

In [56]:
get_posterior(likelihood_buy, likelihood_not_buy, p_buy, p_not_buy)

(0.3333333333333333, 0.6666666666666666)