# Problem 1. Mushroom Weight and Height

In the class we build a naive bayes classifier which classify whether a mushroom is poisonous or edible. In the class, all the feature are of categorical type; it has a discrete number of outcome as opposed to real number.

In this problem we want to explore two ways to deal with real number features for Naive Bayes classifier.

## The Data
The data given to you is very similar to mushroom data. It has three features: cap-color(with the same dictionary we did in class), mushroom-weight(made up by me), mushroom-height(also made up by me).

The data is given in
`mushrooms_homework_test.csv`
and
`mushrooms_homework_train.csv`

## Task 1.1) Simplest way. Just bin it.

The simplest way for dealing with continuous value feature is to discretize it. The simplest way to discretize is just to bin it. For example if your data looks like

`data = (0.9, 1.1, 1.2, 2.1, 2.2, 4.2, 5.3, 5.1)`

We count bin it with bin edges of 

`bins = (1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5)`

the bin number of a data point $x$ is the index $i$ where `bins[i-1] < x < bins[i]`. If $x$ is less than the minimum of bin edge then the bin number is 0. If $x$ is more than the maximum of bin edge then it's `len(bins)`.

Ex the data points above would be turned into
`binno = (0, 1, 1, 3, 3, 7, 9, 9)`

Once we discretize the value we can just use Bayes Classifier we did in class.

**Your task is to build a naive bayes classifier with binned height and binned weight. Pick appropriate bin edges somehow.
Then test your algorithm on the test data set. Report how many you got right and wrong in test data**


In [28]:
import pandas as pd
import numpy as np

In [29]:
df = pd.read_csv('data/mushrooms_homework_train.csv')
df

Unnamed: 0,xclass,cap_color,weight,height
0,e,y,6.122458,7.143689
1,e,w,4.709259,7.398728
2,p,w,2.341551,4.733059
3,e,g,3.954025,4.040427
4,e,y,3.456429,6.422466
...,...,...,...,...
6502,p,n,3.103014,3.211495
6503,e,n,5.162033,6.841161
6504,e,n,4.754516,5.347342
6505,e,n,3.272145,8.031833


In [113]:
def make_bin_column(col_data):
    bins = np.linspace(np.min(col_data), np.max(col_data), 3)
    return np.digitize(col_data, bins=bins)

df['dheight'] = make_bin_column(df.height)
df['dweight'] = make_bin_column(df.weight)

In [114]:
features = df[['cap_color', 'dweight', 'dheight']]
xclasses = df['xclass']

In [115]:
from typing import Dict
def max_value_key(d: Dict[str, float])->str:
    best_key = None
    best_value = 0.
    first = True
    for k, v in d.items():
        if first or v>best_value:
            best_key = k
            best_value = v
            first=False
    return best_key
    

In [116]:
from dataclasses import dataclass

@dataclass(frozen=True)
class ProbKey:
    fname: str
    value: str
    cls: str
    

class NaiveBayes:
    def __init__(self):
        # self.prob_dict (feature_name, value, classname) -> 
        self.prob_dict = {}
        self.fnames = []
        self.all_classes = []
        self.prior = {}
        
    def _cal_prob(self, features, xclasses, fname, value, cls):
        value_mask = features[fname] == value
        cls_mask = xclasses == cls
        n_right = np.sum(value_mask & cls_mask)
        n_cls = sum(cls_mask)
        prob = n_right / n_cls
        return prob 
        
    
    def train(self, features, xclasses):
        all_classes = xclasses.unique()
        self.fnames = features.columns
        self.all_classes = all_classes
        prob_dict = {}
        self.prior = {cls:sum(xclasses==cls)/len(xclasses) for cls in all_classes}
        
        for fname in features.columns:
            for value in features[fname].unique():
                for cls in all_classes:
                    key = ProbKey(fname=fname, value=value, cls=cls)
                    prob = self._cal_prob(features, xclasses, fname, value, cls)
                    prob_dict[key] = prob
        self.prob_dict = prob_dict
    
    def predict_top_one(self, data, cls: str):
        p = self.prior[cls]
        for fname in self.fnames:
            value = getattr(data, fname)
            key = ProbKey(fname, value, cls)
            this_p = self.prob_dict[key]
            p *= this_p
        return p #just the prior*Prod P(x|cls)
            
    def predict_prob(self, data):
        top_dict = {cls: self.predict_top_one(data, cls) for cls in self.all_classes}
        evidence = sum(v for k,v in top_dict.items())
        return {cls: v/evidence for cls, v in top_dict.items()}
    
    def predict_class(self, data):
        #return cls with highest prob
        probs = self.predict_prob(data)
        best_cls = max_value_key(probs)
        return best_cls


In [117]:
nb = NaiveBayes()
nb.train(features, xclasses)
# for data, xclass in zip(features.itertuples(), xclasses):
#     print(nb.predict_prob(data), xclass)

In [118]:
test_df = pd.read_csv('data/mushrooms_homework_test.csv')
test_df['dheight'] = make_bin_column(test_df.height)
test_df['dweight'] = make_bin_column(test_df.weight)

test_features = test_df[['cap_color', 'dweight', 'dheight']]
test_xclasses = test_df['xclass']

In [119]:
correct = 0
total = 0
for data, xclass in zip(features.itertuples(), xclasses):
    if nb.predict_class(data) == xclass:
        correct+=1
    total += 1
print(correct, total, correct/total)

4449 6507 0.6837252189949286


## Task 1.2) Gaussian Naive Bayes.

We could assume a certain probability distribution function(pdf) for the conditional probability. One popular choice is normal distrubution/gaussian distribution.

Let us understand this assumption intuitively. The idea is that the *weight* of *poisonous* mushroom is normally distributed around some mean with some width while the *weight* for edible mushroom is hopefully normally distributed around some other mean.

Let us say that the weight of poisonous mushroom is normally distributed around $2\pm1$ gram(I made up this number)  while the edible mushroom is normally distributed at $5\pm 1$ gram. If we found a mushroom of 2.5 gram. It's probably the poisonous one since edible one should weight around 5 gram.

<img src="gaussian.png" width="400"/>

Concretely, we assume that the probability distribution of feature $X$ given that it is of class $y$ to be

$$
\displaystyle
pdf(X=x|y) = \frac{1}{\sqrt{2\pi}} \exp\left[\frac{-(x-\mu_y)^2}{2\sigma_y^2}\right]
$$

 - $\mu_y$ is the mean of feature $X$ given that it is of class $y$. Ex: mean weight($X$) of poisonous mushroom (eg: 2 gram).

 - $\sigma_y$ is the standard deviation of feature $X$ given that it is of class $y$. Ex std. dev. of weight of poisonous mushroom(eg: \pm 1 gram)
 
Recall the relatio between pdf and probability.
$$
    P(X \in (x, x+\delta x)|y) = \int^{x+\delta x}_x pdf(X=x) \;\text{d}x
$$

for small enough $\delta x$ it becomes
$$
    P(X \in (x, x+\delta x)|y) = pdf(X=x) \delta x
$$

Now here is the important part. From the above $P(X=x|y)$ and $P(X=x|\sim y)$ has a factor of $\delta x$(I drop of the range for brevity). This means that the factor of $\delta x$ appear in both the numerator and denominator of the formula we use for calculating probability. Thus the $\delta x$ cancels out nicely. This means that **we can just use pdf(X=x|y) as P(X=x|y)** in naive bayes formula we got and every thing will just work out. We don't have to worry about the $\delta x$ part

**Your task: build a classifier similar to what you did in 1.1. Except now you use gaussian distribution assumption for the continous features. Measure your performance against the test data**

In [120]:
from typing import List
def gaussian(x, mu, sigma):
    return 1/np.sqrt(2*np.pi)*np.exp(-(x-mu)**2/2/sigma**2)

In [121]:
@dataclass(frozen=True)
class GaussianParam:
    mu: float
    sigma: float

@dataclass(frozen=True)
class GaussianKey:
    fname: str
    cls: str
    
class NaiveBayesGaussian:
    def __init__(self):
        # self.prob_dict (feature_name, value, classname) ->
        self.prob_dict = {}
        self.all_classes = []
        self.prior = {}
        self.categorical_features = []  # list of categorical features
        self.continuous_features = []  # list of continuous features
        self.gaussian_param = {}  # dict from continuous featurename -> Gaussian param

    def _cal_prob(self, features, xclasses, fname, value, cls):
        value_mask = features[fname] == value
        cls_mask = xclasses == cls
        n_right = np.sum(value_mask & cls_mask)
        n_cls = sum(cls_mask)
        prob = n_right / n_cls
        return prob

    def train(self, features, xclasses, continuous_features: List[str] = None):
        if continuous_features is None:
            continuous_features = []

        self.categorical_features = [
            col for col in features.columns if col not in continuous_features]
        self.continuous_features = continuous_features
        all_classes = xclasses.unique()
        self.all_classes = all_classes
        prob_dict = {}
        self.prior = {cls: sum(xclasses == cls)/len(xclasses)
                      for cls in all_classes}

        for fname in self.categorical_features:
            for value in features[fname].unique():
                for cls in all_classes:
                    key = ProbKey(fname=fname, value=value, cls=cls)
                    prob = self._cal_prob(
                        features, xclasses, fname, value, cls)
                    prob_dict[key] = prob
        self.prob_dict = prob_dict

        # for continuous one we use the gaussian
        def get_gaussian_features(col, cls):
            feature = features[col][xclasses==cls]
            return GaussianParam(mu=np.mean(feature), sigma=np.std(feature))
        self.gaussian_param = {GaussianKey(fname=col, cls=cls):get_gaussian_features(col, cls) 
                               for col in self.continuous_features
                               for cls in self.all_classes}
        

    def predict_top_one(self, data, cls: str):
        p = self.prior[cls]
        for fname in self.categorical_features:
            value = getattr(data, fname)
            key = ProbKey(fname, value, cls)
            this_p = self.prob_dict[key]
            p *= this_p
        
        for fname in self.continuous_features:
            gp = self.gaussian_param[GaussianKey(fname=fname, cls=cls)]
            value = getattr(data, fname)
            p *= gaussian(value, gp.mu, gp.sigma)
        return p  # just the prior*Prod P(x|cls)

    def predict_prob(self, data):
        top_dict = {cls: self.predict_top_one(
            data, cls) for cls in self.all_classes}
        evidence = sum(v for k, v in top_dict.items())
        return {cls: v/evidence for cls, v in top_dict.items()}

    def predict_class(self, data):
        # return cls with highest prob
        probs = self.predict_prob(data)
        best_cls = max_value_key(probs)
        return best_cls

In [122]:
df = pd.read_csv('data/mushrooms_homework_train.csv')
df

Unnamed: 0,xclass,cap_color,weight,height
0,e,y,6.122458,7.143689
1,e,w,4.709259,7.398728
2,p,w,2.341551,4.733059
3,e,g,3.954025,4.040427
4,e,y,3.456429,6.422466
...,...,...,...,...
6502,p,n,3.103014,3.211495
6503,e,n,5.162033,6.841161
6504,e,n,4.754516,5.347342
6505,e,n,3.272145,8.031833


In [123]:
features = df[['cap_color', 'weight', 'height']]
xclasses = df.xclass
nb = NaiveBayesGaussian()
nb.train(features, xclasses, continuous_features = ['weight', 'height'])

In [124]:
nb.gaussian_param

{GaussianKey(fname='weight', cls='e'): GaussianParam(mu=4.310016666234858, sigma=1.2925952634849398),
 GaussianKey(fname='weight', cls='p'): GaussianParam(mu=3.2141575420054664, sigma=0.9972401361343372),
 GaussianKey(fname='height', cls='e'): GaussianParam(mu=5.995639636920748, sigma=1.9003781732257048),
 GaussianKey(fname='height', cls='p'): GaussianParam(mu=5.00334707527546, sigma=1.4240001555328947)}

In [125]:
df.weight[df.xclass=='p'].mean()

3.2141575420054664

In [126]:
correct = 0
total = 0
for data, xclass in zip(features.itertuples(), xclasses):
    if nb.predict_class(data) == xclass:
        correct+=1
    total += 1
print(correct, total, correct/total)

4691 6507 0.7209159366835716


# Problem 2. Product Reviews

Naive Bayes is quite powerful given its simplicity. Typically the usefulness of a Machine learning Algorithm is limited only by your imgination on what to ask. If you ask an interesting question, you got a useful system. If you ask a dump question, you got a useless system.

In this problem we will explore a text mining application using Naive Bayes.

The goal of this problem is to make a system that can read customer review and tells whether the customer recommend the product to others or not.

The data is stolen from https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews
I splitted it into train(`clothing_reviews_train.csv`) and test(`clothing_reviews_test.csv`) for you.

The two columns that is relavant to this problem are.
- Recommended IND
- Title
- Review Text

**Do not use any other column**

You could do challenging version (No extra point except for bragging rights)
Use the data from http://jmcauley.ucsd.edu/data/amazon/ and do similar thing --> the book review is hugeeeeee


## Your task
Build a classifier which you can give it a reviewtext and review title and it can split out whether the reviewer recommend the product or not. Measure the performance on test data `clothing_reviews_test.csv`.

## Some Guide for you.

- Be sure to normalize your text. Example [here](https://machinelearningmastery.com/clean-text-machine-learning-python/). This includes
    - lowercase everything
    - clean out stop words
    - get rid of punctuations
    - stem the word
    - Yes you may use nltk only for cleaning up part.
- Be careful as you are multiplying a whole bunch of small numbers together. You are better off adding the log and exponentiate it back.
- Read up lecture notes on spam filtering. Especially on how to deal with missing words. You can read up [old version of excercise 1 from sam](https://github.com/KongsakTi/PatternReg/tree/master/Exercise%201)
- **Do not** get stuck on this alone. Find help/Consult your friends if you are stuck. Collaboration is allowed but you must understand what you send in.

In [140]:
import keyword
df = pd.read_csv('data/clothing_reviews_train.csv')
df.columns = df.columns \
    .str.strip() \
    .str.lower() \
    .str.replace(' ', '_') \
    .str.replace('(', '') \
    .str.replace(')', '') \
    .str.replace('-','_') \
    .map(lambda x: 'x'+x if x in keyword.kwlist else x )
# df.title[df.title.isna()] = ""
# df.review_text[df.review_text.isna()] = ""

In [141]:
df.columns

Index(['unnamed:_0', 'clothing_id', 'age', 'title', 'review_text', 'rating',
       'recommended_ind', 'positive_feedback_count', 'division_name',
       'department_name', 'class_name'],
      dtype='object')

In [165]:
mini_df = df[['title', 'review_text', 'recommended_ind']]
mini_df = mini_df.dropna()

mini_df['fulltext'] = mini_df.title + ' ' + mini_df.review_text

In [247]:
from nltk.tokenize import word_tokenize
# remove punctuation from each word
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
table = str.maketrans('', '', string.punctuation)
stop_words = set(stopwords.words('english'))
# stop_words.remove('not')
porter = PorterStemmer()

#txt = ' '.join(mini_df.fulltext)[:2000]
def clean_text(txt: str) -> List[str]:
    tokens = word_tokenize(txt)
    # convert to lower case
    tokens = [w.lower() for w in tokens]

    stripped = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    # filter out stop words    
    words = [w for w in words if not w in stop_words]
    stemmed = [porter.stem(word) for word in words]
    return stemmed

In [248]:
mini_df

Unnamed: 0,title,review_text,recommended_ind,fulltext
1,Some major design flaws,I had such high hopes for this dress and reall...,0,Some major design flaws I had such high hopes ...
2,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",1,"My favorite buy! I love, love, love this jumps..."
3,Flattering shirt,This shirt is very flattering to all due to th...,1,Flattering shirt This shirt is very flattering...
4,Not for the very petite,"I love tracy reese dresses, but this one is no...",0,Not for the very petite I love tracy reese dre...
5,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,Cagrcoal shimmer fun I aded this in my basket ...
...,...,...,...,...
18839,Entrancing,I'm so impressed with the beautiful color comb...,1,Entrancing I'm so impressed with the beautiful...
18840,What a fun piece!,So i wasn't sure about ordering this skirt bec...,1,What a fun piece! So i wasn't sure about order...
18842,Great dress for many occasions,I was very happy to snag this dress at such a ...,1,Great dress for many occasions I was very happ...
18843,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,1,"Very cute dress, perfect for summer parties an..."


In [249]:
from typing import Tuple
tmp_clean_texts = mini_df.fulltext.map(clean_text)
class TextBasedNaiveBayes:
    def __init__(self):
        self.prob_dict = {} # (word, class) -> log prob
        self.prior = {} # (cls) -> log prob
        self.total = {}
        self.love_count = {}
        self.hate_count = {}
        
    def get_word_counts(self, clean_texts) -> Tuple[Dict[str, int], int]:
        word_counts = {}
        total_count = 0
        for tokens in clean_texts:
            for word in tokens:
                word_counts[word] = word_counts.get(word, 0) + 1
                total_count += 1
        return word_counts, total_count
    
    def train(self, texts: 'pd.Serie[str]', xclasses: 'pd.Serie[int]') -> None:
        clean_texts = texts.map(clean_text)
        recommended_mask = xclasses == 1
        love_texts = clean_texts[recommended_mask]
        hate_texts = clean_texts[~recommended_mask]
        
        love_wcs, total_love = self.get_word_counts(love_texts)
        hate_wcs, total_hate = self.get_word_counts(hate_texts)
        
        self.love_count = love_wcs
        self.hate_count = hate_wcs
        
        self.total['love'] = total_love
        self.total['hate'] = total_hate
        
        for word, count in love_wcs.items():
            self.prob_dict[(word, 'love')] = np.log(count/total_love)
        
        for word, count in hate_wcs.items():
            self.prob_dict[(word, 'hate')] = np.log(count/total_hate)
        
        self.prior['love'] = np.log(sum(recommended_mask) / len(texts))
        self.prior['hate'] = np.log(sum(~recommended_mask) / len(texts))
    
    def predict(self, texts: str):
        clean_texts = clean_text(texts)

        # prior * Prod P(word|love)
        log_love_prob = self.prior['love'] + sum(self.prob_dict.get((word, 'love'), np.log(0.5/self.total['love'])) for word in clean_texts)
        log_hate_prob = self.prior['hate'] + sum(self.prob_dict.get((word, 'hate'), np.log(0.5/self.total['hate'])) for word in clean_texts)

        love_prob = np.exp(log_love_prob)
        hate_prob = np.exp(log_hate_prob)

        return love_prob/(love_prob + hate_prob)

In [250]:
tbnb = TextBasedNaiveBayes()
tbnb.train(mini_df.fulltext, mini_df.recommended_ind)

In [251]:
correct = 0
total = 0

for text, rec in zip(mini_df.fulltext, mini_df.recommended_ind):
    guess = tbnb.predict(text)
    if guess > 0.5 and rec == 1:
        correct+=1
    elif guess <=0.5 and rec==0:
        correct+=1
    total+=1
#     print(tbnb.predict(text), rec, text)
#     print('-'*20)

print(correct, total, correct/total)

14232 15808 0.9003036437246964


In [252]:
import keyword
df = pd.read_csv('data/clothing_reviews_test.csv')
df.columns = df.columns \
    .str.strip() \
    .str.lower() \
    .str.replace(' ', '_') \
    .str.replace('(', '') \
    .str.replace(')', '') \
    .str.replace('-','_') \
    .map(lambda x: 'x'+x if x in keyword.kwlist else x )
# df.title[df.title.isna()] = ""
# df.review_text[df.review_text.isna()] = ""

df.columns

mini_df_test = df[['title', 'review_text', 'recommended_ind']]
mini_df_test = mini_df_test.dropna()

mini_df_test['fulltext'] = mini_df_test.title + ' ' + mini_df_test.review_text

In [253]:
correct = 0
total = 0

for text, rec in zip(mini_df_test.fulltext, mini_df_test.recommended_ind):
    guess = tbnb.predict(text)
    if guess > 0.5 and rec == 1:
        correct+=1
    elif guess <=0.5 and rec==0:
        correct+=1
    total+=1
#     print(tbnb.predict(text), rec, text)
#     print('-'*20)

print(correct, total, correct/total)

3378 3867 0.873545384018619


In [222]:
tbnb.prob_dict

{('favorit', 'love'): -6.654262970637098,
 ('buy', 'love'): -6.255806492683411,
 ('love', 'love'): -3.852924835111331,
 ('jumpsuit', 'love'): -7.681219270398773,
 ('fun', 'love'): -6.288297979791273,
 ('flirti', 'love'): -8.640995114212666,
 ('fabul', 'love'): -7.811715759328141,
 ('everi', 'love'): -6.882179800703966,
 ('time', 'love'): -6.156088464424666,
 ('wear', 'love'): -4.372074967311267,
 ('get', 'love'): -5.6404135065972385,
 ('noth', 'love'): -7.9272286464499855,
 ('great', 'love'): -4.359710049340317,
 ('compliment', 'love'): -6.279093444062202,
 ('flatter', 'love'): -5.04673163647358,
 ('shirt', 'love'): -5.456595718874439,
 ('due', 'love'): -7.670637161068236,
 ('adjust', 'love'): -8.172729104865471,
 ('front', 'love'): -6.2764790634881304,
 ('tie', 'love'): -7.103530701403655,
 ('perfect', 'love'): -4.810242421508748,
 ('length', 'love'): -5.491112160831418,
 ('leg', 'love'): -6.066476305734979,
 ('sleeveless', 'love'): -8.966417514647295,
 ('pair', 'love'): -6.1020212431

In [210]:
def predict(self, texts: str):
    clean_texts = clean_text(texts)
    
    # prior * Prod P(word|love)
    log_love_prob = self.prior['love'] + sum(self.prob_dict.get((word, 'love'), 0.1/self.total['love']) for word in clean_texts)
    log_hate_prob = self.prior['hate'] + sum(self.prob_dict.get((word, 'hate'), 0.1/self.total['hate']) for word in clean_texts)
    
    love_prob = np.exp(log_love_prob)
    hate_prob = np.exp(log_hate_prob)
    
    return love_prob/(love_prob + hate_prob)
    
predict(tbnb, 'I love this shirt.')

0.7060512235965029

In [223]:
clean_texts = mini_df.fulltext.map(clean_text)

In [224]:
recommended_mask = mini_df.recommended_ind == 1
recommend_texts = clean_texts[recommended_mask]
hate_texts = clean_texts[~recommended_mask]

In [243]:
for text in mini_df.fulltext[~recommended_mask][:100]:
    print(text)
    print('-'*20)

Some major design flaws I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
--------------------
Not for the very petite I love tracy reese dresses, but this one is not for the very petite. i am just under 5 feet tall and usually wear a 0p in this brand. this dress was very pretty out of the package but its a lot of dress. the skirt is long and very full so it overwhelmed my small frame. not a stranger to alterations, shortening and narrowing the skirt would take away from the embellishment of the garment. i love the color and the idea of the st

In [228]:
word_counts = {}
total_count = 0
for tokens in hate_texts:
    for word in tokens:
        word_counts[word] = word_counts.get(word, 0) + 1
        total_count += 1

In [230]:
def top_n(wcs, n=10):
    tmp = [(k,v) for k, v in wcs.items()]
    tmp.sort(key=lambda x: x[1], reverse=True)
    return tmp[:n]

top_n(word_counts, 40)

[('look', 1786),
 ('dress', 1679),
 ('nt', 1536),
 ('fit', 1431),
 ('like', 1430),
 ('top', 1267),
 ('size', 1212),
 ('love', 1189),
 ('would', 1047),
 ('fabric', 1043),
 ('color', 832),
 ('back', 754),
 ('order', 728),
 ('wear', 720),
 ('small', 714),
 ('realli', 661),
 ('return', 659),
 ('tri', 597),
 ('much', 558),
 ('materi', 548),
 ('cute', 536),
 ('want', 532),
 ('way', 516),
 ('shirt', 504),
 ('qualiti', 496),
 ('one', 492),
 ('work', 485),
 ('beauti', 480),
 ('could', 469),
 ('disappoint', 469),
 ('also', 462),
 ('great', 458),
 ('even', 450),
 ('go', 439),
 ('larg', 439),
 ('short', 420),
 ('pretti', 407),
 ('nice', 402),
 ('sweater', 390),
 ('flatter', 389)]

In [198]:
word_counts = {}
total_count = 0
for tokens in hate_texts:
    for word in tokens:
        word_counts[word] = word_counts.get(word, 0) + 1
        total_count += 1

def top_n(wcs, n=10):
    tmp = [(k,v) for k, v in wcs.items()]
    tmp.sort(key=lambda x: x[1], reverse=True)
    return tmp[:n]

top_n(word_counts)

[('look', 1786),
 ('dress', 1679),
 ('nt', 1536),
 ('fit', 1431),
 ('like', 1430),
 ('top', 1267),
 ('size', 1212),
 ('love', 1189),
 ('would', 1047),
 ('fabric', 1043)]

In [158]:
txt = ''.join(mini_df.fulltext)[:2000]

In [159]:
txt

"Some major design flawsI had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it cMy favorite buy!I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!Flattering shirtThis shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!Not for the very petiteI love tracy reese dresses, but this one is not for the very petite. i am just under 5 feet tall and usua

In [160]:
from nltk import sent_tokenize
txt = ' '.join(mini_df.fulltext)[:2000]
txt = sent_tokenize(txt)
print(txt)

['Some major design flawsI had such high hopes for this dress and really wanted it to work for me.', 'i initially ordered the petite small (my usual size) but i found this to be outrageously small.', 'so small in fact that i could not zip it up!', 'i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers.', 'imo, a major design flaw was the net over layer sewn directly into the zipper - it cMy favorite buy!I love, love, love this jumpsuit.', "it's fun, flirty, and fabulous!", 'every time i wear it, i get nothing but great compliments!Flattering shirtThis shirt is very flattering to all due to the adjustable front tie.', 'it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan.', 'love this shirt!!', '!Not for the very petiteI love tracy reese dresses, but this one is not for the very petite.', 'i am

In [180]:



print(mini_df.fulltext.loc[1])

print(clean_text(mini_df.fulltext.loc[1]))

Some major design flaws I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
['major', 'design', 'flaw', 'high', 'hope', 'dress', 'realli', 'want', 'work', 'initi', 'order', 'petit', 'small', 'usual', 'size', 'found', 'outrag', 'small', 'small', 'fact', 'could', 'zip', 'reorder', 'petit', 'medium', 'ok', 'overal', 'top', 'half', 'comfort', 'fit', 'nice', 'bottom', 'half', 'tight', 'layer', 'sever', 'somewhat', 'cheap', 'net', 'layer', 'imo', 'major', 'design', 'flaw', 'net', 'layer', 'sewn', 'directli', 'zipper', 'c']


In [184]:
aaa = [1,2,3,4,5,6]
def add_one(x):
    return x+1
list(map(add_one, aaa))

[2, 3, 4, 5, 6, 7]

In [None]:
aaa.sort()