## Take the first 10000 review texts. Perform only these steps as part of pre-processing: lowercasing and removing punctuation. Compute IDF of all words in these reviews. Report the top 20 words and bottom 20 words, based on IDF, with their IDF scores.

In [11]:
import pandas as pd
import numpy as np
import string
import re
pd.set_option('display.max_colwidth', 200)
np.random.seed(22)

In [12]:
df=pd.read_json("dataset.txt", lines=True)
df1=df[["reviewText","overall"]].head(10000)

## Preprocessing: Lowercasing

In [13]:
df1['reviewText'] = df1['reviewText'].str.lower()
df1

Unnamed: 0,reviewText,overall
0,they look good and stick good! i just don't like the rounded shape because i was always bumping it and siri kept popping up and it was irritating. i just won't buy a product like this again,4
1,these stickers work like the review says they do. they stick on great and they stay on the phone. they are super stylish and i can share them with my sister. :),5
2,these are awesome and make my phone look so stylish! i have only used one so far and have had it on for almost a year! can you believe that! one year!! great quality!,5
3,"item arrived in great time and was in perfect condition. however, i ordered these buttons because they were a great deal and included a free screen protector. i never received one. though its not ...",4
4,"awesome! stays on, and looks great. can be used on multiple apple products. especially having nails, it helps to have an elevated key.",5
...,...,...
9995,"what comes in the boxcasemicro usb cableinstructionshow i use the iphonei am constantly surfing the web, playing games, or listening to music which really gives the battery a run for its money. i...",4
9996,"no wonder the price was so low, this battery pack is so bad that with the iphone battery fully charged and this pack fully charged, i cannot get not even 5 hours of use out of the phonei noticed i...",1
9997,"i bought this almost 2 years ago honestly, and it still charges great! i totally recommend this to everyone interested! it is interesting that it took about 4 hours to give a full charge so that i...",5
9998,"i bought this originally for the outrageous price of $90, but thought it would be worth it if it got me through a full day of heavy use on my iphone. the first case worked for about 3 weeks and th...",1


# Preprocessing: Removing Punctuation

In [14]:
translator = str.maketrans("", "", string.punctuation)
df1['reviewText'] = df1['reviewText'].str.translate(translator)
df1

Unnamed: 0,reviewText,overall
0,they look good and stick good i just dont like the rounded shape because i was always bumping it and siri kept popping up and it was irritating i just wont buy a product like this again,4
1,these stickers work like the review says they do they stick on great and they stay on the phone they are super stylish and i can share them with my sister,5
2,these are awesome and make my phone look so stylish i have only used one so far and have had it on for almost a year can you believe that one year great quality,5
3,item arrived in great time and was in perfect condition however i ordered these buttons because they were a great deal and included a free screen protector i never received one though its not a bi...,4
4,awesome stays on and looks great can be used on multiple apple products especially having nails it helps to have an elevated key,5
...,...,...
9995,what comes in the boxcasemicro usb cableinstructionshow i use the iphonei am constantly surfing the web playing games or listening to music which really gives the battery a run for its money i am...,4
9996,no wonder the price was so low this battery pack is so bad that with the iphone battery fully charged and this pack fully charged i cannot get not even 5 hours of use out of the phonei noticed it ...,1
9997,i bought this almost 2 years ago honestly and it still charges great i totally recommend this to everyone interested it is interesting that it took about 4 hours to give a full charge so that is s...,5
9998,i bought this originally for the outrageous price of 90 but thought it would be worth it if it got me through a full day of heavy use on my iphone the first case worked for about 3 weeks and then ...,1


## computing IDF for all the words in reviewText

### Tokenization

In [15]:
df1['tokens']=df1[['reviewText']].applymap(lambda x:re.findall(r'\w+',x))
df2=df1.reset_index()
df2=df2[["index","tokens"]]

In [16]:
df3=df2.explode("tokens")

### Calculating document frequency of token

In [17]:
df_idf=df3.groupby(["tokens"]).agg(df=('index','nunique'))    

### Top 20 words with respect to IDF

In [21]:
df_idf["N"]=10000
df_idf["idf"]=np.log10(df_idf.N/df_idf.df)
df_idf[["idf"]].sort_values("idf",ascending=False).head(20)

Unnamed: 0_level_0,idf
tokens,Unnamed: 1_level_1
zx80,4.0
summarizes,4.0
itplus,4.0
itperhaps,4.0
itpairing,4.0
itoy,4.0
summarized,4.0
itouches,4.0
itouch4,4.0
summarizepros,4.0


### Bottom 20 words with respect to IDF

In [10]:
df_idf[["idf"]].sort_values("idf",ascending=False).tail(20)

Unnamed: 0_level_0,idf
tokens,Unnamed: 1_level_1
you,0.446967
phone,0.407379
not,0.380594
have,0.374688
on,0.354381
but,0.348819
that,0.334044
with,0.318126
in,0.315334
of,0.292174
