## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
import spacy 
nlp = spacy.load("en_core_web_sm")

### Read reviews data

In [4]:
# Load the Samsung.txt dataset
con=open("../data/Samsung.txt",'r', encoding="utf-8")
samsung_reviews=con.read()
con.close()

In [5]:
len(samsung_reviews.split("\n"))

46355

### Dataset is a text file where each review is in a new line

In [30]:
samsung_reviews.split("\n")[0:4]

["I feel so LUCKY to have found this used (phone to us & not used hard at all), phone on line from someone who upgraded and sold this one. My Son liked his old one that finally fell apart after 2.5+ years and didn't want an upgrade!! Thank you Seller, we really appreciate it & your honesty re: said used phone.I recommend this seller very highly & would but from them again!!",
 'nice phone, nice up grade from my pantach revue. Very clean set up and easy set up. never had an android phone but they are fantastic to say the least. perfect size for surfing and social media. great phone samsung',
 'Very pleased',
 'It works good but it goes slow sometimes but its a very good phone I love it']

### Will our hypothesis hold on real world data? `Product features---POS_NOUN`

In [14]:
review1=samsung_reviews.split("\n")[0]

### Lets do nlp parse on part of one review in our dataset

In [16]:
review1=nlp(review1);print(review1.text); [print(f"{token.text} -- {token.pos_}") for token in review1];print()

I feel so LUCKY to have found this used (phone to us & not used hard at all), phone on line from someone who upgraded and sold this one. My Son liked his old one that finally fell apart after 2.5+ years and didn't want an upgrade!! Thank you Seller, we really appreciate it & your honesty re: said used phone.I recommend this seller very highly & would but from them again!!
I -- PRON
feel -- VERB
so -- ADV
LUCKY -- ADJ
to -- PART
have -- AUX
found -- VERB
this -- DET
used -- VERB
( -- PUNCT
phone -- NOUN
to -- ADP
us -- PRON
& -- CCONJ
not -- PART
used -- VERB
hard -- ADV
at -- ADV
all -- ADV
) -- PUNCT
, -- PUNCT
phone -- NOUN
on -- ADP
line -- NOUN
from -- ADP
someone -- PRON
who -- PRON
upgraded -- VERB
and -- CCONJ
sold -- VERB
this -- DET
one -- NOUN
. -- PUNCT
My -- PRON
Son -- PROPN
liked -- VERB
his -- PRON
old -- ADJ
one -- NOUN
that -- PRON
finally -- ADV
fell -- VERB
apart -- ADV
after -- ADP
2.5 -- NUM
+ -- NUM
years -- NOUN
and -- CCONJ
did -- AUX
n't -- PART
want -- VERB
an

#### Real world data is usually messy, observe the words `found` and `used`

In [25]:
pos = [word.pos_ for word in review1]
lemma = [word.lemma_ for word in review1]
text = [word.text for word in review1]
df = pd.DataFrame({'Text':text, 'Lemma':lemma, 'Pos':pos})
#print(df['Pos'].value_counts())
df[df['Pos']=='NOUN'].groupby(by='Lemma').agg({'Lemma':'count'}).rename(columns={'Lemma':'count'}).sort_values(by='count', ascending=False)


Unnamed: 0_level_0,count
Lemma,Unnamed: 1_level_1
phone,3
one,2
honesty,1
line,1
seller,1
upgrade,1
year,1


In [71]:
from tqdm import tqdm

df_all = pd.DataFrame()
reviews = samsung_reviews.split("\n")[:1000]
print(len(reviews))
lemma = []
for review in tqdm(reviews):
    review=nlp(review)
    for word in review:
        if word.pos_ != 'NOUN': continue
        #pos.append(word.pos_)
        lemma.append(word.lemma_)
        #text.append(word.text)
        #df = pd.DataFrame({'text':text, 'lemma':lemma, 'pos':pos})
        #df_all = pd.concat([df_all, df])

#print(df_all.shape)
#df_all.head()


1000


100%|██████████| 1000/1000 [00:07<00:00, 133.70it/s]

6871 phone      1196
battery      90
time         90
screen       87
dtype: int64





In [75]:
lemma_ = pd.Series(data=lemma)
print(len(lemma_));print(lemma_.value_counts()[:5])

6871
phone      1196
battery      90
time         90
screen       87
price        86
dtype: int64


#### It seems possible that if we extract all the nouns from the reviews and look at the top 5 most frequent lemmatised noun forms, we will be able to identify `What people are talking about?`

### Lets repeat this experiment on a larger set of reviews

### Lets add some way of keeping track of time

### Did you notice anything? What do you think will be the time taken to process all the reviews?

## Summary
- POS tag based rule seems to be working well
- We need to figure out a way to reduce the time taken to process reviews