## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import os
import spacy 
from tqdm import tqdm

### Read reviews data

In [2]:
con=open("../Dataset/Samsung.txt",'r', encoding="utf-8")
samsung_reviews=con.read()
con.close()

### Can we reduce the time taken?
[Pipelines (Spacy)](https://spacy.io/usage/processing-pipelines)


<img src='./images/spacy_pipeline.png'>

In [3]:
# shorten the pipline loading
nlp=spacy.load('en_core_web_sm',disable=['parser','ner'])

In [4]:
nouns = []
for review in tqdm(samsung_reviews.split("\n")[0:1000]):
    doc = nlp(review)
    for tok in doc:
        if tok.pos_=="NOUN":
            nouns.append(tok.lemma_.lower())

100%|██████████| 1000/1000 [00:05<00:00, 188.20it/s]


In [5]:
len(samsung_reviews.split("\n"))

46355

In [6]:
(46355/1000)*6

278.13

In [7]:
278/60

4.633333333333334

### Lets process all the reviews now and see if time taken is less !!!

In [8]:
nouns = []
for review in tqdm(samsung_reviews.split("\n")):
    doc = nlp(review)
    for tok in doc:
        if tok.pos_=="NOUN":
            nouns.append(tok.lemma_.lower())

100%|██████████| 46355/46355 [03:03<00:00, 252.37it/s]


### Does the hypothesis of nouns capturing `product features` hold?

In [9]:
nouns=pd.Series(nouns)
nouns.value_counts().head(5)

phone      43512
battery     4344
product     3958
time        3833
screen      3789
Name: count, dtype: int64

In [10]:
nouns.value_counts().head(10)

phone      43512
battery     4344
product     3958
time        3833
screen      3789
card        3401
price       3156
problem     3141
camera      2934
app         2593
Name: count, dtype: int64

### We now know that people mention `battery`, `product`, `screen` etc. But we still don't know in what context they mention these keywords

### Summary:
 - Most frequently used lemmatised forms of noun, inform us about the product features people are talking about in product reviews
 - In order to process the review data faster spacy allows us to use the idea of enabling parts of model inference pipeline via `spacy.loads()` command and `disable` parameter

In [11]:
import datetime, pytz; 
print("Current Time in IST:", datetime.datetime.now(pytz.utc).astimezone(pytz.timezone('Asia/Kolkata')).strftime('%Y-%m-%d %H:%M:%S'))

Current Time in IST: 2025-02-12 16:06:25
