## Importing libraries

In [2]:
pip install spacy

Collecting spacy
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/61/08/f21d6f07a879cdfe284bc5bacfcf86c054866c24fe2e7c2e383d7a04421b/spacy-3.6.1-cp39-cp39-win_amd64.whl.metadata
  Downloading spacy-3.6.1-cp39-cp39-win_amd64.whl.metadata (26 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.4-py3-none-any.whl (11 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.9-cp39-cp39-win_amd64.whl (18 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.7-cp39-cp39-win_amd64.whl (30 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.8-cp39-cp39-win_amd64.whl (96 kB)
     ---------------------------------------- 0.0/96.8 kB ? eta -:--:--
     ---------------------------------------- 96.8/96.8 kB 5.4 MB/s eta 0:00:00
Collecting thin

In [1]:
import pandas as pd
import numpy as np
import os
import spacy 
from tqdm import tqdm

### Read reviews data

In [2]:
con=open("../Dataset/Samsung.txt",'r', encoding="utf-8")
samsung_reviews=con.read()
con.close()

### Can we reduce the time taken?
[Pipelines (Spacy)](https://spacy.io/usage/processing-pipelines)


<img src='./images/spacy_pipeline.png'>

In [3]:
# shorten the pipline loading
nlp=spacy.load('en_core_web_sm', disable=['parser','ner'])

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

In [None]:
nouns = []
for review in tqdm(samsung_reviews.split("\n")[0:1000]):
    doc = nlp(review)
    for tok in doc:
        if tok.pos_=="NOUN":
            nouns.append(tok.lemma_.lower())

100%|██████████| 1000/1000 [00:06<00:00, 148.24it/s]


In [None]:
len(samsung_reviews.split("\n"))

In [None]:
(46355/1000)*6

278.13

In [None]:
278/60

4.633333333333334

### Lets process all the reviews now and see if time taken is less !!!

In [None]:
nouns = []
for review in tqdm(samsung_reviews.split("\n")):
    doc = nlp(review)
    for tok in doc:
        if tok.pos_=="NOUN":
            nouns.append(tok.lemma_.lower())

100%|██████████| 46355/46355 [04:27<00:00, 173.42it/s]


### Does the hypothesis of nouns capturing `product features` hold?

In [None]:
nouns=pd.Series(nouns)
nouns.value_counts().head(5)

phone      43237
battery     4350
product     3907
time        3825
screen      3746
dtype: int64

In [None]:
nouns.value_counts().head(10)

phone      43237
battery     4350
product     3907
time        3825
screen      3746
card        3399
price       3148
problem     3120
camera      2773
app         2606
dtype: int64

### We now know that people mention `battery`, `product`, `screen` etc. But we still don't know in what context they mention these keywords

### Summary:
 - Most frequently used lemmatised forms of noun, inform us about the product features people are talking about in product reviews
 - In order to process the review data faster spacy allows us to use the idea of enabling parts of model inference pipeline via `spacy.loads()` command and `disable` parameter