# Customer review categorization case study

You have used this feature of Amazon many times before buying any product from the website where you look into the reviews of the product category-wise. E.g. for iphone 12 you can see  there is categorisation of reviews into features based on the reviews’ text submitted by the users. These features can be ‘battery life’, ‘value for money’, ‘screen size’ and so on.

So, given a customer review data of Samsung, build a model which generate the top categories from it.

In [None]:
# import necessary libraries

import pandas as pd
import numpy as np
import os
import spacy
from tqdm import tqdm
import re
import nltk
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# load reviews

con = open('/content/sample_data/Samsung.txt', 'r')
review_data = con.read()
con.close()
reviews = review_data.split('\n')
len(reviews)

46355

There are 46355 reviews are present in the given dataset.

To identify the product categorization, we perform following 2 steps

- Step1: Identify top frequent nouns in the reviews, as product features are always noun.

- Step2: Identify top frequent prefix and suffix of the words identified in Step1.

## Step:1 identify top frequent nouns

In [None]:
# build nlp model

nlp_model = spacy.load('en_core_web_sm', disable=['parser','ner'])

In [None]:
# identify nouns

nouns = []

for review in tqdm(reviews):
  tokens = nlp_model(review)
  for token in tokens:
    if token.pos_ == 'NOUN':
      nouns.append(token.lemma_.lower())

100%|██████████| 46355/46355 [04:15<00:00, 181.13it/s]


In [None]:
# create result df

categorization_df = pd.DataFrame({'Categories': nouns})

In [None]:
# top 5 product categories

categorization_df.value_counts().head(5)

Unnamed: 0_level_0,count
Categories,Unnamed: 1_level_1
phone,43507
battery,4334
product,3992
screen,3838
time,3810


## Step2: Identified top prefix and suffix of categories

To perform this step, we will use the regex expression. Expression is `prefix category suffix`

In [None]:
# identify all prefix and suffix

def get_context(keyword):
  prefix = []
  suffix = []

  pattern = re.compile(f'\w+\s{keyword}\s\w+')
  prefixes_suffixes = re.findall(pattern, review_data)
  for txt in prefixes_suffixes:
    l = txt.split(' ')
    prefix.append(l[0].lower())
    suffix.append(l[-1].lower())

  prefix = [pre for pre in prefix if pre not in stopwords.words('english')]
  suffix = [suff for suff in suffix if suff not in stopwords.words('english')]
  return prefix, suffix

In [None]:
# implementation #2

def get_context2(reviews,keyword):
    pattern = re.compile(f"\w+\s{keyword}\s\w+")
    prefixes_suffixes = re.findall(pattern,reviews)
    prefixes = []
    suffixes = []
    for p in prefixes_suffixes:
        l = p.split(" ")
        prefixes.append(l[0].lower())
        suffixes.append(l[-1].lower())
    prefixes = [p for p in prefixes if p not in stopwords.words('english')]
    suffixes = [s for s in suffixes if s not in stopwords.words('english')]
    prefixes=pd.Series(prefixes).value_counts().head(5).index
    suffixes=pd.Series(suffixes).value_counts().head(5).index
    return pd.DataFrame({'prefixes':prefixes,'keyword':[f'{keyword}']*len(prefixes),'suffixes':suffixes})

In [None]:
# get prefix, suffix for battery keyword

get_context2(review_data, 'battery')

Unnamed: 0,prefixes,keyword,suffixes
0,good,battery,life
1,great,battery,lasts
2,long,battery,last
3,new,battery,runs
4,removable,battery,drains
