# Sentiment Analysis

In this notebook our goal is to perform some basic analysis on the data and drive some useful insights. We will try to follow a very basic approach of a data science pipeline.

Here's the goal of this notebook:

* Load and understand the data at hand.
* Setup a basic baseline. Before going to state-of-the-art models
* Setup performance metrics before deploying the model.

# Load Essential Libraries

In [1]:
import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import json
import ast

%matplotlib inline

# Load dataset from GDrive

In [2]:
dataset = pd.read_csv("/content/drive/MyDrive/Projects/Moodboard/dataset.csv")

In [3]:
dataset.shape

(923, 15)

In [4]:
dataset

Unnamed: 0.1,Unnamed: 0,_id,Sector,Main Category,Retailer,productlink,Brand,title,star,Details,Ingredients,reviews,createdAtdate,createdAt,source
0,0,5feac0cca137251264c4341d,Beauty & Skincare,makeup,ulta,https://www.ulta.com/fill-fluff-eyebrow-pomade...,NYX Cosmetics,Fill & Fluff Eyebrow Pomade Pencil,4.3 out of 5 stars,The boyfriend jean of the eyebrow world has ar...,Microcrystalline Wax/Cera Microcristallina/Cir...,"{'reviewer': ['By\xa0Franny', 'By\xa0Lilly', '...",29/12/2020,2020-12-29 05:38:19.937,ecommerce
1,1,5feac0cca137251264c43425,Beauty & Skincare,makeup,ulta,https://www.ulta.com/diamonds-ice-please-epic-...,NYX Cosmetics,"Diamonds & Ice, Please! Epic Wear Liner Kit",4.4 out of 5 stars,Ready for an epic lineup? The limited edition ...,,"{'reviewer': ['By\xa0kjh', 'By\xa0AIM RN', 'By...",29/12/2020,2020-12-29 05:38:19.939,ecommerce
2,2,5feac0cca137251264c43427,Beauty & Skincare,makeup,ulta,https://www.ulta.com/slim-lip-pencil?productId...,NYX Cosmetics,Slim Lip Pencil,4.4 out of 5 stars,"Slim, trim, but never prim. NYX Professional M...","Sorbitan Isostearate, Isocetyl Stearate,Phenyl...","{'reviewer': ['By\xa0CeCe', 'By\xa0Kelsey', 'B...",29/12/2020,2020-12-29 05:38:19.941,ecommerce
3,3,5feac0cca137251264c43422,Beauty & Skincare,makeup,ulta,https://www.ulta.com/lip-lingerie-glitter?prod...,NYX Cosmetics,Lip Lingerie Glitter,3.7 out of 5 stars,Reveal gorgeous nude lips and knockout shine w...,"Hydrogenated Polyisobutene, Polybutene, Diisos...","{'reviewer': ['By\xa0J . Galeano', 'By\xa0Sara...",29/12/2020,2020-12-29 05:38:19.939,ecommerce
4,4,5feac0cca137251264c43423,Beauty & Skincare,makeup,ulta,https://www.ulta.com/lip-lingerie-shimmer?prod...,NYX Cosmetics,Lip Lingerie Shimmer,4.0 out of 5 stars,Reveal gorgeous nude lips and knockout shine w...,"Hydrogenated Polyisobutene, Polybutene, Diisos...","{'reviewer': ['By\xa0Becky', 'By\xa0Savanah', ...",29/12/2020,2020-12-29 05:38:19.939,ecommerce
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
918,918,5ff54992b1cebb35b49d15af,Beauty & Skincare,makeup,ulta,https://www.ulta.com/facial-spray-with-aloe-ch...,Mario Badescu,"Facial Spray with Aloe, Chamomile and Lavender",4.2 out of 5 stars,Replenish skin with an infusion of calming bot...,"Aqua (Water, Eau), Lavandula Angustifolia (Lav...","{'reviewer': ['By\xa0Sarbear', 'By\xa0Dizzy', ...",06/01/2021,2021-01-06 05:24:31.941,ecommerce
919,919,5ff54992b1cebb35b49d15b0,Beauty & Skincare,makeup,ulta,https://www.ulta.com/travel-size-facial-spray-...,Mario Badescu,"Travel Size Facial Spray With Aloe, Herbs and ...",4.6 out of 5 stars,Whether spritzed for a hydrating boost or midd...,"Aqua (Water, Eau), Propylene Glycol, Aloe Barb...","{'reviewer': ['By\xa0Bianca', 'By\xa0LW', 'By\...",06/01/2021,2021-01-06 05:24:31.941,ecommerce
920,920,5ff54992b1cebb35b49d15b1,Beauty & Skincare,makeup,ulta,https://www.ulta.com/facial-spray-with-aloe-sa...,Mario Badescu,"Facial Spray with Aloe, Sage and Orange Blossom",4.6 out of 5 stars,Hydrate and refresh with Mario Badescu's Facia...,"Aqua (Water, Eau), Propylene Glycol, Salvia Of...","{'reviewer': ['By\xa0Theresa12', 'By\xa0Sophia...",06/01/2021,2021-01-06 05:24:31.941,ecommerce
921,921,5ff54992b1cebb35b49d15b2,Beauty & Skincare,makeup,ulta,https://www.ulta.com/travel-size-facial-spray-...,Mario Badescu,"Travel Size Facial Spray with Aloe, Cucumber a...",4.6 out of 5 stars,Mario Badescu's Travel Size Facial Spray With ...,"Aqua (Water, Eau), Propylene Glycol, Mentha Pi...",{'reviewer': ['By\xa0Aaliyah\U0001f90d🥞\U0001f...,06/01/2021,2021-01-06 05:24:31.942,ecommerce


In [5]:
dataset['Sector'].value_counts()

Beauty & Skincare    923
Name: Sector, dtype: int64

In [6]:
dataset['Brand'].value_counts()

NYX Cosmetics    205
Clinique         131
Bareminerals     112
Mario Badescu    110
Estee Lauder      71
Juice beauty      61
Colourpop         54
Murad             50
BH Cosmetics      42
neutrogena        39
the ordinary      35
Sally Beauty      13
Name: Brand, dtype: int64

This could be an interesting feature. As brand name alone is a powerful drive for customers as well. Although I would just keep it as an insight for further added improvements. For now we will not use this data for data modelling.

In [7]:
dataset['star'].value_counts()

4.5 out of 5 stars    125
4.3 out of 5 stars    121
4.4 out of 5 stars    103
4.2 out of 5 stars     89
4.6 out of 5 stars     75
4.1 out of 5 stars     65
4.0 out of 5 stars     59
4.7 out of 5 stars     56
3.9 out of 5 stars     36
5.0 out of 5 stars     30
3.8 out of 5 stars     27
3.7 out of 5 stars     25
4.8 out of 5 stars     21
3.6 out of 5 stars     16
4.9 out of 5 stars     12
3.4 out of 5 stars     11
3.5 out of 5 stars     10
3.2 out of 5 stars      9
3.1 out of 5 stars      8
3.3 out of 5 stars      8
2.8 out of 5 stars      3
3.0 out of 5 stars      3
2.5 out of 5 stars      3
2.9 out of 5 stars      2
2.6 out of 5 stars      2
2.4 out of 5 stars      2
1.1 out of 5 stars      1
2.3 out of 5 stars      1
Name: star, dtype: int64

This is another interesting feature that drive a sentiment a lot. Let's keep them in tab as well.

Now let's examine the reviews of our customers.

In [8]:
dataset['Details'][100]

'Love the look of a lash lift but not the price? Enter NYX Professional Makeup On The Rise Liftsacara. This ultra-high pigmented, vegan formula, matte black mascara is the first ever lifting and volumizing liftascara. Delivering quick-charge lift and volume in just a few strokes. Featuring an innovative applicator that¿s part rounded, part hourglass to lift you lashes up to new heights! The "curling" effect results in a far-reaching fringe so that eyes look bigger and lashes look thicker, making it our most dramatic mascara to date. Also cool: it works for all lash types. In a clinical study 95% of participants saw instant volume and 94% saw instant lift! Elevate your expectations with this high drama black volume that lifts eyelashes so freakin\' high, you just won\'t believe it! Featuring a vegan formula with no animal-derived ingredients or by-products. All NYX Professional Makeup products are proudly cruelty-free and PETA certified.Benefits:Volumizing mascara with lash-lifting abil

Details columns is usually company provided details of the product. As of sentiment analysis for customers it may not be a lot helpful to understand customer reviews. Although, it could be helpful in whole another domain where based on the product details we can figure out which customers are more attracted to which aspect of the products which could be whole another area. We can keep that in mind for bulding further models in future. 

In [9]:
sample_data = dataset['reviews'][10]

Wow this looks like we have some messy data at hand. Also from intial look up we have the data and the labels as well which will help us in preparing a dataset which we will use it as our final model building task.

In [10]:
sample_data

'{\'reviewer\': [\'By\\xa0Aina\', \'By\\xa0nessa\', \'By\\xa0Lish\', \'By\\xa0Sam\', \'By\\xa0Pineapple\', \'By\\xa0LJ\', \'By\\xa0Grandmother\', \'By\\xa0Sarahlynn\', \'By\\xa0Bel\', \'By\\xa0Noel\', \'By\\xa0Hazel\', \'By\\xa0Cassie\', \'By\\xa0Auj\', \'By\\xa0deedeeg\', \'By\\xa0Dee\', \'By\\xa0Kay5\', \'By\\xa0adri\', \'By\\xa0Barbie secret agent\', \'By\\xa0Amy\', \'By\\xa0Kim\', \'By\\xa0Paola\', \'By\\xa0Lindaw3\', \'By\\xa0Keelee\', \'By\\xa0Becksstein\', \'By\\xa0Miss Michelle\', \'By\\xa0B\', \'By\\xa0Yang\', \'By\\xa0Kacie\', \'By\\xa0Mary\', \'By\\xa0Randy\', \'By\\xa0Carly\', \'By\\xa0Bear\', \'By\\xa0Tomadeupjazz\', \'By\\xa0Janna\', \'By\\xa0Milleloves\', \'By\\xa0Extragirl\', \'By\\xa0Jane21\', \'By\\xa0JKCRUSH\', \'By\\xa0Debs3\', \'By\\xa0Donna\', \'By\\xa0Tammy\', \'By\\xa0s\', \'By\\xa0Glamadiva\', \'By\\xa0Dani\', \'By\\xa0LoisAnn\', \'By\\xa0Megan K.\', \'By\\xa0Glamapuss\', \'By\\xa0LTT\', \'By\\xa0Marie\', \'By\\xa0Tiff\', \'By\\xa0Pauline\', \'By\\xa0Jill\', \'

In [11]:
data = {}
final_df = pd.DataFrame(data,columns = ["review","label"])

In [12]:
final_df

Unnamed: 0,review,label


In [13]:
def extract_data_from_df(sample_data):
  global final_df
  data = {}
  sample_data_exp = ast.literal_eval(sample_data)
  TOTAL_COLS = len(sample_data_exp['review'])
  try:
    if len(sample_data_exp['review'])!= len(sample_data_exp['rrat']):
      return "BAD DATAFRAME"
  except:
    return "COLUMNS DO NOT MATCH"

  data["review"] = sample_data_exp['review']
  data["label"] = sample_data_exp['rrat']
  new_df = pd.DataFrame(data,columns = ["review","label"])
  final_df = final_df.append(new_df, ignore_index=True)
  return TOTAL_COLS

In [14]:
# Extract Data

# clean description
dataset['reviews'].map(lambda x: extract_data_from_df(x))

0      300
1       11
2      300
3       77
4       77
      ... 
918    300
919    300
920    242
921    300
922    118
Name: reviews, Length: 923, dtype: object

In [15]:
final_df['label'] = final_df.label.astype(int)

So this is our final dataset we will use for training our model. It will be fun. 

In [16]:
def convert_to_labels(rating):
  if rating<=2:
    return "NEGATIVE"
  elif rating == 3:
    return "NEUTRAL"
  else:
    return "POSITIVE"

In [17]:
final_df['label'] = final_df['label'].map(lambda x: convert_to_labels(x))

In [20]:
final_df.to_csv("/content/drive/MyDrive/Projects/Moodboard/sentiment.csv",index= False)

In [19]:
final_df['label'].value_counts()

POSITIVE    104482
NEGATIVE     18490
NEUTRAL      11037
Name: label, dtype: int64