Suggestions
use votes as a weighted averages
KNN clustering for classifications

In [118]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import json
from pathlib import Path
import os 
import statsmodels.api as sm
import plotly.express as px
import re
import gzip
from urllib.request import urlopen

### Handling nulls

#### Cleaning Review Data set 

In [119]:
#directing to the right file path
os.chdir("/Users/mac/Desktop/Data/CAPSTONE")
cwd = os.getcwd() 

In [120]:
#Opening the reviews,  resource: https://towardsdatascience.com/load-yelp-reviews-or-other-huge-json-files-with-ease-ad804c2f1537
review_df = []
r_dtypes = {"overall": np.float16, 
            "verified": np.int32, 
            "vote": np.int32,
            "reviewTime": np.int32,
            "reviewerID": np.int32,
            "asin": object,
            "reviewerName": object,
            "reviewText":object , 
            "summary": object,     
            "style": object, 
            "image": object, 
           }
with open("Luxury_Beauty.json", "r") as f:
    reader = pd.read_json(f, orient="records", lines=True, 
                          dtype=r_dtypes, chunksize=1000)
        
    for chunk in reader:
        reduced_chunk = chunk.drop(columns=['unixReviewTime'],axis=1)
        review_df.append(reduced_chunk)
    
review_df = pd.concat(review_df, ignore_index=True)

In [121]:
#checking review_df dataset
review_df.head(10)

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,style,image
0,2.0,3.0,1,"06 15, 2010",A1Q6MUU0B2ZDQG,B00004U9V2,D. Poston,"I bought two of these 8.5 fl oz hand cream, an...",dispensers don't work,,
1,5.0,14.0,1,"01 7, 2010",A3HO2SQDCZIE9S,B00004U9V2,chandra,"Believe me, over the years I have tried many, ...",Best hand cream ever.,,
2,5.0,,1,"04 18, 2018",A2EM03F99X3RJZ,B00004U9V2,Maureen G,Great hand lotion,Five Stars,{'Size:': ' 3.5 oz.'},
3,5.0,,1,"04 18, 2018",A3Z74TDRGD0HU,B00004U9V2,Terry K,This is the best for the severely dry skin on ...,Five Stars,{'Size:': ' 3.5 oz.'},
4,5.0,,1,"04 17, 2018",A2UXFNW9RTL4VM,B00004U9V2,Patricia Wood,The best non- oily hand cream ever. It heals o...,I always have a backup ready.,{'Size:': ' 3.5 oz.'},
5,5.0,,1,"04 14, 2018",AXX5G4LFF12R6,B00004U9V2,Ralla,Ive used this lotion for many years. I try oth...,Ive used this lotion for many years. I try ...,{'Size:': ' 250 g'},
6,5.0,,1,"04 11, 2018",A7GUKMOJT2NR6,B00004U9V2,Lydia Speight,Works great for dry hands.,Five Stars,{'Size:': ' 3.5 oz.'},
7,5.0,,1,"04 11, 2018",A3FU4L59BHA9FY,B00004U9V2,Allen Semer,The best hand cream ever.,Made in the USA,{'Size:': ' 3.5 oz.'},
8,5.0,,1,"04 7, 2018",A1AMNMIPQMXH9M,B00004U9V2,Vets park,LOVE THIS SCENT!! But Crabtree and Evelyn mak...,Moistens and smells good,{'Size:': ' 3.5 oz.'},
9,5.0,,1,"04 6, 2018",A3DMBDTA8VGWSX,B00004U9V2,Cynthia P. Irving,Its a great moisturizer especially for gardners,Five Stars,{'Size:': ' 3.5 oz.'},


#### Initial Data Dictionary

**Review Data** 
- `overall` Rating given by user out of 5.0 (numeric)
- `verified`: Denotes verified purchases or not (numeric) 
- `vote`: Number of users that have liked the review (numeric)
- `reviewTime`: Recorded time of review (numeric)
- `reviewerID`: Unique reviewer ID (object)
- `asin`: Unique product ID (object)
- `reviewerName`: Name given of reviewer (object) 
- `reviewText`: Body of user review (object) 
- `summary`: Title of user review (object)     
- `style`: Dictionary object containing details on the product reviewed (Dictionary)
- `image`: Associated images in JPEGs of the product uploaded by user(object) 

**Meta Data** 
- `category`: object
- `description`: object,
- `fit`: object,
- `title`: initial listing of the product (object)
- `also_buy`: related ASINs that purchasers have also bought (list)
- `tech2`: first technical detail table of the product (object)
- `brand`: brand name (object)
- `feature`: object,
- `rank`: rankng of the category at the time of the data extraction (object)
- `also_view`: related ASINs, those that have also been viewed (list)
- `details`: object,
- `Shipping Weight`: object,
- `International Shipping`: object,
- `ASIN`: object, 
- `Item model number`: object,
- `main_cat`: object,
- `similar_item`: object,
- `date`: object,
- `price`: np.float32,
- `asin`: object, 
- `imageURL`: url of the product image (list) 
- `imageURLHighRes`: url of the high resolution product image (list)
          

In [122]:
#checking amount of nulls   
nan_count = review_df.isna().sum()
print(nan_count)

overall              0
vote            470939
verified             0
reviewTime           0
reviewerID           0
asin                 0
reviewerName        31
reviewText         400
summary            183
style           323615
image           567210
dtype: int64


`Image`, `Vote`, `Style` all have high levels of null values. However we want to keep `Vote` column by filling with 0s since this is an indication of other users agreeing with the review written. Before we move on we should explore the style column to see if there is any useful information there since data is stored in a dictionary.

In [123]:
#filling Vote column nulls with 0

review_df['vote']= review_df['vote'].str.replace(",","").astype("float32")
review_df["vote"].fillna(0, inplace = True)

In [124]:
#converting vote column to vote
print(review_df.dtypes)

overall         float16
vote            float32
verified          int32
reviewTime       object
reviewerID       object
asin             object
reviewerName     object
reviewText       object
summary          object
style            object
image            object
dtype: object


In [125]:
review_df.iloc[22620:22625]

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,style,image
22620,5.0,5.0,1,"11 16, 2015",A14OAB0YBN176K,B0002ZW5UQ,P Nitty,"It's funny, I tried this same product about 15...","If you have thinning hair, this is perfect. I...","{'Size:': ' 0.42 oz.', 'Color:': ' Black'}",
22621,4.0,0.0,1,"11 15, 2015",A1MB9LANC755GE,B0002ZW5UQ,kamran,"Very nice. Thanks,",Four Stars,"{'Size:': ' 0.42 oz.', 'Color:': ' Black'}",
22622,5.0,0.0,1,"11 14, 2015",AHYZUR6W4B3EL,B0002ZW5UQ,Zachary Spencer,I honestly love this stuff! It does wonders. E...,I love it!!!!,"{'Size:': ' 0.97 oz.', 'Color:': ' Dark Brown'}",
22623,4.0,2995.0,1,"11 14, 2015",A2J0S1IC4PU9G8,B0002ZW5UQ,Big Stink,"So, if you're bothering to read reviews about ...","Not bad, Toppik, but let's be clear about a fe...","{'Size:': ' 0.42 oz.', 'Color:': ' Dark Brown'}",
22624,5.0,0.0,1,"11 13, 2015",A2N5487XPK9L8E,B0002ZW5UQ,Yaneida Gutierrez,"Love it, does the work well, and I love that I...","Go on, buy it!","{'Size:': ' 0.97 oz.', 'Color:': ' Dark Brown'}",


In [126]:
review_df.describe()

Unnamed: 0,overall,vote,verified
count,574628.0,574628.0,574628.0
mean,,1.335126,0.878032
std,0.0,12.092627,0.327249
min,1.0,0.0,0.0
25%,4.0,0.0,1.0
50%,5.0,0.0,1.0
75%,5.0,0.0,1.0
max,5.0,2995.0,1.0


In [127]:
#sanity check
review_df["vote"].isna().sum()

0

In [128]:
#creating a new df for style
style_df = review_df["style"].apply(pd.Series)


In [140]:
#checking for null values in style column broken out
style_df.isnull().sum()
p = (style_df["Size:"].isna().sum()/review_df.shape[0])*100
print(f"Even column with the lowest amount of nulls, size: still contains {round(p,2)}% of null values which is quite a lot so we can drop the Style column altogether")

Even column with the lowest amount of nulls, size: still contains 75.67% of null values which is quite a lot so we can drop the Style column altogether


In [141]:
#Removing columns with null values greater than 80% https://stackoverflow.com/questions/43311555/how-to-drop-column-according-to-nan-percentage-for-dataframe 
#style_df = style_df.loc[:, style_df.isnull().mean() < .5]

In [142]:
review_df

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,style,image
0,2.0,3.0,1,"06 15, 2010",A1Q6MUU0B2ZDQG,B00004U9V2,D. Poston,"I bought two of these 8.5 fl oz hand cream, an...",dispensers don't work,,
1,5.0,14.0,1,"01 7, 2010",A3HO2SQDCZIE9S,B00004U9V2,chandra,"Believe me, over the years I have tried many, ...",Best hand cream ever.,,
2,5.0,0.0,1,"04 18, 2018",A2EM03F99X3RJZ,B00004U9V2,Maureen G,Great hand lotion,Five Stars,{'Size:': ' 3.5 oz.'},
3,5.0,0.0,1,"04 18, 2018",A3Z74TDRGD0HU,B00004U9V2,Terry K,This is the best for the severely dry skin on ...,Five Stars,{'Size:': ' 3.5 oz.'},
4,5.0,0.0,1,"04 17, 2018",A2UXFNW9RTL4VM,B00004U9V2,Patricia Wood,The best non- oily hand cream ever. It heals o...,I always have a backup ready.,{'Size:': ' 3.5 oz.'},
...,...,...,...,...,...,...,...,...,...,...,...
574623,5.0,0.0,1,"03 20, 2017",AHYJ78MVF4UQO,B01HIQEOLO,Lori Fox,Great color and I prefer shellac over gel,Five Stars,,
574624,5.0,0.0,1,"10 26, 2016",A1L2RT7KBNK02K,B01HIQEOLO,Elena,Best shellac I have ever used. It doesn't tak...,Best shellac I have ever used,,
574625,5.0,0.0,1,"09 30, 2016",A36MLXQX9WPPW9,B01HIQEOLO,Donna D. Harris,Great polish and beautiful color!!,Great polish!,,
574626,1.0,2.0,1,"12 5, 2016",A23DRCOMC2RIXF,B01HJ2UY0W,Y.Y. Chen,"The perfume is good, but the spray head broke ...",Spray head broke off within a month,"{'Size:': ' 1.7 Fluid Ounce', 'Color:': ' Multi'}",


In [143]:
#Removing unnecessary columns and Image
columns = ["style","image"]
for column in columns: 
     review_df = review_df.drop([column],axis=1)


In [144]:
nan_count = review_df.isna().sum()
print(nan_count)

overall           0
vote              0
verified          0
reviewTime        0
reviewerID        0
asin              0
reviewerName     31
reviewText      400
summary         183
dtype: int64


In [145]:
review_df["reviewerName"]

0               D. Poston
1                 chandra
2               Maureen G
3                 Terry K
4           Patricia Wood
               ...       
574623           Lori Fox
574624              Elena
574625    Donna D. Harris
574626          Y.Y. Chen
574627         ML Shelton
Name: reviewerName, Length: 574628, dtype: object

In [148]:
#replacing na values of reviewer name with reviewer ID 
review_df['reviewerName'] =review_df.loc[:,'reviewerName'].fillna(review_df["reviewerID"])

In [154]:
nan_count = review_df.isna().sum()
print(nan_count)

overall         0
vote            0
verified        0
reviewTime      0
reviewerID      0
asin            0
reviewerName    0
reviewText      0
summary         0
dtype: int64


In [153]:
# % of na values of review_df
pct_reviewText = review_df["reviewText"].isna().sum()/review_df.shape[0]*100
print(f"Null values of reviewText makes up {round(pct_reviewText,2)}% of dataset, this is small so we can remove these lines")

Null values of reviewText makes up 0.0% of dataset, this is small so we can remove these lines


In [171]:
#removing remaining rows with null values and checking row counts, saved to df1
review_df1 = review_df.dropna().copy()
review_df1.shape

(574053, 9)

In [172]:
#sanity check
review_df1.isna().sum().sum()

0

In [173]:
review_df1

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary
0,2.0,3.0,1,"06 15, 2010",A1Q6MUU0B2ZDQG,B00004U9V2,D. Poston,"I bought two of these 8.5 fl oz hand cream, an...",dispensers don't work
1,5.0,14.0,1,"01 7, 2010",A3HO2SQDCZIE9S,B00004U9V2,chandra,"Believe me, over the years I have tried many, ...",Best hand cream ever.
2,5.0,0.0,1,"04 18, 2018",A2EM03F99X3RJZ,B00004U9V2,Maureen G,Great hand lotion,Five Stars
3,5.0,0.0,1,"04 18, 2018",A3Z74TDRGD0HU,B00004U9V2,Terry K,This is the best for the severely dry skin on ...,Five Stars
4,5.0,0.0,1,"04 17, 2018",A2UXFNW9RTL4VM,B00004U9V2,Patricia Wood,The best non- oily hand cream ever. It heals o...,I always have a backup ready.
...,...,...,...,...,...,...,...,...,...
574623,5.0,0.0,1,"03 20, 2017",AHYJ78MVF4UQO,B01HIQEOLO,Lori Fox,Great color and I prefer shellac over gel,Five Stars
574624,5.0,0.0,1,"10 26, 2016",A1L2RT7KBNK02K,B01HIQEOLO,Elena,Best shellac I have ever used. It doesn't tak...,Best shellac I have ever used
574625,5.0,0.0,1,"09 30, 2016",A36MLXQX9WPPW9,B01HIQEOLO,Donna D. Harris,Great polish and beautiful color!!,Great polish!
574626,1.0,2.0,1,"12 5, 2016",A23DRCOMC2RIXF,B01HJ2UY0W,Y.Y. Chen,"The perfume is good, but the spray head broke ...",Spray head broke off within a month


### Row Duplicates

Now that we have dealt with the null values we can move onto checking if we have any duplicates of rows. 

In [174]:
#checking count of duplicated rows 
review_df1.duplicated().sum()

34893

In [175]:
# % of na values of review_df
pct_duplicates = review_df1.duplicated().sum()/review_df1.shape[0]*100
print(f"Duplicate rows make up {round(pct_duplicates,2)} % of dataset, which is fairly large")

Duplicate rows make up 6.08 % of dataset, which is fairly large


In [176]:
#looking at duplicates
review_df1[review_df1.duplicated()].sample(5)

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary
209489,5.0,25.0,1,"03 21, 2014",A2H6LHCM3AR3YV,B001MF3FMW,Twitch,"OK IF YOU HATE SHAVING?!?! Get this , then go ...",Best EVER!!!!!!
171958,4.0,0.0,1,"03 28, 2016",A2FGTVN1MM8GJU,B0016LXK6I,Amazon Customer,great product and price,Four Stars
168516,5.0,8.0,1,"08 15, 2015",A1SPRS3KBPPXTU,B001543FF2,AmandaSizzle2009,The coverage of this product is fantastic. One...,Great concealer for this light skinned blonde
183207,1.0,5.0,1,"09 23, 2016",AUTFC3NFRIVRQ,B0017SWIU4,Amazon Customer,Did not work for me. It's a strange product. D...,Don't recommend. I'd rather use a generic neut...
165049,5.0,0.0,1,"01 8, 2017",A8FNU03HKL1A3,B0013U0EYI,Kindle Customer,"There you go...picture says it all! No hair, h...",There you go... picture says it all ...


In [177]:
pct_origin_row = review_df.duplicated().sum() /review_df.duplicated(keep=False).sum()
print(f"Each duplicated row appears close to twice at {round(pct_origin_row,2)}% of the dataset, so there was likely a data recording issue.")

Each duplicated row appears close to twice at 0.51% of the dataset, so there was likely a data recording issue.


In [178]:
#removing duplicates
review_df1 = review_df1.drop_duplicates()
review_df1.shape

(539160, 9)

In [179]:
#creating clean review_df and copying in review_df into it 
clean_review_df = pd.DataFrame()
clean_review_df = review_df.copy()

In [180]:
clean_review_df.isna().sum()

overall         0
vote            0
verified        0
reviewTime      0
reviewerID      0
asin            0
reviewerName    0
reviewText      0
summary         0
dtype: int64

In [181]:
clean_review_df

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary
0,2.0,3.0,1,"06 15, 2010",A1Q6MUU0B2ZDQG,B00004U9V2,D. Poston,"I bought two of these 8.5 fl oz hand cream, an...",dispensers don't work
1,5.0,14.0,1,"01 7, 2010",A3HO2SQDCZIE9S,B00004U9V2,chandra,"Believe me, over the years I have tried many, ...",Best hand cream ever.
2,5.0,0.0,1,"04 18, 2018",A2EM03F99X3RJZ,B00004U9V2,Maureen G,Great hand lotion,Five Stars
3,5.0,0.0,1,"04 18, 2018",A3Z74TDRGD0HU,B00004U9V2,Terry K,This is the best for the severely dry skin on ...,Five Stars
4,5.0,0.0,1,"04 17, 2018",A2UXFNW9RTL4VM,B00004U9V2,Patricia Wood,The best non- oily hand cream ever. It heals o...,I always have a backup ready.
...,...,...,...,...,...,...,...,...,...
574623,5.0,0.0,1,"03 20, 2017",AHYJ78MVF4UQO,B01HIQEOLO,Lori Fox,Great color and I prefer shellac over gel,Five Stars
574624,5.0,0.0,1,"10 26, 2016",A1L2RT7KBNK02K,B01HIQEOLO,Elena,Best shellac I have ever used. It doesn't tak...,Best shellac I have ever used
574625,5.0,0.0,1,"09 30, 2016",A36MLXQX9WPPW9,B01HIQEOLO,Donna D. Harris,Great polish and beautiful color!!,Great polish!
574626,1.0,2.0,1,"12 5, 2016",A23DRCOMC2RIXF,B01HJ2UY0W,Y.Y. Chen,"The perfume is good, but the spray head broke ...",Spray head broke off within a month


#### Cleaning Metadata Dataset

In [240]:
#Opening the Meta data 
metadata_df = []
r_dtypes = {"category": object,
            "tech1": object,
            "description": object,
            "fit": object,
            "title": object,
            "also_buy": list,
            "tech2": object, 
            "brand": object,
            "feature": object,
            "rank":object,
            "also_view": list, 
            "details": object,
            "Shipping Weight": object,
            'International Shipping': object,
            "ASIN": object, 
            "Item model number": object,
            "main_cat": object,
            "similar_item": object,
            "date": object,
            "price": np.float32,
            "asin": object, 
            "imageURL": list, 
            "imageURLHighRes": list ,
           }
with open("meta_Luxury_Beauty.json", "r") as f:
    reader = pd.read_json(f, orient="records", lines=True, 
                          dtype=r_dtypes, chunksize=1000)
        
    for chunk in reader:
        reduced_chunk = chunk.drop(columns=["tech1"],axis=1)
        metadata_df.append(reduced_chunk)
    
metadata_df = pd.concat(metadata_df, ignore_index=True)

In [241]:
#Row and column count 
metadata_df.shape

(12299, 18)

In [242]:
metadata_df.head(10)

Unnamed: 0,category,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,details,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,[],[After a long day of handling thorny situation...,,Crabtree &amp; Evelyn - Gardener's Ultra-Moist...,"[B00GHX7H0A, B00FRERO7G, B00R68QXCS, B000Z65AZ...",,,[],"4,324 in Beauty & Personal Care (","[B00FRERO7G, B00GHX7H0A, B07GFHJRMX, B00TJ3NBN...",{'  Product Dimensions: ': '2.2 x 2.2 ...,Luxury Beauty,,NaT,$30.00,B00004U9V2,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
1,[],[If you haven't experienced the pleasures of b...,,AHAVA Bath Salts,[],,,[],"1,633,549 in Beauty & Personal Care (",[],{'  Product Dimensions: ': '3 x 3.5 x ...,Luxury Beauty,,NaT,,B0000531EN,[],[]
2,[],"[Rich, black mineral mud, harvested from the b...",,"AHAVA Dead Sea Mineral Mud, 8.5 oz, Pack of 4",[],,,[],"1,806,710 in Beauty &amp; Personal Care (",[],{'  Product Dimensions: ': '5.1 x 3 x ...,Luxury Beauty,,NaT,,B0000532JH,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
3,[],[This liquid soap with convenient pump dispens...,,"Crabtree &amp; Evelyn Hand Soap, Gardeners, 10...",[],,,[],[],"[B00004U9V2, B00GHX7H0A, B00FRERO7G, B00R68QXC...",{'  Product Dimensions: ': '2.6 x 2.6 ...,Luxury Beauty,,NaT,$15.99,B00005A77F,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
4,[],[Remember why you love your favorite blanket? ...,,Soy Milk Hand Crme,"[B000NZT6KM, B001BY229Q, B008J724QY, B0009YGKJ...",,,[],"42,464 in Beauty &amp; Personal Care (",[],{'  Product Dimensions: ': '7.2 x 2.2 ...,Luxury Beauty,,NaT,$18.00,B00005NDTD,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
5,[],"[Winter, summer, spring or fall, this soothing...",,"AHAVA Dermud Enriched Intensive Foot Cream, 4....",[],,,[],"1,527,650 in Beauty & Personal Care (",[],{'  Product Dimensions: ': '2.5 x 2.3 ...,Luxury Beauty,,NaT,,B00005R7ZZ,[],[]
6,[],[Highly concentrated formula created to rejuve...,,"AHAVA Dermud Intensive Nourishing Hand Cream, ...",[],,,[],"1,538,330 in Beauty &amp; Personal Care (",[],{'  Product Dimensions: ': '2.5 x 2.3 ...,Luxury Beauty,,NaT,,B00005R7ZY,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
7,[],[<P><STRONG>Please note: Due to product improv...,,Supersmile Powdered Mouthrinse,"[B0010Y3M2S, B00005V50B, B00NNZWXEK, B001AC3VI...",,,[],"122,723 in Beauty & Personal Care (","[B07CHTPD6W, B07D72B2VX, B07CLR4T96, B07B9XZ3K...",{'  Product Dimensions: ': '5.8 x 2.8 ...,Luxury Beauty,,NaT,$21.73,B00005V50C,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
8,[],"[Created by Dr. Irwin Smigel, world-renowned ""...",,Supersmile Professional Teeth Whitening Toothp...,"[B00NNZWXEK, B0057MMSWY, B00TZJDY4Q, B001ABYRZ...",,,[],"5,522 in Beauty &amp; Personal Care (","[B00TZJDY4Q, B07CHTPD6W, B076GZSV93, B00KAC7LE...",{'  Product Dimensions: ': '1.8 x 1.4 ...,Luxury Beauty,,NaT,$23.00,B00005V50B,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
9,[],[Naturally stimulating essential oils make our...,,"Archipelago Morning Mint Body Lotion ,18 Fl Oz","[B001IJOYJA, B008J720A4, B001IJQR68, B008J722A...",,,[],"20,146 in Beauty &amp; Personal Care (","[B001JB55SQ, B00J0A448K, B001IJQR68, B008J722A...",{'  Product Dimensions: ': '2.6 x 2.6 ...,Luxury Beauty,,NaT,$25.00,B000066SYB,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...


In [243]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12299 entries, 0 to 12298
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   category         12299 non-null  object
 1   description      12299 non-null  object
 2   fit              12299 non-null  object
 3   title            12299 non-null  object
 4   also_buy         12299 non-null  object
 5   tech2            12299 non-null  object
 6   brand            12299 non-null  object
 7   feature          12299 non-null  object
 8   rank             12299 non-null  object
 9   also_view        12299 non-null  object
 10  details          12299 non-null  object
 11  main_cat         12299 non-null  object
 12  similar_item     12299 non-null  object
 13  date             0 non-null      object
 14  price            12299 non-null  object
 15  asin             12299 non-null  object
 16  imageURL         12299 non-null  object
 17  imageURLHighRes  12299 non-null

There are many empty columns in the dataset, and many with just empty lists or the same value repeated throughout: category, tech1, fit, tech2, brand, feature, main_cat, date. 

Equally since we wont be needing the image URLs we can remove these columns as well.

In [244]:
#removing columns 
columns= ["category", "fit", "tech2", "brand", "feature", "main_cat", "date", "imageURL","imageURLHighRes"]

metadata_df.drop(columns,axis=1, inplace=True)

In [245]:
metadata_df.head(5)

Unnamed: 0,description,title,also_buy,rank,also_view,details,similar_item,price,asin
0,[After a long day of handling thorny situation...,Crabtree &amp; Evelyn - Gardener's Ultra-Moist...,"[B00GHX7H0A, B00FRERO7G, B00R68QXCS, B000Z65AZ...","4,324 in Beauty & Personal Care (","[B00FRERO7G, B00GHX7H0A, B07GFHJRMX, B00TJ3NBN...",{'  Product Dimensions: ': '2.2 x 2.2 ...,,$30.00,B00004U9V2
1,[If you haven't experienced the pleasures of b...,AHAVA Bath Salts,[],"1,633,549 in Beauty & Personal Care (",[],{'  Product Dimensions: ': '3 x 3.5 x ...,,,B0000531EN
2,"[Rich, black mineral mud, harvested from the b...","AHAVA Dead Sea Mineral Mud, 8.5 oz, Pack of 4",[],"1,806,710 in Beauty &amp; Personal Care (",[],{'  Product Dimensions: ': '5.1 x 3 x ...,,,B0000532JH
3,[This liquid soap with convenient pump dispens...,"Crabtree &amp; Evelyn Hand Soap, Gardeners, 10...",[],[],"[B00004U9V2, B00GHX7H0A, B00FRERO7G, B00R68QXC...",{'  Product Dimensions: ': '2.6 x 2.6 ...,,$15.99,B00005A77F
4,[Remember why you love your favorite blanket? ...,Soy Milk Hand Crme,"[B000NZT6KM, B001BY229Q, B008J724QY, B0009YGKJ...","42,464 in Beauty &amp; Personal Care (",[],{'  Product Dimensions: ': '7.2 x 2.2 ...,,$18.00,B00005NDTD


In [246]:
#applying NaN values in place of empty lists
metadata_df = metadata_df.where(~metadata_df.applymap(lambda x: x == [] or x is None or x == ''))

In [247]:
#null value count
null_metadata = metadata_df.isna().sum()
null_metadata["price"]

5260

In [248]:
metadata_df[['ranking','remove']] = metadata_df["rank"].str.split(" ", n=1,expand = True)

In [249]:
#convert ranking column to integer
metadata_df['ranking'] = metadata_df["ranking"].str.replace(",", "", regex=False)
metadata_df["ranking"].fillna(0, inplace = True)
metadata_df["ranking"] =metadata_df["ranking"].astype("int32")

In [250]:
#remove unnecessary columns
metadata_df.drop(['remove','rank',"similar_item"],axis=1,inplace=True )

In [251]:
#removing description from list 
metadata_df['product_description'] = metadata_df['description'].str.join(', ')

In [252]:
#taking a look at details column
d = metadata_df["details"].iloc[0]
d.keys()

dict_keys(['\n    Product Dimensions: \n    ', 'Shipping Weight:', 'Domestic Shipping: ', 'International Shipping: ', 'ASIN:', 'Item model number:'])

Taking a look at the values that are contained in the `details` column, we can pass on these since these don't have much relation to contributing to the individual products. If we didnt already have access to the ASIN then we would keep this value. 

In [253]:
metadata_df.drop(["details","description"],axis=1, inplace=True)

`price` is an object right now, this needs to convert to numeric value. 

In [254]:
metadata_df["asin"].unique()

array(['B00004U9V2', 'B0000531EN', 'B0000532JH', ..., 'B01HIQEOLO',
       'B01HJ2UY0W', 'B01HJ2UY1G'], dtype=object)

In [255]:
#removing dollar sign from price and replacing those values with 0 
metadata_df['price'] = metadata_df["price"].str.replace("$", "", regex=False)
metadata_df["price"].fillna(0, inplace = True)
metadata_df

Unnamed: 0,title,also_buy,also_view,price,asin,ranking,product_description
0,Crabtree &amp; Evelyn - Gardener's Ultra-Moist...,"[B00GHX7H0A, B00FRERO7G, B00R68QXCS, B000Z65AZ...","[B00FRERO7G, B00GHX7H0A, B07GFHJRMX, B00TJ3NBN...",30.00,B00004U9V2,4324,After a long day of handling thorny situations...
1,AHAVA Bath Salts,,,0,B0000531EN,1633549,If you haven't experienced the pleasures of ba...
2,"AHAVA Dead Sea Mineral Mud, 8.5 oz, Pack of 4",,,0,B0000532JH,1806710,"Rich, black mineral mud, harvested from the ba..."
3,"Crabtree &amp; Evelyn Hand Soap, Gardeners, 10...",,"[B00004U9V2, B00GHX7H0A, B00FRERO7G, B00R68QXC...",15.99,B00005A77F,0,This liquid soap with convenient pump dispense...
4,Soy Milk Hand Crme,"[B000NZT6KM, B001BY229Q, B008J724QY, B0009YGKJ...",,18.00,B00005NDTD,42464,Remember why you love your favorite blanket? T...
...,...,...,...,...,...,...,...
12294,"CND Shellac Power Polish, Patina Buckle","[B003ONLAXQ, B00YDEZ9T6, B074KHRD13, B00R3PZK1...","[B00D2VMUA2, B074KJZJYW, B074KHRD13, B073SB9JW...",15.95,B01HIQIEYC,88740,", CND Craft Culture Collection: Patina Buckle,..."
12295,CND Shellac power polish denim patch,"[B003ONLAXQ, B003OH0KBA, B004LEMWGG, B01MT91G4...","[B00D2VMUA2, B01L0EV8X2, B004LEMWGG, B00EFGDYZ...",15.95,B01HIQHQU0,122331,CND Shellac was designed to be used as a syste...
12296,"CND Shellac, Leather Satchel","[B003ONLAXQ, B003OH0KBA, B004LEMWGG, B01MT91G4...","[B00D2VMUA2, B01L0EV8X2, B004LEMWGG, B00EFGDYZ...",15.95,B01HIQEOLO,168028,CND Shellac was designed to be used as a syste...
12297,"Juicy Couture I Love Juicy Couture, 1.7 fl. Oz...",,"[B0757439SY, B01HJ2UY1G, B01KX3TK7C, B01LX71LJ...",76.00,B01HJ2UY0W,490755,The I AM JUICY COUTURE girl is once again taki...


In [256]:
metadata_df["price"] = metadata_df["price"].str.strip()

In [257]:
#extracting string from
metadata_df['price_USD']= metadata_df['price'].str.extract(r'(\d*?\.\d{2})', expand=False)
metadata_df['price_USD'] = metadata_df['price_USD'].astype(float)

In [258]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12299 entries, 0 to 12298
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   title                12299 non-null  object 
 1   also_buy             7724 non-null   object 
 2   also_view            9046 non-null   object 
 3   price                7039 non-null   object 
 4   asin                 12299 non-null  object 
 5   ranking              12299 non-null  int32  
 6   product_description  12149 non-null  object 
 7   price_USD            6958 non-null   float64
dtypes: float64(1), int32(1), object(6)
memory usage: 720.8+ KB


In [259]:
#for loop to get also_buy counts, also_view counts 
columns = ["also_buy","also_view"]

for column in columns: 
    name =  column + "_" + "counts"
    
    #Using explode function to count all of the times each asin appears
    x = metadata_df.explode(column)[column]
    
    my_count = x.value_counts()
    
    #creating a dictionary
    dict_counts = my_count.to_dict()
    
    #assigning the values using the dictionary created and creating a new column
    metadata_df[name] = metadata_df["asin"].map(dict_counts)


In [260]:
metadata_df

Unnamed: 0,title,also_buy,also_view,price,asin,ranking,product_description,price_USD,also_buy_counts,also_view_counts
0,Crabtree &amp; Evelyn - Gardener's Ultra-Moist...,"[B00GHX7H0A, B00FRERO7G, B00R68QXCS, B000Z65AZ...","[B00FRERO7G, B00GHX7H0A, B07GFHJRMX, B00TJ3NBN...",30.00,B00004U9V2,4324,After a long day of handling thorny situations...,30.00,56.0,48.0
1,AHAVA Bath Salts,,,,B0000531EN,1633549,If you haven't experienced the pleasures of ba...,,,
2,"AHAVA Dead Sea Mineral Mud, 8.5 oz, Pack of 4",,,,B0000532JH,1806710,"Rich, black mineral mud, harvested from the ba...",,,
3,"Crabtree &amp; Evelyn Hand Soap, Gardeners, 10...",,"[B00004U9V2, B00GHX7H0A, B00FRERO7G, B00R68QXC...",15.99,B00005A77F,0,This liquid soap with convenient pump dispense...,15.99,,
4,Soy Milk Hand Crme,"[B000NZT6KM, B001BY229Q, B008J724QY, B0009YGKJ...",,18.00,B00005NDTD,42464,Remember why you love your favorite blanket? T...,18.00,26.0,15.0
...,...,...,...,...,...,...,...,...,...,...
12294,"CND Shellac Power Polish, Patina Buckle","[B003ONLAXQ, B00YDEZ9T6, B074KHRD13, B00R3PZK1...","[B00D2VMUA2, B074KJZJYW, B074KHRD13, B073SB9JW...",15.95,B01HIQIEYC,88740,", CND Craft Culture Collection: Patina Buckle,...",15.95,47.0,37.0
12295,CND Shellac power polish denim patch,"[B003ONLAXQ, B003OH0KBA, B004LEMWGG, B01MT91G4...","[B00D2VMUA2, B01L0EV8X2, B004LEMWGG, B00EFGDYZ...",15.95,B01HIQHQU0,122331,CND Shellac was designed to be used as a syste...,15.95,2.0,
12296,"CND Shellac, Leather Satchel","[B003ONLAXQ, B003OH0KBA, B004LEMWGG, B01MT91G4...","[B00D2VMUA2, B01L0EV8X2, B004LEMWGG, B00EFGDYZ...",15.95,B01HIQEOLO,168028,CND Shellac was designed to be used as a syste...,15.95,,
12297,"Juicy Couture I Love Juicy Couture, 1.7 fl. Oz...",,"[B0757439SY, B01HJ2UY1G, B01KX3TK7C, B01LX71LJ...",76.00,B01HJ2UY0W,490755,The I AM JUICY COUTURE girl is once again taki...,76.00,,6.0


In [261]:
#maybe do something about the missing price values?
pct_null_price = null_metadata["price"]/metadata_df.shape[0]
print(f"Price is missing from {round(pct_null_price,2)}  which could be an issue.") 

Price is missing from 0.43  which could be an issue.


In [262]:
metadata_df["also_buy_counts"].fillna(0, inplace = True)
metadata_df["also_view_counts"].fillna(0, inplace = True)

In [263]:

pct_price = (metadata_df["price_USD"].isna().sum()/metadata_df.shape[0])*100
print(f"Price is still an issue with null values that are fairly high taking up {round(pct_price, 2)}% of the data set.  ")


Price is still an issue with null values that are fairly high taking up 43.43% of the data set.  


In [264]:
columns = ['also_buy','also_view',"price","price_USD"]
for column in columns: 
     metadata_df = metadata_df.drop([column],axis=1)


`Price` is still an issue with null values that are fairly high taking up #finish

In [265]:
#checking for null values 
metadata_df.isna().sum()

title                    0
asin                     0
ranking                  0
product_description    150
also_buy_counts          0
also_view_counts         0
dtype: int64

if product description is null then we will remove these. 


In [266]:
metadata_df = metadata_df[metadata_df['product_description'].notna()]

In [267]:
metadata_df.isna().sum()

title                  0
asin                   0
ranking                0
product_description    0
also_buy_counts        0
also_view_counts       0
dtype: int64

In [268]:
metadata_df_clean = metadata_df.copy()

In [269]:
metadata_df_clean.head(10)

Unnamed: 0,title,asin,ranking,product_description,also_buy_counts,also_view_counts
0,Crabtree &amp; Evelyn - Gardener's Ultra-Moist...,B00004U9V2,4324,After a long day of handling thorny situations...,56.0,48.0
1,AHAVA Bath Salts,B0000531EN,1633549,If you haven't experienced the pleasures of ba...,0.0,0.0
2,"AHAVA Dead Sea Mineral Mud, 8.5 oz, Pack of 4",B0000532JH,1806710,"Rich, black mineral mud, harvested from the ba...",0.0,0.0
3,"Crabtree &amp; Evelyn Hand Soap, Gardeners, 10...",B00005A77F,0,This liquid soap with convenient pump dispense...,0.0,0.0
4,Soy Milk Hand Crme,B00005NDTD,42464,Remember why you love your favorite blanket? T...,26.0,15.0
5,"AHAVA Dermud Enriched Intensive Foot Cream, 4....",B00005R7ZZ,1527650,"Winter, summer, spring or fall, this soothing ...",0.0,0.0
6,"AHAVA Dermud Intensive Nourishing Hand Cream, ...",B00005R7ZY,1538330,Highly concentrated formula created to rejuven...,0.0,0.0
7,Supersmile Powdered Mouthrinse,B00005V50C,122723,<P><STRONG>Please note: Due to product improve...,6.0,0.0
8,Supersmile Professional Teeth Whitening Toothp...,B00005V50B,5522,"Created by Dr. Irwin Smigel, world-renowned ""F...",20.0,0.0
9,"Archipelago Morning Mint Body Lotion ,18 Fl Oz",B000066SYB,20146,Naturally stimulating essential oils make our ...,28.0,18.0


In [270]:
#metadata_df_clean["title"].str.contains.str.contains('|'.join(['Good', 'East']))

AttributeError: 'function' object has no attribute 'str'

In [None]:
metadata_df_clean["title"][11]

'AHAVA Time Line Age Defying Continual Eye Treatment, .5 oz'

In [271]:
from collections import defaultdict
title_list = metadata_df_clean["title"].to_list()

temp = defaultdict(int)
 
# memorizing count
for title in title_list:
    for wrd in title.split():
        wrd = wrd.lower()
        
        clean_wrd = ""
        #searching through each letter and only accepting those in the alphabet
        for letter in wrd: 
            if letter >= 'a' and letter <= 'z': 
                clean_wrd = clean_wrd + letter
        temp[clean_wrd] += 1
 
# getting max frequency
res = max(temp, key=temp.get)
 
# printing result
print("Word with maximum frequency : " + str(res))

Word with maximum frequency : 


In [272]:
sorted_dict = sorted(temp.items(),key = lambda x:x[1], reverse = True)

In [276]:
sorted_dict_alpha = sorted(temp.items())



In [274]:

df_dict = pd.DataFrame(sorted_dict_alpha)
df_dict.to_csv("myfile.csv",index=False)


In [275]:
print(sorted_dict)

[('', 9037), ('oz', 5771), ('fl', 3483), ('cream', 1081), ('for', 1022), ('hair', 852), ('amp', 834), ('de', 793), ('eau', 770), ('and', 754), ('body', 748), ('spray', 715), ('skin', 681), ('ounce', 674), ('shampoo', 608), ('eye', 499), ('nail', 495), ('with', 447), ('oil', 433), ('gel', 432), ('conditioner', 432), ('lotion', 410), ('serum', 405), ('color', 392), ('set', 389), ('spf', 375), ('brush', 360), ('loccitane', 352), ('lip', 347), ('parfum', 340), ('toilette', 334), ('face', 327), ('the', 327), ('beauty', 324), ('of', 316), ('treatment', 301), ('black', 291), ('polish', 281), ('hand', 277), ('cleansing', 277), ('cleanser', 274), ('kit', 270), ('mask', 264), ('men', 264), ('butter', 259), ('facial', 252), ('free', 247), ('moisturizer', 241), ('dry', 232), ('balm', 221), ('wash', 209), ('in', 196), ('ahava', 195), ('care', 193), ('collection', 190), ('evelyn', 189), ('stila', 189), ('bliss', 189), ('professional', 188), ('crabtree', 187), ('glo', 186), ('london', 183), ('sunscre

In [None]:
lower_case = list((map(lambda x: x.lower(), metadata_df_clean["title"])))

In [None]:
df_title = pd.DataFrame(lower_case)

In [None]:
df_title.shape

(12149, 1)

In [385]:
regex_list = [r'\skin\B|moisturi[z|s](er|ing)|\bcleans.*(ing|r)\b|serum|mask|\bface\b(?!\w*sur)|facial|\bmasque\b|\bcreme\b|\btoner\b|acne|(?!hair spray)\b(face|facial) spray\b|butter|badescu|elemis|antiaging|makeup remover|body lotion|towelettes|occitane|\b(?<!sun )lotion\b|skincare|nuxe|ahava|crabtree|moisterizing|arden|jurlique|retinol|strivectin|\bremover\b|cream|bliss|\b(?<!hair )oil\b|scrub|acid|derm|skin|body|greyl|pureology|aloe|perricone|wipe|vichy|anti(?!per|oxid|friz)|age|borghese|hyd*|exfol|iredale|bisse|neova|squalane|ceut|neck|red flower' #skincare
              ,r'hair|condition|\b(moroccanoil|marulaoil|exquisiteoil|oilhair|shampoo|shampooing|hairspray)\b|(?!face|facial)\b(hair|detangling|texture|volumizing) spray\b|protectant|treatment|styling|lipstick|brightening|brow(?!n)|paul mitchell|\b(?<!mist )spray\b|mousse|hair oil|curl|texture|wax|oribe|amika|text*|mist|\bgel\b|rusk|sebastian|chi|rsquo|nioxin' #haircare 
              ,r'\w*(?<!s)nail|\bpolish\b(?!\w*ed|\w*er|\w*es|\w*ing\b)|cuticle|manicure|pedicure|crme|\bhand(s)?\b(?!bag)|\b(?<!lip )lacquer\b|\bopi\b|zoya|foot cream|cnd|coat' #nailcare
              ,r'soap(?!holder)|bath|shower(?!cap)|\bli(ps)?\b(?!philip)|\bwash(?!mouth)|salt|\bmud\b|rituals' #bath&shower
              ,r'shadow|beauty|\bli(ps)?\b(?!philip)|\bli(p)?\b(?!philip)|liner(?!clineral)|shadow(?!s)|balm|makeup|lip lacquer|foundation|cosmetics|pencil|\bgloss\b|\bpalette\b|concealer|\bprimer\b|mascara|rodial|\bmakeup\b(?! remover)|bronzing|bronzer|blush|powder|stila|cosmetic|lorac|\beye\b(?! makeup remover)|dermablend|jouer|highlighter|geller|tint|lash|bronze|contour|base|matte|berry|tonymoly|korres|lip' #makeup
              ,r'eau|toilette|parfum|perfume|fragrance|\bkenneth\b' #fragance
              ,r'\b(aftershave|shave)\b(?!r)|men|shaving|shave|deodorant|pomade|razor|perspirant|billy|klein' #menscare
              ,r'tweezer|brush|showercap|capsule|clip(s)|philip|iron|japonesque|candle|tool(s)|vitamin|\bdryer\b|ceramic|mouth|tooth|teeth|diffuser|babyliss|\btea\b|comb|temptu|white|mirror|bag|curler|ghd|stand|supersmile|ionic'#accessories 
              ,r'sun|spf|sunscreen|\btan\b|\btanning\b|\btanner\b|sun lotion' #suncare
              ,r'\bgift\b|\bset\b|\btravel\b|kit' #other
             ]
                
column_names = ['c_skincare','c_haircare','c_nailandhands','c_bath&shower','c_makeup','c_fragance','c_menscare','c_accessories','c_suncare','c_other']


In [386]:
#for loop to create columns 
for column,regex_search in zip(column_names, regex_list):
    
    metadata_df_clean[column] = metadata_df_clean["title"].str.contains(
    regex_search
    , regex = True
    , case=False )
    counts = metadata_df_clean[column].sum()
    print(f"{column} column created with {counts} matches")
    

  metadata_df_clean[column] = metadata_df_clean["title"].str.contains(


c_skincare column created with 6351 matches
c_haircare column created with 4092 matches
c_nailandhands column created with 1024 matches
c_bath&shower column created with 771 matches
c_makeup column created with 3205 matches
c_fragance column created with 1370 matches
c_menscare column created with 1445 matches
c_accessories column created with 1586 matches
c_suncare column created with 590 matches
c_other column created with 766 matches


In [387]:
#creating new column for category 
mapping_dict = {False: 'none'
                , 'c_skincare': 'skincare'
                , 'c_haircare': "haircare"
                , 'c_nailandhands': "nails and hands"
                ,'c_bath&shower' : " bath and shower"
                ,'c_makeup' : "makeup"
                ,'c_fragance' : "fragance"
                ,"c_accessories" : "accesories"
                , "c_suncare" : "suncare"
               }


df_cat = metadata_df_clean.filter(like='c_')
metadata_df_clean['category'] = (df_cat.idxmax(axis=1).map(mapping_dict)
                    .where(df_cat.any(axis=1), mapping_dict[False]))
print (metadata_df_clean['category'])

0               skincare
1               skincare
2               skincare
3               skincare
4        nails and hands
              ...       
12294    nails and hands
12295    nails and hands
12296    nails and hands
12297           fragance
12298           fragance
Name: category, Length: 12149, dtype: object


In [388]:
metadata_null = metadata_df_clean[metadata_df_clean['category']=='none']
metadata_null

Unnamed: 0,title,asin,ranking,product_description,also_buy_counts,also_view_counts,c_skincare,c_haircare,c_nailandhands,c_bath&shower,c_makeup,c_fragance,c_menscare,c_accessories,c_suncare,category,c_other
675,"Replenix CF Antioxidant Formula with Caffeine,...",B000BYVBIU,93820,Replenix CF Serum is composed of 90% polypheno...,16.0,0.0,False,False,False,False,False,False,False,False,False,none,False
842,"L'ANZA Healing Volume Final Effects, 10.6 oz.",B000GI3TZO,29224,Bamboo Bodifying Complex and Keratin Healing S...,41.0,34.0,False,False,False,False,False,False,False,False,False,none,False
884,"MOP Lemongrass Lift, Citrus, 8.45 Fl Oz",B000H30EV0,682264,Voluptuous and radiant hair is achievable with...,0.0,0.0,False,False,False,False,False,False,False,False,False,none,False
925,"Cellex-C Enhancer G.L.A. Extra Moist, 2 oz",B000HX2KQW,101215,"G.L.A. Extra Moist is a rich, moisturizing cre...",23.0,8.0,False,False,False,False,False,False,False,False,False,none,False
933,"Cellex-C Betaplex Line Smoother, 1 oz",B000HX6K18,572652,"A crystal clear, oil-free gel containing a pow...",13.0,3.0,False,False,False,False,False,False,False,False,False,none,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12047,"18.21 Man Made Sweet Tobacco Spirits, 3.4 Fl Oz",B01ELWU4NU,37767,", Inspired by the roaring spirit of swanky Pro...",13.0,31.0,False,False,False,False,False,False,False,False,False,none,False
12055,Aromatherapy Associates Deep Relax Roller Ball,B01EM9XD6W,356441,"Let the restful powers of Vetivert, Chamomile ...",8.0,6.0,False,False,False,False,False,False,False,False,False,none,False
12062,Malibu C Blondes Enhancing Collection,B01EQM8TWS,45753,Malibu c Blondes Wellness Kit Shampoo and Cond...,10.0,11.0,False,False,False,False,False,False,False,False,False,none,False
12117,NEUMA neuMoisture Instant Fix,B01FL2GIM0,63192,", Replenishing hair spray for smooth, healthy,...",5.0,0.0,False,False,False,False,False,False,False,False,False,none,False


In [370]:
from collections import defaultdict
title_list = metadata_null["title"].to_list()

temp1 = defaultdict(int)
 
# memorizing count
for title in title_list:
    for wrd in title.split():
        wrd = wrd.lower()
        
        clean_wrd = ""
        #searching through each letter and only accepting those in the alphabet
        for letter in wrd: 
            if letter >= 'a' and letter <= 'z': 
                clean_wrd = clean_wrd + letter
        temp1[clean_wrd] += 1
 
# getting max frequency
res = max(temp1, key=temp.get)
 
# printing result
print("Word with maximum frequency : " + str(res))

Word with maximum frequency : 


In [371]:
sorted_dict_alpha1 = sorted(temp1.items())
df_dict = pd.DataFrame(sorted_dict_alpha1)
df_dict.to_csv("myfile1.csv",index=False)


In [309]:
metadata_df_clean

Unnamed: 0,title,asin,ranking,product_description,also_buy_counts,also_view_counts,c_skincare,c_haircare,c_nailandhands,c_bath&shower,c_makeup,c_fragance,c_menscare,c_accessories,c_suncare,category
0,Crabtree &amp; Evelyn - Gardener's Ultra-Moist...,B00004U9V2,4324,After a long day of handling thorny situations...,56.0,48.0,False,False,True,False,False,False,False,False,False,nails and hands
1,AHAVA Bath Salts,B0000531EN,1633549,If you haven't experienced the pleasures of ba...,0.0,0.0,False,False,False,True,False,False,False,False,False,bath and shower
2,"AHAVA Dead Sea Mineral Mud, 8.5 oz, Pack of 4",B0000532JH,1806710,"Rich, black mineral mud, harvested from the ba...",0.0,0.0,False,False,False,False,False,False,False,False,False,none
3,"Crabtree &amp; Evelyn Hand Soap, Gardeners, 10...",B00005A77F,0,This liquid soap with convenient pump dispense...,0.0,0.0,False,False,True,True,False,False,False,False,False,nails and hands
4,Soy Milk Hand Crme,B00005NDTD,42464,Remember why you love your favorite blanket? T...,26.0,15.0,False,False,True,False,False,False,False,False,False,nails and hands
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12294,"CND Shellac Power Polish, Patina Buckle",B01HIQIEYC,88740,", CND Craft Culture Collection: Patina Buckle,...",47.0,37.0,False,False,True,False,False,False,False,False,False,nails and hands
12295,CND Shellac power polish denim patch,B01HIQHQU0,122331,CND Shellac was designed to be used as a syste...,2.0,0.0,False,False,True,False,False,False,False,False,False,nails and hands
12296,"CND Shellac, Leather Satchel",B01HIQEOLO,168028,CND Shellac was designed to be used as a syste...,0.0,0.0,False,False,False,False,False,False,False,False,False,none
12297,"Juicy Couture I Love Juicy Couture, 1.7 fl. Oz...",B01HJ2UY0W,490755,The I AM JUICY COUTURE girl is once again taki...,0.0,6.0,False,False,False,False,False,False,True,False,False,


In [None]:
#remove the c columns
column_names = ['c_skincare','c_haircare','c_nailandhands','c_bath&shower','c_makeup','c_fragance','c_menscare','c_accessories','c_suncare']

for column in column_names: 
    metadata_df_clean.drop([column],axis=1, inplace=True)

In [286]:
metadata_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12149 entries, 0 to 12298
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   title                12149 non-null  object 
 1   asin                 12149 non-null  object 
 2   ranking              12149 non-null  int32  
 3   product_description  12149 non-null  object 
 4   also_buy_counts      12149 non-null  float64
 5   also_view_counts     12149 non-null  float64
 6   c_skincare           12149 non-null  bool   
 7   c_haircare           12149 non-null  bool   
 8   c_nailandhands       12149 non-null  bool   
 9   c_bath&shower        12149 non-null  bool   
 10  c_makeup             12149 non-null  bool   
 11  c_fragance           12149 non-null  bool   
 12  c_menscare           12149 non-null  bool   
 13  c_accessories        12149 non-null  bool   
 14  c_suncare            12149 non-null  bool   
dtypes: bool(9), float64(2), int32(1), ob

In [298]:
metadata_df_clean

Unnamed: 0,title,asin,ranking,product_description,also_buy_counts,also_view_counts,c_skincare,c_haircare,c_nailandhands,c_bath&shower,c_makeup,c_fragance,c_menscare,c_accessories,c_suncare,category
0,Crabtree &amp; Evelyn - Gardener's Ultra-Moist...,B00004U9V2,4324,After a long day of handling thorny situations...,56.0,48.0,False,False,True,False,False,False,False,False,True,nails and hands
1,AHAVA Bath Salts,B0000531EN,1633549,If you haven't experienced the pleasures of ba...,0.0,0.0,False,False,False,True,False,False,False,False,True,bath and shower
2,"AHAVA Dead Sea Mineral Mud, 8.5 oz, Pack of 4",B0000532JH,1806710,"Rich, black mineral mud, harvested from the ba...",0.0,0.0,False,False,False,False,False,False,False,False,True,suncare
3,"Crabtree &amp; Evelyn Hand Soap, Gardeners, 10...",B00005A77F,0,This liquid soap with convenient pump dispense...,0.0,0.0,False,False,True,True,False,False,False,False,True,nails and hands
4,Soy Milk Hand Crme,B00005NDTD,42464,Remember why you love your favorite blanket? T...,26.0,15.0,False,False,True,False,False,False,False,False,True,nails and hands
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12294,"CND Shellac Power Polish, Patina Buckle",B01HIQIEYC,88740,", CND Craft Culture Collection: Patina Buckle,...",47.0,37.0,False,False,True,False,False,False,False,False,True,nails and hands
12295,CND Shellac power polish denim patch,B01HIQHQU0,122331,CND Shellac was designed to be used as a syste...,2.0,0.0,False,False,True,False,False,False,False,False,True,nails and hands
12296,"CND Shellac, Leather Satchel",B01HIQEOLO,168028,CND Shellac was designed to be used as a syste...,0.0,0.0,False,False,False,False,False,False,False,False,True,suncare
12297,"Juicy Couture I Love Juicy Couture, 1.7 fl. Oz...",B01HJ2UY0W,490755,The I AM JUICY COUTURE girl is once again taki...,0.0,6.0,False,False,False,False,False,False,True,False,True,


In [None]:
metadata_df_clean.isna().sum()

NameError: name 'metadata_df_clean' is not defined

lambda - 


#To-do </br>
rank - only keep number done </br> 
description, - take out of list done </br>
details - exploring done </br>
price - removing the $ and , and converting to float done
remove row duplicates 
figure out what to do with price null values 
work out None error from inplace = True
concat tables
categorise 

Next - group by product order the reviews by time. 

group by product, order by time, and then see what would look like (poss window function)

### Analysis 

In [None]:
#splitting out Day month year from reviewtime
clean_review_df['reviewTime'] = review_df['reviewTime'].str.replace(",","")
clean_review_df[['Day','Month',"Year"]] = clean_review_df["reviewTime"].str.split(" ", expand = True)
clean_review_df

In [None]:
#separating columns of clean_review_df
clean_review_df[['Day','Month',"Year"]] = clean_review_df["reviewTime"].str.split(" ", expand = True)
clean_review_df

In [None]:
#Looking at number of reviews over the years 
plt.figure(figsize=(10,8))
clean_review_df['Year'].value_counts().sort_index().plot()
plt.show()

In [None]:
clean_review_df.groupby("Year")["Year"].value_counts()

In [None]:
#unique product values 
unique_asin = clean_review_df["asin"].nunique()
print(unique_asin)

There are 12,120 unique asins in the data set. 

value_counts, look at percentile, 90% of the products have x number of reviews 

In [None]:
clean_review_df.info()

In [None]:
clean_review_df["asin"].value_counts()

In [None]:
#histogram of reviews 
binwidth = 50

plt.figure(figsize=(10,8))
asin_data = clean_review_df["asin"].value_counts()
plt.hist(asin_data , bins=np.arange(0,3500, binwidth))

plt.xlabel(f'Review count. Bin Width: {binwidth}')
plt.ylabel('Frequency')
plt.title('Distribution of Review counts in dataset')
plt.show()

- consider removing products with value counts less than a certain amount. 

In [None]:
clean_review_df.info()

In [None]:
plot = clean_review_df.groupby("overall").count().reset_index()
plot = plot.rename(columns= {"overall":"Review Rating","verified":"Counts of Reviews"})

#plot structure
fig = px.bar(plot, 
             x = "Review Rating",
             y = "Counts of Reviews",
             title = "Amazon users are generous when they review, 65% of the dataset gave out a 5 star review",
             color = "Review Rating",
             color_continuous_scale="darkmint"
             
             )

fig.update_layout(coloraxis_showscale=False)

#plot 
fig.show()

In [None]:
357973/clean_review_df.shape[0]

NameError: name 'clean_review_df' is not defined

In [None]:
clean_review_df.shape

(539120, 11)

In [None]:
#calculating % of duplicates
clean_review_df.duplicated().sum() / clean_review_df.shape[0] *100

{'category': [], 'tech1': '', 'description': ['After a long day of handling thorny situations, our new hand therapy pump is just the help you need. It contains shea butter as well as extracts of yarrow, clover and calendula to help soothe and condition work-roughened hands.', 'By Crabtree & Evelyn', 'The aromatic benefits of herbs are varied and far-reaching, so we combined a whole bunch of them into one restoratively fragrant line-up straight from the garden.', 'We&#039;ve formulated our Gardeners Hand Therapy with Myrrh Extract to help condition nails and cuticles as well as skin super hydrators macadamia seed oil and shea butter to help replenish lost moisture. Rich in herbal extracts like cooling cucumber and rosemary leaf  a favourite for antioxidants  to help protect hands against daily urban and environmental stresses while the hydrating power of Vitamin E, Hyaluronic Acid and Ceramides contribute to improve the skins natural moisture barrier with this garden-inspired treatment.

In [None]:
df_combined = pd.concat([clean_review_df,metadata_df_clean], join = "left")

NameError: name 'clean_review_df' is not defined

(12299, 19)