# Amazon Fashion Apparel Recommendation With NLP and Deep Learning

In [276]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

## Loading the dataset

In [277]:
data = pd.read_json("shirts_data.json")

In [278]:
print("No of Data Points : ",data.shape[0])
print("No of Features : ",data.shape[1])

No of Data Points :  183138
No of Features :  19


This particular dataset has around 200,000 data points and 19 features

## Overview of the Dataset

In [279]:
data.head()

Unnamed: 0,sku,asin,product_type_name,formatted_price,author,color,brand,publisher,availability,reviews,large_image_url,availability_type,small_image_url,editorial_review,title,model,medium_image_url,manufacturer,editorial_reivew
0,,B016I2TS4W,SHIRT,,,,FNC7C,,,"[False, https://www.amazon.com/reviews/iframe?...",https://images-na.ssl-images-amazon.com/images...,,https://images-na.ssl-images-amazon.com/images...,Minions Como Superheroes Ironman Women's O Nec...,Minions Como Superheroes Ironman Long Sleeve R...,,https://images-na.ssl-images-amazon.com/images...,,
1,,B01N49AI08,SHIRT,,,,FIG Clothing,,,"[False, https://www.amazon.com/reviews/iframe?...",https://images-na.ssl-images-amazon.com/images...,,https://images-na.ssl-images-amazon.com/images...,Sizing runs on the small side. FIG® recommends...,FIG Clothing Womens Izo Tunic,,https://images-na.ssl-images-amazon.com/images...,,
2,,B01JDPCOHO,SHIRT,,,,FIG Clothing,,,"[False, https://www.amazon.com/reviews/iframe?...",https://images-na.ssl-images-amazon.com/images...,,https://images-na.ssl-images-amazon.com/images...,Sizing runs on the small side. FIG® recommends...,FIG Clothing Womens Won Top,,https://images-na.ssl-images-amazon.com/images...,,
3,,B01N19U5H5,SHIRT,,,,Focal18,,,"[True, https://www.amazon.com/reviews/iframe?a...",https://images-na.ssl-images-amazon.com/images...,,https://images-na.ssl-images-amazon.com/images...,100% Brand New & Fashion<br> Quantity: 1 Piece...,Focal18 Sailor Collar Bubble Sleeve Blouse Shi...,,https://images-na.ssl-images-amazon.com/images...,,
4,,B004GSI2OS,SHIRT,$26.26,,Onyx Black/ Stone,FeatherLite,,Usually ships in 6-10 business days,"[False, https://www.amazon.com/reviews/iframe?...",https://images-na.ssl-images-amazon.com/images...,now,https://images-na.ssl-images-amazon.com/images...,,Featherlite Ladies' Long Sleeve Stain Resistan...,,https://images-na.ssl-images-amazon.com/images...,,


We can observe that there are 19 features present in the dataset

In [280]:
data.columns

Index(['sku', 'asin', 'product_type_name', 'formatted_price', 'author',
       'color', 'brand', 'publisher', 'availability', 'reviews',
       'large_image_url', 'availability_type', 'small_image_url',
       'editorial_review', 'title', 'model', 'medium_image_url',
       'manufacturer', 'editorial_reivew'],
      dtype='object')

## Viewing and selecting data

### - Which features are useful for our problem statement?

ASIN - Amazon Standard Identification Number

In [281]:
data = data[['asin', 'brand', 'color', 'medium_image_url', 'product_type_name', 'title', 'formatted_price']]

In [282]:
data.head()

Unnamed: 0,asin,brand,color,medium_image_url,product_type_name,title,formatted_price
0,B016I2TS4W,FNC7C,,https://images-na.ssl-images-amazon.com/images...,SHIRT,Minions Como Superheroes Ironman Long Sleeve R...,
1,B01N49AI08,FIG Clothing,,https://images-na.ssl-images-amazon.com/images...,SHIRT,FIG Clothing Womens Izo Tunic,
2,B01JDPCOHO,FIG Clothing,,https://images-na.ssl-images-amazon.com/images...,SHIRT,FIG Clothing Womens Won Top,
3,B01N19U5H5,Focal18,,https://images-na.ssl-images-amazon.com/images...,SHIRT,Focal18 Sailor Collar Bubble Sleeve Blouse Shi...,
4,B004GSI2OS,FeatherLite,Onyx Black/ Stone,https://images-na.ssl-images-amazon.com/images...,SHIRT,Featherlite Ladies' Long Sleeve Stain Resistan...,$26.26


### - General core insights of the data

In [283]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183138 entries, 0 to 183137
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   asin               183138 non-null  object
 1   brand              182987 non-null  object
 2   color              64956 non-null   object
 3   medium_image_url   183138 non-null  object
 4   product_type_name  183138 non-null  object
 5   title              183138 non-null  object
 6   formatted_price    28395 non-null   object
dtypes: object(7)
memory usage: 9.8+ MB


### -The data type information of each column

In [284]:
data.dtypes

asin                 object
brand                object
color                object
medium_image_url     object
product_type_name    object
title                object
formatted_price      object
dtype: object

## Describing data

### 1. How many product types are there in total?

In [285]:
data['product_type_name'].describe()

count     183138
unique        72
top        SHIRT
freq      167794
Name: product_type_name, dtype: object

There are 72 unique categories of products in the dataset with SHIRT having the top frequency

###         The unique product types in the dataset are:

In [286]:
data['product_type_name'].unique()

array(['SHIRT', 'SWEATER', 'APPAREL', 'OUTDOOR_RECREATION_PRODUCT',
       'BOOKS_1973_AND_LATER', 'PANTS', 'HAT', 'SPORTING_GOODS', 'DRESS',
       'UNDERWEAR', 'SKIRT', 'OUTERWEAR', 'BRA', 'ACCESSORY',
       'ART_SUPPLIES', 'SLEEPWEAR', 'ORCA_SHIRT', 'HANDBAG',
       'PET_SUPPLIES', 'SHOES', 'KITCHEN', 'ADULT_COSTUME',
       'HOME_BED_AND_BATH', 'MISC_OTHER', 'BLAZER',
       'HEALTH_PERSONAL_CARE', 'TOYS_AND_GAMES', 'SWIMWEAR',
       'CONSUMER_ELECTRONICS', 'SHORTS', 'HOME', 'AUTO_PART',
       'OFFICE_PRODUCTS', 'ETHNIC_WEAR', 'BEAUTY',
       'INSTRUMENT_PARTS_AND_ACCESSORIES', 'POWERSPORTS_PROTECTIVE_GEAR',
       'SHIRTS', 'ABIS_APPAREL', 'AUTO_ACCESSORY', 'NONAPPARELMISC',
       'TOOLS', 'BABY_PRODUCT', 'SOCKSHOSIERY',
       'POWERSPORTS_RIDING_SHIRT', 'EYEWEAR', 'SUIT', 'OUTDOOR_LIVING',
       'POWERSPORTS_RIDING_JACKET', 'HARDWARE', 'SAFETY_SUPPLY',
       'ABIS_DVD', 'VIDEO_DVD', 'GOLF_CLUB', 'MUSIC_POPULAR_VINYL',
       'HOME_FURNITURE_AND_DECOR', 'TABLET_COMPUTER',

### What are the top 10 frequent product_types?

In [287]:
from collections import Counter

In [288]:
n = 10
dfFrequent = data['product_type_name'].value_counts()[:n].index.tolist()

In [289]:
dfFrequent

['SHIRT',
 'APPAREL',
 'BOOKS_1973_AND_LATER',
 'DRESS',
 'SPORTING_GOODS',
 'SWEATER',
 'OUTERWEAR',
 'OUTDOOR_RECREATION_PRODUCT',
 'ACCESSORY',
 'UNDERWEAR']

Using Counter function

In [290]:
product_count = Counter(list(data['product_type_name']))
product_count.most_common(10)

[('SHIRT', 167794),
 ('APPAREL', 3549),
 ('BOOKS_1973_AND_LATER', 3336),
 ('DRESS', 1584),
 ('SPORTING_GOODS', 1281),
 ('SWEATER', 837),
 ('OUTERWEAR', 796),
 ('OUTDOOR_RECREATION_PRODUCT', 729),
 ('ACCESSORY', 636),
 ('UNDERWEAR', 425)]

### 2.  What are the unique colors in the dataset?

In [291]:
data['color'].describe()

count     64956
unique     7380
top       Black
freq      13207
Name: color, dtype: object

- No.of products with color information is 64956 out of 183138
- The most occuring color is Black.
- There are 13,207 records with the color black

### What are the top 5 colors from the dataset?

In [292]:
color_count = Counter(list(data['color']))
type(color_count)

collections.Counter

In [293]:
color_counter = color_count.most_common(5)

In [294]:
color_counter[1:]

[('Black', 13207), ('White', 8616), ('Blue', 3570), ('Red', 2289)]

### What percentage of data is missing from the colors feature?

There are 118,182 missing records in the dataset with no color

In [295]:
color_counter[0]

(None, 118182)

### 3. Description of the price attribute in the dataset

In [296]:
print(data['formatted_price'])

0           None
1           None
2           None
3           None
4         $26.26
           ...  
183133    $14.58
183134      None
183135      None
183136    $44.99
183137      None
Name: formatted_price, Length: 183138, dtype: object


In [297]:
print(data['formatted_price'].describe())

count      28395
unique      3135
top       $19.99
freq         945
Name: formatted_price, dtype: object


Out of 183138, only 28395 products have the price information , rest all of the entries have null values

The frequently occuring price is $19.99 with about 945 occurences 

### 4. How many unique brands are there in the dataset?

In [298]:
brands = (data['brand'].describe)
print (brands)

<bound method NDFrame.describe of 0                 FNC7C
1          FIG Clothing
2          FIG Clothing
3               Focal18
4           FeatherLite
              ...      
183133        TOOGOO(R)
183134       VOGUE CODE
183135         Wrangler
183136    susana monaco
183137         Sexybaby
Name: brand, Length: 183138, dtype: object>


In [299]:
print (data['brand'].describe())

count     182987
unique     10577
top         Zago
freq         223
Name: brand, dtype: object


The total number of data points is 183138, out of which 
only 182987 products have the brand name

There are about 10577 unique brands

The top brand is by name 'Zago' with a frequency rate of about 223


### What are the 10 most common brands in the dataset?

In [300]:
brand_count = Counter(list(data['brand']))
brand_count.most_common(10)

[('Zago', 223),
 ('XQS', 222),
 ('Yayun', 215),
 ('YUNY', 198),
 ('XiaoTianXin-women clothes', 193),
 ('Generic', 192),
 ('Boohoo', 190),
 ('Alion', 188),
 ('Abetteric', 187),
 ('TheMogan', 187)]

The brand with the most number of products is Zago with about 223 products

### 5.  General description of title attribute

In [301]:
print(data['title'].describe)

<bound method NDFrame.describe of 0         Minions Como Superheroes Ironman Long Sleeve R...
1                             FIG Clothing Womens Izo Tunic
2                               FIG Clothing Womens Won Top
3         Focal18 Sailor Collar Bubble Sleeve Blouse Shi...
4         Featherlite Ladies' Long Sleeve Stain Resistan...
                                ...                        
183133    TOOGOO(R) Women's Tops Spring Autumn Casual Pu...
183134    VOGUE CODE Vintage V Neck Plaid Shirt Sleevele...
183135    Wrangler George Strait For Her Long Sleeve Pin...
183136    Susana Monaco Womens Susana Monoco Sleeveless ...
183137    SexyBaby Women's Mesh Splive Flounced Sleeve C...
Name: title, Length: 183138, dtype: object>


In [302]:
print(data['title'].describe())

count                                                183138
unique                                               175985
top       Nakoda Cotton Self Print Straight Kurti For Women
freq                                                     77
Name: title, dtype: object


There are 183138 total products with same no.of total titles , and thereby a title exists for each of product

The most common title is of the 'Nakoda Cotton Self Print Straight Kurti For Women'and it is repeated for about 77 times

## Manipulating data

### Converting title to lower case

In [304]:
data['title']= data['title'].str.lower()

In [305]:
data

Unnamed: 0,asin,brand,color,medium_image_url,product_type_name,title,formatted_price
0,B016I2TS4W,FNC7C,,https://images-na.ssl-images-amazon.com/images...,SHIRT,minions como superheroes ironman long sleeve r...,
1,B01N49AI08,FIG Clothing,,https://images-na.ssl-images-amazon.com/images...,SHIRT,fig clothing womens izo tunic,
2,B01JDPCOHO,FIG Clothing,,https://images-na.ssl-images-amazon.com/images...,SHIRT,fig clothing womens won top,
3,B01N19U5H5,Focal18,,https://images-na.ssl-images-amazon.com/images...,SHIRT,focal18 sailor collar bubble sleeve blouse shi...,
4,B004GSI2OS,FeatherLite,Onyx Black/ Stone,https://images-na.ssl-images-amazon.com/images...,SHIRT,featherlite ladies' long sleeve stain resistan...,$26.26
...,...,...,...,...,...,...,...
183133,B01MSALTSO,TOOGOO(R),Black,https://images-na.ssl-images-amazon.com/images...,OUTERWEAR,toogoo(r) women's tops spring autumn casual pu...,$14.58
183134,B015W98YQK,VOGUE CODE,Monochrome Plaid,https://images-na.ssl-images-amazon.com/images...,SHIRT,vogue code vintage v neck plaid shirt sleevele...,
183135,B075756PGC,Wrangler,Pink,https://images-na.ssl-images-amazon.com/images...,SHIRT,wrangler george strait for her long sleeve pin...,
183136,B074L8FVTT,susana monaco,Rose,https://images-na.ssl-images-amazon.com/images...,SHIRT,susana monaco womens susana monoco sleeveless ...,$44.99


### Dropping records with null values of price and color