Python for ML
ML libraries built for Python:

1. Numpy efficiently load large dataset large memory
2. SciPy: computing features/ numeric optimization features to calculate recommended products for users
3. Pandas: represent data as spreadsheet (edit and perform calculations)

In [1]:
# Import Libraries
import os
import pandas as pd
import numpy as np

In [2]:
# Get  current working directory
project_root = os.getcwd()
datasets = os.path.join(project_root, "datasets/")

# Define the dtype for column 1 to avoid mixed types
dtype_dict = {1: str}

# load the Data
product_info = pd.read_csv(datasets + "product_info.csv", index_col="product_id")
#product_info = pd.read_csv(datasets + "product_info.csv")
reviews_250 = pd.read_csv(datasets + "reviews_0-250.csv", dtype=dtype_dict)
reviews_500 = pd.read_csv(datasets + "reviews_250-500.csv", dtype=dtype_dict)
reviews_750 = pd.read_csv(datasets + "reviews_500-750.csv", dtype=dtype_dict)
reviews_1250 = pd.read_csv(datasets + "reviews_750-1250.csv", dtype=dtype_dict)
reviews_end = pd.read_csv(datasets + "reviews_1250-end.csv", dtype=dtype_dict)

# Review Contents
print("product_info shape:", product_info.shape)
print("reviews_250 shape:", reviews_250.shape)
print("reviews_500 shape:", reviews_500.shape)
print("reviews_750 shape:", reviews_750.shape)
print("reviews_1250 shape:", reviews_1250.shape)
print("reviews_end shape:", reviews_end.shape)

all_reviews = pd.concat([
    reviews_250, reviews_500, reviews_750, 
    reviews_1250, reviews_end], ignore_index=True)

product_info shape: (8494, 26)
reviews_250 shape: (602130, 19)
reviews_500 shape: (206725, 19)
reviews_750 shape: (116262, 19)
reviews_1250 shape: (119317, 19)
reviews_end shape: (49977, 19)


In [3]:
unique_count = product_info.index.nunique()
print(unique_count)

8494


In [4]:
unique_products = product_info['product_name'].nunique()
print(unique_products)

8415


There are 8,494 rows in the product_info dataset and there are 8,494 unique values in product_id column. Therefore there are no duplicates.

In [5]:
# Check null values
print("Product Info Missing Values:\n", product_info.isnull().sum())
print("\nAll Reviews Missing Values:\n", all_reviews.isnull().sum())

Product Info Missing Values:
 product_name             0
brand_id                 0
brand_name               0
loves_count              0
rating                 278
reviews                278
size                  1631
variation_type        1444
variation_value       1598
variation_desc        7244
ingredients            945
price_usd                0
value_price_usd       8043
sale_price_usd        8224
limited_edition          0
new                      0
online_only              0
out_of_stock             0
sephora_exclusive        0
highlights            2207
primary_category         0
secondary_category       8
tertiary_category      990
child_count              0
child_max_price       5740
child_min_price       5740
dtype: int64

All Reviews Missing Values:
 Unnamed: 0                       0
author_id                        0
rating                           0
is_recommended              167988
helpfulness                 561592
total_feedback_count             0
total_neg_feedb

In [6]:
pd.set_option('display.max_columns', None)

In [7]:
product_info.head(1)

Unnamed: 0_level_0,product_name,brand_id,brand_name,loves_count,rating,reviews,size,variation_type,variation_value,variation_desc,ingredients,price_usd,value_price_usd,sale_price_usd,limited_edition,new,online_only,out_of_stock,sephora_exclusive,highlights,primary_category,secondary_category,tertiary_category,child_count,child_max_price,child_min_price
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
P473671,Fragrance Discovery Set,6342,19-69,6320,3.6364,11.0,,,,,"['Capri Eau de Parfum:', 'Alcohol Denat. (SD A...",35.0,,,0,0,1,0,0,"['Unisex/ Genderless Scent', 'Warm &Spicy Scen...",Fragrance,Value & Gift Sets,Perfume Gift Sets,0,,


In [8]:
all_reviews.head(1)

Unnamed: 0.1,Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,0,1741593524,5,1.0,1.0,2,0,2,2023-02-01,I use this with the Nudestix “Citrus Clean Bal...,Taught me how to double cleanse!,,brown,dry,black,P504322,Gentle Hydra-Gel Face Cleanser,NUDESTIX,19.0


The following are the variables that we will be working with for the reccommendation model and improve its predictive power. These columns can help the recommendation system better understand user preferences and product characteristics. These variables have no missing values. `product_id` can be linked between the two datasets.

#### Product Info
    product_id
    product_name
    primary_category


#### Reviews
    author_id
    product_id
    rating
    brand_name
    product_name

In [9]:
product_info[['product_name', 'primary_category']]

Unnamed: 0_level_0,product_name,primary_category
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1
P473671,Fragrance Discovery Set,Fragrance
P473668,La Habana Eau de Parfum,Fragrance
P473662,Rainbow Bar Eau de Parfum,Fragrance
P473660,Kasbah Eau de Parfum,Fragrance
P473658,Purple Haze Eau de Parfum,Fragrance
...,...,...
P467659,Couture Clutch Eyeshadow Palette,Makeup
P500874,L'Homme Eau de Parfum,Fragrance
P504428,Mon Paris Eau de Parfum Gift Set,Fragrance
P504448,Y Eau de Parfum Gift Set,Fragrance


In [10]:
all_reviews[['author_id', 'product_id', 'rating']]

Unnamed: 0,author_id,product_id,rating
0,1741593524,P504322,5
1,31423088263,P420652,1
2,5061282401,P420652,5
3,6083038851,P420652,5
4,47056667835,P420652,5
...,...,...,...
1094406,2276253200,P505392,5
1094407,28013163278,P505392,5
1094408,1539813076,P505392,5
1094409,5595682861,P505392,5
