# Notebook 01: Problem Statement and Data Cleaning

This notebook details the problem statement and cleans/saves the datasets.

## Problem Statement

[CTE Watch Company](https://www.cte-watches.com/who-we-are) is a South Florida based  company that distributes fashion and household items to smaller retailers in the Caribbean islands, Cruise ships, Duty-free/Travel Retail, Latin America & North American markets.

Success is determined by how well CTE manages inventory to optimize warehouse space and minimize the purchase of unsuccessful products. CTE typically purchases directly from the brands on a quarterly basis, which they then go on to sell to the smaller retailers. They must  predict how much they can sell to the smaller retailers in order to place quarterly brand orders. Typically, CTE makes their predictions by calculating a 6-month rolling average and placing an order to make sure they have enough inventory for 3 months. If it is a new model, they must make a guess based on the attributes of the watch. This is a very manual process and requires the 30-year expertise of the owner to be done well. Additionally, there are many different types of products within each brand (ex. 300 different watch models for 1 brand), and there are approximately 40 brands.

In this project, I aim to build a model that accurately and more quickly forecasts order quantities for both new and existing models for one watch brand. The model will predict watch sales based on historical sales (if available) and the individual model attributes (style, gender, color, price, material, etc). Different types of models will be explored to adequately and appropriately accomplish this goal (forecasting, regression, classification). This predictions will be validated against the owner's purchase order for the next quarter during a meeting in early June.

In [353]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

## Data Cleaning

### Watch Flow Chart

This dataset contains CTE's records of each model's sales for the brand since 2015 (when they started distributing the brand).

In [354]:
df_watch_chart = pd.read_excel('../data/WATCH FLOW CHART 0416.xlsx',header=1)

# Drop unnecessary/blank columns
df_watch_chart.drop(columns=['IMAGE','MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC','Unnamed: 16','Unnamed: 29'],index=0,inplace=True)

# Standardize column names
df_watch_chart.columns = df_watch_chart.columns.astype(str).str.strip().str.lower().str.replace(' ','_')
df_watch_chart.columns = ['style_id', 'retail_price', 'upc_num', 'description','collection', 'case', 'gender',
       'status', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022-01',
       '2022-02', '2022-03', '2022-04', 'qt_sales_order', 'total_sales_2022', 'qt_on_hand',
       'monthly_avg_2022']

# Drop null row
df_watch_chart.dropna(subset='style_id', inplace=True)

# Convert UPC number to str (categorical)
df_watch_chart.loc[:,'upc_num'] = df_watch_chart.loc[:,'upc_num'].astype(int).astype(str)

# Identify and deal with null values
# Retail price
df_watch_chart.dropna(subset='retail_price', inplace=True)
# Collection
df_watch_chart.loc[646,'collection'] = 'FORRESTER CHRONO'
df_watch_chart.loc[786,'collection'] = 'JACQUELINE'
# Case
df_watch_chart.loc[646,'case'] = '46MM'
df_watch_chart.loc[786,'case'] = '36MM'
# Sales years
df_watch_chart[['2015','2016','2017','2018','2019','2020','2021','2022-01',
       '2022-02', '2022-03', '2022-04']] = df_watch_chart[['2015','2016','2017','2018','2019','2020','2021','2022-01',
       '2022-02', '2022-03', '2022-04']].fillna(0)
# Status
df_watch_chart.fillna('No status reported',inplace=True)

In [355]:
df_watch_chart.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1588 entries, 1 to 1590
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   style_id          1588 non-null   object 
 1   retail_price      1588 non-null   float64
 2   upc_num           1588 non-null   object 
 3   description       1588 non-null   object 
 4   collection        1588 non-null   object 
 5   case              1588 non-null   object 
 6   gender            1588 non-null   object 
 7   status            1588 non-null   object 
 8   2015              1588 non-null   float64
 9   2016              1588 non-null   float64
 10  2017              1588 non-null   float64
 11  2018              1588 non-null   float64
 12  2019              1588 non-null   float64
 13  2020              1588 non-null   float64
 14  2021              1588 non-null   float64
 15  2022-01           1588 non-null   float64
 16  2022-02           1588 non-null   float64


In [356]:
# Save to csv
df_watch_chart.to_csv('../cleaned_datasets/watch_flow_chart.csv',index=False)

### Brand Proposal

This dataset contain's the brand's current collection and CTE's order on April 16, 2022.

In [357]:
df_proposal = pd.read_excel('../data/PROPOSAL APRIL 16.xlsx')

# Drop unnecessary/blank columns
df_proposal.drop(columns=['IMAGE','Unnamed: 12','Unnamed: 14','Unnamed: 17'],inplace=True)

# Drop rows with total sheet calculations
df_proposal.dropna(subset='Material',inplace=True)

# Standardize column names
df_proposal.columns = df_proposal.columns.astype(str).str.strip().str.lower().str.replace(' ','_')
df_proposal.columns = ['style_id', 'upc_num', 'priority', 'retail_price', 'gender', 'collection', 'case',
       'availability', 'qty_on_hand', 'qty_on_order', 'qty_total', 'qty_sales_order', 'qty_sold_last_6m',
       'qty_avg/mo', 'months_of_supply', 'wholesale_price', 'cte_cost', 'qty_cte_cost']

# Convert UPC number to str (categorical)
df_proposal.loc[:,'upc_num'] = df_proposal.loc[:,'upc_num'].astype(int).astype(str)

In [358]:
# Data quality checks
df_proposal['qty_total_dup'] = df_proposal['qty_on_hand'] + df_proposal['qty_on_order']
(df_proposal['qty_total_dup'] == df_proposal['qty_total']).value_counts()

True    300
dtype: int64

In [359]:
df_proposal['qty_avg/mo_dup'] = df_proposal['qty_sold_last_6m']/6
(df_proposal['qty_avg/mo_dup'] == df_proposal['qty_avg/mo']).value_counts()

True    300
dtype: int64

In [360]:
df_proposal['months_of_supply_dup'] = [(df_proposal.loc[x,'qty_total_dup'] + df_proposal.loc[x,'qty_sales_order'])/df_proposal.loc[x,'qty_avg/mo_dup'] if df_proposal.loc[x,'qty_avg/mo_dup'] != 0 else np.nan for x in df_proposal.index]
(df_proposal['months_of_supply_dup'] == df_proposal['months_of_supply']).value_counts()

True     216
False     84
dtype: int64

In [361]:
(df_proposal['months_of_supply_dup'].isna() == df_proposal['months_of_supply'].isna()).value_counts()

True    300
dtype: int64

In [362]:
df_proposal['cte_cost_dup'] = df_proposal['wholesale_price']*0.7
(df_proposal['cte_cost_dup'] == df_proposal['cte_cost']).value_counts()

True    300
dtype: int64

In [363]:
# Drop all duplicated columns
df_proposal.drop(columns=['qty_total_dup','qty_avg/mo_dup','months_of_supply_dup','cte_cost_dup'],inplace=True)

In [364]:
df_proposal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300 entries, 0 to 299
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   style_id          300 non-null    object 
 1   upc_num           300 non-null    object 
 2   priority          300 non-null    object 
 3   retail_price      300 non-null    float64
 4   gender            300 non-null    object 
 5   collection        300 non-null    object 
 6   case              300 non-null    object 
 7   availability      300 non-null    object 
 8   qty_on_hand       300 non-null    float64
 9   qty_on_order      300 non-null    object 
 10  qty_total         300 non-null    float64
 11  qty_sales_order   300 non-null    float64
 12  qty_sold_last_6m  300 non-null    object 
 13  qty_avg/mo        300 non-null    float64
 14  months_of_supply  216 non-null    object 
 15  wholesale_price   300 non-null    float64
 16  cte_cost          300 non-null    float64
 1

In [365]:
# Save to csv
df_proposal.to_csv('../cleaned_datasets/proposal.csv',index=False)

### All Watches from Brand Website

This dataset contains all watches from the brand's website, and includes more attributes.

In [366]:
df_website = pd.read_excel('../data/ALL ITEMS FROM WEBSITE.xlsx')

# Drop unnecessary columns
df_website.drop(columns=['u_brand','u_category','name'],inplace=True)

# Standardize column names
df_website.columns = df_website.columns.astype(str).str.strip().str.lower().str.replace(' ','_').str.replace('u_','').str.replace('ws_','')
df_website.columns = ['style_id', 'upc_num', 'gender', 'collection', 'size', 'description','wholesale_price', 'retail_price',
       'product_type', 'product_websites', 'weight', 'color',
       'country_of_origin', 'warranty', 'band_color', 'band_material',
       'case_material', 'clasp_type', 'crystal_type', 'dial_color',
       'movement_type', 'water_resistant', 'max_cart_qty']

# Convert UPC number to str (categorical)
df_website.loc[:,'upc_num'] = df_website.loc[:,'upc_num'].astype(int).astype(str)

# Identify and deal with null values
# band color
df_website['band_color'] = df_website['band_color'].fillna('No color reported')
# clasp type
df_website['clasp_type'] = df_website['clasp_type'].fillna('No clasp type reported')
# movement type
df_website['movement_type'] = df_website['movement_type'].fillna('No movement type reported')
# water resistance, fill with 10000 since most likelye
df_website['max_cart_qty'] = df_website['max_cart_qty'].fillna(10000)

# Dropping the following columns because their values are all the same
df_website.drop(columns=['product_type','product_websites','weight','crystal_type'],inplace=True)

In [367]:
df_website.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1195 entries, 0 to 1194
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   style_id           1195 non-null   object 
 1   upc_num            1195 non-null   object 
 2   gender             1195 non-null   object 
 3   collection         1195 non-null   object 
 4   size               1195 non-null   int64  
 5   description        1195 non-null   object 
 6   wholesale_price    1195 non-null   float64
 7   retail_price       1195 non-null   int64  
 8   color              1195 non-null   object 
 9   country_of_origin  1195 non-null   object 
 10  warranty           1195 non-null   object 
 11  band_color         1195 non-null   object 
 12  band_material      1195 non-null   object 
 13  case_material      1195 non-null   object 
 14  clasp_type         1195 non-null   object 
 15  dial_color         1195 non-null   object 
 16  movement_type      1195 

In [368]:
# Save to csv
df_website.to_csv('../cleaned_datasets/website.csv',index=False)

### New Models Q3

This dataset contains the new models that will debut in Q3 2022.

In [369]:
df_q3_new = pd.read_excel('../data/new models H3.xlsx',header = 1)

# Drop empty rows (NaN)
df_q3_new.dropna(how = 'all',inplace=True)

# Drop unnecessary columns
df_q3_new.drop(columns=['STATUS'],inplace=True)

# Standardize column names
df_q3_new.columns = df_q3_new.columns.astype(str).str.strip().str.lower().str.replace(' ','_')
df_q3_new.columns = ['style_id', 'collection', 'movement_type', 'retail_price', 'color', 'band_color', 'case',
       'lug_width', 'gender']

In [370]:
df_q3_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47 entries, 1 to 48
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   style_id       47 non-null     object 
 1   collection     47 non-null     object 
 2   movement_type  47 non-null     object 
 3   retail_price   47 non-null     float64
 4   color          47 non-null     object 
 5   band_color     47 non-null     object 
 6   case           47 non-null     object 
 7   lug_width      47 non-null     object 
 8   gender         47 non-null     object 
dtypes: float64(1), object(8)
memory usage: 3.7+ KB


In [371]:
# save to csv
df_q3_new.to_csv('../cleaned_datasets/q3_new_models.csv', index=False)

### New Models Q4

This dataset contains the new models that will debut in Q4 2022.

In [372]:
df_q4_new = pd.read_excel('../data/new models H4.xlsx')

# Drop empty rows (NaN)
df_q4_new.dropna(how = 'all',inplace=True)

# Drop unnecessary columns
df_q4_new.drop(columns=['Unnamed: 9'],inplace=True)

# Infer column names
df_q4_new.columns = ['style_id', 'collection', 'movement_type', 'retail_price', 'color', 'band_color', 'case',
       'lug_width', 'gender']

# Identify and deal with null values
# price - will deal with later
# lug width
df_q4_new['lug_width'] = df_q4_new['lug_width'].fillna('No lug width reported')

In [373]:
df_q4_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36 entries, 0 to 36
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   style_id       36 non-null     object 
 1   collection     36 non-null     object 
 2   movement_type  36 non-null     object 
 3   retail_price   33 non-null     float64
 4   color          36 non-null     object 
 5   band_color     36 non-null     object 
 6   case           36 non-null     object 
 7   lug_width      36 non-null     object 
 8   gender         36 non-null     object 
dtypes: float64(1), object(8)
memory usage: 2.8+ KB


In [374]:
# save to csv
df_q4_new.to_csv('../cleaned_datasets/q4_new_models.csv', index=False)

### Holiday items - new items Q4

This dataset contains the new models that will debut in Q4 2022. It was determined to be a repeat of New Models Q4.

In [375]:
# df_holiday_new = pd.read_excel('../data/HOLIDAY NEW ITEMS 05-04-2022.xlsx')

# # Drop unnecessary columns
# df_holiday_new.drop(columns=['DESCRIPTION','SEASON'],inplace=True)

# # Standardize column names
# df_holiday_new.columns = df_holiday_new.columns.astype(str).str.strip().str.lower().str.replace(' ','_').str.replace('u_','').str.replace('ws_','')
# df_holiday_new.columns = ['style_id', 'collection', 'gender', 'wholesale_price', 'retail_price', 'movement_type',
#        'color', 'band_color', 'case']

# # deal with retail_price later

# df_holiday_new.info()

# # save to csv
# df_holiday_new.to_csv('../cleaned_datasets//holiday_q4_new_models.csv',index=False)

## Merging Datasets

### Merge old/current models from watch flow chart and website

In [376]:
df_watch_chart.shape

(1588, 23)

In [377]:
df_proposal.shape

(300, 18)

In [378]:
df_website.shape

(1195, 19)

In [379]:
df_models = pd.merge(left = df_watch_chart, right = df_website, how = 'outer', on = 'style_id', sort=True)

#### Validation of data

In [380]:
df_models.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1724 entries, 0 to 1723
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   style_id           1724 non-null   object 
 1   retail_price_x     1588 non-null   float64
 2   upc_num_x          1588 non-null   object 
 3   description_x      1588 non-null   object 
 4   collection_x       1588 non-null   object 
 5   case               1588 non-null   object 
 6   gender_x           1588 non-null   object 
 7   status             1588 non-null   object 
 8   2015               1588 non-null   float64
 9   2016               1588 non-null   float64
 10  2017               1588 non-null   float64
 11  2018               1588 non-null   float64
 12  2019               1588 non-null   float64
 13  2020               1588 non-null   float64
 14  2021               1588 non-null   float64
 15  2022-01            1588 non-null   float64
 16  2022-02            1588 

In [381]:
# check the status of models where there was no match on the website
df_models[df_models['retail_price_y'].isna()]['status'].value_counts()

DISCD                 478
No status reported     50
2018H3                  1
Name: status, dtype: int64

In [382]:
# check the 2021 sales of models where there was no match on the website
df_models[df_models['retail_price_y'].isna()]['2021'].value_counts()

0.0    529
Name: 2021, dtype: int64

It seems that the 2018H3 is now discontinued and the other watches had no sales in 2021

In [383]:
# check that status is only past watches
df_models['status'].value_counts()
# it is not obvious that the "No status reported" wawtches are old, current, or new models

DISCD                 690
No status reported    330
2019H4                 68
2020H1                 60
2020H3                 56
2020H2                 49
2019H3                 49
2021H3                 46
2021H4                 45
2019H2                 38
2022H1                 31
2021H1                 29
2021H2                 27
2018H4                 25
2019H1                 21
2018H3                  8
2020H4                  6
2022H2                  6
2018                    3
0                       1
Name: status, dtype: int64

#### Check where the retail prices differ between datasets

In [384]:
price_discrepancy = df_models[(df_models['retail_price_x'] != df_models['retail_price_y'])][['style_id','upc_num_x','status','retail_price_x','retail_price_y']].dropna()
price_discrepancy

Unnamed: 0,style_id,upc_num_x,status,retail_price_x,retail_price_y
0,AM4141,691464216092,No status reported,129.0,110.0
9,AM4532,796483064850,DISCD,125.0,149.0
17,BQ1010,796483055803,No status reported,125.0,129.0
18,BQ1130,796483065345,DISCD,135.0,139.0
19,BQ3115,796483250796,No status reported,135.0,139.0
...,...,...,...,...,...
1713,ME3208,796483555082,2021H4,219.0,240.0
1714,ME3209,796483555112,2021H4,249.0,260.0
1715,ME3210,796483555099,2021H4,249.0,250.0
1716,ME3211,796483555549,2021H4,189.0,210.0


In [385]:
# This is a tough decision... however for now, let's replace the prices with the website retail prices
# Rationale: If we use the price in the model to predict sales, the new model prices will not be scaled to the
# previous model prices

df_models['retail_price'] = [x if str(y) == 'nan' else y for x, y in zip(df_models['retail_price_x'],df_models['retail_price_y'])]

In [386]:
df_models[['retail_price_x','retail_price_y','retail_price']]

Unnamed: 0,retail_price_x,retail_price_y,retail_price
0,129.0,110.0,110.0
1,129.0,129.0,129.0
2,149.0,,149.0
3,165.0,,165.0
4,169.0,,169.0
...,...,...,...
1719,210.0,210.0,210.0
1720,,280.0,280.0
1721,,280.0,280.0
1722,,260.0,260.0


In [387]:
df_models['retail_price'].isna().value_counts()

False    1724
Name: retail_price, dtype: int64

In [388]:
df_models.drop(columns=['retail_price_x','retail_price_y'],inplace=True)

#### Check where the UPC num differ between datasets

In [389]:
upc_discrepancy = df_models[(df_models['upc_num_x'] != df_models['upc_num_y'])][['style_id','upc_num_x','upc_num_y','status']].dropna()
upc_discrepancy

Unnamed: 0,style_id,upc_num_x,upc_num_y,status
495,ES4393,757697665530,796483387720,No status reported
496,ES4394,757697665547,796483387737,No status reported
497,ES4396,757697665523,796483387713,DISCD
500,ES4403,757697665820,796483388727,DISCD
503,ES4408,757697667459,796483396975,DISCD
504,ES4409,757697667442,796483396968,DISCD
509,ES4414,757697666704,796483396821,DISCD
513,ES4422,79648339644,796483396944,DISCD
534,ES4446,757697670503,796483415676,2018H4
545,ES4468,757697672897,796483419629,DISCD


In [390]:
df_models[['upc_num_x','upc_num_y']]

Unnamed: 0,upc_num_x,upc_num_y
0,691464216092,691464216092
1,691464267100,691464267100
2,796483009837,
3,796483009844,
4,796483009851,
...,...,...
1719,796483561380,796483561380
1720,,796483561878
1721,,796483561861
1722,,796483561885


In [391]:
# check if the UPC number for models in the flow chart can be found in the website for a different watch
for upc_num_x in upc_discrepancy['upc_num_x']:
    if upc_num_x in df_models['upc_num_y'].values:
        print(upc_num_x)

796483033290


In [392]:
# After examining the upc 796483033290, it seems like the flow chart upc num can be replaced.
#df_models[(df_models['upc_num_x'] == '796483033290') | (df_models['upc_num_y'] == '796483033290')].T

In [393]:
# I don't want to keep the upc number for now, so will drop them. However, if this changes in the future,
# I would want to replace the watch flow chart UPC numbers with the website #s and fill in the rest with the website #s.
df_models.drop(columns=['upc_num_x','upc_num_y'], inplace=True)

#### Check where the Collection differ between datasets

In [394]:
# Lots of discrepancies because of capitalization. Let's capitalize all values and reevaluate
df_models['collection_x'] = df_models['collection_x'].astype(str).apply(str.capitalize).str.strip().replace('Nan',np.nan)
df_models['collection_y'] = df_models['collection_y'].astype(str).apply(str.capitalize).str.strip().replace('Nan',np.nan)

In [395]:
collection_discrepancy = df_models[(df_models['collection_x'] != df_models['collection_y'])][['style_id','collection_x','collection_y','status']].dropna()
collection_discrepancy

Unnamed: 0,style_id,collection_x,collection_y,status
0,AM4141,Colleague,Serena,No status reported
5,AM4508,Colleague,Serena,DISCD
17,BQ1010,Madeline,Rhett,No status reported
18,BQ1130,Madeline,Flynn,DISCD
19,BQ3115,Madeline,Suitor,No status reported
...,...,...,...,...
1697,ME3189,Carlie mini,Carlie mini me,2020H3
1698,ME3190,Fb,Fb - 01 automatic,2020H3
1699,ME3191,Fb-01,Fb - 01 automatic,2020H3
1700,ME3195,Neutra,Neutra automatic,No status reported


Lots of errors or small differences in naming on the CTE dataset. Will totally replace the CTE collections with those in the website.

In [396]:
df_models['collection'] = [x if str(y) == 'nan' else y for x, y in zip(df_models['collection_x'],df_models['collection_y'])]

In [397]:
df_models[['collection_x','collection_y','collection']]

Unnamed: 0,collection_x,collection_y,collection
0,Colleague,Serena,Serena
1,Colleague,Colleague,Colleague
2,Cecile,,Cecile
3,Cecile,,Cecile
4,Cecile,,Cecile
...,...,...,...
1719,Stella,Stella,Stella
1720,,Bronson,Bronson
1721,,Bronson,Bronson
1722,,Bronson,Bronson


In [398]:
df_models['collection'].isna().value_counts()

False    1724
Name: collection, dtype: int64

In [399]:
df_models.drop(columns=['collection_x','collection_y'],inplace=True)

In [400]:
df_models[df_models['collection'].str.contains('Georgia')]

Unnamed: 0,style_id,description_x,case,gender_x,status,2015,2016,2017,2018,2019,...,band_color,band_material,case_material,clasp_type,dial_color,movement_type,water_resistant,max_cart_qty,retail_price,collection
138,ES2830,FOSSIL / LADIES / GEORGIA,32MM,LADIES,No status reported,19.0,52.0,37.0,114.0,101.0,...,Brown,Leather,Stainless Steel,No clasp type reported,Brown,Quartz,50m - 160ft - 5atm,10000.0,120.0,Georgia
140,ES3060,FOSSIL / LADIES / GEORGIA,32MM,LADIES,No status reported,20.0,53.0,73.0,120.0,114.0,...,Brown,Leather,Stainless Steel,No clasp type reported,White,Quartz,50m - 160ft - 5atm,10000.0,120.0,Georgia
141,ES3077,FOSSIL / LADIES / GEORGIA,32MM,LADIES,No status reported,32.0,69.0,132.0,208.0,68.0,...,Grey,Leather,Stainless Steel,No clasp type reported,Black,Quartz,50m - 160ft - 5atm,10000.0,140.0,Georgia
142,ES3199,FOSSIL / LADIES / GEORGIA,GEORGIA,LADIES,DISCD,19.0,41.0,9.0,0.0,0.0,...,,,,,,,,,85.0,Georgia
146,ES3225,FOSSIL / LADIES / GEORGIA,32MM,LADIES,DISCD,39.0,66.0,32.0,2.0,0.0,...,,,,,,,,,125.0,Georgia
147,ES3226,FOSSIL / LADIES / GEORGIA,32MM,LADIES,No status reported,29.0,61.0,45.0,119.0,56.0,...,Rose Gold,Stainless Steel,Stainless Steel,No clasp type reported,Rose Gold,Quartz,50m - 160ft - 5atm,10000.0,149.0,Georgia
148,ES3262,FOSSIL / LADIES / GEORGIA,26MM,LADIES,DISCD,23.0,40.0,67.0,66.0,39.0,...,Brown,Leather,Stainless Steel,No clasp type reported,Rose Gold,Quartz,50m - 160ft - 5atm,10000.0,125.0,Georgia
149,ES3264,FOSSIL / LADIES / GEORGIA,GEORGIA,LADIES,DISCD,24.0,32.0,9.0,0.0,0.0,...,,,,,,,,,105.0,Georgia
150,ES3268,FOSSIL / LADIES / GEORGIA,26MM,LADIES,DISCD,39.0,59.0,38.0,81.0,20.0,...,Rose Gold,Stainless Steel,Stainless Steel,Deployment,Rose Gold,Quartz,30m - 100ft - 3atm,10000.0,129.0,Georgia
151,ES3269,FOSSIL / LADIES / GEORGIA,26MM,LADIES,DISCD,44.0,81.0,22.0,41.0,50.0,...,Silver,Stainless Steel,Stainless Steel,Fold-Over,Silver,Quartz,50m - 160ft - 5atm,10000.0,105.0,Georgia


In [401]:
# Check if there are collections that could be combined
collections = df_models['collection'].unique()
collections.sort()
collections

array(['44mm townsman', '48mm townsman', 'Abilene', 'Annette', 'Arc - 01',
       'Arc-02', 'Arc-03', 'Atwater', 'Avondale', 'Barstow',
       'Barstow automatic', 'Batman heritage le', 'Batman le', 'Belmar',
       'Belmar multifunction', 'Blake', 'Blane', 'Blythe',
       'Bowman chrono', 'Bronson', 'Bronson twist', 'Caiden', 'Camile',
       'Carbon series', 'Carlie', 'Carlie mini', 'Carlie mini me',
       'Carlie mini v-day', 'Carlie set', 'Cecile', 'Chapman', 'Chase',
       'Chase automatic', 'Chase timer', 'Chelsey',
       'Classic minute glitz', 'Classic minute w gli', 'Classics',
       'Coachman', 'Colleague', 'Color undertones', 'Commuter',
       'Copeland', 'Crewmaster', 'Daily', 'Daisy', 'Daisy 3 hand',
       'Dayliner', 'Deam', 'Dean', 'Decker', 'Del rey', 'Df-01',
       'Dillinger', 'Drifter', 'Earth day watch', 'Everett',
       'Everett 3 hand', 'Everett 3h', 'Everett chronograph',
       'Everett solar digital', 'Everyday muse', 'Fb - 01',
       'Fb - 01 automat

In [402]:
collections_replace = {'44mm townsman':'Townsman','48mm townsman':'Townsman','Classic minute w gli':'Classic minute glitz',
                       'Everett 3 hand':'Everett 3h', 'Forrester auto':'Forrester automatic', 'Fb-01 chrono':'Fb - 01 chrono',
                       'Fb-01':'Fb - 01', 'Georgia 26mm':'Georgia small','Deam':'Dean',
                       'Commuter':'The commuter','Minimalist':"The minimalist",'Minimalist chrono':'The minimalist chrono',
                       'Modern persuit':'Modern pursuit','Obf':'Original boyfriend','Mens other':'Other - mens watch',
                       'Tailor 35mm':'Tailor','The commuter 3 hand/date':'The commuter 3h date','mens other':'Other - mens watch',
                       'The commuter auto':'The commuter automatic'
                      }
fixed_collections = df_models['collection'].replace(collections_replace).unique()
fixed_collections.sort()
fixed_collections

array(['Abilene', 'Annette', 'Arc - 01', 'Arc-02', 'Arc-03', 'Atwater',
       'Avondale', 'Barstow', 'Barstow automatic', 'Batman heritage le',
       'Batman le', 'Belmar', 'Belmar multifunction', 'Blake', 'Blane',
       'Blythe', 'Bowman chrono', 'Bronson', 'Bronson twist', 'Caiden',
       'Camile', 'Carbon series', 'Carlie', 'Carlie mini',
       'Carlie mini me', 'Carlie mini v-day', 'Carlie set', 'Cecile',
       'Chapman', 'Chase', 'Chase automatic', 'Chase timer', 'Chelsey',
       'Classic minute glitz', 'Classics', 'Coachman', 'Colleague',
       'Color undertones', 'Copeland', 'Crewmaster', 'Daily', 'Daisy',
       'Daisy 3 hand', 'Dayliner', 'Dean', 'Decker', 'Del rey', 'Df-01',
       'Dillinger', 'Drifter', 'Earth day watch', 'Everett', 'Everett 3h',
       'Everett chronograph', 'Everett solar digital', 'Everyday muse',
       'Fb - 01', 'Fb - 01 automatic', 'Fb - 01 chrono', 'Fb - 02',
       'Fb - 03', 'Fb-01 automatic', 'Fb-01 mini', 'Fb-adventure',
       'Flynn', 

##### Previous scratchwork

In [403]:

# collection_name_dict = dict(set(zip(collection_discrepancy['collection_x'],collection_discrepancy['collection_y'])))

# # Major discrepancy at index 32. I will replace the collection and gender with that of the website
# # df_models[(df_models['collection_x'] == 'The minimalist 3h') & (df_models['collection_y'] == 'Karli')].T
# df_models.loc[32,:] = ['BQ3440', 'Karli', '34MM', 'Ladies', 'DISCD', 0.0, 0.0,
#        0.0, 0.0, 0.0, 25.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
#        'Ladies', 'Karli', 34.0, 69.5, 'MOP', 'Japan - JP',
#        '2 - Year International Limited Warranty', 'Black',
#        'Stainless Steel', 'Stainless Steel', 'Fold-Over',
#        'Mother Of Pearl', 'Quartz', '50m - 160ft - 5atm', 10000.0, 139.0]

# # Major discrepancy at index 29. I will replace the collection and gender with that of the website
# #df_models[(df_models['collection_x'] == 'Dean') & (df_models['collection_y'] == 'Suitor')].T
# df_models.loc[29,:] = ['BQ3423', 'Suitor', '36MM', 'Ladies', 'No status reported', 0.0, 0.0,
#        0.0, 0.0, 0.0, 48.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
#        'Ladies', 'Suitor', 36.0, 79.5, 'MOP', 'Japan - JP',
#        '2 - Year International Limited Warranty', 'Rose Gold',
#        'Stainless Steel', 'Stainless Steel', 'Fold-Over',
#        'Mother Of Pearl', 'Quartz', '50m - 160ft - 5atm', 10000.0, 159.0]

# # Major discrepancy at index 28. I will replace the collection and gender with that of the website
# #df_models[(df_models['collection_x'] == 'Grant') & (df_models['collection_y'] == 'Karli')].T
# df_models.loc[28,:] = ['BQ3422', 'Karli', '34MM', 'Ladies', 'No status reported', 0.0, 0.0,
#        0.0, 0.0, 0.0, 50.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
#        'Ladies', 'Karli', 34.0, 64.5, 'MOP', 'Japan - JP',
#        '2 - Year International Limited Warranty', 'Rose Gold',
#        'Stainless Steel', 'Stainless Steel', 'Fold-Over',
#        'Mother Of Pearl', 'Quartz', '50m - 160ft - 5atm', 10000.0, 129.0]

# # Major discrepancy at index 26 I will replace the collection and gender with that of the website
# #df_models[(df_models['collection_x'] == 'Machine') & (df_models['collection_y'] == 'Suitor mini')].T
# df_models.loc[26,:] = ['BQ3334', 'Suitor mini', '26MM', 'Ladies', 'No status reported', 0.0, 0.0,
#        0.0, 0.0, 0.0, 100.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
#        'Ladies', 'Suitor mini', 26.0, 69.5, 'MOP', 'Japan - JP',
#        '2 - Year International Limited Warranty', 'Gold',
#        'Stainless Steel', 'Stainless Steel', 'Fold-Over',
#        'Mother Of Pearl', 'Quartz', '50m - 160ft - 5atm', 10000.0, 139.0]

# # Major discrepancy at index 18 I will replace the collection and gender with that of the website
# # df_models[(df_models['collection_x'] == 'Madeline') & (df_models['collection_y'] == 'Flynn')].T
# df_models.loc[26,:] = ['BQ1130', 'Flynn', '48MM', 'Mens', 'DISCD', 0.0, 0.0, 0.0,
#        0.0, 0.0, 49.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
#        'Mens', 'Flynn', 48.0, 69.5, 'Black', 'Japan - JP',
#        '2 - Year International Limited Warranty', 'Black', 'Leather',
#        'Stainless Steel', 'Buckle', 'Black', 'Quartz',
#        '50m - 160ft - 5atm', 10000.0, 139.0]

# collection_replace_dict = {'Fb-01': 'Stella', # confirmed typo
#  'Daisy 3hand': 'Carlie', # confirmed typo
#  'Carlie': 'Carlie mini',
#  'Fb - 02': 'Garrett', # confirmed typo
#  'Townsman box set': 'Townsman',
#  'Jacqueline': 'Stella', # confirmed typo
#  'Fb -05': 'Fb - 01',
#  'Townsman':'Townsman', #'Townsman': 'Luxe leisure', # typo on the website
#  'Mini': 'Carlie mini',
#  'Micro': 'Scarlette micro',
#  'The minimalist moonphase': 'The minimalist',
#  'Everett 3 hand': 'Everett',
#  'Fb -03': 'Fb - 01',
#  'Colleague': 'Serena',
#  'The': 'The andy and addison set',
#  'neutra auto': 'Neutra automatic',
#  'Copeland 42mm': 'Copeland',
#  'Arc': 'Arc-03',
#  'Retro': 'Retro pilot',
#  'Tailor': 'Tailor 35mm',
#  'The minimalist 3h': 'Karli', # major discrepancy - replaced collection and gender of the watch
#  'Grant sport automatic': 'Grant sport',
#  'The minimalist':'The minimalist', #'The minimalist': 'Color undertones', # seems unneccessarily, will keep the same
#  'Neutra chrono':'Neutra chrono', #'Neutra chrono': 'Luxe leisure', # typo on the website
#  'Dean': 'Suitor', # major discrepancy - replaced collection and gender of the watch
#  'Limited': 'Forrester',# more specific
#  'chase timer': 'Chase',
#  'Fb-03': 'Fb - 03',
#  'Fb -02': 'Fb - 01',
#  'Carlie mini': 'Carlie mini v-day',
#  'H date 42mm': 'The commuter', # confirmed typo
#  'Georgia\xa026mm': 'Georgia',
#  'The commuter': 'The commuter 3h',
#  'Fb': 'Fb-adventure',
#  'Stella': 'Kalya', # confirmed typo
#  'The commuter 3h':'The commuter 3h', #'The commuter 3h': 'The commuter 3 hand/date', # seems to be same category
#  'Scarlette': 'Scarlette micro',
#  'Neutra 3': 'Neutra 3h',
#  'Mathis 3h': 'Mathis',
#  'Tailor mini': 'Tailor',
#  'the essentialist': 'The essentialist',
#  'Garrett': 'Fb - 02', # confirmed typo
#  'The essentialist': 'Luxe leisure', # typo on the website
#  'Chase timer': 'Chase',
#  'Bronson': 'Bronson twist',
#  'Tailor me': 'Tailor',
#  'Original boyfriend':'Original boyfriend', # 'Original boyfriend': 'Obf', # not consistent
#  'Grant': 'Karli', # major discrepancy - replaced collection and gender of the watch
#  'Machine': 'Suitor mini', # major discrepancy - replaced collection and gender of the watch
#  'Forrester chrono': 'Fb - 02', # confirmed typo
#  'Lyric': 'Suitor mini', # major discrepancy - replaced collection
#  'Kalya': 'Blythe', # major discrepancy - replaced collection 
#  'Jacqueline box set': 'Jacqueline set',
#  'the minimalist solar': 'The minimalist solar',
#  'Mega': 'Mega machine',
#  'Forrester auto': 'Forrester automatic',
#  'jacqueline': 'Jacqueline',
#  'Belmar multifunction': 'Belmar',
#  'grant 44mm': 'Grant',
#  'Izzy': 'Daisy 3 hand', # confirmed typo
#  'Madeline': 'Flynn', # major discrepancy - replaced collection and gender of the watch
#  'Neely': 'Typographer', # major discrepancy - replaced collection
#  'machine': 'Machine',
#  'Arc-01': 'Arc - 01',
#  'Neutra': 'Neutra automatic',
#  'timer 42mm': 'Chase timer',
#  'Fb -01': 'Fb - 01',
#  'goodwin chrono': 'Goodwin',
#  'Forrester': 'Forrester chrono',
#  'Daisy': 'Daisy 3 hand',
#  'Earth': 'Earth day watch',
#  'the minimalist 3h': 'The minimalist',
#  'Sadie': 'Sadie multifunction',
#  'Georgia': 'Georgia small',
#  'Fb -04': 'Fb - 01'# confirmed typo
#                           }

# collection_discrepancy.replace(collection_replace_dict)

# collection_discrepancy

# collection_replace_dict['Colleague']

# # for i in collection_discrepancy.index:
# #     key = df_models.loc[i,'collection_x']
# #     df_models.loc[i,'collection_x'] = collection_replace_dict[key]

#### Check where Gender differs between datasets

In [404]:
# Lots of discrepancies because of capitalization. Let's capitalize all values and reevaluate
df_models['gender_x'] = df_models['gender_x'].astype(str).apply(str.capitalize).str.strip().replace('Nan',np.nan)
df_models['gender_y'] = df_models['gender_y'].astype(str).apply(str.capitalize).str.strip().replace('Nan',np.nan)

In [405]:
gender_discrepancy = df_models[(df_models['gender_x'] != df_models['gender_y'])][['style_id','gender_x','gender_y','status']].dropna()
gender_discrepancy

Unnamed: 0,style_id,gender_x,gender_y,status
17,BQ1010,Ladies,Mens,No status reported
18,BQ1130,Ladies,Mens,DISCD
26,BQ3334,Men,Ladies,No status reported
27,BQ3407,Men,Ladies,No status reported
28,BQ3422,Men,Ladies,No status reported
...,...,...,...,...
1711,ME3206,Gents,Mens,2021H4
1712,ME3207,Gents,Mens,2021H4
1713,ME3208,Gents,Mens,2021H4
1714,ME3209,Gents,Mens,2021H4


In [406]:
gender_discrepancy_list = list(set(zip(gender_discrepancy['gender_x'],gender_discrepancy['gender_y'])))
gender_discrepancy_list

[('Ladies', 'Mens'),
 ('Gents', 'Mens'),
 ('Men', 'Ladies'),
 ('Unisex', 'Mens'),
 ('Men', 'Mens')]

In [407]:
collection_and_gender_discrepancy = gender_discrepancy.merge(collection_discrepancy,left_index = True, right_index=True)
collection_and_gender_discrepancy

Unnamed: 0,style_id_x,gender_x,gender_y,status_x,style_id_y,collection_x,collection_y,status_y
17,BQ1010,Ladies,Mens,No status reported,BQ1010,Madeline,Rhett,No status reported
18,BQ1130,Ladies,Mens,DISCD,BQ1130,Madeline,Flynn,DISCD
26,BQ3334,Men,Ladies,No status reported,BQ3334,Machine,Suitor mini,No status reported
27,BQ3407,Men,Ladies,No status reported,BQ3407,Machine,Suitor,No status reported
28,BQ3422,Men,Ladies,No status reported,BQ3422,Grant,Karli,No status reported
...,...,...,...,...,...,...,...,...
1522,FS5929,Gents,Mens,2022H1,FS5929,Fb-01,Fb - 01,2022H1
1569,LE1132,Gents,Mens,No status reported,LE1132,Retro,Retro pilot,No status reported
1658,ME3140,Men,Mens,No status reported,ME3140,Grant sport automatic,Grant sport,No status reported
1667,ME3154,Men,Mens,No status reported,ME3154,Townsman,48mm townsman,No status reported


In [408]:
collection_and_gender_discrepancy[0:35]

Unnamed: 0,style_id_x,gender_x,gender_y,status_x,style_id_y,collection_x,collection_y,status_y
17,BQ1010,Ladies,Mens,No status reported,BQ1010,Madeline,Rhett,No status reported
18,BQ1130,Ladies,Mens,DISCD,BQ1130,Madeline,Flynn,DISCD
26,BQ3334,Men,Ladies,No status reported,BQ3334,Machine,Suitor mini,No status reported
27,BQ3407,Men,Ladies,No status reported,BQ3407,Machine,Suitor,No status reported
28,BQ3422,Men,Ladies,No status reported,BQ3422,Grant,Karli,No status reported
29,BQ3423,Men,Ladies,No status reported,BQ3423,Dean,Suitor,No status reported
30,BQ3424,Men,Ladies,No status reported,BQ3424,Machine,Suitor,No status reported
31,BQ3438,Men,Ladies,DISCD,BQ3438,Machine,Suitor,DISCD
32,BQ3440,Men,Ladies,DISCD,BQ3440,The minimalist 3h,Karli,DISCD
33,BQ3442,Men,Ladies,No status reported,BQ3442,The minimalist 3h,Modern sophisticate,No status reported


In [409]:
collection_and_gender_discrepancy[35:75]

Unnamed: 0,style_id_x,gender_x,gender_y,status_x,style_id_y,collection_x,collection_y,status_y
1141,FS5398,Men,Mens,DISCD,FS5398,The minimalist,Other - mens watch,DISCD
1142,FS5399,Men,Mens,DISCD,FS5399,The commuter,Commuter,DISCD
1143,FS5400,Men,Mens,DISCD,FS5400,The commuter,Commuter,DISCD
1144,FS5401,Men,Mens,No status reported,FS5401,The commuter,Commuter,No status reported
1145,FS5402,Men,Mens,No status reported,FS5402,The commuter,Commuter,No status reported
1146,FS5403,Men,Mens,No status reported,FS5403,The commuter,Commuter,No status reported
1147,FS5404,Men,Mens,DISCD,FS5404,The commuter,Commuter,DISCD
1148,FS5406,Men,Mens,No status reported,FS5406,The commuter,Commuter,No status reported
1149,FS5407,Men,Mens,DISCD,FS5407,Townsman,44mm townsman,DISCD
1159,FS5417,Men,Mens,DISCD,FS5417,The commuter 3h,The commuter 3h date,DISCD


In [410]:
gender_only_discrepancy = gender_discrepancy.drop(index = collection_and_gender_discrepancy.index)
gender_only_discrepancy_list = list(set(zip(gender_only_discrepancy['gender_x'],gender_only_discrepancy['gender_y'])))
gender_only_discrepancy_list

[('Ladies', 'Mens'), ('Gents', 'Mens'), ('Men', 'Mens'), ('Unisex', 'Mens')]

In [411]:
gender_only_discrepancy[gender_only_discrepancy['gender_x'] == 'Ladies']

Unnamed: 0,style_id,gender_x,gender_y,status
983,FS5068IE,Ladies,Mens,DISCD


In [412]:
df_models[df_models['style_id'] == 'FS5068IE'] # this seems to be a mistake

Unnamed: 0,style_id,description_x,case,gender_x,status,2015,2016,2017,2018,2019,...,band_color,band_material,case_material,clasp_type,dial_color,movement_type,water_resistant,max_cart_qty,retail_price,collection
983,FS5068IE,FOSSIL / MENS / GRANT 44MM,44MM,Ladies,DISCD,0.0,0.0,0.0,134.0,150.0,...,Brown,Leather,Stainless Steel,No clasp type reported,Blue,Quartz,50m - 160ft - 5atm,10000.0,145.0,Grant


Upon inspection, it looks to be ok to replace all conflicts with the gender from the website.

In [413]:
df_models['gender'] = [x if str(y) == 'nan' else y for x, y in zip(df_models['gender_x'],df_models['gender_y'])]

In [414]:
df_models[['gender_x','gender_y','gender']]

Unnamed: 0,gender_x,gender_y,gender
0,Ladies,Ladies,Ladies
1,Ladies,Ladies,Ladies
2,Ladies,,Ladies
3,Ladies,,Ladies
4,Ladies,,Ladies
...,...,...,...
1719,Ladies,Ladies,Ladies
1720,,Mens,Mens
1721,,Mens,Mens
1722,,Mens,Mens


In [415]:
df_models['gender'].isna().value_counts()

False    1724
Name: gender, dtype: int64

In [416]:
df_models.drop(columns=['gender_x','gender_y'],inplace=True)

In [441]:
# Check if any groups can be combined
df_models['gender'].value_counts()

Ladies    902
Mens      571
Men       250
Unisex      1
Name: gender, dtype: int64

In [442]:
df_models['gender'] = df_models['gender'].replace('Men','Mens')

In [444]:
df_models[df_models['gender'] == 'Unisex']

Unnamed: 0,style_id,description_x,status,2015,2016,2017,2018,2019,2020,2021,...,case_material,clasp_type,dial_color,movement_type,water_resistant,max_cart_qty,retail_price,collection,gender,case_size
1172,FS5430,FOSSIL / UNISEX / BLAKE 40MM,No status reported,0.0,0.0,0.0,40.0,0.0,0.0,0.0,...,,,,,,,75.0,Blake,Unisex,40.0


In [445]:
df_models[df_models['collection']=='Blake']

Unnamed: 0,style_id,description_x,status,2015,2016,2017,2018,2019,2020,2021,...,case_material,clasp_type,dial_color,movement_type,water_resistant,max_cart_qty,retail_price,collection,gender,case_size
1169,FS5427,FOSSIL / UNISEX / BLAKE 40MM,No status reported,0.0,0.0,0.0,95.0,2.0,0.0,0.0,...,Alloy,No clasp type reported,White,Quartz,50m - 160ft - 5atm,10000.0,75.0,Blake,Mens,40.0
1170,FS5428,FOSSIL / MENS / BLAKE 40MM,No status reported,0.0,0.0,0.0,98.0,2.0,0.0,0.0,...,Alloy,No clasp type reported,White,Quartz,50m - 160ft - 5atm,10000.0,75.0,Blake,Mens,40.0
1171,FS5429,FOSSIL / UNISEX / BLAKE 40MM,No status reported,0.0,0.0,0.0,98.0,1.0,0.0,0.0,...,Alloy,No clasp type reported,White,Quartz,50m - 160ft - 5atm,10000.0,75.0,Blake,Mens,40.0
1172,FS5430,FOSSIL / UNISEX / BLAKE 40MM,No status reported,0.0,0.0,0.0,40.0,0.0,0.0,0.0,...,,,,,,,75.0,Blake,Unisex,40.0
1173,FS5431,FOSSIL / UNISEX / BLAKE 40MM,No status reported,0.0,0.0,0.0,38.0,0.0,0.0,0.0,...,Alloy,No clasp type reported,White,Quartz,50m - 160ft - 5atm,10000.0,75.0,Blake,Mens,40.0


In [446]:
# Seems safe to fill in gender with mens
df_models['gender'] = df_models['gender'].replace('Unisex','Mens')

In [448]:
df_models['gender'].value_counts()

Ladies    902
Mens      822
Name: gender, dtype: int64

#### Check where Case/Size differs between datasets

In [417]:
case_discrepancy = df_models[(df_models['case'] != df_models['size'])][['style_id','case','size','status','description_x','description_y']].dropna()
case_discrepancy

Unnamed: 0,style_id,case,size,status,description_x,description_y
0,AM4141,28MM,28.0,No status reported,FOSSIL / LADIES / COLLEAGUE,FOSSIL / LADIES / COLLEAGUE
1,AM4183,28MM,28.0,DISCD,FOSSIL / LADIES / COLLEAGUE,FOSSIL / LADIES / COLLEAGUE
5,AM4508,28MM,28.0,DISCD,FOSSIL / LADIES / COLLEAGUE,FOSSIL / LADIES / COLLEAGUE
9,AM4532,CECILE,40.0,DISCD,FOSSIL / LADIES / CECILE,FOSSIL / LADIES / CECILE
17,BQ1010,42MM,42.0,No status reported,FOSSIL / LADIES / MADELINE 42MM,FOSSIL / MENS / RHETT 42MM
...,...,...,...,...,...,...
1714,ME3209,44MM,44.0,2021H4,FOSSIL / GENTS / NEUTRA 44MM,FOSSIL / GENTS / NEUTRA 44
1715,ME3210,44MM,44.0,2021H4,FOSSIL / GENTS / TOWNSMAN 44MM,FOSSIL / GENTS / TOWNSMAN 44
1716,ME3211,34MM,34.0,2021H4,FOSSIL / LADIES / STELLA 34MM,FOSSIL / LADIES / STELLA 34
1717,ME3212,34MM,34.0,2021H4,FOSSIL / LADIES / STELLA 34MM,FOSSIL / LADIES / STELLA 34


It seems that case is the size with the units. Let's try stripping the MM and converting to float. There is clearly one example where this will fail, so let's replace with size in that instance.

In [418]:
case_size = []
count_except = 0
for i in case_discrepancy.index:
    try:
        case_size.append(float(case_discrepancy.loc[i,'case'].strip('MM')))
    except:
        case_size.append(case_discrepancy.loc[i,'size'])
        count_except +=1
print(count_except)

10


In [419]:
case_discrepancy['case_size'] = case_size
case_discrepancy

Unnamed: 0,style_id,case,size,status,description_x,description_y,case_size
0,AM4141,28MM,28.0,No status reported,FOSSIL / LADIES / COLLEAGUE,FOSSIL / LADIES / COLLEAGUE,28.0
1,AM4183,28MM,28.0,DISCD,FOSSIL / LADIES / COLLEAGUE,FOSSIL / LADIES / COLLEAGUE,28.0
5,AM4508,28MM,28.0,DISCD,FOSSIL / LADIES / COLLEAGUE,FOSSIL / LADIES / COLLEAGUE,28.0
9,AM4532,CECILE,40.0,DISCD,FOSSIL / LADIES / CECILE,FOSSIL / LADIES / CECILE,40.0
17,BQ1010,42MM,42.0,No status reported,FOSSIL / LADIES / MADELINE 42MM,FOSSIL / MENS / RHETT 42MM,42.0
...,...,...,...,...,...,...,...
1714,ME3209,44MM,44.0,2021H4,FOSSIL / GENTS / NEUTRA 44MM,FOSSIL / GENTS / NEUTRA 44,44.0
1715,ME3210,44MM,44.0,2021H4,FOSSIL / GENTS / TOWNSMAN 44MM,FOSSIL / GENTS / TOWNSMAN 44,44.0
1716,ME3211,34MM,34.0,2021H4,FOSSIL / LADIES / STELLA 34MM,FOSSIL / LADIES / STELLA 34,34.0
1717,ME3212,34MM,34.0,2021H4,FOSSIL / LADIES / STELLA 34MM,FOSSIL / LADIES / STELLA 34,34.0


In [420]:
case_discrepancy_actual = case_discrepancy[(case_discrepancy['size'] != case_discrepancy['case_size'])]

There are bizarre discrepancies on both datasets. I will have to manually replace issues.

In [421]:
case_discrepancy_actual['case_size'].values
case_discrepancy_actual['case_size'] = [38., 44., 45., 35., 34., 34., 32., 32., 32., 32., 32., 27., 36.,
       36., 36., 38., 38., 29., 29., 29., 35., 35., 29., 36., 36., 38.,
       38., 38., 36., 36., 36., 28., 28., 28., 28., 28., 35., 34., 34.,
       34., 35., 35., 44., 41., 44., 44., 44., 44., 48., 48., 48., 48.,
       48., 44., 46., 46., 36., 27., 35.]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  case_discrepancy_actual['case_size'] = [38., 44., 45., 35., 34., 34., 32., 32., 32., 32., 32., 27., 36.,


In [422]:
case_discrepancy_actual

Unnamed: 0,style_id,case,size,status,description_x,description_y,case_size
40,CE1102,38MM,18.0,2020H1,FOSSIL / LADIES / CARLIE 38MM,FOSSIL / LADIES / CARLIE 38MM,38.0
74,CH2601IE,44MM,42.0,No status reported,FOSSIL / MENS / DECKER 44MM,FOSSIL / MENS / DECKER 44MM,44.0
78,CH2993,46MM,45.0,DISCD,FOSSIL / MENS / DEL REY,FOSSIL / MENS / DEL REY,45.0
384,ES4245,34MM,35.0,No status reported,FOSSIL / LADIES / ATWATER,FOSSIL / LADIES / ATWATER,35.0
418,ES4287,35MM,34.0,DISCD,FOSSIL / LADIES / NEELY,FOSSIL / LADIES / NEELY,34.0
420,ES4289,35MM,34.0,DISCD,FOSSIL / LADIES / NEELY,FOSSIL / LADIES / NEELY,34.0
453,ES4336SET,32MM,31.0,DISCD,FOSSIL / LADIES / BLANE 32MM,FOSSIL / LADIES / BLANE 32MM,32.0
454,ES4337SET,32MM,31.0,DISCD,FOSSIL / LADIES / BLANE 32MM,FOSSIL / LADIES / BLANE 32MM,32.0
478,ES4363,32MM,34.0,No status reported,FOSSIL / LADIES / SCARLETTE 32MM,FOSSIL / LADIES / SCARLETTE 32MM,32.0
485,ES4372,32MM,34.0,DISCD,FOSSIL / LADIES / SCARLETTE 32MM,FOSSIL / LADIES / SCARLETTE 32MM,32.0


Confirmed that this new case_size column is correct. Strategy is now to create a new column in df_models where the values are case_size, and if there is none, use the size. If there is no size, use the stripped & float version of case. Finally, check for any null values afterwards.

In [423]:
new_case_size = []
for i in df_models.index:
    if i in case_discrepancy_actual:
        x = case_discrepancy_actual.loc[i,'case_size']
    elif str(df_models.loc[i,'size']) != 'nan':
        x = df_models.loc[i,'size']
    else:
        try:
            x = float(df_models.loc[i,'case'].strip('MM'))
        except:
            x = np.nan
    new_case_size.append(x)
df_models['case_size'] = new_case_size
df_models

Unnamed: 0,style_id,description_x,case,status,2015,2016,2017,2018,2019,2020,...,case_material,clasp_type,dial_color,movement_type,water_resistant,max_cart_qty,retail_price,collection,gender,case_size
0,AM4141,FOSSIL / LADIES / COLLEAGUE,28MM,No status reported,54.0,135.0,204.0,201.0,142.0,0.0,...,Stainless Steel,No clasp type reported,Mother Of Pearl,Quartz,50m - 160ft - 5atm,10000.0,110.0,Serena,Ladies,28.0
1,AM4183,FOSSIL / LADIES / COLLEAGUE,28MM,DISCD,70.0,170.0,174.0,131.0,0.0,0.0,...,Stainless Steel,Deployment,Mother Of Pearl,Quartz,100m - 330ft - 10atm,10000.0,129.0,Colleague,Ladies,28.0
2,AM4481,FOSSIL / LADIES / CECILE,40MM,No status reported,72.0,133.0,251.0,33.0,0.0,0.0,...,,,,,,,149.0,Cecile,Ladies,40.0
3,AM4482,FOSSIL / LADIES / CECILE,CECILE,DISCD,0.0,164.0,138.0,0.0,1.0,0.0,...,,,,,,,165.0,Cecile,Ladies,
4,AM4483,FOSSIL / LADIES / CECILE,40MM,No status reported,49.0,106.0,147.0,0.0,0.0,0.0,...,,,,,,,169.0,Cecile,Ladies,40.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1719,ME3214,FOSSIL / LADIES / STELLA 34MM,34MM,2022H1,0.0,0.0,0.0,0.0,0.0,0.0,...,Stainless Steel,Deployment,Silver,Automatic,50m - 160ft - 5atm,100.0,210.0,Stella,Ladies,34.0
1720,ME3217,,,,,,,,,,...,Stainless Steel,Fold-Over,Black,Automatic,50m - 160ft - 5atm,100.0,280.0,Bronson,Mens,44.0
1721,ME3218,,,,,,,,,,...,Stainless Steel,Fold-Over,Black,Automatic,50m - 160ft - 5atm,100.0,280.0,Bronson,Mens,44.0
1722,ME3219,,,,,,,,,,...,Stainless Steel,Buckle,Black,Automatic,50m - 160ft - 5atm,100.0,260.0,Bronson,Mens,44.0


In [424]:
df_models[df_models['case_size'].isna()]

Unnamed: 0,style_id,description_x,case,status,2015,2016,2017,2018,2019,2020,...,case_material,clasp_type,dial_color,movement_type,water_resistant,max_cart_qty,retail_price,collection,gender,case_size
3,AM4482,FOSSIL / LADIES / CECILE,CECILE,DISCD,0.0,164.0,138.0,0.0,1.0,0.0,...,,,,,,,165.0,Cecile,Ladies,
6,AM4509,FOSSIL / LADIES / CECILE,CECILE,DISCD,79.0,153.0,2.0,0.0,0.0,0.0,...,,,,,,,125.0,Cecile,Ladies,
7,AM4511,FOSSIL / LADIES / CECILE,CECILE,DISCD,62.0,115.0,5.0,0.0,0.0,0.0,...,,,,,,,145.0,Cecile,Ladies,
8,AM4522,FOSSIL / LADIES / CECILE,CECILE,DISCD,27.0,68.0,1.0,0.0,0.0,0.0,...,,,,,,,165.0,Cecile,Ladies,
10,AM4576,FOSSIL / LADIES / CECILE,CECILE,DISCD,35.0,72.0,16.0,0.0,0.0,0.0,...,,,,,,,125.0,Cecile,Ladies,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1651,ME3130,FOSSIL / MENS / DEAN,DEAN,DISCD,0.0,15.0,5.0,0.0,0.0,0.0,...,,,,,,,245.0,Dean,Men,
1652,ME3133,FOSSIL / MENS / MODERN MACHINE,MACHINE,DISCD,0.0,1.0,36.0,2.0,0.0,0.0,...,,,,,,,265.0,Modern machine,Men,
1653,ME3134,FOSSIL / MENS / MODERN MACHINE,MACHINE,DISCD,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,245.0,Modern machine,Men,
1656,ME3137,FOSSIL / LADIES / VINTAGE MUSE,MUSE,DISCD,0.0,2.0,7.0,0.0,0.0,0.0,...,,,,,,,225.0,Vintage muse,Ladies,


There are lots of models where the case size is missing. Let's see if there is a good strategy to impute the case size.

In [425]:
df_models.groupby('collection')['case_size'].std().sort_values(ascending = False)[0:20]

collection
Fb - 02                     11.710801
Daisy 3 hand                10.000000
Izzy                         9.811558
Forrester chrono             9.276014
Carlie mini v-day            8.000000
Garrett                      6.936639
The andy and addison set     6.000000
Retro pilot                  5.656854
Stella                       5.220539
Carlie mini                  5.032733
Fb-01                        4.783212
Carlie                       4.779468
The commuter 3h              4.381780
Tailor                       4.331437
Copeland                     3.938928
Jacqueline                   3.770258
Atwater                      3.577709
The commuter 3h date         3.113996
Georgia                      3.081316
Bronson                      3.070598
Name: case_size, dtype: float64

It is not obvious what the case size should be. For now, I will replace with "No case size reported"

In [426]:
df_models['case_size'] = df_models['case_size'].fillna('No case size reported')

In [427]:
df_models.drop(columns=['case','size'],inplace=True)

#### Fill empty sales with 0.

In [428]:
df_models.columns

Index(['style_id', 'description_x', 'status', '2015', '2016', '2017', '2018',
       '2019', '2020', '2021', '2022-01', '2022-02', '2022-03', '2022-04',
       'qt_sales_order', 'total_sales_2022', 'qt_on_hand', 'monthly_avg_2022',
       'description_y', 'wholesale_price', 'color', 'country_of_origin',
       'warranty', 'band_color', 'band_material', 'case_material',
       'clasp_type', 'dial_color', 'movement_type', 'water_resistant',
       'max_cart_qty', 'retail_price', 'collection', 'gender', 'case_size'],
      dtype='object')

In [429]:
df_models[['2015', '2016', '2017', '2018',
       '2019', '2020', '2021', '2022-01', '2022-02', '2022-03', '2022-04',
       'qt_sales_order', 'total_sales_2022', 'qt_on_hand', 'monthly_avg_2022']] = df_models[['2015', '2016', '2017', '2018',
       '2019', '2020', '2021', '2022-01', '2022-02', '2022-03', '2022-04',
       'qt_sales_order', 'total_sales_2022', 'qt_on_hand', 'monthly_avg_2022']].fillna(0)

#### Fill empty and 0 status with "No status reported"

In [430]:
df_models['status'] = df_models['status'].fillna("No status reported")

In [431]:
df_models['status'].replace(0,"No status reported",inplace=True)

In [432]:
df_models['status'].value_counts()

DISCD                 690
No status reported    467
2019H4                 68
2020H1                 60
2020H3                 56
2020H2                 49
2019H3                 49
2021H3                 46
2021H4                 45
2019H2                 38
2022H1                 31
2021H1                 29
2021H2                 27
2018H4                 25
2019H1                 21
2018H3                  8
2020H4                  6
2022H2                  6
2018                    3
Name: status, dtype: int64

#### Recalculate wholesale price based on retail price

In [433]:
df_models['wholesale_price'] = df_models['retail_price']*0.7

#### Examine color

There are discrepancies based on capitalization and different words. Let's standardize.

In [434]:
# Lots of discrepancies because of capitalization. Let's capitalize all values and reevaluate
df_models['color'] = df_models['color'].astype(str).apply(str.capitalize).str.strip().replace('Nan',np.nan)

df_models['color'] = df_models['color'].replace('Mop','Mother of pearl').replace('Grey','Gray').replace('Multi','Multicolor')

df_models['color'] = df_models['color'].fillna("No color reported")

df_models['color'].value_counts()

No color reported    529
Black                253
Silver               208
White                133
Blue                 130
Rose gold            113
Gold                  80
Mother of pearl       49
Gray                  42
Green                 39
Brown                 28
Cream                 17
Pink                  16
Multicolor            13
Smoke                 10
Skeleton              10
Red                    8
Champagne              8
Gunmetal               7
Two-tone               6
Purple                 6
Caramel                6
Clear                  5
Burgundy               2
Tan                    2
Turquoise              2
Yellow                 1
Nude                   1
Name: color, dtype: int64

#### Examine country of origin

In [451]:
df_models['country_of_origin'].value_counts()

Japan - JP       1141
China - CN         49
Thailand - TH       5
Name: country_of_origin, dtype: int64

In [452]:
df_models['country_of_origin'] = df_models['country_of_origin'].fillna("No country reported")

#### Examine warranty

In [454]:
df_models['warranty'].value_counts()

2 - Year International Limited Warranty    1192
1 - Year International Limited Warranty       3
Name: warranty, dtype: int64

In [455]:
df_models['warranty'] = df_models['warranty'].fillna("No warranty reported")

#### Examine band color

In [458]:
df_models['band_color'].replace('Multi','Multicolor',inplace=True)

df_models['band_color'] = df_models['band_color'].fillna("No color reported")

In [459]:
df_models['band_color'].value_counts()

No color reported    535
Brown                256
Silver               198
Black                183
Rose Gold            116
Blue                  91
Gold                  70
Grey                  47
Multicolor            41
Two-Tone              40
Green                 30
Pink                  29
White                 24
Nude                  22
Red                   12
Purple                 7
Gunmetal               5
Tan                    3
Cream                  3
Burgundy               3
Rose                   2
Champagne              2
Yellow                 2
Tortoise               1
Turquoise              1
Orange                 1
Name: band_color, dtype: int64

#### Examine band material

In [462]:
df_models['band_material'] = df_models['band_material'].fillna("No band material reported")

In [463]:
df_models['band_material'].value_counts()

Stainless Steel              561
No band material reported    529
Leather                      526
Silicone                      50
Ceramic                       27
Mesh                           9
Alloy                          5
Plastic                        5
Acetate                        3
Nylon                          3
Fabric                         3
Polyurethane                   2
Canvas                         1
Name: band_material, dtype: int64

#### Examine case material

In [466]:
df_models['case_material'] = df_models['case_material'].fillna("No case material reported")

In [467]:
df_models['case_material'].value_counts()

Stainless Steel              1137
No case material reported     529
Ceramic                        27
Alloy                          12
Resin                           9
Nylon                           7
Acetate                         1
Polyurethane                    1
Carbon                          1
Name: case_material, dtype: int64

#### Examine clasp type

In [470]:
df_models['clasp_type'] = df_models['clasp_type'].fillna("No clasp type reported")

In [471]:
df_models['clasp_type'].value_counts()

No clasp type reported    830
Buckle                    447
Fold-Over                 306
Deployment                112
Clasp                      13
613                         6
Safety                      4
Velcro                      4
Tang                        2
Name: clasp_type, dtype: int64

#### Examine dial color

In [474]:
df_models['dial_color'] = df_models['dial_color'].fillna("No dial color reported")

In [475]:
df_models['dial_color'].value_counts()

No dial color reported    529
Black                     285
Blue                      156
White                     148
Silver                    148
Rose Gold                  76
Mother Of Pearl            62
Green                      60
Grey                       44
Gold                       41
Brown                      31
Cream                      26
Pink                       23
Digital                    20
Red                        20
Multicolor                 13
Purple                     10
Skeleton                   10
Champagne                  10
Tan                         4
Turquoise                   2
Clear                       2
Burgundy                    2
Yellow                      1
Beige                       1
Name: dial_color, dtype: int64

#### Examine movement type

In [478]:
df_models['movement_type'] = df_models['movement_type'].fillna("No movement type reported")

In [479]:
df_models['movement_type'].value_counts()

Quartz                       1074
No movement type reported     536
Automatic                      73
Digital                        18
Mechanical                     15
Solar                           8
Name: movement_type, dtype: int64

#### Examine water resistance

In [482]:
df_models['water_resistant'] = df_models['water_resistant'].fillna("No water resistance reported")

In [483]:
df_models['water_resistant'].value_counts()

50m - 160ft - 5atm              932
No water resistance reported    530
30m - 100ft - 3atm              146
100m - 330ft - 10atm            116
Name: water_resistant, dtype: int64

#### Examine band material

In [487]:
df_models['max_cart_qty'] = df_models['max_cart_qty'].fillna("No max quantity reported")

In [488]:
df_models['max_cart_qty'].value_counts()

10000.0                     848
No max quantity reported    529
100.0                       347
Name: max_cart_qty, dtype: int64

#### Drop descriptions

In [490]:
df_models.drop(columns=['description_x','description_y'],inplace=True)

In [495]:
df_models.isna().sum().sum()

0

### Combine new models as "test" set

In [437]:
df_q3_new['quarter'] = '2022Q3'
df_q4_new['quarter'] = '2022Q4'

In [438]:
df_new_models = pd.concat([df_q3_new,df_q4_new],ignore_index=True)

In [439]:
df_new_models.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83 entries, 0 to 82
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   style_id       83 non-null     object 
 1   collection     83 non-null     object 
 2   movement_type  83 non-null     object 
 3   retail_price   80 non-null     float64
 4   color          83 non-null     object 
 5   band_color     83 non-null     object 
 6   case           83 non-null     object 
 7   lug_width      83 non-null     object 
 8   gender         83 non-null     object 
 9   quarter        83 non-null     object 
dtypes: float64(1), object(9)
memory usage: 6.6+ KB


In [440]:
# Verify each style is only in the dataset once
df_new_models['style_id'].duplicated().value_counts()

False    83
Name: style_id, dtype: int64