# Notebook 01: Problem Statement and Data Cleaning

This notebook details the problem statement and cleans/saves the datasets.

## Problem Statement

([CTE Watch Company](https://www.cte-watches.com/who-we-are) is a South Florida based  company that distributes fashion and household items to smaller retailers in the Caribbean islands, Cruise ships, Duty-free/Travel Retail, Latin America & North American markets.

Success is determined by how well CTE manages inventory to optimize warehouse space and minimize the purchase of unsuccessful products. CTE typically purchases directly from the brands on a quarterly basis, which they then go on to sell to the smaller retailers. They must  predict how much they can sell to the smaller retailers in order to place quarterly brand orders. Typically, CTE makes their predictions by calculating a 6-month rolling average and placing an order to make sure they have enough inventory for 3 months. If it is a new model, they must make a guess based on the attributes of the watch. This is a very manual process and requires the 30-year expertise of the owner to be done well. Additionally, there are many different types of products within each brand (ex. 300 different watch models for 1 brand), and there are approximately 40 brands.

In this project, I aim to build a model that accurately and more quickly forecasts order quantities for both new and existing models for one watch brand. The model will predict watch sales based on historical sales (if available) and the individual model attributes (style, gender, color, price, material, etc). Different types of models will be explored to adequately and appropriately accomplish this goal (forecasting, regression, classification). This predictions will be validated against the owner's purchase order for the next quarter during a meeting in early June.

In [187]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

## Data Cleaning

### Watch Flow Chart

This dataset contains CTE's records of each model's sales for the brand since 2015 (when they started distributing the brand).

In [329]:
df_watch_chart = pd.read_excel('../data/WATCH FLOW CHART 0416.xlsx',header=1)

# Drop unnecessary/blank columns
df_watch_chart.drop(columns=['IMAGE','MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC','Unnamed: 16','Unnamed: 29','DESCRIPTION'],index=0,inplace=True)

# Standardize column names
df_watch_chart.columns = df_watch_chart.columns.astype(str).str.strip().str.lower().str.replace(' ','_')
df_watch_chart.columns = ['style_id', 'retail_price', 'upc_num', 'collection', 'case', 'gender',
       'status', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022-01',
       '2022-02', '2022-03', '2022-04', 'qt_sales_order', 'total_sales_2022', 'qt_on_hand',
       'monthly_avg_2022']

# Drop null row
df_watch_chart.dropna(subset='style_id', inplace=True)

# Convert UPC number to str (categorical)
df_watch_chart.loc[:,'upc_num'] = df_watch_chart.loc[:,'upc_num'].astype(int).astype(str)

# Identify and deal with null values
# Retail price
df_watch_chart.dropna(subset='retail_price', inplace=True)
# Collection
df_watch_chart.loc[646,'collection'] = 'FORRESTER CHRONO'
df_watch_chart.loc[786,'collection'] = 'JACQUELINE'
# Case
df_watch_chart.loc[646,'case'] = '46MM'
df_watch_chart.loc[786,'case'] = '36MM'
# Sales years
df_watch_chart[['2015','2016','2017','2018','2019','2020','2021','2022-01',
       '2022-02', '2022-03', '2022-04']] = df_watch_chart[['2015','2016','2017','2018','2019','2020','2021','2022-01',
       '2022-02', '2022-03', '2022-04']].fillna(0)
# Status
df_watch_chart.fillna('No status reported',inplace=True)

In [330]:
df_watch_chart.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1588 entries, 1 to 1590
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   style_id          1588 non-null   object 
 1   retail_price      1588 non-null   float64
 2   upc_num           1588 non-null   object 
 3   collection        1588 non-null   object 
 4   case              1588 non-null   object 
 5   gender            1588 non-null   object 
 6   status            1588 non-null   object 
 7   2015              1588 non-null   float64
 8   2016              1588 non-null   float64
 9   2017              1588 non-null   float64
 10  2018              1588 non-null   float64
 11  2019              1588 non-null   float64
 12  2020              1588 non-null   float64
 13  2021              1588 non-null   float64
 14  2022-01           1588 non-null   float64
 15  2022-02           1588 non-null   float64
 16  2022-03           1588 non-null   float64


In [332]:
# Save to csv
df_watch_chart.to_csv('../cleaned_datasets/watch_flow_chart.csv',index=False)

### Brand Proposal

This dataset contain's the brand's current collection and CTE's order on April 16, 2022.

In [191]:
df_proposal = pd.read_excel('../data/PROPOSAL APRIL 16.xlsx')

# Drop unnecessary/blank columns
df_proposal.drop(columns=['IMAGE','Unnamed: 12','Unnamed: 14','Unnamed: 17'],inplace=True)

# Drop rows with total sheet calculations
df_proposal.dropna(subset='Material',inplace=True)

# Standardize column names
df_proposal.columns = df_proposal.columns.astype(str).str.strip().str.lower().str.replace(' ','_')
df_proposal.columns = ['style_id', 'upc_num', 'priority', 'retail_price', 'gender', 'collection', 'case',
       'availability', 'qty_on_hand', 'qty_on_order', 'qty_total', 'qty_sales_order', 'qty_sold_last_6m',
       'qty_avg/mo', 'months_of_supply', 'wholesale_price', 'cte_cost', 'qty_cte_cost']

# Convert UPC number to str (categorical)
df_proposal.loc[:,'upc_num'] = df_proposal.loc[:,'upc_num'].astype(int).astype(str)

In [192]:
# Data quality checks
df_proposal['qty_total_dup'] = df_proposal['qty_on_hand'] + df_proposal['qty_on_order']
(df_proposal['qty_total_dup'] == df_proposal['qty_total']).value_counts()

True    300
dtype: int64

In [193]:
df_proposal['qty_avg/mo_dup'] = df_proposal['qty_sold_last_6m']/6
(df_proposal['qty_avg/mo_dup'] == df_proposal['qty_avg/mo']).value_counts()

True    300
dtype: int64

In [194]:
df_proposal['months_of_supply_dup'] = [(df_proposal.loc[x,'qty_total_dup'] + df_proposal.loc[x,'qty_sales_order'])/df_proposal.loc[x,'qty_avg/mo_dup'] if df_proposal.loc[x,'qty_avg/mo_dup'] != 0 else np.nan for x in df_proposal.index]
(df_proposal['months_of_supply_dup'] == df_proposal['months_of_supply']).value_counts()

True     216
False     84
dtype: int64

In [195]:
(df_proposal['months_of_supply_dup'].isna() == df_proposal['months_of_supply'].isna()).value_counts()

True    300
dtype: int64

In [196]:
df_proposal['cte_cost_dup'] = df_proposal['wholesale_price']*0.7
(df_proposal['cte_cost_dup'] == df_proposal['cte_cost']).value_counts()

True    300
dtype: int64

In [197]:
# Drop all duplicated columns
df_proposal.drop(columns=['qty_total_dup','qty_avg/mo_dup','months_of_supply_dup','cte_cost_dup'],inplace=True)

In [198]:
df_proposal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300 entries, 0 to 299
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   style_id          300 non-null    object 
 1   upc_num           300 non-null    object 
 2   priority          300 non-null    object 
 3   retail_price      300 non-null    float64
 4   gender            300 non-null    object 
 5   collection        300 non-null    object 
 6   case              300 non-null    object 
 7   availability      300 non-null    object 
 8   qty_on_hand       300 non-null    float64
 9   qty_on_order      300 non-null    object 
 10  qty_total         300 non-null    float64
 11  qty_sales_order   300 non-null    float64
 12  qty_sold_last_6m  300 non-null    object 
 13  qty_avg/mo        300 non-null    float64
 14  months_of_supply  216 non-null    object 
 15  wholesale_price   300 non-null    float64
 16  cte_cost          300 non-null    float64
 1

In [199]:
# Save to csv
df_proposal.to_csv('../cleaned_datasets/proposal.csv',index=False)

### All Watches from Brand Website

This dataset contains all watches from the brand's website, and includes more attributes.

In [200]:
df_website = pd.read_excel('../data/ALL ITEMS FROM WEBSITE.xlsx')

# Drop unnecessary columns
df_website.drop(columns=['u_brand','u_category','description','name'],inplace=True)

# Standardize column names
df_website.columns = df_website.columns.astype(str).str.strip().str.lower().str.replace(' ','_').str.replace('u_','').str.replace('ws_','')
df_website.columns = ['style_id', 'upc_num', 'gender', 'collection', 'size', 'wholesale_price', 'retail_price',
       'product_type', 'product_websites', 'weight', 'color',
       'country_of_origin', 'warranty', 'band_color', 'band_material',
       'case_material', 'clasp_type', 'crystal_type', 'dial_color',
       'movement_type', 'water_resistant', 'max_cart_qty']

# Convert UPC number to str (categorical)
df_website.loc[:,'upc_num'] = df_website.loc[:,'upc_num'].astype(int).astype(str)

# Identify and deal with null values
# band color
df_website['band_color'] = df_website['band_color'].fillna('No color reported')
# clasp type
df_website['clasp_type'] = df_website['clasp_type'].fillna('No clasp type reported')
# movement type
df_website['movement_type'] = df_website['movement_type'].fillna('No movement type reported')
# water resistance, fill with 10000 since most likelye
df_website['max_cart_qty'] = df_website['max_cart_qty'].fillna(10000)

# Dropping the following columns because their values are all the same
df_website.drop(columns=['product_type','product_websites','weight','crystal_type'],inplace=True)

In [201]:
df_website.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1195 entries, 0 to 1194
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   style_id           1195 non-null   object 
 1   upc_num            1195 non-null   object 
 2   gender             1195 non-null   object 
 3   collection         1195 non-null   object 
 4   size               1195 non-null   int64  
 5   wholesale_price    1195 non-null   float64
 6   retail_price       1195 non-null   int64  
 7   color              1195 non-null   object 
 8   country_of_origin  1195 non-null   object 
 9   warranty           1195 non-null   object 
 10  band_color         1195 non-null   object 
 11  band_material      1195 non-null   object 
 12  case_material      1195 non-null   object 
 13  clasp_type         1195 non-null   object 
 14  dial_color         1195 non-null   object 
 15  movement_type      1195 non-null   object 
 16  water_resistant    1194 

In [202]:
# Save to csv
df_website.to_csv('../cleaned_datasets/website.csv',index=False)

### New Models Q3

This dataset contains the new models that will debut in Q3 2022.

In [239]:
df_q3_new = pd.read_excel('../data/new models H3.xlsx',header = 1)

# Drop empty rows (NaN)
df_q3_new.dropna(how = 'all',inplace=True)

# Drop unnecessary columns
df_q3_new.drop(columns=['STATUS'],inplace=True)

# Standardize column names
df_q3_new.columns = df_q3_new.columns.astype(str).str.strip().str.lower().str.replace(' ','_')
df_q3_new.columns = ['style_id', 'collection', 'movement_type', 'retail_price', 'color', 'band_color', 'case',
       'lug_width', 'gender']

In [240]:
df_q3_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47 entries, 1 to 48
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   style_id       47 non-null     object 
 1   collection     47 non-null     object 
 2   movement_type  47 non-null     object 
 3   retail_price   47 non-null     float64
 4   color          47 non-null     object 
 5   band_color     47 non-null     object 
 6   case           47 non-null     object 
 7   lug_width      47 non-null     object 
 8   gender         47 non-null     object 
dtypes: float64(1), object(8)
memory usage: 3.7+ KB


In [241]:
# save to csv
df_q3_new.to_csv('../cleaned_datasets/q3_new_models.csv', index=False)

### New Models Q4

This dataset contains the new models that will debut in Q4 2022.

In [253]:
df_q4_new = pd.read_excel('../data/new models H4.xlsx')

# Drop empty rows (NaN)
df_q4_new.dropna(how = 'all',inplace=True)

# Drop unnecessary columns
df_q4_new.drop(columns=['Unnamed: 9'],inplace=True)

# Infer column names
df_q4_new.columns = ['style_id', 'collection', 'movement_type', 'retail_price', 'color', 'band_color', 'case',
       'lug_width', 'gender']

# Identify and deal with null values
# price - will deal with later
# lug width
df_q4_new['lug_width'] = df_q4_new['lug_width'].fillna('No lug width reported')

In [254]:
df_q4_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36 entries, 0 to 36
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   style_id       36 non-null     object 
 1   collection     36 non-null     object 
 2   movement_type  36 non-null     object 
 3   retail_price   33 non-null     float64
 4   color          36 non-null     object 
 5   band_color     36 non-null     object 
 6   case           36 non-null     object 
 7   lug_width      36 non-null     object 
 8   gender         36 non-null     object 
dtypes: float64(1), object(8)
memory usage: 2.8+ KB


In [255]:
# save to csv
df_q4_new.to_csv('../cleaned_datasets/q4_new_models.csv', index=False)

### Holiday items - new items Q4

This dataset contains the new models that will debut in Q4 2022. It was determined to be a repeat of New Models Q4.

In [275]:
# df_holiday_new = pd.read_excel('../data/HOLIDAY NEW ITEMS 05-04-2022.xlsx')

# # Drop unnecessary columns
# df_holiday_new.drop(columns=['DESCRIPTION','SEASON'],inplace=True)

# # Standardize column names
# df_holiday_new.columns = df_holiday_new.columns.astype(str).str.strip().str.lower().str.replace(' ','_').str.replace('u_','').str.replace('ws_','')
# df_holiday_new.columns = ['style_id', 'collection', 'gender', 'wholesale_price', 'retail_price', 'movement_type',
#        'color', 'band_color', 'case']

# # deal with retail_price later

# df_holiday_new.info()

# # save to csv
# df_holiday_new.to_csv('../cleaned_datasets//holiday_q4_new_models.csv',index=False)

## Merging Datasets

### Merge old/current models from watch flow chart and website

In [333]:
df_watch_chart.shape

(1588, 22)

In [334]:
df_proposal.shape

(300, 18)

In [335]:
df_website.shape

(1195, 18)

In [336]:
df_models = pd.merge(left = df_watch_chart, right = df_website, how = 'outer', on = 'style_id', sort=True)

#### Validation of data

In [337]:
df_models.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1724 entries, 0 to 1723
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   style_id           1724 non-null   object 
 1   retail_price_x     1588 non-null   float64
 2   upc_num_x          1588 non-null   object 
 3   collection_x       1588 non-null   object 
 4   case               1588 non-null   object 
 5   gender_x           1588 non-null   object 
 6   status             1588 non-null   object 
 7   2015               1588 non-null   float64
 8   2016               1588 non-null   float64
 9   2017               1588 non-null   float64
 10  2018               1588 non-null   float64
 11  2019               1588 non-null   float64
 12  2020               1588 non-null   float64
 13  2021               1588 non-null   float64
 14  2022-01            1588 non-null   float64
 15  2022-02            1588 non-null   float64
 16  2022-03            1588 

In [342]:
# check the status of models where there was no match on the website
df_models[df_models['retail_price_y'].isna()]['status'].value_counts()

DISCD                 478
No status reported     50
2018H3                  1
Name: status, dtype: int64

In [340]:
# check the 2021 sales of models where there was no match on the website
df_models[df_models['retail_price_y'].isna()]['2021'].value_counts()

0.0    529
Name: 2021, dtype: int64

It seems that the 2018H3 is now discontinued and the other watches had no sales in 2021

In [344]:
# check that status is only past watches
df_models['status'].value_counts()
# it is not obvious that the "No status reported" wawtches are old, current, or new models

DISCD                 690
No status reported    330
2019H4                 68
2020H1                 60
2020H3                 56
2020H2                 49
2019H3                 49
2021H3                 46
2021H4                 45
2019H2                 38
2022H1                 31
2021H1                 29
2021H2                 27
2018H4                 25
2019H1                 21
2018H3                  8
2020H4                  6
2022H2                  6
2018                    3
0                       1
Name: status, dtype: int64

#### Check where the retail prices differ between datasets

In [358]:
price_discrepancy = df_models[(df_models['retail_price_x'] != df_models['retail_price_y'])][['style_id','upc_num_x','status','retail_price_x','retail_price_y']].dropna()
price_discrepancy

Unnamed: 0,style_id,upc_num_x,status,retail_price_x,retail_price_y
0,AM4141,691464216092,No status reported,129.0,110.0
9,AM4532,796483064850,DISCD,125.0,149.0
17,BQ1010,796483055803,No status reported,125.0,129.0
18,BQ1130,796483065345,DISCD,135.0,139.0
19,BQ3115,796483250796,No status reported,135.0,139.0
...,...,...,...,...,...
1713,ME3208,796483555082,2021H4,219.0,240.0
1714,ME3209,796483555112,2021H4,249.0,260.0
1715,ME3210,796483555099,2021H4,249.0,250.0
1716,ME3211,796483555549,2021H4,189.0,210.0


In [396]:
# This is a tough decision... however for now, let's replace the prices with the website retail prices
# Rationale: If we use the price in the model to predict sales, the new model prices will not be scaled to the
# previous model prices

df_models['retail_price'] = [x if str(y) == 'nan' else y for x, y in zip(df_models['retail_price_x'],df_models['retail_price_y'])]

In [399]:
df_models[['retail_price_x','retail_price_y','retail_price']]

Unnamed: 0,retail_price_x,retail_price_y,retail_price
0,129.0,110.0,110.0
1,129.0,129.0,129.0
2,149.0,,149.0
3,165.0,,165.0
4,169.0,,169.0
...,...,...,...
1719,210.0,210.0,210.0
1720,,280.0,280.0
1721,,280.0,280.0
1722,,260.0,260.0


In [401]:
df_models['retail_price'].isna().value_counts()

False    1724
Name: retail_price, dtype: int64

In [402]:
df_models.drop(columns=['retail_price_x','retail_price_y'],inplace=True)

#### Check where the UPC num differ between datasets

In [404]:
upc_discrepancy = df_models[(df_models['upc_num_x'] != df_models['upc_num_y'])][['style_id','upc_num_x','upc_num_y','status']].dropna()
upc_discrepancy

Unnamed: 0,style_id,upc_num_x,upc_num_y,status
495,ES4393,757697665530,796483387720,No status reported
496,ES4394,757697665547,796483387737,No status reported
497,ES4396,757697665523,796483387713,DISCD
500,ES4403,757697665820,796483388727,DISCD
503,ES4408,757697667459,796483396975,DISCD
504,ES4409,757697667442,796483396968,DISCD
509,ES4414,757697666704,796483396821,DISCD
513,ES4422,79648339644,796483396944,DISCD
534,ES4446,757697670503,796483415676,2018H4
545,ES4468,757697672897,796483419629,DISCD


In [417]:
df_models[['upc_num_x','upc_num_y']]

Unnamed: 0,upc_num_x,upc_num_y
0,691464216092,691464216092
1,691464267100,691464267100
2,796483009837,
3,796483009844,
4,796483009851,
...,...,...
1719,796483561380,796483561380
1720,,796483561878
1721,,796483561861
1722,,796483561885


In [427]:
# check if the UPC number for models in the flow chart can be found in the website for a different watch
for upc_num_x in upc_discrepancy['upc_num_x']:
    if upc_num_x in df_models['upc_num_y'].values:
        print(upc_num_x)

796483033290


In [430]:
# After examining the upc 796483033290, it seems like the flow chart upc num can be replaced.
#df_models[(df_models['upc_num_x'] == '796483033290') | (df_models['upc_num_y'] == '796483033290')].T

In [432]:
# I don't want to keep the upc number for now, so will drop them. However, if this changes in the future,
# I would want to replace the watch flow chart UPC numbers with the website #s and fill in the rest with the website #s.
df_models.drop(columns=['upc_num_x','upc_num_y'], inplace=True)

#### Check where the Collection differ between datasets

In [451]:
# Lots of discrepancies because of capitalization. Let's capitalize all values and reevaluate
df_models['collection_x'] = df_models['collection_x'].astype(str).apply(str.capitalize).str.strip().replace('Nan',np.nan)
df_models['collection_y'] = df_models['collection_y'].astype(str).apply(str.capitalize).str.strip().replace('Nan',np.nan)

In [452]:
collection_discrepancy = df_models[(df_models['collection_x'] != df_models['collection_y'])][['style_id','collection_x','collection_y','status']].dropna()
collection_discrepancy

Unnamed: 0,style_id,collection_x,collection_y,status
0,AM4141,Colleague,Serena,No status reported
5,AM4508,Colleague,Serena,DISCD
17,BQ1010,Madeline,Rhett,No status reported
18,BQ1130,Madeline,Flynn,DISCD
19,BQ3115,Madeline,Suitor,No status reported
...,...,...,...,...
1697,ME3189,Carlie mini,Carlie mini me,2020H3
1698,ME3190,Fb,Fb - 01 automatic,2020H3
1699,ME3191,Fb-01,Fb - 01 automatic,2020H3
1700,ME3195,Neutra,Neutra automatic,No status reported


In [512]:
collection_name_dict = dict(set(zip(collection_discrepancy['collection_x'],collection_discrepancy['collection_y'])))

In [518]:
collection_replace_dict = {'Fb-01': 'Stella', # confirmed typo
 'Daisy 3hand': 'Carlie', # confirmed typo
 'Carlie': 'Carlie mini',
 'Fb - 02': 'Garrett', # confirmed typo
 'Townsman box set': 'Townsman',
 'Jacqueline': 'Stella', # confirmed typo
 'Fb -05': 'Fb - 01',
 'Townsman':'Townsman', #'Townsman': 'Luxe leisure', # typo on the website
 'Mini': 'Carlie mini',
 'Micro': 'Scarlette micro',
 'The minimalist moonphase': 'The minimalist',
 'Everett 3 hand': 'Everett',
 'Fb -03': 'Fb - 01',
 'Colleague': 'Serena',
 'The': 'The andy and addison set',
 'neutra auto': 'Neutra automatic',
 'Copeland 42mm': 'Copeland',
 'Arc': 'Arc-03',
 'Retro': 'Retro pilot',
 'Tailor': 'Tailor 35mm',
 'The minimalist 3h': 'Karli', # major discrepancy - replaced collection and gender of the watch
 'Grant sport automatic': 'Grant sport',
 'The minimalist':'The minimalist', #'The minimalist': 'Color undertones', # seems unneccessarily, will keep the same
 'Neutra chrono':'Neutra chrono', #'Neutra chrono': 'Luxe leisure', # typo on the website
 'Dean': 'Suitor', # major discrepancy - replaced collection and gender of the watch
 'Limited': 'Forrester',# more specific
 'chase timer': 'Chase',
 'Fb-03': 'Fb - 03',
 'Fb -02': 'Fb - 01',
 'Carlie mini': 'Carlie mini v-day',
 'H date 42mm': 'The commuter', # confirmed typo
 'Georgia\xa026mm': 'Georgia',
 'The commuter': 'The commuter 3h',
 'Fb': 'Fb-adventure',
 'Stella': 'Kalya', # confirmed typo
 'The commuter 3h':'The commuter 3h', #'The commuter 3h': 'The commuter 3 hand/date', # seems to be same category
 'Scarlette': 'Scarlette micro',
 'Neutra 3': 'Neutra 3h',
 'Mathis 3h': 'Mathis',
 'Tailor mini': 'Tailor',
 'the essentialist': 'The essentialist',
 'Garrett': 'Fb - 02', # confirmed typo
 'The essentialist': 'Luxe leisure', # typo on the website
 'Chase timer': 'Chase',
 'Bronson': 'Bronson twist',
 'Tailor me': 'Tailor',
 'Original boyfriend':'Original boyfriend', # 'Original boyfriend': 'Obf', # not consistent
 'Grant': 'Karli', # major discrepancy - replaced collection and gender of the watch
 'Machine': 'Suitor mini', # major discrepancy - replaced collection and gender of the watch
 'Forrester chrono': 'Fb - 02', # confirmed typo
 'Lyric': 'Suitor mini', # major discrepancy - replaced collection
 'Kalya': 'Blythe', # major discrepancy - replaced collection 
 'Jacqueline box set': 'Jacqueline set',
 'the minimalist solar': 'The minimalist solar',
 'Mega': 'Mega machine',
 'Forrester auto': 'Forrester automatic',
 'jacqueline': 'Jacqueline',
 'Belmar multifunction': 'Belmar',
 'grant 44mm': 'Grant',
 'Izzy': 'Daisy 3 hand', # confirmed typo
 'Madeline': 'Flynn', # major discrepancy - replaced collection and gender of the watch
 'Neely': 'Typographer', # major discrepancy - replaced collection
 'machine': 'Machine',
 'Arc-01': 'Arc - 01',
 'Neutra': 'Neutra automatic',
 'timer 42mm': 'Chase timer',
 'Fb -01': 'Fb - 01',
 'goodwin chrono': 'Goodwin',
 'Forrester': 'Forrester chrono',
 'Daisy': 'Daisy 3 hand',
 'Earth': 'Earth day watch',
 'the minimalist 3h': 'The minimalist',
 'Sadie': 'Sadie multifunction',
 'Georgia': 'Georgia small',
 'Fb -04': 'Fb - 01'# confirmed typo
                          }

In [507]:
# Major discrepancy at index 32. I will replace the collection and gender with that of the website
# df_models[(df_models['collection_x'] == 'The minimalist 3h') & (df_models['collection_y'] == 'Karli')].T
df_models.loc[32,:] = ['BQ3440', 'Karli', '34MM', 'Ladies', 'DISCD', 0.0, 0.0,
       0.0, 0.0, 0.0, 25.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       'Ladies', 'Karli', 34.0, 69.5, 'MOP', 'Japan - JP',
       '2 - Year International Limited Warranty', 'Black',
       'Stainless Steel', 'Stainless Steel', 'Fold-Over',
       'Mother Of Pearl', 'Quartz', '50m - 160ft - 5atm', 10000.0, 139.0]

# Major discrepancy at index 29. I will replace the collection and gender with that of the website
#df_models[(df_models['collection_x'] == 'Dean') & (df_models['collection_y'] == 'Suitor')].T
df_models.loc[29,:] = ['BQ3423', 'Suitor', '36MM', 'Ladies', 'No status reported', 0.0, 0.0,
       0.0, 0.0, 0.0, 48.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       'Ladies', 'Suitor', 36.0, 79.5, 'MOP', 'Japan - JP',
       '2 - Year International Limited Warranty', 'Rose Gold',
       'Stainless Steel', 'Stainless Steel', 'Fold-Over',
       'Mother Of Pearl', 'Quartz', '50m - 160ft - 5atm', 10000.0, 159.0]

# Major discrepancy at index 28. I will replace the collection and gender with that of the website
#df_models[(df_models['collection_x'] == 'Grant') & (df_models['collection_y'] == 'Karli')].T
df_models.loc[28,:] = ['BQ3422', 'Karli', '34MM', 'Ladies', 'No status reported', 0.0, 0.0,
       0.0, 0.0, 0.0, 50.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       'Ladies', 'Karli', 34.0, 64.5, 'MOP', 'Japan - JP',
       '2 - Year International Limited Warranty', 'Rose Gold',
       'Stainless Steel', 'Stainless Steel', 'Fold-Over',
       'Mother Of Pearl', 'Quartz', '50m - 160ft - 5atm', 10000.0, 129.0]

# Major discrepancy at index 26 I will replace the collection and gender with that of the website
#df_models[(df_models['collection_x'] == 'Machine') & (df_models['collection_y'] == 'Suitor mini')].T
df_models.loc[26,:] = ['BQ3334', 'Suitor mini', '26MM', 'Ladies', 'No status reported', 0.0, 0.0,
       0.0, 0.0, 0.0, 100.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       'Ladies', 'Suitor mini', 26.0, 69.5, 'MOP', 'Japan - JP',
       '2 - Year International Limited Warranty', 'Gold',
       'Stainless Steel', 'Stainless Steel', 'Fold-Over',
       'Mother Of Pearl', 'Quartz', '50m - 160ft - 5atm', 10000.0, 139.0]

# Major discrepancy at index 18 I will replace the collection and gender with that of the website
# df_models[(df_models['collection_x'] == 'Madeline') & (df_models['collection_y'] == 'Flynn')].T
df_models.loc[26,:] = ['BQ1130', 'Flynn', '48MM', 'Mens', 'DISCD', 0.0, 0.0, 0.0,
       0.0, 0.0, 49.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       'Mens', 'Flynn', 48.0, 69.5, 'Black', 'Japan - JP',
       '2 - Year International Limited Warranty', 'Black', 'Leather',
       'Stainless Steel', 'Buckle', 'Black', 'Quartz',
       '50m - 160ft - 5atm', 10000.0, 139.0]

In [525]:
collection_discrepancy.replace(collection_replace_dict,1)

Unnamed: 0,style_id,collection_x,collection_y,status
0,AM4141,Colleague,Serena,No status reported
5,AM4508,Colleague,Serena,DISCD
17,BQ1010,Madeline,Rhett,No status reported
18,BQ1130,Madeline,Flynn,DISCD
19,BQ3115,Madeline,Suitor,No status reported
...,...,...,...,...
1697,ME3189,Carlie mini,Carlie mini me,2020H3
1698,ME3190,Fb,Fb - 01 automatic,2020H3
1699,ME3191,Fb-01,Fb - 01 automatic,2020H3
1700,ME3195,Neutra,Neutra automatic,No status reported


### Combine new models as "test" set

In [281]:
df_q3_new['quarter'] = '2022Q3'
df_q4_new['quarter'] = '2022Q4'

In [283]:
df_new_models = pd.concat([df_q3_new,df_q4_new],ignore_index=True)

In [285]:
df_new_models.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83 entries, 0 to 82
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   style_id       83 non-null     object 
 1   collection     83 non-null     object 
 2   movement_type  83 non-null     object 
 3   retail_price   80 non-null     float64
 4   color          83 non-null     object 
 5   band_color     83 non-null     object 
 6   case           83 non-null     object 
 7   lug_width      83 non-null     object 
 8   gender         83 non-null     object 
 9   quarter        83 non-null     object 
dtypes: float64(1), object(9)
memory usage: 6.6+ KB


In [292]:
# Verify each style is only in the dataset once
df_new_models['style_id'].duplicated().value_counts()

False    83
Name: style_id, dtype: int64