# Table of Content
- [Imports](#imports)
- [Resources](#resources)
- [Load Data Sets](#load-data-sets)
	- [Orders (Wrangled)](#orders-(wrangled))
	- [Products](#products)
- [Consistency Checks - Products Data](#consistency-checks---products-data)
- [Consistency Checks - Oders Data](#consistency-checks---oders-data)
- [Export Cleaned Data Sets](#export-cleaned-data-sets)


## Imports [#](#table-of-content)

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd

import da_helper as da

## Resources [#](#table-of-content)

In [2]:
# project folder
project_folder = Path(r"C:\Users\vynde\Desktop\CareerFoundry Data Analytics\Data Immersion - 4 Python Fundamentals for Data Analysts\Instacart_Basket_Analysis")

# resource folders
original_data_folder = project_folder / "02_Data" / "Original_Data"
prepared_data_folder = project_folder / "02_Data" / "Prepared_Data"

# input files
products_data_file = original_data_folder / "products.csv"
oders_data_file = prepared_data_folder / "orders_wrangled.csv"

# output files
cleaned_oders_data_file = prepared_data_folder / "orders_cleaned.csv"
cleaned_products_data_file = prepared_data_folder / "products_cleaned.csv"

## Load Data Sets [#](#table-of-content)

##### Orders (Wrangled) [#](#table-of-content)

In [3]:
df_ords = pd.read_csv(oders_data_file)
df_ords.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0


##### Products [#](#table-of-content)

In [4]:
df_prods = pd.read_csv(products_data_file, index_col=False)
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


## Consistency Checks - Products Data [#](#table-of-content)

In [5]:
# create a copy or the original dataframe to work with 
df_prods_cleaned = df_prods.copy()

Basic stats

In [6]:
df_prods_cleaned.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


Check for mixed types

In [7]:
da.check_types(df_prods_cleaned)

Columns with mixed types:
  product_id: no
  product_name: YES
  aisle_id: no
  department_id: no
  prices: no


Check for missing values

In [8]:
df_nan = da.check_missing(df_prods_cleaned)

Missing values:
  product_id: 0
  product_name: 16
  aisle_id: 0
  department_id: 0
  prices: 0


Handle missing values

In [9]:
# drop records with nan product_name
df_prods_cleaned.dropna(subset=["product_name"], inplace=True)

Check for duplicates

In [10]:
da.check_duplicates(df_prods_cleaned)

Number of true duplicates: 5
Duplicate values:
  product_id: 7
  product_name: 5
  aisle_id: 49543
  department_id: 49656
  prices: 49435


Handle duplicates

In [11]:
# drop duplicates
df_prods_cleaned.drop_duplicates(inplace=True)
da.check_duplicates(df_prods_cleaned)

Number of true duplicates: 0
Duplicate values:
  product_id: 2
  product_name: 0
  aisle_id: 49538
  department_id: 49651
  prices: 49430


Inspect duplicated product_id's 

In [12]:
# get list of duplicated ids
dupl_ids = df_prods_cleaned[df_prods_cleaned.duplicated(subset=["product_id"])].product_id.tolist()
dupl_ids

[6800, 26520]

In [13]:
# show records
df_prods_cleaned[df_prods_cleaned.product_id.isin(dupl_ids)]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
6799,6800,Revive Zero Vitamin Water,64,7,6.4
6800,6800,Sprouted Quinoa Flakes Baby Cereal,92,18,14.0
26520,26520,Clinical Advanced Solid Ultimate Fresh Anti-Pe...,80,11,10.6
26521,26520,Cheese Shredded Sharp Cheddar Reduced Fat 2%,21,16,2.9


>There are duplicated product_ids which can lead to problems later on. However, product_name is unique

## Consistency Checks - Oders Data [#](#table-of-content)

In [14]:
# create copy to work with
df_ords_cleaned = df_ords.copy()

Basic stats

In [15]:
df_ords_cleaned.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


>- order_id: reasonable. <br>
  min, max and count fit to each other
>- user_id:  ok
>- order_number: min 1, max 100, ok
>- order_dow: min 0, max 6
>- order_hour_of_day: min 0, max 23, ok
>- day_since_prior_order: min 0, max 30

Check for mixed types

In [16]:
df_ords_cleaned.dtypes

Unnamed: 0                  int64
order_id                    int64
user_id                     int64
order_number                int64
orders_day_of_week          int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

In [17]:
# checking for mixed data types
da.check_types(df_ords_cleaned)

Columns with mixed types:
  Unnamed: 0: no
  order_id: no
  user_id: no
  order_number: no
  orders_day_of_week: no
  order_hour_of_day: no
  days_since_prior_order: no


Check for missing values

In [18]:
da.check_missing(df_ords_cleaned);

Missing values:
  Unnamed: 0: 0
  order_id: 0
  user_id: 0
  order_number: 0
  orders_day_of_week: 0
  order_hour_of_day: 0
  days_since_prior_order: 206209


Handle missing values

In [19]:
# total number of users
len(set(df_ords_cleaned.user_id)) 

206209

In [20]:
# replace nan values with 0
df_ords_cleaned.days_since_prior_order.fillna(0, inplace=True)

>206209 is exactly the max value of the user_id.
I double checked that there aren't any user_ids missing by determining the total number of users.
Another check could have been to see if the order_number == 1 because these order don't have any prior order and thus no days_since_prior_order.
In this case we can replace these values with 0.

Check for duplicates

In [21]:
da.check_duplicates(df_ords_cleaned, subset=["user_id", "order_number"])

Number of true duplicates: 0
Duplicate values:
  Unnamed: 0: 0
  order_id: 0
  user_id: 3214874
  order_number: 3420983
  orders_day_of_week: 3421076
  order_hour_of_day: 3421059
  days_since_prior_order: 3421052
Number of duplicates for subset ['user_id', 'order_number']: 0


>There are no duplicate records.

>Neither true duplicates

>Nor duplicates for the subset "user_id" and "order_number"

## Export Cleaned Data Sets [#](#table-of-content)

Cleaned products data

In [22]:
df_prods_cleaned.to_csv(cleaned_products_data_file)

Cleaned orders data

In [23]:
df_ords_cleaned.to_csv(cleaned_oders_data_file)