# Exercise 4.5 Data Consistency Checks

## This script contains the following:
1. Importing Libraries and Data Files
2. Consistency Checks on products.csv Data
3. Consistency Checks on orders.csv Data
4. Export Dataframes

# 1. Importing Libraries and Data Files

In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Create a string of the path for the main project folder
path = r'C:\Users\Ryan\Documents\07-17-2023 Instacart Basket Analysis'

In [3]:
# Import the "products.csv" data set using the os library
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'))

In [4]:
# Import the “orders_wrangled.csv” data set using the os library
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

# 2. Consistency Checks on products.csv Data

#### If you haven’t performed the consistency checks covered in this Exercise on your df_prods dataframe, do so now.

In [5]:
# Check the df_prods output
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [6]:
df_prods.shape

(49693, 5)

In [7]:
# Find missing values in df_prods
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [8]:
df_prods.loc[df_prods['product_name'].isnull() == True]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [9]:
# Create a dataframe exlcuding missing values from 'product_name' column
df_prods = df_prods.loc[df_prods['product_name'].isnull() == False]

In [10]:
# Check the dimensions
df_prods.shape

(49677, 5)

The dataframe now has 16 fewer rows than before.

In [11]:
# Check the data types
df_prods.dtypes

product_id         int64
product_name      object
aisle_id           int64
department_id      int64
prices           float64
dtype: object

In [12]:
# Check for mixed type data
for col in df_prods.columns.tolist():
    weird = (df_prods[[col]].applymap(type) != df_prods[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_prods[weird]) > 0:
        print (col)

No mixed data types found.

In [13]:
# Check for duplicates
df_prods[df_prods.duplicated()]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


Five duplicates found.

In [14]:
# Remove duplicates from df_prods
df_prods = df_prods.drop_duplicates()

In [15]:
# Check the dimensions
df_prods.shape

(49672, 5)

The dataframe now has 5 fewer rows than before.

# 3. Consistency Checks on orders.csv Data

#### Run the df.describe() function on your df_ords dataframe. Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further.

In [16]:
# Check df_ords dataframe
df_ords.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0


In [17]:
# Drop the 'Unnamed: 0' column from df_ords dataframe
df_ords = df_ords.drop(columns = ['Unnamed: 0'])

In [18]:
# Check the output
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [19]:
# Obtain numerical statistics on df_ords
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


The 'order_id', 'user_id', and 'order_number' are not measures, so there should not be statitics for them. Most likely pandas  assigned them integer values when reading the csv file. The min and max values for the other columns appear to be correct.

#### Check for mixed-type data in your df_ords dataframe.

In [20]:
# Check for mixed types in df_ords dataframe
for col in df_ords.columns.tolist():
    weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_ords[weird]) > 0:
        print (col)

There are no mixed data types in the df_ords dataframe.

#### If you find mixed-type data, fix it. The column in question should contain observations of a single data type.

There are no mixed-type data to fix.

#### Run a check for missing values in your df_ords dataframe. In a markdown cell, report your findings and propose an explanation for any missing values you find.

In [21]:
# Find missing values in df_ords dataframe
df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

There are 206,209 missing values in the 'days_since_prior_order' column. What might explain these missing values is when the order is a first time order, because there are no days prior to a first time order.

#### Address the missing values using an appropriate method. In a markdown cell, explain why you used your method of choice.

In [22]:
# Create a dataframe containing the missing values from 'days_since_prior_order' column
df_nan = df_ords[df_ords['days_since_prior_order'].isnull() == True]
df_nan.head(5)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
11,2168274,2,1,2,11,
26,1374495,3,1,1,14,
39,3343014,4,1,6,11,
45,2717275,5,1,3,12,


In [23]:
df_nan.tail(5)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
3420930,969311,206205,1,4,12,
3420934,3189322,206206,1,3,18,
3421002,2166133,206207,1,6,19,
3421019,2227043,206208,1,1,15,
3421069,3154581,206209,1,3,11,


As expected, the missing values in the 'days_since_prior_order' column are first time orders (when order number = 1) for each user. Since first time orders do not have any days since prior order, it is correct to leave them blank. Therefore, no action will be taken to address the missing values.

#### Run a check for duplicate values in your df_ords data. In a markdown cell, report your findings and propose an explanation for any duplicate values you find.

In [24]:
# Check for duplicates from df_ords
df_ords[df_ords.duplicated()]

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order


There are no full duplicates in the df_ords dataframe

#### Address the duplicates using an appropriate method. In a markdown cell, explain why you used your method of choice.

There are no duplicates to address.

In [25]:
# Check the dimensions
df_ords.shape

(3421083, 6)

# 4. Export Dataframes

#### Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder and give them appropriate, succinct names.

In [26]:
# Export df_prod dataframe as "products_checked.csv"
df_prods.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'))

In [27]:
# Export df_ords dataframe as "orders_checked.csv"
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.csv'))