# Data Consistency Checks

1. Importing libraries and datasets
2. Investigating mixed-type columns
3. Duplicates
4. Additional exploratory analysis of dataset
    1. Describe
    2. Mixed-type data
    3. Missing values
    4. Duplicate values
5. Export dataframes

# 1. Importing libraries and datasets

In [1]:
#import libraries
import pandas as pd
import numpy as nm
import os

In [2]:
#folder path into usable string
path = r'C:\Users\rutha\CareerFoundry\01-23_Instacart_Basket_Analysis'

In [26]:
#import products csv
df_prods = pd.read_csv(os.path.join(path, '02_data', 'Original_data', 'products.csv'), index_col = False)

In [27]:
df_prods

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


In [29]:
#import orders_wrangled csv
df_ords = pd.read_csv(os.path.join(path, '02_data', 'Prepared_data', 'orders_wrangled.csv'), index_col = False)

In [30]:
df_ords

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...,...
3421078,3421078,2266710,206209,10,5,18,29.0
3421079,3421079,1854736,206209,11,4,10,30.0
3421080,3421080,626363,206209,12,1,12,18.0
3421081,3421081,2977660,206209,13,1,12,7.0


In [31]:
#dropping unnamed column
df_ords.drop(columns = ['Unnamed: 0'])

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


In [32]:
#defining df_ords without the eval_set column
df_ords = df_ords.drop(columns = ['Unnamed: 0'])

In [33]:
#checking output
df_ords

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


# 2. Investigating mixed-type columns

Mixed-type columns are where you have different data types within the same column, e.g., string and interger. It's important that we clean mixed-data *before* we conduct our analysis, as it will interfer with our code. 

*Since our dataset doesn't contain any mixed-data columns, we need to create one for the purposes of this exercise.*

In [34]:
#create a dataframe
df_test = pd.DataFrame()

In [35]:
#create a mixed type column
df_test['mix'] = ['a', 'b', 1, True]

In [7]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [36]:
#calculating the number of null values in your df
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [37]:
#creating df of just null values  
df_nan = df_prods[df_prods['product_name'].isnull()==True]

In [38]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [39]:
df_prods.shape

(49693, 5)

In [40]:
#create new clean df without missing values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [41]:
#run shape() function again to compare the rows and columns in your new df
df_prods_clean.shape

(49677, 5)

# 3. Duplicates

We need to check for duplicate values, in this instance we're checking for duplicate rows rather than duplicate values in columns since it is expected that we would have duplicate in say, the department id column. 

In [42]:
#looking for full duplicates by creating a new df
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [43]:
#returning all duplicate values
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


**Observations**

Five duplicates were found, so we will remove these from our dataset. 

In [44]:
#checking the shape of my df so I can compare once removed duplicates
df_prods_clean.shape

(49677, 5)

In [45]:
#create a df that doesn't include duplicates
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [46]:
#checking output
df_prods_clean_no_dups.shape

(49672, 5)

# 4. Additional exploratory analysis of dataset

I want to run some additional basic statistical analysis to further understand our dataset. Initially we're looking at whether the minimum, maximum, and mean values in our dataset are consistent with our expectations. 

### A. Describe function

In [47]:
#run the df.describe function on df_ords
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


**Observations**

Broadly speaking, the values are what I would expect. The maximum value in order_id is 3.421083e+06 which aligns with the number of rows in our dataset, as do the total number of orders in the order_number column. 
Fot order_hour_of_day, our minimum value is 0 and our maximum is 23 which aligns with there being 24 hours in the day. 

### B. Mixed-type data

In [48]:
#Q3 Check for mixed-type data in your dataframe
for col in df_ords.columns.tolist():
    weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_ords[weird]) > 0:
        print (col)

**Observations**

No mixed data returned

### C. Missing values

In [22]:
#Q5 run a check for missing values in your dataframe
##firstly to see Ture/False
df_ords.isnull()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
3421078,False,False,False,False,False,False,False
3421079,False,False,False,False,False,False,False
3421080,False,False,False,False,False,False,False
3421081,False,False,False,False,False,False,False


In [23]:
##secondly now grouped 
df_ords.isnull().sum()

Unnamed: 0                     0
order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

**Observations**

There are 206,209 null values in a dataframe of 3,421,083 rows = 6% of the column values. My initial thought would be that these could be first-time purchasers. A null values in the days_since_prior_order column would be expected because they have never ordered before. 

My response to the 206,209 missing values in the prior order column would be to leave them as is. 
Firstly, 6% of values is a reasonably small amount of the column to be missing, so keeping these values as null wouldn't be statistically significant. I don't think imputing values based on the mean would bring much benefit to my analysis.
Secondly, I think it an interesting datapoint which the client may benefit from understanding. We could create a flag which highlights users with no prior orders, ie., new customers. Assuming my assumption of the reason behind the null values is correct, I would say this client has high customer retention with a lot of repeat purchasers, but perhaps they should be doing more to bring in new customers as currently they only form 6% of their customer base, in the time period selected. 

### D. Duplicate values

In [49]:
#run a check for duplicate values in your df_ords data
df_ords_dups = df_ords[df_ords.duplicated()]

In [50]:
df_ords_dups

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order


In [51]:
#double checking using the shape() function 
df_ords_dups.shape

(0, 6)

**Observations**

There are no duplicate rows but if there were, I would have created a new dataframe: df_ords_no_dups = df_ords.drop_duplicates()

# 5. Export data frames

In [52]:
#export cleaned df_prods as csv
df_prods_clean_no_dups.to_csv(os.path.join(path, '02_Data', 'Prepared_data', 'products_clean.csv'))

In [53]:
#export cleaned df_ords as csv
df_ords.to_csv(os.path.join(path, '02_Data', 'Prepared_data', 'orders_clean.csv'))