# Exercise - Data Consistency Checks
1. Consistency checks covered in this Exercise on your df_prods dataframe

Importing libraries and df_prods and df_ords data sets

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [2]:
#Define path
path = r'/Users/renataherrera/Documents/CF RH 2023-2024/CF DATA IMMERSION/CF RH A4 PYTHON/RH_PYTHON_Instacart Basket Analysis'

In [3]:
#Check
path

'/Users/renataherrera/Documents/CF RH 2023-2024/CF DATA IMMERSION/CF RH A4 PYTHON/RH_PYTHON_Instacart Basket Analysis'

In [4]:
# import products.csv data set from "Original Data" folder
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [7]:
#check
df_prods

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


In [6]:
# import orders_wrangled.csv data set from "Prepared Data" folder
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

In [8]:
#check
df_ords

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order
0,0,2539329,1,prior,1,2,8,
1,1,2398795,1,prior,2,3,7,15.0
2,2,473747,1,prior,3,3,12,21.0
3,3,2254736,1,prior,4,4,7,29.0
4,4,431534,1,prior,5,4,15,28.0
...,...,...,...,...,...,...,...,...
3421078,3421078,2266710,206209,prior,10,5,18,29.0
3421079,3421079,1854736,206209,prior,11,4,10,30.0
3421080,3421080,626363,206209,prior,12,1,12,18.0
3421081,3421081,2977660,206209,prior,13,1,12,7.0


Mixed-type data

In [10]:
#create a practice "test" dataframe called df_test
df_test = pd.DataFrame()

In [12]:
#create a mixed type column within df_test and filled with numeric, string and boolean values 
df_test['mix'] = ['a', 'b', 1, True]

In [13]:
#check for two string values, a numeric value, and a boolean value
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [14]:
#advanced Python syntax to creat a custom function 
#that searches your df for mixed-type columns
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


The “for” in for-loop stands for “for these elements, do this,” and the “loop” describes how the structure works: looping over and over again as it performs the procedures detailed by the “for.” Here, the for-loop is looping through each column in the dataframe and executing the same block of code each time.

Within the for-loop, a new variable is created: weird. Assigned to it is a test that checks whether the data types within the column are consistent. The weird variable will ultimately take a boolean value of either True or False. If True, that means the column contains inconsistent data types. If False, that means the column contains only one data type. Boolean values can also be represented by numbers: 0 as False and 1 as True.

Here comes the “if” statement. An if statement checks if some condition is met, and if it’s met, executes a line of code. If the condition isn’t met, the code isn’t executed. Here, the if statement is checking whether weird is true or false. If it’s greater than 0, than it’s true. If not, it’s false. If weird is true, the command print(col) is executed, which prints the problematic column for you to see. Because of the for-loop, this command will be executed on every column in your dataframe, printing every mixed-type column it finds. 

# Now that you know how to find mixed-type columns within your df, you need to know how to fix them. 


The first step is deciding what single data type the column in question should be. If your column contained mostly names, for instance, it should be a string. If it contained mostly order numbers, it should be a numeric value of some sort.

In [15]:
#Converting a column's data typefrom numeric to string
#or vice versa; update the str within the astype() function to int64 
#or whichever numeric data type you want to use.
df_test['mix'] = df_test['mix'].astype('str')

Missing Values for 2 reasons: data corruption and/or they were never recorded in the first place

In [16]:
#to find missing values
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [19]:
#check
df_prods

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


What this does is assign the function isnull() to the df_prods dataframe, then sum the result with the attached sum() function. The isnull() function is used to find missing observations, with “observations” here referring to entries in your dataframe. Think of them like cells in Excel.

If you were to use the isnull() function by itself, it would return a value of True or False, which, by itself, isn’t very helpful. You need to know how many total missing observations there are, which is where the sum() function comes in. As you learned previously, True values can also be interpreted numerically as 1, and False values can also be interpreted numerically as 0. If every missing observation is equal to 1, then you can simply add them up using the sum() function to obtain the total number of missing observations.

# To actually view these 16 values, you can create a subset of the dataframe containing only the values in question

In [17]:
#create a new df -a subset
#containing only those values within the "product_name" column that meet the condition isnull() = True
df_nan = df_prods [df_prods ['product_name'].isnull () == True]

In [18]:
#check
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


Addressing Missing Values

There are a few ways to deal with missing data:

1. Create a new variable that acts like a flag based on the missing value.
2. Impute the value with the mean or median of the column (if the variable is numeric).
3. Remove or filter out the missing data.

Before making this new dataframe, let’s run the df_prods.shape function so you can compare the number of rows in your current dataframe with the number in your subset once the missing rows have been removed

In [20]:
#shape function returning the number of rows and columns in a df
df_prods.shape

(49693, 5)

Next, create a new dataframe called df_prods_clean. this time setting the isnull() condition to False instead of True (you want non-missing values in your new dataframe as opposed to missing values)

In [21]:
#creating a new df for non-missing values = False
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

Finally, run df_prods_clean.shape again to check that the number of rows has decreased. This confirms that your operation was a success.

In [31]:
#check if no. of rows has decreased
df_prods_clean.shape

(49677, 5)

New dataframe should have exactly 16 less rows than the original dataframe (the same as the number of missing values).

# Overwriting can be risky. Unless you’re absolutely sure it’s safe to drop the values in question, you should create a new dataframe instead.

In both cases, rather than creating an entirely new dataframe, you’re overwriting df_prods with a new version of df_prods that doesn’t contain the missing values. This is done by way of the inplace = True function, which overwrites the original dataframe. If you don’t specify an inplace argument in your code, the function will take the default setting, which is inplace = False. When specified as False, the command will only return a view of the changed dataframe, leaving the original dataframe untouched.

# Another way you can drop all missing values is via the following command:
df_prods.dropna(inplace = True)

# If you wanted to use this command to drop only the NaNs from a particular column, the code would look like this:
df_prods.dropna(subset = [‘product_name’], inplace = True)

Duplicates

In [32]:
# command to look for full duplicates within your df
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [33]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


This code creates a new subset of df_prods_clean—df_dups—containing only rows that are duplicates. The duplicated() function is what identifies duplicate rows. It’s run on the df_prods_clean dataframe. Any duplicate rows that it finds are saved within the new df_dups dataframe.

# You’ve now located your duplicate rows. Great! All that’s left is to address them.

Addressing duplicates - pandas has a handy function for just this purpose—df.drop_duplicates().

In [34]:
#Creating a new df that contains only the unique rows from df_prods_clean
#and that doesn't include the duplicates you just identified using the drop_duplicates()
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [35]:
#check the shape of this new df
df_prods_clean_no_dups.shape

(49672, 5)

# You now have 49,672 rows in your dataframe. The five duplicates have been successfully deleted!

Tidying up and Exporting Changes

In [39]:
#Exporting new, clean df as re-named products_checked
# Re-named and reflective of the data set produced after your consistency checks
df_prods.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_checked.csv'))

# 2. Run the df.describe() function on your df_ords dataframe. Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further.
Tip: Keep an eye on min and max values!


In [38]:
#checking data
df_ords.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


# 2 A Investigating the days_since_previous_order values between min and max are over the max, eg. 50% @7 

# 3. Check for mixed-type data in your df_ords dataframe.

In [40]:
#checking df_ords df
df_ords

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order
0,0,2539329,1,prior,1,2,8,
1,1,2398795,1,prior,2,3,7,15.0
2,2,473747,1,prior,3,3,12,21.0
3,3,2254736,1,prior,4,4,7,29.0
4,4,431534,1,prior,5,4,15,28.0
...,...,...,...,...,...,...,...,...
3421078,3421078,2266710,206209,prior,10,5,18,29.0
3421079,3421079,1854736,206209,prior,11,4,10,30.0
3421080,3421080,626363,206209,prior,12,1,12,18.0
3421081,3421081,2977660,206209,prior,13,1,12,7.0


In [41]:
#checking for mixed-type data
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

# 3 A there is no print (col) printing of the problematic mixed-type column found, t

# 4. If you find mixed-type data, fix it. The column in question should contain observations of a single data type.

# 4 A No mixed-type data in the df_ords df found.

# 5. Run a check for missing values in your df_ords dataframe.In a markdown cell, report your findings and propose an explanation for any missing values you find.

In [42]:
#check for missing values in your df_ords df
df_ords.isnull().sum()

Unnamed: 0                        0
order_id                          0
user_id                           0
eval_set                          0
order_number                      0
orders_day_of_week                0
order_hour_of_day                 0
days_since_previous_order    206209
dtype: int64

In [43]:
#to actually view 206209 values, 
# create a subset of the df containing only the values in question
df_ords_nan = df_ords[df_ords['days_since_previous_order'].isnull() == True]

In [44]:
#check and missing values subset is displayed
df_ords_nan

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order
0,0,2539329,1,prior,1,2,8,
11,11,2168274,2,prior,1,2,11,
26,26,1374495,3,prior,1,1,14,
39,39,3343014,4,prior,1,6,11,
45,45,2717275,5,prior,1,3,12,
...,...,...,...,...,...,...,...,...
3420930,3420930,969311,206205,prior,1,4,12,
3420934,3420934,3189322,206206,prior,1,3,18,
3421002,3421002,2166133,206207,prior,1,6,19,
3421019,3421019,2227043,206208,prior,1,1,15,


In [45]:
#check the user_id of any two random ids with order_number = 1 
# to verify NaN values in corresponding days_since_previous_order column 
df_ords[df_ords['user_id'] == 3]

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order
26,26,1374495,3,prior,1,1,14,
27,27,444309,3,prior,2,3,19,9.0
28,28,3002854,3,prior,3,3,16,21.0
29,29,2037211,3,prior,4,2,18,20.0
30,30,2710558,3,prior,5,0,17,12.0
31,31,1972919,3,prior,6,0,16,7.0
32,32,1839752,3,prior,7,0,15,7.0
33,33,3225766,3,prior,8,0,17,7.0
34,34,3160850,3,prior,9,0,16,7.0
35,35,676467,3,prior,10,3,16,17.0


In [46]:
#check user_id
df_ords[df_ords['user_id'] == 206205]

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order
3420930,3420930,969311,206205,prior,1,4,12,
3420931,3420931,2658896,206205,prior,2,2,15,30.0
3420932,3420932,414137,206205,prior,3,5,16,10.0
3420933,3420933,1716008,206205,train,4,1,16,10.0


# 5 A Now the rows with missing values have been identified with a total of 206209 for days_since_prior_order column, and this is also the last user id value as shown in 206209 rows (and 8 columns) missing values  days_since_previous_order column. 
In the order_number column all the values are = 1, indicating all user_ids with a order_number = 1, show a NaN value in corresponding days_since_prior_order because there is no data from prior orders existing.

Note: "Unnamed": 0 column left as is (and not dropped or overwrited) because no further info on it and therefore unnecessary.

# 6. Address the missing values using an appropriate method. In a markdown cell, explain why you used your method of choice.

# 6 A The missing values are left as is. The values may provide further investigative potential in terms of the initial asssumptions that the data are representative of a first order.

Note: "Unnamed": 0 column left as is (and not dropped or overwrited) because no further info on it and therefore unnecessary.

# 7. Run a check for duplicate values in your df_ords data. In a markdown cell, report your findings and propose an explanation for any duplicate values you find.

In [47]:
#check for duplicate values
df_ords_dups = df_ords[df_ords.duplicated()]

In [48]:
df_ords_dups

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order


# 7 A There are no duplicate values in df_ords

# 8. Address the duplicates using an appropriate method. In a markdown cell, explain why you used your method of choice.

# 8 A No duplicates however you could clean and drop your duplicates if found.

# 9. Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder and give them appropriate, succinct names.

In [49]:
# Export df_ords as orders_checked
df_ords.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_checked.csv')) 

In [51]:
# Export df_prods as products_checked
df_prods.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_checked.csv'))