In [1]:
import pandas as pd
import os

In [2]:
path = r'/Users/polusa/Library/Mobile Documents/com~apple~CloudDocs/my_DA_2024/CareerFoundry_Data_Analytics_Bootcamp/4-Python_Fundamentals_for_DA/04-2024_Instacart_Basket_Analysis/02-Data'

In [3]:
df_prods = pd.read_csv(os.path.join(path,'01-Raw_Data/products.csv'), index_col=False)
df_ords = pd.read_csv(os.path.join(path,'02-Prepared_Data/orders_wrangled.csv'),index_col=[0])

In [5]:
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


### Data Quality (consistency checks)

Throughout this Exercise, you’ll be walking through some of the most common checks you’ll want to perform on data to confirm its consistency.  

These include:

- Finding and addressing mixed data types
- Finding and addressing missing values
- Finding and addressing duplicate records

### Mixed Data Types  

A mixed-type column is a column that includes both string values and numeric values. This could happen, for instance, if you had a column of names (string format), where missing values were marked with a “0” (numeric format).  

Since our dataset has already all columns in order regarding the data types let's practice by creating a small test dataframe.

In [6]:
# Create a dataframe

df_test = pd.DataFrame()

In [7]:
# Create a mixed type column

df_test['mix'] = ['a', 'b', 1, True]

In [8]:
for col in df_test.columns.tolist():
  weird = (df_test[[col]].map(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [9]:
# another method to check for mixed types

mixed_types = df_test.apply(lambda x: len(set(map(type, x))) > 1)
mixed_types

mix    True
dtype: bool

In [10]:
df_test['mix'] = df_test['mix'].astype('str')

In [11]:
df_test.dtypes

mix    object
dtype: object

### Missing Values  

Missing values can occur for two reasons: 1) data corruption, or 2) they were never recorded in the first place. It’s important that you investigate and address any missing values in your data when conducting an analysis in Python.  

#### IMPACT OF MISSING VALUE: EXAMPLE
Imagine you’re working on an analysis project with financial data. You have two columns that contain amounts, such as `income` and `debt`. Using such specific amounts, especially in finance, isn’t a good idea because they’re likely to change over time (e.g., due to inflation). If you wanted to use these columns, you could, instead, derive a ratio (`debt/income`) in a new column. This would be a more stable characteristic as it would show the PROPORTION of debt compared to income. If you had a missing value in the income column, however, this would lead to an error, as you can’t divide by 0. You’d need to fix this missing value before creating the new column!

#### Finding Missing Values  

The `isnull()` function is used to find missing observations, with “observations” here referring to entries in your dataframe. Think of them like cells in Excel. It returns a value of `True` or `False`.  



In [12]:
df_prods.head(3)

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5


In [13]:
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

The columns in your dataframe are listed on the left, while a count (the output of the `sum()` function) is displayed next to them on the right. With a quick glance, you can see that the only column with missing values is the "`product_name`" column, and it’s missing 16 values.

In [14]:
df_nan = df_prods[ df_prods['product_name'].isnull() ] 

In [15]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


#### Addressing Missing Values  

There are a few ways to deal with missing data:

1. Create a new variable that acts like a flag based on the missing value.
2. Impute the value with the mean or median of the column (if the variable is numeric).
3. Remove or filter out the missing data.

#### Method 1 
##### Create a new variable that acts like a flag based on the missing value  

Remember that you don’t always want to simply remove missing values—the fact that they’re missing, in itself, could be important.  
One of the most important variables in a bank’s database is the “delinquency” of those who take out loans. It keeps track of whether there has ever been a time when the debtors have failed to pay an installment. The bank can then use these profiles to decide whether or not to give a new customer a loan.  

If a customer never took out a loan or was never late paying off their loan, they’d never appear in the delinquency column, making for what would likely be many missing values in the `delinquency` column. These missing values actually hold just as much importance as the non-missing values.  

Instead, create a new column containing the string values “Delinquent” and “Not delinquent” to flag the delinquent status of each profile.  


#### Method 2  
##### Impute the value with the mean or median of the column (numeric columns only)  
It’s based on the principle that if the value weren’t missing, it would probably be close to the mean or median value (empirical rule).  
Do be prudent when deciding whether to use the mean or median. The mean is a statistical measure that can be greatly influenced by extreme values, which could potentially lead to inaccurate imputions.  

`df['column with missings'].fillna(mean_value, inplace=True)`  


#### Method 3  
##### Remove or filter out the missing data  

Looking at the rows in `df_nan`, it quickly becomes clear that there isn’t much you can do in terms of imputation as it is of `string` type.  

String values can’t be imputed like numeric values, leaving you in a bit of a pickle. You can either remove the missing values entirely or filter out the ones that aren’t missing into a subset dataframe and continue your analysis with this new dataframe.  



In [16]:
df_prods.shape 

# (row, columns)

(49693, 5)

In [17]:
# ~ is the negation in pandas ( df_prods['product_name'].isnull() == False )

df_prods_clean = df_prods[ ~df_prods['product_name'].isnull()]

In [18]:
# 16 rows were removed

df_prods_clean.shape

(49677, 5)

Another way you can drop all missing values is via the following command:  

`df_prods.dropna(inplace = True)`  

If you wanted to use this command to drop __only the NaNs from a particular column__, the code would look like this:  

`df_prods.dropna(subset = [‘product_name’], inplace = True)`

#### Duplicates  

It is important that you think carefully at what must be considered a duplicate. Entire rows that are indeed considered duplicates, but sometimes we need to think at a combination of variables that together makes a duplicate row despite having some other values that are not identical among different rows.  

##### Finding Duplicates

In [19]:
# return a "boolean" mask with True or False based whether an entire row has duplicates
df_prods_clean.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
49688    False
49689    False
49690    False
49691    False
49692    False
Length: 49677, dtype: bool

In [20]:
# there are 5 rows duplicated
df_prods_clean.duplicated().sum()

5

We can create a new DataFrame of duplicated row:

In [21]:
df_dups = df_prods_clean[ df_prods_clean.duplicated()]

In [22]:
# as previuosly verified there are 5 duplicated rows
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


##### Addressing Duplicates  

You need to delete them

In [23]:
df_prods_clean.shape

(49677, 5)

In [24]:
df_prods_clean.drop_duplicates().shape

(49672, 5)

In [25]:
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [26]:
df_prods_clean_no_dups.shape

(49672, 5)

##### Export cleaned dataset 

In [27]:
df_prods_clean_no_dups.to_csv(os.path.join(path, '02-Prepared_Data/products_checked.csv'))

# Task 4.5 - Data Consistency Check

##### 2) Run the `df.describe()` function on your `df_ords` dataframe. Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further.

In [28]:
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


The column `days_since_last_order` seems to have something that needs to be investigated more.  
The count is less than all the other columns indicating missing values or NaN values. The `min` value of `0` should be understood better to see what it means. It may be that the value `0` is recorded when a first order is recorded, or perhaps when a second order is placed the same day. While the value of `0` doesn't seem incorrect, its meaning in this context should be investigated and understood.  

The column `order_number` also should be understood. While there seem to be no missing values, the range of values for this columns go from `1` to `100`. It is not clear whether that refers to some quantity present in the order, or if perhaps something is wrong here with the numerical recording of the order (perhaps that may refer to the invoice number or some other documents needed for the order).  

The remaining columns seems to have consistent and correct statistics based on this context.

In [29]:
# no duplicates in the 'order_number' column

df_ords['order_number'].isnull().sum()

0

In [30]:
# range of values in the 'order_number' column

df_ords['order_number'].unique()

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

##### 3) Check for mixed-type data in your df_ords dataframe

In [31]:
mixed_types = df_ords.apply(lambda x: len(set(map(type, x))) > 1)

In [32]:
mixed_types

order_id                 False
user_id                  False
order_number             False
orders_day_of_week       False
order_hour_of_day        False
days_since_last_order    False
dtype: bool

There seem to be no mixed types in the `df_ords` dataframe.

##### 5) Run a check for missing values in your df_ords dataframe

In [33]:
df_ords.isnull().sum()

order_id                      0
user_id                       0
order_number                  0
orders_day_of_week            0
order_hour_of_day             0
days_since_last_order    206209
dtype: int64

The only missing values refer to the column `days_since_last_order` with 206,209 missing entries. Let's check what are the unique values in this column:

In [34]:
df_ords['days_since_last_order'].unique()

array([nan, 15., 21., 29., 28., 19., 20., 14.,  0., 30., 10.,  3.,  8.,
       13., 27.,  6.,  9., 12.,  7., 17., 11., 22.,  4.,  5.,  2., 23.,
       26., 25., 16.,  1., 18., 24.])

As we can see the `0` value is part of the data, so I would exclude for now the fact that a missing values means value `0` until further investigation. It may indicates the first order ever recorded for that particular user.

In [35]:
# number of duplicates of the 'user_id' in the rows with the missing value in the 'days_since_last_order'
df_ords[ df_ords['days_since_last_order'].isnull()]['user_id'].duplicated().sum()

0

In [36]:
# number of duplicates of the 'user_id' in the rows with NO missing value in the 'days_since_last_order'
df_ords[ ~df_ords['days_since_last_order'].isnull()]['user_id'].duplicated().sum()

3008665

In [37]:
df_ords[ df_ords['days_since_last_order'].isnull()][['order_number','days_since_last_order']]


Unnamed: 0,order_number,days_since_last_order
0,1,
11,1,
26,1,
39,1,
45,1,
...,...,...
3420930,1,
3420934,1,
3421002,1,
3421019,1,


Upon a quick investingation it seems that the subset of rows with missing value in the `days_since_last_order` columns have no `user_id` duplicated values, while it is the opposite for the rows with no missing values in the `days_since_last_order` column.  

Moreover all the missing values matches with a value of `1` in the `order_number` column which helps to understand also what this latter columns means: the `order_number` columns seems to indicate how many order one user has made with us.

Based on the context, it may be that the missing values in the `days_since_last_order` column indicates a first order ever place and recorded. Consequently, it may be that the value `0` in the `days_since_last_order` would indicate a second order placed in less than 24 hours from the previous. 

##### 6) Address the missing values using an appropriate method  

In will input the value `-1` in place of the missing values to stay consistent with the data type. This must be documented so that if any statistics will have to be calculated the value `-1` should be excluded to avoid any misleading results.

In [38]:
filled_column = df_ords['days_since_last_order'].fillna(-1)
df_ords_clean_no_missing = df_ords.assign(days_since_last_order = filled_column)


In [39]:
df_ords_clean_no_missing.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,2539329,1,1,2,8,-1.0
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


##### 7) Run a check for duplicate values in your df_ords data

In [40]:
df_ords_clean_no_missing.duplicated().sum()

0

There are no duplicated rows in the orders dataset.

##### 9) Export your final, cleaned df_prods and df_ords data as “.csv”

In [41]:
# df_ords_clean_no_missing.to_csv(os.path.join(path, '02-Prepared_Data/orders_checked.csv'))

In [44]:
df_ords_clean_no_missing.shape

(3421083, 6)