# Data Consistency Checks

1. Importing libraries and data set 
2. Mixed-type columns
3. Duplicates
4. Exercise 4.5

In [1]:
#import libraries
import pandas as pd
import numpy as nm
import os

In [2]:
#folder path into usable string
path = r'C:\Users\rutha\CareerFoundry\01-23_Instacart_Basket_Analysis'

In [3]:
#import products csv
df_prods = pd.read_csv(os.path.join(path, '02_data', 'Original_data', 'products.csv'))

In [4]:
#import orders_wrangled csv
df_ords = pd.read_csv(os.path.join(path, '02_data', 'Prepared_data', 'orders_wrangled.csv'))

In [5]:
#create a dataframe
df_test = pd.DataFrame()

#### 2. Mixed type columns

In [6]:
#create a mixed type column
df_test['mix'] = ['a', 'b', 1, True]

In [7]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [8]:
#calculating the number of null values in your df
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [9]:
#creating df of just null values  
df_nan = df_prods[df_prods['product_name'].isnull()==True]

In [10]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [11]:
df_prods.shape

(49693, 5)

In [12]:
#create new clean df without missing values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [13]:
#run shape() function again to compare the rows and columns in your new df
df_prods_clean.shape

(49677, 5)

#### 3. Duplicates

In [14]:
#looking for full duplicates by creating a new df
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [15]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [16]:
#checking the shape of my df so I can compare once removed duplicates
df_prods_clean.shape

(49677, 5)

In [17]:
#create a df that doesn't include duplicates
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [18]:
df_prods_clean_no_dups.shape

(49672, 5)

# Exercise 4.5 

In [19]:
#Q2 run the df.describe function on df_ords, write down anything about the data that looks off
df_ords.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


**Q2 answer:**
My initial thought is that there is an issue with order_hour_of_the_day, since there are 24 hours in the day but the maximum values is 2.300000e+01.
I also appear to have created a new column out of the index which I would propose dropping. 
I accidentally started off by using my newly created df_prods_clean_no_dups dataframe and have included my findings below as I would want to investigate some of those values. 

In [20]:
df_prods_clean_no_dups.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,9.993282
std,14340.705287,38.315784,5.850779,453.615536
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,99999.0


**Q2 Answer:**
The first thing I noticed is with the prices column. The mean is 9.993282 but the max value is 99999. Obviously large numbers like this can sometimes be used as replacements for missing or incorrect data. This is supported by the fact that the IQR values are 4.1, 7.1, and 11.1. 99999 would be an extreme outlier in this dataframe. 
I would say that everything else looks broadly ok, with evenly dispersed IQR values, and means that are within reasonable ranges of the 50% quartile. The counts for each variable are also all equal. 
One small point I would want to confirm with the client is that a product.id could be a decimal value. 

In [21]:
#Q3 Check for mixed-type data in your dataframe
for col in df_ords.columns.tolist():
    weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_ords[weird]) > 0:
        print (col)

No mixed data returned

**Question 5** Run a check to see if there are any missing values in your dataframe

In [22]:
#Q5 run a check for missing values in your dataframe
##firstly to see Ture/False
df_ords.isnull()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
3421078,False,False,False,False,False,False,False
3421079,False,False,False,False,False,False,False
3421080,False,False,False,False,False,False,False
3421081,False,False,False,False,False,False,False


In [23]:
##secondly now grouped 
df_ords.isnull().sum()

Unnamed: 0                     0
order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

**Q5 Answer** there are 206,209 null values in a dataframe of 3,421,083 rows = 6% of the column values. My initial thought would be that these could be first-time purchasers. A null values in the days_since_prior_order column would be expected because they have never ordered before. 

**Q6 Address the missing values using an apprpriate method:**
My response to the 206,209 missing values in the prior order column would be to leave them as is. 
Firstly, 6% of values is a reasonably small amount of the column to be missing, so keeping these values as null wouldn't be statistically significant. I don't think imputing values based on the mean would bring much benefit to my analysis.
Secondly, I think it an interesting datapoint which the client may benefit from understanding. Assuming my assumption of the reason behind the null values is correct, I would say this client has high customer retention with a lot of repeat purchasers, but perhaps they should be doing more to bring in new customers as currently they only form 6% of their customer base, in the time period selected. 

**Q7 Run a check for duplicate values in your dataframe**

In [24]:
#Q7 run a check for duplicate values in your df_ords data
df_ords_dups = df_ords[df_ords.duplicated()]

In [25]:
df_ords_dups

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order


In [26]:
#double checking using the shape() function 
df_ords_dups.shape

(0, 7)

**Q7 answer:** there are no duplicate rows but if there were, I would have created a new dataframe: df_ords_no_dups = df_ords.drop_duplicates()

In [30]:
#Q9 export cleaned df_prods as csv
df_prods_clean_no_dups.to_csv(os.path.join(path, '02_Data', 'Prepared_data', 'products_clean.csv'))

In [33]:
#Q9 export cleaned df_ords as csv
df_ords.to_csv(os.path.join(path, '02_Data', 'Prepared_data', 'orders_clean.csv'))

TypeError: join() got an unexpected keyword argument 'index_col'