# Table of Contents
## 1. Import Data and Checks
## 2. Check for Nulls
## 3. Check for Duplicates
## 4. Check for Mixed Data Types
## 5. Export Data 

# 1. Import Data and Checks

In [3]:
import pandas as pd
import numpy as np
import os

In [4]:
# create path
path = r'C:\Users\18602\Documents\Data Analytics\Data Immersion\Month 4\Instacart Basket Analysis'

In [5]:
# importing dataset products 
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [6]:
# importing dataset orders 
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'), index_col = False)

In [7]:
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


# 2. Check for Nulls

In [13]:
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [14]:
df_nan = df_prods[df_prods['product_name'].isnull()==True]

In [15]:
df_prods.shape

(49693, 5)

In [16]:
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [17]:
df_prods_clean.shape

(49677, 5)

In [18]:
df_dups = df_prods_clean[df_prods_clean.duplicated()]

# 3. Check for Duplicates

In [19]:
df_prods_clean.shape

(49677, 5)

In [20]:
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [21]:
df_dups.shape

(5, 5)

In [22]:
df_prods_clean_no_dups.shape

(49672, 5)

In [23]:
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'))

In [25]:
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


For the descriptive statistics on the product dataframe we have four categories:

product_id behaves like a string in some ways but the min and max don't match the count so we can assume there are some missing and some duplicate values since we are five short of the total count. 

aisle_id should also behaves like a string.The min/max values are plausible (no negatives our outlandishly high numbers).

department_id this also behaves like a string somewhat and the min/max also make sense in that they range from 1-21.

prices is where we see a red flag. The max price is 99999. We can assume that this is an error or placeholder of some kind since it isn't reasonable for a grocery item to cost $99,999. It appears to be skewing our standard deviation as well

# 4. Check for Mixed Data Types

In [29]:
# Check for mixed types 
for col in df_prods.columns.tolist():
  weird = (df_prods[[col]].applymap(type) != df_prods[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_prods[weird]) > 0:
    print (col) 

product_name


In [30]:
# Change product name mixed types to str
df_prods['product_name'] = df_prods['product_name'].astype('str')

Question: for the lesson the answer "mix" came up but in this test 'product name' came up, which is the name of a column. I assumed that this was to indicate that the product_name column was mixed, fixed it, and when I ran the code again after it didn't come up and no other columns came up. So I just wanted to confirm:
-It would list it any/all columns that have a mixed data-type

In [32]:
# Check for missing values in df_ords
df_ords.isnull().sum()

order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

The only missing values are in days_since_prior_order which as 20,6209 missing values which means it has a null for 6% of the rows. Most likely this is because 6% of the customers have only placed one order and don't have a prior order to measure. I'm not sure how to accomplish this in python, but the way to check to see if this is the case would be to see if all of the nulls had unique a unique user_id. To be extra certain, we'd want to see if all returning customers had a days_since_prior_order entry. 

Depending on how we want to use this information, we could do a couple of things. If were are trying to gather information on the frequency habbits of our shoppers, then we should leave these rows as null or with a string such as 'first_ord'. I know having mixed data/nulls isn't ideal, but depending on how the data is used, it may be the best option.

Deleting the rows altogether does not make sense, since we still want the order information and the information on single-purchase customers is valuable. However, if we want to study returning customer behavior, we can create a df that removes the nulls.

If my rational as to why that data is missing is correct, putting in another value would be misleading. Even though putting in the mean wouldn't change the current averages, it could affect future data as more orders are added in and does not accurately reflect the shopping habbits of these customers. It is helpful to know that 6% have not repeated their service if that is the case.

For now, I've created a new df for returning customers that removes the nulls. I've left the original data set as is because I would want confirmation before changing the nulls to 'first_ord' although that would be my inclination as to how to make the data set more usable.


In [49]:
# Create df with no only repeat customers
df_ords_freq = df_ords.dropna(inplace = False)

In [51]:
df_ords_freq.shape

(3214874, 7)

In [52]:
# Create df for duplicates
df_dups = df_ords[df_ords.duplicated()]

In [55]:
# Search for duplicates came up null
df_dups.shape

(0, 7)

There are no duplicates in this data set which means the data set only has unique orders.

# 5. Export Data

In [56]:
df_ords.to_csv(os.path.join(path, '02 Data', 'prepared data', 'df_ords_clean.csv'))

In [57]:
df_prods.to_csv(os.path.join(path, '02 Data', 'prepared data', 'df_ords_clean.csv'))