## Table of Contents
1. [Import Libraries](#import-libraries)
2. [Load Source and Prepared Data](#load-source-and-prepared-data)
3. [Create Test Data for Mixed Types](#create-test-data-for-mixed-types)
4. [Detect Mixed Data Types](#detect-mixed-data-types)
5. [Convert Mixed-Type Columns](#convert-mixed-type-columns)
6. [Check for Missing Values](#check-for-missing-values)
7. [Handling Missing Values (Discussion)](#handling-missing-values-discussion)
8. [Check for Duplicate Records](#check-for-duplicate-records)
9. [Handling Duplicate Records (Discussion)](#handling-duplicate-records-discussion)
10. [Export Checked Orders Data](#export-checked-orders-data)

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [2]:
df_prods = pd.read_csv(r'C:\Users\rbaue\Desktop\Instacart\Original Data\products.csv', index_col = False)

In [3]:
df_ords = pd.read_csv(r'C:\Users\rbaue\Desktop\Instacart\Prepared Data\orders_wrangled.csv', index_col = False)

In [6]:
# Creating a test dataframe
df_test = pd.DataFrame()

In [7]:
# Creating a mixed type column
df_test['mix'] = ['a', 'b', 1, True]

In [8]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [11]:
# Checking for mixed types
for col in df_test.columns.tolist():
  weird = (df_test[[col]].map(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [15]:
df_test['mix'] = df_test['mix'].astype('str')

In [16]:
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [18]:
df_nan = df_prods[df_prods['product_name'].isnull() ==True]

In [19]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [20]:
df_prods.shape

(49693, 5)

In [21]:
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [22]:
df_prods_clean.shape

(49677, 5)

In [23]:
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [24]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [25]:
df_dups.shape

(5, 5)

In [26]:
df_prods_clean

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


In [28]:
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [30]:
df_prods_clean_no_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


In [32]:
path = r'C:\Users\rbaue\Desktop\Instacart'

In [33]:
df_prods_clean_no_dups.to_csv(os.path.join(path, 'Prepared Data', 'prods_wrangled.csv'))

In [36]:
# 2.

In [34]:
df_ords.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


In [37]:
# 3.

In [38]:
# Checking for mixed types
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].map(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

In [39]:
print (col)

days_since_prior_order


In [40]:
# 4.

In [41]:
df_ords['days_since_prior_order'] = df_ords['days_since_prior_order'].astype('str')

In [45]:
# 5.

In [46]:
df_ords.isnull().sum()

Unnamed: 0                0
order_id                  0
user_id                   0
order_number              0
orders_day_of_week        0
order_hour_of_day         0
days_since_prior_order    0
dtype: int64

Luckily, there are no missing values in this dataframe

In [47]:
# 6.

If there were any missing values, I could either create a new variable as a flag for the missing value, impute the missing values using the mean or median, or I could remove the missing values.

Usually, removing them is not the best route. I'd investigate the data to figure out which of the other approaches made the most sense

In [48]:
# 7.

In [49]:
df_ords_dups = df_ords[df_ords.duplicated()]

In [50]:
df_ords_dups

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order


There are also no duplicates in this dataframe

In [52]:
# 8.

If I had found any full duplicates where an entire row was the same, I'd use the drop_duplicates command to create a new dataframe without the duplicate rows

In [53]:
# 9.

In [54]:
df_ords.to_csv(os.path.join(path, 'Prepared Data', 'ords_checked.csv'))