## Table of Contents

### 1. Importing libraries
### 2. Importing data
### 3. Data consistency checks - products
##### 1. Mixed-type values
##### 2. Missing values
##### 3. Duplicate values
##### 4. Check products dataframe for irregularities
### 4. Data consistency checks - orders
##### 1. Mixed-type values
##### 2. Missing values
##### 3. Duplicate values
### 4. Exporting data

# 01. Importing libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

# 02. Importing data

In [2]:
# Create path shortcut
path = r'/Users/taraperrigeold/Documents/Documents - Tara Perrige’s MacBook Pro/CareerFoundry/Instacart Basket Analysis'

In [3]:
# Check output
path

'/Users/taraperrigeold/Documents/Documents - Tara Perrige’s MacBook Pro/CareerFoundry/Instacart Basket Analysis'

In [4]:
# Import products.csv
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [5]:
# Check output
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [6]:
# Check shape
df_prods.shape

(49693, 5)

In [7]:
# Import orders_wrangled.csv
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

In [8]:
# Check output
df_ords.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0


In [9]:
# Check shape
df_ords.shape

(3421083, 7)

# 03. Data consistency checks - products

## 01. Mixed-type values

In [10]:
# Create a test dataframe
df_test = pd.DataFrame()

In [11]:
# Create a mixed_type column
df_test['mix'] = ['a', 'b', 1, True]

In [12]:
# Check output
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [13]:
# Check for mixed types
for col in df_test.columns.tolist():
    weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_test[weird]) > 0:
        print(col)

mix


In [14]:
# Change data type of mix column to string
df_test['mix'] = df_test['mix'].astype('str')

In [15]:
# Check data type was changed
df_test['mix'].dtype

dtype('O')

## 02. Missing values

In [16]:
# Find missing values in df_prods
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [17]:
# Create subset containing only those 16 missing values in product_name
df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [18]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [19]:
# Check number of rows in df_prods to compare later
df_prods.shape

(49693, 5)

In [20]:
# Create new dataframe that excludes the missing values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [21]:
# Check that there are 16 fewer rows in df_prods_clean than df_prods
df_prods_clean.shape

(49677, 5)

## 03. Duplicate values

In [22]:
# Look for full duplicates by creating a subset containing only rows that are duplicates
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [23]:
# Check output
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [24]:
# Check number of rows in df_prods_clean to compare later
df_prods_clean.shape

(49677, 5)

In [25]:
# Create new dataframe that does not include duplicates just identified
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [26]:
# Check that there are 5 fewer rows in df_prods_clean_no_dups than df_prods_clean
df_prods_clean_no_dups.shape

(49672, 5)

## 04. Check products dataframe for irregularities

In [27]:
# Look at descriptive stats of products
df_prods_clean_no_dups.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,9.993282
std,14340.705287,38.315784,5.850779,453.615536
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,99999.0


The maximum price is extremely high, especially when compared to the 3rd quartile. There could be some outliers skewing the data, but it's more likely that something was entered incorrectly. This needs to be investigated further.

In [28]:
# Check average of prices
df_prods_clean_no_dups['prices'].mean()

9.993281929457261

In [29]:
# Check median of prices
df_prods_clean_no_dups['prices'].median()

7.1

In [30]:
# Check max value of prices
df_prods_clean_no_dups['prices'].max()

99999.0

In [31]:
# Check for outliers in prices
df_prods_clean_no_dups.loc[df_prods_clean_no_dups['prices'] > 100]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


In [32]:
# Turn outlier data into Nan
df_prods_clean_no_dups.loc[df_prods_clean_no_dups['prices'] > 100, 'prices'] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [33]:
# Check new max value of prices
df_prods_clean_no_dups['prices'].max()

25.0

# 04. Data consistency checks - orders

## 01. Mixed-type values

In [34]:
# Check for mixed types
for col in df_ords.columns.tolist():
    weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_ords[weird]) > 0:
        print(col)

No mixed-type columns found.

## 02. Missing values

In [35]:
# Find missing values in df_ords
df_ords.isnull().sum()

Unnamed: 0                     0
order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

The days_since_prior_order column has a lot of missing values. It makes sense that data would be missing for people who don't have a prior order, meaning this was their first time ordering using Instacart.

In [36]:
# Check shape of df_ords
df_ords.shape

(3421083, 7)

## 03. Duplicate values

In [37]:
# Check for full duplicates by creating a subset of only rows that are duplicates
df_ords_dups = df_ords[df_ords.duplicated()]

No full duplicates found; however, there is a duplicate column (see below).

In [38]:
df_ords.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0


There are two index columns. At first I thought that index_col = False meant that pandas would not duplicate the index column, so I was confused why pandas did not recognize the Unnamed:0 column as an index column. I thought maybe it had something to do with the column name. After some research, it seems I can use a function to set a new index column after creating a dataframe: df.set_index('column_name', inplace = True). Going forward in other notebooks I can just import the csv file but exclude that column.

In [39]:
# Create new dataframe that drops the unnamed: 0 column
df_ords_clean = df_ords.drop(columns = ['Unnamed: 0'])

In [40]:
df_ords_clean.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [41]:
# Check new shape
df_ords_clean.shape

(3421083, 6)

I decided to drop the column so that I can have a clean export. Now I know to not include that column when importing the data set next time.

# 05. Exporting data

In [33]:
# Export clean products data
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_checked.csv'))

In [34]:
# Export clean orders data
df_ords_clean.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_checked.csv'))