# Introduction to Data Wrangling 

## Table of Contents
* [01. Importing Libraries](#01.-Importing-Libraries)
* [02. Importing CSV Files](02.-Importing-CSV-Files)
* [03. Data Wrangling Procedures](#03.-Data-Wrangling-Procedures)
    * [Dropping Columns](#Dropping-Columns)
    * [Renaming Columns](#Renaming-Columns)
    * [Changing Data Types](#Changing-Data-Types)
    * [Transposing Data](#Transposing-Data)
    * [Creating Data Dictionary](#Creating-Data-Dictionary)
    * [Creating subset of data frame](#Creating-subset-of-data-frame)
    * [Exporting Data Frame](#Exporting-Data-Frame)
* [Exercise 4.4 Questions](#Exercise-4.4-Questions)
    * [Changing data type of identifier variable](#Changing-data-type-of-identifier-variable)
    * [Changing name of column](#Changing-name-of-column)
    * [Determining busiest hour for placing orders](#Determining-busiest-hour-for-placing-orders)
    * [Retrieving value from data dictionary](#Retrieving-value-from-data-dictionary)
    * [Creating subset for breakfast](#Creating-subset-for-breakfast)
    * [Creating subset for dinner](#Creating-subset-for-dinner)
    * [Checking dimensions of data frame](#Checking-dimensions-of-data-frame)
    * [Extracting information about specific user](#Extracting-information-about-specific-user)
        * [Descriptive stats for user](#Descriptive-stats-for-user)
    * [Exporting data frames](#Exporting-data-frames)

# 01. Importing Libraries

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
import os

# 02. Importing CSV Files

In [2]:
# Importing Orders CSV File
df_ords = pd.read_csv(r'/Users/suzandiab/Documents/Instacart Basket Analysis/02 Data/Original Data/orders.csv', 
                      index_col = False)

In [3]:
# Assigning path to Order File
path = r'/Users/suzandiab/Documents/Instacart Basket Analysis'

In [4]:
# Importing Product CSV File
df_prods = pd.read_csv(r'/Users/suzandiab/Documents/Instacart Basket Analysis/02 Data/Original Data/products.csv', 
                      index_col = False)

# 03. Data Wrangling Procedures

## Dropping Columns

In [5]:
# Dropping eval set column
df_ords.drop(columns = ['eval_set'])

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


In [6]:
# Overwriting data frame for order CSV file without eval set column
df_ords = df_ords.drop(columns = ['eval_set'])

In [7]:
# Printing all functions in the days since prior order column, even missing values
df_ords['days_since_prior_order'].value_counts(dropna = False)

days_since_prior_order
30.0    369323
7.0     320608
6.0     240013
4.0     221696
3.0     217005
5.0     214503
NaN     206209
2.0     193206
8.0     181717
1.0     145247
9.0     118188
14.0    100230
10.0     95186
13.0     83214
11.0     80970
12.0     76146
0.0      67755
15.0     66579
16.0     46941
21.0     45470
17.0     39245
20.0     38527
18.0     35881
19.0     34384
22.0     32012
28.0     26777
23.0     23885
27.0     22013
24.0     20712
25.0     19234
29.0     19191
26.0     19016
Name: count, dtype: int64

## Renaming Columns

In [8]:
# Renaming orders dow column
df_ords.rename(columns = {'order_dow' : 'orders_day_of_week'}, inplace = True)

In [9]:
# Displaying table to check column was renamed
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


## Changing Data Types

In [10]:
# Changing data type of order id to string so describe function can ignore 
df_ords['order_id'] = df_ords['order_id'].astype('str')

In [11]:
# Checking data type of order id column, string
df_ords['order_id'].dtype

dtype('O')

## Transposing Data 

In [12]:
# Importing Department CSV File
df_dep = pd.read_csv(r'/Users/suzandiab/Documents/Instacart Basket Analysis/02 Data/Original Data/departments.csv', 
                      index_col = False)

In [13]:
# Viewing department CSV file
df_dep.head()

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing


In [14]:
# Switching from wide to long format (transposing)
df_dep.T

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


In [15]:
# Overwriting old data frame
df_dep_t = df_dep.T

In [16]:
# Adding index to data frame
df_dep_t.reset_index()

Unnamed: 0,index,0
0,department_id,department
1,1,frozen
2,2,other
3,3,bakery
4,4,produce
5,5,alcohol
6,6,international
7,7,beverages
8,8,pets
9,9,dry goods pasta


In [17]:
# Creating a new header
new_header = df_dep_t.iloc[0]

In [18]:
# Checking new header 
new_header

0    department
Name: department_id, dtype: object

In [19]:
# Creating a new dataframe that only copies over rows beyond the first row
df_dep_t_new = df_dep_t[1:]

In [20]:
# Displauing new version of data frame
df_dep_t_new

Unnamed: 0,0
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


In [21]:
# Setting variable as new header
df_dep_t_new.columns = new_header

In [22]:
# Displaying table with the new header 
df_dep_t_new

department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


## Creating Data Dictionary

In [23]:
# Turning data frame into a dictionary
data_dict = df_dep_t_new.to_dict('index')

In [24]:
# Displaying data dictionary
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

In [25]:
# Displaying first 5 rows of product file
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [26]:
# Display what 19 stands for in data dictionary
print(data_dict.get('19'))

{'department': 'snacks'}


## Creating subset of data frame

In [27]:
# Created subset of product dataframe that only contains data from the snacks department.
df_prods[df_prods['department_id']==19]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5
...,...,...,...,...,...
49666,49662,Bacon Cheddar Pretzel Pieces,107,19,3.6
49669,49665,Super Dark Coconut Ash & Banana Chocolate Bar,45,19,6.9
49670,49666,Ginger Snaps Snacking Cookies,61,19,5.2
49675,49671,Milk Chocolate Drops,45,19,3.0


In [28]:
# Running part of subset code
df_prods['department_id']==19

0         True
1        False
2        False
3        False
4        False
         ...  
49688    False
49689    False
49690    False
49691    False
49692    False
Name: department_id, Length: 49693, dtype: bool

In [29]:
# Displaying all instances of department id being 19 in product file
df_prods[df_prods['department_id']==19]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5
...,...,...,...,...,...
49666,49662,Bacon Cheddar Pretzel Pieces,107,19,3.6
49669,49665,Super Dark Coconut Ash & Banana Chocolate Bar,45,19,6.9
49670,49666,Ginger Snaps Snacking Cookies,61,19,5.2
49675,49671,Milk Chocolate Drops,45,19,3.0


In [30]:
# Save as new data frame 
df_snacks =  df_prods[df_prods['department_id']==19]

In [31]:
# Displaying first 5 rows of snack data frame to double-check
df_snacks.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5


In [32]:
# Additional way of displaying what 19 stands for
df_snacks_2 = df_prods.loc[df_prods['department_id'] == 19]

In [33]:
# This way allows for multiple numbers to be plugged in
df_snacks_3 = df_prods.loc[df_prods['department_id'].isin([19])]

## Exporting data frame

In [34]:
# Exporting the data frame to Original Data then manually moved to Prepared Data in file explorer
df_ords.to_csv(os.path.join(path, '02 Data','Original Data', 'orders_wrangled.csv'))

# Exercise 4.4 Questions

## Changing data type of identifier variable

In [35]:
# Changing data type of user id column to string
df_ords['user_id'] = df_ords['user_id'].astype('str')

In [36]:
# Checking new data type 
df_ords['user_id'].dtype

dtype('O')

## Changing name of column

In [37]:
# Renaming column in order file
df_ords.rename(columns = {'days_since_prior_order' : 'days_since_last_order'}, inplace = True)

In [38]:
# Checking column name was changed
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


## Determining busiest hour for placing orders

In [39]:
# Checking frequency of the order hour of day column
df_ords['order_hour_of_day'].value_counts

<bound method IndexOpsMixin.value_counts of 0           8
1           7
2          12
3           7
4          15
           ..
3421078    18
3421079    10
3421080    12
3421081    12
3421082    14
Name: order_hour_of_day, Length: 3421083, dtype: int64>

busiest hour for placing orders is hour 12

## Retrieving value from data dictionary

In [40]:
# Displaying data dictionary
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

In [41]:
# Displaying what 4 represents in department id column
print(data_dict.get('4'))

{'department': 'produce'}


## Creating subset for breakfast

In [42]:
# Created subset of product dataframe that only contains data from the breakfast department.
df_prods[df_prods['department_id']==14]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
27,28,Wheat Chex Cereal,121,14,10.1
33,34,,121,14,12.2
67,68,"Pancake Mix, Buttermilk",130,14,13.7
89,90,Smorz Cereal,121,14,3.9
210,211,Gluten Free Organic Cereal Coconut Maple Vanilla,130,14,3.6
...,...,...,...,...,...
49330,49326,Cereal Variety Fun Pack,121,14,9.1
49395,49391,Light and Fluffy Buttermilk Pancake Mix,130,14,2.0
49547,49543,Chocolate Cheerios Cereal,121,14,10.8
49637,49633,Shake 'N Pour Buttermilk Pancake Mix,130,14,14.2


In [43]:
# Saving subset as new breakfast data frame
df_breakfast =  df_prods[df_prods['department_id']==14]

In [44]:
# Checking new subset
df_breakfast.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
27,28,Wheat Chex Cereal,121,14,10.1
33,34,,121,14,12.2
67,68,"Pancake Mix, Buttermilk",130,14,13.7
89,90,Smorz Cereal,121,14,3.9
210,211,Gluten Free Organic Cereal Coconut Maple Vanilla,130,14,3.6


## Creating subset for dinner 

In [45]:
# Creating subset for 4 of these categories
df_dinner = df_prods.loc[df_prods['department_id'].isin([5,7,12,20])]

## Checking dimensions of data frame

In [46]:
# Displaying number of rows in dinner data frame
df_dinner.shape

(7650, 5)

## Extracting information about specific user

In [47]:
# Creating subset for customer with user id 1
df_userone = df_ords[df_ords['user_id']== '1']

In [48]:
# Displaying records for customer with user id 1
df_userone

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
5,3367565,1,6,2,7,19.0
6,550135,1,7,1,9,20.0
7,3108588,1,8,1,14,14.0
8,2295261,1,9,1,16,0.0
9,2550362,1,10,4,8,30.0


### Descriptive stats for user 

In [49]:
# Descriptive statsitics for customer with user id 1
df_userone.describe()

Unnamed: 0,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
count,11.0,11.0,11.0,10.0
mean,6.0,2.636364,10.090909,19.0
std,3.316625,1.286291,3.477198,9.030811
min,1.0,1.0,7.0,0.0
25%,3.5,1.5,7.5,14.25
50%,6.0,3.0,8.0,19.5
75%,8.5,4.0,13.0,26.25
max,11.0,4.0,16.0,30.0


# Exporting data frames

In [50]:
# Exporting order file to original data and manually placing in prepared data in file explorer
df_ords.to_csv(os.path.join(path, '02 Data','Original Data', 'orders_wrangled.csv'))

In [51]:
# Exporting departments file to original data and manually placing in prepared data in file explorer
df_dep_t_new.to_csv(os.path.join(path, '02 Data','Original Data', 'departments_wrangled.csv'))