# 4.9 INTRO TO DATA VISUALIZATION WITH PYTHON
** **
 
### PART 1

Data preparation and combination techniques to additional dataframe for the project. Then, move on to generating visualizations for the analysis. The senior Instacart officers has prvided a new data set of customer information to go along with product and order data. 

**SCRIPT CONTENTS:**

1. Importing Libraries & Files
2. Data Wrangling
    - Renaming columns
    - Data consistency checks
    - Exploring Data
    - Missing values
3. Merge Updated DataFrame: **df_merged_final.pkl**


****
#### 1. IMPORTING LIBRARIES & FILES

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
# Document File Location

path = r'C:\Users\G\12-2022 Instacart Basket Analysis'

In [3]:
# Import Customer DataFrame

customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'))

****
#### 2. DATA WRANGLING

Q4. Wrangle the data so that it follows consistent logic; for example, rename columns with illogical names and drop columns that don’t add anything to your analysis.

In [4]:
# Check Customer DataFrame

customers

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
...,...,...,...,...,...,...,...,...,...,...
206204,168073,Lisa,Case,Female,North Carolina,44,4/1/2020,1,married,148828
206205,49635,Jeremy,Robbins,Male,Hawaii,62,4/1/2020,3,married,168639
206206,135902,Doris,Richmond,Female,Missouri,66,4/1/2020,2,married,53374
206207,81095,Rose,Rollins,Female,California,27,4/1/2020,1,married,99799


_- Renaming Columns_

In [5]:
# RENAMING COLUMNS:

# user_id to customer_id
customers.rename(columns = {'user_id' : 'customer_id'}, inplace = True)

# First Name to Name
customers.rename(columns = {'First Name' : 'Name'}, inplace = True)

# Surnam to Surname
customers.rename(columns = {'Surnam' : 'Surname'}, inplace = True)

# STATE to State
customers.rename(columns = {'STATE' : 'State'}, inplace = True)

# date_joined to Member_Since
customers.rename(columns = {'date_joined' : 'Member_Since'}, inplace = True)

# n_dependants to No_of_Dependants
customers.rename(columns = {'n_dependants' : 'No_of_Dependants'}, inplace = True)

# fam_status to Marital_Status
customers.rename(columns = {'fam_status' : 'Marital_Status'}, inplace = True)

# income to Income
customers.rename(columns = {'income' : 'Income'}, inplace = True)

#Check Update Output
customers.head()

Unnamed: 0,customer_id,Name,Surname,Gender,State,Age,Member_Since,No_of_Dependants,Marital_Status,Income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [6]:
# Checkpoint - Dataframe Dimension

customers.shape

(206209, 10)

Q5. Complete the fundamental data quality and consistency checks you’ve learned throughout this Achievement; for example, check for and address missing values and duplicates, and convert any mixed-type data.

In [7]:
# Check Customer statistical description

customers.describe()

Unnamed: 0,customer_id,Age,No_of_Dependants,Income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


In [8]:
# Search for duplicates

customers[customers.duplicated()]

Unnamed: 0,customer_id,Name,Surname,Gender,State,Age,Member_Since,No_of_Dependants,Marital_Status,Income


**NO DUPLICATES FOUND**

In [9]:
# Checking null values

customers.isnull().sum()

customer_id             0
Name                11259
Surname                 0
Gender                  0
State                   0
Age                     0
Member_Since            0
No_of_Dependants        0
Marital_Status          0
Income                  0
dtype: int64

**OBESERVATION: _11259 First Names are missing_** which may **not** contribute to any data discreprencies for this analysis since can be treated as a privacy compliance and can remain the way it is.

In [10]:
# Verifying DatarFrame Data types

customers.dtypes

customer_id          int64
Name                object
Surname             object
Gender              object
State               object
Age                  int64
Member_Since        object
No_of_Dependants     int64
Marital_Status      object
Income               int64
dtype: object

In [11]:
# Check mixed data types

for col in customers.columns.tolist():
  weird = (customers[[col]].applymap(type) != customers[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (customers[weird]) > 0:
    print (col)

Name


In [12]:
# Check  data type of 'Name'

customers['Name'].dtype

dtype('O')

In [13]:
# Update Mixed_Typed "Name" Column to String (str)

customers['Name'] = customers['Name'].astype('str')

In [14]:
# Checkpoint 

for col in customers.columns.tolist():
  weird = (customers[[col]].applymap(type) != customers[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (customers[weird]) > 0:
    print (col)

**NO MIXED-TYPE DATA FOUND** - Update was successful.

****
#### 3. MERGE DATAFRAME

Q6. Combine your customer data with the rest of your prepared Instacart data. (Hint: Make sure the key columns are the same data type!)

In [15]:
# Import Merge Pickle Format File

ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged_update2.pkl'))

In [16]:
# Check Merge dataframe shape

ords_prods_merge.shape

(32404854, 25)

In [17]:
# Checking ords_prods_merge column names

ords_prods_merge.head()

Unnamed: 0,order_id,customer_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,...,price_range,busiest_day,busiest days,busiest_period_of_day,max_order,loyalty_flag,average_price,spender_flag,median_days_since_prior_order,order_frequency_flag
0,2539329,1,1,2,8,,True,196,1,0,...,Mid-range product,Regularly busy,Regularly busy,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,False,196,1,1,...,Mid-range product,Regularly busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,False,196,1,1,...,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,False,196,1,1,...,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,False,196,1,1,...,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer


In [18]:
# Merging customers dataframe to ords_prods_merge

ords_prods_cstmrs = ords_prods_merge.merge(customers, on = 'customer_id')

In [19]:
# Check Merge dataframe shape

ords_prods_cstmrs.shape

(32404854, 34)

****
#### EXPORTING UPDATE DATAFRAME

In [20]:
# Export Merge File as Pickle

ords_prods_cstmrs.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'df_merged_final.pkl'))