# 4.9 Intro to Data Visualization with Python

## Part 1

### (1.) Import your analysis libraries, as well as your new customer data set as a dataframe.

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [3]:
# Import data set
path = r'D:\02.2022_Instacart Basket Analysis' 

In [4]:
df_customer = pd.read_csv(os.path.join(path,'02 Data','Original Data','customers.csv'), index_col = False)

In [5]:
# Check the data set
df_customer.shape

(206209, 10)

### (2.) Wrangle the data so that it follows consistent logic; for example, rename columns with illogical names and drop columns that don’t add anything to your analysis.

In [7]:
# Check df_customer
df_customer.columns

Index(['user_id', 'First Name', 'Surnam', 'Gender', 'STATE', 'Age',
       'date_joined', 'n_dependants', 'fam_status', 'income'],
      dtype='object')

In [8]:
df_customer.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [9]:
df_customer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   First Name    194950 non-null  object
 2   Surnam        206209 non-null  object
 3   Gender        206209 non-null  object
 4   STATE         206209 non-null  object
 5   Age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


#### Dropping columns

In [11]:
# Remove unnecessary column
df_customer = df_customer.drop(columns = ['date_joined'])

In [12]:
df_customer.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1,married,40374


#### Renamming columns

In [14]:
# change to uniform column names
df_customer.rename(columns = {'First Name':'first_name','Surnam' : 'last_name','STATE': 'state','fam_status':'family_status','Gender':'gender','Age' : 'age'}, inplace = True)

In [15]:
df_customer.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,n_dependants,family_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1,married,40374


#### Changing Datatypes

In [17]:
# Change to corresponding datatype
df_customer.dtypes

user_id           int64
first_name       object
last_name        object
gender           object
state            object
age               int64
n_dependants      int64
family_status    object
income            int64
dtype: object

In [18]:
df_customer['user_id'] = df_customer['user_id'].astype('int32')

In [19]:
df_customer['age'] = df_customer['age'].astype('int32')

In [20]:
df_customer['n_dependants'] = df_customer['n_dependants'].astype('int32')

In [21]:
df_customer.dtypes

user_id           int32
first_name       object
last_name        object
gender           object
state            object
age               int32
n_dependants      int32
family_status    object
income            int64
dtype: object

### (3.) Complete the fundamental data quality and consistency checks you’ve learned throughout this Achievement; for example, check for and address missing values and duplicates, and convert any mixed-type data.

#### Check data accuracy

In [24]:
df_customer.describe()

Unnamed: 0,user_id,age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


According to 25%. 50% and 75% income, 593901 seems a little be too high for the income value. It is also possible that some customer really have this high income. Therefore, income values need to be investiagted further. 

In [25]:
df_customer[df_customer['income'] > 154941]

Unnamed: 0,user_id,first_name,last_name,gender,state,age,n_dependants,family_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,3,married,165665
8,69965,Jeremy,Vang,Male,Texas,47,1,married,162432
27,83910,Ruth,Rice,Female,Indiana,70,1,married,165147
34,117740,Lisa,Sparks,Female,Oregon,55,1,married,292759
39,150670,Terry,Wilkerson,Male,Iowa,55,2,married,166288
...,...,...,...,...,...,...,...,...,...
206181,192077,Susan,Nash,Female,Georgia,54,3,married,160664
206182,180181,Carl,Bridges,Male,West Virginia,77,2,married,167232
206189,193828,Russell,Travis,Male,North Carolina,46,1,married,160483
206199,179673,Adam,Villanueva,Male,Wyoming,77,0,divorced/widowed,162239


No mismatched entry is found. There are 17880 entries with extremely high income. Based on 25%. 50%. 75% value, I took 100% income for reference to see how many entries are above this value. It turned out that the result makes sense, since only small numbers of people could possibly earn so many money. However, we still need to pay attention to the correlation chart to see if there is anything weird in the income column we missed. 

####  Check missing values

In [27]:
df_customer.isnull().sum()

user_id              0
first_name       11259
last_name            0
gender               0
state                0
age                  0
n_dependants         0
family_status        0
income               0
dtype: int64

In [28]:
df_customer['first_name'].value_counts(dropna = False)

NaN        11259
Marilyn     2213
Barbara     2154
Todd        2113
Jeremy      2104
           ...  
Merry        197
Eugene       197
Garry        191
Ned          186
David        186
Name: first_name, Length: 208, dtype: int64

In [29]:
df_customer_NaN= df_customer[df_customer['first_name'].isnull() == True]

In [30]:
df_customer_NaN.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,n_dependants,family_status,income
53,76659,,Gilbert,Male,Colorado,26,2,married,41709
73,13738,,Frost,Female,Louisiana,39,0,single,82518
82,89996,,Dawson,Female,Oregon,52,3,married,117099
99,96166,,Oconnor,Male,Oklahoma,51,1,married,155673
105,29778,,Dawson,Female,Utah,63,3,married,151819


There are 11259 missing values. We can ask the department to check the customer's first_names and fill the correct names in the NaNs. However, the missing values only make up 5 % of total data and they won't interfere with our data analysis at this moment. Therefore, I chose to filter out NaNs and run the next consistency check at first.  

In [31]:
df_customer_noNaN= df_customer[df_customer['first_name'].isnull() == False]

In [32]:
df_customer_noNaN.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,n_dependants,family_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1,married,40374


In [33]:
df_customer_noNaN.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 194950 entries, 0 to 206208
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   user_id        194950 non-null  int32 
 1   first_name     194950 non-null  object
 2   last_name      194950 non-null  object
 3   gender         194950 non-null  object
 4   state          194950 non-null  object
 5   age            194950 non-null  int32 
 6   n_dependants   194950 non-null  int32 
 7   family_status  194950 non-null  object
 8   income         194950 non-null  int64 
dtypes: int32(3), int64(1), object(5)
memory usage: 12.6+ MB


In [34]:
df_customer_noNaN.shape

(194950, 9)

#### Check duplicates

In [36]:
df_customer_noNaN[df_customer_noNaN.duplicated()]

Unnamed: 0,user_id,first_name,last_name,gender,state,age,n_dependants,family_status,income


No duplicate is found.

In [37]:
df_customer_noNaN.shape

(194950, 9)

#### Check mixed data type

In [39]:
for col in df_customer_noNaN.columns.tolist():
    weird = (df_customer_noNaN[[col]].applymap(type) != df_customer_noNaN[[col]].iloc[0].apply(type)).any(axis = 1)
    if len(df_customer_noNaN[weird]) > 0:
     print (col)

No mixed type is found.

In [40]:
df_customer_noNaN.shape

(194950, 9)

In [41]:
df_customer_noNaN.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 194950 entries, 0 to 206208
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   user_id        194950 non-null  int32 
 1   first_name     194950 non-null  object
 2   last_name      194950 non-null  object
 3   gender         194950 non-null  object
 4   state          194950 non-null  object
 5   age            194950 non-null  int32 
 6   n_dependants   194950 non-null  int32 
 7   family_status  194950 non-null  object
 8   income         194950 non-null  int64 
dtypes: int32(3), int64(1), object(5)
memory usage: 12.6+ MB


In [42]:
# Exporting
df_customer_noNaN.to_csv(os.path.join(path,'02 Data','Prepared Data','customer_check.csv'))