## Cleaning Customer address data from the raw data excel sheet

In [1]:
import pandas as pd
import numpy as np

In [3]:
cust_add = pd.read_excel('Raw_data.xlsx', sheet_name='CustomerAddress')

In [4]:
cust_add.head()

Unnamed: 0,customer_id,address,postcode,state,country,property_valuation
0,1,060 Morning Avenue,2016,New South Wales,Australia,10
1,2,6 Meadow Vale Court,2153,New South Wales,Australia,10
2,4,0 Holy Cross Court,4211,QLD,Australia,9
3,5,17979 Del Mar Point,2448,New South Wales,Australia,4
4,6,9 Oakridge Court,3216,VIC,Australia,9


In [5]:
cust_add.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 0 to 3998
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         3999 non-null   int64 
 1   address             3999 non-null   object
 2   postcode            3999 non-null   int64 
 3   state               3999 non-null   object
 4   country             3999 non-null   object
 5   property_valuation  3999 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 187.6+ KB


The data type of columns looks good. Lets check for the data quality and apply data cleaning process where ever applicable to clean our dataset before performing any analysis

### Total Records

In [6]:
print(f"Total Records in dataset: {len(cust_add)}")
print(f"Total features in the dataset: {cust_add.shape[1]}")

Total Records in dataset: 3999
Total features in the dataset: 6


### Numeric Columns and Non Numeric Columns

In [7]:
# select numeric columns

df_numeric = cust_add.select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
print(f"Numeric columns: {numeric_cols}")

Numeric columns: ['customer_id' 'postcode' 'property_valuation']


In [8]:
# select non numeric columns

df_non_numeric = cust_add.select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
print(f"Non Numeric columns: {non_numeric_cols}")

Non Numeric columns: ['address' 'state' 'country']


## Data Quality Checks
### 1. Missing Values
Checking for the presence of any missing values in the dataset. If missing values are present for a particular feature then depending upon the situation the feature may be either dropped(in case when major amount of data is missing) or an appropriate value will be imputed in the feature column with missing values.

In [9]:
# total number of missing values for each feature

cust_add.isnull().sum()

customer_id           0
address               0
postcode              0
state                 0
country               0
property_valuation    0
dtype: int64

`In this dataset there are no missing values`

### 2. Inconsistency Check in Data
We will check whether there is any inconsistent data/typo error data is present in the categorical columns
The columns to be checked are `'address'`, `'postcode'`, `'state'`, `'country'`

#### 2.1 State

In [10]:
cust_add['state'].value_counts()

NSW                2054
VIC                 939
QLD                 838
New South Wales      86
Victoria             82
Name: state, dtype: int64

Here there are inconsistencies in the state column of the dataset. For `New South Wales` and `Victoria` we have two values, one being the full name and the other being their short name. The state names should be standardized and columns with state as `New South Wales` will be replaced by `NSW` and columns with as `Victoria` will be replaced by `VIC`

In [11]:
# function to replace state names with their abbreviations

def state_abbrev(x):
    if x == 'New South Wales':
        return 'NSW'
    elif x == 'Victoria':
        return 'VIC'
    else :
        return x


# Applying the above function to the state column
cust_add['state'] = cust_add['state'].apply(state_abbrev)

In [12]:
cust_add['state'].value_counts()

NSW    2140
VIC    1021
QLD     838
Name: state, dtype: int64

### 2.2 Country

In [13]:
cust_add['country'].value_counts()

Australia    3999
Name: country, dtype: int64

There is no inconsistency of data in the Country column

### 2.3 PostCode
The Postcode column looks perfect. There is no inconsistency / typo in the data.

In [14]:
cust_add['postcode'].value_counts()

2170    31
2155    30
2145    30
2153    29
3977    26
        ..
3808     1
3114     1
4721     1
4799     1
3089     1
Name: postcode, Length: 873, dtype: int64

In [15]:
cust_add[['state', 'country', 'postcode', 'address']].drop_duplicates()

Unnamed: 0,state,country,postcode,address
0,NSW,Australia,2016,060 Morning Avenue
1,NSW,Australia,2153,6 Meadow Vale Court
2,QLD,Australia,4211,0 Holy Cross Court
3,NSW,Australia,2448,17979 Del Mar Point
4,VIC,Australia,3216,9 Oakridge Court
...,...,...,...,...
3994,VIC,Australia,3064,1482 Hauk Trail
3995,QLD,Australia,4511,57042 Village Green Point
3996,NSW,Australia,2756,87 Crescent Oaks Alley
3997,QLD,Australia,4032,8194 Lien Street


### 3. Duplication Checks
We need to ensure that there are no duplication of the records in the dataset. This may lead to discripency in the data analysis due to poor quality of the data used. If there are duplicate rows of data then we need to drop any such rows.
For checking for duplicate records we need to firstly remove the primary key column of the dataset then apply drop_duplicates() function

In [16]:
# Dropping primary key i.e. customer_id and storing into a temporary dataframe

cust_add_temp = cust_add.drop('customer_id', axis=1).drop_duplicates()
print(f"Number of records after removing customer_id: {len(cust_add_temp)}")
print(f"Number of records in original dataset: {len(cust_add)}")

Number of records after removing customer_id: 3999
Number of records in original dataset: 3999


`Since both numbers are same. There were no duplicate records in the dataset`

### 4. Exporting the Cleaned Customer Address data to a csv file
Currently the Customer Address is clean. Hence, we can export the data to a csv to continue our data analysis of Customer Segmentation by joining it to other tables

In [17]:
cust_add.to_csv('CustomerAddress_cleaned.csv', index=False)

### 5. Checking for Master Detail record counts
Checking with the Master Table (CustomerDemographic_cleaned.csv) containing the entire Customer Data for the Customer IDs which are getting dropped from the Customer Address Dataset.
* `Basically these are the Customers who have an address but are not a part of the Demographics dataset yet.`

In [19]:
cust_demo_detail = pd.read_csv('CustomerDemographic_cleaned.csv')

In [20]:
cust_demo_detail.head()

Unnamed: 0,customer_id,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,Age
0,1,Laraine,Medendorp,Female,93,1953-10-12,Executive Secretary,Health,Mass Customer,N,Yes,11.0,70
1,2,Eli,Bockman,Male,81,1980-12-16,Administrative Officer,Financial Services,Mass Customer,N,Yes,16.0,43
2,3,Arlin,Dearle,Male,61,1954-01-20,Recruiting Manager,Property,Mass Customer,N,Yes,15.0,69
3,4,Talbot,,Male,33,1961-10-03,,IT,Mass Customer,N,No,7.0,62
4,5,Sheila-kathryn,Calton,Female,56,1977-05-13,Senior Editor,,Affluent Customer,N,Yes,8.0,46


In [24]:
print(f'Total Number of records in CustomerDemographic dataset: {cust_demo_detail.shape[0]}')
print(f'Total Number of records in CustomerAddress dataset: {cust_add.shape[0]}')

print(f'In Demographic Dataset, {cust_add.shape[0] - cust_demo_detail.shape[0]} records are getting dropped due to data cleaning process in the Demographic dataset')

Total Number of records in CustomerDemographic dataset: 3912
Total Number of records in CustomerAddress dataset: 3999
In Demographic Dataset, 87 records are getting dropped due to data cleaning process in the Demographic dataset


### Customer IDs in the Address table getting dropped:

In [25]:
cust_drop = cust_add.merge(cust_demo_detail, left_on='customer_id', right_on='customer_id', how='left', indicator=True)

cust_drop.head()

Unnamed: 0,customer_id,address,postcode,state,country,property_valuation,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,Age,_merge
0,1,060 Morning Avenue,2016,NSW,Australia,10,Laraine,Medendorp,Female,93.0,1953-10-12,Executive Secretary,Health,Mass Customer,N,Yes,11.0,70.0,both
1,2,6 Meadow Vale Court,2153,NSW,Australia,10,Eli,Bockman,Male,81.0,1980-12-16,Administrative Officer,Financial Services,Mass Customer,N,Yes,16.0,43.0,both
2,4,0 Holy Cross Court,4211,QLD,Australia,9,Talbot,,Male,33.0,1961-10-03,,IT,Mass Customer,N,No,7.0,62.0,both
3,5,17979 Del Mar Point,2448,NSW,Australia,4,Sheila-kathryn,Calton,Female,56.0,1977-05-13,Senior Editor,,Affluent Customer,N,Yes,8.0,46.0,both
4,6,9 Oakridge Court,3216,VIC,Australia,9,Curr,Duckhouse,Male,35.0,1966-09-16,,Retail,High Net Worth,N,Yes,13.0,57.0,both
