## Gather data: 
We are going to use [credit card data](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) from UCI's Machine Learning Repository for this project. Let us gather required data into a folder, `datasets`. 

Let's first load the python modules needed for gather credit card data.

In [5]:
# Modules required to gather data
import wget
import os 
import pandas as pd
import numpy as np

In [88]:
# Load helper functions from src/utils-*-*.py to run on jupyter-notebook
%run ../src/utils-gather-assess.py

In here, I am reading the weblinks into variables url_data and url_names.  

In [7]:
# Weblink directing to credit card dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening'

# Names of files in credit card dataset
data_file = 'crx.data'
names_file = 'crx.names'

Organizing data from the beginning will save time for productive data analysis in the later stages of the project. Downloaded (2) data files are given the same name as specified above.    

In [8]:
# Local directory to save data
path_data = '../datasets'

OK. We are all set to gather the data we need for this project. I am going to write a very simple python function to check and download the data needed. 

Now, let us call the check_download function twice on two datafiles needed.

In [9]:
# Call function on data_file
check_download(url, data_file, path_data)

 Folder exists: ../datasets
 Datafile already present: ../datasets/crx.data


In [10]:
# Call function on names_file
check_download(url, names_file, path_data)

 Folder exists: ../datasets
 Datafile already present: ../datasets/crx.names


Required datafiles for this project are now downloaded into `datasets` folder. How about a quick sneak-peak into the `datasets` folder? 

In [89]:
show_files_datasets(path_data)

 datasets/
       - crx.data
       - crx.names
       - crx.data_named.csv
       - crx.data_clean.csv


***

## Assess data:
In here, we load the credit card data and assess the data for `Quality` and `Tidiness`.

In [15]:
path_datafile = path_data + '/' + data_file

In [16]:
# Load credit card data into a DataFrame - cc_df  
cc_df = load_csv_df(path_datafile)
cc_df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [17]:
cc_df.shape

(690, 16)

No. of instances and features exactly match what is given in the details file. So, we can move further into our assessment. 

As column names provided no additional information in this dataset, we load the dataframe with no header. This means, our column names are simply numbers starting from 0 to 16. A quick google search directed me to a [Credit Card Analysis Page by Ryan Kuhn](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html), where the author assigned names to the variables. First 15 variables are the credit application features/attributes. Final, 16th, approved variable is the credit approval status. We are going to use these names as our column names.     

In [20]:
# Old column names
cc_df.columns

Index(['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel',
       'Ethnicity', 'YearsEmployed', 'PriorDefaulter', 'Employed',
       'CreditScore', 'DriversLicense', 'Citizen', 'ZipCode', 'Income',
       'Approved'],
      dtype='object')

In [21]:
# Define a list with new column names
new_column_names = ['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel', 'Ethnicity', 'YearsEmployed', 'PriorDefaulter', 'Employed','CreditScore', 'DriversLicense','Citizen','ZipCode','Income','Approved']
cc_df.columns = new_column_names

Print information to show connection between column id and column names.

In [22]:
cc_df.head(5)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefaulter,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [23]:
# Saved the names DataFrame into a file
cc_df.to_csv('../datasets/crx.data_named.csv')

In [26]:
# Show Features and their Data types 
show_features_datatypes(cc_df)

Column id:   0 	Name: Gender       	DataType: object
Column id:   1 	Name: Age          	DataType: object
Column id:   2 	Name: Debt         	DataType: float64
Column id:   3 	Name: Married      	DataType: object
Column id:   4 	Name: BankCustomer 	DataType: object
Column id:   5 	Name: EducationLevel 	DataType: object
Column id:   6 	Name: Ethnicity    	DataType: object
Column id:   7 	Name: YearsEmployed 	DataType: float64
Column id:   8 	Name: PriorDefaulter 	DataType: object
Column id:   9 	Name: Employed     	DataType: object
Column id:  10 	Name: CreditScore  	DataType: int64
Column id:  11 	Name: DriversLicense 	DataType: object
Column id:  12 	Name: Citizen      	DataType: object
Column id:  13 	Name: ZipCode      	DataType: object
Column id:  14 	Name: Income       	DataType: int64
Column id:  15 	Name: Approved     	DataType: object


In [27]:
drop_duplicate_rows(cc_df)

There are 0 duplicated rows in the dataset.


Let's get an overview of the complete dataframe. 

In [28]:
cc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          690 non-null    object 
 1   Age             690 non-null    object 
 2   Debt            690 non-null    float64
 3   Married         690 non-null    object 
 4   BankCustomer    690 non-null    object 
 5   EducationLevel  690 non-null    object 
 6   Ethnicity       690 non-null    object 
 7   YearsEmployed   690 non-null    float64
 8   PriorDefaulter  690 non-null    object 
 9   Employed        690 non-null    object 
 10  CreditScore     690 non-null    int64  
 11  DriversLicense  690 non-null    object 
 12  Citizen         690 non-null    object 
 13  ZipCode         690 non-null    object 
 14  Income          690 non-null    int64  
 15  Approved        690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


Now, there are no missing values in any attribute of the dataframe. Features are distributed as below:
- 2 Float datatypes
- 2 Integer datatypes
- 12 Object datatypes

Now, let's check each attribute in detail. Details on `Quality` or `Tidiness` issues are documented here. 

***

- **Gender** of any human is generally identified as either Male, Female or Neutral. In here, we only consider two options for Gender (either Male or Female).

In [37]:
show_feature_summary(cc_df, 'Gender')

 Details of feature: Gender
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['b', 'a']
         - cnt.vals: [480, 210]


Since, gender is a categorical variable. Missing values ('?') in the Gender feature are replaced with the max. occurance value. 

In [38]:
# Replace '?' with max.occurance of Gender - call to replace_feature_missingvalues
replace_feature_missingvalues(cc_df, 'Gender', '?')

 Details of column: Gender
        - >>>Error: ?


In [45]:
show_feature_summary(cc_df, 'Gender')

 Details of feature: Gender
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['b', 'a']
         - cnt.vals: [480, 210]


- Common sense suggests that **Age** attribute should be an float datatype. But, looks like it is an Object datatype instead. In Pandas, missing values such as '?' can alter the data type to an object. 

In [46]:
show_feature_summary(cc_df, 'Age')

 Details of feature: Age
         - datatype: float64
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: [31.019159420289856, 22.67, 20.42, 24.5, 22.5, 25.0, 18.83, 20.67, 23.58, 19.17, 27.67, 27.83, 33.17, 23.25, 23.08, 23.0, 25.17, 35.17, 34.17, 41.17, 28.58, 26.17, 32.33, 27.25, 24.58, 26.67, 29.5, 23.5, 20.75, 25.67, 20.0, 22.92, 24.75, 22.08, 36.75, 20.83, 29.58, 18.58, 19.42, 32.25, 21.83, 36.17, 17.92, 16.33, 22.17, 34.83, 23.92, 28.17, 39.17, 19.58, 20.5, 28.25, 36.67, 25.75, 21.5, 34.0, 28.75, 37.5, 22.75, 31.25, 20.17, 25.25, 27.58, 26.75, 19.5, 22.25, 23.75, 30.17, 18.17, 40.58, 19.67, 37.33, 28.67, 21.17, 21.92, 18.08, 40.92, 21.08, 29.83, 33.58, 23.42, 24.08, 34.08, 39.92, 17.08, 48.17, 20.08, 33.67, 22.58, 23.17, 31.67, 39.08, 39.5, 25.08, 48.58, 31.08, 45.0, 16.25, 25.58, 36.33, 21.75, 31.83, 47.42, 21.25, 29.67, 27.0, 48.08, 27.42, 17.58, 32.0, 28.0, 16.5, 51.83, 18.25, 34.92, 44.33, 24.42, 24.83, 17.67, 25.33, 38.92, 16.08, 30.67, 22.0, 32.67, 21.0

Let's replace the missing value '?' and set the datatype of Age feature to float.

In [48]:
# Replace '?' with mean age of applicants in the dataset & change datatype to integer
cc_df.Age.replace('?', 0.0, inplace=True)
change_feature_datatype(cc_df, 'Age', float)
replace_feature_missingvalues(cc_df, 'Age', 0.0)

 Details of column: Age
        - dtype(o): float64
        - dtype(n): float64


In [50]:
show_feature_summary(cc_df,'Age')

 Details of feature: Age
         - datatype: float64
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: [31.019159420289856, 22.67, 20.42, 24.5, 22.5, 25.0, 18.83, 20.67, 23.58, 19.17, 27.67, 27.83, 33.17, 23.25, 23.08, 23.0, 25.17, 35.17, 34.17, 41.17, 28.58, 26.17, 32.33, 27.25, 24.58, 26.67, 29.5, 23.5, 20.75, 25.67, 20.0, 22.92, 24.75, 22.08, 36.75, 20.83, 29.58, 18.58, 19.42, 32.25, 21.83, 36.17, 17.92, 16.33, 22.17, 34.83, 23.92, 28.17, 39.17, 19.58, 20.5, 28.25, 36.67, 25.75, 21.5, 34.0, 28.75, 37.5, 22.75, 31.25, 20.17, 25.25, 27.58, 26.75, 19.5, 22.25, 23.75, 30.17, 18.17, 40.58, 19.67, 37.33, 28.67, 21.17, 21.92, 18.08, 40.92, 21.08, 29.83, 33.58, 23.42, 24.08, 34.08, 39.92, 17.08, 48.17, 20.08, 33.67, 22.58, 23.17, 31.67, 39.08, 39.5, 25.08, 48.58, 31.08, 45.0, 16.25, 25.58, 36.33, 21.75, 31.83, 47.42, 21.25, 29.67, 27.0, 48.08, 27.42, 17.58, 32.0, 28.0, 16.5, 51.83, 18.25, 34.92, 44.33, 24.42, 24.83, 17.67, 25.33, 38.92, 16.08, 30.67, 22.0, 32.67, 21.0

- **Debt** should also be of float datatype. Let's see if there are any missing values in that column.

In [52]:
show_feature_summary(cc_df, 'Debt')

 Details of feature: Debt
         - datatype: float64
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: [1.5, 0.0, 3.0, 2.5, 1.25, 0.75, 0.5, 5.0, 4.0, 1.75, 6.5, 2.0, 10.0, 1.0, 0.585, 0.375, 11.0, 3.5, 0.54, 7.0, 12.5, 0.835, 0.165, 11.5, 5.5, 9.0, 2.75, 4.25, 15.0, 1.54, 6.0, 0.335, 0.875, 0.29, 0.25, 1.04, 2.04, 4.5, 10.5, 12.0, 0.125, 8.5, 9.5, 0.04, 3.75, 2.25, 2.54, 1.125, 1.835, 0.415, 0.46, 0.79, 1.665, 0.665, 3.165, 7.5, 2.71, 1.585, 3.25, 4.46, 4.415, 1.335, 0.205, 14.5, 1.085, 2.335, 9.25, 13.0, 2.415, 1.625, 13.5, 11.25, 0.625, 1.71, 0.83, 0.21, 8.0, 5.085, 0.42, 5.875, 14.0, 2.165, 7.625, 2.875, 3.125, 10.125, 10.25, 5.125, 1.165, 4.625, 19.5, 11.75, 6.75, 19.0, 12.75, 15.5, 1.375, 16.5, 9.54, 2.29, 5.29, 2.085, 5.04, 3.29, 4.085, 3.54, 0.58, 3.04, 4.04, 6.21, 4.71, 12.54, 16.165, 4.79, 1.21, 9.75, 6.04, 25.085, 0.17, 2.46, 4.165, 13.665, 14.79, 12.25, 0.67, 8.585, 6.665, 11.585, 11.665, 2.835, 26.335, 25.21, 9.96, 11.045, 5.625, 9.335, 3.085, 22.29

Looks like there are no missing values. But, wait, why are the debt values so low? The debt must be in different units. Let's apply describe method on Debt column and gain some statistical information. Looks like the mean debt of applicants in around 4.75, with max debt of 28. Definitely, this value can't be in dollars. It must be in thousands of dollars. 

In [54]:
cc_df.Debt.describe()

count    690.000000
mean       4.758725
std        4.978163
min        0.000000
25%        1.000000
50%        2.750000
75%        7.207500
max       28.000000
Name: Debt, dtype: float64

- **Marriage** attribute should reflect the marital status of each applicant. Since the details are masked, we can't make a firm decision on missing values. Here, since the column is Catergorical, we replace the missing values with most.occurance value.  

In [55]:
show_feature_summary(cc_df, 'Married')

 Details of feature: Married
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['u', 'y', '?', 'l']
         - cnt.vals: [519, 163, 6, 2]


In [56]:
# Change Married datatype from int to object and fix ?
change_feature_datatype(cc_df, 'Married', object)
replace_feature_missingvalues(cc_df, 'Married', '?')

 Details of column: Married
        - dtype(o): object
        - dtype(n): object
 Details of column: Married
      - uniqval(o): ['u', 'y', '?', 'l']
      - cnt.val(o): [519, 163, 6, 2]
      - uniqval(n): ['u', 'y', 'l']
      - cnt.val(n): [525, 163, 2]


In [58]:
show_feature_summary(cc_df, 'Married')

 Details of feature: Married
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['u', 'y', 'l']
         - cnt.vals: [525, 163, 2]


- **BankCustomer** attribute should be a simple yes, no or maybe type answer. Let's see.

In [59]:
show_feature_summary(cc_df, 'BankCustomer')

 Details of feature: BankCustomer
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['g', 'p', '?', 'gg']
         - cnt.vals: [519, 163, 6, 2]


In [61]:
# fix '?' missing data in BankCustomer column
replace_feature_missingvalues(cc_df, 'BankCustomer', '?')

 Details of column: BankCustomer
      - uniqval(o): ['g', 'p', '?', 'gg']
      - cnt.val(o): [519, 163, 6, 2]
      - uniqval(n): ['g', 'p', 'gg']
      - cnt.val(n): [525, 163, 2]


In [62]:
show_feature_summary(cc_df, 'BankCustomer')

 Details of feature: BankCustomer
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['g', 'p', 'gg']
         - cnt.vals: [525, 163, 2]


- **EducationLevel** attribute is definitely a categorical variable. But, let's see how many missing values it carries. 

In [63]:
show_feature_summary(cc_df, 'EducationLevel')

 Details of feature: EducationLevel
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['c', 'q', 'w', 'i', 'aa', 'ff', 'k', 'cc', 'm', 'x', 'd', 'e', 'j', '?', 'r']
         - cnt.vals: [137, 78, 64, 59, 54, 53, 51, 41, 38, 38, 30, 25, 10, 9, 3]


In [64]:
replace_feature_missingvalues(cc_df, 'EducationLevel', '?')

 Details of column: EducationLevel
      - uniqval(o): ['c', 'q', 'w', 'i', 'aa', 'ff', 'k', 'cc', 'm', 'x', 'd', 'e', 'j', '?', 'r']
      - cnt.val(o): [137, 78, 64, 59, 54, 53, 51, 41, 38, 38, 30, 25, 10, 9, 3]
      - uniqval(n): ['c', 'q', 'w', 'i', 'aa', 'ff', 'k', 'cc', 'm', 'x', 'd', 'e', 'j', 'r']
      - cnt.val(n): [146, 78, 64, 59, 54, 53, 51, 41, 38, 38, 30, 25, 10, 3]


In [65]:
show_feature_summary(cc_df, 'EducationLevel')

 Details of feature: EducationLevel
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['c', 'q', 'w', 'i', 'aa', 'ff', 'k', 'cc', 'm', 'x', 'd', 'e', 'j', 'r']
         - cnt.vals: [146, 78, 64, 59, 54, 53, 51, 41, 38, 38, 30, 25, 10, 3]


- **Ethnicity** is the next column with object datatype. That's fine! Any missing values? If so, we replace them accordingly.

In [66]:
show_feature_summary(cc_df, 'Ethnicity')

 Details of feature: Ethnicity
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['v', 'h', 'bb', 'ff', '?', 'z', 'j', 'dd', 'n', 'o']
         - cnt.vals: [399, 138, 59, 57, 9, 8, 8, 6, 4, 2]


In [67]:
replace_feature_missingvalues(cc_df, 'Ethnicity', '?')

 Details of column: Ethnicity
      - uniqval(o): ['v', 'h', 'bb', 'ff', '?', 'z', 'j', 'dd', 'n', 'o']
      - cnt.val(o): [399, 138, 59, 57, 9, 8, 8, 6, 4, 2]
      - uniqval(n): ['v', 'h', 'bb', 'ff', 'z', 'j', 'dd', 'n', 'o']
      - cnt.val(n): [408, 138, 59, 57, 8, 8, 6, 4, 2]


In [68]:
show_feature_summary(cc_df, 'Ethnicity')

 Details of feature: Ethnicity
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['v', 'h', 'bb', 'ff', 'z', 'j', 'dd', 'n', 'o']
         - cnt.vals: [408, 138, 59, 57, 8, 8, 6, 4, 2]


- **YearsEmployed** should indicate the no. of years an applicant was employed. So, it should be a float. Luckily, with no missing values.

In [70]:
show_feature_summary(cc_df, 'YearsEmployed')

 Details of feature: YearsEmployed
         - datatype: float64
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: [0.0, 0.25, 0.04, 1.0, 0.125, 0.5, 0.085, 1.5, 0.165, 2.5, 2.0, 1.75, 5.0, 3.5, 0.29, 0.75, 3.0, 2.25, 1.25, 0.415, 4.0, 5.5, 0.375, 0.665, 1.085, 0.54, 4.5, 6.5, 0.21, 0.585, 1.585, 0.335, 0.835, 8.5, 10.0, 0.875, 1.625, 7.0, 5.75, 4.25, 1.165, 15.0, 3.25, 7.5, 14.0, 1.415, 2.29, 3.75, 6.0, 2.75, 1.29, 3.085, 1.665, 20.0, 0.625, 1.21, 4.75, 13.875, 5.25, 1.835, 0.96, 2.625, 2.085, 11.0, 8.0, 2.375, 2.585, 12.75, 3.165, 1.375, 2.415, 5.085, 1.335, 12.5, 1.46, 3.04, 11.5, 13.5, 7.875, 14.415, 10.75, 13.0, 7.375, 28.5, 8.625, 0.79, 9.46, 0.46, 7.415, 1.71, 1.54, 2.54, 6.04, 7.585, 8.665, 5.335, 6.29, 0.455, 9.0, 2.71, 5.04, 2.165, 2.79, 5.665, 3.17, 4.58, 8.29, 4.29, 0.795, 2.335, 0.71, 7.96, 5.125, 15.5, 18.0, 4.625, 6.75, 1.875, 16.0, 3.335, 2.46, 2.125, 17.5, 3.125, 4.335, 5.165, 1.96, 3.96, 2.04, 1.04, 4.165, 5.375]
         - cnt.vals: [70, 35, 33, 

- **PriorDefaulter** is an interesting attribute. Let's venture if the data values are corresponding to data type. No missing data here !

In [71]:
show_feature_summary(cc_df, 'PriorDefaulter')

 Details of feature: PriorDefaulter
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['t', 'f']
         - cnt.vals: [361, 329]


- **Employed** attribute provides information on if the applicant is currently employed or not? No missing values here too!

In [72]:
show_feature_summary(cc_df, 'Employed')

 Details of feature: Employed
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['f', 't']
         - cnt.vals: [395, 295]


- **CreditScore** attribute is an feature on every applicant. It typically is in the range 0-800 in North America. Scoring scheme may be different in different countries. So, we don't worry about the overall range. Luckily, there are no missing values....that's good!

In [73]:
show_feature_summary(cc_df, 'CreditScore')

 Details of feature: CreditScore
         - datatype: int64
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: [0, 1, 2, 3, 6, 11, 5, 7, 4, 8, 9, 14, 12, 10, 15, 16, 17, 20, 40, 13, 19, 23, 67]
         - cnt.vals: [395, 71, 45, 28, 23, 19, 18, 16, 15, 10, 10, 8, 8, 8, 4, 3, 2, 2, 1, 1, 1, 1, 1]


- **DriversLicense** is just a personal identification document that is provided for any kind of application. 

In [74]:
show_feature_summary(cc_df, 'DriversLicense')

 Details of feature: DriversLicense
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['f', 't']
         - cnt.vals: [374, 316]


- **Citizen** is just an additional personal information that typically is a character object. No missing values here too!

In [75]:
show_feature_summary(cc_df, 'Citizen')

 Details of feature: Citizen
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['g', 's', 'p']
         - cnt.vals: [625, 57, 8]


- **ZipCode** provides the information about where a person lives in a city. No missing values here too! I can live with the fact that ZipCode is a string and not a number. At this point, we aren't sure if we can replace missing values '?' with the most. occurance values. But, we will go ahead and replace it for our first analysis. 

In [76]:
show_feature_summary(cc_df, 'ZipCode')

 Details of feature: ZipCode
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['00000', '00200', '00120', '00160', '00100', '00080', '00280', '00180', '00140', '00320', '00240', '?', '00300', '00260', '00060', '00400', '00220', '00360', '00340', '00380', '00144', '00440', '00070', '00132', '00420', '00040', '00232', '00108', '00520', '00128', '00272', '00150', '00096', '00216', '00480', '00460', '00181', '00164', '00290', '00176', '00350', '00370', '00210', '00399', '00050', '00312', '00136', '00168', '00396', '00225', '00110', '00020', '00112', '00092', '00073', '00129', '00500', '00145', '00228', '00130', '00352', '00154', '00088', '00720', '00252', '00560', '00330', '00094', '00550', '00030', '00171', '00487', '00348', '00263', '00239', '00268', '00329', '00443', '00156', '00163', '02000', '00383', '00174', '00086', '00170', '00117', '00276', '00640', '00454', '00099', '00076', '00303', '00510', '00375', '00230', '00356', '00204', '

In [524]:
replace_missingvalues(ccdata, 'ZipCode', '?')

 Details of column: ZipCode
      - uniqval(o): [132, 35, 35, 34, 30, 30, 22, 18, 16, 14, 14, 13, 13, 11, 9, 9, 9, 7, 7, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      - cnt.val(o): ['00000', '00200', '00120', '00160', '00080', '00100', '00280', '00180', '00140', '00320', '00240', '00300', '?', '00260', '00400', '00220', '00060', '00340', '00360', '00380', '00144', '00108', '00440', '00070', '00420', '00132', '00520', '00232', '00040', '00128', '00150', '00460', '00272', '00181', '00096', '00176', '00216', '00480', '00290', '00164', '00020', '00129', '00092', '00252', '00145', '

In [78]:
show_feature_summary(cc_df, 'ZipCode')

 Details of feature: ZipCode
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['00000', '00200', '00120', '00160', '00100', '00080', '00280', '00180', '00140', '00320', '00240', '?', '00300', '00260', '00060', '00400', '00220', '00360', '00340', '00380', '00144', '00440', '00070', '00132', '00420', '00040', '00232', '00108', '00520', '00128', '00272', '00150', '00096', '00216', '00480', '00460', '00181', '00164', '00290', '00176', '00350', '00370', '00210', '00399', '00050', '00312', '00136', '00168', '00396', '00225', '00110', '00020', '00112', '00092', '00073', '00129', '00500', '00145', '00228', '00130', '00352', '00154', '00088', '00720', '00252', '00560', '00330', '00094', '00550', '00030', '00171', '00487', '00348', '00263', '00239', '00268', '00329', '00443', '00156', '00163', '02000', '00383', '00174', '00086', '00170', '00117', '00276', '00640', '00454', '00099', '00076', '00303', '00510', '00375', '00230', '00356', '00204', '

- **Income** of an applicant is a very relevant information for a credit card application. Income is mostly provided as a number. No missing value!

In [80]:
show_feature_summary(cc_df, 'Income')

 Details of feature: Income
         - datatype: int64
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: [0, 1, 1000, 500, 2, 5, 300, 6, 3, 200, 100, 4, 50, 150, 7, 10, 3000, 20, 400, 560, 18, 600, 2000, 5000, 4000, 108, 17, 44, 204, 8, 1210, 67, 284, 68, 350, 351, 1200, 13, 375, 40, 16, 35, 27, 19, 15, 21, 456, 11, 809, 3065, 99, 28, 444, 540, 158, 154, 147, 2200, 146, 160, 162, 2197, 1187, 9, 1208, 168, 210, 234, 2279, 228, 225, 221, 3290, 1236, 196, 173, 195, 13212, 1212, 3257, 140, 179, 5298, 141, 41, 2184, 134, 1704, 60, 59, 58, 55, 53, 2100, 14, 42, 1065, 38, 1058, 33, 32, 2079, 25, 2072, 23, 22, 4159, 70, 1097, 4208, 130, 2206, 126, 122, 120, 117, 12, 113, 109, 5200, 105, 26726, 1260, 98, 90, 87, 1110, 246, 2283, 112, 237, 690, 15108, 769, 768, 2803, 750, 742, 5860, 80, 730, 722, 713, 687, 10000, 2732, 5800, 678, 5124, 100000, 15000, 8851, 5777, 184, 2690, 639, 1004, 790, 245, 51100, 2028, 4071, 990, 1062, 2010, 1400, 11202, 960, 948, 11177, 1950, 918, 800,

- **Approved** column is the holy-grail in this Credit Card Approval Project.  

In [81]:
show_feature_summary(cc_df, 'Approved')

 Details of feature: Approved
         - datatype: object
         - col.size: (690,)
         - NaN.vals: 0
         - uniqvals: ['-', '+']
         - cnt.vals: [383, 307]


In [83]:
cc_df.head(5)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefaulter,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [85]:
cc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          690 non-null    object 
 1   Age             690 non-null    float64
 2   Debt            690 non-null    float64
 3   Married         690 non-null    object 
 4   BankCustomer    690 non-null    object 
 5   EducationLevel  690 non-null    object 
 6   Ethnicity       690 non-null    object 
 7   YearsEmployed   690 non-null    float64
 8   PriorDefaulter  690 non-null    object 
 9   Employed        690 non-null    object 
 10  CreditScore     690 non-null    int64  
 11  DriversLicense  690 non-null    object 
 12  Citizen         690 non-null    object 
 13  ZipCode         690 non-null    object 
 14  Income          690 non-null    int64  
 15  Approved        690 non-null    object 
dtypes: float64(3), int64(2), object(11)
memory usage: 86.4+ KB


We are going use the data version after cleaning information for `Quality`. Datatypes of all features are now categorized as below:
- 3 Float datatypes
- 2 Integer datatypes
- 11 Object datatypes

Let's now save it into a csv file. 

In [87]:
cc_df.to_csv('../datasets/crx.data_clean.csv')

In [90]:
show_files_datasets(path_data)

 datasets/
       - crx.data
       - crx.names
       - crx.data_named.csv
       - crx.data_clean.csv
