### Gather data: 
We are going to use [credit card data](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) from UCI's Machine Learning Repository for this project. Let us gather required data into a folder, `datasets`. 

Let's first load the python modules needed for gather credit card data.

In [58]:
# Modules required to gather data
import wget
import os 
import pandas as pd
import numpy as np

In [101]:
# Load helper functions from utils.py to run on jupyter-notebook
%run ../src/utils-gather-assess.py

In here, I am reading the weblinks into variables url_data and url_names.  

In [60]:
# Weblink directing to credit card dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening'

# Names of files in credit card dataset
data_file = 'crx.data'
names_file = 'crx.names'

Organizing data from the beginning will save time for productive data analysis in the later stages of the project. Downloaded (2) data files are given the same name as specified above.    

In [61]:
# Local directory to save data
path_data = '../datasets'

OK. We are all set to gather the data we need for this project. I am going to write a very simple python function to check and download the data needed. 

Now, let us call the check_download function twice on two datafiles needed.

In [62]:
# Call function on data_file
check_download(url, data_file, path_data)

 Folder exists: ../datasets
 Datafile already present: ../datasets/crx.data


In [63]:
# Call function on names_file
check_download(url, names_file, path_data)

 Folder exists: ../datasets
 Datafile already present: ../datasets/crx.names


Required datafiles for this project are now downloaded into `datasets` folder. How about a quick sneak-peak into the `datasets` folder? 

In [64]:
show_files_in_datasets(path_data)

 datasets/
       - crx.data
       - crx.names
       - crx.data_named.csv


***

### Assess data:
In here, we load the credit card data and assess the data for `Quality` and `Tidiness`.

In [65]:
path_datafile = path_data + '/' + data_file

In [79]:
# Load credit card data into a DataFrame - cc_df  
cc_df = load_csv_df(path_datafile)
cc_df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [80]:
cc_df.shape

(690, 16)

No. of instances and features/attributes exactly match what is given in the details file. So, we can move further into our assessment. 

As column names provided no additional information in this dataset, we load the dataframe with no header. This means, our column names are simply numbers starting from 0 to 16. A quick google search directed me to a [Credit Card Analysis Page by Ryan Kuhn](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html), where the author assigned names to the variables. The first 15 variables are the credit application features/attributes. Final, 16th, approved variable is the credit approval status. We are going to use these names as our column names.     

In [81]:
# Old column names
cc_df.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype='int64')

In [91]:
# Define a list with new column names
new_column_names = ['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel', 'Ethnicity', 'YearsEmployed', 'PriorDefaulter', 'Employed','CreditScore', 'DriversLicense','Citizen','ZipCode','Income','Approved']
cc_df.columns = new_column_names

Print information to show connection between column id and column names.

In [92]:
cc_df.head(5)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefaulter,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [93]:
# Saved the names DataFrame into a file
cc_df.to_csv('../datasets/crx.data_named.csv')

In [94]:
# Show Features and their Data types 
show_features_datatypes(cc_df)

Column id:   0 	Name: Gender       	DataType: object
Column id:   1 	Name: Age          	DataType: object
Column id:   2 	Name: Debt         	DataType: float64
Column id:   3 	Name: Married      	DataType: object
Column id:   4 	Name: BankCustomer 	DataType: object
Column id:   5 	Name: EducationLevel 	DataType: object
Column id:   6 	Name: Ethnicity    	DataType: object
Column id:   7 	Name: YearsEmployed 	DataType: float64
Column id:   8 	Name: PriorDefaulter 	DataType: object
Column id:   9 	Name: Employed     	DataType: object
Column id:  10 	Name: CreditScore  	DataType: int64
Column id:  11 	Name: DriversLicense 	DataType: object
Column id:  12 	Name: Citizen      	DataType: object
Column id:  13 	Name: ZipCode      	DataType: object
Column id:  14 	Name: Income       	DataType: int64
Column id:  15 	Name: Approved     	DataType: object


In [102]:
drop_duplicate_rows(cc_df)

There are 0 duplicated rows in the dataset.


Let's get an overview of the complete dataframe. 

In [103]:
cc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
Gender            690 non-null object
Age               690 non-null object
Debt              690 non-null float64
Married           690 non-null object
BankCustomer      690 non-null object
EducationLevel    690 non-null object
Ethnicity         690 non-null object
YearsEmployed     690 non-null float64
PriorDefaulter    690 non-null object
Employed          690 non-null object
CreditScore       690 non-null int64
DriversLicense    690 non-null object
Citizen           690 non-null object
ZipCode           690 non-null object
Income            690 non-null int64
Approved          690 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


There are no missing values in any attribute of the dataframe. Features are distributed as below:
- 2 Float datatypes
- 2 Integer datatypes
- 12 Object datatypes

Now, let's check each attribute in detail. Details on `Quality` or `Tidiness` issues are documented here. 

In [1]:
def get_uniquevalues(df, colname):
    """
        Returns a list with all unique values in the column of datafram df
    """
    return list(dict(df[colname].value_counts(ascending=False, dropna=False)).keys())


def get_uniquecounts(df, colname):
    """
        Returns a list with all counts of unique values in the column of datafram df
    """
    return list(dict(df[colname].value_counts(ascending=False, dropna=False)).values())

In [490]:
# Function to get brief of individual data attribute
def show_details(df, colname):
    """
        Function to display all necessary information to fix missing data
    """
    print(' Details of column:',colname)
    print('        - datatype:',df[colname].dtypes)
    print('        - col.size:',df[colname].shape)
    print('        - NaN.vals:',df[colname].isnull().sum())
    print('        - uniqvals:',get_uniquevalues(df, colname))
    print('        - cnt.vals:',get_uniquecounts(df, colname))

In [491]:
# Function to change datatype of a column
def change_datatype(df, colname, dtype_new):
    """
        Function to modify data type of a column 
    """
    if dtype_new in [object, int, float]:
        print(' Details of column:',colname)
        print('        - dtype(o):',df[colname].dtypes)
        # Change of data type is done here!
        df[colname] = df[colname].astype(dtype_new)
        print('        - dtype(n):',df[colname].dtypes)
    else:
        print(' Details of column:',colname)
        print('        - >>>Error:',dtype_new) 

In [502]:
# Function to replace one value with other (according to datatype) in a column 
def replace_missingvalues(df, colname, old_val):
    """
        Function to replace old_val with new_val (according to datatype) in a column of DataFrame
    """
    if (old_val in df[colname].unique()):
        print(' Details of column:',colname)
        print('      - uniqval(o):',get_uniquecounts(df, colname))
        print('      - cnt.val(o):',get_uniquevalues(df, colname))
        
        # Replace old_val with new_val in df[colname]
        if (df[colname].dtype == object):
            new_val = df[colname].value_counts(ascending=False).index[0]
        else:
            new_val = df[colname].mean()
        df[colname].replace(old_val, new_val,inplace=True)       
        print('      - uniqval(n):',get_uniquecounts(df, colname))
        print('      - cnt.val(n):',get_uniquevalues(df, colname))
        
    else:
        print(' Details of column:',colname)
        print('        - >>>Error:',old_val) 


- **Gender** attribute generally spans in Male, Female or Neutral. We can simplify the column data by only considering two options for Gender (either Male of Female). Futher, simplification to numeric is also possible. But, we will leave that task to sklearn.encoders. 

In [476]:
show_details(ccdata, 'Gender')

 Details of column: Gender
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['b', 'a', '?']
        - cnt.vals: [468, 210, 12]


In [477]:
# Replace '?' with max.occurance of Gender - call to replace_missingvalues
replace_missingvalues(ccdata, 'Gender', '?')

 Details of column: Gender
      - uniqval(o): [468, 210, 12]
      - cnt.val(o): ['b', 'a', '?']
      - uniqval(o): [480, 210]
      - cnt.val(o): ['b', 'a']


In [478]:
show_details(ccdata, 'Gender')

 Details of column: Gender
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['b', 'a']
        - cnt.vals: [480, 210]


- Common sense says that **Age** attribute should be an float datatype. But, looks like it is an Object datatype instead. In Pandas, missing values such as '?' can alter the data type to an object. 

In [493]:
detail_datacolumn(ccdata, 'Age')

 Details of column: Age
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['?', '22.67', '20.42', '23.58', '22.50', '25.00', '24.50', '18.83', '19.17', '20.67', '27.67', '23.25', '23.00', '27.83', '33.17', '23.08', '32.33', '23.50', '20.00', '26.17', '24.75', '22.92', '27.25', '25.17', '20.75', '34.17', '24.58', '22.08', '25.67', '28.58', '41.17', '29.50', '35.17', '26.67', '19.50', '20.50', '22.17', '21.92', '48.17', '28.17', '29.58', '36.67', '18.17', '28.25', '23.42', '21.08', '23.92', '28.75', '22.25', '25.75', '37.50', '40.92', '24.08', '34.00', '21.17', '28.67', '34.83', '37.33', '17.08', '39.92', '19.67', '20.17', '36.75', '17.92', '39.17', '27.58', '25.25', '34.08', '33.67', '16.33', '36.17', '40.58', '31.25', '20.08', '19.42', '26.75', '19.58', '32.25', '22.75', '29.83', '23.17', '20.83', '18.08', '30.17', '22.58', '21.50', '33.58', '18.58', '23.75', '21.83', '32.00', '64.08', '42.83', '16.92', '44.33', '22.83', '21.25', '19.33', '

In [495]:
# Replace '?' with mean age of applicants in the dataset & change datatype to integer
ccdata.Age.replace('?', 0.0, inplace=True)
modify_datatype(ccdata, 'Age', float)
replace_missingvalues(ccdata, 'Age', 0.0)

 Details of column: Age
        - dtype(o): object
        - dtype(n): float64
 Details of column: Age
      - uniqval(o): [12, 9, 7, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [496]:
show_details(ccdata, 'Age')

 Details of column: Age
        - datatype: float64
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: [31.019159420289856, 22.67, 20.42, 24.5, 22.5, 25.0, 18.83, 20.67, 23.58, 19.17, 27.67, 27.83, 33.17, 23.25, 23.08, 23.0, 25.17, 35.17, 34.17, 41.17, 28.58, 26.17, 32.33, 27.25, 24.58, 26.67, 29.5, 23.5, 20.75, 25.67, 20.0, 22.92, 24.75, 22.08, 36.75, 20.83, 29.58, 18.58, 19.42, 32.25, 21.83, 36.17, 17.92, 16.33, 22.17, 34.83, 23.92, 28.17, 39.17, 19.58, 20.5, 28.25, 36.67, 25.75, 21.5, 34.0, 28.75, 37.5, 22.75, 31.25, 20.17, 25.25, 27.58, 26.75, 19.5, 22.25, 23.75, 30.17, 18.17, 40.58, 19.67, 37.33, 28.67, 21.17, 21.92, 18.08, 40.92, 21.08, 29.83, 33.58, 23.42, 24.08, 34.08, 39.92, 17.08, 48.17, 20.08, 33.67, 22.58, 23.17, 31.67, 39.08, 39.5, 25.08, 48.58, 31.08, 45.0, 16.25, 25.58, 36.33, 21.75, 31.83, 47.42, 21.25, 29.67, 27.0, 48.08, 27.42, 17.58, 32.0, 28.0, 16.5, 51.83, 18.25, 34.92, 44.33, 24.42, 24.83, 17.67, 25.33, 38.92, 16.08, 30.67, 22.0, 32.67, 21.0, 34.

- **Debt** should also be a float. Let's see if there are any missing values in that column.

In [497]:
show_details(ccdata, 'Debt')

 Details of column: Debt
        - datatype: float64
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: [1.5, 0.0, 3.0, 2.5, 1.25, 0.75, 0.5, 5.0, 4.0, 1.75, 6.5, 2.0, 10.0, 1.0, 0.585, 0.375, 11.0, 3.5, 0.54, 7.0, 12.5, 0.835, 0.165, 11.5, 5.5, 9.0, 2.75, 4.25, 15.0, 1.54, 6.0, 0.335, 0.875, 0.29, 0.25, 1.04, 2.04, 4.5, 10.5, 12.0, 0.125, 8.5, 9.5, 0.04, 3.75, 2.25, 2.54, 1.125, 1.835, 0.415, 0.46, 0.79, 1.665, 0.665, 3.165, 7.5, 2.71, 1.585, 3.25, 4.46, 4.415, 1.335, 0.205, 14.5, 1.085, 2.335, 9.25, 13.0, 2.415, 1.625, 13.5, 11.25, 0.625, 1.71, 0.83, 0.21, 8.0, 5.085, 0.42, 5.875, 14.0, 2.165, 7.625, 2.875, 3.125, 10.125, 10.25, 5.125, 1.165, 4.625, 19.5, 11.75, 6.75, 19.0, 12.75, 15.5, 1.375, 16.5, 9.54, 2.29, 5.29, 2.085, 5.04, 3.29, 4.085, 3.54, 0.58, 3.04, 4.04, 6.21, 4.71, 12.54, 16.165, 4.79, 1.21, 9.75, 6.04, 25.085, 0.17, 2.46, 4.165, 13.665, 14.79, 12.25, 0.67, 8.585, 6.665, 11.585, 11.665, 2.835, 26.335, 25.21, 9.96, 11.045, 5.625, 9.335, 3.085, 22.29, 11.

Looks like there are no missing values. Running a quick description tells us that the debt information must have been in the order of thousands. 

In [498]:
ccdata.Debt.describe()

count    690.000000
mean       4.758725
std        4.978163
min        0.000000
25%        1.000000
50%        2.750000
75%        7.207500
max       28.000000
Name: Debt, dtype: float64

- **Marriage** attribute should reflect the marital status of the applicant.

In [499]:
show_details(ccdata, 'Married')

 Details of column: Married
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['u', 'y', '?', 'l']
        - cnt.vals: [519, 163, 6, 2]


In [500]:
# Change Married datatype from int to object and fix ?
modify_datatype(ccdata, 'Married', object)
replace_missingvalues(ccdata, 'Married', '?')

 Details of column: Married
        - dtype(o): object
        - dtype(n): object
 Details of column: Married
      - uniqval(o): [519, 163, 6, 2]
      - cnt.val(o): ['u', 'y', '?', 'l']
      - uniqval(o): [525, 163, 2]
      - cnt.val(o): ['u', 'y', 'l']


In [501]:
show_details(ccdata, 'Married')

 Details of column: Married
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['u', 'y', 'l']
        - cnt.vals: [525, 163, 2]


- **BankCustomer** attribute should be a simple yes, no or maybe type answer. Let's see.

In [503]:
show_details(ccdata, 'BankCustomer')

 Details of column: BankCustomer
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['g', 'p', '?', 'gg']
        - cnt.vals: [519, 163, 6, 2]


In [504]:
# fix '?' missing data in BankCustomer column
replace_missingvalues(ccdata, 'BankCustomer', '?')

 Details of column: BankCustomer
      - uniqval(o): [519, 163, 6, 2]
      - cnt.val(o): ['g', 'p', '?', 'gg']
      - uniqval(n): [525, 163, 2]
      - cnt.val(n): ['g', 'p', 'gg']


In [505]:
show_details(ccdata, 'BankCustomer')

 Details of column: BankCustomer
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['g', 'p', 'gg']
        - cnt.vals: [525, 163, 2]


- **EducationLevel** attribute is definitely a categorical variable. But, let's see how many missing values it carries. 

In [506]:
show_details(ccdata, 'EducationLevel')

 Details of column: EducationLevel
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['c', 'q', 'w', 'i', 'aa', 'ff', 'k', 'cc', 'x', 'm', 'd', 'e', 'j', '?', 'r']
        - cnt.vals: [137, 78, 64, 59, 54, 53, 51, 41, 38, 38, 30, 25, 10, 9, 3]


In [507]:
replace_missingvalues(ccdata, 'EducationLevel', '?')

 Details of column: EducationLevel
      - uniqval(o): [137, 78, 64, 59, 54, 53, 51, 41, 38, 38, 30, 25, 10, 9, 3]
      - cnt.val(o): ['c', 'q', 'w', 'i', 'aa', 'ff', 'k', 'cc', 'x', 'm', 'd', 'e', 'j', '?', 'r']
      - uniqval(n): [146, 78, 64, 59, 54, 53, 51, 41, 38, 38, 30, 25, 10, 3]
      - cnt.val(n): ['c', 'q', 'w', 'i', 'aa', 'ff', 'k', 'cc', 'x', 'm', 'd', 'e', 'j', 'r']


In [508]:
show_details(ccdata, 'EducationLevel')

 Details of column: EducationLevel
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['c', 'q', 'w', 'i', 'aa', 'ff', 'k', 'cc', 'x', 'm', 'd', 'e', 'j', 'r']
        - cnt.vals: [146, 78, 64, 59, 54, 53, 51, 41, 38, 38, 30, 25, 10, 3]


- **Ethnicity** is the next column with object datatype. That's fine! Any missing values? If so, we replace them accordingly.

In [509]:
show_details(ccdata, 'Ethnicity')

 Details of column: Ethnicity
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['v', 'h', 'bb', 'ff', '?', 'j', 'z', 'dd', 'n', 'o']
        - cnt.vals: [399, 138, 59, 57, 9, 8, 8, 6, 4, 2]


In [511]:
replace_missingvalues(ccdata, 'Ethnicity', '?')

 Details of column: Ethnicity
      - uniqval(o): [399, 138, 59, 57, 9, 8, 8, 6, 4, 2]
      - cnt.val(o): ['v', 'h', 'bb', 'ff', '?', 'j', 'z', 'dd', 'n', 'o']
      - uniqval(n): [408, 138, 59, 57, 8, 8, 6, 4, 2]
      - cnt.val(n): ['v', 'h', 'bb', 'ff', 'j', 'z', 'dd', 'n', 'o']


In [512]:
show_details(ccdata, 'Ethnicity')

 Details of column: Ethnicity
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['v', 'h', 'bb', 'ff', 'j', 'z', 'dd', 'n', 'o']
        - cnt.vals: [408, 138, 59, 57, 8, 8, 6, 4, 2]


- **YearsEmployed** should indicate the no. of years an applicant was employed. So, it should be a float. Luckily, with no missing values.

In [513]:
display_column(ccdata, 'YearsEmployed')

 Details of column: YearsEmployed
        - datatype: float64
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: [ 1.25   3.04   1.5    3.75   1.71   2.5    6.5    0.04   3.96   3.165
  2.165  4.335  1.     5.     0.25   0.96   3.17   0.665  0.75   0.835
  7.875  3.085  0.5    5.165 15.     7.     5.04   7.96   7.585  0.415
  2.     1.835 14.415  4.5    5.335  8.625 28.5    2.625  0.125  6.04
  3.5    0.165  0.875  1.75   0.     7.415  0.085  5.75   6.     3.
  1.585  4.29   1.54   1.46   1.625 12.5   13.5   10.75   0.375  0.585
  0.455  4.     8.5    9.46   2.25  10.     0.795  1.375  1.29  11.5
  6.29  14.     0.335  1.21   7.375  7.5    3.25  13.     5.5    4.25
  0.625  5.085  2.75   2.375  8.     1.085  2.54   4.165  1.665 11.
  9.     1.335  1.415  1.96   2.585  5.125 15.5    0.71   5.665 18.
  5.25   8.665  2.29  20.     2.46  13.875  2.085  4.58   2.71   2.04
  0.29   4.75   0.46   0.21   0.54   3.335  2.335  1.165  2.415  2.79
  4.625  1.04   6.75   1.875 16.

- **PriorDefaulter** is an interesting attribute. Let's venture if the data values are corresponding to data type. No missing data here !

In [514]:
show_details(ccdata, 'PriorDefaulter')

 Details of column: PriorDefaulter
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['t', 'f']
        - cnt.vals: [361, 329]


- **Employed** attribute provides information on if the applicant is currently employed or not? No missing values here too!

In [515]:
show_details(ccdata, 'Employed')

 Details of column: Employed
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['f', 't']
        - cnt.vals: [395, 295]


- **CreditScore** attribute is an important indicator for every applicant. It typically is in the range 0-800 in North America. Scoring scheme may be different in different countries. So, we don't worry about the overall range. No missing values....that's good!

In [517]:
show_details(ccdata, 'CreditScore')

 Details of column: CreditScore
        - datatype: int64
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: [0, 1, 2, 3, 6, 11, 5, 7, 4, 8, 9, 14, 12, 10, 15, 16, 17, 20, 40, 13, 19, 23, 67]
        - cnt.vals: [395, 71, 45, 28, 23, 19, 18, 16, 15, 10, 10, 8, 8, 8, 4, 3, 2, 2, 1, 1, 1, 1, 1]


- **DriversLicense** is just a personal identification document that is provided for any kind of application. 

In [519]:
show_details(ccdata, 'DriversLicense')

 Details of column: DriversLicense
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['f', 't']
        - cnt.vals: [374, 316]


- **Citizen** is just an additional personal information that typically is a character object. No missing values here too!

In [520]:
show_details(ccdata, 'Citizen')

 Details of column: Citizen
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['g', 's', 'p']
        - cnt.vals: [625, 57, 8]


- **ZipCode** provides the information about where a person lives in a city. No missing values here too! I am live with the fact that ZipCode is a string and not a number. At this point, we aren't sure if we can replace missing values '?' with the most. occurance values. But, we will go ahead and replace it for our first analysis. 

In [521]:
show_details(ccdata, 'ZipCode')

 Details of column: ZipCode
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['00000', '00200', '00120', '00160', '00080', '00100', '00280', '00180', '00140', '00320', '00240', '00300', '?', '00260', '00400', '00220', '00060', '00340', '00360', '00380', '00144', '00108', '00440', '00070', '00420', '00132', '00520', '00232', '00040', '00128', '00150', '00460', '00272', '00181', '00096', '00176', '00216', '00480', '00290', '00164', '00020', '00129', '00092', '00252', '00145', '00210', '00399', '00350', '00168', '00352', '00073', '00225', '00228', '00370', '00330', '00130', '00560', '00112', '00396', '00720', '00136', '00088', '00312', '00050', '00500', '00154', '00110', '00022', '00368', '00208', '00076', '00224', '00840', '00443', '00311', '00202', '00680', '00093', '00276', '00372', '00375', '00171', '00017', '00231', '00432', '00640', '00062', '00450', '00268', '00117', '00253', '00288', '00408', '00024', '00381', '00052', '00204', '00510

In [524]:
replace_missingvalues(ccdata, 'ZipCode', '?')

 Details of column: ZipCode
      - uniqval(o): [132, 35, 35, 34, 30, 30, 22, 18, 16, 14, 14, 13, 13, 11, 9, 9, 9, 7, 7, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      - cnt.val(o): ['00000', '00200', '00120', '00160', '00080', '00100', '00280', '00180', '00140', '00320', '00240', '00300', '?', '00260', '00400', '00220', '00060', '00340', '00360', '00380', '00144', '00108', '00440', '00070', '00420', '00132', '00520', '00232', '00040', '00128', '00150', '00460', '00272', '00181', '00096', '00176', '00216', '00480', '00290', '00164', '00020', '00129', '00092', '00252', '00145', '

In [525]:
show_details(ccdata, 'ZipCode')

 Details of column: ZipCode
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['00000', '00200', '00120', '00160', '00080', '00100', '00280', '00180', '00140', '00320', '00240', '00300', '00260', '00400', '00060', '00220', '00340', '00360', '00380', '00420', '00132', '00232', '00108', '00520', '00070', '00440', '00144', '00040', '00128', '00164', '00150', '00096', '00480', '00460', '00290', '00181', '00216', '00176', '00272', '00020', '00073', '00129', '00396', '00225', '00228', '00145', '00210', '00130', '00330', '00370', '00352', '00112', '00168', '00092', '00399', '00720', '00136', '00252', '00350', '00154', '00312', '00088', '00110', '00560', '00050', '00500', '00372', '00840', '00224', '00076', '00202', '00443', '00311', '00680', '00093', '00022', '00375', '00268', '00017', '00356', '00208', '00368', '00640', '00303', '00211', '00432', '00195', '00171', '00250', '00408', '00024', '00381', '00510', '01160', '00465', '00455', '00348', '0

- **Income** of an applicant is a very relevant information for a credit card application. Income is mostly provided as a number. No missing value!

In [522]:
show_details(ccdata, 'Income')

 Details of column: Income
        - datatype: int64
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: [0, 1, 1000, 500, 2, 5, 300, 6, 3, 200, 100, 4, 50, 150, 7, 10, 3000, 20, 400, 560, 18, 600, 2000, 5000, 4000, 108, 17, 44, 204, 8, 1210, 67, 284, 68, 350, 351, 1200, 13, 375, 40, 16, 35, 27, 19, 15, 21, 456, 11, 809, 3065, 99, 28, 444, 540, 158, 154, 147, 2200, 146, 160, 162, 2197, 1187, 9, 1208, 168, 210, 234, 2279, 228, 225, 221, 3290, 1236, 196, 173, 195, 13212, 1212, 3257, 140, 179, 5298, 141, 41, 2184, 134, 1704, 60, 59, 58, 55, 53, 2100, 14, 42, 1065, 38, 1058, 33, 32, 2079, 25, 2072, 23, 22, 4159, 70, 1097, 4208, 130, 2206, 126, 122, 120, 117, 12, 113, 109, 5200, 105, 26726, 1260, 98, 90, 87, 1110, 246, 2283, 112, 237, 690, 15108, 769, 768, 2803, 750, 742, 5860, 80, 730, 722, 713, 687, 10000, 2732, 5800, 678, 5124, 100000, 15000, 8851, 5777, 184, 2690, 639, 1004, 790, 245, 51100, 2028, 4071, 990, 1062, 2010, 1400, 11202, 960, 948, 11177, 1950, 918, 800, 7059

- **Approved** column is the holy-grail in this Credit Card Approval Project.  

In [523]:
show_details(ccdata, 'Approved')

 Details of column: Approved
        - datatype: object
        - col.size: (690,)
        - NaN.vals: 0
        - uniqvals: ['-', '+']
        - cnt.vals: [383, 307]


In [526]:
ccdata.head(5)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefaulter,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [527]:
ccdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
Gender            690 non-null object
Age               690 non-null float64
Debt              690 non-null float64
Married           690 non-null object
BankCustomer      690 non-null object
EducationLevel    690 non-null object
Ethnicity         690 non-null object
YearsEmployed     690 non-null float64
PriorDefaulter    690 non-null object
Employed          690 non-null object
CreditScore       690 non-null int64
DriversLicense    690 non-null object
Citizen           690 non-null object
ZipCode           690 non-null object
Income            690 non-null int64
Approved          690 non-null object
dtypes: float64(3), int64(2), object(11)
memory usage: 86.4+ KB


We are going use this version after changing information about the `Quality` of the data. Attributes datatypes are categorized as below:
- 3 Float datatypes
- 2 Integer datatypes
- 11 Object datatypes

Let's now save it into a csv file. 

In [529]:
ccdata.to_csv('../datasets/crx.data_clean.csv')

In [530]:
show_files_in_datasets(path_data)

 datasets/
       - crx.data
       - crx.names
       - crx.data_clean.csv
