# Sneak-Peak into Credit Card Dataset

#### In here, we peform two simple but essential sets in any Data Science project. 
- Firstly, we are going to download [credit card data](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) from UCI's Machine Learning Repository. 
- Secondly, we are going to dig into different data types that are at our disposal in the credit card dataset.  

### 1. Download credit card data from UCI ML Repository

In [5]:
#!pip install wget

In [54]:
# Modules to import
import wget
import os 
import pandas as pd

In [50]:
# Weblinks directing to the data to be downloaded
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening'
data_file = 'crx.data'
names_file = 'crx.names'
url_data  = url+'/'+data_file
url_names = url+'/'+names_file

In [51]:
# Dowloads datafiles if not already present
path_raw_data = '../data/raw_data'
if os.path.isfile(path_raw_data+'/'+'crx.data'):
    print(" Data already present.")
else:
    print(" Downloading data... <START>")
    crxdata  = wget.download(url_data, path_raw_data)
    print(" - {}".format(data_file))
    crxnames = wget.download(url_names, path_raw_data)
    print(" - {}".format(names_file))
    print(" Downloading data... <FINISH>")    

 Downloading data... <START>
 - crx.data
 - crx.names
 Downloading data... <FINISH>


***

### 2. Load data using Pandas and investigate datatypes

Now that we have successfully downloaded data, let's investigate the datatypes in the dataset.

In [58]:
ccard_df = pd.read_table(path_raw_data+'/'+data_file, sep=',', header=None)
print(ccard_df.head())

  0      1      2  3  4  5  6     7  8  9   10 11 12     13   14 15
0  b  30.83  0.000  u  g  w  v  1.25  t  t   1  f  g  00202    0  +
1  a  58.67  4.460  u  g  q  h  3.04  t  t   6  f  g  00043  560  +
2  a  24.50  0.500  u  g  q  h  1.50  t  f   0  f  g  00280  824  +
3  b  27.83  1.540  u  g  w  v  3.75  t  t   5  t  g  00100    3  +
4  b  20.17  5.625  u  g  w  v  1.71  t  f   0  f  s  00120    0  +


Looks like due to the propritary nature of credit card information, all the header details have been omitted from the dataset. But, it is quite possible to decode this information with little effort. To do so, I ran a quick google search and end up with all the header information I needed about the dataset from this [website](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html). 

### Column data information and the datatypes they need to be:
- Col:0 - **Gender** - Categorical
- Col:1 - **Age**    - Numerical
- Col:2 - **Debt**   - Numerical
- Col:3 - **Marital status** - Categorical
- Col:4 - **Bank Customer**  - Categorical
- Col:5 - **Education level** - Categorical
- Col:6 - **Ethnicity** - Categorical
- Col:7 - **YearsEmployed** - Numerical
- Col:8 - **PriorDefault** - Numerical
- Col:9 - **Employed** - Numerical
- Col:10 - **Credit Score** - Numerical 
- Col:11 - **Drivers License** - Categorical
- Col:12 - **Citizenship** - Categorical
- Col:13 - **Zipcode** - Categorical
- Col:14 - **Income** - Numerical
- Col:15 - **Approved** - Categorical

The data types we have are:

In [63]:
ccard_df.dtypes

0      object
1      object
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
11     object
12     object
13     object
14      int64
15     object
dtype: object

All the data types except the Col:1 - Age are not in agreement. Let's look into the issue we may face while embarking to tidy the data in 1_tidy_data.ipynb notebook. 

In [71]:
ccard_df.shape

(690, 16)

In [70]:
ccard_df.describe(include='all')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
count,690,690,690.0,690,690,690,690,690.0,690,690,690.0,690,690,690.0,690.0,690
unique,3,350,,4,4,15,10,,2,2,,2,3,171.0,,2
top,b,?,,u,g,c,v,,t,f,,f,g,0.0,,-
freq,468,12,,519,519,137,399,,361,395,,374,625,132.0,,383
mean,,,4.758725,,,,,2.223406,,,2.4,,,,1017.385507,
std,,,4.978163,,,,,3.346513,,,4.86294,,,,5210.102598,
min,,,0.0,,,,,0.0,,,0.0,,,,0.0,
25%,,,1.0,,,,,0.165,,,0.0,,,,0.0,
50%,,,2.75,,,,,1.0,,,0.0,,,,5.0,
75%,,,7.2075,,,,,2.625,,,3.0,,,,395.5,


***

### List of tasks to tidy the data
1. Replace '?' in **Col:0** with b.
2. Replace '?' in **Col:1** with mean age of cusotmer in the dataset. 
3. Replace '?' in **Col:3** with categorical variables that has more counts. 
4. Replace '?' in **Col:4** with categorical variables that has more counts.
5. Replace '?' in **Col:5** with categorical variables that has more counts.
6. Replace '?' in **Col:6** with categorical variables that has more counts.
7. Replace '?' in **Col:3** with categorical variables that has more counts.