# HW3: Exploratory Analysis
For this homework, we will be applying the concepts from today's class to the [Credit Application Dataset](http://archive.ics.uci.edu/ml/datasets/Credit+Approval) from the UCI machine learning repository. 

This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. 

This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.

I've created a list of arbitrary column names below, with the assumption that A16 represents whether or not the credit application was approved. 

## 1. Load the dataset
Download the [dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data) to your computer and update the file location in the pd.read_csv() function below: 

In [45]:
import pandas as pd
import numpy as np

# read in the data
names = [("A" + str(x+1)) for x in range(0,16)] # generate list of column names
df = (pd.read_csv('/Users/summerrae/Downloads/crx.data', 
                 header=None,
                 names=names)
        .replace({'?': np.nan}))  # here we are automatically replace question marks with np.nan

In [46]:
df.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


## 2. Find and repleace null values
The easiest way to check each column for null values is to use the following statement: 

In [47]:
df.isnull().sum()

A1     12
A2     12
A3      0
A4      6
A5      6
A6      9
A7      9
A8      0
A9      0
A10     0
A11     0
A12     0
A13     0
A14    13
A15     0
A16     0
dtype: int64

We can use the following syntax to pull out rows that contain any null values: 

In [48]:
df[df.isnull().any(axis=1)]

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
71,b,34.83,4.0,u,g,d,bb,12.5,t,f,0,t,g,,0,-
83,a,,3.5,u,g,d,v,3.0,t,f,0,t,g,300.0,0,-
86,b,,0.375,u,g,d,v,0.875,t,f,0,t,s,928.0,0,-
92,b,,5.0,y,p,aa,v,8.5,t,f,0,f,g,0.0,0,-
97,b,,0.5,u,g,c,bb,0.835,t,f,0,t,s,320.0,0,-
202,b,24.83,2.75,u,g,c,v,2.25,t,t,6,f,g,,600,+
206,a,71.58,0.0,,,,,0.0,f,f,0,f,p,,0,+
243,a,18.75,7.5,u,g,q,v,2.71,t,t,5,f,g,,26726,+
248,,24.5,12.75,u,g,c,bb,4.75,t,t,2,f,g,73.0,444,+
254,b,,0.625,u,g,k,v,0.25,f,f,0,f,g,380.0,2010,-


Examine the rows and remove the null values from each column in an appropriate way. 

For example, let's look at the unique vaues in A4. There appears to be two categorical values and the number 1.

In [52]:
df.A4.unique()

array(['u', 'y', nan, 'l'], dtype=object)

The simplest way to replace null values in categorical data is to fillna() with the most frequent column value, so that you are not changing the columns frequency distribution: 

In [53]:
df.A4.value_counts()

u    519
y    163
l      2
Name: A4, dtype: int64

In [54]:
df["A4"] = df.A4.fillna("u")

In [None]:
# replace nulls from each column

## 3. One-hot encode the A16 column
A16 contains the class labels, which are the target for the dataframe. These can be one-hot-encoded so that: 
- "-" = 0
- "+" = 1

This would allow us to use this information in a regression, or other machine learning model. 

## 4. Standardize the Dataset
We can see that several of the columns are on different scales. 

In [None]:
# create a density plot to view the distribution of each numeric column

In [None]:
# standardize each numeric using either Zscore or MinMax, according to the columns distribution

## 5. Apply these methods on our own dataset, as appropriate
Where appropriate, apply each of these methods on your own dataset and plot the results. 

## 6. For two points, let us know what you would like feedback on, or have any questions about!