<div style="background: red; color: white; padding: 20px;">
    <h1>Example project</h1>
Use the red boxes as guide and main sections for your project.
<br><br>
This project should be subdivided in the following sections that you have to complete:
    <b>
    <ol>
        <li>Project presentation</li>
        <li>Data exploration and cleaning</li>
        <li>Data visualization</li>
        <li>Feature engineering</li>
        <li>Predictive modeling</li>
        <li>Present results</li>
    </ol>
    </b>
</div>

---
---

<div style="background: red; color: white; padding: 20px;">
    <h1>1. Project presentation</h1>
    <ul>
        <li>1.1. Project objectives</li>
        <li>1.2. Form hypotheses about your defined problem and visually analyze the data</li>
        <li>1.3. Dataset info, source of the data, columns explanation</li>
</div>

# 1. Project presentation
### Credit card applications

<img src="img/creditcard.png"
    style="width:250px; float: right; margin: 0 40px 40px 40px;"></img>

In this project you will create a model to predict if an credit card application should be approved or not.


---
## 1.1. Project objectives

    - Practice classification models
    - Practice spot-checking algorithms 

---
## 1.2. Form hypotheses about your defined problem and visually analyze the data

TO DO

---
## 1.3. Dataset info

To train our model you will use the the [Credit Card Approval dataset](https://archive.ics.uci.edu/ml/datasets/credit+approval) from the UCI MAchine Learning Repository.

This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.

Here's the possible values for each variable:
- A1: b, a.
- A2: continuous.
- A3: continuous.
- A4: u, y, l, t.
- A5: g, p, gg.
- A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
- A7: v, h, bb, j, n, z, dd, ff, o.
- A8: continuous.
- A9: t, f.
- A10: t, f.
- A11: continuous.
- A12: t, f.
- A13: g, p, s.
- A14: continuous.
- A15: continuous.
- A16: +,- (class attribute)

### Hands on! 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

---
---

<div style="background: red; color: white; padding: 20px;">
    <h1>2. Data exploration and cleaning</h1>
    <ul>
        <li>2.1. Gather data</li>
        <li>2.2. Fix inconsistencies and handle missing values</li>
        <li>2.3. Drop unused columns</li>
    </ul>
</div>

# 2. Data exploration and cleaning


---
## 2.1. Gather data

Load the `data/credit_approval.csv` file, and store it into `applications_df` DataFrame.

This file already has wrong observations removed, and it is balanced.

In [None]:
applications_df = pd.read_csv('data/credit_approval.csv', header=None)

applications_df.head()

> According to this [blog](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html) the probable feature names could be `Gender`, `Age`, `Debt`, `Married`, `BankCustomer`, `EducationLevel`, `Ethnicity`, `YearsEmployed`, `PriorDefault`, `Employed`, `CreditScore`, `DriversLicense`, `Citizen`, `ZipCode`, `Income` and `ApprovalStatus`.

In [None]:
cols = ['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel',
        'Ethnicity', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore',
        'DriversLicence', 'Citizen', 'ZipCode', 'Income', 'ApprovalStatus']

applications_df.columns = cols

#### Show the shape of the resulting `applications_df`.

In [None]:
applications_df.shape

#### Data exploration

Let's first see a quick summary of the DataFrame and some descriptive statistics of the data.

In [None]:
print(applications_df.info())

applications_df.describe()

> The dataset contains both numeric and non-numeric data.

---
## 2.2. Fix inconsistencies and handle missing values

### Detecting missing values

Check per column if there is any missing value.

In [None]:
applications_df.isna().sum()

### Detecting incorrect values

Although we don't have missing values, probably there are incorrect values.

Let's check the unique values per column:

In [None]:
for col in applications_df.columns:
    print(applications_df[col].unique())

### Labeled missing values

There are many missing values labeled with a '`?`' character.

Let's replace these question marks with `NaN` values.

In [None]:
applications_df.replace('?', np.NaN, inplace=True)

### Wrong column type

`Age` column should be of type `float`, fix it.

In [None]:
applications_df = applications_df.astype({'Age': 'float'})

### Handling missing values

If we now remove missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly.

So, to avoid this problem, we are going to **impute the missing values with a mean imputation** strategy.

In [None]:
applications_df.fillna(applications_df.mean(), inplace=True)

But this mean imputation strategy only works on numeric data. So... what about the non-numeric columns?

We are going to impute these non-numeric columns with the **most frequent values** as present in the respective columns.

In [None]:
for col in applications_df.columns:
    if applications_df[col].dtypes == 'object':
        applications_df.fillna(applications_df[col].value_counts().index[0],
                               inplace=True)

Finally, verify the number of `NaN`s again.

In [None]:
applications_df.isna().sum()

---
## 2.3. Drop unused columns

The `DriversLicense` and `ZipCode` columns are not as important as the other features for our goal of predicting whether to approve an application or not.

Let's remove them.

In [None]:
applications_df.drop(['DriversLicence', 'ZipCode'], axis=1, inplace=True)

---
---

<div style="background: red; color: white; padding: 20px;">
    <h1>3. Data visualization</h1>
    <ul>
        <li>3.1. Numeric variables analysis</li>
        <li>3.2. Non-numeric variables analysis</li>
        <li>3.3. Other charts to show relationships between columns</li>
    </ul>
</div>

# 3. Data visualization

---
## 3.1. Numeric variables analysis

Let's plot histograms for each numeric variable.

First define a `plot_hist` function that receives a column name as parameter and plot an histogram of that column:

In [None]:
def plot_hist(col):
    applications_df.loc[:,col].plot(kind='hist', title=col)
    plt.show()

Now use the function above to show an histogram for each numeric column. 

In [None]:
numeric_cols = ['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income']

for col in numeric_cols:
    plot_hist(col)

Now create a scatter matrix to see if there is any important relationship.


In [None]:
from pandas.plotting import scatter_matrix

ax = scatter_matrix(applications_df[['Age', 'Debt', 'YearsEmployed',
                                     'CreditScore', 'Income']],
                    figsize=(12,12))

Finally, create a correlation matrix for all the numeric variables.

In [None]:
corr_metrics = applications_df.corr()

corr_metrics.style.background_gradient(cmap="bwr")

These numeric columns don't have strong correlation between them.

The highest one indicates that more `Age` implies more `YearsEmployed` that at certain point makes sense.

---
## 3.2. Non-numeric variables analysis

Let's plot bar plots for each non-numeric variable.

First define a `plot_bar` function that receives a column name as parameter and plot a bar plot of that column:


In [None]:
def plot_bar(col):
    applications_df.loc[:,col].value_counts().plot(kind='bar', title=col)
    plt.show()

Now use the function above to show an histogram for each non-numeric column. 

In [None]:
non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
                    'Ethnicity', 'PriorDefault', 'Employed', 'Citizen',
                    'ApprovalStatus']

for col in non_numeric_cols:
    plot_bar(col)

---
## 3.3. Other charts to show relationships between columns

TO DO

---
---

<div style="background: red; color: white; padding: 20px;">
    <h1>4. Feature engineering</h1>
    <ul>
        <li>4.1. Select features you will use</li>
        <li>4.2. Parse variables to correct data type</li>
        <li>4.3. Scale/standardize variables</li>
        <li>4.4. Construct meaningful variables using the data you have</li>
</div>

# 4. Feature engineering

---
## 4.1. Select features you will use

**Create features $X$ and labels $y$**

Separate features and labels into different $X$ and $y$ variables.

In [None]:
X = applications_df.drop(['ApprovalStatus'], axis=1)
y = applications_df['ApprovalStatus']

---
## 4.2. Parse varaibles to correct data type

#### Convert non-numeric data into numeric

Let's use `OrdinalEncoder` to encode categorical features ($X$) into integer values.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
                    'Ethnicity', 'PriorDefault', 'Employed', 'Citizen']

enc = OrdinalEncoder().fit(X[non_numeric_cols])

new_values = enc.transform(X[non_numeric_cols])

X.loc[:, non_numeric_cols] = new_values

X.head()

---
## 4.3. Scale/standardize variables

Let's use `StandardScaler` to rescale the features so that they'll have the properties of a standard normal distribution with $\mu=0$ and $\sigma=1$, where $\mu$ is the mean (average) and $\sigma$ is the standard deviation from the mean.


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)

X = scaler.transform(X)

X

#### Target variable analysis

The `ApprovalStatus` is our target variable (label). It has two possible values:


In [None]:
y.values[0:100]

In [None]:
plot_bar('ApprovalStatus')

Let's use `LabelEncoder` to normalize its values such that theye contain only values 0 and 1.

In [None]:
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder().fit(y)

y = label_enc.transform(y)

y[0:100]

---
## 4.4. Construct meaningful variables using the data you have


In [None]:
# your code goes here


---
---

<div style="background: red; color: white; padding: 20px;">
    <h1>5. Predictive modeling</h1>
    <ul>
        <li>5.1. Train ML models</li>
        <li>5.2. Find best performing model</li>
        <li>5.3. Evaluate model</li>
        <li>5.3. Use them to make predictions</li>
    </ul>
</div>

# 5. Modeling

---
## 5.1. Train ML models

Create a `get_cv_scores` function that receives a `model` parameter with a scikit-learn model and returns the CV scores of that model.

You should use a `StratifiedKFold` cross-validator with 5 splits and a `random_state` seed to get always the same partitions. 

5 scores should be returned.


In [None]:
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score

def get_cv_scores(model):
    return cross_val_score(model, X, y,
                           cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=10))

---
### Spot-check algorithms

Create each of the following models and call the `get_cv_scores` function using each model to get its CV scores.

Save the resulting scores in the `results_df` to compare them at the end.

In [None]:
results_df = pd.DataFrame()

#### K Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

results_df['KNN'] = get_cv_scores(model)

#### Support Vector Machines

In [None]:
from sklearn import svm

model = svm.SVC(gamma='auto',
                random_state=10)

results_df['SVM'] = get_cv_scores(model)

#### Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

results_df['Naive Bayes'] = get_cv_scores(model)

#### Gradient Boost Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(random_state=10)

results_df['GBC'] = get_cv_scores(model)

#### AdaBoost Classifier (Adaptive Boosting)

In [None]:
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier(random_state=10)

results_df['AdaBoost'] = get_cv_scores(model)

---
## 5.2. Find best performing model

Show a boxplot per algorithm using the data you saved in `results_df`.

Which one performs the best? And the worst?

In [None]:
results_df.boxplot(figsize=(14,6), grid=False)

Let's see if we can do better. We can select the best model and perform a grid search of the model parameters to improve the model's ability to predict credit card approvals.


#### Cross validation

Train severals 'KNeighborsClassifier' models with different `k` values and calculate the accuracy of these models.

Keep using a `KNeighborsClassifier` estimator and a `StratifiedKFold` cross-validator with 5 splits.

Test the following `k` values:

In [None]:
def get_kneighbors_score(k):
    model = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(model, X, y, cv=5)
    return scores.mean()

ACC_dev = []
parameters=[1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]
    
for k in parameters:
    scores=get_kneighbors_score(k)
    ACC_dev.append(scores)

#### Getting the best parameters

In [None]:
# This is one possible solution
ACC_dev=pd.DataFrame(ACC_dev)
ACC_dev.rename(columns={0: 'Accuracy'}, inplace=True)
ACC_dev['parameters']=parameters

ACC_dev.loc[ACC_dev['Accuracy'] == ACC_dev['Accuracy'].max()]

---
## 5.3. Evaluate model

Create the final model, with the tunned parameter.

In [None]:
model = KNeighborsClassifier(n_neighbors=8)

#### Get model CV predictions

Generate cross-validated estimates for each input data point.

Use a `StratifiedKFold` cross-validator with 5 splits and a random_state seed.


In [None]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(model, X, y,
                           cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=10))

#### Classification report

Show a `classification_report` using the `y_pred` predictions.

Remember that our labels were encoded as follow:

| type  | code |
|-------|------|
|   +   |   0  |
|   -   |   1  |

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y, y_pred))

---
---

<div style="background: red; color: white; padding: 20px;">
    <h1>6. Present results</h1>
    <ul>
        <li>6.1. Communicate the findings using visualizations</li>
        <li>6.2. Final conclusions</li>
    </ul>
</div>

# 6. Present results

---
## 6.1. Communicate the findings using visualizations

TO DO

---
## 6.2. Final conclusions

TO DO

#### Confusion matrix

Show a `confusion_matrix` using the `y_pred` predictions.

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y, y_pred, labels=[0, 1])

> The first element of the of the first row of the confusion matrix denotes the 
**true positives** meaning the number of positive instances (approved applications) predicted by the model correctly.

> The last element of the second row of the confusion matrix denotes the **true negatives** meaning the number of negative instances (denied applications) predicted by the model correctly.


---