<a href="https://colab.research.google.com/github/ragavkumarv/xyz/blob/master/code_along1_loan_status.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised Learning


## Predicting values of a *target variable* given a set of *features*

* For example, predicting if a customer will buy a product *(target)* based on their location and last five purchases *(features)*.

### Regression

* Predicting the values of a continuous variable e.g., house price.

### Classification

* Predicting a binary outcome e.g., customer churn.

# Data Dictionary

The data has the following fields:

|Column name | Description |
|------------|-------------|
| `loan_id`  | Unique loan id |
| `gender`   | Gender - `Male` / `Female` |
| `married`  | Marital status - `Yes` / `No` |
| `dependents` | Number of dependents |
| `education` | Education - `Graduate` / `Not Graduate` |
| `self_employed` | Self-employment status - `Yes` / `No` |
| `applicant_income` | Applicant's income |
| `coapplicant_income` | Coapplicant's income |
| `loan_amount` | Loan amount (thousands) |
| `loan_amount_term` | Term of loan (months) |
| `credit_history` | Credit history meets guidelines - `1` / `0` |
| `property_area` | Area of the property - `Urban` / `Semi Urban` / `Rural` |
| `loan_status` | Loan approval status (target) - `1` / `0` |

## Step 0: Get the place and things ready

In [None]:
# Import required libraries


In [None]:
# Read in the dataset


# Preview the data


# Exploratory Data Analysis

We can't just dive straight into machine learning!
We need to understand and format our data for modeling.
What are we looking for?



## Step 1: Cleanliness

* Are columns set to the correct data type?
* Do we have missing data?

In [None]:
# Remove the loan_id to avoid accidentally using it as a feature


In [None]:
# Counts and data types per column


## Step 2: Distributions

* Many machine learning algorithms expect data that is normally distributed.
* Do we have outliers (extreme values)?

In [None]:
# Target frequency


In [None]:
# Class frequency by loan_status


## Step 3: Relationships

* If data is strongly correlated with the target variable it might be a good feature for predictions!

In [None]:
# Correlation between variables

## Skipping Step 4: Feature Engineering

* Using domain knowledge to select and extract predictive features from raw data
* Transforming predictive features into a format usable by the model


## Step 5: Modeling

In [None]:
# First model using loan_amount

# Split into training and test sets

# Previewing the training set

In [None]:
# Instantiate a logistic regression model

# Fit to the training data

# Predict test set values

# Check the model's first five predictions


## Step 6: Testing

# Classification Metrics

&nbsp;

## Accuracy
$Accuracy = \frac{Correct Predictions}{Total Observations}$

&nbsp;

## Confusion Matrix

**True Positive (TP)** = # Correctly predicted as positive

**True Negative (TN)** = # Correctly predicted as negative

**False Positive (FP)** = # Incorrectly predicted as positive (actually negative)

**False Negative (FN)** = # Incorrectly predicted as negative (actually positive)

&nbsp;

|        | **Predicted: Negative** | **Predicted: Positive** |
|--------|---------------------|---------------------|
|**Actual: Negative** | True Negative | False Positive |
|**Actual: Positive** | False Negative | True Positive |

&nbsp;

### Confusion Matrix Metrics

$Precision = \frac{TP}{TP + FP}$

$Recall = \frac{TP}{TP + FN}$

In [None]:
# Accuracy

In [None]:
# Confusion matrix


## Trying with Step 4: Feature Engineering

In [None]:
# Convert categorical features to binary

# Previewing the new DataFrame

In [None]:
# Resplit into features and targets

# Split into training and test sets

## Step 5: Modeling again!

In [None]:
# Instantiate logistic regression model

# Fit to the training data

# Predict test set values

## Step 6: Testing again!

In [None]:
# Accuracy

In [None]:
# Confusion matrix


## Trying with Step 4, 5 and 6 again: Feature Engineering, Modeling, Testing

In [None]:
# Finding the importance of features


In [None]:
# Illustrate feature importance



# Split into training and test sets


# Instantiate logistic regression model


# Fit to the training data


# Predict test set values


# Accuracy


# Confusion matrix


## Step 7: Experiment / Deployment

## How might we improve model performance?

* Further preprocessing:
	- Log transformations for skewed distributions.
	- Scale feature values.
	- Remove outliers e.g., high earners.
* Try a different model e.g., Decision trees
* Gather more data.
	- Train new models on incorrect predictions (may need more data and/or a holdout set).
* Further feature engineering.
* Hyperparameter tuning.