# Homework: Building a Classifier
## IEOR 135/290, Data-X: Applied Data Ventures
**Author:** Sudarshan Gopalakrishnan | UC Berkeley, B.S. EECS'21 (in collaboration with Ikhlaq Sidhu)

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import seaborn as sns
%matplotlib inline


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

import xgboost as xgb

# Objective
The purpose of this homework is to walk you through the data science cycle towards building a classical machine learning model.

In this homework, you will do the following:
* Explore the provided dataset
* Develop visualizations
* Formulate the hypothesis
* Clean and reformat the dataset to help you build a predictive
* Develop features
* Build a predictive model

You are working with CalBank to build a classifier that will help them determine whether a loan profile is likely to be Fully Paid or Charged Off (i.e. written off given under the assumption that it won't be paid back).

In [None]:
data = pd.read_csv("m160-hw-dataset.csv")
data.head()

In [None]:
data.columns

# Part 1: Data Cleaning
In this part, we will guide you through cleaning the dataset and prepare it for data exploration.

### **Question 1**: What are some potential problems with the above dataset? 

*Enter your answer here*

### **Question 2**: Each column has missing data. Discuss how you would deal with missing data in each column, implement the changes and safe it in the variable 'cleaned_data'

* **Loan ID:** *Enter your answer here*
* **Customer ID:** *Enter your answer here*
* **Loan Status:** *Enter your answer here*
* **Current Loan Amount:** *Enter your answer here*
* **Term:** *Enter your answer here*
* **Credit Score:** *Enter your answer here*
* **Annual Income:** *Enter your answer here*
* **Years in current job:** *Enter your answer here*
* **Home Ownership:** *Enter your answer here*
* **Purpose:** *Enter your answer here*
* **Monthly Debt:** *Enter your answer here*
* **Years of Credit History:** *Enter your answer here*
* **Number of Credit Problems:** *Enter your answer here*
* **Current Credit Balance:** *Enter your answer here*
* **Maximum Open Credit:** *Enter your answer here*
* **Bankruptcies:** *Enter your answer here*
* **Tax Liens:** *Enter your answer here*


In [None]:
# Do your data cleaning here

In [None]:
cleaned_data = data

In [None]:
cleaned_data.head()

In [None]:
assert len(data.columns[data.isnull().any()]) == 0, "cleaned_data still has NaN values"

# Part 2: Data Exploration

### **Question 3:** Create three interesting visualizations that will help you build a hypothesis.

Feel free to create cells to analyse the data and build your understanding. Use the questions below to help you get started:
* What are some numerical indicators that point towards a charged-off loan?
* How does the purpose of the loan impact the loan application? 
* How does the house ownership status impact the loan application?

Once you analyse the dataset, create your visualizations.

In [None]:
# Write the code for your data analysis. Press Esc+B to create new cells below.

**Visualization 1**

In [None]:
# Write the code for your visualization here

**What insights does this visualization offer?**

*Explain the insights this visualization offers*

-------------------

**Visualization 2**

In [None]:
# Write the code for your visualization here

**What insights does this visualization offer?**

*Explain the insights this visualization offers*

------------------------

**Visualization 3**

In [None]:
# Write the code for your visualization here

**What insights does this visualization offer?**

*Explain the insights this visualization offers*

### **Question 4:** What is your hypothesis?

*Enter your hypothesis here*

### **Question 5:** Analyse the dataset and make an inference about your hypothesis.

In [None]:
# Write the code for your analysis here

*Enter your conclusion here*

# Part 3: Preprocessing and Feature Engineering

 The objective you are working towards is building a classifier that will allow you to classify whether a loan will be fully paid or charged off using some or all of the following pieces of data: 
* Loan ID
* Customer ID
* Loan Status
* Current Loan Amount
* Term
* Credit Score
* Annual Income
* Years in current job
* Home Ownership
* Purpose
* Monthly Debt
* Years of Credit History
* Months since last delinquent
* Number of Open Accounts
* Number of Credit Problems
* Current Credit Balance
* Maximum Open Credit
* Bankruptcies
* Tax Liens

**Setting Up The Dataset For Model Building**

### **Question 6:** Which metrics do you think are good indicators of loan status?  Create a new Dataframe preprocessed_data with all the columns that you determine to be good indicators of loan status.

In [None]:
# Write your code for Question 6 here

In [None]:
preprocessed_data = ...

### **Question 7:** Are the numerical features useful in their current form? For every column that you believe need to be changed, do so and save your results in preprocessed2_data.

In [None]:
# Write your code for Question 7 here

In [None]:
preprocessed2_data = ...

### **Question 8:** Convert all your non-numerical features to numerical form that you believe would be useful for your model. Save your results in preprocced3_data.

In [None]:
# Write your code for Question 8 here

In [None]:
preprocessed3_data = ...

# Part 4: Model Building

In [None]:
y = preprocessed3_data["Loan Status"]
X = preprocessed3_data.drop(columns=["Loan Status"])

assert len(X) == len(y), "The number of input rows and outputs do not match."

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.34, random_state=42)

### **Question 9:** Implement the following classification models and determine the best model for this use case.
1. Logistic Regression
2. K-Nearest Neighbors (KNN) 
3. Support Vector Machines (SVM)
4. Perceptron
5. XGBoost
6. Random Forest

Be sure to use:
* X_train, y_train to train the models in Question 9a, 9b,..., 9f
* X_validation, y_validation to validate the models in Question 9a, 9b,..., 9f
* X_test, y_test to test the best model in Question 10

#### Question 9a: Logistic Regression

In [None]:
# Write your code here

logreg = ...

#### Question 9b: KNN

In [None]:
# Write your code here

knn = ...

#### Question 9c: SVM

In [None]:
# Write your code here

svm = ...

#### Question 9d: Perceptron

In [None]:
# Write your code here

perceptron = ...

#### Question 9e: XGBoost

In [None]:
# Write your code here

gradboost = ...

#### Question 9f: Random Forest

In [None]:
# Write your code here

random_forest = ...

### **Question 10:** Which model has the best performance?

In [None]:
best_model = ...

In [None]:
print("The accuracy of your best model is {}".format(best_model.score(X_test, y_test)))