# Lecture 11

Here we use the E-car dataset on car loan applications.

IF you're running this on Google Colab, and only then, should you run this cell:

In [None]:
# !! Run this on Google Colab only.
from google.colab import drive
drive.mount('/content/drive')

Import the required modules

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns  # Seaborn is a plotting library derived from matplotlib



Load the dataset:

In [None]:
DATA_FILEPATH = "drive/MyDrive/e_car_data.csv"
df = pd.read_csv(DATA_FILEPATH)

Show summary statistics:

In [None]:
df.describe()

Prepare data for multi-class logistic regression, with three features:

In [None]:
x1 = df[['tier', 'amount', 'prime']].values

x_scaled = preprocessing.scale(x1)

apr = df['apr'].values
accept = df['accept'].values

Place loans into 8 unordered classes / bins, depending on their acceptance (1-4 for denied, 5-8 for accepted) and their Annual Percentage Rate (APR, into 4 classes each):

In [None]:
def loan_to_bin(accept_value, apr_value):
  offset = 0
  if accept_value:
    offset = 4

  if apr_value < 4:
    return offset

  if apr_value < 6:
    return 1 + offset

  if apr_value < 8:
    return 2 + offset

  return 3 + offset

Create new labels for the dataset:

In [None]:
y = []
for i in range(len(apr)):
    accept_value = accept[i]
    apr_value = apr[i]
    bin = loan_to_bin(accept_value, apr_value)
    y.append(bin)

# Alternative in one-line with a list comprehension (but less readable):
# y = [loan_to_bin(accept[i], apr[i]) for i in range(len(apr))]

Split the data into train and test (90/10 split):


In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.1)

Train the model. The argument `multi_class` is now redundant: when the outcomes have 3 or more classes, ScitKit-Learn understands that you want to run a multinomial model, and started showing a warning. I prefer to have the argument and the warning, because the code is more readable in that we want multinomial regression.

In [None]:
model = LogisticRegression(multi_class='multinomial').fit(x_train, y_train)

Predict the values in the test set, using the built-in function `.predict()`, and evaluate accuracy:

In [None]:
y_test_predict = model.predict(x_test)
accuracy = metrics.accuracy_score(y_test, y_test_predict)
print ("Multi-class logistic regression Accuracy Score on Test Set:", accuracy)

Show the "confusion" matrix, or predicted labels versus true labels

In [None]:
predicted = model.predict(x_train)
mat = metrics.confusion_matrix(y_train, predicted)
sns.heatmap(mat.T,
            square = True,
            annot=True,
            fmt = "d",
            )
plt.xlabel("true labels")
plt.ylabel("predicted label")
plt.show()

# TODO: run the previous cells again, without scaling

Go back to a previous cell and remove or comment the line that scales the features. How does the multi-class logistic regression behave now? Why?

# Perceptron

Now let's fit a perceptron on the same dataset, using a single variable for loan approved or not.

Import the Perceptron model from ScitKit-Learn:

In [None]:
from sklearn.linear_model import Perceptron

Define the outcome and the features:

In [None]:
y = df['accept'].values
columns = ['tier',
           'amount',
           'apr',
           'prime',
           'fico',
           'competition apr',
           'partner bin']
x2 = df[columns].values

Normalize the data, to help with convergence, and split into training set and validation set:

In [None]:
x2_scaled = preprocessing.scale(x2)
x_train, x_test, y_train, y_test = train_test_split(x2_scaled, y, test_size=0.1)

Train a perceptron with a learning rate:

In [None]:
pmodel = Perceptron(eta0=0.1)
pmodel.fit(x_train, y_train)
y_test_predict = pmodel.predict(x_test)
accuracy = metrics.accuracy_score(y_test, y_test_predict)
print("Perceptron Accuracy Score on Test Set: %.5f" % accuracy)

# Multi-layer perceptron

We'll use a model from SciKit-Learn, which already has all we need (e.g., cross-entropy loss function). We define two hidden layers, with 64 and 32 neurons, and fit it to data. Notice that the accuracy has impoved to around 80%.

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(64, 32),
                    activation="logistic",
                    max_iter=1000,
                    random_state=42)

mlp.fit(x_train, y_train)

# Make predictions on the test data
y_pred = mlp.predict(x_test)

# Calculate the accuracy of the model
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy of multi-layer perceptron on test data: %.5f" %
      accuracy)