Data Mining Assignment

This assignment represents 100% of the Data Mining module’s mark. It is composed of Part 1 which is worth 40 marks, and Part 2 which is worth 60 marks. You can work in a team of 2 students for this assignment. One student per team will be chosen by the team as being the team leader – who will be in charge of coordinating the team’s work, and of submitting the assignment in their account on VLE on behalf of all the team.

PART 1:

This task is based on the Sonar real data seen previously in class. Several objects which can be rock or metal cylinders are scanned on different angles and under different conditions, with sonar signals. 60 measurements are recorded per columns for each object (one record per object) and these are the predictors called A1, A2, …, A60. The label associated with each record contains the letter "R" if the object is a rock and "M" if it is metal cylinder, and this is the outcome variable called Class.

Two datasets are provided to you: a training dataset in the sonar_train.csv file, and a test dataset in the sonar_test.csv file.

a) You are required to write a Python code implementing the simplest Nearest Neighbour algorithm (that is, using just 1 neighbour), with the Minkowski distance, both discussed in lecture of week 1. Your code will read the power q appearing in the Mionkowski distance, and will classify each record from the test dataset based on the training dataset. Remember, to classify a record from the test set you need to find its nearest neighbour in the training set (this is the one which minimizes the distance to the test set record); take the class of the nearest neighbour as the predicted class for the test set record. After classifying all the records in the test set, your code needs to calculate and display the accuracy, recall, precision, and F1 measure with respect to the class "M" (which is assumed to be the positive class), of the predictions on the test dataset. Run your code to produce results first for Manhattan distance and then for Euclidian distance, which are particular cases of Minkowski distance (q=1, and q=2, see lecture week 1).

b) Run your code for the power q as a positive integer number from 1 to 20 and display the accuracy, recall, precision, and F1 measure on the test set in a chart. Which value of q leads to the best accuracy on the test set?

The code, comments, explanations and results will be provided in a Jupyter notebook called Part1.

Note that in this task you are not to apply a library for the nearest neighbour algorithm, but you are required to compute the distances, find the nearest neighbour, and so code yourself this simple algorithm.

PART 2:

This task is based on a real credit risk data, and is to predict a response variable Y which represents a credit card default payment (Yes = 1, No = 0), using the 23 input variables as follows:

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
X4: Marital status (1 = married; 2 = single; 3 = others).
X5: Age (year).
X6 - X11: History of past payment. One tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

Two datasets are provided to you: a training dataset in the creditdefault_train.csv file, and a test dataset in the creditdefault_test.csv file.

Using Python and any relevant libraries, you are required to build the best predictive model by tuning models using cross validation on the training dataset with each of the following algorithms discussed in this module: k-Nearest Neighbours, Decision Trees, Random Forest, Bagging, AdaBoost, and SVM. Out of the models tuned with the above algorithms, select the best model and clearly justify your choice, and evaluate its performances on the test set.

The coding, comments and explanations will be provided in your Python Jupyter notebook called Part2, which should include also the results. Moreover, for each algorithm mentioned above, include 1 chart in the notebook illustrating how accuracy of the models vary when you vary the values of one numeric hyperparameter only (at your choice).

Note regarding working in a team or individually, and what you need to submit:

    You can work and submit in a team of 2 students - in which case you should choose a team leader.  As a team you should work on all the tasks. Include the names and student numbers of both of the team members on top of each notebooks Part 1 and Part 2, and indicate who is the team leader. The team leader must perform the submission from their account (hence only once) for both students.

    Or you can work also work and submit alone for this coursework. In this case you must tackle only point (a) in Part 1, and only 3 out of the 6 algorithms mentioned in Part 2 (at your choice, but choose 3 only). Include your name and student number on top of the notebooks Part 1 and Part 2,  followed by the mention "I worked and submitted alone"

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load the data
train_data = pd.read_csv("creditdefault_train.csv")
test_data = pd.read_csv("creditdefault_test.csv")

# Explore the data
train_data.head()
train_data.shape
train_data.info()

#Check for missing values
train_data.isnull().sum()

# Separate the data into features and labels
X = train_data.iloc[:,:-1].values
y = train_data.iloc[:,-1].values

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# k-Nearest Neighbours
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_pred)
print("k-Nearest Neighbours accuracy: {}".format(knn_accuracy))

# Decision Trees
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)
print("Decision Trees accuracy: {}".format(dt_accuracy))

# Random Forest
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)
print("Random Forest accuracy: {}".format(rf_accuracy))

# Bagging
bg_model = BaggingClassifier(n_estimators=100)
bg_model.fit(X_train, y_train)
bg_pred = bg_model.predict(X_test)
bg_accuracy = accuracy_score(y_test, bg_pred)
print("Bagging accuracy: {}".format(bg_accuracy))

# AdaBoost
ab_model = AdaBoostClassifier(n_estimators=100)
ab_model.fit(X_train, y_train)
ab_pred = ab_model.predict(X_test)
ab_accuracy = accuracy_score(y_test, ab_pred)
print("AdaBoost accuracy: {}".format(ab_accuracy))

# SVM
svm_model = SVC(kernel='rbf',random_state=0,gamma=0.01,C=10.0)
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_pred)
print("SVM accuracy: {}".format(svm_accuracy))

# Tune the models using Grid Search
# k-Nearest Neighbours
knn_params = {'n_neighbors':[5,10,15,20,25,30]}
knn_gs = GridSearchCV(knn_model, knn_params, cv=5)
knn_gs.fit(X_train, y_train)
knn_gs.best_params_

# Decision Trees
dt_params = {'criterion':('gini','entropy'), 'max_depth':[5,10,15,20,25,30]}
dt_gs = GridSearchCV(dt_model, dt_params, cv=5)
dt_gs.fit(X_train, y_train)
dt_gs.best_params_

# Random Forest
rf_params = {'n_estimators':[100,200,300,400,500]}
rf_gs = GridSearchCV(rf_model, rf_params, cv=5)
rf_gs.fit(X_train, y_train)
rf_gs.best_params_

# Bagging
bg_params = {'n_estimators':[100,200,300,400,500]}
bg_gs = GridSearchCV(bg_model, bg_params, cv=5)
bg_gs.fit(X_train, y_train)
bg_gs.best_params_

# AdaBoost
ab_params = {'n_estimators':[100,200,300,400,500]}
ab_gs = GridSearchCV(ab_model, ab_params, cv=5)
ab_gs.fit(X_train, y_train)
ab_gs.best_params_

# SVM
svm_params = {'C':[1,10,100,1000],'gamma':[0.001,0.01,0.1,1]}
svm_gs = GridSearchCV(svm_model, svm_params, cv=5)
svm_gs.fit(X_train, y_train)
svm_gs.best_params_

# Evaluate the model on the test set
# Transform the test set
X_test = scaler.transform(test_data.iloc[:,:-1].values)
y_test = test_data.iloc[:,-1].values

# Make predictions
rf_pred = rf_model.predict(X_test)

# Evaluate the model
rf_accuracy = accuracy_score(y_test, rf_pred)
print("Random Forest accuracy: {}".format(rf_accuracy))
print(confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred))