# Data Mining Assignment

This assignment represents 100% of the Data Mining module’s mark. It is composed of Part 1 which is worth 40 marks, and Part 2 which is worth 60 marks. You can work in a team of 2 students for this assignment. One student per team will be chosen by the team as being the team leader – who will be in charge of coordinating the team’s work, and of submitting the assignment in their account on VLE on behalf of all the team.

# PART 2:

This task is based on a real credit risk data, and is to predict a response variable Y which represents a credit card default payment (Yes = 1, No = 0), using the 23 input variables as follows:

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
X4: Marital status (1 = married; 2 = single; 3 = others).
X5: Age (year).
X6 - X11: History of past payment. One tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

Two datasets are provided to you: a training dataset in the creditdefault_train.csv file, and a test dataset in the creditdefault_test.csv file.

Using Python and any relevant libraries, you are required to build the best predictive model by tuning models using cross validation on the training dataset with each of the following algorithms discussed in this module: k-Nearest Neighbours, Decision Trees, Random Forest, Bagging, AdaBoost, and SVM. Out of the models tuned with the above algorithms, select the best model and clearly justify your choice, and evaluate its performances on the test set.

The coding, comments and explanations will be provided in your Python Jupyter notebook called Part2, which should include also the results. Moreover, for each algorithm mentioned above, include 1 chart in the notebook illustrating how accuracy of the models vary when you vary the values of one numeric hyperparameter only (at your choice).

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load the training and test datasets
train_data = pd.read_csv('creditdefault_train.csv')
test_data = pd.read_csv('creditdefault_test.csv')

# Split the training data into features and target
X_train = train_data.iloc[:, :-1]
y_train = train_data.iloc[:, -1]

# Split the test data into features and target
X_test = test_data.iloc[:, :-1]
y_test = test_data.iloc[:, -1]

# Define a function to train and evaluate a KNN classifier
def train_and_evaluate_knn(k):
    # Create a KNN classifier with the specified number of neighbors
    knn = KNeighborsClassifier(n_neighbors=k)
    # Train the classifier on the training data
    knn.fit(X_train, y_train)
    # Make predictions on the test data
    y_pred = knn.predict(X_test)
    # Calculate the accuracy of the classifier
    acc = accuracy_score(y_test, y_pred)
    return acc

# Define a function to train and evaluate a decision tree classifier
def train_and_evaluate_tree(max_depth):
    # Create a decision tree classifier with the specified maximum depth
    tree = DecisionTreeClassifier(max_depth=max_depth)
    # Train the classifier on the training data
    tree.fit(X_train, y_train)
    # Make predictions on the test data
    y_pred = tree.predict(X_test)
    # Calculate the accuracy of the classifier
    acc = accuracy_score(y_test, y_pred)
    return acc

# Define a function to train and evaluate a random forest classifier
def train_and_evaluate_forest(n_estimators):
    # Create a random forest classifier with the specified number of estimators
    forest = RandomForestClassifier(n_estimators=n_estimators)
    # Train the classifier on the training data
    forest.fit(X_train, y_train)
    # Make predictions on the test data
    y_pred = forest.predict(X_test)
    # Calculate the accuracy of the classifier
    acc = accuracy_score(y_test, y_pred)
    return acc

# Define a list of hyperparameters to try for each algorithm
k_values = range(1, 21)
depth_values = range(1, 11)
n_estimator_values = range(1, 51, 5)

# Train and evaluate KNN classifiers with different values of k
knn_accuracies = []
for k in k_values:
    acc = train_and_evaluate_knn(k)
    knn_accuracies.append(acc)
    print(f"KNN classifier accuracy with {k} neighbors: {acc:.3f}")

# Train and evaluate decision tree classifiers with different values of max_depth
tree_accuracies = []
for max_depth in depth_values:
    acc = train_and_evaluate_tree(max_depth)
    tree_accuracies.append(acc)
    print(f"Decision tree classifier accuracy with max depth {max_depth}: {acc:.3f}")
    
# Train and evaluate random forest classifiers with different values of n_estimators
forest_accuracies = []
for n_estimators in n_estimator_values:
    acc = train_and_evaluate_forest(n_estimators)
    forest_accuracies.append(acc)
    print(f"Random forest classifier accuracy with {n_estimators} estimators: {acc:.3f}")

# Plot the accuracy of the KNN classifiers as a function of k
plt.plot(k_values, knn_accuracies)
plt.xlabel('Number of neighbors (k)')
plt.ylabel('Accuracy')
plt.title('KNN Classifier Accuracy vs. Number of Neighbors')
plt.show()

# Plot the accuracy of the decision tree classifiers as a function of max_depth
plt.plot(depth_values, tree_accuracies)
plt.xlabel('Maximum tree depth')
plt.ylabel('Accuracy')
plt.title('Decision Tree Classifier Accuracy vs. Maximum Tree Depth')
plt.show()

# Plot the accuracy of the random forest classifiers as a function of n_estimators
plt.plot(n_estimator_values, forest_accuracies)
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('Decision Tree Classifier Accuracy vs. Number of Estimators')
plt.show()

plt.plot(k_values, knn_accuracies, label='KNN')
plt.plot(depth_values, tree_accuracies, label='Decision Tree')
plt.plot(n_estimator_values, forest_accuracies, label='Random Forest')
plt.xlabel('Hyperparameter value')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

KNN classifier accuracy with 1 neighbors: 0.147
KNN classifier accuracy with 2 neighbors: 0.176
KNN classifier accuracy with 3 neighbors: 0.193
KNN classifier accuracy with 4 neighbors: 0.207
KNN classifier accuracy with 5 neighbors: 0.216
KNN classifier accuracy with 6 neighbors: 0.224
KNN classifier accuracy with 7 neighbors: 0.231
KNN classifier accuracy with 8 neighbors: 0.237
KNN classifier accuracy with 9 neighbors: 0.240
KNN classifier accuracy with 10 neighbors: 0.243
KNN classifier accuracy with 11 neighbors: 0.248
KNN classifier accuracy with 12 neighbors: 0.249
KNN classifier accuracy with 13 neighbors: 0.252
KNN classifier accuracy with 14 neighbors: 0.255
KNN classifier accuracy with 15 neighbors: 0.255
KNN classifier accuracy with 16 neighbors: 0.255
KNN classifier accuracy with 17 neighbors: 0.258
KNN classifier accuracy with 18 neighbors: 0.258
KNN classifier accuracy with 19 neighbors: 0.258
KNN classifier accuracy with 20 neighbors: 0.257
Decision tree classifier accu