# Introduction

The purpose of this project is to inspect, formulate a model to visualise data and develop machine learning algorithms using data science tools in Python. This project is created to test on:


- Understanding data using basic statistical opertations and Natural Language Processing Tools
- Separate features and labels in a data set
- Normalise the data set
- Split the data into testing and training data
- Using SVM model to perform multi-class classification
- Predicting the testing data and communicating analysis using Quadratic Weighted Kappa
- Providing a concise conclusion



# Structure of content

1. Read and Describe contents of file
2. Supervised Machine Learning
3. Feature, Label Separation
4. Train | Test Data Split
5. Binary vs Multi-class Model
6. Data Normalisation
7. Feature Column Selection
8. Support Vector Machine
9. Building a predictive model
10. Predicting from the model built
11. Confusion Matrix
12. QWK Score
13. Kaggle Submission Prediction

# Data Description and Exploration

The aim of this project is to utilize Python as a data science tool to investigate and visualize data, as well as build machine learning models. The project will involve analyzing and creating an intelligent system that can classify individuals into credit score ranges based on a combination of banking detaiks and credit-related data that has been gathered over time. Assuming myself as the data scientist's of this company, I am entitled to analyse and preprocess the data before creating machine learning models with the right algorithms to forecast each customer's credit score. As a result of automating this procedure, the system can lessen the manual labour needed to go through enormous volumes of data. At teh same time this also offers a more precise and effective way to determine credit scores

### Importing necessary libraries

In [1]:
import csv
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif

# reading the dataset CSV file and creates a pandas DataFrame object called 'dataset' containing the data from the CSV file.
dataset = pd.read_csv('Credit_Scores_Dataset.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'Credit_Scores_Dataset.csv'

### Basic Descriptive Statistics of Data

The dataset contains the basic bank details and the credit-related information of the customers of a global finance company.

The examples below show some basic descriptive statistics of the dataset.

In [None]:
# The head() function in pandas library gets the first 5 rows of data from the csv file. This allows us to get a basic idea on the columns in the data set and the values for each columns in the data set
dataset.head()

Unnamed: 0,ID,Month,Age,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,106981,8,41,2,14619.585,1005.29875,7,7,19,1,...,2,125.33,38.883189,222,1,11.620889,32.84625,202,279.724565,2
1,108774,1,28,12,70883.44,5663.953333,4,4,10,3,...,2,604.77,31.131854,356,0,97.133997,39.858686,201,526.033197,2
2,111896,3,29,12,14395.83,1027.6525,8,8,28,7,...,1,2841.0,37.587389,27,1,74.795382,31.947738,201,258.713002,1
3,32731,2,25,1,11189.065,1159.422083,6,3,15,3,...,2,761.18,33.980973,126,0,18.439801,16.806258,201,324.2841,2
4,128760,7,37,3,78956.73,6523.7275,7,3,14,2,...,2,436.82,27.684657,265,1,128.558654,70.788144,102,669.025667,2
5,151390,5,50,11,21167.555,1829.962917,2,7,9,3,...,3,1286.57,33.708627,361,0,40.335282,22.819491,202,277.370503,1
6,69108,7,43,14,44964.22,3898.018333,6,7,15,3,...,2,210.15,28.666651,400,0,75.40797,46.850154,201,306.094503,2
7,26275,7,50,13,140390.32,11888.19333,5,2,4,3,...,3,1423.23,32.966273,374,0,182.160424,133.213034,103,1020.195699,3
8,139554,1,29,6,54284.94,4673.745,7,8,30,7,...,1,4237.27,38.040801,59,1,190.641975,73.850532,101,392.593676,2
9,99702,1,18,3,19375.76,1633.646667,4,3,6,1,...,2,1053.72,31.459343,325,1,10.511619,26.628272,103,335.390534,2


In [None]:
dataset.shape

(2100, 24)

The shape attribute allows us to understand the number of rows and columns in the dataset. From this dataset, we can see the data set has 2100 rows and 24 columns

In [None]:
dataset.dtypes

ID                            int64
Month                         int64
Age                           int64
Occupation                    int64
Annual_Income               float64
Monthly_Inhand_Salary       float64
Num_Bank_Accounts             int64
Num_Credit_Card               int64
Interest_Rate                 int64
Num_of_Loan                   int64
Delay_from_due_date           int64
Num_of_Delayed_Payment        int64
Changed_Credit_Limit        float64
Num_Credit_Inquiries          int64
Credit_Mix                    int64
Outstanding_Debt            float64
Credit_Utilization_Ratio    float64
Credit_History_Age            int64
Payment_of_Min_Amount         int64
Total_EMI_per_month         float64
Amount_invested_monthly     float64
Payment_Behaviour             int64
Monthly_Balance             float64
Credit_Score                  int64
dtype: object

The dtypes attributes shows the types of data in each column of the data set. This allows us to understand how the data can be manipulated. From this data, we understand that all of the data are numerical data consisting of either integers or float values

In [None]:
dataset.describe()

From the above summary statistic, we can see that each column has 2100 data values which is the same as the number of rows. this explains to us that all the columns has been filled with data and there is no NaN values in the columns of the data set.

From the min and max value of Credit_Score being 1 and 3 respectively we know that the data in this column is between 1 to 3. But, from the above dtypes attributes, we identified Credit_Score to have int64 data type in its columns. So, from this we know that the data values in the Credit_Score column is either 1, 2 or 3.

From the median age which is 33 and the mean age which is 33.197143, which are close to each other to an age of 33, indicates to us that the average age of data collected from the respondents are at 33, where they are mostly working adults that have high financial commitment

# Supervised learning

## What exactly is Supervised learning?

- Supervised learning is defined as the process of training an algorithm on labelled data.

## What is the notion of labelled data?

- Labelled data means the data has already been sorted or categorised. Building a model that can predict the category or output of brand-new, unforeseen data based on the input or properties of the data is the aim of supervised learning.

## What are the training and test datasets?

- The training dataset refers to the portion of the dataset used in training the ML model. The training dataset is used to optimize the model parameters in order for the model to make accurate predictions on new data.

- The test dataset refers to the portion of the dataset used in testing the ML model. The testing dataset is used to evaluate the performance of the already trained model. This dataset is separate and different from the training dataset and is not used to train the model. It is, however, used to test the model's ability to generalize to new data.

### Sources:
- https://www.ibm.com/topics/supervised-learning
- https://www.techtarget.com/searchenterpriseai/definition/supervised-learning
- https://www.analyticssteps.com/blogs/binary-and-multiclass-classification-machine-learning


___________________________________________________________________________________________________________________________________________________________________________________

### Separating the Features and the Labels

In [None]:
# selecting all rows, and selected relevant columns
features = dataset.iloc[:,[4,5,6,7,8,9,10,11,12,13,14,15,17,18,22]].values

# selecting all the rows and the last column only
label = dataset.iloc[:,-1].values

# .values() converts the selected column into a NumPy array because many ML libaries require the input format to be NumPy arrays.

### Splitting the data for training and testing

In [None]:
# 'train_test_split' function is used to split a dataset into the training and testing sets
from sklearn.model_selection import train_test_split

features_train, features_test, label_train, label_test = train_test_split(features, label, test_size = 0.21, random_state = 0)

# test_size = 0.21 means that 21% of the data is used for testing
# random_state sets the random seed for the split.
# This ensures that the split is reproducible, meaning that if the code is run again with the same seed, the split will be the same.

# Splitting the training and testing dataset

We split our data into training and test sets to measure and analyze the performance of our machine learning models. The training data set which will be used to train the model learns the patterns and relationship of features and labels, while the test set estimates how well the model will generalize to new, unseen data. The splitting of training and testing data set helps in evaluating the performance for a model on a new data.

It is also important in preventing overfitting of model which indicates that the machine learning model has fitted the data set used in training too perfect and well until it could not identify the pattern of a new data. With training and testing data set, we can train the model using the training set and fine tune the model which was created using the training data set with the testing data set by evaluating the performance of the model and tuning the hyperparameters of the model. This also ensures that the model can generalise successfully to fresh data.

A training-test split of 80|20 to 70|30, is generally considered as a good split between testing and training dataset as there is sud=fficient data for training model and checking the performance of the model

# Classification

### Explaining the difference between binary and multi-class classification
Both binary and multi-class classification are common types of supervised ML tasks that involve the process of assigning a label or category to a given input. However, there are differences between binary and multi-class classification.

Binary classification involves classifying given data into one of two classes (or you could say it assigns one of two possible labels to an input). For example, determining whether an email is spam or not is a binary classification problem, as there are only two possible outcomes.

Multi-class classification involves classifying elements into different classes (or you could say it assigns one of three or more possible labels to an input). For example, determining the colour of a ball from an image of it is a multi-class classification problem, as there are multiple possible outcomes (e.g. blue, green, red).

The main difference between binary and multi-class classification is the number of possible outcomes. In binary classification, there are only 2 possible outcomes, while in multi-class classification, there are more than 2. Multi-class classification, however, doesn't limit itself to any number of classes (unlike binary classification)


### Sources:
- https://www.analyticssteps.com/blogs/binary-and-multiclass-classification-machine-learning
- https://vitalflux.com/difference-binary-multi-class-multi-label-classification/#:~:text=So%2C%20what's%20the%20difference%20between,from%20more%20than%20two%20classes.




### Describing what I understand from the need to normalise data
Data normalisation is an important step in the pre-processing stage to prepare data for analysis or machine learning models.

The goal is to standardise the numeric columns in the datatset to a common scale because some variables may dominate others if they are on different scales. This reduces the computational complexity of the models, therefore making the optimization process slower and less accurate. Once data is normalized, the optimization process becomes more efficient.

Another reason is to improve the accuracy and stability of the models. Many ML algorithms assume that the variables are normally distributed with mean = 0 and its standard deviation = 1. By normalizing data, the variables are made to conform to this assumption, therefore improving the overall performance of the model.

### Sources:
- https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff

# Normalisation

In [None]:
from sklearn.preprocessing import StandardScaler

# using StandardScaler function to scale data properly
sc = StandardScaler()

# scaling the training dataset using fit_transform function. The dataset is scaled to have mean = 0 and a certain standard deviation
features_train = sc.fit_transform(features_train)

# transform() function applies the mean and standard deviation values to the testing dataset
features_test = sc.transform(features_test)

### Describing SVM (in relation to Linear Regression)

Support Vector Machine (abbreviated as SVM) is a type of supervised learning algorithm used for both classification and regression tasks. The goal of SVM is to find the best hyperplane (decision boundaries used to predict the continuous output) that separates the data points into different classes. The criteria for this is that it maximizes the margin between the closest points from each class.

The goal of linear regression is to find the line that best fits the data points. The line of best fit is chosen following the criteria that the sum of the squared distances between the actual data points and the predicted values on the line is minimized.

Both of these algortihms attempt to find a linear relationship between the input features and the output variable. SVM however, is more flexible than linear regression as it can handle both linear and non-linear relationships between the input features and the output variable. While both of these algorithms aim to find a relationship between the input features and the output variable, they have different goals and approaches to how they achieve this. Linear regression tries to minimize the sum of squared distances between the predicted and actual values while SVM tries to maximize the margin between the classes.

### Sources:
- https://towardsdatascience.com/unlocking-the-true-power-of-support-vector-regression-847fd123a4a0
- https://www.analyticssteps.com/blogs/how-does-support-vector-machine-algorithm-works-machine-learning

### What is the kernel in SVM?
In SVM, the kernel is defined as a set of mathematical functions. Its purpose is to transform the input data into a higher-dimensional space, where it can more easily be separated into different classes. The kernel function calculates the dot product between pairs of data points in the original input space and maps them to a higher-dimensional feature space.

Different SVMs use different types of kernel functions. Choosing the correct kernel is important as it determines the shape of the decision boundary or hyperplane that separates the data points. Examples of different types of kernels include:

#### - Linear
This calculates the dot product in between the input feature vectors in the original space. This kernel is useful for linearly separable data.

#### - Polynomial
This uses a polynomial function to map the input data to a higher-dimensional feature space. The polynomial function can handle non-linearly separable data.

#### - RBF
This kernel maps the input data to an infinite-dimensional feature space using a Gaussian function. It is commonly used for non-linearly separable data.

#### - Sigmoid
Similar to the polynomial kernel, but this kernel uses a hyperbolic tangent function instead of a polynomial function.

The choice of kernel being used can impact the performance of the SVM greatly, so it is important to pick the appropriate kernel based on the nature of the data and problem at hand.

### Sources:
- https://data-flair.training/blogs/svm-kernel-functions/
- https://www.geeksforgeeks.org/major-kernel-functions-in-support-vector-machine-svm/

### Building the model from training dataset

In [None]:
from sklearn.svm import SVC

# here we initialize a SVC object called "classifier" that uses the kernel "rbf"
classifier = SVC(kernel = 'rbf')

# training the model
# features_train is the input feature data & label_train is the corresponding output or targeted data
classifier.fit(features_train, label_train)

SVC()

### Prediction for credit score label

In [None]:
# predicting & displaying the test set results
label_prediction = classifier.predict(features_test)
label_prediction

array([2, 3, 1, 2, 2, 2, 3, 2, 1, 2, 2, 1, 2, 3, 2, 2, 2, 3, 2, 2, 2, 3,
       2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 3, 2, 2, 3, 1, 1,
       3, 2, 2, 2, 2, 1, 3, 3, 3, 1, 1, 3, 1, 1, 3, 2, 2, 1, 2, 2, 1, 2,
       2, 3, 3, 2, 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, 2, 2, 1,
       3, 2, 1, 2, 2, 2, 1, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2, 1, 2, 3, 2, 2,
       1, 1, 3, 3, 3, 3, 1, 3, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 3, 2, 2, 2,
       1, 2, 2, 2, 2, 2, 3, 1, 1, 1, 3, 1, 3, 2, 2, 2, 1, 2, 1, 2, 2, 1,
       3, 3, 2, 1, 2, 3, 2, 2, 2, 2, 2, 2, 2, 1, 2, 3, 2, 2, 2, 2, 1, 2,
       2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 3, 2,
       2, 3, 1, 1, 2, 2, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2,
       1, 2, 2, 1, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 1, 2, 2, 3, 2, 2, 2, 1,
       2, 1, 2, 2, 3, 2, 2, 2, 2, 1, 3, 3, 2, 1, 2, 1, 3, 2, 2, 2, 2, 2,
       2, 2, 1, 2, 1, 3, 2, 2, 1, 3, 2, 3, 2, 2, 2,

### Displaying confusion matrix

In [None]:
# importing confusion matrix library
from sklearn.metrics import confusion_matrix

# making and displaying the confusion matrix
cmatrix = confusion_matrix(label_test, label_prediction)
cmatrix

array([[ 65,  49,  12],
       [ 27, 180,  34],
       [  0,  26,  48]])

#### Explanation for confusion matrix

The output that can be observed above is a 3x3 confusion matrix. The instances in a predicted class is represented by rows of the matrix. The instances in an actual class is represented by columns of the matrix. The numbers in the matrix represent the counts of instances that fall into each of these categories.

More specifically, the confusion matrix can be interpreted as follows:

- The first row represents the instances that were actually in the first class. Out of these instances, 65 were correctly classified as belonging to the first class, 49 were incorrectly classified as belonging to the second class, and 12 were incorrectly classified as belonging to the third class.

- The second row represents the instances that were actually in the second class. Out of these instances, 27 were incorrectly classified as belonging to the first class, 180 were correctly classified as belonging to the second class, and 34 were incorrectly classified as belonging to the third class.

- The third row represents the instances that were actually in the third class. Out of these instances, none were incorrectly classified as belonging to the first or second class, but 26 were incorrectly classified as belonging to the third class, and 48 were correctly classified as belonging to the third class.

Overall, this confusion matrix can be used to calculate various performance metrics for the classification model, such as accuracy, precision, and recall.






### Quadratic Weighted Kappa (QWK)

Quadratic Weighted Kappa (QWK) is a statistical measure that is commonly used to evaluate the agreement between two raters, for example two different ML algorithms. The QWK ranges from -1 to 1. -1 indicates complete disagreement, 0 indicates the level of agreement expected by chance, and 1 indicates complete agreement.

We calculate the QWK by comparing the observed agreement between the raters with the expected agreement.
- The observed agreement is the proportion of cases where the raters agree on the rating
- The expected agreement is the proportionn of cases where the raters would be expected to agree by chance, based on the distribution of ratings.

QWK is very useful when evaluating the performance of ML algorithms. This is because it can capture the agreement between the algorithm's predictions and the actual labels of the data, even when there are multiple possible categories. For example, in a multi-class classification task, QWK can be used to evaluate the agreement between the predicted class and the actual class, taking into account the degree of disagreement between the different classes.

### Sources:
- https://www.kaggle.com/code/carlolepelaars/understanding-the-metric-quadratic-weighted-kappa
- https://verosssr.com/cacefc41e57e499a9f4c8cdce878df5d
- https://medium.com/x8-the-ai-community/kappa-coefficient-for-dummies-84d98b6f13ee

### Using the sklearn.metrics library to code and obtain the QWK score

In [None]:
# importing cohen kappa score library
from sklearn.metrics import cohen_kappa_score

# obtain the true labels for the test set
label_true = label_test

# compute the QWK score for the predicted and true labels
qwk_score = cohen_kappa_score(label_true, label_prediction, weights = 'quadratic')
print("QWK score: {:.4f}".format(qwk_score))

QWK score: 0.5239


#### Explanation for the QWK score:

The QWK score of 0.5239 indicates a moderate level of agreement between two raters or between the predicted and true values. This means that my model's predictions are better than random guessing, but there is still room for improvement.

# Kaggle submission – Credit Score Evaluation

In [None]:
import csv

# we are using the same model that we've created earlier but using a different dataset
dataset2 = pd.read_csv('Credit-Scores-Submission.csv')

# separating features and label
feature = dataset2.iloc[:,[4,5,6,7,8,9,10,11,12,13,14,15,17,18,22]].values

# scaling
feature = sc.transform(feature)

# predicting credit scores for the test set
label_pred = classifier.predict(feature)
print(label_pred, len(label_pred))

# for this part, two files were utilised: one for input, one for output
with open('32845650-WongJunWei-v1.csv', 'r') as csv_file, open('32845650-Wong_Jun_Wei-v1.csv', 'w', newline = '') as output_file:
    csvreader = csv.reader(csv_file)
    csvwriter = csv.writer(output_file)

    # iterating over rows of the input csv file, then updates the credit score, and lastly writes the updated rows to the output csv file
    for i, row in enumerate(csvreader):
        # for the first row (which is the header), write it to the output CSV file
        if i == 0:
            csvwriter.writerow(row)
        else:
            # row[1] refers to the credit score, therefore this updates the credit score of that row
            row[1] = label_pred[i - 1]
            # write the updated row to the output CSV file
            csvwriter.writerow(row)

[1 1 2 2 2 3 3 2 2 2 2 2 3 1 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 3 1 3 1 1 1
 1 2 3 3 2 2 1 1 2 2 2 2 1 2 2 2 3 2 2 2 3 2 1 2 3 3 1 2 2 1 2 2 3 3 2 2 1
 2 2 2 2 2 2 2 3 1 1 1 1 1 2 2 2 2 2 3 1 1 2 2 2 2 3 2 3 3 2 2 2 1 2 2 3 2
 1 2 1 3 2 1 2 3 2 3 2 2 2 2 1 3 3 1 1 2 1 2 2 1 1 1 2 1 1 2 2 3 3 2 1 2 1
 2 3 2 1 1 2 2 2 2 1 2 1 2 3 2 1 1 1 1 3 2 2 3 2 3 2 1 2 2 2 1 2 1 2 2 1 2
 1 2 2 1 2 3 2 1 1 3 3 2 3 2 3 1 2 2 2 2 1 2 2 2 2 3 2 1 2 3 1 1 2 2 2 3 2
 3 2 2 3 2 3 2 2 1 3 1 2 2 2 2 1 2 2 1 1 1 1 2 1 2 2 1 1 2 2 3 2 1 3 2 3 1
 2 2 2 3 3 2 3 1 3 3 2 3 2 1 1 2 1 2 2 3 2 1 1 3 2 1 3 2 2 3 3 2 2 3 2 2 2
 2 3 2 1 2 2 3 2 2 1 2 1 1 3 3 3 3 3 2 2 1 3 2 2 2 2 2 2 2 1 3 2 2 2 3 2 1
 1 1 2 2 2 2 2 1 2 2 2 3 2 3 2 3 3 2 2 1 1 3 1 2 2 1 2 3 2 1 2 2 2 2 2 1 2
 1 1 2 1 2 2 2 1 1 2 3 2 3 2 2 3 2 1 3 2 3 2 2 1 1 1 2 2 2 2 1 1 2 1 2 1 2
 3 2 3 3 2 3 2 2 1 2 2 2 2 2 1 1 2 1 2 1 3 1 2 2 3 3 3 2 3 3 3 1 1 2 2 1 2
 2 1 3 1 2 2 2 2 1 2 3 2 2 2 2 2 1 3 3 2 3 1 2 3 3 2 3 2 2 3 3 2 2 1 3 3 1
 2 2 1 2 2 1 2 2 2 2 2 3 

# Conclusion

From this project, I have learnt that a predictive model of Support Vector Machine (SVM) is a strong library that can be used in a wide range of problems involving predicting values. I have also learnt that SVM can not only be used for classification purposes (as attempted in the project), but also can be used to solve for regression problems. I have also learnt the importance of normalizing or scaling the data in the pre-processing stage before processing a predictive model. I have also learnt about a the importance of using appropriate hyperparameters and tuning the hyperparameters in a SVM model to get a good predictive model that has an optimal performance. I also learnt how to evaluate the model using various metrics and validation techniques to ensure that it is accurate and generalizes well to new data. Overall, I hope the knowledge gained from this project will help me in the field of study in which I pursue in the future.