# PS5: Classification of Consumer Credit Score
In this problem we will construct, train and evaluate a feedforward neural network to classify consumer credit risk.

## Task 1: Setup, Data, and Prerequisites
In this task, we'll set up the computational environment, load the necessary packages, and prepare the `world(...)` function for our personal shopper problem. We will also define any constants we use throughout the problem set.

We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. 
* The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [1]:
include("Include.jl"); # This will load necessary packages and functions

### Data
Next, let's load the data. The dataset is a CSV file containing information about consumer credit risk, and is [available on Kaggle](https://www.kaggle.com/datasets/sudhanshu2198/processed-data-credit-score?resource=download). 
* _What's in the dataset?_ The dataset contains apporimately 100k records, where each record contains feature variables related to consumer credit, such as income, age, and other relevant attributes that may influence credit risk. The label variable for each record indicates whether the consumer is a good or bad credit risk, i.e., has a `(Poor | Standard | Good)` credit score. 

Let's load the raw dataset, do some data wrangling, and split the data into training and test sets. We'll save the raw data in the `raw_data::DataFrame` variable.

In [2]:
raw_data = CSV.read(joinpath(_PATH_TO_DATA, "credit_score_dataset.csv"), DataFrame)

Row,Delay_from_due_date,Num_of_Delayed_Payment,Num_Credit_Inquiries,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Amount_invested_monthly,Monthly_Balance,Credit_Score,Credit_Mix,Payment_Behaviour,Age,Annual_Income,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Monthly_Inhand_Salary,Changed_Credit_Limit,Outstanding_Debt,Total_EMI_per_month
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,String3,Float64,Float64,String15,String15,String,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,3.0,7.0,4.0,26.8226,265.0,No,80.4153,312.494,Good,Good,High_spent_Medium_value_payments,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
2,3.0,7.0,4.0,31.945,265.0,No,118.28,284.629,Good,Good,High_spent_Medium_value_payments,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
3,3.0,7.0,4.0,28.6094,267.0,No,81.6995,331.21,Good,Good,High_spent_Medium_value_payments,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
4,5.0,4.0,4.0,31.3779,268.0,No,199.458,223.451,Good,Good,High_spent_Medium_value_payments,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
5,6.0,4.0,4.0,24.7973,269.0,No,41.4202,341.489,Good,Good,High_spent_Medium_value_payments,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
6,8.0,4.0,4.0,27.2623,270.0,No,62.4302,340.479,Good,Good,High_spent_Medium_value_payments,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
7,3.0,8.0,4.0,22.5376,271.0,No,178.344,244.565,Good,Good,High_spent_Medium_value_payments,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
8,3.0,6.0,4.0,23.9338,271.0,No,24.7852,358.124,Standard,Good,High_spent_Medium_value_payments,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
9,3.0,4.0,2.0,24.464,319.0,No,104.292,470.691,Standard,Good,High_spent_Large_value_payments,28.0,34847.8,2.0,4.0,6.0,1.0,3037.99,5.42,605.03,18.8162
10,7.0,1.0,2.0,38.5508,320.0,No,40.3912,484.591,Good,Good,High_spent_Large_value_payments,28.0,34847.8,2.0,4.0,6.0,1.0,3037.99,5.42,605.03,18.8162


In [3]:
names(raw_data)

21-element Vector{String}:
 "Delay_from_due_date"
 "Num_of_Delayed_Payment"
 "Num_Credit_Inquiries"
 "Credit_Utilization_Ratio"
 "Credit_History_Age"
 "Payment_of_Min_Amount"
 "Amount_invested_monthly"
 "Monthly_Balance"
 "Credit_Score"
 "Credit_Mix"
 ⋮
 "Annual_Income"
 "Num_Bank_Accounts"
 "Num_Credit_Card"
 "Interest_Rate"
 "Num_of_Loan"
 "Monthly_Inhand_Salary"
 "Changed_Credit_Limit"
 "Outstanding_Debt"
 "Total_EMI_per_month"

Next, let's do some data wrangling. In particular, we will convert the categorical vriables to numerical variables:
* The `Payment_of_Min_Amount` variable is a categorical variable with levels: `No`, abd `Yes`. Let's convert it to a numerical variable with levels: `No` $\rightarrow$ `-1` and `Yes` $\rightarrow$ `1`.
* The `Credit_Score` variable is a categorical variable with levels: `Poor`, `Standard`, and `Good`. Let's convert it to a numerical variable with levels: `Poor` $\rightarrow$ `1`, `Standard` $\rightarrow$ `2`, and `Good` $\rightarrow$ `3`.
* The `Credit_Mix` variable is a categorical variable with levels: `Bad`, `Standard`, and `Good`. Let's convert it to a numerical variable with levels: `Bad` $\rightarrow$ `1`, `Standard` $\rightarrow$ `2`, and `Good` $\rightarrow$ `3`.
The `Payment_Behaviour` variable is a categorical variable with multiple levels. Let's convert these to a numerical variable with levels: `Low_spent_Small_value_payments` $\rightarrow$ `1`, `Low_spent_Medium_value_payments` $\rightarrow$ `2`, `Low_spent_Large_value_payments` $\rightarrow$ `3`, `High_spent_Small_value_payments` $\rightarrow$ `4`, `High_spent_Medium_value_payments` $\rightarrow$ `5`, and `High_spent_Large_value_payments` $\rightarrow$ `6`.

In [4]:
let
    tmp = Set{String}()
    for record in eachrow(raw_data)
        push!(tmp, record[:Credit_Mix])
    end
    tmp
end

Set{String} with 3 elements:
  "Good"
  "Bad"
  "Standard"

In [5]:
dataset = let
    
    dataset = copy(raw_data); # make a copy of the raw data, so we can keep the original intact
    transform!(dataset, :Payment_of_Min_Amount => ByRow(x -> (x=="No" ? -1 : 1)) => :Payment_of_Min_Amount); # maps Payment_of_Min_Amount to -1,1
    transform!(dataset, :Credit_Score => ByRow(s -> convertcreditscore(s))  => :Credit_Score); # maps Credit_Score to 1,2,3
    transform!(dataset, :Credit_Mix => ByRow(s -> convertcreditmix(s))  => :Credit_Mix); # maps Credit_Mix to 1,2,3
    transform!(dataset, :Payment_Behaviour => ByRow(s -> convertcreditbehavior(s))  => :Payment_Behaviour); # maps Payment_Behaviour to 1,2,3,4,5,6
    dataset;
end

Row,Delay_from_due_date,Num_of_Delayed_Payment,Num_Credit_Inquiries,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Amount_invested_monthly,Monthly_Balance,Credit_Score,Credit_Mix,Payment_Behaviour,Age,Annual_Income,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Monthly_Inhand_Salary,Changed_Credit_Limit,Outstanding_Debt,Total_EMI_per_month
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Int64,Float64,Float64,Int64,Int64,Int64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,3.0,7.0,4.0,26.8226,265.0,-1,80.4153,312.494,3,3,5,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
2,3.0,7.0,4.0,31.945,265.0,-1,118.28,284.629,3,3,5,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
3,3.0,7.0,4.0,28.6094,267.0,-1,81.6995,331.21,3,3,5,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
4,5.0,4.0,4.0,31.3779,268.0,-1,199.458,223.451,3,3,5,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
5,6.0,4.0,4.0,24.7973,269.0,-1,41.4202,341.489,3,3,5,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
6,8.0,4.0,4.0,27.2623,270.0,-1,62.4302,340.479,3,3,5,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
7,3.0,8.0,4.0,22.5376,271.0,-1,178.344,244.565,3,3,5,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
8,3.0,6.0,4.0,23.9338,271.0,-1,24.7852,358.124,2,3,5,23.0,19114.1,3.0,4.0,3.0,4.0,1824.84,11.27,809.98,49.5749
9,3.0,4.0,2.0,24.464,319.0,-1,104.292,470.691,2,3,6,28.0,34847.8,2.0,4.0,6.0,1.0,3037.99,5.42,605.03,18.8162
10,7.0,1.0,2.0,38.5508,320.0,-1,40.3912,484.591,3,3,6,28.0,34847.8,2.0,4.0,6.0,1.0,3037.99,5.42,605.03,18.8162


Next, let's package the full dataset into a `Vector{Tuple{Vector{Float64}, OneHot{}}` data structure.

In [6]:
converted_dataset, features = let
    converted_dataset = Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}();
    number_digit_array = [1,2,3]; # this is the digits for the labels

    # which cols do we want to use as features - let's use all *but* the label -
    all_cols = names(dataset);
    features = Array{String,1}();
    for col in all_cols
        if col != :Credit_Score
            push!(features, col);
        end
    end
    features = features |> sort;
    number_of_features = length(features);
    
    # build record tuples -
    for record ∈ eachrow(dataset)

        # convert the label to a one-hot vector
        label = record[:Credit_Score]; # this is the label we want to predict 
        Y = onehot(label, number_digit_array); # convert the label to a one-hot vector
        X = Vector{Float32}();

        for i ∈ eachindex(features)
            feature = features[i]; # get the feature name
            value = record[feature] |> Float32; # get the value of the feature
            push!(X, value); # add the value to the feature vector
        end

        data_tuple = (X, Y); # create a tuple of the feature vector and the label
        push!(converted_dataset, data_tuple); # add the tuple to the dataset
    end

    converted_dataset, features;
end;

Test and training data

In [None]:
θ = 0.80; # what fraction of the data to use for training
number_of_training_samples = Int64(θ * length(converted_dataset)); # 80% of the data will be used for training
number_of_test_samples = length(converted_dataset) - number_of_training_samples; # the rest will be used for testing
numner_of_features = length(features); # the number of features
number_of_classes = 3;
number_of_epochs = 100; # how many epochs do we want to train for?
number_of_hidden_nodes = 2^10;

In [8]:
training_dataset, test_dataset = let
    
    # initialize -
    training_dataset = Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}();
    test_dataset = Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}();
    number_of_samples = length(converted_dataset);
    all_index_set = range(1, stop=number_of_samples, step=1) |> Set{Int64};

    # generate a set of random indices for training and testing -
    random_training_index_set = Set{Int64}();
    while length(random_training_index_set) ≤ number_of_training_samples
        random_index = rand(1:number_of_samples);
        push!(random_training_index_set, random_index);
    end
    random_test_index_set = setdiff(all_index_set, random_training_index_set); # the rest of the indices will be used for testing
    
    # populate the training set -
    random_training_index_vector = random_training_index_set |> collect;
    for i ∈ eachindex(random_training_index_vector)
        index = random_training_index_vector[i];
        push!(training_dataset, converted_dataset[index]);
    end
    
    # populate the test set -
    random_test_index_vector = random_test_index_set |> collect;
    for i ∈ eachindex(random_test_index_vector)
        index = random_test_index_vector[i];
        push!(test_dataset, converted_dataset[index]);
    end

    training_dataset, test_dataset;
end;

## Task 2: Setup the model structure and training
In this task, we'll construct and train a feedforward model, i.e., learn the model parameters, using example images encoded in the the `training_image_dataset::Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}`. 

Then, we build an empty model with default (random) parameter values but a fixed structure. The number and dimension of the layers and the activation functions for each layer are specified when we build the model (but we'll update the parameters during training).
* _Library_: We use [the `Flux.jl` machine learning library](https://github.com/FluxML/Flux.jl) to construct the neural network model. The model will have three layers: the input layer is a `784` $\times$ `512` layer with [relu activation functions](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)), the hidden layer is a `512` $\times$ `10` layer and the output layer is the [softmax function](https://en.wikipedia.org/wiki/Softmax_function).
* _Syntax_: The [`Flux.jl` package](https://github.com/FluxML/Flux.jl) uses some next level syntax. The model is built using [the `Chain` function](https://fluxml.ai/Flux.jl/stable/reference/models/layers/#Flux.Chain), which takes a list of layers as input. Each layer is defined using the [`Dense` type](https://fluxml.ai/Flux.jl/stable/reference/models/layers/#Flux.Dense) (in this case), which takes the number of input and output neurons as arguments. The activation function is an additional argument to [the `Dense` type](https://fluxml.ai/Flux.jl/stable/reference/models/layers/#Flux.Dense). The final layer uses [the `softmax(...)` method exported by the `NNlib.jl` package](https://fluxml.ai/NNlib.jl/dev/reference/#Softmax) to produce a probability distribution over the classes.

In [None]:
# TODO: Uncomment the code below to build the model!
Flux.@layer MyFluxNeuralNetworkModel  trainable=(input, hidden); # create a "namespaced" of sorts
MyModel() = MyFluxNeuralNetworkModel( # a strange type of constructor
    Chain(
        input = Dense(numner_of_features, number_of_hidden_nodes, tanh_fast),  # layer 1
        hidden = Dense(number_of_hidden_nodes, number_of_classes, tanh_fast), # layer 2
        output = NNlib.softmax) # layer 3 (output layer)
);
model = MyModel().chain;

__Loss function__: Next, specify the `loss` function we will minimize to estimate the model parameters. We choose a loss function that is appropriate for a _multiclass classification problem_, namely a [logit cross-entropy loss function](https://fluxml.ai/Flux.jl/stable/reference/models/losses/#Flux.Losses.logitcrossentropy):
$$
\mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C} y_{ij}\log(p_{ij}(\theta))
$$
where the outer summation is over all $N$ training examples, and the inner summation is over the $C$ possible classes. The $y_{ij}$ is the one-hot encoded label for the $i$th training example, and $p_{ij}$ is the predicted probability of the $i$th training example being in class $j$. 

In [10]:
# TODO: Uncomment below to setup the loss function -
loss(ŷ, y) = Flux.Losses.logitcrossentropy(ŷ, y; agg = mean); # loss for training multiclass classifiers, what is the agg?

We'll use [Gradient descent with momentum](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum) where the `λ` parameter denotes the `learning rate` and `β` denotes the momentum parameter. We save information about the optimizer in the `opt_state` variable, which will eventually get passed to the training method.

In [None]:
λ = 0.20; # learning rate (default: 0.01)
β = 0.10; # momentum parameter (default: 0.90)
opt_state = Flux.setup(Momentum(λ, β), model);

We are now ready to train the model. If the `should_we_train = true,` then we use the [Gradient descent with momentum](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum) to minimize a [logit cross-entropy loss function](https://fluxml.ai/Flux.jl/stable/reference/models/losses/#Flux.Losses.logitcrossentropy).
* _Restart_: Because the error landscape is non-convex, we have to start from many different locations. We do `number_of_epochs` passes through the data, i.e., a forward pass for prediction and a backpropagation step for parameter updates. Although the training is a little opaque, intuition suggests that the library is choosing different initial parameter guesses for each pass through the data and then driving these to convergence.
* _Training takes a long time_. For each complete pass through the data, i.e., for each `epoch,` we save a `tmp` file holding the network state... just in case of `BOOOOOOOOM.`  We also have some pre-trained models to load if the `should_we_train` flag is false.

In [None]:
let

    should_we_train = true # TODO: set this flag to {true | false}
    if (should_we_train == true)
        for i = 1:number_of_epochs
            
            # train the model -
            Flux.train!(model, training_dataset, opt_state) do m, x, y
                loss(m(x), y)
            end
        
            # let the user know how we are doing -
            if (rem(i,2) == 0)
                @show "Epoch $i of $number_of_epochs completed" # print the epoch number
            end
        
            # save the state of the model, in case something happens. We can reload from this state
            jldsave(joinpath(_PATH_TO_DATA, "tmp-model-training-checkpoint.jld2"), model_state = Flux.state(model))    
        end
    else
        # if we don't train: load up a previous model
        model_state = JLD2.load(joinpath(_PATH_TO_DATA, "tmp-model-training-checkpoint.jld2"), "model_state");
        Flux.loadmodel!(model, model_state);
    end
end

Chain(
  Dense(21 => 512, relu),               [90m# 11_264 parameters[39m
  Dense(512 => 3, relu),                [90m# 1_539 parameters[39m
  NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m12_803 parameters, 50.215 KiB.

## Task 3: How well does the model predict unseen versus observed images?
In this task, we'll check the network's generalization, i.e., how well it does on data it has not seen. One of the challenges with [Neural Networks)](https://en.wikipedia.org/wiki/Neural_network_(machine_learning)) is the lack of generalizability, i.e., they _may not_ perform well on data the model has not seen. 

Let's explore this question:
* First, compute the fraction of the `training data` that is correctly classified. This will help us understand how many of the `N` training samples we get correct and how many we get wrong. We expect to be _mostly correct_ on the training data.
* Next, we'll do the same thing but with the `test data,` i.e., data the model has never seen. We expect the correct prediction fraction on the test data to be less than or, at best, equal to the equivalent training data value.

### Correct prediction `training` dataset
In the code block below, we pass the pixel data from the image into the `model` instance, compute the predicted label `ŷ,` and compare the predicted and actual labels for the `training` dataset.
* _Logic_: If the prediction and the actual label agree, we update the `S` variable (a running count of the number of correct predictions). Finally, we compute the fraction of _correct_ classifications by dividing the number of correct predictions by the total number of images in the `training` dataset.

In [17]:
let 
    S_training = 0;
    number_digit_array = [1,2,3]; # this is the digits for the labels
    for i ∈ eachindex(training_dataset)
    
        x = training_dataset[i][1];
        y = training_dataset[i][2];
        ŷ = model(x) |> z-> argmax(z) |> z-> number_digit_array[z] |> z-> onehot(z,number_digit_array)
        y == ŷ ? S_training +=1 : nothing
    end
    correct_prediction_training = (S_training/length(training_dataset))*100;
    println("Correct prediction % on the training data: $(correct_prediction_training)%");
end

Correct prediction % on the training data: 53.156848278708004%
