# Lab 8d: Building a Linear Regression Sarchasm Classifier
In this lab, students will create a two-class linear classifier to predict whether a news headline is sarcastic or not. We'll estimate the parameters of this classifier using the ordinary least squares with and without regularization.  

### Learning objectives and tasks
* __Prerequisites__: To save some time, we'll load the saved file from the `SarcasmSamplesTokenizer` example from `week-4` using [the `load(...)` method exported by the FileIO.jl package](https://github.com/JuliaIO/FileIO.jl). We'll then use this data in the subsequent calculations.
* __Task 1__: Compute the expected value of the parameters $\beta$ without regularization. In this task, we estimate the parameters of a linear regression model that maps the text token sequences to the headline labels without regularization.
* __Task 2__: Compute the expected value of the parameters $\beta$ with regularization. In this task, we'll explore the effect of regularization on our ability to classify news headlines as sarcastic or not sarcastic.

## Setup
We set up the computational environment by including [the `Include. jl` file](Include.jl) using [the `include(...)` method](https://docs.julialang.org/en/v1/base/base/#Base.include). The [`Include.jl` file](Include.jl) loads external packages and functions we will use in these examples. 
* For additional information on functions and types used in this example, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

In [3]:
include("Include.jl");

## Prerequisites
To save some time, we'll load the saved file from the `SarcasmSamplesTokenizer` example using [the `load(...)` method exported by the FileIO.jl package](https://github.com/JuliaIO/FileIO.jl). To load the `jld2` (binary) saved file, we pass the path to the file we want to load the [`load(...)` function](https://github.com/JuliaIO/FileIO.jl). This call returns the data as a [Julia `Dict` type](https://docs.julialang.org/en/v1/base/collections/#Base.Dict). 
* Let's set the path to the save file in the `path_to_save_file::String` variable.

In [5]:
path_to_save_file = joinpath(_PATH_TO_DATA, "L4a-SarcasmSamplesTokenizer-SavedData.jld2");

Then we load the `jld2` file using [the `load(...)` method](https://juliaio.github.io/FileIO.jl/stable/reference/#FileIO.load), where the contents of the file are stored in the `saved_data_dictionary::Dict{String, Any}` variable. 
* We saved the `corpusmodel::MySarcasmRecordCorpusModel` instance, which holds the other interesting data, e.g., the `tokendictionary.` Thus, we can get (most) of everything we need from the `corpusmodel.`

In [7]:
saved_data_dictionary = load(path_to_save_file);

# pull data from the saved_data_dictionary -
corpusmodel = saved_data_dictionary["corpus"];

tokendictionary = corpusmodel.tokens;
inversetokendictionary = corpusmodel.inverse;
number_of_records = saved_data_dictionary["number_of_records"];

# compute some stuff need for later -
number_of_tokens = tokendictionary |> length; # size of the token dictionary

### Compute the maximum pad length
Not every headline has the same length, but we want the token vectors to have the same size. Thus, we'll find the longest vectors in the dataset and pad the token vectors to that length. To do that, let's iterate through each headline, compute its size, and then save this length if it is longer than we've seen before.

In [9]:
max_pad_length = 0; # initialize: we have 0 length
for i ∈ 1:number_of_records
    test_record_length = tokenize(corpusmodel.records[i].headline, corpusmodel.tokens) |> length; # tokenize, and calc the number of tokens
    if (test_record_length > max_pad_length)
        max_pad_length = test_record_length; # we've found a new longest headline!
    end
end
max_pad_length

151

### Compute the vector representation of all headline samples
Now that we have found the `max_pad_length::Int64`, we can tokenize all records using the `max_pad_length::Int64` value as the `pad` value in [the `tokenize(...)` method](src/Compute.jl). 
* We'll use `right-padding` and will store the tokenized records for each headline in the `token_record_dictionary::Dict{Int64, Array{Int64,1}}` dictionary, where the keys of this dictionary are the record indexes and the values of the tokenized records (which are of type `Array{Int64,1}.`)

In [11]:
token_record_dictionary = Dict{Int64, Array{Int64,1}}();
for i ∈ 1:number_of_records
    
    v = tokenize(corpusmodel.records[i].headline, corpusmodel.tokens, 
            pad = max_pad_length); 
    token_record_dictionary[i] = v;
end
token_record_dictionary[1]

151-element Vector{Int64}:
 26617
 23295
 27980
  8295
  5553
 18533
 12047
 15828
   913
   913
   913
   913
   913
     ⋮
   913
   913
   913
   913
   913
   913
   913
   913
   913
   913
   913
   913

### Compute the vector representation of all labels
Finally, let's compute the label dictionary. We'll store the labels in the `label_record_dictionary::Dict{Int64, Int64}` dictionary, where the keys of this dictionary are the record indexes, and the values are the labels: `sarcastic = 1`, while `not sarcastic = -1`.

In [13]:
label_record_dictionary = Dict{Int64, Int64}();
for i ∈ 1:number_of_records
    label = corpusmodel.records[i].issarcastic;
    if (label == true)
        label_record_dictionary[i] = 1;
    else
        label_record_dictionary[i] = -1;
    end
end
label_record_dictionary;

### Partition the data matrix $\mathbf{X}$ and the output $\mathbf{Y}$
We can now put the token vectors and the labels into a data matrix $\mathbf{X}$ and an output vector $\mathbf{Y}$.
* The data matrix $\mathbf{X}$ will be a `number_of_records` $\times$ `max_pad_length` array holding `Int64` values (the token indexes). On the other hand, the output (label) vector will be a `number_of_records`$\times$`1` column vector, holding the label for each record.

In [15]:
X,Y = let

    X = Array{Int64,2}(undef, number_of_records, max_pad_length);
    Y = Array{Int64,1}(undef, number_of_records);
    
    for i ∈ 1:number_of_records
        Y[i] = label_record_dictionary[i]; # get the label (output value)

        tokens = token_record_dictionary[i];
        for j ∈ 1:max_pad_length
            X[i,j] = tokens[j];
        end
    end
    X,Y
end;

Finally, let's partition the data into a `training` and `testing` set so that we can determine how well the model can predict unseen data, i.e., how well the model `generalizes.`

In [17]:
fraction = 0.80
(X_train, X_test, y_train, y_test) = partition(X, Y; trainfraction = fraction); # this is a *random* split

## Task 1: Compute the expected value of the parameters $\beta$ without regularization
In this task, we estimate the parameters of a linear regression model that maps the text token sequences to the headline labels. 
We know that the `data matrix` $\mathbf{X}$ is `overdetermined,` i.e., $m>n$ (more equations than unknowns). Thus, we are solving the minimization problem for an unknown parameter estimates $\hat{\beta}$:
$$
\begin{equation*}
\hat{\mathbf{\beta}} = \arg\min_{\mathbf{\beta}} ||~\mathbf{y} - \mathbf{X}\cdot\mathbf{\beta}~||^{2}_{2}
\end{equation*}
$$
where $||\star||^{2}_{2}$ is the square of the p = 2 vector norm. Then, the value of the unknown parameter vector $\mathbf{\beta}$ that minimizes the sum of the squares loss function for an overdetermined system is given by:
\begin{equation*}
\hat{\mathbf{\beta}} = \left(\mathbf{X}^{T}\mathbf{X}\right)^{-1}\mathbf{X}^{T}\mathbf{y} - \left(\mathbf{X}^{T}\mathbf{X}\right)^{-1}\mathbf{X}^{T}\mathbf{\epsilon}
\end{equation*}
The matrix $\mathbf{X}^{T}\mathbf{X}$ is called the normal matrix, while $\mathbf{X}^{T}\mathbf{y}$ is called the moment matrix. The __expectation__ removes the error term.

### Check: Does the inverse of the normal matrix exist?
Before we estimate the model parameters $\hat{\beta}$, let's check if the normal matrix $\mathbf{X}^{T}\mathbf{X}$ is invertible by calling [the `rank(...)` function exported by LinearAlgebra.jl](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.rank).
* If the normal matrix $\mathbf{X}^{T}\mathbf{X}$ has full rank, we can continue. However, if it does not, we must consider an alternative approach. Do this test [using the @assert macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert).

In [20]:
normal_matrix = transpose(X)*X;
@assert rank(normal_matrix) == max_pad_length

LoadError: AssertionError: rank(normal_matrix) == max_pad_length

### Singular Value Decomposition
Alternatively, we can compute the expected value of the parameters using  [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition). Let the [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) of the $n\times{p}$ data matrix $\mathbf{X}$ be given by:
$$
\begin{equation}
\mathbf{X} = \mathbf{U}\cdot\mathbf{\Sigma}\cdot\mathbf{V}^{T}
\end{equation}
$$
where $\mathbf{U}$ is an orthogonal matrix, $\mathbf{\Sigma}$ is a diagonal singular value matrix,
and $\mathbf{V}$ is an orthogonal matrix. Then, the regularized least-squares estimate of the unknown parameter vector $\mathbf{\theta}$ is given by:
$$
\begin{equation}
\hat{\mathbf{\theta}}_{\lambda} = \Bigl[\left(\mathbf{\Sigma}^{T}\mathbf{\Sigma}+\lambda\mathbf{I}\right)\mathbf{V}^{T}\Bigr]^{-1}\cdot\mathbf{\Sigma^{T}}\cdot\mathbf{U}^{T}\cdot\mathbf{y}
\end{equation}
$$

In [22]:
β̂ = let
    
    (U,d,V) = svd(X_train);
    Σ = diagm(d);
    IM = diagm(ones(max_pad_length)); # we have max_pad_length parameters -

    # compute θ̂ -
    M = (transpose(Σ)*Σ)*transpose(V);
    β̂ = inv(M)*transpose(Σ)*transpose(U)*y_train;
end;

In [23]:
β̂

151-element Vector{Float64}:
 -5.560106434815856e-6
 -6.0371704317631966e-6
 -1.0937902426433418e-6
 -1.7285330950072338e-6
  9.898804545293222e-7
  1.2146227706635256e-6
 -2.428556872274477e-6
 -1.7354398127010959e-6
 -4.2780050388636935e-6
 -1.9861752296580188e-6
 -2.622158153985228e-6
  1.9679296468266063e-6
  6.288858502536041e-6
  ⋮
 -1.3372930761239743e9
 -4.26605938164181e9
  2.9019561832030244e9
 -1.7213787695795991e9
 -3.129596583136175e7
 -9.878952070559653e8
  4.5146078035189104e8
 -4.799696932635876e9
 -2.1830751465770736e9
  2.7095469942950583e9
 -4.975883629941494e9
  3.434164501223366e9

### Compute the fraction correctly classified
The parameters $\hat{\beta}$ can now be used to compute what the model says the labels should be, i.e., whether a story is sarcastic or not. Let's calculate the `y_model_train::Array{Float64,1}` array, which holds the model estimated labels for the training data:

In [25]:
y_model_train = X_train*β̂;

Hmmmm. If a bunch of floating point values. Let's propose a mapping between these values and the sarcasm labels:
* if $\hat{y}_{i}\leq{0}$: the news headline is __not sarcastic__. Otherwise, for $\hat{y}_{i}>0$ we classify the news headline as sarcastic.

Let's implement this idea and compute the fraction of correctly labeled headlines for the training and testing datasets. We store the fraction of correct labels for the training data in the `f_train::Float64` variable and the fraction of correct labels for the testing data in the `f_test::Float64` variable:

In [27]:
f_train, f_test = let

    # compute what the model says the labels should be
    y_model_train = X_train*β̂;
    y_model_test = X_test*β̂;
    number_of_training_samples = length(y_model_train);
    number_of_testing_samples = length(y_model_test);
    f_train = 0.0;
    f_test = 0.0;

    # -- TRAINING --------------------------------------------- #
    N₊ = 0;
    for i ∈ 1:number_of_training_samples
        
        label = 1; # default sarcasm
        if (y_model_train[i] ≤ 0)
            label = 0;
        end
        
        if (label == y_train[i])
           N₊ += 1;
        end
    end
    f_train = N₊/number_of_training_samples;
    # --------------------------------------------------------- #

    # -- TESTING ---------------------------------------------- #
    N₊ = 0;
    for i ∈ 1:number_of_testing_samples
        
        label = 1; # default sarcasm
        if (y_model_test[i] ≤ 0)
            label = 0;
        end
        
        if (label == y_test[i])
           N₊ += 1;
        end
    end
    f_test = N₊/number_of_testing_samples;
    # --------------------------------------------------------- #
    
    # return -
    f_train, f_test
end

(0.14976415094339623, 0.14223309453084046)

## Task 2: Compute the expected value of the parameters $\beta$ with regularization
In this task, we'll explore the effect of regularization on our ability to classify news headlines as sarcastic or not sarcastic. 
Let's see what happens when we add a regularization parameter. If we use `ridge` regularization, i.e., we add a $||\,\beta\,||_{2}^{2}$ term to the objective function:
$$
\begin{equation*}
\hat{\mathbf{\beta}} = \arg\min_{\mathbf{\beta}} ||~\mathbf{y} - \mathbf{X}\cdot\mathbf{\beta}~||^{2}_{2} + \lambda\cdot{||\,\beta\,||_{2}^{2}}
\end{equation*}
$$

where $\lambda\geq{0}$ is called a `regularization` parameter. This problem has an analytical solution of the form:
\begin{equation*}
\hat{\mathbf{\beta}} = \left(\mathbf{X}^{T}\mathbf{X}+\lambda\cdot\mathbf{I}\right)^{-1}\mathbf{X}^{T}\mathbf{y} - \left(\mathbf{X}^{T}\mathbf{X}+\lambda\cdot\mathbf{I}\right)^{-1}\mathbf{X}^{T}\mathbf{\epsilon}
\end{equation*}
The __expectation__ removes the error term. Let's set a value for the regularization parameter $\lambda\geq{0}$

In [29]:
λ = 100.0; # select a value

### Check: Does regularization fix the rank issue?
Before we estimate the model parameters $\hat{\beta}$, let's check if the regularized normal matrix $\mathbf{X}^{T}\mathbf{X}+\lambda\cdot\mathbf{I}$ is invertible by calling [the `rank(...)` function exported by LinearAlgebra.jl](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.rank).
* If the regularized normal matrix $\mathbf{X}^{T}\mathbf{X}+\lambda\cdot\mathbf{I}$ has full rank, we can continue. However, if it does not, we must consider an alternative approach. Do this test [using the @assert macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert).

In [31]:
IM = diagm(ones(max_pad_length));
regularized_normal_matrix = transpose(X_train)*X_train + λ*IM;
@assert rank(regularized_normal_matrix) == max_pad_length

Wow! That is cool. Let's compute the expected value of the regression parameters $\hat{\beta}$ by inverting the regularized normal matrix:

In [33]:
β̂_reg = let
    M = transpose(X_train)*X_train + λ*IM;
    β̂_reg = inv(M)*transpose(X_train)*y_train
end;

In [34]:
β̂_reg

151-element Vector{Float64}:
 -5.93592616986123e-6
 -5.353346459971599e-6
 -5.05217221697683e-7
 -1.496124563166532e-6
  8.700359051415233e-7
  5.099607899736908e-7
 -1.2619829702993616e-6
 -2.0105457554749383e-6
 -4.409290668348094e-6
 -2.194810986647522e-6
 -2.2422691095262753e-6
  2.030717171782424e-6
  6.206132842742174e-6
  ⋮
  8.483468587594203e-6
  9.153058872127902e-6
  8.256795549318286e-6
  7.413361259661839e-6
  6.349108596135993e-6
  5.233346953255753e-6
  6.5147984636412055e-6
  5.852345743404349e-6
  5.725657585814404e-6
  5.227943753587794e-6
  5.7853367176147685e-6
  6.5710384212248896e-6

### Compute the fraction correctly classified for the regularized case
The parameters $\hat{\beta}$ can now be used to compute what the model says the labels should be, i.e., whether a story is sarcastic or not. Let's calculate the `y_model_train::Array{Float64,1}` array, which holds the model estimated labels for the training data for the regularized parameters.
* We store the fraction of correct labels for the training data in the `f_train_reg::Float64` variable and the fraction of correct labels for the testing data in the `f_test_reg::Float64` variable:

In [36]:
f_train_reg, f_test_reg = let

    # compute what the model says the labels should be
    y_model_train = X_train*β̂_reg;
    y_model_test = X_test*β̂_reg;
    number_of_training_samples = length(y_model_train);
    number_of_testing_samples = length(y_model_test);
    f_train = 0.0;
    f_test = 0.0;

    # -- TRAINING --------------------------------------------- #
    N₊ = 0;
    for i ∈ 1:number_of_training_samples
        
        label = 1; # default sarcasm
        if (y_model_train[i] ≤ 0)
            label = 0;
        end
        
        if (label == y_train[i])
           N₊ += 1;
        end
    end
    f_train = N₊/number_of_training_samples;
    # --------------------------------------------------------- #

    # -- TESTING ---------------------------------------------- #
    N₊ = 0;
    for i ∈ 1:number_of_testing_samples
        
        label = 1; # default sarcasm
        if (y_model_test[i] ≤ 0)
            label = 0;
        end
        
        if (label == y_test[i])
           N₊ += 1;
        end
    end
    f_test = N₊/number_of_testing_samples;
    # --------------------------------------------------------- #
    
    # return -
    f_train, f_test
end

(0.16024633123689727, 0.1493971693167919)

## Hmmm. Both of these approaches give lousy performance. 