# 1. Healthcare Quality

In [18]:
using DataFrames, CSV
df = CSV.read("Quizhpc_2_Health Care Quality Assessment.csv", DataFrame)

xs = permutedims(Matrix{Float64}(df[:, 1:end-1]))
ys = permutedims(df.PoorCare)

# Normalization
using Statistics: mean, std
xs .-= mean(xs, dims=2)
xs ./= std(xs, dims=2)

13×131 Matrix{Float64}:
 -1.71231    -1.68596   -1.65962    …   1.65962      1.68596     1.71231
 -0.532913   -0.336813  -0.532913       2.40859     -0.532913    5.35009
 -0.69028    -0.228919  -0.69028        4.38469      0.232441   -0.228919
  0.525609   -0.796403  -0.90657       -0.90657      0.0849384   0.96628
 -0.367785   -0.367785  -0.161888       0.455802    -0.367785   -0.161888
  0.911436   -0.253186   0.911436   …  -1.09041     -0.959389   -0.103968
 -0.272174   -0.761938  -0.272174       3.89082     -0.517056    1.09916
  0.0453112  -0.767808  -1.01174        1.10237     -0.117313    2.89123
 -0.218337    0.220571  -0.584094       2.34196      0.14742    -0.291489
  1.72245    -0.839293  -0.562348      -0.00845635  -0.0776928   2.06864
  0.863663   -0.304077   0.0560669  …   1.33294     -0.0530678   0.503519
 -0.218251   -0.218251  -0.218251      -0.218251    -0.218251   -0.218251
 -0.385867   -0.24267    0.330119       0.0437243   -0.24267     1.4757

In [19]:
using Flux
model = Flux.Dense(size(xtrain)[1] => 1, σ)
loss(x, y) = Flux.binarycrossentropy(model(x), y)
optimiser = Flux.Descent()
parameters = Flux.params(model)
for _ in 1:10000
    Flux.train!(loss, parameters, [(xs, ys)], optimiser)
end

In [20]:
function confusion_matrix(predictions, ys)
    predicted_positives = sum(predictions)
    predicted_negatives = length(predictions) - predicted_positives
    actual_positives = sum(ys)
    actual_negatives = length(ys) - actual_positives

    true_positives = sum(predictions .& ys)
    false_positives = predicted_positives - true_positives
    true_negatives = sum(.~predictions .& .~ys)
    false_negatives = predicted_negatives - true_negatives

    precision = true_positives / predicted_positives
    recall = true_positives / actual_positives
    f1score = 2 * precision * recall / (precision + recall)

    return (true_positives, true_negatives, false_positives, false_negatives,
        precision, recall, f1score)
end

predictions = model(xs) .> 0.5
(true_positives, true_negatives, false_positives, false_negatives,
    precision, recall, f1score) = confusion_matrix(predictions, ys)

println("True Positives = $true_positives\nTrue Negatives = $true_negatives\n"
    * "False Positives = $false_positives\nFalse Negatives = $false_negatives\n")
println("Precision = $precision\n"
    * "Recall = $recall\n"
    * "F1 Score = $f1score")

True Positives = 16
True Negatives = 91
False Positives = 7
False Negatives = 17

Precision = 0.6956521739130435
Recall = 0.48484848484848486
F1 Score = 0.5714285714285715


# 2. Logistic Regression Models
## a.
It does not matter for model 1, because no matter how we set the weights, the model is going to classify point 3 as Y = -1 (since there is no bias, P(Y=1) always equals 0). However, by changing the bias in model 2, it is possible to change the classification for point 3, so the labeling does matter for model 2.
## b.
If we define $\mathcal{L}$ as the cost function,
$$\mathcal{L}(\mathbf{w}, \lambda; x^{(i)}, y^{(i)}) \approx \sum_i y^{(i)} \mathbf{w}^T x^{(i)}
-\frac{\lambda}{2}\|\mathbf{w}\|^2.$$
Since the cost function is the same for every component of $\mathbf{w}$,
$$\frac{\partial\mathcal{L}}{\partial\mathbf{w}} = \sum_i y^{(i)} x^{(i)} - \lambda\mathbf{w}$$
(keep in mind that $x$ is also a vector). By setting the derivative equal to zero, the MLE $\mathbf{w}$ is obtained.
$$\mathbf{w} = \frac{\sum_i y^{(i)} x^{(i)}}{\lambda}$$
Therefore, $\mathbf{w}$ gets smaller as $\lambda$ increases.

# Short Answer Questions
## 3.
FALSE; logistic regression is a method used for classification, not regression.
## 4.
TRUE; the logistic regression loss function can be used to train a neural network with sigmoid function nodes.
## 5.
b. Maximum Likelihood; least squares is used in linear regression, and Jaccard distance is completely irrelevant here.