## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Write your code in the **Code cells** and your answers in the **Markdown cells** of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.

3. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to render the **.ipynb** file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

4. The assignment is worth 100 points, and is due on **17th October 2025 at 11:59 pm**. 

5. **Five points are properly formatting the assignment**. The breakdown is as follows:
    - Must be an HTML file rendered using Quarto **(1 point)**. *If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.* 
    - No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission.  **(1 point)**
    - There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) **(1 point)**
    - Final answers to each question are written in the Markdown cells. **(1 point)**
    - There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. **(1 point)**

6.  The maximum possible score in the assigment is 100+5 = 110 out of 100.

## Objective

- **Reinforce Knowledge of Gradient Descent:** Apply your understanding of gradient descent to classification problems by implementing logistic regression and softmax regression from scratch.
- **Hands-on Implementation:** Build classification models manually to gain deeper insights into their mathematical foundations and working principles.
- **Explore Customization Options:** Learn how implementing models from scratch allows you to:
  - Adjust and optimize model parameters for specific requirements.
  - Add features or constraints that might not be possible with standard libraries.
- **Compare with Pre-built Models:** Use scikit-learn’s logistic regression as a baseline to evaluate the performance and efficiency of your custom implementation. This will help you understand when to use custom models and when to leverage pre-built ones.
- **Prepare for Real-world Scenarios:** Understand the scenarios where off-the-shelf models are not sufficient, allowing you to confidently tackle complex machine learning problems and create novel solutions.


In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Sigmoid or Logistic Function
<img align="left" src="https://media.licdn.com/dms/image/D4D12AQGIXdSG7IJCNw/article-cover_image-shrink_600_2000/0/1694183259537?e=2147483647&v=beta&t=OtnfeqwCtKTSVrdKZdyOzNYECyLLZuEUIxkTfTQ0dS0"     style=" width:300px; padding: 10px; " >

As you learned from the sequence course, for a classification task, we can start by using our linear regression model,

 $f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot  \mathbf{x}^{(i)} + b$, to predict $y$ given $x$. 

- However, we would like the predictions of our classification model to be between 0 and 1 since our output variable $y$ is either 0 or 1. 
- This can be accomplished by using a "sigmoid function" which maps all input values to values between 0 and 1. 


## Formula for Sigmoid function

The formula for a sigmoid function is as follows -  

$$g(z) = \frac{1}{1 + e^{-z}} \tag{1}$$


In the case of logistic regression, z (the input to the sigmoid function), is the output of a linear regression model. 
- In the case of a single example, $z$ is scalar.
- in the case of multiple examples, $z$ may be a vector consisting of $m$ values, one for each example.
- NumPy has a function called [`exp()`](https://numpy.org/doc/stable/reference/generated/numpy.exp.html), which offers a convenient way to calculate the exponential ( $e^{z}$) of all elements in the input array (`z`).

## Logistic Regression
<img align="left" src="./images/C1_W3_LogisticRegression_right.png"     style=" width:300px; padding: 10px; " > A logistic regression model applies the sigmoid to the familiar linear regression model as shown below:

$$ 
f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = g(\mathbf{w} \cdot \mathbf{x}^{(i)} + b ) \tag{2} 
$$ 

  where

  $$
  g(z) = \frac{1}{1+e^{-z}}\tag{3}
  $$

## Logistic Loss Function

Logistic Regression uses a loss function more suited to the task of categorization where the target is 0 or 1 rather than any number. 

>**Definition Note:**   In this course, these definitions are used:  
**Loss** is a measure of the difference of a single example to its target value while the  
**Cost** is a measure of the losses over the training set


This is defined: 
* $loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)})$ is the cost for a single data point, which is:

\begin{equation}
  loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = \begin{cases}
    - \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) & \text{if $y^{(i)}=1$}\\
    - \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) & \text{if $y^{(i)}=0$}
  \end{cases}
\end{equation}


*  $f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ is the model's prediction, while $y^{(i)}$ is the target value.

*  $f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = g(\mathbf{w} \cdot\mathbf{x}^{(i)}+b)$ where function $g$ is the sigmoid function.

The defining feature of this loss function is the fact that it uses two separate curves. One for the case when the target is zero or ($y=0$) and another for when the target is one ($y=1$). Combined, these curves provide the behavior useful for a loss function, namely, being zero when the prediction matches the target and rapidly increasing in value as the prediction differs from the target. Consider the curves below:

<div style="text-align: center;">
    <img src=https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTIyEcDUIGx9kLDohkOyjq8X2OkQZbxLoVW3JyEefVtog&s alt="Description of image" width="400"/>
</div>

Combined, the curves are similar to the quadratic curve of the squared error loss. Note, the x-axis is $f_{\mathbf{w},b}$ which is the output of a sigmoid. The sigmoid output is strictly between 0 and 1.

The loss function above can be rewritten to be easier to implement.
    $$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)$$
  
This is a rather formidable-looking equation. It is less daunting when you consider $y^{(i)}$ can have only two values, 0 and 1. One can then consider the equation in two pieces:  
when $ y^{(i)} = 0$, the left-hand term is eliminated:
$$
\begin{align}
loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), 0) &= (-(0) \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - 0\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \\
&= -\log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)
\end{align}
$$
and when $ y^{(i)} = 1$, the right-hand term is eliminated:
$$
\begin{align}
  loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), 1) &=  (-(1) \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - 1\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)\\
  &=  -\log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)
\end{align}
$$

OK, with this new logistic loss function, a cost function can be produced that incorporates the loss from all the examples. This will be the topic of the next lab. For now, let's take a look at the cost vs parameters curve for the simple example we considered above:

## Cost function

Recall, loss is defined to apply to one example. Here you combine the losses to form the **cost**, which includes all the examples.


Recall that for logistic regression, the cost function is of the form 

$$ J(\mathbf{w},b) = \frac{1}{m} \sum_{i=0}^{m-1} \left[ loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) \right] \tag{1}$$

where
* $loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)})$ is the cost for a single data point, which is:

    $$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = -y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \tag{2}$$
    
*  where m is the number of training examples in the data set and:
$$
\begin{align}
  f_{\mathbf{w},b}(\mathbf{x^{(i)}}) &= g(z^{(i)})\tag{3} \\
  z^{(i)} &= \mathbf{w} \cdot \mathbf{x}^{(i)}+ b\tag{4} \\
  g(z^{(i)}) &= \frac{1}{1+e^{-z^{(i)}}}\tag{5} 
\end{align}
$$
 

## Task 1: Implementing L2 Logistic Regression with Vectorized Gradient Descent (10 points)
- It should take eight inputs: **X_train**, **y_train**, **X_test**, **y_test**, **w_in**, **alpha** (learning rate), **num_iters**, and **lambda_reg**. 
- It should return the optimal parameters and the costs history on both train and test set

hints:
* Implement `compute_cost_logistic_ridge` function to calculate the cost .
* Derive gradient for L2 logistic regression
* Implement `compute_gradient_logistic_ridge` function to Calculate the Gradient
* Implement `gradient_descent_logistic_ridge` function

Note that you don't have to stricitly follow these steps


## Task 2: Apply your implementation on a data set (10 points)

*  Read the **heart_disease_classification.csv** file into pandas dataframe. use **random_state=0** to Shuffle the data to eliminate any inherent order or bias that may be present.
*  Split the features and the target column into different variables. **(3 points)**
*  Create binary columns from these three categorical columns `cp`, `thal`, and `slope`. **(4 points)**
*  Use **random_state=42** to Split the data into training and test datasets with a 80-20 split. . Then, scale the features of both datasets. **(4 points)**
*  Set initial w_in, alpha, num_iters, and lambda_reg you think are right, as long as the model converges.
*  Plot the learning curve of gradient descent on the training and test set.

## Task 3: Explore the impact of `lambda_reg` on the dataset and identify the optimal value that provides the best performance (10 points)

`lambda_reg` controls the strength of the regularization applied to the model. When `lambda_reg` is set to zero, regularization is effectively turned off. As `lambda_reg` increases, the penalty for large weights becomes more significant, helping to reduce overfitting. In this task, 

* Experiment with different values of `lambda_reg` in the set [0.0, 0.01, 0.03, 0.1, 0.3]. 
* Plot the learning curves for both the training and test sets on the same figure to visualize the impact of each value.
* Determine which value of `lambda_reg` yields the best performance on this dataset.
* Output the performance in terms of `accuracy`, `precision`, `recall`

Please use `learning_rate=0.005, num_iterations=1200` for this task

## Task 4: Compare your implementation with the `LogisticRegression` model from `sklearn` to re-evaluate the dataset (10 points)

* Use the `LogisticRegression` model from `sklearn` to re-evaluate the dataset while maintaining the  maintaining the same (or similar) hyperparameter settings for a (fair) comparison. 
* Report the performance using the same evaluation metrics as previously used, and 
* Compare the results to your custom implementation. Analyze whether `sklearn`'s built-in logistic regression achieves similar, better, or worse performance, and 
* Try to explain the potential reasons for any differences observed.  (15 points)

## Task 5: Get to know the `tol` in Sklearn `LogisticRegression` (10 points)

Use data visualization to explore the impact of the `tol` (tolerance) parameter in the logistic regression model. Explain how it affects model's overall performance.

Hints: Feel free to experiment with different values to see how this hyperparameter affects the model's performance.

## Task 6: Summarize your findings below (10 points)