![](deep.jpg)

It's time to dive into previously explored topics a bit more **deeply**. Time to try to understand what's actually going on.

These posts help me trigger what kind of thought process I had at the time of learning the topic. It's **not be easily disgestable for other people 🗺️** but feel free to read on.

Up til now, I've sped-run topics whilst not understanding much of the  details. 

I like to get the code running at least and produce results to expectation before getting in the weeds. At least I know the code works so I don't end up heading towards doom 🌚. 

In these deeper dives:    

- Lets hope to do these concepts justice (I'll try to get feedback from professionals at some point). Feel free to correct my understanding! (I'm sure some parts aren't on the ball 👎🍙).  

- The concepts and math are brand new to me (or done a decade ago), these posts help me express things in my way, which is usually the very lay way. It is a pretty raw thoughts at each step of the concepts and ideas I'm trying to understand; Con-steps? Step-deas? (Sorry ❄️).  

- Also, without writing them down, I'll end up forgetting most of how and what I thought to conceive these concepts (I need to improve my vocabulary), i.e. it will be like these thoughts had never had happened (loss and irretrievable from the black-hole that is my mind ⚫).  

## 1. Introduction  
### 1.1 The Mission

![](mission.jpg)

**[Mission]**:  

- Create a `model` that takes in some `inputs` and provides a reasonble `output` (prediction).

**[Method]**:  


- A `mathematical function` is a form of a model:
    - takes in inputs `x1, x2, ...` and 
    - `transforms` it into a single output `y = F(x1, x2, ...)`. 

This seems like a good appraoch to tackle the mission.

### 1.2 The Mathematical Model
`Create a Linear Model` (or mathematical function) given some `(real world) data` (data that represents something we want to predict given similar input) in the form $$y = a_1x_1 + a_2x_2 + ... a_nx_n$$
It is called `Linear` because our inputs `x` are of degree 1.  

- Each `inputs` (`x1, x2, ..., xn`) is multiplied by their corresponding `coefficients` or `parameters` (`a1, a2, ..., an`), i.e.: $$a_1x_1 + a_2x_2 + ...$$ 
    - The `inputs` variables `x`'s are like features or characteristics of our model.
    - the `coefficients` (known as `parameters`) in machine learning talk, scale the features/inputs. By scaling the inputs, it's like finding out which input/feature matters more or not, in determining the output. 
- The output `y` (known as `predictions`) is calculated by `summing all the scaled parameters` together: `a1x1 + a2x2 + ...`: $$F(x) = y = a_1x_1 + a_2x_2 + ... + a_nx_n$$

### 1.2.1 Laymens (mathematical model):  
- If a coefficient $a_1, a_2 ...$ is **almost zero**, it's like saying it does not impact the prediction value (remember we are adding up scaled versions of inputs to calculate the output / prediction). Then thats like saying perhaps this particualr parameter is an unimportant variable (or feature)! 
    - Perhaps we can get rid of it altogether? 
    - Less variables, less calculations, less overhead (and less work!) and a more simple model without losing significant predictive power.
- If it has a large coefficient, then it would impact the overall sum, hence prediction. Probably shouldnt disregard this parameter.

### 1.3 Simple Examples (of mathematical models)
Two simple examples (single input `x` and a single output `y`) can be visualised on the `2D-plane`:

- A `straight line` where input `x` is the horizontal-axis and output: `F(x)` and `y` is on the vertical-axis is with the formula we all know and love: $$ F(x) = mx + b $$

And similarly,  

- A `quadratic line` is: $$F(x) = ax^2 + bx + c$$

**Note**: ***Our model should `generalise a mathematical function` (or find out the core characteristics and relationships ) given our set of data, so once we know that, we can predict unseen data with similar traits.***

For a quadratic, what combination of $a$, $b$ and $c$ will help us predict closely the value we want given some random $x$ and then another $x$, in fact, how about all future values of $x$ as we want and have good predictions all of them on average $x$? That's what we are trying to do!

## 2 Steps  

### 2.1 [Step 1]: Choose (Create) the General Quadratic Equation/Model  

Assume a quadratic equation: `F(x) = ax^2 + bx + c` can help us model some real world phenomena (e.g. throwing a ball or driving then stopping).  

![](quad_real_world.jpg)

### 2.2 [Step 2]: Make Predictions!

Okay I wish it was that easy, but ***But what does making predictions mean and how do you do it***?

Actually, in this context, it's as simple as plugging in the different input values and their coefficients (`a,b,c,x`) to our quadaratic equation to see what output value (`y`) we get.    

In other words, calculate the `F(x)` (predictions) by using different combinations of our inputs `x` and parameters `a,b,c` with the equation `F(x) = ax^2 + bx + c`. Simple!

Wait no, but how do we know what `input variables x` and combinations `coefficients a,b,c` to use? Surely not just anything random? Kinda yes but also no.

So in order to make useful predictions, we should have **systematic way** to decide what ***`starting`*** coefficients and variables to use.

#### 2.2.1 **LONG Laymens** on using good ***starting*** coefficients and variables  

A `Good` set of inputs: 

- Say we want to produce a model that guesses whether a passenger on the Titanic **Survived** and **Didnt Survive**.  
    - A pretty good set of input variables would be data on actual passengers of the titanic and on whether they survived or not.   
    - A model could learn what characteristics of these passengers and then adjust the parameters, say `gender` or `age` to see if the prediction is better or worse, knowing already the survivability of the passenger in the first place.

A `Bad` set of inputs:

- would be perhaps characteristics about passengers wouldnt probably not impact their survivability like:   
    - `their favourite colour` or   
    - `their dominant hand` or  
    - `ear, finger, arm, or toe lengths` (within each group of adults and kids)

An `even worse` set of inputs (moving towards the bad `validation set` territory, more on this later).  

- would be getting data from unrelated sources (and sorry to state the obvious), such as:   
    - the cohort of 2015 Applied Finance Graduates from Macquarie University, Sydney,  
    - a group people picked at random off streets of Binh Thanh Saigon, Vietnam,  
    - a group of online matches from a popular dating app obtained by a user.  
These groups  technically surivived the Titanic so the model will say all their characteristics of everyone are useful, which isnt a great predictor (it'll predict being born after the Titanic sunk gives you a 100% chance of surviving the Titanic 🤭)

For coefficients, I heard its ***more of an art than a science*** on what are good starting parameter values (although I assume later there are some cool algorithms that give us a hand in the future).    

- Logically it would make sense that initially **input dominate parameters** (input values are large relative to their coefficients) and then they're scaled by a **parameter** according to importance through the neural network.  
- This means, start with **relatively low value parameters** (to it's input variables) and let these coefficients adjust (`learn` through `gradient descent`) to larger values, whether positive or negative.  
- In other words start with coefficients **about zero**.   
- Okay that kinda makes sense but lets talk through an example:   
    -   Lets assume we have a parameters should be relatively large and positive (for e.g. being a female or child in first class on the Titanic is probably two good parameters for survivability).  
    - If the parameter starts off large and negative then this causes the model to take alot of iterations to learn and adjust the parameter to be positive. And vice versa.  
    - So if we set all near zero, it will simply take `less time on average` for most models for each parameter to get to whether they need to go (`optimisation?`) considering we don't know what parameters matter or not, so its random for us, ie, it can be positive or negative, we just don't know (***assuming there is no super obvious characteristic that would dominate a model, in which case, maybe its better to put a larger value? perhaps that isnt a data science way to do things though, I'm not sure yet***).  
        - From my understanding, by initially `setting the some coefficients initially` and then adjust the model to our data, this is actually called `learning with a pre-trained model` (not set by us). The pre-trained model has already been run and complete by someone else and we are teaching the model learn about our data on top of its knowledge.  
        - Most of what I'll be initially doing is `Building a Learner from a Pre-trained Model`. In the future, I hope to learn to build the Pre-Trained models from scratch, if its a useful endeavour?     
        - Popular examples of pre-trained models are `ResNet50`, which are trained on (millions of) pictures with 50 layers deep neural network.  
- Another example to drive it home to start between `-0.5 to 0.5`:  
    - Say we play a game where I tell you to choose a starting value and take count by 1 towards my guess and you want to choose a starting value to minimise the count on average.   
    - And I tell you my guess is always random and between `-50 and 50`.   
    - You'd probably pick zero because its common sense! (I hope it is)  
    - It's kind of same in our model, except we dont pick zero itself because it ruins the math (explain later).

***That sounds well and good but:***

1. Doesnt a coefficients **relativeness** depend on the actual value of its **corresponding input parameter**. So shouldnt the starting coefficient be set **relative to its corresponding starting input parameter**, and not just an `arbitrary range of -0.5 to 0.5`? and also
2. Input parameters can **range drastically** **within itself** and **relative to other variables**! For example:
- `age` could be range `from 1 to 130` years old   
- `Amount of money fundraised or donated` by people could range `0 to billions`  
- `Lenght of the petals of a plum flower` could range `0.1 mm to 40 mm`.  
- `x-values` of any line function could be negative $\infty$ to positive $\infty$ (ngl I wanted to use the $\infty$ sign for once in my life ♾️)  

***How does it make sense to just arbitrarily choose -0.5 to 0.5 then?***  

` insert explanation of normalisation of data `  
` insert explanation of dummy variables `  

Thus, I'll use randomly generated values between -0.5 and +0.5 (-1.0 and +1.0 could also work)


### 2.3 The Real Step 2: 2a Set the Parameters `a, b, and c` 
- **[Action]**: Set our parameters to be random numbers.

### 2.3 Step 2b: Input `x` 
Set a range values input values.
   
- **Note**: In real world and here there is always a set of real world data. 
- Unfortunately computers can really interpret photos or audio or text in its aesthestic form like humans. Fortunately, these different modes can be represented with numbers which computers are very good at handling.
    - A **picture** are made up on pixels. Each pixel is a combination of a 3 parameters input (R,G,B) to create the colour.
    - A piece of **Text** can be decomposed into words and letters and be labelled with integers too.

For our model, given that we have some `real y-values` mapped to some `input x-values`, we should at least have predictions on those x-values to see how our predicted y-values are doing (known as `accuracy`, more later), calculate each `F(x1), F(x2), etc `, by going through each combination `a,b,c and x`:

- `F(x1) = a1x2^2 + b1x1 + c1` (prediction 1 `y1 or F(x1) or F(x1,a1,b1,c1)` given this specific set of input variable `x1` and parameters `a1, b1, c1`)
- `F(x2) = a2x2^2 + b2x2 + c2`, (prediction 2)... etc  

**Note**:
 
- These coefficients are started as `random` because we have to start somewhere. 
- Starting off random parameters is generally a good practice, the art is then to incrementally updating our coefficients (known as `gradient descent`) to make our model (`neural network`) do better (`improving accuracy` or `decreasing our loss function`) at each iteration (known as `epoch`).

### 2.4 TBA STEP Accuracy
- `measure accuracy` of our predictions i.e. `Mean Absolute Error (MAE)`. 
    - For each actual output/value `y`, what is our `predicted y`.
    - Since predictions can be `above or below` the actual value, we apply an absolute function to measure the difference or `loss`. That is, the MAE is also known as our `Loss Function`. An alternate popular loss function is the `Mean Squared Function (MSE)

### 2.4 TBA STEP UPDATE PARAMETERS
- `updating parameters` to 
    - `improve accuracy` of our predictions (i.e. decrease difference between our predictions and data)
    - `do it automatically`: the art of improving our model or `learning` is called `gradient descent`.  
- `neural network`: Once the model is accuracy we are happy with, this is our `neural network`.

Quite simple really?

## 2. Simulate Data
Simulate content

In [2]:
import numpy as np, torch

np.random.seed(42)
def noise(x, scale): return np.random.normal(scale=scale, size=x.shape)
def add_noise(x, mult, add): return x * (1+noise(x,mult)) + noise(x,add)

actual_x_values_tsr = torch.linspace(-2, 2, steps=20)[:,None] # simulate 20 actual x-values + shape to 2D tensor

def actual_function(x): return 3*x**2 + 2*x + 1 
# In reality we don't have a real function like this to use, 
# however we use this + add noise, which simulates real data
# then we try to model this noisey data
actual_y_values_unrealistic_tsr = actual_function(actual_x_values_tsr)
 # actual y-values of funtion we use to find

#  but again, these values are not realistic because they're based on the real function - something that doenst exist in real life,
# its like having the exact function that determines whether a photo is a cat or not
# we can only approximate functions and its parameters
# in real world, data has noise,
# introduce data to these unrealistic real values to product realistic actual values

# okay so why dont we need to add noise to actual_x_values? its any input is realistic/real world
# any photo is can be asked 'is it a cat?'
# any passenger with any characters can be asked 'did the passenger survive?

actual_y_values_realistic_tsr = add_noise(actual_y_values_unrealistic_tsr, 0.15, 1.5) # use actualy y-values + add noise - to create simulated real data 

In [2]:
#now that we have a set of (actual realistic) 'data'
# 1. try 'model' it with a quadratic equation 
# 2. create loss function - mean absolute error - difference between each point of actual y vs predicted y at each x, find difference and then absolute

In [3]:
# actual_x_values[:5],actual_y_values_realistic[:5]

# 1. Create quad function with parameters
def quad_fn(a,b,c,x): return a*x**2 + b*x + c
y_a1_b1_c1_x1 = quad_fn(1,1,1,1)
y_a1_b1_c1_x1  # 3 = (1*1^2) + (1*1) + (1) = 1+1+1
y_a1_b1_c1_x2 = quad_fn(1,1,1,2)
y_a1_b1_c1_x2  # 7 = (1*2^2) + (2*1) + (2) = 4+2+1

# Its quite cumbersome to put a new x-value through the function, to get a corresponding predicted y-value
# ideally we can provide a list of xs to get a list of predicted ys (x-tensor -> f -> y-tensor)
# and also the coefficients are parameterised
from functools import partial
def mk_quad_fn(a,b,c): return partial(quad_fn,a,b,c)
quad111 = mk_quad_fn(1,1,1)
# quad111(actual_x_values)
predicted_quad111_y_values_tsr = quad111(actual_x_values_tsr) # remember is a 2d tensor due to added dimension with: [:,None]

# we have actual-yand predicted-y data, lets compare them

def mae(actual, preds): return abs(preds-actual).mean()

mae(actual_y_values_realistic_tsr,predicted_quad111_y_values_tsr)



In [4]:
import matplotlib.pyplot as plt
# plt.plot(actual_y_values_realistic_tsr,predicted_quad111_y_values_tsr)
plt.scatter(actual_x_values_tsr, actual_y_values_realistic_tsr)

# plot predictions
# for predictions, graphically it will look better to plot a line 
# rather than just a coressponding y-prediction to each actual y

# lets do just corresponding ones first to see what it looks like

plt.scatter(actual_x_values_tsr, predicted_quad111_y_values_tsr)



In [5]:
import matplotlib.pyplot as plt
plt.scatter(actual_x_values_tsr, actual_y_values_realistic_tsr)
# plt.scatter(actual_x_values_tsr, predicted_quad111_y_values_tsr)

# plot y-line prediction

plt.plot(actual_x_values_tsr, predicted_quad111_y_values_tsr, 'r')


In [6]:
import matplotlib.pyplot as plt
from ipywidgets import interact

@interact(a=1,b=1,c=1)
def plot_both(a,b,c):
    # interactive_predicted_quad_y_values_tsr = custom_quad_fn(actual_x_values_tsr)

    plt.scatter(actual_x_values_tsr, actual_y_values_realistic_tsr)

    actual_x_values_for_plotting_tsr = torch.linspace(-2.1,2.1,100)[:,None]
    custom_quad_fn = mk_quad_fn(a,b,c)
    interactive_predicted_quad_y_values_tsr = custom_quad_fn(actual_x_values_for_plotting_tsr)
    plt.ylim((-3,13))
    plt.plot(actual_x_values_for_plotting_tsr, interactive_predicted_quad_y_values_tsr, 'r')


In [1]:
from ipywidgets import interact
from fastai.basics import *
import numpy as np


plt.rc('figure', dpi=90)

def plot_function(f, title=None, min=-2.1, max=2.1, color='r', ylim=None):
    x = torch.linspace(min,max, 100)[:,None]
    if ylim: plt.ylim(ylim)
    plt.plot(x, f(x), color)
    if title is not None: plt.title(title)

def f(x): return 3*x**2 + 2*x + 1

# plot_function(f, "$3x^2 + 2x + 1$")

def quad(a, b, c, x): return a*x**2 + b*x + c

def mk_quad(a,b,c): return partial(quad, a,b,c)


# f2 = mk_quad(3,2,1)
# plot_function(f2)

def noise(x, scale): return np.random.normal(scale=scale, size=x.shape)

def add_noise(x, mult, add): return x * (1+noise(x,mult)) + noise(x,add)




np.random.seed(42)

x = torch.linspace(-2, 2, steps=20)[:,None]
y = add_noise(f(x), 0.15, 1.5)
@interact(a=1.1, b=1.1, c=1.1)
def plot_quad(a, b, c):
    plt.scatter(x,y)
    plot_function(mk_quad(a,b,c), ylim=(-3,13))



interactive(children=(FloatSlider(value=1.1, description='a', max=3.3000000000000003, min=-1.1), FloatSlider(v…

In [None]:
def rectified_linear(m,b,x):
    y = m*x+b
    return torch.clip(y, 0.)

In [None]:
plot_function(partial(rectified_linear, 1,1))

In [None]:
import torch.nn.functional as F
def rectified_linear2(m,b,x): return F.relu(m*x+b)
plot_function(partial(rectified_linear2, 1,1))

In [None]:
@interact(m=1.5, b=1.5)
def plot_relu(m, b):
    plot_function(partial(rectified_linear, m,b), ylim=(-1,4))

In [None]:
def double_relu(m1,b1,m2,b2,x):
    return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x)

@interact(m1=-1.5, b1=-1.5, m2=1.5, b2=1.5)
def plot_double_relu(m1, b1, m2, b2):
    plot_function(partial(double_relu, m1,b1,m2,b2), ylim=(-1,6))