<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Warm-up-with-sklearn-linear-regression" data-toc-modified-id="Warm-up-with-sklearn-linear-regression-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Warm-up with sklearn linear regression</a></span></li><li><span><a href="#Notations:-Model-parameters,-hyper-parameters,-regularization----Linear-regression-example" data-toc-modified-id="Notations:-Model-parameters,-hyper-parameters,-regularization----Linear-regression-example-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Notations: Model parameters, hyper-parameters, regularization -- Linear regression example</a></span></li><li><span><a href="#Challenges" data-toc-modified-id="Challenges-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Challenges</a></span></li><li><span><a href="#Dealing-with-underfitting-and-overfitting" data-toc-modified-id="Dealing-with-underfitting-and-overfitting-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Dealing with underfitting and overfitting</a></span></li></ul></div>

# Warm-up with sklearn linear regression

In [3]:
#Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"
import sklearn.linear_model

In [4]:
help(sklearn.linear_model)

Help on package sklearn.linear_model in sklearn:

NAME
    sklearn.linear_model - The :mod:`sklearn.linear_model` module implements a variety of linear models.

PACKAGE CONTENTS
    _base
    _bayes
    _cd_fast
    _coordinate_descent
    _glm (package)
    _huber
    _least_angle
    _logistic
    _omp
    _passive_aggressive
    _perceptron
    _ransac
    _ridge
    _sag
    _sag_fast
    _sgd_fast
    _stochastic_gradient
    _theil_sen
    setup
    tests (package)

CLASSES
    sklearn.base.BaseEstimator(builtins.object)
        sklearn.linear_model._huber.HuberRegressor(sklearn.linear_model._base.LinearModel, sklearn.base.RegressorMixin, sklearn.base.BaseEstimator)
        sklearn.linear_model._logistic.LogisticRegression(sklearn.linear_model._base.LinearClassifierMixin, sklearn.linear_model._base.SparseCoefMixin, sklearn.base.BaseEstimator)
            sklearn.linear_model._logistic.LogisticRegressionCV(sklearn.linear_model._logistic.LogisticRegression, sklearn.linear_model._ba

# Notations: Model parameters, hyper-parameters, regularization -- Linear regression example

<img style="float: center" src="./images/introduction_to_linear_regression.png" alt="drawing" Hight="600" width="600"/>


In the [above figure](https://algorithmia.com/blog/introduction-to-loss-functions):
1. The parameters of the model is $w_{0},w_{1}$. $w_{1}$ is the slope or weight vector in sum functions of the features $IF$. 
2. We aim to estimate $Y$ by $\hat{Y}=f(X;w)$ such that $\hat{Y}$-$Y$ is minimum. There are many types of <b> loss functions </b> or the functions that minimize the residuals or the difference between $\hat{Y}$ and $Y$.
4. The parameters of linear regression algorithm (e.g., $\alpha_{1}$) or the parameters of algorithm to minimize the residuals or to estimate model parameters are called the hyper-parameters--[Extra Reading](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/)
5.  Regularizations are techniques used to reduce the error by fitting a function appropriately on the given training set.
6. We also use validation subset to select the best hyper-parameters.

# Challenges

Besides each machine learning task type challenges, we have the following common challenges: 
1. Insufficient quantitative training data:
    - Many algorithms need a lot of data to learn accurately.
    - In practice, the data size is small or medium. 
    - Some recommendations to develop synthesis data but it is too difficult.

<b> See the following book Figure 1.20 </b>

<img style="float: center" src="./images/curve-insufficient-data.png" alt="drawing" Hight="600" width="600"/>

2. Insufficient qualitative training data
    - The lack of representative examples for each label.
    - The sampling of examples is biased such that it excludes important population characteristics.
    - Some examples have missing values and/or they are rare/noise/outlier. 
        - Do we need to handle these cases? How?
        - Do we need to apply methods to preprocess the data? Usually, yes!
    - Irrelevant features/attributes.
        - Do we need to remove or combine some of them? 
    - Data could have some non-linearity or fluctuation trends. 
        - How to identify these trends?
3. Model complexity, assumptions, and parameters: If we select model-based learning, then 
    - We need to care of the following theorem:
        - __No Free Lunch (NFL) theorem:__ If we make absolutely no assumption about the data, then there is no reason to prefer one model over any other.  There is no model that is a priori guaranteed to work better. There is always a need to select the best model for each data set; if possible. Otherwise, we may  try to generalize the model.
    - We need to:
        - Identify potential data assumptions that could handle the mentioned deficiencies in data quantity and quality. 
        - Select the most adequate learning algorithm to handle these assumptions:
            - Need to optimize the algorithm parameters (i.e., hyper-parameters) settings.
            - Develop the best performing (i.e., minimizing the loss function or maximizing the accuracy) model with its parameters’ values.
<img style="float: center" src="./images/overfitting&underfitting.png" alt="drawing" Hight="600" width="600"/>    
    - We need to select between simple and complex model in order to minimize:
        - __Under-fitting:__  In the training data, the model is too simple to discover important hidden relationships among features . So, its  performance is low and the bias is high.
             - The prediction error caused as a result of training the model is called <b>bias</b>. Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. 
        - __Over-fitting:__  In the test data, the model developed in the training data could not generalize what it learned to recognize new cases correctly. Indeed, the model may be too complex such that it memorized <u> every training case relationship and data noise </u> rather than it drew some generalizations from the training data.  The model performance is high in the training data but it is low in testing or unseen data since it generates a lot of generalization errors and the variance is high.
         - The prediction error caused as a result testing the model is called <b>variance</b>. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data.       

# Dealing with underfitting and overfitting


- We need to tackle under-fitting and over-fitting: We do that by balancing between bias and variance; (see the below figure) described in [the link](https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229)
    
<img style="float: center" src="./images/bias&variance.png" alt="drawing" Hight="600" width="600"/>    

- The flowchart to tackle under-fitting and over-fitting is in
    [the Extra Reading](https://learnopencv.com/bias-variance-tradeoff-in-machine-learning/)
    
<img style="float: center" src="./images/algorithm_to_overfitting&underfitting.png" alt="drawing" Hight="600" width="600"/> 

- We need to draw a plot of the curves of training and testing errors and be sure it looks like the below figure: 

<img style="float: center" src="./images/curves_training&testing.png" alt="drawing" Hight="600" width="600"/> 
