# Machine Learning Basics

If you are familier with following concepts you can move to next notebook:

> Reference : http://www.deeplearningbook.org/contents/ml.html

    1. Learning Algorithms
    2. Capacity, Overﬁtting and Underﬁtting
    3. Hyperparameters and Validation Sets
    4. Estimators, Bias and Variance
    5. Maximum Likelihood Estimation
    6. Bayesian Statistics
    7. Supervised Learning Algorithms
    8. Unsupervised Learning Algorithms
    9. Stochastic Gradient Descent
    10. Building a Machine Learning Algorithm
    11. Challenges Motivating Deep Learning

In [None]:
# library imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Deep learning is a speciﬁc kind of machine learning. To understand deep learning well, one must have a solid understanding of the basic principles of machine learning. This tutorial provides a brief course in the most important general principles that are applied throughout the rest of the book.

### 1. Learning Algorithms

*A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experienceE.*

#### 1.1 Task, *T*

The objective of the learning is to perform tasks better and faster than other programs and humans. So task is any particular function we want our ML program to be able to do.

Few common ML tasks include:
   - **Classification**: In this type of task, the computer program is asked to specify which of k categories some input                               belongs to.
   
   - **Classification with missing inputs**: Many times part of inputs are not known and we need to classify them without those inputs using some fucntion this type of problem generally arises in medical situations.
   
   - **Regression**: In this type of task, the computer program is asked to predict a numerical value given some input.
   
   - **Transcription**: In this type of task, the machine learning system is asked to observe a relatively unstructured representation of some kind of data and transcribe the information into discrete textual form. For example,optical character recognition.
   
   - **Machine Translation**: In a machine translation task, the input already consists of a sequence of symbols in some language, and the computer program must convert this into a sequence of symbols in another language. 
   
   - **Structure output**: Structured output tasks involve any task where the output is a vector (or other data structure containing multiple values) with important relationships between the diﬀerent elements.
   
   - **Anamoly Detection**: In this type of task, the computer program shifts through a set of events or objects and ﬂags some of them as being unusual or atypical.
   
   - ** Synthesis and sampling**: In this type of task, the machine learning algorithm is asked to generate new examples that are similar to those in the training data.
   
   - **Imputation of missing Values**: This type task are set to predict the missing values from a given example.
   
   - **Denoising**: Conversion of corrupted input to original input
   
   - **Density Estimation**: To learn a function that can be used as a probabilty density function..           

In [None]:
# run this cell
# classification example in general programing function:
def classify(a): # classifies the even and odd numbers
    if (a%2 == 0):
        return "Class Even"
    elif(a%2 == 1):
        return "Class Odd"
    else:
        return "Undefined Class"
    
    

data = 0.02
try:
    data = int(input("Enter a number: "))
except ValueError:
    pass
    
print("The number Entered is in: ",classify(data))

#### 1.2 Performance Measure: *P*

To evaluate the abilities of a machine learning algorithm, we must design a quantitative measure of its performance.
Usually this performance measure P is speciﬁc to the task T being carried out by the system.

Many tasks like clasification, regresion etc can be measured in terms of **accuracy, error rate** but not all of then can be measure like this there are task like density estimation for which the accuracy calculation would be meaning less. It's easy to measure a particaular thing but it's difficult to decide what to be measured in order to achieve the required task.


#### 1.3 Experience: *E*

Machine learning algorithms can be broadly categorized as unsupervised or supervised by what kind of experience they are allowed to have during the learning process.

***Unsupervised Learning***: Each experience or training sample is a set of features to learn properties from.

***Supervised Learning***: Each experience is a set of features associated with a lable.

Machine learning is similar to human learning and so categorized as given above.
Unsupervised learning is like person observing several things and making a conclusion or a theory or picture based on just the inputs, whereas Supervised learning is like teaching someone that this is a cat, this is a knife like we were thaught in schools where cats and knifes were labels.

#### 1.4 Example Linear Regression:

As the name indicates it is used to perform regression. It is a supervised learning problem.

**Objective**:
To build a system that can take a vector x ∈ R<sup>n</sup> as input and predict the value of a scalar y ∈ R as its output. The output of linear regression is a linear function of the input. Let y' be the value that our model predicts y should take on. We deﬁne the output to be a linear equation as follows:

$$y' = w^T.x + b\\
\Longrightarrow y' = [b,\ w]^T.[1,\ x]\\
\Longrightarrow y' = w_{new}^T.x_{new}
$$

where, w is vector of **parameters**. <br> Now we have our ***task: to predict y'***

We will define two sets 
1. train set (to train our model)
2. test set (to test our model)

we now have to define a performance measure, we can measure mean squared error here as:

$$MSE={1\over m}\sum_i(y'− y)^2_i.$$

you might notice when y'--> y <==> MSE --> 0

One way of solving such problem could be gradient descent but we will go with a more mathematical approach here which will give result in one step. we say to minimize MSE:

$$say\ the\ model\ is:\ ( y' = w.x)\\
\Longrightarrow \triangledown_w MSE_{train} = 0\\
\Longrightarrow{1\over m}\triangledown_w\|y'-y\|^2_2=0\\
\Longrightarrow{\triangledown_w(x.w-y)^T(x.w-y)=0}\\
\Longrightarrow w = (x^Tx)^{-1}x^T.y$$


The system of equation solved by this equation are called normal equations.

Lets try it out:


In [None]:
# Linear Regression:
# we are going to find a line that minimized the error of y' and y in a given distrubition
np.random.seed(1)
def regress(x,y):
    xnew = np.append(np.ones(x.shape),x,axis=1)
    
    ## YOUR CODE HERE
    xtx = None # find xT.x of normal equation solver use--> np.dot(xnew.T,xnew)
    # finding inverse of xtx
    try:
        xtx_inv = np.linalg.inv(xtx)
    except:
        xtx_inv = 1/xtx
        
    xty = None # find the XTy of th equation use --> np.dot(xnew.T,y)
    wnew = None # find the parametres using -->  np.dot(xtx_inv,xty)
    #END
    
    w = wnew[1]
    b = wnew[0]
    return w, b

x = np.random.randn(100,1)
y = np.random.randn(100,1)**3
w, b = regress(x,y)
yd = w*x + b

print("The figure denotes a regression line approximating the y with x")
plt.plot(x,yd,'-r')
plt.plot(x,y,'.')
plt.show()

> Expected output :<br>
    The figure denotes a regression line approximating the y with x
 <img src="mlb/regression.png">

### 2. Capacity, Overﬁtting and Underﬁtting

The central challenge in machine learning is that our algorithm must perform well on new, previously unseen inputs and not just those on which our model was trained. The ability to perform well on previously un observed inputs is called ***generalization***.

*training set* : dataset on which the learning model is trained.

*test set*: dataset on which model is tested.

we generally train model by minimizing training_error whereas, we actually care more about test_error.

It is important to note that test data should come from same distribution as that of training data. otherwise the model will not be applicable to such examples and you might end up getting very low performance.

The performance of a learning algorithm is determined by its ability to:
    1. Make the training error small.
    2. Make the gap between training error and test error small.

These two factors correspond to the two central challenges in machine learning:

   - *underfitting*: occurs when the model is not able to obtain a suﬃciently low error value on the training set.
   - *overfitting* : occurs whenthe gap between the training error and test error is too large.
   
We can control whether a model is more likely to overﬁt or underﬁt by altering it's **capacity.**  Informally, a model’s capacity is its ability to ﬁt a wide variety of functions. 

Models with low capacity may struggle to ﬁt the training set. Models with high capacity can overﬁt by memorizing properties of the training set that do not serve them well on the test set.

One way to control the capacity of a learning algorithm is by choosing it's hypothesis space, the set of functions that the learning algorithm is allowed to select as being the solution.

> Example: the linear regression algorithm has the set of all linear functions of its input as its hypothesis space.

Say we want to expand the linear regression model to more hypothesis spaces like quadratic,cubic,other polynomial forms.

For quadratic function, 
we can use x<sup>2</sup> as new dimensional parameters.

>Let's see this in below example:
   -Try changing the value of n in the polynomial regression 

In [None]:
# Run this cell..............................................
# Regression analysis
np.random.seed(1)
# read this function it is similar to the linear regression we discussed earlier.......
def poly_regress(x,y,n):
    xnew = x**0
    #In below 2 lines we are basically increasing the dimension of input from {x}-->y to {x,x^2,x^3,...,x^n}-->y
    for i in range(1,n+1):
        xnew = np.append(xnew,x**i,axis=1)
    
    # rest of the code is same.........................
    xtx = np.dot(xnew.T,xnew)
    try:
        xtx_inv = np.linalg.inv(xtx)
    except:
        xtx_inv = 1/xtx
        
    xty = np.dot(xnew.T,y)
    wnew = np.dot(xtx_inv,xty)
    
    return wnew

def calc_output(x,w,n):
    yd = 0
    for i in range(n+1):
        yd = yd + (w[i,:]*(x**i))
    return yd
        

x = np.arange(0,15,0.15).reshape(100,1)
x_train = x[0:70]
y = x**2 + 5*np.random.randn(100,1)
y_train = y[0:70]

# Linear 
w_l = poly_regress(x_train,y_train,1)
yd_l = calc_output(x,w_l,1)

# Quadratic
w_q = poly_regress(x_train,y_train,2)
yd_q = calc_output(x,w_q,2)

# Polynnomial
w_p = poly_regress(x_train,y_train,5)
yd_p = calc_output(x,w_p,5) # Try different values of n to see behaviour

print("The figure denotes a regression line approximating the y with x")

f, ax = plt.subplots(1,3,figsize=[16,4])
ax[0].axvspan(0, x[70], color='0.1', alpha=0.1)
ax[0].plot(x,yd_l,'k.')
ax[0].plot(x,y,'.')
ax[0].set_title("n = 1 (Linear) Underfitting")
ax[0].annotate('train', xy=(2, 150) )
ax[0].annotate('test', xy=(12, 0) )
ax[1].axvspan(0, x[70], color='0.1', alpha=0.1)
ax[1].plot(x,yd_q,'k.')
ax[1].plot(x,y,'.')
ax[1].set_title("n = 2 (Quadratic) Optimal fit")
ax[1].annotate('train', xy=(2, 150) )
ax[1].annotate('test', xy=(12, 0) )
ax[2].axvspan(0, x[70], color='0.1', alpha=0.1)
ax[2].plot(x,yd_p,'k.')
ax[2].plot(x,y,'.')
ax[2].set_title("n = 5 Overfitting")
ax[2].annotate('train', xy=(2, 150) )
ax[2].annotate('test', xy=(12, 0) )
plt.show()
print("The shaded part represents the training set.")

>**Conclusion: As you can see that the linear function is unable to capture the curvature so it underfits the data and quadractic function is optimal for the dataset but as we go on increasing the polynomial power the model captures unnecessary curvatures due to its complexity in nature and so it overfits the training dataset and fails on test set.**

**Parametric models**: Models that learn from a vector that have a fixed or finite size before any data is observed.

**Non-parammetric models**: Models that is not bounded by size before reading the dataset is called non-parameteric models.

**Bayes error**: Error incurred by making the predictions from true distribution function is called Byes error. You can say it is like a noise or the minimum training error any algorithm can achieve if any algorithm is getting less than this error that means that the model is overfiting the data.

For example in before example we made regression models lets see how there error changes with respect to the training data size. and we know the bayes error since we know that the data is of what degree polynomial.

In [None]:
# Before running this cell make sure you run above cell.
# run this cell to see variation of errors with respect to training size.
np.random.seed(1)
x = np.arange(0,15,0.15).reshape(100,1)
y = x**2 + 5*np.random.randn(100,1)
bayes_err = np.sum(abs(x**2-y))/x.shape[0]
print("The bayes error is :",bayes_err)

elst = []
for i in range(50,92):
    n1 = 2
    n2 = 3
    
    
    x_tr = x[0:i]
    x_te = x[i:]
    y_tr = y[0:i]
    y_te = y[i:]
    w_q = poly_regress(x_tr,y_tr,n1)
    w_p = poly_regress(x_tr,y_tr,n2)
    
    yd_q_tr = calc_output(x_tr,w_q,n1)
    yd_q_te = calc_output(x_te,w_q,n1)
    yd_p_tr = calc_output(x_tr,w_p,n2)
    yd_p_te = calc_output(x_te,w_p,n2)
    
    err_q_tr = np.sum(abs(yd_q_tr-y_tr))/y_tr.shape[0]
    err_q_te = np.sum(abs(yd_q_te-y_te))/y_te.shape[0]
    err_p_tr = np.sum(abs(yd_p_tr-y_tr))/y_tr.shape[0]
    err_p_te = np.sum(abs(yd_p_te-y_te))/y_te.shape[0]
    
    err = np.array([err_q_tr,err_q_te,err_p_tr,err_p_te])
    elst.append(err)

xaxis = np.arange(50,92)
edata = np.array(elst)
bayes = (np.ones((edata.shape[0],1)))*bayes_err
f, ax = plt.subplots(1,2,figsize=[16,4])
ax[0].plot(xaxis,edata[:,0],'-g', label='quadratic-n1')
ax[1].plot(xaxis,edata[:,1],'-g', label='quadratic-n1')
ax[0].plot(xaxis,edata[:,2],'-b', label='n2-poly')
ax[1].plot(xaxis,edata[:,3],'-b', label='n2-poly')
ax[0].set_xlabel("Training size")
ax[1].set_xlabel("Training size")
ax[0].set_ylabel("Training Error")
ax[1].set_ylabel("Test Error")
ax[0].plot(xaxis,bayes,'--y', label='bayes err')
ax[1].plot(xaxis,bayes,'--y', label='bayes err')
ax[0].legend(loc="upper right")
ax[1].legend(loc="upper right")
plt.show()


>You will notice that the test error or generalization error decreases with increase in number of training samples.

#### 2.1 The No Free Lunch Theorem

The no free lunch theorem for machine learning states that, averaged over all possible data-generating distributions, every classiﬁcation algorithm has the same error rate when classifying previously unobserved points. In other words,in some sense, no machine learning algorithm is universally any better than any other. The most sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class.

Fortunately, these results hold only when we average over all possible data-generating distributions. If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions.


#### 2.2  Regularization
Given a distribution we dont know what the actual distribution looks like so we have to some how avoid overfitting for higher degree of hypothesis space also. we call this using regularization. for example -

while using a gradient desccent / weight decay (discussed in previous tutorial.) method for solving a regression model we minimize the cost function which consist of two terms:-
    - Mean squared error over training set.
    - Regularization term.

you can see for a n-variabled polynomial regression there are n (w's) in parameter list and if we control the magnitude of these (w's) we can reduce the effect of overfitting curvatures in our model.

So the cost function form which we use in such approach is:
    
$$J(w) = MSE_{train} + \lambda w^Tw$$

By controlling ***lambda*** we can decrease or increase the effect of overfitting and underfitting. Due to ***optimization*** of cost function the prenscence of regulization term tries to ***reduce magnitude of w's***. the reduction is a parameter of lambda. If the amount of reduction ***i.e lambda value is very high the model will go into high underfiiting or high bias mode.***

The term ***high bias*** means highly depending on the bias term ***b*** in our model ***y' = w<sup>T</sup>x + b***.

Note that we need not regularize bias b since its not affecting the curvature only the offset.

Let's run the gradient descent model with different regularization parameters.......

> Please read the function used below to get a understanding of how gradient descent in regression works.

In [None]:
# Regularization run this cell
np.random.seed(5)

# The cost function
def cost_f(yd,y,w,ld):
    mse = 0.5*np.dot((yd-y),(yd-y).T)/y.shape[1]
    reg = ld*np.dot(w.T,w)
    c = mse + reg
    return c

# The differential of cost function with respect to w's and b's
def d_cost_f(yd,y,x,w,ld):
    m = y.shape[1]
    dfdw = (1/m)*np.dot((yd-y),x.T).T + 2*ld*w
    dfdb = (1/m)*np.sum((yd-y))
    
    return dfdw, dfdb

# The Regression model.....
def g_regression(x1,y,learning_rate,ld,n):
    x = x1
    #In below 2 lines we are basically increasing the dimension of input from {x}-->y to {x,x^2,x^3,...,x^n}-->y
    for i in range(2,n+1):
        x = np.append(x,x1**i,axis=0)
    
    w = np.random.randn(x.shape[0],1)
    b = np.random.randn(1,1)
    yd = np.dot(w.T,x) + b
    dcost = 10
    ocost = 0
    while(abs(dcost) > 0.00000001):
        dw, db = d_cost_f(yd,y,x,w,ld)
        w = w - learning_rate*dw
        b = b - learning_rate*db
        yd = np.dot(w.T,x) + b
        cost = cost_f(yd,y,w,ld)
        dcost = cost - ocost
        ocost = cost
        
    return w, b

# The prediction over testing set......
def calc(x1,y,w,b,n):
    x = x1
    #In below 2 lines we are basically increasing the dimension of input from {x}-->y to {x,x^2,x^3,...,x^n}-->y
    for i in range(2,n+1):
        x = np.append(x,x1**i,axis=0)
    yd = np.dot(w.T,x) + b
    return yd

n = 5 #hypothesis space / polyomial degree
ts = 60 # traning size
x = (np.arange(0,1.5,0.015)).reshape(1,100) # input random variable
x_train = x[0,0:ts].reshape(1,ts) # training input
y = (x**2 + 0.1*np.random.randn(1,100)).reshape(1,100) # output random variable
y_train = y[0,0:ts].reshape(1,ts) # training output

ys = x**2 # initialization of predicted values with true distribution.
cost = []

lds = np.array([0,0,0.005,0.012,0.02,0.05,0.1,0.2,0.5,0.6,0.7,0.8,1,5,10])  # lambda values to test

# iterating over lamda's
for ld in lds[1:]:
    print("computing for lambda:", ld,"...")
    w, b = g_regression(x_train,y_train,0.0005,ld,n)
    yd = calc(x,y,w,b,n)
    ys = np.append(ys,yd,axis=0)
    cost.append(cost_f(yd,y,w,ld))

npt = ys.shape[0]//3
f, ax = plt.subplots(npt,3,figsize=[16,20])

# plotting the predictions/.....................................
for i in range(npt):
    for j in range(3):
        ax[i,j].plot(ys[min(i*3 + j,ys.shape[0]-1),:].T,'.k')
        ax[i,j].plot(y.T,'.g')
        ax[i,j].set_title("n = "+str(min(n,2+(i+j)*20))+",lambda = "+str(lds[i*3 + j]))
        ax[i,j].annotate('train', xy=(20, 2) )
        ax[i,j].annotate('test', xy=(80, 0) )
        ax[i,j].axvspan(0, ts, color='0.1', alpha=0.1)
plt.show()
     

> you will notice as we increase lambda the model goes from overfitting to underfiiting and some where in middle we obtain regularized optimum solution.

### 3. Hyperparameters and Validation Sets

Most machine learning algorithms have hyperparameters, settings that we canuse to control the algorithm’s behavior. Typically these are the variables that the algorithms does not adapt it-self.

>Example: learning_rate, lambda(regulization), hypothesis space etc.

we can make algorithms to optimize hyperparameters but optimizing them based on training set will only result in overfitting and the generalization gap(error gap between training and test set) will increase.

To solve this problem we introduce a new set called:<br>
***Validation set***: A part of training set not used for training parameters but for calculating the generalization gap and optimizing the hyper-parameters. It's like a test set but not used for testing the optimized model, but used to optimize the model hyper-parameters.

Considering all sets we now have three sets of data : training set, validation set, test set. Usually the percentage given to these three categories in most general algorithms is about 80%-96%,2%-10%,2%-10% of whole dataset respectively.

#### 3.1 Cross - Validation

In this method the dataset is divided into k non-overlaping subsets and test error is calculted by taking average over all the test errors. On i<sup>th</sup> trial the i<sup>th</sup> subset is used as the test set and reamining as training set this is genrally used when the dataset size is small and there is scarcity of data. This is also known as ***k-fold cross validation set.***

### 4.  Estimators, Bias and Variance

#### 4.1 Point Estimation

Point estimation is the attempt to provide the single “best” prediction of some quantity of interest. In general the quantity of interest can be a single parameter or a vector of parameters in some parametric model, such as the weights in our linear regression example, but it can also be a whole function.

For example say we have {x1,x2,....,xn} indepedent and identically disributed datasets. 

A point estimator or statistic is any function of the data is given by:

$$\theta_m = g(x^{(1)},x^{(2)},....,x^{(n)})$$

While any function can be considered as an estimator, a good estimator is a function whose output is close to the true underlying θ that generated the training data.

Point estimation can also refer to the estimation of the relationship between input and target variables. We refer to these types of point estimates as *function estimators.*

#### 4.2 Bias

The bias of an estimator is deﬁned as:

$$bias(θ_m) = E(θ_m) − θ,$$where E(θ) is the expected value of θ

#### 4.3 Variance and Standard Error

The variance of an estimator is simply the variance(var(θ)) and  the square root of this variance is called the standard error.

For example standard error of the θ = mean is given by:

$$SE(µ_m) =\sqrt{Var\left({1\over m}\sum^m_{i=1}x^{(i)}\right)}={σ\over\sqrt{m}}$$

#### 4.4 Trading oﬀ Bias and Variance to Minimize Mean Squared Error

Bias and variance measure two diﬀerent sources of error in an estimator. 
<br>
Bias measures the expected deviation from the true value of the function or parameter.
<br>
Variance on the other hand, provides a measure of the deviation from the expected estimator value that any particular sampling of the data is likely to cause.

What happens when we are given a choice between two estimators, one with more bias and one with more variance? How do we choose between them?

Let's Answer the question with a example plot for above solved case :

In [None]:
# Example Bias Variance trade off.....
# Make sure you run the above code cell before this one since we are going to use values of above cell here...
np.random.seed(7)
# defining the test region
te = 60
tr = 80
# fuction to return square of bias of the prediction
def biasmean(ys,y):
    return(np.mean((((y[0,te:tr]))-(ys[te:tr]))))**2

# fuction to return variance of the prediction
def biasvar(ys,y):
    return(np.std(ys[te:tr]))**2


bm = []
bv = []
ms = []

# calculating values for already calculated values........................
print("Plotting the variation of bias, variance ans MSE...")
for i in range(1,ys.shape[0]):
    bm.append(biasmean(ys[i,:],ys[0,:].reshape(1,ys.shape[1])))
    bv.append(biasvar(ys[i,:],ys[0,:].reshape(1,ys.shape[1])))

    
plt.figure(figsize=[16,8])
plt.plot(np.flip(np.array(bm),axis=0)[:13],'-r',label="bias-square")
plt.plot(np.flip(np.array(bv),axis=0)[:13],'-k',label="variance")
plt.plot(np.flip(np.array(bv)+np.array(bm),axis=0)[:13],'-b',label="genrelization gap")
plt.plot(np.flip(np.array(cost),axis=0)[:13].reshape(13,1),'-y',label="loss function")
plt.legend(loc="upper right")
plt.axvline(x=9)
plt.annotate('optimum capacity', xy=(9-1,0.6) )
plt.annotate('overfitting\nhigh variance', xy=(10,0.4) )
plt.annotate('underfitting\nhigh bias', xy=(9-4,0.5) )
plt.title("Relation between bias variance and generalization gap")
plt.xlabel("Capacity (lambda descending)")
plt.show()

> you will notice that optimization is directly related to bias and variance and obtained where the two are not very high or not very low so when you have trade off either bias or varience you should choos the optimal position by using such analysis.

#### 4.5 Consistency
So far we have discussed the properties of various estimators for a training set of ﬁxed size. Usually, we are also concerned with the behavior of an estimator as the amount of training data grows.
Consistency means as the number of examples increases the estimation converges to the actual estimation or actual solution.
<br>
Consistency ensures that the bias induced by the estimator diminishes as the number of data examples grows.

-----------------------------

### 5. Maximum Likelihood Estimation

The question comes what are good estimators, basically we define estimmators based on some principles. One of the most generally used principle is maximum likelihood principle.

Consider a set of m examples ***X = {x(1), . . . , x(m)}*** drawn independently fromthe true but unknown data-generating distribution ***pdata(x)***

Let ***pmodel(x;θ)*** be a parametric family of probability distributions over thesame space indexed by ***θ***.

The maximum likelihood estimator for θ is then deﬁned as:

$$θ_{ML} = arg \ max_θ\ p_{model}(X; θ),\\= arg\  max_θ\ \prod^m_{i=1}p_{model}(x^{(i)}; θ).$$applying log and scaling does not change the argument so we get,$$θ_{ML}= arg\ max_θ{1\over m}\sum^m_{i=1}log p_{model}(x^{(i)}; θ)\\θ_{ML}= arg\ max_θE_{x∼p_{data}}log p_{model}(x; θ)$$

The degree of dissimilarity two models can be given by KL divergence:

$$D_{KL}(p_{data}\|p_{model}) = E_{x∼p_{data}}[log\ p_{data}(x) − log\ p_{model}(x)]$$

minimizing this dissimilarity is same as maximizing the likely hood estimate...

***cross-entropy***: Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution deﬁned by the training set and the probability distribution deﬁned by model.
> For example, mean squared error is the cross-entropy between the empirical distribution and a Gaussian model.

#### 5.1 Conditional Log-Likelihood and Mean Squared Error

If X represents all our inputs and Y all our observed targets, then the conditional maximum likelihood estimator is:
$$θ_{ML}= arg\ max_θ\ P (Y | X; θ).$$if examples are Independent and identically distributed(i.i.d):$$θ_{ML}= arg\ max_θ\ \sum^m_{i=1}log\ P (y^{(i)}|\ x^{(i)}; θ).$$

#### 5.2 Properties of Maximum Likelihood

Under appropriate conditions, the maximum likelihood estimator has the property of ***consistency***:
   - The true distribution pdata must lie within the model family pmodel.
   - The true distribution pdata must correspond to exactly one value of θ.
   
That parametric mean squared error decreases as m (size of training data) increases, and for m large, the Cramér-Rao lower bound (Rao, 1945; Cramér, 1946) shows that no consistent estimator has a lower MSE than the maximum likelihood estimator.

#### Example: Linear Regression as maximum likelyhood problem.....

Since the examples are assumedto be i.i.d., the conditional log-likelihood is given by:

$$
\sum_{i=1}^mlog\ p(y^{(i)}|\ x^{(i)}; θ)
=
− m log(σ) −{m\over2}log(2π) −\sum^m_{i=1}{\|y'(i)− y(i)\|^2\over{2\sigma^2}}
$$

let's try to maximize it...

In [None]:
# Example Linear Regression as maximum likely hood problem.
np.random.seed(1)
def cost(yd,y):
    m = y.shape[0]
    sigma = np.std(y)
    
    # YOUR CODE HERE 
    clld = None # calculate the maximum likelihood estimate from given equation as : 
    #-m*np.log(sigma) - (m/2)*np.log(2*np.pi) - np.dot((yd-y).T,(yd-y))/(2*(sigma**2))
    # End
    
    return clld

def d_cost(yd,y,x):
    m = y.shape[0]
    sigma = np.std(y)
    dfdw = None # calculate the derivative of mlh equation w.r.t(w) as -->  -(0.5/sigma**2)*np.dot((yd-y).T,x)
    dfdb = None # calculate the derivative of mlh equation w.r.t(b) as --> -(0.5/sigma**2)*np.sum((yd-y))
    
    return dfdw, dfdb
    
def regress(x,y,learning_rate):
    w = np.random.randn(1,1)
    b = np.random.randn(1,1)
    yd = w*x + b
    dcost = 10
    ocost = 0
    while(abs(dcost) > 0.00000001):
        
        dw, db = d_cost(yd,y,x)
        
        # YOUR CODE HERE 
        w = None # Update w to maximize the cost  use ---- > w + learning_rate*dw
        b = None # Update b to maximize the cost  use ---- > b + learning_rate*db
        # NOTE : we are using + instead of - to go up and maximize on the training surface
        # End
        
        yd = w.T*x + b
        ncost = cost(yd,y)
        dcost = ncost - ocost
        ocost = ncost
        
    return w, b

x = np.random.randn(100,1)
y = np.random.randn(100,1)**3
w, b = regress(x,y,0.01)
yd = w*x + b

print("The figure denotes a regression line approximating the y with x")
plt.plot(x,yd,'-r')
plt.plot(x,y,'.')
plt.show()

> Expected output :<br> The figure denotes a regression line approximating the y with x <img src = "mlb/regression.png"> NOTE: you got the same answer as previous linear regression example confirming the relation of maximum likelihood and mean squared error. 

### 6. Bayesian Statistics

So far we have discussed **frequentist statistics** and approaches based on estimating a single value of θ. Another approach is to consider all possible values of θ when making a prediction. The latter is the domain of **Bayesian statistics.**

Unlike frequestist where the θ is fixed and unknown the bayesian uses probability to reﬂect degrees of certainty in states of knowledge.

***Prior probability distribution***: Before observing the data, we represent our knowledge of θ using the prior probability distribution,p(θ) which in general choosen as one having quiet high entropy(quiet broad - high degree of uncertainity).

And after that we observe the dataset to update the probability distribution according to dataset which gives optimized distribution which is nearly same as actual distribution.

Consider we have data samples {x1,x2,.....,xm}. We can recover the eﬀect of data on our belief about θ by combining the data likelihood p(x1, . . . , xm| θ) with the prior via ***Bayes’ rule:***

$$
p(θ\ |\ x_1, . . . , x_m) ={p(x_1, . . . , x_m\ |\ θ).p(θ)\over p(x_1, . . . , x_m)}
$$

In bayesian method after observing m examples, the predicted distribution over the next data sample, x(m+1), is given by:

$$
p(x_{m+1}\ |\ x_1, . . . , x_m) =\int p(x_{m+1}\ |\ θ)\ .p(θ\ |\ x_1, . . . , x_m) dθ
$$

benefits of bayesian:
   - unlike the maximum likelihood approach that makes predictions using a point estimate of θ, the Bayesian approach is to make prediction using a full distribution over θ.
   - the integral part tends to protect well against the overfitting which was taken care by variance in SPE.

But still there are facts like: the prior has an inﬂuence by shifting probability mass density towards regions of the parameter space that are preferred a priori. Critics of the Bayesian approach identify the prior as a source of subjective human judgment aﬀecting the predictions. <br>Bayesian methods typically generalize much better when limited training data is available but typically suﬀer from high computational cost when the number of training examples is large.


#### 6.1 Maximum a Posteriori (MAP) Estimation

Single point estimates are easy to control(tractable) as compared to bayesian models. So in order to get bayesian benefits with a tractable model we allow the prior to inﬂuence the choice of the point estimate. One rational way to do this is to choose the **maximum a posteriori(MAP)** point estimate :

$$
θ_{MAP}= arg\ max_θ\ p(θ\ |\ x) = arg\ max_θ\ log\ p(x\ |\ θ) + log\ p(θ)
$$

>As an example, consider a linear regression model with a Gaussian prior on the weights **w**. If this prior is given by **N(w|0,(1/λ).I<sup>2</sup>)** *(a gaussian distribution with 0 mean and standard deviation as 1/λ)*, then the log-prior term is proportional to the familiar **λw<sup>T</sup>w** weight decay penalty, plus a term that does not depend on w and does not aﬀect the learning process. MAP Bayesian inference with a Gaussian prior on the weights thus corresponds to weight decay.

One can say that MAP is a regularized Maximum likelyhood estimate.
> Read more about it here : https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation

#### Example: Bayesian Linear Regression.

The prediction is parametrized as :
$$y = w^T.x$$

Given a set of m training samples (X(train), y(train)), we can express the prediction of y over the entire training set as:

$$y_{(train)}= X_{(train)}.w$$

Expressed as a Gaussian conditional distribution on y(train), we have:

$$p(y_{(train)}| X_{(train)}, w) = N(y_{(train)}| X_{(train)}.w, I)\\∝ exp\left({−1\over2}(y_{(train)}− X_{(train)}.w)^T(y_{(train)}− X_{(train)}.w)\right)$$ 

We need to define a prior distribution with high entropy.:

$$
p(w) = N(w\ |\ µ_0, Λ_0) ∝ exp\left({−1\over2}(w − µ_0)^TΛ^{−1}_0(w − µ_0)\right)
$$where µ<sub>0</sub> and Λ<sub>0</sub> are the prior distribution mean vector and covariance matrix respectively..

>NOTE: Unless there is a reason to use a particular covariance structure, we typically assume a diagonal covariance matrix Λ<sub>0</sub>= diag(λ<sub>0</sub>).


now proceed in determining the posterior distribution over the model parameters:

$$
p(w\ |\ X, y) ∝ p(y\ |\ X, w)\ .p(w)\\
∝ exp\left({−1\over2}(w − µ_0)^TΛ^{−1}_0(w − µ_0)\right).exp\left({−1\over2}(y− X.w)^T(y− X.w)\right)\\
∝ exp\left({−1\over2}\left(−2y^TXw + w^TX^TXw + w^TΛ^{−1}_0.w − 2µ^T_0Λ^{−1}_0.w\right)\right)
$$
we define,
$$
Λ_m=\left(X^T.X + Λ^{−1}_0\right)^{−1}; µ_m= Λ_m\left(X^T.y + Λ^{−1}_0.µ_0\right)\\
∝ exp\left({−1\over2}(w − µ_m)^TΛ^{−1}_m.(w − µ_m) +{1\over2}µ^T_m.Λ^{−1}_m.µ_m\right)\\
\downarrow\downarrow\downarrow\\
p(w\ |\ X, y)∝ exp\left({−1\over2}(w − µ_m)^TΛ^{−1}_m(w − µ_m)\right)
$$

Examining this posterior distribution enables us to gain some intuition for the eﬀect of Bayesian inference. In most situations, we set µ<sub>0</sub> to 0. If we set Λ<sub>0</sub> = (1/α)I, then µ<sub>m</sub> gives the same estimate of was does frequentist linear regression with a weight decay penalty of αw<sup>T</sup>w. One diﬀerence is that the Bayesian estimate is undeﬁned if α is set to zero we are not allowed to begin the Bayesian learning process with an inﬁnitely wide prior on w. The more important diﬀerence is that the Bayesian estimate provides a covariance matrix, showing how likely all the diﬀerent values of w are, rather than providing only the estimate µ<sub>m</sub>.

Let's try it out....


In [None]:
# Example Linear Regression as bayesian estimate problem.
np.random.seed(1)
xi = np.random.randn(100,1)
y = np.random.randn(100,1)**3
x = np.ones(xi.shape)
x = np.append(x,xi,axis=1)

def bayes_regress(x,y,alpha):
    # alpha is equal to 1/variance
    n = x.shape[1] # Rd d dimension of x
    
    # YOUR CODE HERE
    d0 = None # make a covariance matrix with alpha as 1/variance  use --> np.diag(np.ones((n))/alpha)
    u0 = None # make a initial mean vector for prior  use -->  np.zeros((n,1)) 
    # END
    
    w = np.random.normal(0, np.sqrt(1/alpha), n) # initializing w's with prior guassian distribution
    dm = inv(np.dot(x.T,x)+inv(d0)) # new integrated covarinace matrix dimension nxn
    um = np.dot(dm,(np.dot(x.T,y)+np.dot(inv(d0),u0))) # final bayesian found weight distribution.
    print("The weights by bayesian method are:\n",um)
    
    # Verifying the found result with the normal equations method(maximum likelihood estimation)
    wmle = np.dot(inv(np.dot(x.T,x)),np.dot(x.T,y))
    print("The weights by normal equation(MLE) method are:\n",wmle)
    
    return um
    
# A function to inverse nd.matrix    
def inv(c):
    m = c.shape[0]
    if (m < 2):
        return 1/c
    else:
        return np.linalg.inv(c)
    
# start of model computation...........    
w = bayes_regress(x,y,0.01)     
yd = np.dot(x,w)
print("The figure denotes a regression line approximating the y with x")
plt.plot(xi,yd,'-r')
plt.plot(xi,y,'.')
plt.show()

> Expected output:
```
The weights by bayesian method are:
 [[ 0.24069274]
 [ 0.42435956]]
The weights by normal equation(MLE) method are:
 [[ 0.24071364]
 [ 0.42441186]]
The figure denotes a regression line approximating the y with x
```
<img src = "mlb/regression.png">

### 7. Supervised Learning Algorithms

Supervised learning algorithms are, roughly speaking, learning algorithms that learn to associate some input with some output, given a training set of examples of inputs x and outputs y.

#### 7.1 Probabilistic Supervised Learning

Most supervised learning algorithms in this book are based on estimating a probability distribution p(y | x). Linear regression is parameterized in terms of mean in real value domain but we can also talk about classes for example given a sample belons to class A or class B, 0 or 1 this type of regression is called logistic regression since we use sigmoid function to bound the probabilities in 0 to 1 range. Other function can also be used like tanh function.

#### Logistic Regression:
$$y' = \sigma(w^T.x+b)\\sigmoid,\sigma(x) = {1\over (1+e^{-x})} $$

since we are using sigmoid function the mean squared error no longer remains as convex and will not give optimal solution so we define new cost function using bernoulli's distribution since we are using binary classes:

$$
cost(y') = {\begin{cases}-log(y')&{\text{for }}y=1\\-log(1-y')&{\text{for }}y=0\end{cases}}
$$net cost over m examples can be written as:$$
J_{net-cost} = -{1\over m}\sum_i^my_ilogy'_i + (1-y_i)log(1-y'_i)
$$

here J is a convex function and so,
now we can simply use gradient descent method over it. 

>NOTE: During prediction we say if y'(probability of being y=1) > 0.5 sample goes to class y = 1 else y = 0 class

Let's give it a try:.....

In [None]:
# defining problem run this cell...
np.random.seed(12)
x1 = np.abs(np.random.randn(100,1))
x2_1 = 10*np.abs(np.random.randn(100,1))
x2_2 = (x2_1) + 20 + np.random.randn(100,1)

x_1 = np.append(x1,x2_1,axis=1)
y_1 = np.ones((x_1.shape[0],1))
x_2 = np.append(x1,x2_2,axis=1)
y_2 = np.zeros((x_2.shape[0],1))

x = np.append(x_1,x_2,axis=0)
y = np.append(y_1,y_2,axis=0)

print("yellow ones have class y = 0 and green ones have class y = 1")
plt.plot(x1,x2_1,'go')
plt.plot(x1,x2_2,'yo')
plt.show()


In [None]:
# Logistic regression model: Complete this part and run.......
# defining the cost function
np.random.seed(1)
f, ax = plt.subplots(1,3,figsize=[16,4])
def cost(y,yd):
    m = y.shape[0]
    
    ## YOUR CODE HERE
    J = None # define cost as explained use --> (-1/m)*np.sum(y*np.log(yd) + (1-y)*np.log(1-yd))
    # END
    
    return J

def grads(y,yd,x):
    m = y.shape[0]
    
    ## YOUR CODE HERE
    dw = None # calc diff of cost w.r.t w as --> (-1/m)*(np.dot(x.T,y*(1-yd)) - np.dot(x.T,(1-y)*yd))
    db = None # calc diff of cost w.r.t b as --> (-1/m)*np.sum(y*(1-yd) - (1-y)*yd)
    # End
    
    return dw, db

def sigmoid(x):
    return 1/(1 + np.exp(-x)) # sigmoid function

def logistic_reg(y,x,lr):
    # similar to linear reression bus with different cost and grads
    [m, n] = x.shape
    mse = []
    los = []
    
    # weights initializzation........
    w = 0.01*np.random.randn(n,1)
    b = 0.01*np.random.randn(1,1)
    wmse = w # weight for mse error calc
    bmse = b # bias for mse error calc

    z = np.dot(x,w) + b
    yd = sigmoid(z)
    nc = cost(y,yd)
    oc = nc
    dc = 10
    
    while(abs(dc) > 0.0000001):
        dw, db = grads(y,yd,x)
        w = w - lr*dw
        b = b - lr*db
        yd = sigmoid(np.dot(x,w) + b)
        nc = cost(y,yd)
        dc = nc-oc
        oc = nc
        
        los.append(nc)
        
        # Trying with mean squared loss.....on sigmoid function......
        try:
            ydmse = sigmoid(np.dot(x,wmse) + b)
            dwmse = np.dot(x.T,(ydmse-y)*ydmse*(1-ydmse))
            dbmse = np.sum((ydmse-y)*ydmse*(1-ydmse))/m
            wmse = wmse - lr*dwmse
            bmse = bmse - lr*dbmse
            mse.append(0.5*np.dot((ydmse-y).T,ydmse-y)/m)
        except error:
            pass
            
        
    # plotting the error variation with number of epochs/ iterations
    ax[0].plot(np.squeeze(mse).T)
    ax[0].set_title("MSE error plot")
    ax[0].set_xlabel("number of iterations")
    ax[1].plot(np.squeeze(los).T)
    ax[1].set_title("Logistic loss plot")
    ax[1].set_xlabel("number of iterations")
    
    return w, b

def predict(w,b,x):
    return(sigmoid(np.dot(x,w) + b)>0.5)


print("Training...")    
w, b = logistic_reg(y,x,0.05)
yd  = predict(w,b,x)
error = (np.sum(yd != y)/y.shape[0])*100
print("Obtained Accuracy :", 100 - error,"%")
xerr1 = x[:,0].reshape(200,1)
xerr2 = x[:,1].reshape(200,1)
ax[2].plot(x1,x2_1,'go')
ax[2].plot(x1,x2_2,'yo')
ax[2].plot(xerr1[yd!=y],xerr2[yd!=y],'rs',alpha=0.7,label="wrong predictions")
ax[2].legend()
plt.show()

> Expected outptut :<br>
    Obtained Accuracy : 97.0 %
    <img src = "mlb/logreg.png">

#### 7.2 Support Vector Machines

One of the most inﬂuential approaches to supervised learning is the support vector machine.<br>
We try to find a line for linear classification that separates the two class read below how:
> For example the in linear kernel svm the hyper plane looks somthing like this:<img src="mlb/svm.jpg"> The equation of hyperplane is nothing but **y = w*x + b**, so we say
<br>
*The main idea is to identify the optimal separating hyperplane which maximizes the margin of the training data.*
prediction SVM:<br>
    - f(x) = w.x + b
    - if f(x) >= 1 --> y = 1
    - if f(x) < -1 --> y = -1
    -
    - --> y.f(x) >= 1
what we want is to maximize the marginal width between the 2 classes from seperation hyperplane. So if you try to find distcance from a point to a plane and then subtract the two distances you will get the margin and finally you will see that:
<br>
$$margin(m)∝{2\over\|w\|}$$
<br>
so we need to minimize ||w|| given the constraint y.f(x) >= 1
    <br>
In many cases the data is not seperable properly so we introduce a slack variable..:
<br>
$$y_i.f(x)\geq 1-\zeta_i \mbox{ for all } 1\leq i \leq n, \zeta_i\geq 0\\\longrightarrow min\ {\|w\|\over2} + C\sum{\zeta_i}$$
<br>this is called primal form, here just like lambda in regulization C can control the amount of regularization here...

For non-linear SVMs we define it differently called the **dual form** transformation, we basically transform the space into high dmensional linearly seperable form and transform back after obtaining the result:
<br>We know logistic regression uses sigmoid over:

$$w^Tx + b = b +\sum^m_{i=1}α_ix^Tx_{(i)}$$to give probability of a sample being in a class.

SVM uses:

$$f(x) = b +\sum_{i}α_ik(x, x_{(i)})\\k(x, x_{(i)}) = \phi(x)·\phi(x_{(i)})$$<br>
to classify samples directly being in a class or out of it by seeing the sign of output, where k is called a kernel function.<br>The most commonly used kernel is the ***Gaussian kernel***, also known as the ***radial basis function(RBF)*** kernel since it decreases along lines in v space radiating outward from u.

$$k(u, v) = N(u − v; 0, σ^2.I)$$

there are many other methods which can use kernels these methods are called kernel methods or ***kernel machines***. But kernel methods incorporates a high computational cost as compared to other methods but neural networks can out perform such kernel methods which we will discuss later.

> Read more on kernel methods: https://en.wikipedia.org/wiki/Kernel_method

Classification in SVM is based on vector's whose coefficent α is non_zero these are called ***support Vectors***.<br>
To solve such transformmed problem of SVM we generally use another optimization algorithm called ***Sequential minimal optimization***, since the gradient descent will become very slow on larger datasets.
> Read about Sequential minimal optimization here:https://en.wikipedia.org/wiki/Sequential_minimal_optimization and https://en.wikipedia.org/wiki/Support_vector_machine

We have our final dual form as:
$$
\max _{\alpha }\sum _{i=1}^{n}\alpha _{i}-{\frac {1}{2}}\sum _{i=1}^{n}\sum _{j=1}^{n}y_{i}y_{j}K(x_{i},x_{j})\alpha _{i}\alpha _{j},\\\text{
subject to:}\\
{\displaystyle 0\leq \alpha _{i}\leq C,\quad {\mbox{ for }}i=1,2,\ldots ,n,}\\
{\displaystyle \sum _{i=1}^{n}y_{i}\alpha _{i}=0}
$$

> Duality or lagrange duality reference : https://en.wikipedia.org/wiki/Duality_(optimization)

Let's try this....

In [None]:
# Defining problem... run this cell
# defining problem run this cell...
np.random.seed(1)
cx = (np.arange(0,100)-50)/10
cy = np.sqrt(25 - cx**2)
a = 0.8
b = 1.5
x1_1 = (a*cx - (np.random.randn(100))).reshape(100,1)
x2_1 = (a*cy - (np.random.randn(100))).reshape(100,1)
x1_2 = (b*cx + (np.random.randn(100))).reshape(100,1)
x2_2 = (b*cy + (np.random.randn(100))).reshape(100,1)
x_1 = np.append(x1_1,x2_1,axis=1)
y_1 = np.ones((x_1.shape[0],1))
x_2 = np.append(x1_2,x2_2,axis=1)
y_2 = -np.ones((x_2.shape[0],1))
x = np.append(x_1,x_2,axis=0)
y = np.append(y_1,y_2,axis=0)
print("yellow ones have class y = 0 and green ones have class y = 1")
plt.plot(x1_1,x2_1,'go')
plt.plot(x1_2,x2_2,'yo')
plt.show()

In [None]:
# Supoport Vector Machine
np.random.seed(1)
def gkernel(x1, x2, sigma):
    var = x1-x2
    sim = np.exp(-(np.linalg.norm(var)**2)/(2*sigma**2))
    return sim
    
def svm_train(X, Y, C,tol = 0.0001, max_passes=5):
    [m, n] = X.shape
    sigma = 0.1
    
    
    # Variables
    alpha = np.zeros((m,1))
    b = 0
    E = np.zeros((m, 1))
    passes = 0
    K = np.zeros((m,m))
    eta = 0
    L = 0
    H = 0
    
    # The following can be slow due to the lack of vectorization
    for i in range(m):
        for j in range(m):
            K[i,j] = gkernel(X[i,:], X[j,:],sigma)
   


    # Train
    print('\nTraining ...')
    while (passes < max_passes):
        num_changed_alphas = 0
        for i in range(m):
            E[i] = b + np.sum (alpha*Y*K[:,i]) - Y[i]
            
            if ((Y[i]*E[i] < -tol and alpha[i] < C) or (Y[i]*E[i] > tol and alpha[i] > 0)):
                j = i
                while(j==i):
                    j = np.random.randint(m)
                
                E[j] = b + np.sum (alpha*Y*K[:,j]) - Y[j]
                
                alpha_i_old = alpha[i]
                alpha_j_old = alpha[j]
                
                # L is the lower and H is Higher bounds on alpha to be changed
                if (Y[i] == Y[j]): # alpha[j] + alpha[i] > C
                    L = max(0, alpha[j] + alpha[i] - C) 
                    H = min(C, alpha[j] + alpha[i])
                else: # alpha[j] > alpha[i]
                    L = max(0, alpha[j] - alpha[i])
                    H = min(C, C + alpha[j] - alpha[i])
                
                if (L == H): # if alphas are zero or C
                    continue
                    
                eta = 2 * K[i,j] - K[i,i] - K[j,j] # if the kernels are too close
                if (eta >= 0):
                    continue
                
                alpha[j] = alpha[j] - (Y[j] * (E[i] - E[j])) / eta # update of alpha[j]
                
                # contraints  alpha > 0 and < C
                alpha[j] = min (H, alpha[j])
                alpha[j] = max (L, alpha[j])
                
                # check if change in alpha is significant:
                if (abs(alpha[j] - alpha_j_old) < tol):
                    alpha[j] = alpha_j_old
                    continue
                    
                # Determine value for alpha i  
                alpha[i] = alpha[i] + Y[i]*Y[j]*(alpha_j_old - alpha[j])
                
                # computing b
                b1 = b - E[i] - Y[i]*(alpha[i]-alpha_i_old)*K[i,j]-Y[j]*(alpha[j]-alpha_j_old)*K[i,i]
                b2 = b - E[j] - Y[i]*(alpha[i]-alpha_i_old)*K[i,j]-Y[j]*(alpha[j]-alpha_j_old)*K[j,j]    
                
                if(0<alpha[i] and alpha[i]<C):
                    b = b1
                elif(0<alpha[j] and alpha[j]<C):
                    b = b2
                else:
                    b = (b1 + b2)/2
                
                num_changed_alphas = num_changed_alphas + 1
                
        
        if(num_changed_alphas==0):
            passes = passes+1
        else:
            passes = 0
        
    print("Done")
    
    return K, b, alpha

def predict_svm(K,b,alpha,X,Y):   
    # prediction
    [m, n] = X.shape
    Yd = Y*0
    for i in range(m):
        Pr = 0
        for j in range(m):
            Pr = Pr + alpha[j]*Y[j]*K[i,j]
        Yd[i] = Pr + b
    
    Yd[Yd>=0] = 1
    Yd[Yd<0] = -1
    
    return Yd
    
                
                
                
C = 50
K, b, alpha = svm_train(x,y,C)
yd = predict_svm(K, b, alpha, x, y)

accuracy = 100*(np.sum(yd==y)/yd.shape[0])
print("Accuracy:",accuracy,"%")
xm1 = x[:,0].reshape(200,1)
xm2 = x[:,1].reshape(200,1)
x1 = xm1[yd!=y]
x2 = xm2[yd!=y]

plt.figure(figsize=[16,8])
plt.plot(x1_1,x2_1,'bo')
plt.plot(x1_2,x2_2,'ko')
plt.plot(x1,x2,'ro',label="misclassifications")
plt.legend()


plt.show()



> You can read about implentation of SMO in detail here: http://pages.cs.wisc.edu/~dpage/cs760/SMOlecture.pdf

#### 7.3 Other Simple Supervised Learning Algorithms

There are also some non probabilistic algorithms like:

** K - Nearest Neighbour **: is a family of techniques that can be used for regression or classification. As a non parametric learning algorithm, k-nearest neighbors is not restricted to a ﬁxed number of parameters. When we want to produce an output y for a new test input x, we ﬁnd the k-nearest neighbours to x in the training data X. We then return the average of the corresponding y values in the training set. This works for essentially any kind of supervised learning where we can deﬁne an average over y values.

> Algorithm:
    - Calc distance of test vector from all training vector
    - Sort the distances
    - pick k nearest training vectors(neighbours).
    - classify test vector to majority among the k neighbours.

In [None]:
# defining problem run this cell...
np.random.seed(12)
x1 = np.abs(np.random.randn(100,1))
x2_1 = 10*np.abs(np.random.randn(100,1))
x2_2 = (x2_1) + 16 + np.random.randn(100,1)
x_1 = np.append(x1,x2_1,axis=1)
y_1 = np.ones((x_1.shape[0],1))
x_2 = np.append(x1,x2_2,axis=1)
y_2 = np.zeros((x_2.shape[0],1))
x = np.append(x_1,x_2,axis=0)
y = np.append(y_1,y_2,axis=0)
print("yellow ones have class y = 0 and green ones have class y = 1")
plt.plot(x1,x2_1,'go')
plt.plot(x1,x2_2,'yo')
plt.show()

In [None]:
# K- nearest neighbours: Classification
def findClass(x,y,point,k):
    [m, n] = x.shape
    dist = y*0
    for i in range(m):
        d = 0
        for j in range(n):
            d = d + (x[i,j] - point[j])**2
        
        dist[i] = np.sqrt(d)
    
    indx = np.argsort(dist,axis=0)
    
    dist = dist[indx]
    y = y[indx]
    
    c = 0
    for i in range(k):
        c = c + y[i]
        
    c = c/k
    c = 1*(c>0.5)

    return c

def classify(x,y,test,k):
    
    m = test.shape[0]
    yd = np.zeros((m,1))
    for i in range(m):
        yd[i] = findClass(x,y,test[i,:],k)
        
    return yd
    
    
    
percent = 0.8
trs = int(percent*y.shape[0])
x_tr = x[0:trs] 
y_tr = y[0:trs]
x_te = x[trs:]
y_te = y[trs:]

K = 3 # try changing this value

yd = classify(x_tr,y_tr,x_te,K)
A = (yd==y_te)
acc = 100*np.sum(A)/A.shape[0]
print("Accuracy :",acc,"%")

plt.plot(x1,x2_1,'go')
plt.plot(x1,x2_2,'yo')
m1 = x[trs:,0].reshape(y.shape[0]-trs,1)[yd!=y_te]
m2 = x[trs:,1].reshape(y.shape[0]-trs,1)[yd!=y_te]
plt.plot(m1,m2,'r^',label = 'missclassified')
plt.legend()
plt.show()

>There are some other non parametric methods like decision trees which you can read here if you want: https://en.wikipedia.org/wiki/Decision_tree_learning

### 8. Unsupervised Learning Algorithms

As we already discussed unsupervised learning are the type of learning in which the output label is not provided and so the learning is done purely based on input data and so we can't really distinguish between the two that efficiently. In general these are used to find the best representaion of data available.

There are multiple ways of deﬁning a simpler representation:
   - lower-dimensional representations
   - sparse representations
   - independent representations
   
There are several popular methods for doing this:

#### 8.1 PCA (principal component analysis)
As we discussed already in Linear Algebra PCA is used for dimensional reduction i.e lower dimensional representaion.

#### 8.2 K-means Clustering
The k-means clustering algorithm divides the training set into k diﬀerent clusters of examples that are near each other.
We can thus think of the algorithm as providing a k-dimensional one-hot code vector h representing an input x.

> One-hot Code : https://en.wikipedia.org/wiki/One-hot

The one-hot code provided by k-means clustering is an example of a sparse representation, because the majority of its entries are zero for every input, where more than one entry can be nonzero for each input x. One-hot codes are an extreme example of sparse representations that lose many of the beneﬁts of a distributed representation. The one-hot code still confers some statistical advantages, and it confers the computational advantage that the entire representation may be captured by a single integer.

Algorithm: -

   1. Given a set of inputs {x1,x2,x3,x4.....,xn} x has d dimensions, and a number k defining the number of clusters to form.
   2. Inintialize K centroids all of them as far as possible from each other in d dimneisonal space near to inputs.
   3. Calculate the distance of each sample from each centroid and assign it to minimum distance cluster centroid.
   4. take average of the clusters and update the centroid to new averaged centroid.
   5. repeat steps 3 to 5 till convergence.
   

In [None]:
# Representing Iniitial data
np.random.seed(50)
x = np.random.rand(300,2)
x[0:150,0] = x[0:150,0] + 1 + np.random.rand(150,)
x[0:150,1] = x[0:150,1] + 1 - np.random.rand(150,)
x[150:200,0] = x[150:200,0]*1 + 2*np.random.rand(50,)
x[150:200,1] = x[150:200,1]*2 + np.random.rand(50,)
plt.plot(x[:,0],x[:,1],'o')
plt.show()

In [None]:
# K - means Clustering
np.random.seed(12)
plt.figure(figsize=[16,8])
def cluster(x,k,max_pass=5):
    [m, n] = x.shape
    C = np.random.rand(k, n)
    d = np.zeros((m,k))
    
    for i in range(k):
        for j in range(n):
            d[:,i] = d[:,i] + (x[:,j] - C[i,j])**2
            
    oh = np.argmin(d,axis=1)
    oldoh = oh*0
    d = np.zeros((m,k))
    ghi = 0
    cp = C.reshape(C.shape[0],C.shape[1],1)
    passes = 0
    while(passes < max_pass):
        for z in range(k):
            mask = (oh==z).reshape(m,1)*1
            div = np.sum(mask)
            xd = x*mask
            if (div!=0):
                C[z,:] = np.sum(xd,axis=0)/div
        
        for i in range(k):
            for j in range(n):
                d[:,i] = d[:,i] + (x[:,j] - C[i,j])**2
        
        oldoh = oh
        oh = np.argmin(d,axis=1)
        d = np.zeros((m,k))
        ghi = ghi+1
        
        if(np.sum(oldoh != oh)<=1):
            passes = passes+1
        else:
            passes = 0
        
        cp= np.append(cp,C.reshape(k,n,1),axis = 2)
        
    for i in range(k):
        if (i == 0):
            plt.plot(cp[i,0,:],cp[i,1,:],'-kx',label="path of centroids")
        else:
            plt.plot(cp[i,0,:],cp[i,1,:],'-kx')
    
    
    return C, oh.reshape(m)

        
C, h = cluster(x,2)    
plt.scatter(x[:,0],x[:,1],c = h)
plt.plot(C[:,0],C[:,1],'ro',label="centroid")
plt.legend()
plt.show()

### 9. Stochastic Gradient Descent

When the data size increases the computation time for one iteration increases in gradient descent and in many other optimization models due to which the optimation process becomes very slow.

***Stochastic means random***, we basically shuffle and divide the training data into ***minibatches*** of fast computable sample sets this does not go directly down the slope of actual cost function since we are considering only some data but on a average over few iteration you will see that it goes towards the optimal location. one problem might come during convergence which may result in very small fluctuations but no convergence so we have to stop iterating at that time.

The convergence path over cost contour plot will look something like this : <img src="mlb/stochastic.png">

For more details about stochastic gradient descent go here : https://en.wikipedia.org/wiki/Stochastic_gradient_descent

-----------------

### 10. Building a Machine Learning Algorithm

Nearly all deep learning algorithms can be described as particular instances of a fairly simple recipe: combine a speciﬁcation of a dataset, a cost function, an optimization procedure and a model.

> Example: Linear Regression 
``` 
- specification of data --> {X,y}
- cost function --> J(w, b) = −Ex,y∼ˆpdatalog pmodel(y | x),
- model specification --> p(Y|X) = N(y|x'.w+b,1)
```

By realizing that we can replace any of these components mostly independently from the others, we can obtain a wide range of algorithms.

The recipe for constructing a learning algorithm by combining models, costs, andoptimization algorithms supports both supervised and unsupervised learning.

Unsupervised learning can be supported by deﬁning a dataset that contains only X and providing an appropriate unsupervised cost and model:
>Example: PCA can be modeled as:
<br>
    - specification of data --> {X}<br>
    - cost function --> J(w) = Ex∼pdata||x − r(x; w)||^2
    - model specisification r(x) = (w.T).x.(w)

In some cases, the cost function may be a function that we cannot actually evaluate, for computational reasons. In these cases, we can still approximately minimize it using iterative numerical optimization, as long as we have some way of approximating its gradients.

> Please read section 5.10 here: http://www.deeplearningbook.org/contents/ml.html

--------------

### 11. Challenges Motivating Deep Learning

> Please read section 5.11 here: http://www.deeplearningbook.org/contents/ml.html

**# Congratulation #** on Completing this large tutorial you have learned alot of useful machine learning approaches and techniques. there are alot more but those are alos related to what you have already learned tap on your back and move forward to part 2 of this series.


## PART 1 COMPLETE