# Overview

- creates a statistical model to mimic intelligent decision making
- finding patterns in complex, scattered data to present information.
- building smart application/devices.

- Machine learning is subfield of AI concerned with algorithms that allow computers to learn.
- it means that in most cases an algorithm is given a set of data and infers information about properties of data
- this information allows it to make predictions about other data that it might see in future 

<img src="1/hierarchy.png" width=500 height=500 />

- e.g., of smart applications photo based applications,chatbots, voice recognition, pattern analysis, digital security

- Other examples are Robotics, Game AI, recommendation system, psuedo-creative AI(example is Prisma)
- In prisma we trained model to paint face like an artist

### Others are also using it
- Google Page ranking
- Netflix suggestions
- tinder for us to "chill"
- Uber tesla self driving cars running in California
- Political Campaign and advertisment campaign
- spam filtering
- google adsense
- bio-informatics
- google "Allo", amazon "Alexa"
- Facebook photo tagging

### Few important features
- Input features x=(x1,x2,x3,x4....)
- Training set : has list of input features and output too ([x1,y1
                                                             x2,y2
                                                             x3,y3
                                                             .,.
                                                             .,.
                                                             .,.])
- Test set contains a new data apart from training set ([x1,
                                                        x2,
                                                        x3,
                                                        .
                                                        .
                                                        .).
  Once machine has completed its learning on training set it will predict output for test set
- Hypothesis will map input(X) to output(Y)

- $x^{(i)}$   : ith example of x
- $x_{j}^{(i)}$ : jth feature of ith example of x
- since x ={$x_{1}$,$x_{2}$,.....}

- assume $x_{1}$ is area and $x_{2}$ is floor .
- Then x=[Area, floor]={$x_{1}$,$x_{2}$} which is $x^{(i)}$ here
- and area or floor is one of the features from this example set
- $y^{(i)}$ represents output for this ith example.


## SuperVised Learning 

- means we try to predict output for test set because we have x and y both in training set .ie., we have label.

Two types
- Regression : in which we predict continuous valued outputs. For e.g., housing price, stock price based on certain features. Continuous valued outputs means y=f(x). x here, is input feature set.
- Classification : in this we have discrete classes. For e.g., we have apples, guava, mango etc. When new fruit is given then we will try to map it to given set of classes of fruits i.e, gauva, mango, apple etc.
  These classes ahve certain features like shape etc.

### Linear Regression

- assume we have been given training set with labeled data ie,, area and price

In [8]:

    
import pandas as pd
data = [[400,800], [600,1150], [1000,2100]]
pd.DataFrame(data, columns=["Area", "Price"])



Unnamed: 0,Area,Price
0,400,800
1,600,1150
2,1000,2100


- Now if new area is given assume 700 what we can predict about its price?
- we need y=f(x) from the above given training set
- it can be predicted around 1300.
- i.e., if we will plot the graph for the above table we will get line and the price for 700 will be around this line only.
- we can write line as y =mx+c or y=$\theta_{1}$x+$\theta_{0}$
- hypothesis helps us in finding this $\theta_{1}$ and $\theta_{0}$

### Generalized hypothesis

\begin{equation*}
h(x)=\sum_{i=0}^n \theta_i x_i,    x_{0}=1  
\end{equation*}



When we have so many lines plotted for training set then which line is best?
- the one which has minimum error.

<img src="1/error.png" width=500 height=500 />

How to calculate this error?

\begin{equation*}
    total error=\sum_{i=1}^m |y^{(i)}-h_\theta(x^{(i)})|
\end{equation*}

 where m is no of training examples
- but this mod function is not differentiable
- we need to make few changes
    

\begin{equation*}
    total squared error=\sum_{i=1}^m (y^{(i)}-h_\theta(x^{(i)}))^2
\end{equation*}

- this is differentiable
- it has nice probabilistic interpretition(discussed later).

## Gradient Descent

#### Most of the machine learning algorithms is about optimizing i.e minimizing errors over all training examples

- here error function is polynomial of degree 2
- error is function of $\theta$

- As we know that error depends on line in case of linear regression
- the more close line is we have less errors for our training examples
- therefore for quadratic error we can say that we need minimum error that is minima.
- our error is convex function of $\theta$

What is convex function?
- whenever we draw line b/w any two points in graph the function should go only below the line.
- it has local minima similar to global minima
- for local minima we need 

\begin{equation*}
|\frac{\partial f(\theta)}{\partial \theta}|=0
\end{equation*}

<img src="1/gradientdescenterror.png" width=500 height=500 />

### Update Rule

\begin{equation*}
\theta=\theta- \alpha\frac{\partial f(\theta)}{\partial \theta}
\end{equation*}

- here $\alpha$ is learning rate which should be small. Why?
- Because learning rate is nothing but steps we have to take to reach local minima from certain point i.e., for given $\theta$
- This update rule algorithm will stop when derivative is zero means once we reach local minima then 
- $\theta$=$\theta$-$\alpha$*0


<img src="1/update.png" width=500 height=500 />

Also this total square error can be modified as

\begin{equation*}
    total sqaured error=\sum_{i=1}^m (y^{(i)}-h_\theta(x^{(i)}))^2
\end{equation*}

\begin{equation*}
    f(\theta_{0},\theta_{1})(x_{i})=\sum_{i=1}^m (y^{(i)}-(\theta_{0}+\theta_{1}x))^2
\end{equation*}

\begin{equation*}
    \theta_{0}=\theta_{0}-\alpha\frac{\partial f(\theta_{0})}{\partial \theta_{0}}
\end{equation*}

\begin{equation*}
    \theta_{1}=\theta_{1}-\alpha\frac{\partial f(\theta_{1})}{\partial \theta_{1}}
\end{equation*}

<img src="1/a.png" width=500 height=500 />

### Gradient Descent algorithm

<img src="1/1.png" width=700 height=600 />

<img src="1/2.png" width=700 height=600 />

<img src="1/3.png" width=700 height=600 />

### Linear Regression Algorithm

- Fit $\theta$ to get minimum error i.e., find $\theta_{0}$  and $\theta_{1}$ 
- After that to predict the output use formula   $\theta^{T}x$ 
- where x is a point is from test set and $\theta$ is an array

#### Code Linear Regression algo

- matplotlib is used for plotting graph
- numpy for maths
- pandas for reading ad writing to csv files

### (X) Milk acidity , (Y) Density of milk

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# values will convert it into arrays
# shape will print array size
def readData(filename):
    df=pd.read_csv(filename)
    return df.values

x=readData('./X.csv')
y=readData('./Y.csv')
print(x)
print(x.shape)

#reshape converts array of array to 1-D array 
print(x.reshape((99,))
print(y.reshape((99,))
      
#plot function joins points and forms a line      
plt.plot(x,y)
      
#instead we can use scatter plot to see only points
plt.scatter(x,y)
plt.show()
      

## Normalization step(optional)

- Generally we make mean=0.Why?
So that all the features which are far from origin should come close to origin 
- and standard deviation=1. Why?
So that all the features are scaled from 0 to 1 range
- Doing normalization doesn't changes the graph

In [None]:
x=x-x.mean()/(x.std())
plt.scatter(x,y)


#### Algorithm- Linear Regression

In [None]:
def hypothesis(theta,x):
    return theta[0]+theta[1]*x

def error(X,Y,theta):
    total_error=0
    
    //to get number of training examples
    m=X.shape[0]
    
    for i in range(m):
        total_error+= (Y[i]-hypothesis(theta,X[i]))**2
    
    return 0.5*total_error 

def gradient(X,Y,theta):
    grad=np.array([0.0,0.0])
    m=X.shape[0]
    
    #applying the formula of J($\theta$)
    for i in range(m):
        grad[0]+=-1*(Y[i]-hypothesis(theta,X[i]))
        grad[1]+=-1*(Y[i]-hypothesis(theta,X[i]))*X[i]
    return grad
    

def gradient_descent(X,Y,learning_rate,maxitr):
    
    # since we have derivative of theta 0/theta 0  and derivative of theta 1/theta 1
    # so we need grad of size 1*2
    # similarly for theta we have theta 0 and theta 1
    
    grad=np.array([0.0,0.0])
    theta=np.array([0.0,0.0])
    
    #error is list
    e=[]
    
    for i in range(maxitr):
        grad=gradient(X,Y,theta)
        ce=error(X,Y,theta)
        theta[0]=theta[0]-learning_rate*grad[0]
        theta[1]=theta[1]-learning_rate*grad[1]
        e.append(ce)
    
    return theta,e
    
    
# maxitr is max iterations for now is taken as 100 
# from error graph we can see when the graph has become constant which is after 40 
#learning rate wa assumed to be very low

theta,e = gradient_descent(X,Y,learning_rate=0.001,maxitr=100)
print(theta[0],theta[1])

plt.scatter(X,Y)
plt.plot(X,hypothesis(theta,X),color='r')
plt.show()


plt.plot(e)

# we can see that the value of both after 50 iterations and 99 iterations is almost same
# so maxitr can be 50 or 99
print(e[50])
print(e[99])



    
    

### * Note:  Linear Regression is classification and Supervised Learning