# Neural Networks


We started thinking about machine learning wiht the idea that the basic idea is
that we assume that our target variable ($y_i$) is related to the features $\mathbf{x}_i$
by some function (for sample $i$):

$$ y_i =f(\mathbf{x}_i)$$

But we don't know that function exactly, so we assume a type (a decision
  tree, a boundary for SVM, a probability distribution) that has some parameters
  $\theta$ and then use a machine
  learning algorithm $\mathcal{A}$ to estimate the parameters for $f$.  In the
  decision tree the parameters are the thresholds to compare to, in the GaussianNB the parameters are the mean and variance, in SVM it's the support vectors that define the margin.  

$$\theta = \mathcal{A}(X,y) $$

That we can use to test on our test data:

$$ \hat{y}_i = f(x_i;\theta) $$

A neural net allows us to not assume a specific form for $f$ first, it does
universal function approximation.  For one hidden layer and a binary classification problem:


$$f(x) = W_2g(W_1^T x +b_1) + b_2 $$

where the function $g$ is called the activation function. so we approximate some
unknown, complicated function $f4 by taking a weighted sum of all of the inputs,
and passing those through another, known function.

In [1]:
from sklearn.neural_network import MLPClassifier
from sklearn import svm
import pandas as pd
import sklearn

from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn import model_selection

We're going to use the digits dataset again.

In [2]:
digits = datasets.load_digits()
digits_X = digits.data
digits_y = digits.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(digits_X,digits_y)

In [3]:
digits.images[0]

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

Sklearn provides an estimator for the Multi-Llayer Perceptron (MLP). We can see one with one layer to
start.

In [4]:
mlp = MLPClassifier(
  hidden_layer_sizes=(16),
  max_iter=100,
  alpha=1e-4,
  solver="lbfgs",
  verbose=10,
  random_state=1,
  learning_rate_init=0.1,
)

In [5]:
mlp.fit(X_train,y_train).score(X_test,y_test)

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =         1210     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.40544D+00    |proj g|=  7.12918D+00

At iterate    1    f=  7.78456D+00    |proj g|=  6.43829D+00

At iterate    2    f=  3.04431D+00    |proj g|=  1.78432D+00

At iterate    3    f=  2.36091D+00    |proj g|=  3.95617D-01

At iterate    4    f=  2.26682D+00    |proj g|=  2.32293D-01

At iterate    5    f=  2.13681D+00    |proj g|=  2.72235D-01

At iterate    6    f=  1.99002D+00    |proj g|=  4.52193D-01

At iterate    7    f=  1.77195D+00    |proj g|=  2.84157D-01

At iterate    8    f=  1.64468D+00    |proj g|=  4.24260D-01

At iterate    9    f=  1.54488D+00    |proj g|=  3.95837D-01

At iterate   10    f=  1.43066D+00    |proj g|=  2.95591D-01

At iterate   11    f=  1.33032D+00    |proj g|=  3.64713D-01

At iterate   12    f=  1.19376D+00    |proj g|=  5.41581D-01

At iterate   13    f=  1.1

 This problem is unconstrained.



At iterate   90    f=  3.75672D-01    |proj g|=  1.34794D-01

At iterate   91    f=  3.73913D-01    |proj g|=  1.38448D-01

At iterate   92    f=  3.72390D-01    |proj g|=  4.31761D-01

At iterate   93    f=  3.68939D-01    |proj g|=  1.70690D-01

At iterate   94    f=  3.66016D-01    |proj g|=  1.60594D-01

At iterate   95    f=  3.63716D-01    |proj g|=  2.01862D-01

At iterate   96    f=  3.61407D-01    |proj g|=  1.62327D-01

At iterate   97    f=  3.59867D-01    |proj g|=  2.98699D-01

At iterate   98    f=  3.58583D-01    |proj g|=  2.49273D-01

At iterate   99    f=  3.57685D-01    |proj g|=  8.08819D-02

At iterate  100    f=  3.56395D-01    |proj g|=  1.09317D-01

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F   

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


0.86

We can compare it  to SVM:

In [6]:
svm_clf = svm.SVC(gamma=0.001)
svm_clf.fit(X_train, y_train)
svm_clf.score(X_test,y_test)

0.9911111111111112

We saw that the SVM performed a bit better, but this is a simple problem.
We can also compare these based on much they store, the number of parameters
is realted to the complexity.

In [7]:
import numpy as np

In [8]:
np.prod(list(svm_clf.support_vectors_.shape))

44672

In [9]:
np.sum([np.prod(list(c.shape)) for c in mlp.coefs_])

1184

In [10]:
mlp.coefs_

[array([[-0.04544852,  0.12067565, -0.27379625, ...,  0.20710165,
         -0.25885823,  0.09336809],
        [-0.04608944,  0.02952021, -0.18212756, ...,  0.19947011,
         -0.22921377, -0.04477983],
        [ 0.20612104, -0.04715072, -0.15870153, ..., -0.14984838,
          0.09368352, -0.14259509],
        ...,
        [-0.16138772, -0.06240637, -0.42747726, ...,  0.06122934,
          0.03883386, -0.14096137],
        [ 0.00390279, -0.10173871, -0.2841304 , ..., -0.2217815 ,
         -0.30078005, -0.10153926],
        [-0.25381589,  0.10546604, -0.02823113, ...,  0.05135395,
         -0.0472014 ,  0.19055818]]),
 array([[-0.41449777, -0.00621594, -0.25024708,  0.43932948,  0.36069896,
          0.12108234,  0.45749176, -0.14882827,  0.38248806, -0.04756596],
        [-0.23575882,  0.43646879,  0.17899597, -0.4034062 ,  0.27328131,
          0.19180292,  0.10153274,  0.23311177, -0.22428804,  0.11003831],
        [ 1.74141937, -0.44207095, -0.33440943, -0.80912163,  0.91921671,
 

In [11]:
mlp64 = MLPClassifier(
  hidden_layer_sizes=(64),
  max_iter=100,
  alpha=1e-4,
  solver="lbfgs",
  verbose=10,
  random_state=1,
  learning_rate_init=0.1,
)

In [12]:
mlp64.fit(X_train,y_train).score(X_test,y_test)

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =         4810     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.04667D+01    |proj g|=  8.34164D+00

At iterate    1    f=  9.80320D+00    |proj g|=  3.84885D+00

At iterate    2    f=  8.46625D+00    |proj g|=  4.05251D+00

At iterate    3    f=  6.89808D+00    |proj g|=  3.25829D+00

At iterate    4    f=  4.90419D+00    |proj g|=  2.43816D+00

At iterate    5    f=  3.45341D+00    |proj g|=  2.47844D+00

At iterate    6    f=  2.04351D+00    |proj g|=  1.03218D+00

At iterate    7    f=  1.65725D+00    |proj g|=  1.99426D+00

At iterate    8    f=  1.24612D+00    |proj g|=  6.47738D-01

At iterate    9    f=  8.64751D-01    |proj g|=  3.81657D-01

At iterate   10    f=  6.70462D-01    |proj g|=  2.08785D-01

At iterate   11    f=  4.98056D-01    |proj g|=  1.91445D-01

At iterate   12    f=  3.66935D-01    |proj g|=  2.80683D-01

At iterate   13    f=  2.9

 This problem is unconstrained.



At iterate   50    f=  7.69289D-05    |proj g|=  1.84987D-04

At iterate   51    f=  6.02931D-05    |proj g|=  2.43414D-04

At iterate   52    f=  4.67583D-05    |proj g|=  1.49272D-04

At iterate   53    f=  3.56409D-05    |proj g|=  8.44064D-05

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
 4810     53     55      1     0     0   8.441D-05   3.564D-05
  F =   3.5640856299961905E-005

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            


0.9733333333333334

## Questions After Class

### Roughly, how does the model know to use certain functions as the fitting becomes more complex (e.g. sin(x), ln(x), e^x)?

It does not learn an analytical form; it just approximates it.

### when doing the .score on the mlp does the limit vary or does it have a set limit on its own?



### What is tensorflow used for that scikit cant do?

Tensorflow can do more types of networks and has more options for training.  Most importantly, it has code optmizations so that you can use more complex hardware directly.

### when you say weight, what does that mean?

Weights are coefficients, or the weight of that feature.


### what is an artificial neuron?

An artificial neuron is one "unit" of calculation.  A neuron takes a weighted sum of all of its inputs (including a bias term) and passes it through an "activation function" that squashes the values of output into [0,1].

### what real life problems require tensorflow?

All modern ML applications are tensorflow, pytorch or similar.

### What do the hidden layers of the neural network represent?

We do not specify exactly what they represent up front; we can use model explanation techniques and visualization tools to examine them after the fact and try to interpret them if needed.



### What is the best way to optimize a neural net? would it be jut adding more layers?﻿

You could specify some of the parameters and use GridSearch as well. There are types of layers as well. We will see that later.



### Are the weights given to the hidden layers initially random?

Typically yes, they can be initialized randomly and then they are learned.




### I've heard that cleaning data generally is a majority of a data scientists work is this generally true?

###  What does it mean to "translate a jupyter notebook into python scripts"? what exactly are scripts?

a script is a file that can be run non interactively.  That is it can be run straight through without relying on any user input.

### does jupyter notebook have to be used for data science or can we used other types of languages?

You can use other languages and even use Python with a script or interactively in another IDE.

### How are issues of privacy handled for people like Cass, some of the models they spoke about required a lot of personal data?

They do not release the data to just anyone, but they do use a lot of personal data. Mostly, the release anonymized aggregated data so that it is not possible to find an individual.  There are privacy and security procedures to protect the linked data and limit who has access to it.


<!--
### on tensorflow playground, if we increase the weight is that increasing the amount we are feeding within the hidden layer?

### Do the neurons' layers have to be specified in the models we are going to use, or they are already specified for each model?

### Are hidden layers just a number of masks that help the function determine what the overall classification should be?
-->