<a href="https://colab.research.google.com/github/riblidezso/DeepLearningCourse/blob/master/notebooks/01/machine_learning_model_zoo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model zoo of machine learning 

----

### It's not just neural networks!

Until now, you have learnt about the elements of neural networks. While neural networks are the focus of this course, they are just one of the tools (surely the most exciting one recently) in machine learning. So before you go deeper in deep neural networks, we want to introduce other models too. 

Neural networks ( too often referred to as 'AI' nowadays ) are **NOT** always the best tool to solve a problem! Therefore we really want to avoid the situation that we give you a (truly powerful) hammer, and everything seems to look like a nail! Some datasets are not like nails, and require different solutions.

Please always taylor your models to the data, not the 'AI' hype.



### Neural networks vs other models

Neural networks became immensely popular in the last years, because they perform superbly in the following tasks: computer vision (image recognition/detection/segmentation), natural language processing ( speech recognition, language understanding, etc) and some Google products: game of Go, translation.

But they are usually inferior (or not clearly superior) to other solutions in many many problems:
* predicting customer behavior based on the company databases: who is unsatisfied and will leave the company, which product should you recommend in a webshop, etc
* predicting failed parts in manufacturing based on data measured on the assembly line
* predicting stock prices
* [playing poker better than humans](https://www.cmu.edu/news/stories/archives/2017/january/AI-beats-poker-pros.html)
* etc ( look at kaggle compeitions when the data is not image or sound, all those problems are usually best solved with **not** neural nets)


So when should you use neural networks? Oversimplified answer: (and slightly wrong): deep learning for computer vision and natural language processing, probably not deep learning for other datasets.

Deeper:
* Non neural network models use the input data as it is, and try to guess the output based on the intuition that what is close in X, shall be close in Y.
* The layers of neural networks (helped by the humans behind them) are basically learning better and better ( gradually lower dimensional and more correlated with output ) representations of the input data.

The distinction then will be:
* If the data has a nice and meaningful, unstructured, low dimensional representation then neural networks are probably not the best solution.
* If the data has some strong structure (image,voice), it is high dimensional (1MP image has 1 million dimensions!), and it does not have a meaningful low dimensinal representation, neural networks will probably work better than other solutions.



### Other models used here:

- K Nearest Neighbors
- Decision trees (random forests)
- Support vector machines
- There are many more useful models. The purpose of this notebook is to open the window.




## Some additional thoughts:

----

### Traditional machine learning models are easier to track than deep learning.


- Deep learning advances super fast, best solutions are usually 0-2 years old models. 
- The other models shown here are mostly decades old, and still relevant today (like quantum mechanics :) )
- You are mostly expected to use them, and not improve them. Tweak the parameters but rarely tweak the mechanism. Deep Learning models often require deeper knowledge of the model, and more radical modifications.
- Therefore we will only give a brief overview about how ther work, unlike the detailed overview of the elements of neural networks in the last weeks.



### Superhuman performances / Artificial Intelligence

You can read way too often that "AI" does this or that as well or better than humans. It is important to put these results into some context.


#### Hard problems for humans, easier for machines 

Traditional machine learning solves problems which are hard for humans. ( **Similarly to calculators ** ). Doing so they almost always reach "superhuman" performance. Personally i think it's not very lucky to call that Artifical Intelligence because it is a statsitical modelling which has not much to do with what "Natural Intelligence" does.


#### Easy problems for humans, harder for machines  

Deep Learning usually tries to solve a special subset of 'hard' problems, which humans are able to solve easily, but computers were unable to do it for decades. So it is more reasonable to call these systems Aritifical Intelligence.

But still, theese problems are just a well defined subset of tasks humans are able to do:

* Visual and audio perception, (translation, text understanding): "Artificial Perception", you are not "intelligent" if you can recognize a dog, right? They don't show dogs on IQ tests. 

(More generally deep learning can do things which you can do instinctively, without complex reasoning, in a split second.)

---

##### Games  

A special problems are **games**, which humans do not solve instinctively:

Go, this is considered to require intelligence from a human, so is AlphaGo really intelligent? Or is it more like a calucator? Do you think you are intelligent because you can integrate? Mathematica does it better, is it AI?

A calculator is clearly not intelligent, Mathematica neither, AlphaGo? ( Because AlphaGo uses neural nets it make it more "intelligent"? Stockfish is not considered intelligent, but it is *almost* (?) as good in chess.

Board games have a limited world (chess 64 positions) with few rules, and good mechanistic base for the solution (tree search, heuristics (e.g.: with neural net), mcmc tree search). And it can be **simulated**. 


##### A fully functional agent in the real world

The full world need fairly larger representation, has much wider possible steps, and it can not really be simulated. ( Except if we are living in a simulation right? )


An article realted to this topic: https://www.quantamagazine.org/why-self-taught-artificial-intelligence-has-trouble-with-the-real-world-20180221/

----

This notebook was created by Dezso Ribli, if you have any remarks or questions, please contact me.

---

In [0]:
%pylab inline  
# This 'magic' imports useful general stuff ignore it now

In [0]:
import pandas as pd  # data handling toolbox

---

### [Scikit learn](http://scikit-learn.org/stable/index.html): the python machine learning toolbox

This the essential python library for machine learning. It contains models, metrics, and various helper functions, which make life so much easier. 



Load some models. See more [models and their details here](http://scikit-learn.org/stable/supervised_learning.html).

In [0]:
from sklearn import linear_model
from sklearn import neighbors
from sklearn import tree
from sklearn import ensemble
from sklearn import svm
from sklearn import neural_network

Load a cross validation wrapper, and a normalizer

In [0]:
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

Let's wrap the wrapper. 

In [0]:
def testClass(model,x,y):
    """Test a model."""
    # get CV scores
    acc = cross_val_score(model, x, y, cv=5)
    # eval them
    print('Accuracy: %.2f +/- %.2f' % (100*mean(acc),100*std(acc)))


---

### A classification dataset: finding the Higgs boson



The task is to identify the Higgs boson in simulated events.
* Inputs are various physical properties of events. 
* The labels indicate wheter this event corresponds to the Higgs boson or not.


The dataset is hosted [here](http://opendata.cern.ch/record/328), and it was used in a [kaggle challenge](https://www.kaggle.com/c/Higgs-boson).

In [0]:
!wget http://opendata.cern.ch/record/328/files/atlas-higgs-challenge-2014-v2.csv.gz

##### Lets load the data and have a look at it:

It has 800k rows (data points), and 32 columns (variables).

In [0]:
cls_data = pd.read_csv('atlas-higgs-challenge-2014-v2.csv.gz')  # load it
print(cls_data.shape)
cls_data.head()  # peek

 Create X,Y matrices.

In [0]:
x = cls_data.head(10000).drop(['EventId','KaggleSet','KaggleWeight','Label'],axis=1).values
y = cls_data.head(10000).Label.values

Normalize input values

In [0]:
x = StandardScaler().fit(x).transform(x)

### OK, lets try some models!

In [0]:
%%time
testClass(linear_model.LogisticRegression(), x, y)

In [0]:
%%time
testClass(neural_network.MLPClassifier(), x, y)

In [0]:
%%time
testClass(neighbors.KNeighborsClassifier(), x, y)

In [0]:
%%time
testClass(tree.DecisionTreeClassifier(), x, y)

In [0]:
%%time
testClass(ensemble.RandomForestClassifier(), x, y)

In [0]:
%%time
testClass(svm.SVC(), x, y)

----


## A regression dataset: age and DNA methylation

Interestingly there is no simple way to guess the age of a person. We are not trees, there are no rings inside us. This can lead to some [interesting situations](https://www.theguardian.com/world/2016/apr/21/canadian-high-school-basketball-star-jonathan-nicola-refugee).

There are various methods using bones (featured in CSI) or teeth, but here we will show another method, a simple and surprisingly accurate blood test.

The dataset consists of some 'epigenetic' markers on their DNA, called 'methylation'. Each variable represents a specific point in the genome, and their value corresponds to the level of 'methylation' of that position. Methylation is a simple chemical modification of the DNA naturally applied by all cells.

![src](http://helicase.pbworks.com/f/DNAmeth.jpg)

The task will be: Predict the age of people based on their 'methylation' values.


[Data source](https://www.ncbi.nlm.nih.gov/pubmed/23177740?dopt=Abstract)

##### Lets load the data and have a look at it:

It has 656 rows (data points/individuals), and 6000 columns (variables).

** Note, a lot more variables than data points! **

In [0]:
!wget dkrib.web.elte.hu/deeplearning/data/meta_small.csv

In [0]:
reg_data = pd.read_csv('meta_small.csv')
print(reg_data.shape)
reg_data.head()

Create X,Y matrices.

In [0]:
x = reg_data.drop(['age'],axis=1).values
y = reg_data.age.values

----

Let's load a different wrapper this time, which gives us back the individual predictions, not just scores. This way we score the models ourselves the way we like it, and we can plot results.

In [0]:
from sklearn.model_selection import cross_val_predict

Wrap it!

In [0]:
def testReg(model, x, y):
    """Test a model."""
    y_pred = cross_val_predict(model, x, y, cv=5)  
        
    plot(y,y_pred,'.')  # plot data points
    plot([15,90],[15,90])  # y = x line for reference
    xlabel('age')  # label, labels, labels
    ylabel('predicted age') # and labels

    print('RMSE:',((y-y_pred)**2).mean()**0.5)  # primary metric
    print('Pearson corr:',np.corrcoef(y,y_pred)[0,1])  # another one

### Let's model!

In [0]:
# It should not work right? Lot more variables than data points
# It's some kind of magic!
testReg(linear_model.LinearRegression(), x, y)

In [0]:
# It actually fails if number of data points ~ number of params
testReg(linear_model.LinearRegression(), x[:,:520], y)

In [0]:
# The failure was not due to lack of info in the inputs
# It works with a lot fewer params too
testReg(linear_model.LinearRegression(), x[:,:50], y)

In [0]:
# With more variables than data points we should use
# explicit regularisation!!!
testReg(linear_model.Ridge(), x, y)

In [0]:
testReg(neural_network.MLPRegressor(), x, y)  

In [0]:
testReg(neighbors.KNeighborsRegressor(), x, y)

In [0]:
testReg(tree.DecisionTreeRegressor(), x, y)

In [0]:
testReg(ensemble.RandomForestRegressor(n_jobs=-1), x, y)

In [0]:
testReg(svm.SVR(), x, y)  # just doesnt work ...

In [0]:
# Can it be again that it's under determined?
testReg(svm.SVR(), x[:,:50], y)
# Nope

In [0]:
testReg(svm.SVR(C=1000.), x, y)  
# Now it works but is just looks like the linear fit...

----


# Messages:


* Scikit-learn is your best friend.
* There is no single best model. Different models work well in different problems. (Even SVM can be excellent ;) )

* The model's have settings!

You have to develop an intuition in order to avoid extensive model/hyperparameter search!

---

# More

* There are [plenty of other models too in sklearn](http://scikit-learn.org/stable/supervised_learning.html)
* There are lot of models outside sklearn. Most notably:
    * A Gradient Boosted Trees implementation called [XGBoost](http://xgboost.readthedocs.io/en/latest/)
    * Neural networks ( an army of frameworks, but for beginners: [Keras](https://keras.io) )