# Machine Learning Introduction

Vincent Vandenbussche

- PhD in Physics defended in 2014
- Experience in large companies: GE Healthcare, Renault, L'Oréal
- Experience in academic research: CNRS, CEA
- Experience in startups: Easyrecrue, Suricog, Adventure Conseil...
- Founded a startup: Vivadata, a data science bootcamp

## Course overview

Introduction to Machine Learning:
- ML Intro and classification reminders
- Data preparation and Model Evaluation
- Regularization & Optimization
- Decisions Trees and Random Forest
- Boosting and Gradient Boosting

If we have enough time:
- Imbalanced Datasets and anomaly detection
- Dimensionality Reduction and clustering
- More classification with SVMs, k-NNs, Naive Bayes

# What is ML?

## ML or not ML?

Let's have a quizz: ML or not ML?
- Google Translate
- Netflix movie recommendation
- Snapchat/TikTok filters
- Uber driver selection
- Face recognition
- Siri / Google Home / Amazon Alexa...
- Chatbots
- Google research
- Auto correct word
- Google Maps
- Flight booking
- Fraud detection

## A.I vs Machine Learning vs Deep Learning

![](ai_ml_dl.png)

## A brief history of Machine Learning

A non exhaustive timeline:

### Alan Turing

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=16I3OtO2dgcy7zHW0nvcRYQNyLWewAWXm">
</p>

Turing is widely considered to be the **father of theoretical computer science and artificial intelligence**.

In 1950 Alan Turing published a landmark paper in which he speculated about the possibility of creating machines that think. If a machine could carry on a conversation that was indistinguishable from a conversation with a human being, then it was reasonable to say that the machine was "thinking".

The [Turing Test](https://en.wikipedia.org/wiki/Turing_test) was the first serious proposal in the **philosophy of artificial intelligence**. 

> “I do not see why it (the machine) should not enter any one of the fields normally covered by the human intellect, and eventually compete on equal terms. I do not think you even draw the line about sonnets, though the comparison is perhaps a little bit unfair because a sonnet written by a machine will be better appreciated by another machine.”
> 
> Alan Turing, _The London Times_, 1949

### Walter Pitts & Warren McCulloch

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1bD4ITkBz0y69JXppGhCZ42lgxDp-oxo5" width="400">
</p>

Inspired by work in neurology, **Walter Pitts** and **Warren McCulloch** analyzed networks of idealized artificial neurons and showed how they might perform simple logical functions.

They were the first to describe what later researchers would call a **neural network**.

### Marvin Minsky

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1_4Yg4RPJqqfNIIy7hfSeWH1De1TgYUZ8" width="200">
</p>

One of the students inspired by Pitts and McCulloch was a young **Marvin Minsky**, then a 24-year-old graduate student.

In 1951 (with Dean Edmonds) he built the first **neural net machine**, the SNARC. With the SNARC, Minksky built a  first ANN (Artificial Neural Network) that could simulate a rat finding its way through a maze.

At that time, Minsky’s work and theory went largely unnoticed by the public and was rejected by most AI researchers until computing power had reached a level where results could be clearly demonstrated.

> *Within our lifetime, machines may surpass us in general intelligence.*
>
> Marvin Minsky, 1951

But also many more names like:
- Arthur Samuel
- Geoffrey Hinton
- Frank Rosenblatt
- ...

### A brief timeline of the A.I. field

A.I. has a long history behind it, with a series of successes and of failures and disappointments (see [A.I. winter](https://en.wikipedia.org/wiki/AI_winter) for example). 

> 📚 **Resources**: The history behind A.I. is long and fascinating. Check [History of AI on Wikipedia](https://en.wikipedia.org/wiki/History_of_artificial_intelligence) to dig deeper.

![](https://drive.google.com/uc?export=view&id=1kUh0wO82soPDiXzGo8TEevfgzSkGZlto)

> Most ideas are pretty old, so why is everybody using ML now?

## The current AI hype

Probably several main reasons:
- Increase of computing power
- Exponential increase of the data available

![](data_amount_evolution.png)

- Active research field with exponential [increase of papers](https://web.media.mit.edu/~mrfrank/papers/nmi2019.pdf)
![](ai_paper_evol.png)

- Better and wider access to knowledge and tools



## Some examples of ML applications

### 🚙 Autonomous driving

![](https://drive.google.com/uc?export=view&id=1lDltwLk02Am5bsINiqk4u33oABQfawcA)

### 🚙 Autonomous driving

![](https://drive.google.com/uc?export=view&id=1gdJKYzGMtA3KSOD2JS8s53Wp3pzfDwVr)

### 🌆 Smart City

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1D7JgBRhPLk-83ozMeDVh30mxWlsV3RnT">
</p>

### 🔬 Health

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1hn0rFBFMpE1PqpjTrl61xMnbbGmQ8hss" width="70%">
</p>

### 🏪 Amazon cashierless stores

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1RQ1rBiICTtyRTvsAHMcj_0Zu-xb4CKrL">
</p>

### 🎙 Home Assistants

![](https://drive.google.com/uc?export=view&id=1RpkMLUsAyIgzJvdGnUi7d5UUEcajEY4G)

...

___

# The ML paradigm

## Think like a ML algorithm

Let's play a game and think like a ML algorithm:

![](Famous-Logos.jpg)

Using Machine Learning is a change of paradigm:

![](ml_paradigm.png)

## How to use a Machine Learning model

![](https://drive.google.com/uc?export=view&id=1EMSRYv8KcKuUSIZEtUl8b0pN41amLIBx)

## There are many Machine Learning problems 

### There are *different types* of Machine Learning problems

Machine Learning problems can be categorized in different categories. 

In general, a machine learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several **attributes** or **features**.

Learning problems fall into a few categories:

- **Supervised learning**: the data comes with additional attributes (called **labels**) that we want to predict. A supervised learning problem can be either:
 
 
 - **Classification**: samples belong to two or more classes and we want to learn from already labelled data how to predict the class of unlabelled data.
 
 > An example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is trying to label them with the correct category or class.
 
 - **Regression**: if the desired output consists of one or more continuous variables, then the task is called regression.
 
 > An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.
 
- **Unsupervised Learning**: here, the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called **clustering**, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization, called **dimensionality reduction**.


![](https://drive.google.com/uc?export=view&id=1YW1fZEV-a3n4EavCf4EwjQiEaqoG5IZX)

> 📚 **Resources**: What about **Reinforcement Learning**?
>
> An additional branch of machine learning is **reinforcement learning (RL)**. Reinforcement learning differs from other types of machine learning. In RL you don't collect examples with labels. Imagine you want to teach a machine to play a very basic video game and never lose. You set up the model (often called an **agent** in RL) with the game, and you tell the model not to get a "game over" screen. During training, the agent receives a **reward** when it performs this task, which is called a **reward function**. With reinforcement learning, the agent can learn very quickly how to outperform humans.
>
> The lack of a data requirement makes RL a tempting approach. However, designing a good reward function is difficult, and RL models are less stable and predictable than supervised approaches. Additionally, you need to provide a way for the agent to interact with the game to produce data, which means either building a physical agent that can interact with the real world or a virtual agent and a virtual world, either of which is a big challenge.
>
>Reinforcement learning is an active field of ML research, but in this course we'll focus on the problems detailed above because they're a better known problem, more stable, and result in a simpler system.
>
> For comprehensive information on RL, check out [Reinforcement Learning: An Introduction by Sutton and Barto](http://incompleteideas.net/book/RLbook2018.pdf).

## How to run a ML Project

A machine learning problem - or product - is quite different than a regular one.

There are at least several necessary steps, let's review them now.

### 1. Define objective and constraints

- Which question do we want to answer?
- Which hypothesis can we formulate?
- What data is required? Labeled or not?
- Do we have the data?

**Example**: 
- We want to build a SPAM detector
- We can assume we don't expect attachment with the emails
- We want it to be accurate for 85 % of the received emails
- We need it to be fast (real time)
- We do not have any labelled data

### 2. Collect data

Data is what brings power to models. If you don't have good data, don't expect to build a great ML models.

There are many ways to collect data:
- Open data: [Kaggle datasets](https://www.kaggle.com/datasets), [Google dataset research](datasetsearch.research.google.com)...
- APIs 
- Scrapping (With tools such as Beautiful Soup, Scrapy, etc...)
- Internal databases (SQL, NoSQL...)

### 3. Understand and prepare data

- How to deal with incorrect or missing data?
- What are the relevant features of your data?
- Is your data correlated?
- Can you create new features?
- ...

### 4. Train and optimize your ML model

- Train several models
- Optimize them

### 5. Evaluate your model

- Evaluate and pick the best in your constraints
- If the evaluation is not good enough, iterate!

### 6. Deploy your model

Deploy your model for your customers. There are many ways to deploy a model:
- Locally
- On servers
- Serverless architecture

## How much time does it take?

According to you, how much time would be spent on each task:
- defining the problem
- collecting data
- cleaning/preparing data
- training and optimizing model
- model evaluation
- model deployment

Time spent on each task estimation:
- defining the problem ~ 15 %
- collecting data ~ 25 %
- cleaning/preparing data ~ 25 %
- training model ~ 15 %
- model evaluation ~ 5 %
- model deployment ~ 5 %

- Bonus: explaining what you do ~ 10 %

## Current status of Machine Learning and limits

ML is pretty everywhere... but almost never alone!

What ML can do:
- Outperform humans on very specific tasks (with enough data)
- Automatize many painful, repetitive tasks
- Define rules for a given problem
- ...

What ML can not do:
- Generalize well
- Learn from only a few examples
- Give the reason of its decisions (interpretability)
- Work for various tasks
- ...