# Machine Learning Introduction

![](images/wordcloud.png)

## Your teacher

Vincent Vandenbussche: vandenbussche.vincent@gmail.com

- PhD in Physics defended in 2014
- Experience in large companies: GE Healthcare, Renault, L'Oréal
- Experience in academic research: CNRS, CEA
- Experience in startups: Easyrecrue, Suricog, Adventure Conseil...
- Responsible of the Deep Learning course at Mines Paristech engineering school
- Executive trainings: Orange, LCL, EDF...
- Founded a startup: Vivadata, a data science bootcamp

## Course overview

Introduction to Machine Learning:
- Global introduction
- Python and Scientific computing basics
- Data viz & preparation
- Logistic Regression & other classification models overview
- Regression & regularization
- Model evaluation & optimization
- Trees and Random Forest
- Boosting methods

If we have enough time:
- Imbalanced Datasets & anomaly detection

Remote work:
- Practical exercises: classification and regression

### Pedagogical choices

This course has a **very hands on approach**. It comes with pros and cons.

Pros:
- You will be able to reuse all the algorithms we see
- You will see most of the parts of a machine learning project
- You will understand better what is a machine learning project
- You will be able to communicate more easily with data scientists/analysts

Cons:
- We will have to begin with some coding basics in Python
- We will spend time talking about painful aspects of a machine learning project
- You will need to code a bit

## Evaluation


2 short in-class exams for continuous control with multiple choice questions:
- November 26th
- December 14th

One final exam:
- December 18th (TBC)

# I. What is Machine Learning (ML)?

## Let's have quizz: ML or not ML?

- Google Translate
![](images/translate.png)

- Netflix movie recommendation
![](images/Netflix.jpg)

- Uber driver selection
![](images/uber.png)

- Face recognition
![](images/faceID.jpg)

- Siri / Google Home / Amazon Alexa...
![](images/echo-home.jpg)

- Chatbots
![](images/chatbot.jpg)

## A.I. vs Machine Learning vs Deep Learning

![](images/ai_ml_dl.png)

# II. A brief history of Machine Learning

A non exhaustive timeline

### Alan Turing

<center>
<img src="https://drive.google.com/uc?export=view&id=16I3OtO2dgcy7zHW0nvcRYQNyLWewAWXm">
</center>

Turing is widely considered to be the **father of theoretical computer science and artificial intelligence**.

In 1950 Alan Turing published a landmark paper in which he speculated about the possibility of creating machines that think. 

If a machine could carry on a conversation that was indistinguishable from a conversation with a human being, then it was reasonable to say that the machine was "thinking".

The [Turing Test](https://en.wikipedia.org/wiki/Turing_test) was the first serious proposal in the **philosophy of artificial intelligence**. 


### Walter Pitts & Warren McCulloch

<center>
<img src="https://drive.google.com/uc?export=view&id=1bD4ITkBz0y69JXppGhCZ42lgxDp-oxo5" width="400">
</center>

Inspired by work in neurology, **Walter Pitts** and **Warren McCulloch** analyzed networks of idealized artificial neurons and showed how they might perform simple logical functions.

They were the first to describe what later researchers would call a **neural network**.

### Marvin Minsky

<center>
<img src="https://drive.google.com/uc?export=view&id=1_4Yg4RPJqqfNIIy7hfSeWH1De1TgYUZ8" width="200">
</center>

One of the students inspired by Pitts and McCulloch was a young **Marvin Minsky**, then a 24-year-old graduate student.

In 1951 (with Dean Edmonds) he built the first **neural net machine**, the SNARC. With the SNARC, Minksky built a  first ANN (Artificial Neural Network) that could simulate a rat finding its way through a maze.

At that time, Minsky’s work and theory went largely unnoticed by the public and was rejected by most AI researchers until computing power had reached a level where results could be clearly demonstrated.

But also many more names like:
- Arthur Samuel
- Geoffrey Hinton
- Frank Rosenblatt
- ...

### A brief timeline of the A.I. field

A.I. has a long history behind it, with a series of successes and of failures and disappointments (see [A.I. winter](https://en.wikipedia.org/wiki/AI_winter) for example). 

![](https://drive.google.com/uc?export=view&id=1kUh0wO82soPDiXzGo8TEevfgzSkGZlto)


> The history behind A.I. is long and fascinating. Check [History of AI on Wikipedia](https://en.wikipedia.org/wiki/History_of_artificial_intelligence) to dig deeper.


 Most ideas are pretty old, so why is everybody using ML now?

## The current AI hype

Probably several main reasons:
- Increase of computing power
- Exponential increase of the data available

![](images/data_amount_evolution.png)

- Active research field with exponential [increase of papers](https://web.media.mit.edu/~mrfrank/papers/nmi2019.pdf)
![](images/ai_paper_evol.png)

- Better and wider access to knowledge and tools (scikit-learn, tensorflow, MOOCs, bootcamps...)

## Some examples of ML applications

### 🚙 Autonomous driving

![](https://drive.google.com/uc?export=view&id=1lDltwLk02Am5bsINiqk4u33oABQfawcA)

### 🚙 Autonomous driving

![](https://drive.google.com/uc?export=view&id=1gdJKYzGMtA3KSOD2JS8s53Wp3pzfDwVr)

### 🌆 Smart City

<center>
<img src="https://drive.google.com/uc?export=view&id=1D7JgBRhPLk-83ozMeDVh30mxWlsV3RnT">
</center>

### 🔬 Health

<center>
<img src="https://drive.google.com/uc?export=view&id=1hn0rFBFMpE1PqpjTrl61xMnbbGmQ8hss" width="70%">
</center>

### 🏪 Amazon cashierless stores

<center>
<img src="https://drive.google.com/uc?export=view&id=1RQ1rBiICTtyRTvsAHMcj_0Zu-xb4CKrL">
</center>

### 🎙 Home Assistants

![](https://drive.google.com/uc?export=view&id=1RpkMLUsAyIgzJvdGnUi7d5UUEcajEY4G)

...

___

# III. The ML paradigm

## Think like a ML algorithm

Let's play a game and think like a ML algorithm.

Here is a bunch of logos:

![](images/Famous-Logos.jpg)

Using some labeled data (`Yes` ✅ or `No` ❌), let's guess the rule:
![](images/logos_half_classified.png)

Here is the fully labeled dataset:
![](images/logos_classified.png)

Using Machine Learning is a change of paradigm:

![](images/ml_paradigm.png)

## How to use a Machine Learning model

<center>
    <img src="images/ml-logic.png" width="600">
</center>

## There are many Machine Learning problems 

Machine Learning problems can be categorized in different categories. 

In general, a machine learning problem considers a set of `m` samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several **features**.

Learning problems fall mainly into a few categories:
- Supervised learning
- Unsupervised learning
- Reinforcement learning

### Supervised learning

The data comes with additional attributes (called **labels**) that we want to predict. A supervised learning problem can be either:
 
 
 - **Classification**: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data.
 
 > An example of a classification problem would be handwritten character recognition. Classification can be seen as a discrete form of supervised learning.
 
 - **Regression**: if the output consists of one or more continuous variables, then the task is called regression.
 
 > An example of a regression problem would be the prediction of the price of a house as a function of surface and location.

### Unsupervised learning
 
The training data consists of a set of input vectors without any corresponding labels.

There are several applications in unsupervised learning, among which:
- **Clustering**: to discover groups of similar examples within the data
> For example in market segmentation

- **Dimensionality reduction** to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization, interpretation, or further processing
> For example in financial trading to analyse stock markets behaviors

- But also: anomaly detection, association learning, data compression, latent variables, denoising...

![](https://drive.google.com/uc?export=view&id=1YW1fZEV-a3n4EavCf4EwjQiEaqoG5IZX)

> In this course we will focus on supervised learning

# IV. How to run a ML Project

A machine learning problem - or product - is quite different than a regular one.

There are at least several necessary steps, let's review them now.

### 1. Define goals and constraints

- Which question do we want to answer?
- Which hypothesis can we formulate?
- What data is required? Labeled or not?
- Do we have the data?

**Example**: 
- We want to build a SPAM detector
- We can assume we don't expect attachment with the emails
- We want it to be accurate for 85 % of the received emails
- We need it to be fast (real time)
- We do not have any labelled data

### 2. Collect data

Data is what brings power to models. If you don't have good data, don't expect to build a great ML models.

There are many ways to collect data:
- Open data: [Kaggle datasets](https://www.kaggle.com/datasets), [Google dataset research](datasetsearch.research.google.com)...
- APIs 
- Scrapping (With tools such as Beautiful Soup, Scrapy, etc...)
- Internal databases (SQL, NoSQL...)

### 3. Understand and prepare data

- How to deal with incorrect or missing data?
- What are the relevant features of your data?
- Is your data correlated?
- Can you create new features?
- ...

### 4. Train and optimize your ML model

- Train several models
- Optimize them

### 5. Evaluate your model

- Evaluate and pick the best in your constraints
- If the evaluation is not good enough, iterate!

### 6. Deploy your model

Deploy your model for your customers. There are many ways to deploy a model:
- Locally
- On servers
- Serverless architecture

## How much time does it take?

According to you, how much time would be spent on each task:
- defining the problem
- collecting data
- cleaning/preparing data
- training and optimizing model
- model evaluation
- model deployment

Time spent on each task estimation (which is higly dependant on the project!):
- defining the problem ~ 15 %
- collecting data ~ 25 %
- exploring & preparing data ~ 25 %
- training & optimizing model ~ 15 %
- model evaluation ~ 5 %
- explaining the solution ~ 10 %
- model deployment ~ 5 %

Bonus:
- model maintenance & monitoring

> The idea here is: you do **not** spend much time modeling data and training fancy models!

# V. Current status of Machine Learning and limits

ML is pretty everywhere... but mainly overestimated.

What ML **can do**:
- Outperform humans on very specific tasks (with huge amount of data)
- Automatize many painful, repetitive tasks
- Define rules for a given problem
- ...

What ML **can not** do:
- Generalize well
- Learn from only a few examples
- Give the reasons of its decisions (interpretability)
- Work for various, different tasks
- ...