# Machine Learning
# for Structured and Unstructured Data

https://github.com/roshammar/ml-course/

Kristoffer Röshammar

*Tenfifty AB*

***

<table>
    <tr>
        <td><img src="presentation_resources/kid2.jpeg" width="300"></td>
        <td><img src="presentation_resources/cat.jpeg" width="300"></td>
        <td><img src="presentation_resources/waifu.jpg" width="300"></td>
    </tr>
</table>

https://www.linkedin.com/in/kristoffer-röshammar-7b69bb16a/


## Before we Start 1 -- Schedule for Today


| Start | End | Content |
| --- | --- | --- |
| 09:00 | 10:30 | Lecture |
| 10:30 | 10:45 | Coffe break |
| 10:45 | 12:00 | Interactive example |
| 12:00 | 13:15 | Lunch break |
| 13:15 | 14:30 | Lecture |
| 14:30 | 14:45 | Break |
| 14:45 | 15:30 | Lab |
| 15:30 | 16:00 | Wrap-up |


## Before we Start 2 -- A Quick Poll
How many of you have experience of
* Programming?
* Python?
* Jupyter notebooks?
* Machine learning?

How many of you do ML at work?

How many of you know
* Why you'd want a *test data set* in addition to a *validation set*?
* The difference between *supervised* and *unsupervised* ML?

## Before we start 3 -- About this course

The big picture

## What is Machine Learning?

**Artificial intelligence** is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. Computer science defines AI research as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.

Fields include knowledge reasoning, planning, natural language processing, computer vision, and **machine learning**.

**Machine learning** is a sub field of Artificial Intelligence which deals with letting a computer learn on its own, without being explicitly programmed. It uses algorithms and statistical models to discover patterns in data.

## Types of ML
* **Supervised** learning
	* Classification
	* Regression
* **Unsupervised** learning
	* Clustering
	* Dimensionality reduction
	* Recommendation systems
* **Semi-supervised** learning
* **Reinforcement** learning

### Supervised learning
The computer is provided with **training data** containing both input and the corresponding expected output (the correct answer). Hopefully it then **generalises** and can correctly handle examples it has never seen before, with just the input given.

It is  a case of learning by example.

Two main tasks:
* Classification
* Regression


#### Classification
In **classification** you want to split the data into different groups, such as *spam* and *not spam* 

The data can be a list of texts, and for each text a label that says if it is spam or not spam.

Another example is a list of images, each with a label that says what the image depicts (a dog, a chair, a tree, etc).


![](presentation_resources/hotdog.png)

#### Regression
**Regression** means that you are trying to predict a continuous variable.

For example, you could try to predict the selling *price* of houses. You would then need to have access to historical data for a number of houses. For each house you would provide details about number of rooms, location, the year it was built, etc, and finally the selling price.

### Unsupervised learning
In **unsupervised** learning, you don't have examples of expected output, but instead want the computer to automatically find patterns in the data.

One example of this is **clustering**, where items that are similar gets grouped together, without you telling the computer what the different groups should be.

This leaves the interpretation of the output up to you.

### Semi-supervised learning
* Only have a few labeled examples, and (comparatively) lots of unlabeled data
* Computer can still learn patterns from the unlabeled data
* Model with low degrees of freedom 
* Active learning: computer can help you label remaining examples via a continuous feedback loop

### Reinforcement learning
* Agent that maximises reward
* Observes effect of actions
* Robotics, game play, resource management, scheduling, planning, chemistry, recommendations, bidding, advertising, ...

 ## Structured vs. Unstructured Data
* Structured data
	* Like a spreadsheet, or matrix
	* CSV file format
	* Each row is an observation, each column provides a specific kind of information
* Unstructured data
	* Text (documents, email, log files, HTML articles, chat messages), images, sound, video
	* You need to find a way to represent this as numbers that our algorithms can work with

## Supervised Learning with Structured Data
* Prepare data -- get the data ready
    * Actually a big step: data scientist
* Training -- let the computer learn from the data
	* ML engineer
	* This step tends to be the focus, but the others are equally important!
* Evaluation -- see how well the model performed

### Preparing data

#### How to load a data set
* CSV files
* Python `csv` module vs Pandas
	* Row oriented vs col oriented
	* Lots of convenience functions



### Preparing data

#### What does a dataset look like *schematically*?

![](presentation_resources/dataset.svg)

### Preparing data

#### Features
* Continuous variables
* Categorical
    * Convert to numbers
    * Order may matter
	* One-hot encoding	
* Dates
* Missing values
* Re-scaling


### Preparing data

#### Split the data

* X vs. y
* Train, validation, test sets
* Remove y from the test set  
* Be aware of time series data!      

### Training

#### Selecting a model

* Lots of models to choose from
* A handful of models will be all you need in 90% of cases
* Deep learning is cool, but don't forget classical ML!
* We will come back to these soon

### Training

#### Training phase

* The algorithm typically loops through training dataset a number of iterations, or *epochs*
* Each iteration, changes are made to the model
* The training stops when the model does not improve any more
* How do we measure how good the model is, and how do we meaure when it stops improving?

### Evaluation

#### How do we measure results?

We could just count if the prediction was correct or not.

But there are, in fact, four different outcomes for *binary classification* and it matters

* True Positive
* True Negative
* False Positive
* False Negative

### Evaluation

#### How do we measure results?

![](presentation_resources/precision_recall.png)

### Evaluation

#### How do we measure results?

$$ recall = \frac{TP}{P} $$


$$ specificity = \frac{TN}{N} $$


$$ precision = \frac{TP}{TP + FP} $$


$$ accuracy = \frac{TP + TN}{P + N} = \frac{TP + TN}{TP + TN + FP + FN} $$


$$ F1 = \frac{2 TP}{2 TP + FP + FN} $$

### Evaluation

#### How do we measure results?

No measure is always correct. It is up to you to decide what is important in your case.

When would a **FN** be the most problematic outcome?

When would a **FP** be the most problematic outcome?

**A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.**

*The incident occurred on the downtown train line, which runs from Covington and Ashland stations.*

*In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad Administration to find the thief.*

*“The theft of this nuclear material will have significant negative consequences on public and environmental health, our workforce and the economy of our nation,” said Tom Hicks, the U.S. Energy Secretary, in a statement. “Our top priority is to secure the theft and ensure it doesn’t happen again.”*

### Evaluation

#### How do we measure results?


A more useful classification algorithm does not only output 1 or 0, but rather a probability.

Then the above does not work, but you can use something called cross-entropy.

### Evaluation

#### How do we measure results?

For multi-class classification, it is important to see if some classes are harder to predict than other.

Then, a confusion matrix could be used.

![](presentation_resources/confusion_matrix.png)


### Evaluation

#### How do we measure results?

ROC

![](presentation_resources/roc.png)

### Evaluation

#### How do we measure results?

For regression, you need something else

##### Mean Squared Error

$$ MSE = \frac{1}{N}\sum_{i=1}^{N}{(y_i-\hat{y}_i)^2} $$


##### Root Mean Squared Log Error (RMSLE)


$$ RMSLE = \sqrt{MSE(log(y_i + 1), log(\hat{y}_i + 1))} $$



### Evaluation

#### Overfitting

* So you train and, *wow* you get a perfect result of 0! Done?
* Training and validation set
* Model complexity and data size
* Regularization
* Bagging

### Evaluation

* Hyperparameters
* Test set

## Time for an interactive example!

Why Python?

## Unstructured data


* Text (documents, email, log files, HTML articles, chat messages), images, sound, video
* You need to find a way to represent this as numbers that our algorithms can work with
* Manual feature extraction vs deep learning


### Text

#### Classic NLP pipeline

* Tokenization
* Lower case
* Lemmatize
* Stop words
* Synonyms - both expand and reduce
* Spelling
* Ngrams



#### Then what? How to get to numbers?

* Bag-of-words model
* What about word order?
* Tf-idf


#### Or, the modern approach

* Embeddings
* Language models

### Images
* Deep neural networks
* Conv nets

### Time series

* Regression - translate dates into other features such as `is_weekend`
* Casino - translate into features such as moving averages
* Prophet - for seasonality



## Lab time!