 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# Machine Learning basics

* Machine Learning is the computer science branch that studies the ability of computers to learn without explicitely being programmed.

* has been around for a while; used for spam detection since the 1990s.

* Paper by Geoffrey Hinton in 2006 showed that **deep neural networks can be trained to recognize handwritten digits with a precision of over 98%**
    * this was previously thought to be impossible

* Since then, there's been a renewed interest in Machine Learning

### Two branches 

* `Classic Machine Learning`

* `Deep Learning`

### `Classic Machine Learning` models

* collection of parameters that describe an object able to recognize patterns.

* algorithms that rely more on statistics and math

**Most common learning algorithms:**
    <br>
    
 * `Linear and Logistic Regression`
    <br>
    
 * `Naïve Bayes`
    <br>
    
 * `Support Vector Machines`
    <br>
    
 * `Decision Trees`
    <br>
    
 * `Random Forests`
    <br>
    
 * `Boosting algorithms`
    
    
    

**We will focus on:**
    <br>
    
   * `Support Vector Machines`
   <br>
    
   * `Random Forests`
   <br>
    
   * `XGBoost`

## Advantages

### Require less resources

* **outperforms** Deep Learning on smaller data sets

* **computationally inexpensive**; can iterate faster

 * much cheaper than Deep Learning to run (often good-enough for tabular data)

### Interpretability

* easier to interpret

* easier to explain to people not familiar with Machine Learning or Deep Learning

## Disadvantages

### Requires __feature engineering__
 

* a feature is an individually measurable property or characteristic (in statistics, it's also called a **variable**)

**Types of features:**
   <br> 
   
   * **categorical features -**  non-numeric values from a finite number of categories
   <br> 
   
   * **discrete features -**  numeric values from a set of integer values
   <br> 
   
   * **continuous features -** either numeric or date/time (values are usually from the set of rational numbers or timestamps)

### Limited functionality

* performs poorly on complex and abstract problems:
   <br> 
   
   * image recognition
   <br> 
   
   * playing games
   <br> 
   
   * speech recognition
    <br> 
   
   * etc.

# Parts of a typical ML algorithm

**Key parts of a typical algorithm:**
    <br>
    
   * task
    <br>
    
   * experience
    <br>
    
   * performance

## Task

* problem we want to solve using an algorithm

* different tasks require different algorithms - **"no free lunch" theorem** tells us there is no "best algorithm"

**Different types of tasks:**
    <br>
    
   * `Classification`
    <br>
    
   * `Regression`
    <br>
    
   * `Density estimation`
    <br>
    
   * `Anomaly detection`
    <br>
    
   * `Synthesis and sampling`
    <br>
    
   * etc.

## Experience

* encompasses the examples we give the algorithm to observe

* each of those examples is called a **sample** or a **training instance**

* a dataset made up of training instances is called a **training set**

* **special cases:**
    <br>
    
    * active learning
    <br>
    
    * reinforcement learning
    <br>
    
    * etc.

## Performance

* we need to measure how well our algorithm is working - **quantitative measure of performance**

* performance is measured differently depending on the task

* standard ways of measuring performance:
    <br>
    
    * **accuracy** for **classification** (even though there are better alternatives we will mention later)
    <br>
    
    * **mean squared error** for **regression**
    

* by tracking performance we can improve how our model performs on unseen data

# Training with and without supervision

**Different types of training:**
    <br>
    
   * `Supervised learning`
    <br>
   * `Unsupervised learning`
    <br>
    
   * `Semisupervised learning`
    <br>
    
   * `Reinforcement learning`

* we will mostly focus on supervised learning (and a little bit on unsupervised learning), and will skip going in-depth about semisupervised learning and reinforcement learning

## `Supervised learning`

* training data **includes labels**

* each example is a pair that consists of an **input object** and a **desired output value**

* by learning how to connect the input values with the desired output values, the algorithm learns how to predict the output value given an input value

**Supervised learning algorithms:**
    <br>
    
   * `Linear Regression`
    <br>
    
   * `Logistic Regression`
    <br>
    
   * `Support Vector Machines`
    <br>
    
   * `Decision Trees`
    <br>
    
   * `Random Forests`
    <br>
    
   * `Boosting algorithms`
    <br>
    
   * `k-Nearest Neighbors`

## `Unsupervised learning`

* training data is **not labeled**

* requires **A LOT** of data

* we will not use unsupervised classic Machine Learning methods, but we will go over a few of them in the Deep Learning chapter

# Typical Steps of a Machine Learning Project

## Think about the problem


* describe exactly what problem are you trying to solve

* understand how your solution will be used (e.g. scheduled to run automatically, an ad-hoc script, etc.)

* figure out how to measure performance

* figure out what is the performance threshold you need to pass to have a useful solution

## Get the data



* understand what kind of data do you need and where you can find this data

* create a workspace for your training needs:
    <br>
    
  * make sure you either have enough space to hold the data OR you build data pipelines to get data from your data lake, as needed
      <br>
    
  * install required software and packages 

**Problems we run into when working with data:**
   
   <br>
   
   * **insufficient quantity of training data** - training models to solve complex problems usually requires datasets with millions of examples (for simple problems we are usually talking about thousands of examples)
       <br>
    
   * **nonrepresentative training data** - your data must adequately represent the problem you want to solve and must not be biased
       <br>
    
   * **poor-quality data** - don't use data that contains  too many errors, outliers and noise (cleaning data helps only up to a certain point)

 

**Build data pipelines:**

   * data pipelines can be used to manipulate and transform data

   * when building data pipelines aim for the simplest possible solution
        * the more complicated the system, the more you risk failure
        * if possible, opt for hosted or cloud solutions


   * something to remember: 
        * feeding data to classic Machine Learning models is simple
        * however feeding data to Deep Learning models is more complex

## Explore and visualize some of the data to get insights

* explore the data to understand:
    <br>
    
  * what are your data attributes?
      <br>
    
  * what are your data types
      <br>
    
  * do you have gaps in the data? how large are they?
      <br>
    
  * is your data noisy? (e.g. lots of outliers, rounding errors, etc.)

## Cleanup the data and prepare it for ML algorithms



* some advice:
    <br>
    
  * generally good to not modify original data - you may need it in the future!
      <br>
    
  * rather than ad-hoc code, prefer to write functions for transformations (more modular, more reusable)

* clean the data:
    <br>
    
  * remove outliners, if needed
      <br>
    
  * fill in or remove missing values
      <br>
    
  * remove duplicates

* analyze and modify data:

    <br>
    
    * **feature selection** - choose which features you want to use for training
     
     <br>
    
    * **feature engineering** - combine multiple features to get new ones
     
     <br>
    
    * **feature scaling** - perform standardization or normalization (necessary for only some models)


* separate data that you will use for training from data that you will use for testing

## Select a few training models and train them



* if you have lots of data: **sample smaller training sets so you can test many different models faster**
    <br>
    
  * note that this may put neural networks or random forests algorithm at a disadvantages (they need more data)

* train many models from different categories (e.g. linear, tree-based etc.)

## Compare the models and pick one model to use in production



* measure and compare the performance of the various trained models

* find the most significant variables for each algorithm

* analyze the errors made - is there any data that you could feed the system to train it to avoid these errors?

* select the top couple of models that look promising

## Fine-tune the final model(s)



* fine-tune hyperparameters
    * hyperparameters control the **learning process**
    * by fine-tuning them, you can improve your model training

* treat data transformations as hyperparameters that you can switch on / off or whose values you can change
    <br>
    
    * e.g. one hyperparameter can be the way you handle missing data: fill in with zero, fill in with average, drop completely etc.

* try ensemble methods

* finally, measure the performance of the model on a test dataset - estimate the generalization error

## Deploy, monitor, maintain the model in production

* automate your solution as much as possible

* automate the deployment

* create a system that will monitor the performance of your model in production
    <br>
    
  * **over time, models can experience degradation in performance, due to new data**
      <br>
    
  * need to alert when the performance goes below a certain threshold
      <br>
    
  * at that point, you can re-train the model

# Machine Learning in Python

* Python has excellent packages for Machine Learning

* some are used for classic Machine Learning, some are used for Deep Learning (even though there is a bit of overlap)

* in this chapter, we will focus on those that are important for classic Machine Learning (the ones we use for Deep Learning will be covered in the next chapter)

* by far the most popular library for classic Machine Learning is **`Scikit-Learn`**
    * there are other libraries such as **`CatBoost`**
    * however, **`Scikit-Learn`** still enjoys much more support than the newer libraries

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>