## A03. ML Glossary

### Terms

**Machine Learning**

Machine Learning (ML) is the art and science of programming computers to learn from data. There are a number of definitions for ML, two of which are as follows:  
<br/><br/>
*"Machine Learning ios the field of study that gives computers the ability the learn without being explicitly programmed."*  
Arthur Samuel 1959
<br/><br/> <br/>
*"A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."*  
Tom Mitchell 1997

**Supervised Learning**  

Supervised learning is a type of system in which both input and desired output data are provided. Input and output data are labelled for classification to provide a learning basis for future data processing. In Supervised Learning the training data that you feed into the algoritmn includes the desired solutions. These are called **labels**. A typical supervised learning task is **classification** where is is trained with example classification (e.g. spam or not-spam emails) in order to classify new data.

**Unsupervised Learning**  

In Unsupervised Learning, the training data is unlabelled and the system tries to find patterns in the data without being taught.  

**Semisupervised learning**  

Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data. This is called Semisupervised learning learning.

**Reinforcement Learning**  

In Reinforcement Learning, the learning system, called an **agent** in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards, as in Figure 1-12). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

**Batch and Online Learning**  

**Batch Learning**  
Another criterion used to classify Machine Learning systems is whether or not the system can learn incrementally from a stream of incoming data. In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data.  

**Online Learning**  
This where you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives.

**Instance-Based Learning & Model-Based Learning**  

**Instance-Based Learning**  
 This is where the system learns the examples by heart, then generalizes to new cases using a similarity measure.  
 
 **Model Based Learning**  
This is where you build a model that generalizes from a set of examples, then use it to make predictions.


**Attributes, Features & Labels**  

Briefly, feature is input; label is output.

In Machine Learning an **attribute** is a data type (e.g., “Mileage”), while a **feature** has several meanings depending on the context, but generally means an **attribute** plus its value (e.g., “Mileage = 15,000”) and broadly equates to one column of the dataset. Many people use the words attribute and feature interchangeably. For instance, if you're trying to predict the type of pet someone will choose, your input **features** might include age, home region, family income, etc. The **label** is the final choice, such as dog, fish, iguana, rock, etc.

Once you've trained your model, you will give it sets of new input containing those **features**; it will return the predicted **label** (pet type) for that person.

**Overfitting**  

This means that the model performs well on the training data, but it does not generalize well to unseen data. Complex models such as deep neural networks can detect subtle patterns in the data, but if the training set is noisy, or if it is too small (which introduces sampling noise), then the model is likely to detect patterns in the noise itself. Obviously these patterns will not **generalize** to new instances.

**Underfitting**  

As you might guess, underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data. For example, a linear model of life satisfaction is prone to underfit; reality is just more complex than the model, so its predictions are bound to be inaccurate, even on the training examples.

**Regularization**

Constraining a model to make it simpler and reduce the risk of overfitting is called **Regularization**. This is usually accomplished by giving the learning algorithm **degrees of freedom** to adapt the model to the training data.

**Hyperparameter**  
A hyperparameter is a parameter of a learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training. If you set the regularization hyperparameter to a very large value, you will get an almost flat model (a slope close to zero); the learning algorithm will almost certainly not overfit the training data, but it will be less likely to find a good solution. Tuning hyperparameters is an important part of building a Machine Learning system (you will see a detailed example in the next chapter)

**Train / Test**  

The only way to know how well a model will generalize to new cases is to actually try it out on new cases. One way to do that is to put your model in production and monitor how well it performs. This works well, but if your model is horribly bad, your users will complain — not the best idea.

A better option is to split your data into two sets: the training set and the test set. This is called the Train/Test Split As these names imply, you train your model using the training set, and you test it using the test set. The error rate on new cases is called the Generalization Error (or out-of-sample error), and by evaluating your model on the test set, you get an estimation of this error. This value tells you how well your model will perform on instances it has never seen before.

**Crossvalidation**   
To avoid “wasting” too much training data in validation sets, a common technique is to use **crossvalidation**. This is where the training set is split into complementary subsets, and each model is trained against a different combination of these subsets and validated against the remaining parts. Once the model type and hyperparameters have been selected, a final model is trained using these hyperparameters on the full training set, and the generalized error is measured on the test set.

**Feature Scaling**  

One of the most important transformations you need to apply to your data is **feature scaling**. With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have
very different scales. This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. Note that scaling the target values is generally not required. There are two common ways to get all attributes to have the same scale: min-max scaling and standardization. 

### Resources

#### ML Project Checklist

**Pre Project**

1. What's the business issue? Is it a Machine Learning problem?  
2. What does the current solution look like, if indeed there is one?
3. What does the data look like? Some ideas:
    * How was it gathered?  
    * Is it a sample or a full population?  
    * What pre-processing, if any, has the data undergone? Are there any other variables missing?  
    * If it's currently used, what is it used for?  
    * Are there any schema or descriptions available?  
4. How do you anticipate solving the problem? Should you use supervised, unsupervised, or reinforcement Learning? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques?  
5. What Performance Measure will you use?
6. What assumptions have been made, either by you or others?



**Data Preparation** 

1. Perform some basic exploratory analysis of the data. (missing values, column names, cardinality, correlations, categorical / numeric variables, overall cleanliness etc.)
2. Visualise the data. (spread & distribution, patterns, pre-processing etc.) 
3. Eyeball a sample of the data. Are there any glaring issues (e.g. user error, sample_bias, consistency etc.)
4. Performing averaging, combination or dimensionality reduction, creating categorical variables (e.g. simplification of the dataset)

**Pipelining**  

1. Deal with missing values (SKL Imputer)
2. Encode Categorical Variables (SKL Label Encoder / One Hot Encoder)
3. Consider Feature Scaling if the variables have vastly different scales  
4. Consider adding hyperparameter options based upon the variables in the dataset.

**Building / Running the model**

1. What's the error?
2. Is the model overfitting? 
3. Is the model underfitting?
4. Consider trying different model types.

**Fine Tuning**  

1. Consider Grid Search, Randomised Search or Ensemble Methods
2. Analyse the best models and their errors. What variables can we remove?
3. Evaluate on the test set.
