# The Machine Learning

+ **Machine Learning** is the science (and art) of programming computers so they can learn from data.

+ **Machine Learning** is the field of study that gives computers the ability to learn without being explicitly programmed.


+ A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.



**This is what a typical Machine Learning project looks like:**

+ *You studied the data*

+ *You selected a model*

+ *You trained it on the training data (i.e., the learning algorithm searched for the model parameter values that minimize a cost function)*

+ *Finally, you applied the model to make predictions on new cases (this is called inference), hopingthat this model will generalize well*

# Types

+ Supervised

+ Unsupervised

+ Semisupervised

+ Reinforcement Learning

# Supervised

In supervised learning, the training data you feed to the algorithm includes the desired solutions, called
labels.
A typical supervised learning task is classification. The spam filter is a good example of this: it is trained
with many example emails along with their class (spam or ham), and it must learn how to classify new
emails.
![image.png](attachment:image.png)

Another typical task is to predict a target numeric value, such as the price of a car, given a set of features
(mileage, age, brand, etc.) called predictors. This sort of task is called regression.
To train the system, you need to give it many examples of cars, including both their predictors and their labels
(i.e., their prices).
![image.png](attachment:image.png)

*Note that some regression algorithms can be used for classification as well, and vice versa. For example,
Logistic Regression is commonly used for classification, as it can output a value that corresponds to the
probability of belonging to a given class.

Here are some of the most important supervised learning algorithms:

+ k-Nearest Neighbors
+ Linear Regression
+ Logistic Regression
+ Support Vector Machines (SVMs)
+ Decision Trees and Random Forests
+ Neural networks


# Unsupervised learning

In unsupervised learning the training data is unlabeled. The system tries
to learn without a teacher. 
For example, say you have a lot of data about your blog’s visitors. You may want to run a clustering algorithm to try to detect groups of similar visitors. At no point do you tell the algorithm which group a visitor belongs to: it finds those connections without your help. For example, it might notice that 40% of your visitors are males who love comic books and generally read your blog in the evening, while 20% are young sci-fi lovers who visit during the weekends, and so on.



![image.png](attachment:image.png)

After clustering algorithm
![image.png](attachment:image.png)

Clustering
 + k-Means
 + Hierarchical Cluster Analysis (HCA)
 + Expectation Maximization

Visualization and dimensionality reduction
 + Principal Component Analysis (PCA)
 + Kernel PCA


A related task is dimensionality reduction, in which the goal is to simplify the data without losing too much information. One way to do this is to merge several correlated features into one. For example, a car’s mileage may be very correlated with its age, so the dimensionality reduction algorithm will merge them into one feature that represents the car’s wear and tear. This is called feature extraction.

<font size=6, color = 'red'> IMPORTANT</font>

*It is often a good idea to try to reduce the dimension of your training data using a dimensionality reduction algorithm before you feed it to another Machine Learning algorithm (such as a supervised learning algorithm). It will run much faster, the data will take up less disk and memory space, and in some cases it may also perform better.*

# Semisupervised learning

Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little
bit of labeled data. This is called semisupervised learning!  
Some photo-hosting services, such as **Google Photos**, are good examples of this. Once you upload all your family photos to the service, it automatically recognizes that the same person A shows up in photos 1, 5, and 11, while another person B shows up in photos 2, 5, and 7. This is the unsupervised part of the
algorithm (clustering). Now all the system needs is for you to tell it who these people are. Just one label per person,
4 and it is able to name everyone in every photo, which is useful for searching photos.

# Reinforcement Learning

Reinforcement Learning is a very different beast. The learning system, called an agent in this context,
can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

![image.png](attachment:image.png)

# Main Challenges of ML

In short, since your main task is to select a learning algorithm and train it on some data, the two things that
can go wrong are: 

+ ***Bad Data*** 
+ ***Bad Algorithm*** 

***Bad Data***

+ **Insufficient Quantity of Training Data**

*Even for very simple problems you typically need thousands of examples, and for complex problems such as image or speech recognition you may need millions of examples* 

+ **Nonrepresentative Training Data**

*In order to generalize well, it is crucial that your training data be representative of the new cases you want to generalize to. This is true whether you use instance-based learning or model-based learning.*

![image.png](attachment:image.png)

It is crucial to use a training set that is representative of the cases you want to generalize to. If the sample is too small, you will have *sampling noise* (i.e., nonrepresentative data as a result of chance), but even very large samples can be nonrepresentative if the sampling method is flawed. This is called *sampling bias*.

<font size=4, color = 'red'> EXAMPLE</font>

***The most famous example of sampling bias happened during the US presidential election in 1936, which pitted Landon against Roosevelt: the Literary Digest conducted a very large poll, sending mail to about 10 million people. It got 2.4 million answers, and predicted with high confidence that Landon would get 57% of the votes. Instead, Roosevelt won with 62% of the votes. The flaw was in the Literary Digest’s sampling method:***

 + ***First, to obtain the addresses to send the polls to, the Literary Digest used telephone directories, lists of magazine subscribers, club membership lists, and the like. All of these lists tend to favor wealthier people, who are more likely to vote Republican (hence Landon)***

 + ***Second, less than 25% of the people who received the poll answered. Again, this introduces a sampling bias, by ruling out people who don’t care much about politics, people who don’t like the Literary Digest, and other key groups. This is a special type of sampling bias called nonresponse bias***

+ **Poor-Quality Data**

Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor-quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well. It is often well worth the effort to spend time cleaning up your training data.
The truth is, most data scientists spend a significant part of their time doing just that. For example:
If some instances are clearly outliers, it may help to simply discard them or try to fix the errors manually.

If some instances are missing a few features (e.g., 5% of your customers did not specify their age), you must decide whether you want to ignore this attribute altogether, ignore these instances, fill in the missing values (e.g., with the median age), or train one model with the feature and one model without it, and so on.

+ **Irrelevant Features**

As the saying goes: garbage in, garbage out. Your system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones. A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. T
his process, called *feature engineering*, involves:

+ Feature selection: selecting the most useful features to train on among existing features.

+ Feature extraction: combining existing features to produce a more useful one (as we saw earlier, dimensionality reduction algorithms can help).

+ Creating new features by gathering new data.

***Bad Algorithm***

+ **Overfitting**

The model performs well on the training data, but it does not generalize well. 
Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. The possible
solutions are:

+ *To simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data or by constraining the model*

+ *To gather more training data*
+ *To reduce the noise in the training data (e.g., fix data errors and remove outliers)*

![image.png](attachment:image.png)

Complex models such as deep neural networks can detect subtle patterns in the data, but if the training set is noisy, or if it is too small (which introduces sampling noise), then the model is likely to detect patterns in the noise itself. Obviously these patterns will not generalize to new instances. For example, say you feed your life satisfaction model many more attributes, including uninformative ones such as the country’s name. In that case, a complex model may detect patterns like the fact that all countries in the training data with a w in their name have a life satisfaction greater than 7: New Zealand (7.3), Norway (7.4), Sweden (7.2), and Switzerland (7.5). How confident are you that the W-satisfaction rule generalizes to Rwanda or
Zimbabwe?


+ **Underfitting**

Underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data. For example, a linear model of life satisfaction is prone to underfit; reality is just more complex than the model, so its predictions are bound to be inaccurate, even on the training examples.
The main options to fix this problem are:

+ *Selecting a more powerful model, with more parameters*
+ *Feeding better features to the learning algorithm (feature engineering)*
+ *Reducing the constraints on the model (e.g., reducing the regularization hyperparameter)*

<font size=6, color = 'red'> IMPORTANT </font>

***Theorem: No Free Lunch (NFL)***

**If you make absolutely no assumption about the data, then there is no reason to prefer one model over any other.** 

*For some datasets the best model is a linearmodel, while for other datasets it is a neural network. There is no model that is a priori guaranteed to work better (hence the name of the theorem). The only way to know for sure which model is best is to evaluate them all. Since this is not possible, in practice you make some reasonable assumptions about the data and you evaluate only a few reasonable models. For example, for simple tasks you may evaluate linear models with various levels of regularization, and for a complex problem you may evaluate various neural networks.*

# Testing and Validating

+ The only way to know how well a model will generalize to new cases is to actually try it out on new cases. One way to do that is to put your model in production and monitor how well it performs. This works well, but if your model is horribly bad, your users will complain — not the best idea.


+ A better option is to split your data into two sets: the training set and the test set. As these names imply, you train your model using the training set, and you test it using the test set. The error rate on new cases is called the generalization error (or out-of-sample error), and by evaluating your model on the test set, you get an estimation of this error. This value tells you how well your model will perform on instances it has never seen before. If the training error is low (i.e., your model makes few mistakes on the training set) but the generalization error is high, it means that your model is overfitting the training data.


+ So evaluating a model is simple enough: just use a test set. Now suppose you are hesitating between two models (say a linear model and a polynomial model): how can you decide? One option is to train both and compare how well they generalize using the test set.

*Now suppose that the linear model generalizes better, but you want to apply some regularization to avoid overfitting. The question is: how do you choose the value of the regularization hyperparameter? One option is to train 100 different models using 100 different values for this hyperparameter. Suppose you find the best hyperparameter value that produces a model with the lowest generalization error, say just 5% error.*

*So you launch this model into production, but unfortunately it does not perform as well as expected and produces 15% errors. What just happened?*

**The problem is that you measured the generalization error multiple times on the test set, and you adapted the model and hyperparameters to produce the best model for that set. This means that the model is unlikely to perform as well on new data.**
*A common solution to this problem is to have a second holdout set called the* **validation set**. *You train multiple models with various hyperparameters using the* **training set**, *you select the model and hyperparameters that perform best on the* **validation set**, *and when you’re happy with your model you run a single final test against the test set to get an estimate of the generalization error.*

*To avoid “wasting” too much training data in validation sets, a common technique is to use* ***cross - validation***: *the training set is split into complementary subsets, and each model is trained against a different combination of these subsets and validated against the remaining parts. Once the model type and hyperparameters have been selected, a final model is trained using these hyperparameters on the full training set, and the generalized error is measured on the test set.*