# Chapter 1: The Machine Learning Landscape

## whether or not the system can learn incrementally

- batch learning: If you want a batch learning system to know about new data (such as a new type of spam), you need to train a new version of the system from scratch on the full dataset
- out-of-core learning: train systems on huge datasets that cannot fit in one machine’s main memory
- online learning: 
	- train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches.
	- need to monitor your system closely and promptly switch learning off (and possibly revert to a previously working state) if you detect a drop in performance.

## how model generalize

- instance-based learning: the system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples (or a subset of them), using a similarity measure.
- model-based learning: build a model of these examples, then use that model to make predictions. 

## validate models
- it is common to use 80% of the data for training and hold out 20% for testing. However, this depends on the size of the dataset: if it contains 10 million instances, then holding out 1% means your test set will contain 100,000 instances: that’s probably more than enough to get a good estimate of the generalization error.
- holdout validation: you simply hold out part of the training set to evaluate several candidate models and select the best one. The new heldout set is called the validation set
- the most important rule to remember is that the validation set and the test must be **as representative as possible** of the data you expect to use in production 

## challenges of machine learning
- Insufficient Quantity of Training Data
- Nonrepresentative Training Data
- Poor-Quality Data
- Irrelevant Features
- Overfitting the Training Data
- Underfitting the Training Data



# Chapter 2: End-to-End Machine Learning Project

## transformers and pipelines
- it is important to fit the scalers to the training data only, not to the full dataset
- All but the last estimator must be transformers

e.g.

`old_num_pipeline = Pipeline([`
        `('selector', OldDataFrameSelector(num_attribs)),`
        `('imputer', SimpleImputer(strategy="median")),`
        `('attribs_adder', CombinedAttributesAdder()),`  (add new attributes)
        `('std_scaler', StandardScaler()),`
    `])`

`old_cat_pipeline = Pipeline([`
        `('selector', OldDataFrameSelector(cat_attribs)),`
        `('cat_encoder', OneHotEncoder(sparse=False)),`
    `])`

`
old_full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", old_num_pipeline),
        ("cat_pipeline", old_cat_pipeline),
    ])
`

## save models

`from sklearn.externals import joblib`
`joblib.dump(my_model, "my_model.pkl") # and later...`
`my_model_loaded = joblib.load("my_model.pkl")`

## Fine tuning the model
- If GridSearchCV is initialized with refit=True (which is the default), then once it finds the best estimator using crossvalidation, it retrains it on the whole training set. This is usually a good idea since feeding it more data will likely improve its performance.
- The performance will usually be slightly worse than what you measured using crossvalidation if you did a lot of hyperparameter tuning (because your system ends up fine-tuned to perform well on the validation data, and will likely not perform as well on unknown datasets).



# Chapter 3: Classification

## Preformance measures
- This demonstrates why accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets
- confusion matrix: Each row in a confusion matrix represents an actual class, while each column represents a predicted class
- recall can only go down when the threshold is increased
- ROC curve plots sensitivity (recall) versus 1 – specificity (specificity = true negative rate, which is the ratio of negative instances that are correctly classified as negative).
- Since the ROC curve is so similar to the precision/recall (or PR) curve, you may wonder how to decide which one to use. As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives, and the ROC curve otherwise.

## Multiclass Classification

- One-versus-all (OvA) strategy classifies whether a sample belongs to one class or not. Another strategy is to train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. This is called the **one-versus-one** (OvO) strategy. If there are N classes, you need to train N × (N – 1) / 2 classifiers. The main advantage of OvO is that each classifier only needs to be trained on the part of the training set for the two classes that it must distinguish.



# Chapter 4: Training Models

## Gradient descent
- When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge.
### Batch gradient descent
when the cost function is convex and its slope does not change abruptly (as is the case for the MSE cost function), Batch Gradient Descent with a fixed learning rate will eventually converge to the optimal solution, but you may have to wait a while: it can take $O(1/ \epsilon)$ iterations to reach the optimum within a range of $\epsilon$ depending on the shape of the cost function.
### Stochastic gradient descent
- randomness is good to escape from local optima, but bad because it means that the algorithm can never settle at the minimum. One solution to this dilemma is to gradually reduce the learning rate. The steps start out large (which helps make quick progress and escape local minima), then get smaller and smaller, allowing the algorithm to settle at the global minimum.
- The function that determines the learning rate at each iteration is called the **learning schedule**. If the learning rate is reduced too quickly, you may get stuck in a local minimum, or even end up frozen halfway to the minimum. If the learning rate is reduced too slowly, you may jump around the minimum for a long time and end up with a suboptimal solution if you halt training too early.



- The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.
- By convention we iterate by rounds of m iterations; each round is called an **epoch**.

## Learning curves
- bias: this part of the generalization error is due to wrong assumptions
- variance: This part is due to the model’s excessive sensitivity to small variations in the training data.
- irreducible error: this part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data

## Regularized linear models
- ridge regression: it is important to scale the data (e.g., using a StandardScaler) before performing Ridge Regression, as it is sensitive to the scale of the input features. This is true of most regularized models.
- lasso regression: an important characteristic of Lasso Regression is that it tends to completely eliminate the weights of the least important features
- **Elastic Net** is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is equivalent to Lasso Regression $$J(\theta)=MSE(\theta)+r\alpha\sum_{i=1}^n|\theta_i|+\frac{1-r}{2}\alpha\sum_{i=1}^n\theta_i^2$$
- **Early stop**: with Stochastic and Mini-batch Gradient Descent, the curves are not so smooth, and it may be hard to know whether you have reached the minimum or not. One solution is to stop only after the validation error has been above the minimum for some time (when you are confident that the model will not do any better), then roll back the model parameters to the point where the validation error was at a minimum.
   `sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True, penalty=None, learning_rate="constant", eta0=0.0005)`
   with warm_start=True, when the fit() method is called, it just continues training where it left off instead of restarting from scratch.
- It is important to scale the data (e.g., using a StandardScaler) before performing Ridge Regression, as it is sensitive to the scale of the input features. This is true of most regularized models.



# Chapter 5 SVM

## Linear SVM Classification
- think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes. This is called **large margin classification**.
- instances located on the edge of the street are called the **support vectors**
- SVMs are sensitive to the feature scales
- two main issues with hard margin classification：
   - it only works if the data is linearly separable
   - it is quite sensitive to outliers. 
- **soft margin classification**: find a good balance between keeping the street as large as possible and limiting the margin violations (i.e., instances that end up in the middle of the street or even on the wrong side).
- If SVM model is overfitting, you can try regularizing it by reducing C. On the left, using a low C value the margin is quite large, but many instances end up on the street. On the right, using a high C value the classifier makes fewer margin violations but ends up with a smaller margin.
- Besides the LinearSVC class, you could use the SVC class, using `SVC(kernel="linear", C=1)`, but it is much slower, especially with large training sets, so it is not recommended. Another option is to use the SGDClassifier class, with `SGDClassifier(loss="hinge", alpha=1/(m*C))`. This applies regular Stochastic Gradient Descent (see Chapter 4) to train a linear SVM classifier. It does not converge as fast as the LinearSVC class, but it can be useful to handle huge datasets that do not fit in memory (out-of-core training), or to handle online classification tasks.
- The LinearSVC class **regularizes the bias term**, so you should center the training set first by subtracting its mean. This is automatic if you scale the data using the StandardScaler. Moreover, make sure you set the loss hyperparameter to "hinge", as it is not the default value. Finally, for better performance you should set the dual hyperparameter to False, unless there are more features than training instances

## Nonlinear SVM Classification
- One approach to handling nonlinear datasets is to add more features, such as polynomial features
- Adding polynomial features is simple to implement and can work great with all sorts of Machine Learning algorithms (not just SVMs), but at a low polynomial degree it cannot deal with very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow.
- Parameter in SVC: the **hyperparameter coef0** controls how much the model is influenced by highdegree polynomials versus low-degree polynomials.
- A common approach to find the right hyperparameter values is to use **grid search**. It is often faster to first do a very coarse grid search, then a finer grid search around the best values found.

### Adding Similarity Features
- **similarity function**: measures how much each instance resembles a particular landmark.
- Gaussian radial basis function: $\phi_{\gamma}(x, l)=\exp(-\gamma||x-l||^2)$
- how to choose landmark? You may wonder how to select the landmarks. The simplest approach is to **create a landmark at the location of each and every instance in the dataset**. This creates many dimensions and thus increases the chances that the transformed training set will be linearly separable. The downside is that a training set with m instances and n features gets transformed into a training set with m instances and m features (assuming you drop the original features). If your training set is very large, you end up with an equally large number of features.



- Gaussian RBF kernel: increasing gamma makes the bell-shape curve narrower, and as a result each instance’s range of influence is smaller: the decision boundary ends up being more irregular, wiggling around individual instances. Conversely, a small gamma value makes the bell-shaped curve wider, so instances have a larger range of influence, and the decision boundary ends up smoother.
- $\gamma$ acts like a regularization hyperparameter: if your model is overfitting, you should reduce it, and if it is underfitting, you should increase it (similar to the $C$ hyperparameter).
- how can you decide which one to use? 
   - As a rule of thumb, you should always try the linear kernel first (remember that LinearSVC is much faster than SVC(kernel="linear")), especially if the training set is very large or if it has plenty of features. 
   - If the training set is not too large, you should try the Gaussian RBF kernel as well; it works well in most cases. 
   - Then if you have spare time and computing power, you can also experiment with a few other kernels using cross-validation and grid search, especially if there are kernels specialized for your training set’s data structure.

|Class|Time complexity|Out-of-core support|Scaling required|Kernel trick|
| ---- | ---- | ---- | ---- | ---- |
|  LinearSVC |  O(m*n)    |  no    |  yes    |   no  |
|   SDGClassifier   | O(m*n)  |  yes    |    yes  |   no   |
|    SVC  | O(m^2*n) to O(m^3*n) |  no    |  yes    |   yes   |

## SVM Regression
- instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to **fit as many instances as possible** on the street while limiting margin violations
- Adding more training instances within the margin does not affect the model’s predictions; thus, the model is said to be ϵ-insensitive.

## Under the hood
- one way to train a hard margin linear SVM classifier is just to use an off-the-shelf Quadratic Programming  solver
- The dual problem is faster to solve than the primal when the number of training instances is smaller than the number of features.

### Kernel trick
- the essence of the kernel trick: the whole process much more computationally efficient.
- According to **Mercer’s theorem**, if a function $K(a, b)$ respects a few mathematical conditions called Mercer’s conditions ($K$ must be continuous, symmetric in its arguments so $K(a, b) = K(b, a)$, etc.), then there exists a function $\phi$ that maps a and b into another space (possibly with much higher dimensions) such that $K(a, b) = \phi(a)^T\phi(b)$. So you can use K as a kernel since you know $\phi$ exists, even if you don’t know what $\phi$ is.
- Note that some frequently used kernels (such as the Sigmoid kernel) don’t respect all of Mercer’s conditions, yet they generally work well in practice.





# Chapter 6 Decision Tree


