## 1.1.1 Ordinary Least Squares
  
$\underset{w}{min\,} {|| X w - y||_2}^2$  
  
Easy to compute, but coefficient estimates for Ordinary Least Squares rely on the independence of the model terms.  
  
## 1.1.2 Ridge Regression  
  
$\underset{w}{min\,} {{|| X w - y||_2}^2 + \alpha {||w||_2}^2}$  
  
It’s basically a regularized linear regression model, and often used to avoid overtraining.  
  
## 1.1.3 Lasso   
  
$\underset{w}{min\,} { \frac{1}{2n_{samples}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}$    
  
The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent.  
  
  
**What is the difference between Ridge and Lasso?**  
here: https://qiita.com/nanairoGlasses/items/57515340a1bc24ffe445  
  
>|名称|正則化項|特徴|
|---|-------|---|
|ラッソ回帰|L1ノルム  |不要なパラメータ(次元・特徴量)を削ることができる|
|リッジ回帰|L2ノルム  |過学習を抑えることができる|
|直線回帰  |なし  |過学習を起こしやすい|  
>どの特徴量がグラフに影響を与えているのかを見つけるにはラッソを用いるのがベターであり、（単純な最小二乗法よりも）過学習を抑えつつ相関を見つけたいならばロッソがベターである。  
  
## 1.1.5 Elastic Net    
  
$\underset{w}{min\,} { \frac{1}{2n_{samples}} ||X w - y||_2 ^ 2 + \alpha \rho ||w||_1 + \frac{\alpha(1-\rho)}{2} ||w||_2 ^ 2}$
  
ElasticNet is a linear regression model trained with L1 and L2 prior as regularizer. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.  
  
A practical advantage of trading-off between Lasso and Ridge is it allows Elastic-Net to inherit some of Ridge’s stability under rotation.  
  
## 1.1.7 Least Angle Regression  
  
Least-angle regression (LARS) is a regression algorithm for high-dimensional data. LARS is similar to forward stepwise regression. At each step, it finds the predictor most correlated with the response. When there are multiple predictors having equal correlation, instead of continuing along the same predictor, it proceeds in a direction equiangular between the predictors.  
  
## 1.1.9 Orthogonal Matching Pursuit (OMP)  
  
## 1.1.10 Bayesian Regression  
  
## 1.1.11. Logistic regression  
  
Logistic regression, despite its name, is a linear model for classification rather than regression.In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.    
  
## 1.1.14. Passive Aggressive Algorithms  
  
Used to analyse text.  
[機械学習超入門II ～Gmailの優先トレイでも使っているPA法を30分で習得しよう！～](http://d.hatena.ne.jp/echizen_tm/20110120/1295547335)  
  
## 1.1.15. Robustness regression: outliers and modeling errors
    
Robust regression is interested in fitting a regression model in the presence of corrupt data: either outliers, or error in the model.  
  
Brief Explainment: [ロバスト回帰 — 外れ値の影響の低減](https://jp.mathworks.com/help/stats/robust-regression-reduce-outlier-effects.html)  
  
For More Info about Robustness Regression in Scikit Learn, Read [User Guide](http://scikit-learn.org/stable/modules/linear_model.html#different-scenario-and-useful-concepts)    
  
## 1.1.16. Polynomial regression: extending linear models with basis functions
  
One common pattern within machine learning is to use linear models trained on nonlinear functions of the data. This approach maintains the generally fast performance of linear methods, while allowing them to fit a much wider range of data.  
  
For example, a simple linear regression can be extended by constructing polynomial features from the coefficients. In the standard linear regression case, you might have a model that looks like this for two-dimensional data:  
  
$\hat{y}(w, x) = w_0 + w_1 x_1 + w_2 x_2$
  
If we want to fit a paraboloid to the data instead of a plane, we can combine the features in second-order polynomials, so that the model looks like this:  
  
$\hat{y}(w, x) = w_0 + w_1 x_1 + w_2 x_2 + w_3 x_1 x_2 + w_4 x_1^2 + w_5 x_2^2$
  
The (sometimes surprising) observation is that this is still a linear model: to see this, imagine creating a new variable  
  
$z = [x_1, x_2, x_1 x_2, x_1^2, x_2^2]$
  
With this re-labeling of the data, our problem can be written  
  
$\hat{y}(w, x) = w_0 + w_1 z_1 + w_2 z_2 + w_3 z_3 + w_4 z_4 + w_5 z_5$
  
We see that the resulting polynomial regression is in the same class of linear models we’d considered above (i.e. the model is linear in w) and can be solved by the same techniques. By considering linear fits within a higher-dimensional space built with these basis functions, the model has the flexibility to fit a much broader range of data.    
  
**My Question Is "What is the difference between Polynomial Regression and Nural network?"**  
  
[What is the difference between polynomial fitting and neural network training?](https://www.quora.com/What-is-the-difference-between-polynomial-fitting-and-neural-network-training)  
  
>Both approaches try to fit a non-linear function to the data. However, fitting a polynomial is typically much easier since polynomials have a much simpler form than neural networks:  
  
>In contrast, learning a neural network is a non-convex problem that can be much harder to solve, and typically requires good initialization and tuning. But the class of functions that can be learned effectively with a neural network (the “inductive bias”) is richer than polynomials, and many of the successes of neural network/deep learning are due to the rich hierarchical representations that these models are able to learn.  
  
>You could argue that polynomials can approach any function just like neural networks, however the invariances that neural networks learn to encode thanks to their hierarchical architecture make them much more effective for generalization in the context of statistical learning when the data has structure (e.g. in images/audio/text).    
  

## 1.5. Stochastic Gradient Descent
  
**Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression.** Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning.  
  
**SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing.**  
  
The advantages of Stochastic Gradient Descent are:  
  
- Efficiency.
- Ease of implementation (lots of opportunities for code tuning).
  
**The major advantage of SGD is its efficiency, which is basically linear in the number of training examples.**  
  
The disadvantages of Stochastic Gradient Descent include:  
  
- SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
- SGD is sensitive to feature scaling.

**SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet.**    