# How to prevent Overfitting:

- regularization (Lasso&Ridge) : regularization hata eklemek demektir. bias artirip variance azaltarak preventing overfitting.
- More training the data - cross validation
- decreasing the parameters

# REGULARIZATION : penalized regression

**Regularization is a technique used in machine learning to prevent overfitting, which occurs when a model becomes too complex and captures noise in the data rather than the underlying patterns**. Regularization technique **adds a penalty term to the model's loss function**, which discourages the model from fitting the noise in the data and encourages it to focus on the most important patterns.

We can use reg technique with linear reg, logistic reg or neural network training.

By "constraining" or "regularizing" the model, regularization helps to keep the model simple and prevent overfitting the data. 

It allows you to train a model that is simpler, more interpretable and generalizes better to new unseen data. This can help to improve the model's performance, and make it more robust to changes in the data.

There are three main types of regularization : 

- L1 Regularization - Lasso (Laplace)
- L2 Regularization - Ridge (Gauss)
- L1 and L2 Combined - Elastic Net

**Lasso** adds a penalty equal to the absolute value of the magnitude of the coefficients. So what this does is limits the size of the coefficients in the regression equation. Keep in mind that this can yield sparse models where some coefficients actually become zero. It uses the sum of the absolute values.If we have too many features, it's better to use l1 to reduce some of them. (mevcut hataya lambda carpi slopeun mutlak degerini ekliyoruz ve variance dusuruyoruz).

**Ridge**, it's also adding a penalty term, but in this case, the penalty term will be equal to the square of the magnitude of the coefficients, which means all coefficients are actually shrunk by the same factor and do not necessarily eliminate coefficients entirely. It uses the sum of the squares of the parameters. If we think all the features are important and want to keep them, then we use Ridge (Gauss-L2) regularization. (var olan hataya lambda carpi slope kare (featureların katsayısı) hatasi ekliyoruz variancei azaltmak icin.

**Elastic net** essentially just combines L1 and L2 with the addition of an alpha parameter deciding the ratio between them. The alpha parameter should be 0 to 1. If alpha equal to 1 we can say the Elastic Net model identical with Lasso or if equal to 0 identical with Ridge. 

# Multicollinearity

**Multiple predictor variables that have a high degree of correlation with one another are known as Multicollinearity, which is a typical issue in regression analysis**. Because it might be challenging for the model to ascertain the particular contribution of each predictor variable to the outcome, this can lead to some problems for some algorithms. Because of this, methods like linear regression and logistic regression, which assume independent predictor variables, can be vulnerable to multicollinearity and may result in false or deceptive findings.

**Effect of Multicollinearity:**

There may be no problem with having a small amount of MC in your system. It's possible that severe multicollinearity will cause the coefficients to fluctuate, causing the variance (or standard error) of the regression coefficients to soar, making them unstable. Coefficients may change sign (thus making the study implausible) or make the predictors insignificant, even if they should be.

Despite the fact that multicollinearity shouldn't have a major impact on the model's accuracy, it does have an impact on prediction variance and the quality of independent variable analysis. In other words, your data's influence on the model is unreliable.

In case of multicollinearity, R2 score is large but none of the individual beta weights are statistically significant (yuksek r2ye ragmen ornegin tv ve radyonun coefleri dusuk ve dep var.ı tek basian aciklamaktan uzaklar) 

Here is, how we see it on heatmap :

![image.png](attachment:3f487806-f9bb-46a2-9ba0-10d8dc0e3d2d.png)

Feature coefficients before regularization or drop:

![image.png](attachment:e53c6e98-d2be-4317-a49a-39ab17e58e44.png)

Feature coefficients after regularization or drop:

![image.png](attachment:812db4ee-b459-4c63-88b3-740d57d9f236.png)

**What we can do about multicolinearity**

- check to see if one of our predictor variable is duplicate
- remove a redundant variable
- aggregate similar variables
- increase sample size

# Feature Scaling

Feature Scaling is a technique used to standardize the range of independent variables or features of a dataset. It is performed during data pre-processing to handle variations in the magnitude or units of the data. **Without feature scaling, a machine learning algorithm may treat larger values as more important and smaller values as less important, regardless of the unit of measurement.**

Performing feature scaling can improve the performance and convergence of machine learning algorithms such as gradient descent. By standardizing the scale of the features, it ensures that the optimization process is not unduly influenced by features with larger ranges, and the algorithm can converge faster.

Additionally, some machine learning models that rely on distance metrics, like K-Nearest Neighbors, benefit from feature scaling to perform well. Even if it's not strictly necessary, feature scaling is a best practice, as it's relatively fast to compute and doesn't harm other algorithms.

Scaling neden onemli:

- Gradient Descent algortihmasinin verimli ve hizli calisabilmesi icin scaling onemli. zaman kazandirir
- Model coefficients: scale yapildiysa rahatlikla en yuksek coef olan feature en onemli etkieye sahip diyebiliriz. scaling yapilmadiysa soylenemz bu net.
- distance base lagoritmalar (mesafeye dayali calisan algoritmalar)in iyi calisabilmesi icin mutlaka scaling yapmali

Uyari:

- model scale edilmis data ile beslendiyse ve bu model uzerinden prediction yapilacaksa o test-unseen data da mutlaka scla edilmeli tahmin oncesi
- eger scaling iyi calisacak mi calismayacak mi emin degilsek yine de yapmali.

**Techniques to perform Feature Scaling**

**two most important ones:**

- Min-Max Normalization: (min-max scaler) This technique re-scales a feature or observation value with a distribution value between 0 and 1. Featurlar 0 ve 1 arasına ayarlanir.
- Standardization: It is a very effective technique that re-scales a feature value so that it has distribution with 0 mean value and variance equals 1. ortalamasi 0 std'si 1 olacak sekilde aralikta ayarlar. Z-score normalization kullanılıyorsa %97si -3 +3 araligina ayarlanir.
- Bir de Robust scaler var, ozellikle outlier varsa datada o zaman tercih ediliyor genellikle. Ama outliers varsa kesin bu kullanilir denemez.

Hangisini sececegimize deneme yanilma ile hepsini kullanarak karar verecegiz. 

![image.png](attachment:62bcce4e-1606-47ee-8912-c4869c976deb.png)

# Data Leakage

**What is data leakage?**

Data leakage occurs when a model is trained with information that will not be available when it is used for making predictions. This can cause the model to perform poorly on new data, resulting in inaccurate or misleading predictions. To avoid data leakage, it is essential to take a careful approach to preprocessing the training data, and ensure that the model is trained only on information that will be present when the model is used in practice.

Leakage can occur from various sources, from data collection and feature engineering to data partitioning and model validation. Even experienced data scientists can inadvertently introduce leaks, which can cause them to overestimate the performance of the model they deploy. Careful monitoring and attention to detail are key to avoiding data leakage, thus ensuring that the model will perform well on unseen data.

**Attention: When we perform .fit() method for feature scaling, we must fit only train data In Scikit-learn. If we fit both train and test data, there would be a data leakage.**

![image.png](attachment:78588c33-bdc8-48c9-9e28-19246502f197.png)

**What to do to prevent data leakage:**

- Normalization should be done on the training set
- The normalization parameters from the training set should be used on the validation and test sets
- Never normalize the test set using its own normalization parameters
- don't use the same piece of code for normalization on training as on testing
- never normalize the whole dataset before splitting it into train and test sets

# Cross-Validation and Grid Search

# Cross Validation

Is there a way we can achieve both of the following, train of all the data and evaluate all the data? We can achieve this with cross-validation. 

Machine learning algorithms can be easily implemented with ready-made libraries. However, even though algorithms are easily implemented and models can be created, we might not get the expected results without verifying the model. **It's especially important if we have a noisy dataset or small dataset.**

**Cross-validation is one of the most effective methods used to validate the model. Cross-Validation disassembles the data set to evaluate classification models and to train the model**. It allows us to obtain a more robust estimate of a model's performance **by partitioning the data into multiple subsets, called 'folds',** and training and evaluating the model on different subsets of the data. This process is repeated multiple times, using different partitions, to provide a more robust estimate of the model's performance on new, unseen data. This helps to prevent overfitting and gives more accurate results.

![image.png](attachment:155faf12-654b-4455-a4b1-3d25b6230052.png)

In the Cross-Validation process, we implement the following steps:

- We separate a sample dataset as a validation set.
- We train the model using the rest of the dataset (training set).
- With the validation set, we measure the performance of the model.
- We can do this process k times. As seen in the picture above, the model will do this with different validation (test) data each time. 
- We get the performance values of each after, then get their average values.
- If the algorithm gives a satisfactory result with the test set, we can now use this algorithm. Otherwise, try cross-validation on another algorithm.

**Tips:** Cross-validation has to train and test the model over and over again (k times), so additional data processing load and time are required. Although this is not a problem for small and medium volumes of data with short training and testing, it can cause time difficulties in large volume datasets.

# Grid Search

Hyperparametrelerin hangiinin en iyi oldugunu belirlemek icin, ideal lambdayı bulmak, underfittinge gitmesini onlemek icin modeli tune etmeye, yani hyperparameter optimization islemine grid search denir.

Grid search is an enhancement algorithm that allows you to find the best set of parameters for a machine learning model, similar to trying different combinations of settings. You provide a range of options for each parameter, and the algorithm will test all possible combinations to find the best one. It is a way to automate the process of experimenting with different parameter values, without the need to manually try every combination. This method is used to find the best set of parameters that gives the best accuracy for the model. It is like trying different recipes to make a cake, testing which one is the most delicious, grid search takes the guesswork out of finding the perfect set of parameters for your model, and helps you achieve the best possible performance.

![image.png](attachment:15f61a60-e166-4d73-8d43-8ea368fe14df.png)

There are **two types of gridsearch in Scikit-Learn library:**

- Exhaustive Grid Search
- Randomized Parameter Optimization

We will focus on Exhaustive Grid Search. However, sometimes the chosen hyperparameters are too many to run, and if we don't have enough time and resources, we can also use Randomized Parameter Optimization.

# Ridge Regression

Ridge regression is a widely used technique in machine learning to improve the stability and accuracy of a model by preventing overfitting. The basic idea behind this technique is to add a penalty term to the traditional least squares regression algorithm, which is known as the residual sum of squares (RSS). The penalty term, also known as the L2 regularization term, is a function of the number and size of the coefficients in the model. By adding this term, the model is constrained and forced to find a balance between fitting the training data well and keeping the coefficients small, which helps to reduce the risk of overfitting. This method of regularization has been widely used since the 1970s, and is a trusted technique for improving the performance of machine learning models.

(The main idea behing Ridge regression is to find a New Line that doesn't fit the (overfitting) Training Data as well. In other words, we introduce a small amount of bias into how to New Line is fit to the data, but in return for that small amount of bias, we get a significant drop in variance (overfitting: high variance). When the sample sizes are relatively small, then RR can improve predictions made from new data (i.e. reduce Variance) by making the predictions less sensitive to the Training Data. This is done by adding the ridge Regression Penalty to the thing that must be minimized. )

We will implement ridge regression with Scikit-Learn. However, we have to keep in mind a few important things when coding with Scikit-Learn:

- The first important note is that internally within the class call for ridge regression, Scikit-Learn refers to lambda as alpha within the class call. (Lambda'yi alpha diyerek kullanacagiz cunku lambda pythonda baska bir fonksiyon icin kullaniliyor isim olarak)
- Another important note is Scikit-Learn uses "scorer object" (root mean square error (RMSE), mean absolute error (MAE), etc.) to select the best parameter value with cross-validation. 
- Normally, in terms of a regression task, lower is better. However, all scorer objects prefixed with "neg" such as ‘neg_mean_absolute_error,’ ‘neg_root_mean_squared_error’ follow the convention that higher return values are better than lower return values. 

![image.png](attachment:c9855cc4-2661-428b-9b9b-6b940eccbc77.png)

Lambda hyperparametre. disardan mudahale edebildigimiz parametreler hyperprm oluyor. modelin kendi buldugu parametrelere normal parametre (coef) diyoruz, kendimiz disardan ayarladigimiz parametrelere ise hyperparametre diyoruz. Lambda ile modele ekledigimiz hata miktarini ayarliyoruz.

![image.png](attachment:5924f2b5-6e73-40c3-bd02-9c4e5ce8264e.png)

Lambda 0 olmasi normal simple lineer, herhangi bir degisiklik ypmaz 0 olunca.
Lambda 1 ve daha yuksek oldukca regullarizayonu artirarak new line degisir. Artirdikca trade-off olacak, optimal artis icin dengeli artirmak gerekiyor. Best lambdayi bulmak hyperparametre tuning ile oluyor. 

Modele hata ekledikce coefficientler ama az ama fazla düser, tüm slopelar duser, ama yine de her featureın bir katsayisi kaliyor ve Ridge herhangi bir katsayiyi sifirlamiyor. 

![image.png](attachment:d581a419-8660-4c63-8fb0-8f33136b0c52.png)

# Lasso Regression

**Lasso (Least Absolute Shrinkage and Selection Operator) regression is a method for linear modeling that incorporates feature selection and regularization to prevent overfitting.**

Unlike traditional linear regression, which minimizes the residual sum of squares, **Lasso regression minimizes the sum of absolute residuals with a penalty term that reduces the magnitude of coefficients associated with less important features**. This leads to a sparse model, where many coefficients are set to zero, indicating that those features are not important for the prediction.

Lasso regression is particularly **useful in cases of high multicollinearity or when there are many correlated features in the dataset, as it allows identifying the most relevant features while also reducing the complexity of the model**.

The best model is selected by cross-validation. (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html?highlight=lassocv)

Similar to the ridge Regression, lambda can be a value from 0 to infinity abd determined by cross validation. Like RR, LR also results with a bit more bias but less Variance then the least squares. So it reduces the overfitting.

Since LASSO Regression can exclude useless variables from equations, it's a little better than the RR at reducing the Variance in model that contain a lot of useless variables. In contrast, RR tends to do a little better when most variables are useful. 

![image.png](attachment:48d57155-0129-4cfd-bc08-cebc8009c7e4.png)

# Elastic-Net Regression

Elastic-Net stands between Ridge Regression and Lasso Regression. The regularization term is a straightforward blend of both Ridge and Lasso's regularization terms, and you can handle the blend proportion alpha. At the point when alpha = 0, Elastic Net behaves the same as Ridge Regression, and when alpha = 1, it is identical to Lasso Regression.

![image.png](attachment:f4059be5-29c4-4f54-9a59-4dc4eb64cb10.png)

**So which one should you choose plain Linear Regression, Ridge, Lasso, or Elastic-Net? It is quite often desirable to have at any rate of regularization, so general approach, you should keep away from simple Linear Regression. Ridge is a decent default. However, if you presume that a couple of features are precious and the others not, you ought to favor Lasso or Elastic Net since they will, in general, diminish the weights of the useless features down to zero. All in all, Elastic-Net is liked over Lasso because Lasso may act inconsistently when the number of features exceeds the number of training instances or when several features are strongly correlated.**

Ornegin milyonlarca parametresi olan modeller var ve hangisi gereksiz hangisi onemli tesğit etmek cok zor. boyle bir durudma regularization islemi Elastic Net ile yapilir.

Scikit official: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html?highlight=lassocv

Conclusion: "Regularization force the learning algorithm to build a less complex model . In practice, that often leads to slightly higher bias but significantly reduces the variance."