# Data Splits and Polynomial Regression

There are a few best practices to avoid overfitting of your regression models. One of these best practices is splitting your data into training and test sets. Another alternative is to use cross validation. And a third alternative is to introduce polynomial features. This module walks you through the theoretical framework and a few hands-on examples of these best practices.

## Learning Objectives

Realize the importance of having a test set to avoid overfitting

Practice using the train_split function to split your data into training and testing sets

Recognize the trade off between model complexity and prediction error

Assess whether introducing polynomial features improves the error metrics of your linear regression

## Training and Test Splits

### Learning Goals

In this section, we will cover:
- Splitting data into training and testing samples
- Cross-validation approaches
- Model complexity vs. error

![Fiting Training and Test Data](./images/07_FittingTrainingAndTestData.jpg "Fitting Training and Test Data")



In [None]:
# Import the train and test split function
from sklearn.model_selection import train_test_split

# Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)

# Other method for splitting data
from sklearn.model_selection import ShuffleSplit



### class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)[source]

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.


### class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False)[source]

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

The transformation is given by:

```python
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
where min, max = feature_range.
```

This transformation is often used as an alternative to zero mean, unit variance scaling.



### class sklearn.preprocessing.MaxAbsScaler(*, copy=True)[source]¶

Scale each feature by its maximum absolute value.

This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.



## Polynomial Regression

### Learning Goals

In this section, we will cover:
- Extending linear regression
- Using polynomial features to capture nonlinear effects
- Other models that can be used for regression and
classification


### Addition of Polynomial Features

Capture higher order features of data
by adding polynomial features.

"Linear regression" means
linear combinations of features

$ y_\beta (x) =  \beta_0 + \beta_1x +\beta_2x^2+\beta_3x^3$

Can also include variable interactions:

$y_\beta(x) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2$

How is the correct functional form chosen?
- Check relationship of each variable or with outcome.

### Enhancing the Linear Model

Adjusting the standard linear approach to regression by adding polynomial features
is one of many approaches to dealing with the fundamental problems:
- prediction
- interpretation

As we move into model evaluation, keep in mind that the same tools are useful for
evaluating a wide variety of regression and classification problems.

In addition to Polynomial features, we will also examine several additional variants
of standard models, using many for both regression and classification.

Some examples include:
- Logistic Regression
- K-Nearest Neighbors
- Decision Trees
- Support Vector Machines
- Random Forests
- Ensemble Methods
- Deep Learning Approaches

### Polynomial Features: The Syntax

Import the class containing the transformation method
```python
from sklearn.preprocessing import PolynomialFeatures
```

Create an instance of the class
```python
polyFeat = PolynomialFeatures(degree=2
```

Create the polynomial features and then transform the data
```python
polyFeat = polyFeat.fit(X_data)
X_poly = polyFeat.transform(X_data)
```