# Polynomial regression

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

The explanatory (independent) variables resulting from the polynomial expansion of the "baseline" variables are known as higher-degree terms. Such variables are also used in classification settings.[[1]](https://en.wikipedia.org/wiki/Polynomial_regression)

**Although polynomial regression is technically a special case of multiple linear regression, the interpretation of a fitted polynomial regression model requires a somewhat different perspective. It is often difficult to interpret the individual coefficients in a polynomial regression fit, since the underlying monomials can be highly correlated.**

#### Import Libraries:

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Adding Dataset:

This dataset has 8 columns. The values of X1 to X6 columns affect the price per unit area of the house in the "Y house price of unit area" column.

In [None]:
df = pd.read_csv('../input/real-estate-price-prediction/Real estate.csv')

#### Data Overview:

In [None]:
df.head()

In [None]:
df.info()

#### EDA:

In [None]:
sns.pairplot(data = df,
             x_vars = ["X1 transaction date" ,
                      "X2 house age" ,
                      "X3 distance to the nearest MRT station",
                      "X4 number of convenience stores" ,
                      "X5 latitude" ,
                      "X6 longitude" ,
                      "Y house price of unit area"],
             y_vars = ["X1 transaction date" ,
                      "X2 house age" ,
                      "X3 distance to the nearest MRT station",
                      "X4 number of convenience stores" ,
                      "X5 latitude" ,
                      "X6 longitude" ,
                      "Y house price of unit area"]
            )

#### Determine Features And Label

In [None]:
# Features:
X = df.drop(['Y house price of unit area'  , 'No'] , axis = 1)
# Label:
y = df['Y house price of unit area']

#### Preprocessing:

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
polynomial_converter = PolynomialFeatures(degree = 2, include_bias=False)

In [None]:
poly_features = polynomial_converter.fit_transform(X)

In [None]:
poly_features.shape

In [None]:
X.shape

#### Split Data to Train And Test:

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

#### Train the Model:

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
polymodel = LinearRegression()

In [None]:
polymodel.fit(X_train, y_train)

#### Predicting Test Data:

In [None]:
y_pred = polymodel.predict(X_test)

#### Take a look to Test and Prediction

In [None]:
pd.DataFrame({'Y_Test':y_test,'Y_Pred': y_pred, 'Residuals':(y_test - y_pred)}).head()

#### Evaluating Model Performance:

In [None]:
from sklearn import metrics

In [None]:
MAE_Poly = metrics.mean_absolute_error(y_test, y_pred)
MSE_Poly = metrics.mean_squared_error(y_test, y_pred)
RMSE_Poly = np.sqrt(MSE_Poly)

In [None]:
pd.DataFrame([MAE_Poly,MSE_Poly,RMSE_Poly], index = ['MAE','MSE','RMSE'], columns = ['metrics'])

#### Adjusting Model Hyperparameters

In [None]:
# Train List of RMSE per degree
train_RMSE_List = []
# Test List of RMSE per degree
test_RMSE_List = []

for d in range (1,10):
    # preprocessing
    # Create poly data set for degree d
    polynomial_converter = PolynomialFeatures(degree=d, include_bias=False)
    poly_features = polynomial_converter.fit_transform(X)
    
    # Split the dataset
    X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)
    
    # Train the Model 
    polymodel = LinearRegression()
    polymodel.fit(X_train,y_train)
    
    # Predicting
    y_train_pred = polymodel.predict(X_train)
    y_test_pred = polymodel.predict(X_test)
    
    # Evaluating
    # RMSE of Train set:
    train_RMSE = np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))
    # RMSE of Test set:
    test_RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_test_pred))
    
    # Append the RMSE to the Train and Test List
    train_RMSE_List.append(train_RMSE)
    test_RMSE_List.append(test_RMSE)

In [None]:
train_RMSE_List

In [None]:
test_RMSE_List

In [None]:
plt.plot(range(1,5), train_RMSE_List[:4], label = 'Train RMSE')
plt.plot(range(1,5),test_RMSE_List[:4], label = 'Test RMSE')

plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE')

plt.legend()

##### According to the chart, the best degree can be numbers between 2 and 3. (2.5 and more)