#### Applying polynomial features
This project seeks to compare the generalisation performance of two linear models (LinearRegression and Ridge) on the make friedman dataset from scikit-learn. Polynomial features would then be applied to the best performing model to see how best it can improve upon its performance. 

In [13]:
#Importing relevant libraries
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

#Importing the make_friedman1 function from scikit-learn
from sklearn.datasets import make_friedman1

In [14]:
#Creating the dataset
X_F1, y_F1 = make_friedman1(n_samples = 12000, n_features = 25, random_state=0)

In [15]:
# Splitting the dataset into training and test sets
X_train,X_test,y_train,y_test = train_test_split(X_F1, y_F1, random_state = 0)

#### Linear Regression

In [16]:
#Importing the Linear regression model
from sklearn.linear_model import LinearRegression

# model fitting
linreg = LinearRegression().fit(X_train,y_train)

# model evaluation using r2 score
#Training set
r2_score = linreg.score(X_train,y_train)

#Test set
r2_score1 = linreg.score(X_test,y_test)

print("R2 Score\n")
print("Training set: {:.2f} ".format(r2_score))
print("Test set: {:.2f}".format(r2_score1))

R2 Score

Training set: 0.75 
Test set: 0.76


#### Ridge

In [17]:
# Importing the ridge regression model
from sklearn.linear_model import Ridge

# model fitting but this time we are going to loop through a number of alpha 
# parameters to see which one gives the best performance.

linrig = Ridge(alpha = 5).fit(X_train,y_train)

# model evaluation 
#Training set
scoree = linrig.score(X_train,y_train)

#Test set
scoree1 = linrig.score(X_test,y_test)

print("Training set score: {:.2f} ".format(scoree))
print("Test set score: {:.2f}".format(scoree1))

Training set score: 0.75 
Test set score: 0.76


##### Both models seem to have similar generalization performance given the best hyper parameter selections. However, ridge will be used going forward due to it's ability to reduce model complexity given the alpha parameter.

In [18]:
# Applying polynomial features

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree = 3)
X_poly = poly.fit_transform(X_F1)
X_train,X_test,y_train,y_test = train_test_split(X_poly,y_F1, random_state = 0)
    
linridge = Ridge(alpha = 5).fit(X_train,y_train)

# model evaluation
#Training set
score = linridge.score(X_train,y_train)

#Test set
score1 = linridge.score(X_test,y_test)
    
print("Training set score: {:.2f} ".format(score))
print("Test setscore: {:.2f}".format(score1))
    

Training set score: 0.97 
Test setscore: 0.96


As we can see, adding polynomial features to the linear model increased the model's performance drastically.