steps to create a model using sklearn
1. get the dataset
2. split the dataset to features(data) and target
3. train and test split the data using train_test_split class
4. choose an algorithm or multiple algorithms based on the problem statement.
    here in the first stage i have chosen LinearRegression
5. create an insteance of it and train the model using fit()
6.  evaluate the model using predict()
7. get the score of the model using r2_score

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

housing = datasets.fetch_california_housing()

x = housing.data
y = housing.target

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.20,random_state=456)

LR = LinearRegression()
LR.fit(x_train,y_train)

y_pred = LR.predict(x_test)

r2 = r2_score(y_test, y_pred)

print('r2',r2)


now the second stage of development we got 60% r2 score steps to optimize the score
1. one of the methon to optimize the model is using Polynomial Features expaning the no. of features for our model

In [12]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures

housing = datasets.fetch_california_housing()

x = housing.data
y = housing.target
print(x.shape)

poly = PolynomialFeatures()
x = poly.fit_transform(x)

print(x.shape)

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.20,random_state=456)

LR = LinearRegression()
LR.fit(x_train,y_train)

y_pred = LR.predict(x_test)

r2 = r2_score(y_test, y_pred)

print('r2',r2)


(20640, 8)
(20640, 45)
r2 0.6541630009746205


1. after PolynomialFeatures the r2_score has been increased
2. but we dont know wheather the algorithm we have choosen is correct? the best way to know is to try with diffrernt algorithm and check the r2_score for each of them

In [21]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import (RandomForestRegressor , HistGradientBoostingRegressor )

housing = datasets.fetch_california_housing()

x = housing.data
y = housing.target
print(x.shape)

poly = PolynomialFeatures()
x = poly.fit_transform(x)

print(x.shape)

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.20,random_state=456)

LR = LinearRegression()
HGB = HistGradientBoostingRegressor()
RFC = RandomForestRegressor(n_jobs=-1)

for model in [LR,HGB,RFC]:
  model.fit(x_train,y_train)
  y_pred = model.predict(x_test)
  r2 = r2_score(y_test, y_pred)
  print(model,r2)


(20640, 8)
(20640, 45)
LinearRegression() 0.6541630009746205
HistGradientBoostingRegressor() 0.8450594887665172
RandomForestRegressor(n_jobs=-1) 0.8173506840557321


1. from the above experiment we got to know that HistGradientBoostingRegressor model are produces a better R2 score 
2. so the best model for our data is  HistGradientBoostingRegressor so choose that
3. we can optimize this further using Hyperparameters

In [23]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import ( HistGradientBoostingRegressor )

housing = datasets.fetch_california_housing()

x = housing.data
y = housing.target
print(x.shape)

poly = PolynomialFeatures()
x = poly.fit_transform(x)

print(x.shape)

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.20,random_state=456)

for j in [0.1,0.05,0.001]:
  for i in [100,200,300,400,500]:
    model = HistGradientBoostingRegressor(max_iter= i,learning_rate=j)
    model.fit(x_train,y_train)
    y_pred = model.predict(x_test)
    r2 = r2_score(y_test, y_pred)
    print("NUMBER OF TREES:", i)
    print("LEARNING RATE",j)
    print("R2 SCORE:", r2)
    print("-------------")


(20640, 8)
(20640, 45)
NUMBER OF TREES: 100
LEARNING RATE 0.1
R2 SCORE: 0.8475818066000932
-------------
NUMBER OF TREES: 200
LEARNING RATE 0.1
R2 SCORE: 0.8517961318544394
-------------
NUMBER OF TREES: 300
LEARNING RATE 0.1
R2 SCORE: 0.8541217644972916
-------------
NUMBER OF TREES: 400
LEARNING RATE 0.1
R2 SCORE: 0.8563061802062409
-------------
NUMBER OF TREES: 500
LEARNING RATE 0.1
R2 SCORE: 0.8546555425280261
-------------
NUMBER OF TREES: 100
LEARNING RATE 0.05
R2 SCORE: 0.833887487118536
-------------
NUMBER OF TREES: 200
LEARNING RATE 0.05
R2 SCORE: 0.8454051242532314
-------------
NUMBER OF TREES: 300
LEARNING RATE 0.05
R2 SCORE: 0.8517398154345279
-------------
NUMBER OF TREES: 400
LEARNING RATE 0.05
R2 SCORE: 0.8513485479257914
-------------
NUMBER OF TREES: 500
LEARNING RATE 0.05
R2 SCORE: 0.8560085075589842
-------------
NUMBER OF TREES: 100
LEARNING RATE 0.001
R2 SCORE: 0.12081388399827919
-------------
NUMBER OF TREES: 200
LEARNING RATE 0.001
R2 SCORE: 0.221772542725115

1. from the above we got to know that no. of tree = 300 and learing rate = 0.1 is giving us the best model
2. so use only these parameters for the model and save the model using joblib

In [24]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import ( HistGradientBoostingRegressor )
import joblib

housing = datasets.fetch_california_housing()

x = housing.data
y = housing.target
print(x.shape)

poly = PolynomialFeatures()
x = poly.fit_transform(x)

print(x.shape)

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.20,random_state=456)


model = HistGradientBoostingRegressor(max_iter= 300,learning_rate=0.1)
model.fit(x_train,y_train)

joblib.dump(model,'my_model.joblib')
y_pred = model.predict(x_test)
r2 = r2_score(y_test, y_pred)
print("R2 SCORE:", r2)


(20640, 8)
(20640, 45)
R2 SCORE: 0.8471389415234265
