This is very basic notebook and it is about implementing following steps:
* EDA
* Using linear_models
* cross_val_score
* GridSearchCV
* Polynomial Feature
* Feature Selection
* Pipeline

The goal of this notebooke is the see how these steps are implemented in the data science project.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Basic Imports

In [None]:
# Libraries for Data Analysis
import numpy as np
import pandas as pd
# Libraries for Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# For avoiding warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
housing_data = pd.read_csv('/kaggle/input/usa-housing/USA_Housing.csv')

# Quick Look at Data

In [None]:
housing_data.head(3)

There are all numeric feature except one String feature which is 'Address'.

In [None]:
housing_data.info()

There is neither missing values nor categorical data so this dataset doesn't need any cleaning and prerprocessing.

In [None]:
housing_data.describe()

# EDA

In [None]:
# Making the copy of the dataset
e_df = housing_data.copy()

In [None]:
e_df.hist(bins=30, edgecolor='black', figsize=(10,8))
plt.show()

Most of the data is normally distributed apart from 'Avg. Area Number of Bedrooms' which is a good thing for any machine learning model.

In [None]:
sns.pairplot(e_df)
plt.show()

There is not any significat correlation between features but there is some correlation between the feature and target.
Most correlation can be seen with 'Avg. Area Income'

In [None]:
# For showing the values below the diagonal of the metrix
matrix = np.triu(e_df.corr())
# Heatmap is the great way to show correlation
sns.heatmap(e_df.corr(),mask=matrix,annot=True)

Address feature doesn't have any use in our model so dropping the 'Address' column

In [None]:
# Copying the data into the df dafaframe for futher steps
df = e_df.drop('Address', axis=1)

# Splitting the Dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,:-1], df.Price, test_size=0.20, random_state=42)

# Building Baseline Models

These Linear models will be used
* LinearRegression
* Ridge
* Lasso

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso

scores = []
for model in [LinearRegression(), Ridge(), Lasso()]:
    score = cross_val_score(model, X_train, y_train, cv=5)
    scores.append(np.mean(score))
    
baseline_models = pd.DataFrame({'model':['LinearRegression','Ridge','Lasso'], 'score':scores}).set_index('model')
baseline_models

All the baseline model performed exactly same because we didn't do any parameter tuning without parameter tuning all these model have same working process.

# Model Parameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'alpha':[0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(Ridge(), param_grid, cv=5).fit(X_train, y_train)
print(grid.best_estimator_)
print('best_score: {}'.format(grid.best_score_))
results = pd.DataFrame(grid.cv_results_)

sns.set_style('whitegrid')
sns.barplot(results.param_alpha, results.mean_test_score*100)
plt.yticks(np.arange(0,101,10))
plt.show()

In [None]:
param_grid = {'alpha':[0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(Lasso(), param_grid, cv=5).fit(X_train, y_train)
print(grid.best_estimator_)
print(grid.best_score_)
results = pd.DataFrame(grid.cv_results_)

sns.barplot(results.param_alpha, results.mean_test_score*100)
plt.yticks(np.arange(0,101,10))
plt.show()


Parameter Tuning didn't affact the accuracy.

### Coef of the feature set by all the linear models

In [None]:
lr = LinearRegression().fit(X_train, y_train)
ridge = Ridge().fit(X_train, y_train)
lasso = Lasso().fit(X_train, y_train)

coef = pd.DataFrame(data = [lr.coef_, ridge.coef_, lasso.coef_], columns=X_train.columns,
                    index=['linear_regression','ridge','lasso'])
coef

# Applying StanderScaleabsr

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
pipe = make_pipeline(StandardScaler(),LinearRegression())
np.mean(cross_val_score(pipe, X_train, y_train, cv=5))

# Applying Polynomial features

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2).fit(X_train)
X_poly = poly.transform(X_train)
np.mean(cross_val_score(LinearRegression(), X_poly, y_train, cv=5))

Performance is little bad than before

# Applying PCA

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2).fit(X_train)
X_pca = pca.transform(X_train)
np.mean(cross_val_score(LinearRegression(), X_pca, y_train, cv=5))

PCA actually performed very poor.

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(10,5))
axs[0].scatter(X_pca[:,0], y_train, edgecolor='black')
axs[0].set_xlabel('component_1')
axs[0].set_ylabel('price')

axs[1].scatter(X_pca[:,1], y_train, edgecolor='black')
axs[1].set_xlabel('component_2')
axs[1].set_ylabel('price')
plt.show()

# Feature Selection

### Univariate Statistics

In [None]:
from sklearn.feature_selection import SelectPercentile
select = SelectPercentile(percentile=50).fit(X_train, y_train)
selected_X = select.transform(X_train)
print(select.get_support())
np.mean(cross_val_score(LinearRegression(), selected_X, y_train, cv=5))

### Model Based Selection

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor

select = SelectFromModel(RandomForestRegressor(), threshold='median').fit(X_train, y_train)
selected_X = select.transform(X_train)
print(select.get_support())
np.mean(cross_val_score(LinearRegression(), selected_X, y_train, cv=5))

### Itretive Selection

In [None]:
from sklearn.feature_selection import RFE

select = RFE(RandomForestRegressor(), n_features_to_select=3).fit(X_train, y_train)
selected_X = select.transform(X_train)
print(select.get_support())
np.mean(cross_val_score(LinearRegression(), selected_X, y_train, cv=5))

So far the best model is baseline model without any preprocessing on the data.

# Model Evaluation

In [None]:
lr = LinearRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
print(mean_absolute_error(y_test, y_pred))
print(mean_squared_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

In [None]:
plt.figure(figsize=(10,5))
plt.barh(X_train.columns, lr.coef_)

In [None]:
plt.scatter(y_test, y_pred, edgecolor='black')
plt.show()

In [None]:
plt.hist(y_test-y_pred, bins=30)
plt.show()