<h1>1. Business Understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.

In this situation let's pretend we are a real estate agency in Boston MA and we are interested in purchasing some houses. We would like to know which houses are under value to help us narrow down the list and put in an accurate bid on a house.

<b>Objective:</b> Identify what makes a property valuable? What is a fair price for a house?

Load Library

In [None]:
#import libraries for data handling
import os
import pandas as pd
import numpy as np

#import for visualization
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

#import for Linear regression
from sklearn.linear_model import LinearRegression

#import for Visualization
import plotly.express as px


### Load Data into Pandas Dataframe

In [None]:
import pandas as pd
df = pd.read_csv('/kaggle/input/boston-house-prices/housing.csv')
df.head()

In [None]:
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv('/kaggle/input/boston-house-prices/housing.csv', header=None, delimiter=r"\s+", names=column_names)
data.head()

Here we see first 5 rows. Data is loaded Successfully!

### Dataset : Boston
### Goal      : Predict medv column in Test Dataset!
<ol>
<li>	<b>	crim	:	</b>	per capita crime rate by town.
<li>	<b>	zn	:	</b>	proportion of residential land zoned for lots over 25,000 sq.ft.
<li>	<b>	indus	:	</b>	proportion of non-retail business acres per town.
<li>	<b>	chas	:	</b>	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
<li>	<b>	nox	:	</b>	nitrogen oxides concentration (parts per 10 million).
<li>	<b>	rm	:	</b>	average number of rooms per dwelling.
<li>	<b>	age	:	</b>	proportion of owner-occupied units built prior to 1940.
<li>	<b>	dis	:	</b>	weighted mean of distances to five Boston employment centres.
<li>	<b>	rad	:	</b>	index of accessibility to radial highways.
<li>	<b>	tax	:	</b>	full-value property-tax rate per &#36;10,000.
<li>	<b>	ptratio	:	</b>	pupil-teacher ratio by town.
<li>	<b>	black	:	</b>	1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
<li>	<b>	lstat	:	</b>	lower status of the population (percent).
<li>	<b>	medv	:	</b>	median value of owner-occupied homes in &#36;1000s.
</ol>

## 2. Data Understanding (EDA)

### Print a concise summary of a DataFrame.
This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [None]:
# Information on the Dataframe
print("\n\n", data.info())

In [None]:
data.describe()

**Understand the data**

Before cleaning data, there are a couple of things we would like to know, for example, the dimension of a dataset, the data type of each variable, the first few rows and the last few rows of the data and name of each variable, etc.

In [None]:
print("Dimension of Dataset is " + str(data.shape))  # check out the dimension of the dataset
print("\nlook at the data types for each column ")
print(data.dtypes)  # look at the data types for each column

#print(data.head())  # read the first five rows
#print(data.tail())  # read the last five rows

#print(data.columns.values)  # return an array of column names
print("\nlist of column names ")
print(data.columns.values.tolist())  # return a list of column names

**Check Missing Values**

Next, we would like to check if there are any missing values. To check this, we can use the function dataframe.isnull() in pandas. It will return True for missing components and False for non-missing cells. However, when the dimension of a dataset is large, it could be difficult to figure out the existence of missing values. In general, we may just want to know if there are any missing values first. The function dataframe.isnull().values.any() returns True when there is at least one missing value occurring in the data. The function dataframe.isnull().sum().sum() returns the number of missing values in the data set. 

In [None]:
print(data.isnull())  # checking missing values
data.notnull()  # checking non-missing values
data.isnull().values.any()  # only want to know if there are any missing values
data.notnull().sum()  # knowling number of non-missing values for each variable

data.isnull().sum().sum()  # knowing how many missing values in the data
data["MEDV"].isnull().values.any()  # only want to know if there are any missing values in MEDV
data["MEDV"].isnull().sum()  # return the number of missing values in MEDV

In [None]:
data.isnull().sum()  # knowing how many missing values in the data


In [None]:
Reference
* Missing Values : [link](https://miamioh.instructure.com/courses/38817/pages/data-cleaning#:~:text=The%20function%20dataframe.-,isnull().,values%20in%20the%20data%20set.&text=A%20simple%20way%20to%20deal,missing%20values%20in%20the%20dataset.)

In [None]:
import math 

# Struge formula
print(1 + 3.322*(math.log10(len(data))))

# round off the bin size
bin_size = round((1 + 3.322*(math.log10(len(data)))))
print(bin_size)

In [None]:
# number of bins are calculated as per Sturge’s rule K = 1 + 3. 322 logN
sns.distplot(data.MEDV, bins = bin_size, hist = True, rug = True)

<ol>
<b>Linear Regression Assumptions </b>
<li>Linear relationship between target and features
<li>No outliers
<li>No high-leverage points
<li>Homoscedasticity of error terms
<li>Uncorrelated error terms
<li>Independent features

#1 Linear Relationship Between Target & Features

In [None]:
# Plotting the heatmap of correlation between features
corr = data.corr()

plt.figure(figsize=(20,10))
sns.heatmap(corr, cbar=True, square= True, fmt='.1f', annot=True, annot_kws={'size':10})

**Highly correlated**
<li>more access to highway more tax
<li>Medium Correlated
<li>more crim less price
<li>more polution less price
<li>more tax less price
<li>more pt ratio less price
<li>more lower population less price
<li>more distance less industry
<li>more industry more pollution


### 3. Data Preparation

In [None]:
data.head()

In [None]:
#Import for Models
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

from math import sqrt
import statsmodels.api as sm
from sklearn import metrics

#importing plotly and cufflinks in offline mode
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

#import warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)

##  Split Data into Test & Train

##### Benefit to splitting a dataset into some ratio of training and testing subsets for a learning algorithm

<ul>

<li> <b>Motivation:</b> we need a way to choose between machine learning models and our goal is to estimate likely performance of a model on out-of-sample data.
<li> <b>Initial idea:</b> we can train and test on the same data. However this will cause overfitting. As the number of features in a dataset increases the problem will increase
<li><b>Alternative idea:</b> we can use train/test split. We can split the dataset into two pieces so that the model can be trained and tested on different data.
Then, testing accuracy is a better estimate than training accuracy of out-of-sample performance.
</ul>

In [None]:
# Machine Learning 
predictor = data.drop(['MEDV'], 1)
target = data['MEDV']

## 4. Model

## lm1 : Raw data only

In [None]:
# Split data to 80% training data and 20% of test to check the accuracy of our model
X_train, X_test, y_train, y_test = train_test_split(predictor, target, test_size=0.20, random_state=0)
print("Training and testing split by 80/20 was successful")

#Model object
lm1 = LinearRegression(fit_intercept=True,normalize=False)
#Model Training
lm1.fit(X_train,y_train)
# Predict
y_pred = lm1.predict(X_test)

# Evaluate
# 1. root-mean-square error (RMSE) for the Model
# 2. R-Sqauared for the Model

#Calculate root-mean-square error (RMSE):
print ("root-mean-square error (RMSE) for the model is : {}".format(round(sqrt(mean_squared_error(y_test,y_pred)),2)))

#Calculate R-squared for the Model:
print ("R-Squared for the above model : {}".format(round(r2_score(y_test,y_pred)*100,2)),"%")

#OLS model summary
model1 = sm.OLS(y_train,X_train).fit()
model1.summary()

Random Forest Regressor

In [None]:
# Import Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest Regressor
reg = RandomForestRegressor()

# Train the model using the training sets 
reg.fit(X_train, y_train)

# Model prediction on train data
y_pred = reg.predict(X_train)

# Model Evaluation
print('R^2:',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, y_pred))
print('MSE:',metrics.mean_squared_error(y_train, y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))



XGBoost Regressor

In [None]:
# Import XGBoost Regressor
from xgboost import XGBRegressor

#Create a XGBoost Regressor
reg = XGBRegressor()

# Train the model using the training sets 
reg.fit(X_train, y_train)

# Model prediction on train data
y_pred = reg.predict(X_train)

# Model Evaluation
print('R^2:',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, y_pred))
print('MSE:',metrics.mean_squared_error(y_train, y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs Predicted prices")
plt.show()

In [None]:
# Import XGBoost Regressor
from xgboost import XGBRegressor

#Create a XGBoost Regressor
reg = XGBRegressor()

# Train the model using the training sets 
reg.fit(X_train, y_train)

# Model prediction on train data
y_pred = reg.predict(X_test)

# Model Evaluation
print('R^2:',metrics.r2_score(y_test, y_pred))
#print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
#print('MAE:',metrics.mean_absolute_error(y_test, y_pred))
print('MSE:',metrics.mean_squared_error(y_test, y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


In [None]:
# Checking residuals
plt.scatter(y_pred,y_train-y_pred)
plt.title("Predicted vs residuals")
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.show()

**SVM Regressor**

In [None]:
# Import SVM Regressor
from sklearn import svm

# Create a SVM Regressor
reg = svm.SVR()

# Train the model using the training sets 
reg.fit(X_train, y_train)

# Model prediction on train data
y_pred = reg.predict(X_train)

# Model Evaluation
print('R^2:',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, y_pred))
print('MSE:',metrics.mean_squared_error(y_train, y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

In [None]:
from sklearn.linear_model import LinearRegression
# Import Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
# Import XGBoost Regressor
from xgboost import XGBRegressor
# Import SVM Regressor
from sklearn import svm
#from sklearn import cross_validation
from sklearn.model_selection import KFold, cross_val_score, train_test_split
import time

# Create Model Array
models = []
models.append(("LR",LinearRegression(fit_intercept=True,normalize=False)))
models.append(("RFM",RandomForestRegressor()))
models.append(("XGB",XGBRegressor()))
models.append(("SVM",svm.SVR()))

result = []

#measure the MSE 
for name,model in models:
    
    start_time = time.time()
    
    cv_result = cross_val_score(model,X_train,y_train, cv = 10, scoring = "neg_mean_squared_error")
    print(name, cv_result)
    print("-"*3,name, " Mean MSE of cross-validation: ", format(round(cv_result.mean(),2)))
    execution_time = (time.time() - start_time)
    
    result.append((name,format(round(cv_result.mean(),2)), execution_time))




In [None]:
pd.DataFrame(result)
    
df = pd.DataFrame(result, columns =['Algo', 'MSE', 'Execution_time']) 
df 

# sort df by Count column
pd_df = df.sort_values(['MSE']).reset_index(drop=True)
pd_df





In [None]:
pd_df[['Algo', 'MSE']].groupby(['Algo']).sum().iplot(kind='bar',xTitle='Models', yTitle='MSE/Timing', title = 'Model Comparision')

In [None]:
pd_df[['Algo', 'Execution_time']].groupby(['Algo']).sum().iplot(kind='bar',xTitle='Models', yTitle='MSE/Timing', title = 'Model Comparision')

In [None]:
# Machine Learning 
predictor = data.drop(['MEDV'], 1)
target = data['MEDV']

# Split data to 80% training data and 20% of test to check the accuracy of our model
X_train, X_test, y_train, y_test = train_test_split(predictor, target, test_size=0.20, random_state=0)

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn import svm
from sklearn.model_selection import KFold, cross_val_score, train_test_split
import time

# Create Model Array
models = []
models.append(("LR",LinearRegression(fit_intercept=True,normalize=False)))
models.append(("RFM",RandomForestRegressor()))
models.append(("XGB",XGBRegressor()))
models.append(("SVM",svm.SVR()))

result = []

#measure the MSE 
for name,model in models:
    
    start_time = time.time()
    
    cv_result = cross_val_score(model,X_train,y_train, cv = 10, scoring = "neg_mean_squared_error")
    #print(name, cv_result)
    print(name, " Mean MSE of cross-validation: ", format(round(cv_result.mean(),2)))
    execution_time = (time.time() - start_time)
    
    result.append((name,format(round(cv_result.mean(),2)), execution_time))


pd.DataFrame(result)
    
df = pd.DataFrame(result, columns =['Algo', 'MSE', 'Execution_time']) 
df 

# sort df by Count column
pd_df = df.sort_values(['MSE'])
print(pd_df)


df[['Algo', 'MSE']].groupby(['Algo']).sum().iplot(kind='bar',xTitle='Models', yTitle='MSE/Timing', title = 'Model Comparision')

In [None]:
# Machine Learning 
predictor = data.drop(['MEDV'], 1)
target = data['MEDV']

# Split data to 80% training data and 20% of test to check the accuracy of our model
X_train, X_test, y_train, y_test = train_test_split(predictor, target, test_size=0.20, random_state=0)

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn import svm
from sklearn.model_selection import KFold, cross_val_score, train_test_split
import time

# Create Model Array
models = []
models.append(("LR",LinearRegression(fit_intercept=True,normalize=False)))
models.append(("RFM",RandomForestRegressor()))
models.append(("XGB",XGBRegressor()))
models.append(("SVM",svm.SVR()))

result = []

#measure the MSE 
for name,model in models:
    
    start_time = time.time()
    
    cv_result = cross_val_score(model,X_train,y_train, cv = 10, scoring = "neg_mean_squared_error")
    #print(name, cv_result)
    print("-"*3,name, " Mean MSE of cross-validation: ", format(round(cv_result.mean(),2)))
    execution_time = (time.time() - start_time)
    
    result.append((name,format(round(cv_result.mean(),2)), execution_time))


pd.DataFrame(result)
    
df = pd.DataFrame(result, columns =['Algo', 'MSE', 'Execution_time']) 
df 

# sort df by Count column
pd_df = df.sort_values(['MSE'])
print(pd_df)


Mean square Error differes with change in train/test split ratio
#40
* Algo	MSE	Execution_time
* LR	-23.72	0.086060
* RFM	-12.48	3.461339
* XGB	-9.32	0.572908
* SVM	-69.05	0.205350

#30
* 	Algo	MSE	Execution_time
* 0	LR	-22.83	0.084721
* 1	RFM	-14.75	3.695503
* 2	XGB	-10.48	0.812710
* 3	SVM	-67.22	0.213389

#20
* 	Algo	MSE	Execution_time
* 0	LR	-21.4	0.080288
* 1	RFM	-10.57	3.938169
* 2	XGB	-7.83	0.668272
* 3	SVM	-64.68	0.239741

XGB is least in all the cases

##Tensorflow

In [None]:
X

In [None]:
# Data Processing
X = data.iloc[:, 0:13].values
y = data.iloc[:, 13].values

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
#To train the model, let's import the TensorFlow 2.0 classes. 
from tensorflow.keras.layers import Input, Dense, Activation,Dropout
from tensorflow.keras.models import Model

In [None]:
input_layer = Input(shape=(X.shape[1],))
dense_layer_1 = Dense(100, activation='relu')(input_layer)
dense_layer_2 = Dense(50, activation='relu')(dense_layer_1)
dense_layer_3 = Dense(25, activation='relu')(dense_layer_2)
output = Dense(1)(dense_layer_3)

model = Model(inputs=input_layer, outputs=output)
model.compile(loss="mean_squared_error" , optimizer="adam", metrics=["mean_squared_error"])

In [None]:
history = model.fit(X_train, y_train, batch_size=2, epochs=10, verbose=1, validation_split=0.2)

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

pred_train = model.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train)))

pred = model.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred)))

Reference:
    https://stackabuse.com/tensorflow-2-0-solving-classification-and-regression-problems/
        

https://towardsdatascience.com/explained-deep-learning-in-tensorflow-chapter-1-9ab389fe90a1