# Project: Predicting Median House Price in Boston


In this project, I'm going to model the median home price of various houses across U.S. Census tracts in the city of Boston. Note that we are predicting a continuous, numeric output (price) based on a combination of discrete features. Thus, this is a regression problem.

In [1]:
# Let us import some libraries
import matplotlib.pyplot as plt
% matplotlib inline

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

In [None]:
# loading the data
boston = load_boston()

X = pd.DataFrame(boston.data,
                 columns=boston.feature_names)
y = pd.DataFrame(boston.target,
                 columns=['MEDV'])

print(boston['DESCR'])

## 1. Clean Up Data and Perform Exporatory Data Analysis


Boston data is from scikit-learn datasets library. Thus, it should be in clean format. Nevertheless, we should always perform exploratory data analysis before we start our analysis in detail.



In [None]:
# Exploratory data analysis:
# Include: total nulls, data types, shape, summary statistics, and the number of unique values for each column in the dataset.
# You may use plots to describe any relationships between different columns.

#answer 

#for EDA purpose, i'll use X as data set because y is response variable 
X.info() #null check 
X.dtypes #data types
X.shape #shape
X.describe() #summary stat

#plot 
import matplotlib.pyplot as plt
import seaborn as sns

#comparision between exploratory variables
X.plot() 

#I combine 2 df horizontally to plot relationship between exploratory and response varible 
concatenated = pd.concat([X, y], axis=1) 
sns.set(style="white")
sns.pairplot(concatenated, y_vars='MEDV')
#as we can see, CRIM located mostly at the lower level income, etc. 




## 2. Build a Lasso regression model using Scikit-Learn

Use the toolset in Scikit-Learn to build a Lasso regression model to predict our target variable, MEDV. Use a 10-fold cross validation in building your model. Score your predictions. What do these results tell us?


In [None]:
# Lasso regression model with 10-fold cross-validation
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = Lasso(alpha=0.1)
# evaluate model
scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
#Accuracy
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Accuracy: 0.71 (+/- 0.25)


## 3. Build a Random forest regression model using Scikit-Learn

Use the toolset in Scikit-Learn to build a Random forest regression model to predict our target variable, MEDV. Use a 10-fold cross validation in building your model. Score your predictions. What do these results tell us?

In [None]:
# Random forest regression model with 10-fold cross-validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = RandomForestRegressor(n_estimators = 1000, random_state = 0)
# evaluate model
scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
#Accuracy
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))




Accuracy: 0.86 (+/- 0.18)


## 4. Build a Support Vector Machine (SVM) model using Scikit-Learn

Use the toolset in Scikit-Learn to build a Support Vector Machine (SVM) regression model to predict our target variable, MEDV. Use a 10-fold cross validation in building your model. Score your predictions. What do these results tell us?

In [None]:
# Support Vector Machine (SVM) regression model with 10-fold cross-validation
from sklearn.svm import SVR
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = SVR(kernel = 'rbf')
# evaluate model
scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
#Accuracy
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))



Accuracy: 0.21 (+/- 0.33)
