___________________
##### By.
##### A h M e D _ H e f N a w Y
___________________
#### Boston House Price Regression problem!
___________________

##### UnderStanding Problem Attributes!
###### 01. CRIM: per capita crime rate by town
###### 02. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
###### 03. INDUS: proportion of non-retail business acres per town
###### 04. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
###### 05. NOX: nitric oxides concentration (parts per 10 million)
###### 06. RM: average number of rooms per dwelling
###### 07. AGE: proportion of owner-occupied units built prior to 1940
###### 08. DIS: weighted distances to five Boston employment centers
###### 09. RAD: index of accessibility to radial highways
###### 10. TAX: full-value property-tax rate per $10,000

###### 11. PTRATIO : pupil-teacher ratio by town

###### 12. B: 1000(Bk − 0:63)2 where Bk is the proportion of blacks by town

###### 13. LSTAT: % lower status of the population

###### 14. MEDV: Median value of owner-occupied homes in $1000s

In [None]:
from IPython.display import Image
Image(filename='F:\Careers\Machine Learning\work shop\Projects\Boston House Price\DataSet\DataEx.JPG',width=800,height=100)

In [None]:
# Load libraries
import numpy
from numpy import arange
from matplotlib import pyplot as plt
import pandas as pd
from pandas import set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

#### Load DataSet

In [None]:
# Load dataset
filename = '../input/boston-house-prices/housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO',
'B', 'LSTAT', 'MEDV']
DF = pd.read_csv(filename,delim_whitespace=True, names=names)
DF

#### Analyze Data

In [None]:
DF.shape

In [None]:
DF.dtypes

In [None]:
DF.head(21)

In [None]:
# descriptions
DF.describe()

>correlation between all of the numeric attributes.

In [None]:
DF.corr()

 NOX and INDUS with 0.77.

 DIS and INDUS with -0.71.

 TAX and INDUS with 0.72.

 AGE and NOX with 0.73.

 DIS and NOX with -0.78.

### Visualizations time!

> Unimodal Data Visualizations

In [None]:
DF.hist(figsize=(20,20), bins=10)
plt.show()

CRIM: exponential distribution

ZN  : exponential distribution

AGE : exponential distribution

B   : exponential distribution  

RAD : bimodal distribution 

TAX : bimodal distribution

NOX :skewed Gaussian distributions

RM :skewed Gaussian distributions

LSTAT :skewed Gaussian distributions

In [None]:
# density chart
DF.plot(kind='density',subplots=True, layout=(4,4), sharex=False, sharey=False, legend=False, fontsize=10, figsize=(20,20))
plt.show()

In [None]:
# box and whisker plots
DF.plot(kind='box', subplots=True, layout=(4,4), sharex=False, sharey=False,fontsize=10 , figsize=(20,20))
plt.show()

> Multimodal Data Visualizations!

In [None]:
# scatter plot matrix
scatter_matrix(DF, figsize=(25,25),ax=None,grid=True,diagonal='hist',marker='*', range_padding=0.05)
plt.show()

Not linear, but nice predictable curved relationships. :) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [None]:
# correlation matrix
fig = plt.figure(figsize=(20,15))
ax = fig.add_subplot(111)
cax = ax.matshow(DF.corr(), vmin=-1, vmax=1, interpolation='none')
fig.colorbar(cax)
ticks = numpy.arange(0,14,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

_____________
### Validation Step!!

There is a lot of structure in this dataset,, I need to transforms that will improve modeling accuracy

In [None]:
# Split-out validation dataset 80% -- 20%
array = DF.values
X = array[:,0:13] # All features 
Y = array[:,13] # target
validation_size = 0.20 # validaion precentage to estimate accuracy 
seed = 7 #
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

### now is the time for Evaluate Algorithms !!

In [None]:
# Test options and evaluation metric
num_folds = 10
seed = 7
scoring = 'neg_mean_squared_error'

In [None]:
# Spot-Check Algorithms
models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR', SVR()))

In [None]:
# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    print("{}:   {} ({}) \n".format(name, cv_results.mean(), cv_results.std()))

In [None]:
# Compare Algorithms bu visualizaion boxplt graph
fig = plt.figure(figsize=(10,7),facecolor='grey')
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

The differing scales of the data is probably hurting the skill of all of the algorithms

i will unning the same algorithms using a standardized copy of the data
### So, Evaluate Algorithms: Standardization TimE !!

In [None]:
# Standardize the dataset
pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler', StandardScaler()),('LR', LinearRegression())])))
pipelines.append(('ScaledLASSO', Pipeline([('Scaler', StandardScaler()),('LASSO',Lasso())])))
pipelines.append(('ScaledEN', Pipeline([('Scaler', StandardScaler()),('EN',
ElasticNet())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()),('KNN',
KNeighborsRegressor())])))
pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()),('CART',
DecisionTreeRegressor())])))
pipelines.append(('ScaledSVR', Pipeline([('Scaler', StandardScaler()),('SVR', SVR())])))
results = []
names = []
for name, model in pipelines:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    print("{}:    {} ({}) \n".format(name, cv_results.mean(), cv_results.std()))

In [None]:
# Compare updated Algorithms
fig = plt.figure(figsize=(10,7),facecolor='y')
fig.suptitle('Scaled Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

KNN has both a tight distribution of error and has the lowest score.
##### So ,it's time to Improve Results of KNN
### Tuning Step!

i will use a grid search to try a set of different numbers of neighbors and see if we can improve the score to try improve the accuracy of KNN algo.

In [None]:
# KNN Algorithm tuning
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
k_values = numpy.array([1,3,5,7,9,11,13,15,17,19,21])
param_grid = dict(n_neighbors=k_values)
model = KNeighborsRegressor()
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(rescaledX, Y_train)

> display the mean and standard deviation scores as well as the best performing value for k below.

In [None]:
print("Best: %f using %s\n" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

i will try to improve the accuracy by Ensembles methods to ,perhaps it make any improvment

In [None]:
# ensembles
ensembles = []
ensembles.append(('ScaledAB', Pipeline([('Scaler', StandardScaler()),('AB', AdaBoostRegressor())])))
ensembles.append(('ScaledGBM', Pipeline([('Scaler', StandardScaler()),('GBM', GradientBoostingRegressor())])))
ensembles.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()),('RF', RandomForestRegressor())])))
ensembles.append(('ScaledET', Pipeline([('Scaler', StandardScaler()),('ET', ExtraTreesRegressor())])))
results = []
names = []
for name, model in ensembles:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

In [None]:
# Compare Algorithms
fig = plt.figure(figsize=(10,7),facecolor='grey')
fig.suptitle('Scaled Ensemble Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

> > > Tune Ensemble Methods
- The default number of boosting stages to perform (n estimators) is 100
- the larger the number of boosting stages, the better the performance but the longer the training time.


In [None]:
# Tune scaled GBM
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
param_grid = dict(n_estimators=numpy.array([50,100,150,200,250,300,350,400]))
model = GradientBoostingRegressor(random_state=seed)
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(rescaledX, Y_train)

In [None]:
# Display results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r \n" % (mean, stdev, param))

### Finalize Model

In [None]:
# prepare the model
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
model = GradientBoostingRegressor(random_state=seed, n_estimators=400)
model.fit(rescaledX, Y_train)

In [None]:
# transform the validation dataset
rescaledValidationX = scaler.transform(X_validation)
predictions = model.predict(rescaledValidationX)
print(mean_squared_error(Y_validation, predictions))



----------------

End OF Boston House Price PROBLEM :) 
    trying to improve the accuracy as much i can
    
### AhMeD HefNawY