**Welcome to my kernel**

In this dataset we have to predict the sales price of  houses in King County, Seattle. It includes homes sold between May 2014 and May 2015.
Before doing anything we should first know about the dataset what it contains what are its features and what is the structure of data.

The dataset cantains **20** house features plus the price, along with **21613** observations.

The description for the 20 features is given below: <br>

**1. id** :- It is the unique numeric  number assigned to each house being sold. <br>
**2. date** :- It is the date on which the house was sold out. <br>
**3. price**:- It is the price of house which we have to predict so this is our target variable and aprat from it are our features. <br>
**4. bedrooms** :- It determines number of bedrooms in a house. <br>
**5. bathrooms** :- It determines number of bathrooms in a bedroom of a house. <br>
**6.  sqft_living** :- It is the measurement variable which determines the measurement of house in square foot. <br>
**7. sqft_lot** : It is also the measurement variable which determines  square foot of the lot. <br>
**8. floors**: It determines total floors means levels of house. <br>
**9. waterfront** : This feature determines whether a house has a view to waterfront 0 means no 1 means yes. <br>
**10. view** : This feature determines whether a house has been viewed or not  0 means no 1 means yes. <br>
**11. condition** : It determines the overall condition of a house on a scale of 1 to 5. <br>
**12. grade** : It determines the overall grade given to the housing unit, based on King County grading system on a scale of 1 to 11. <br>
**13. sqft_above** : It determines square footage of house apart from basement. <br>
**14. sqft_basement** : It determines square footage of the basement of the house. <br>
**15. yr_built** : It detrmines the date of building of the house. <br>
**16. yr_renovated** : It detrmines year of renovation of house. <br>
**17. zipcode**  : It determines the zipcode of the location of the house. <br>
**18. lat** : It determines the latitude of the location of the house. <br>
**19.  long** : It determines the longitude of the location of the house. <br>
**20. sqft_living15** : Living room area in 2015(implies-- some renovations)  <br>
**21. sqft_lot15** : lotSize area in 2015(implies-- some renovations) <br>

Now, we know about the overall structure of a dataset . So let's apply some of the steps that we should generally do while applying machine learning models.


**STEP 1:   IMPORTING LIBRARIES**

In [None]:
import numpy as np
import pandas as pd
import xgboost
import math
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from __future__ import division
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
from sklearn import cross_validation, tree, linear_model
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import explained_variance_score
import sklearn.learning_curve as curves
from time import time
from sklearn.metrics import r2_score
import os
print(os.listdir("../input"))


In [None]:
data = pd.read_csv('../input/kc_house_data.csv')


In [None]:
# Copying data to another dataframe df_train for our convinience so that original dataframe remain intact.

df_train=data.copy()
df_train.rename(columns ={'price': 'SalePrice'}, inplace =True)

In [None]:
# Now lets see the first five rows of the data
data.head()

**STEP 2:  DATA CLEANING AND PREPROCESSING**

In this step we check whether data contain null or missing values. What is the size of the data. What is the datatype of each column. What are unique values of categorical variables etc.

In [None]:
print(len(data))
# Check the number of features in the data set
print(len(data.columns))
# Check the data types of each column
print(data.dtypes)

In [None]:
# Check any number of columns with NaN or missing values 
print(data.isnull().any().sum(), ' / ', len(data.columns))
# Check any number of data points with NaN
print(data.isnull().any(axis=1).sum(), ' / ', len(data))

**STEP 3 : FINDING CORRELATION**

In this step we check by finding correlation of all the features wrt target variable i.e., price to see whether they are positively correlated or negatively correlated to find if they help in prediction process in model building process or not. But this is also one of the most important step as it also involves domain knowledge of the field of the data means you cannot simply remove the feature from your prediction process just because it is negatively correlated because it may contribute in future prediction for this you should take help of some domain knowledge personnel.

In [None]:
# As id and date columns are not important to predict price so we are discarding it for finding correlation
features = data.iloc[:,3:].columns.tolist()
target = data.iloc[:,2].name

In [None]:
# Finding Correlation of price woth other variables to see how many variables are strongly correlated with price
correlations = {}
for f in features:
    data_temp = data[[f,target]]
    x1 = data_temp[f].values
    x2 = data_temp[target].values
    key = f + ' vs ' + target
    correlations[key] = pearsonr(x1,x2)[0]

In [None]:
# Printing all the correlated features value with respect to price which is target variable
data_correlations = pd.DataFrame(correlations, index=['Value']).T
data_correlations.loc[data_correlations['Value'].abs().sort_values(ascending=False).index]

As zipcode is negatively correlated with sales price , so we can discard it for sales price prediction.

**STEP 4 : EDA or DATA VISUALIZATION **

This is also a very important step in your prediction process as it help you to get aware you about existing patterns in the data how it is relating to your target variables etc.

In [None]:
var = 'sqft_living15'
data = pd.concat([data['price'], data[var]], axis=1)
data.plot.scatter(x=var, y='price', ylim=(3,9000000));

In [None]:
var = 'bedrooms'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(14, 6))
fig = sns.violinplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=3500000);




In [None]:
var = 'grade'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(14, 6))
fig = sns.violinplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=3500000);

In [None]:
var = 'bathrooms'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(20, 20))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=3500000);

In [None]:
var = 'sqft_living'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(3,8000000));


In [None]:
var = 'floors'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(20, 20))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=3500000);

In [None]:
#Pairplots to visualize strong correlation
sns.set()
cols = ['SalePrice', 'sqft_living', 'grade', 'sqft_above', 'view', 'bathrooms','bedrooms','sqft_basement']
sns.pairplot(df_train[cols], height = 3.5)
plt.show();

In [None]:
new_data = df_train[['sqft_living','grade', 'sqft_above', 'sqft_living15','bathrooms','view','sqft_basement','waterfront','yr_built','lat','bedrooms','long']]

In [None]:
X = new_data.values
y = df_train.SalePrice.values

**STEP 5 : SPLITTING DATA INTO TRAINING AND TESTING SET**

The training dataset and test dataset must be similar, usually have the same predictors or variables. They differ on the observations and specific values in the variables. If you fit the model on the training dataset, then you implicitly minimize error or find correct responses. The fitted model provides a good prediction on the training dataset. Then you test the model on the test dataset. If the model predicts good also on the test dataset, you have more confidence. You have more confidence since the test dataset is similar to the training dataset, but not the same nor seen by the model. It means the model transfers prediction or learning in real sense.

So,by splitting dataset into training and testing subset, we can efficiently measure our trained model since it never sees testing data before.Thus it's possible to prevent overfitting.

I am just splitting dataset into 20% of test data and remaining 80% will used for training the model.

In [None]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y ,test_size=0.2)

**STEP 6: APPLYING MACHINE LEARNING MODEL**

In [None]:
from sklearn.ensemble import RandomForestRegressor
rand_regr = RandomForestRegressor(n_estimators=400,random_state=0)
start = time()
rand_regr.fit(X_train, y_train)
end=time()
train_time_rand=end-start
random=rand_regr.score(X_test,y_test)
predictions = rand_regr.predict(X_test)
exp_rand = explained_variance_score(predictions,y_test)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
start = time()
est=GradientBoostingRegressor(n_estimators=400, max_depth=5, loss='ls',min_samples_split=2,learning_rate=0.1).fit(X_train, y_train)
end=time()
train_time_g=end-start
gradient=est.score(X_test,y_test)

pred = est.predict(X_test)
exp_est = explained_variance_score(pred,y_test)

In [None]:
from sklearn.ensemble import AdaBoostRegressor
start = time()
ada=AdaBoostRegressor(n_estimators=50, learning_rate=0.2,loss='exponential').fit(X_train, y_train)
end=time()
train_time_ada=end-start
pred=ada.predict(X_test)
adab=ada.score(X_test,y_test)
predict = ada.predict(X_test)
exp_ada = explained_variance_score(predict,y_test)

In [None]:
from sklearn.tree  import DecisionTreeRegressor
decision=DecisionTreeRegressor()
start = time()
decision.fit(X_train, y_train)
end=time()
train_time_dec=end-start
decc=decision.score(X_test,y_test)
decpredict = decision.predict(X_test)
exp_dec = explained_variance_score(decpredict,y_test)

In [None]:
# Comparing Models on the basis of Model's Accuracy Score and Explained Variance Score of different models
models_cross = pd.DataFrame({
    'Model': ['Gradient Boosting','AdaBoost','Random Forest','Decision Tree'],
    'Score': [gradient,adab,random,decc],
     'Variance Score': [exp_est,exp_ada,exp_rand,exp_dec]})
    
models_cross.sort_values(by='Score', ascending=False)

**STEP 7 : ANALYZING TRAINING TIME EACH MODEL HAS TAKEN**

In [None]:
import matplotlib.pyplot as plt
import numpy as np
model = ['Adaboost', 'GBOOST', 'Random forest', 'Decision Tree']
Train_Time = [
    train_time_ada,
    train_time_g,
    train_time_rand,
    train_time_dec
    
]
index = np.arange(len(model))
plt.bar(index, Train_Time)
plt.xlabel('Machine Learning Models', fontsize=15)
plt.ylabel('Training Time', fontsize=15)
plt.xticks(index, model, fontsize=10, )
plt.title('Comparison of Training Time of all ML models')
plt.show()

From the above figure it is inferred that decision tree has taken negligible amount of time to train where as Randome forest has taken maximum time and it is yet obvious because as we increase the number of tree 400 in this case training time will increase so we should look out for optimal model which has greater accuracy and less training time in comparison to other
So, in this case GBoost is the best choice as its accuracy is highest and it is taking less time to train wrt accuracy.

**STEP 7 : PLOTTING OF COMPLEXITY CURVE**

The following code cell produces a graph for a Gradient Boosting model that has been trained and validated on the training data using different maximum depths. The graph produces two complexity curves — one for training and one for validation. Similar to the learning curves, the shaded regions of both the complexity curves denote the uncertainty in those curves, and the model is scored on both the training and validation sets using the performance_metric function.

In [None]:
def ModelComplexity(X, y):
    """ Calculates the performance of the model as model complexity increases.
        The learning and testing errors rates are then plotted. """
    
    # Create 10 cross-validation sets for training and testing
    cv = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.2, random_state = 0)

    # Vary the max_depth parameter from 1 to 10
    max_depth = np.arange(1,11)

    # Calculate the training and testing scores
    train_scores, test_scores = curves.validation_curve(GradientBoostingRegressor(), X, y, \
        param_name = "max_depth", param_range = max_depth, cv = cv, scoring = 'r2')

    # Find the mean and standard deviation for smoothing
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    test_mean = np.mean(test_scores, axis=1)
    test_std = np.std(test_scores, axis=1)

    # Plot the validation curve
    plt.figure(figsize=(7, 5))
    plt.title('Gradient Boosting Regressor Complexity Performance')
    plt.plot(max_depth, train_mean, 'o-', color = 'r', label = 'Training Score')
    plt.plot(max_depth, test_mean, 'o-', color = 'g', label = 'Validation Score')
    plt.fill_between(max_depth, train_mean - train_std, \
        train_mean + train_std, alpha = 0.15, color = 'r')
    plt.fill_between(max_depth, test_mean - test_std, \
        test_mean + test_std, alpha = 0.15, color = 'g')
    
    # Visual aesthetics
    plt.legend(loc = 'lower right')
    plt.xlabel('Maximum Depth')
    plt.ylabel('Score')
    plt.ylim([-0.05,1.05])
    plt.show()

In [None]:
ModelComplexity(X_train, y_train)

**Interpretation of the Curve**

At a maximum depth of 1, model suffers from high bias. Two scores are quite close,but both the scores are too far from acceptable level so I think it's a high bias problem.In other words, the model is underfitting. The data points obviously follow some sort of curve, but our predictor isn’t complex enough to capture that information. Our model is biased in that it assumes that the data will behave in a certain fashion even though that assumption may not be true. A key point is that there’s nothing wrong with our training—this is the best possible fit that a linear model can achieve. There is, however, something wrong with the model itself in that it’s not complex enough to model our data.

At a maximun depth of 10, model suffers from high variance since training score is 1.0 but validation score is about 0.7.In other words, a model is overfitting. Again, the data points suggest a sort of graceful curve. However, our model uses a very complex curve to get as close to every data point as possible. Consequently, a model with high variance has very low bias because it makes little to no assumption about the data. In fact, it adapts too much to the data.

As we see from the curve, max depth of 5 best generalizes the unseen data. As max depth increases, bias becomes lower and variance becomes higher. We should keep the balance between the two. Just after depth 5 training score increase upwards and validation score starts to goes down, so I it begins to suffer from overfitting. So that's why 5 should be a good choice.

**Conclusion**

So, we have seen that  **accuracy** of gradient boosting is around  **89.28%** and also achieved decent **variance score** of **0.87** which is very close to 1 .
Therefore, it is inferred that  **Gradient Boosting** is the suitable model for this dataset.

Further we can also perform model optimization by using GridSearch to find the appropriate parameters to increase the accuracy by fine tuning hyperparameters.

Apart from that,  your valuable suggestions for further improvement and optimization are always welcome from my side do comment !!