# EDA and Price Prediction
### Competition Description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
### Acknowledgments

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 
<font color = "green" >

#### Content:   
    
1. [Loading and Checking Data Set](#1)
1. [Variable Analysis and Visualization](#2) 
    * [Visualization of Categorical Variables](#3)
    * [Visualization of Numerical Variables](#4)
1. [Data Analysis](#5)
    * [Handling Outliers](#7)
    * [Handling Missing Values and Feature Engineering](#6)
    * [Relationship Between Some Variables](#8)
1. [Modelling](#19)
    * [Hyperparameter Tuning - Grid Search - Cross Validation](#20)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings("ignore")

In [None]:
d_train = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
d_test = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")

In [None]:
d_train.info()

* We have 1460 total entries and 80 different column.
* We have null values in some columns.

In [None]:
print("First 5 raws of data:")
d_train.head()

In [None]:
print("Last 5 raws of data:")
d_train.tail()

In [None]:
d_train.describe().T

<a id = "2"></a>
## Variable Analysis

#### Here's a brief version of what you'll find in the data description file.

1. SalePrice : the property's sale price in dollars. This is the target variable that you're trying to predict.
2. MSSubClass: The building class
3. MSZoning: The general zoning classification
4. LotFrontage: Linear feet of street connected to property
5. LotArea: Lot size in square feet
6. Street: Type of road access
7. Alley: Type of alley access
8. LotShape: General shape of property
9. LandContour: Flatness of the property
10. Utilities: Type of utilities available
11. LotConfig: Lot configuration
12. LandSlope: Slope of property
13. Neighborhood: Physical locations within Ames city limits
14. Condition1: Proximity to main road or railroad
15. Condition2: Proximity to main road or railroad (if a second is present)
16. BldgType: Type of dwelling
17. HouseStyle: Style of dwelling
18. OverallQual: Overall material and finish quality
19. OverallCond: Overall condition rating
20. YearBuilt: Original construction date
21. YearRemodAdd: Remodel date
22. RoofStyle: Type of roof
23. RoofMatl: Roof material
24. Exterior1st: Exterior covering on house
25. Exterior2nd: Exterior covering on house (if more than one material)
26. MasVnrType: Masonry veneer type
27. MasVnrArea: Masonry veneer area in square feet
28. ExterQual: Exterior material quality
29. ExterCond: Present condition of the material on the exterior
30. Foundation: Type of foundation
31. BsmtQual: Height of the basement
32. BsmtCond: General condition of the basement
33. BsmtExposure: Walkout or garden level basement walls
34. BsmtFinType1: Quality of basement finished area
35. BsmtFinSF1: Type 1 finished square feet
36. BsmtFinType2: Quality of second finished area (if present)
37. BsmtFinSF2: Type 2 finished square feet
38. BsmtUnfSF: Unfinished square feet of basement area
38. TotalBsmtSF: Total square feet of basement area
39. Heating: Type of heating
40. HeatingQC: Heating quality and condition
41. CentralAir: Central air conditioning
42. Electrical: Electrical system
43. 1stFlrSF: First Floor square feet
44. 2ndFlrSF: Second floor square feet
45. LowQualFinSF: Low quality finished square feet (all floors)
46. GrLivArea: Above grade (ground) living area square feet
47. BsmtFullBath: Basement full bathrooms
48. BsmtHalfBath: Basement half bathrooms
49. FullBath: Full bathrooms above grade
50. HalfBath: Half baths above grade
51. Bedroom: Number of bedrooms above basement level
52. Kitchen: Number of kitchens
53. KitchenQual: Kitchen quality
54. TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
55. Functional: Home functionality rating
56. Fireplaces: Number of fireplaces
57. FireplaceQu: Fireplace quality
58. GarageType: Garage location
59. GarageYrBlt: Year garage was built
60. GarageFinish: Interior finish of the garage
61. GarageCars: Size of garage in car capacity
62. GarageArea: Size of garage in square feet
63. GarageQual: Garage quality
64. GarageCond: Garage condition
65. PavedDrive: Paved driveway
66. WoodDeckSF: Wood deck area in square feet
67. OpenPorchSF: Open porch area in square feet
68. EnclosedPorch: Enclosed porch area in square feet
69. 3SsnPorch: Three season porch area in square feet
70. ScreenPorch: Screen porch area in square feet
71. PoolArea: Pool area in square feet
72. PoolQC: Pool quality
73. Fence: Fence quality
74. MiscFeature: Miscellaneous feature not covered in other categories
75. MiscVal: "$" Value of miscellaneous feature
76. MoSold: Month Sold
77. YrSold: Year Sold
78. SaleType: Type of sale
79. SaleCondition: Condition of sale


<font color = "green" >
dtypes:
    
<font color = "black" >
    
* object : 43
* int64 : 35
*float64 3



<font color = "green" >


We have 57 categorical, 24 numerical variable.

<font color = "black" >

* **Categorical variables:** 'MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'

* **Numerical variables:** 'Id', 'LotFrontage', 'LotArea', 'Neighborhood', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageYrBlt', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'MiscVal', 'SalePrice'

We have 80 columns in our train data. So it will take time to manually find the types of variables. In order to determine whether the variable is categorical or numerical I will code few lines.

I determine the treshold as 20 for categorical variables.

In [None]:
categorical_features = []
threshold = 20
for each in d_train.columns:
    if d_train[each].nunique() < threshold:
        categorical_features.append(each)
    
numerical_features = []
for each in d_train.columns:
    if each not in categorical_features:
        numerical_features.append(each)
        
print("Categorical Variables:\n\n",categorical_features,"\n\n")        
print("Numerical Variables:\n\n",numerical_features )

In [None]:
numerical_features

<a id = "3"></a>
### Visualization of Categorical Variables

Now, I will visualize categorical variables as bar plot. Thus, the densities of categorical variables will be more understandable.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(30,70))

for i,var in enumerate(categorical_features):
    
    plt.subplot(15,4,i+1)
    sns.countplot(data = d_train, x = var, alpha = 0.3, color="red")
    sns.countplot(data = d_test, x = var, alpha = 0.5, color = "green")

<a id = "4"></a>
### Visualization of Numerical Variables

Now, we will visualize numerical variables as histogram plot. Thus, the densities of numerical variables will be more understandable.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
    
plt.figure(figsize=(30,40))

for i,var in enumerate(numerical_features[1:]):
    if var == "SalePrice":
        break    
    else:
        
        plt.subplot(6,4,i+1)
    
        plt.hist(d_train[var], bins=50, color = "red", alpha = 0.5, label= "Train Data")
        plt.xlabel(var)
        plt.ylabel("Count")
        plt.legend()
    


        plt.hist(d_test[var], bins=50, color = "green", alpha = 0.5, label= "Test Data")
        plt.legend()

<a id = "5"></a>
## Data Analysis

<a id = "7"></a>


### Handling Outliers

Wikipedia definition: In statistics, an **outlier** is an observation point that is distant from other observations.

Outliers may be caused by many different reasons. Such as a mistake during data collection or it can be just an sign of variance in your data. 

    1st quartile (Q1): %25
    2nd quartile (Q2): Median value
    3rd quartile (Q3): %75

    IQR = Q3 - Q1

    Lower Outlier Limit = Q1 - (1.5 * IQR)
    Higher Outlier Limit = Q3 + (1.5 * IQR)
    
    Values that lower than lower outlier limit and higher than higher outlier limit are our outliers.



Box plots are one of the good ways to see outliers.




In [None]:
outlier_indexes = []
def outlier_plotting(feature):
    outlier = [] 
    # Plotting section
    plt.figure(figsize=(6,3))
    sns.boxplot(x=d_train[feature], palette="Set3")
    plt.title("{}'s Outlier Box Plot".format(feature), weight = "bold")
    plt.xlabel(feature, weight = "bold")
    plt.show()
    
    # Outlier computing
    Q1 = d_train[feature].quantile(0.25)
    Q3 = d_train[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_outlier_limit = Q1 - (1.5 * IQR)
    higher_outlier_limit = Q3 + (1.5 * IQR)
    
    print("Values lower than {} and higher than {} are outliers for {}.\n".format(lower_outlier_limit,higher_outlier_limit,feature))

    # There are different ways to detect and show outlier values, I will use Z-Score method instead of writing conditional function.
    
    # Outlier detecting
    threshold = 3

    for i in d_train[feature]:
        z = (i-d_train[feature].mean())/d_train[feature].std()
        if z > threshold: 
            outlier.append(i)
            index = d_train[d_train[feature] == i].index[0]
            outlier_indexes.append(index)      
    if outlier == []:
        print("No any outliers for {}.".format(feature))
    else:
        print("There are {} outliers for {}:".format(len(outlier),feature), outlier)

In [None]:
# We need to eliminate Object data type columns. We need to numeric data for outlier analysis 
for i in [col for col in d_train.columns if d_train[col].dtype != 'O']:
    if i != "Id":
        outlier_plotting(i)

In [None]:
d_train.loc[outlier_indexes]

* We can see the entire values of the index which has outliers. 

Ready to drop:

In [None]:
d_train = d_train.drop(outlier_indexes,axis = 0).reset_index(drop = True)
# There are no outliers in data anymore.
d_train.info()

<a id = "6"></a>


# Handling Missing Values and Feature Engineering

##### We must combine train and test datasets. Because his process are must be carried out together.

In [None]:
alldata = pd.concat([d_train,d_test],axis=0,sort=False)
alldata["SalePrice"].head()

In [None]:
alldata["SalePrice"].tail()

In [None]:
pd.set_option('display.max_rows', 100)
info_count = pd.DataFrame(alldata.isnull().sum(),columns=['Count of NaN'])
dtype = pd.DataFrame(alldata.dtypes,columns=['DataTypes'])
info = pd.concat([info_count,dtype],axis=1)
info


* Now we can see how many NaN values are in which column. 

* I will fill those containing a reasonable number of NaN values with most common values.

* I will fill those containing many NaN values with Sklearn Label Encoder. In this way, some of the object type data will also be transformed into numeric data.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Filling 433 LotFrontage values. I will use linear interpolation to fill these NaN values.
alldata['LotFrontage'].interpolate(method='linear',inplace=True)

# Filling other NaNs
for i in info.T:
    if i == "Id" or i == "SalePrice" or i == "LotFrontage":
        continue
    else:
        if (info.T[i][0] == 0):
            continue
        elif (info.T[i][0] < 400):
            alldata[i].fillna(alldata[i].value_counts().index[0], inplace = True)
        else:
            lbl_enc = LabelEncoder() 
            lbl_enc.fit(list(alldata[i].values)) 
            alldata[i] = lbl_enc.transform(list(alldata[i].values))
            

In [None]:
alldata.isna().any().value_counts()

* As you can see there are NaN values in only one column. This column is SalePrice which come from test dataset. So everything looks fine.

In [None]:
pd.set_option('display.max_columns', 81)
alldata.head()

* We have still columns in object datatype. We need to handle this columns. Because we will use machine learning algorithms.

In [None]:
list_ = ["MSZoning", "Street", "LotShape", "LandContour", "Utilities", "LotConfig",
        "LandSlope", "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle",
        "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd",
        "MasVnrType", "ExterQual", "ExterCond", "Foundation", "BsmtQual", "BsmtCond", "BsmtExposure", 
        "BsmtFinType1", "BsmtFinType2", "Heating", "HeatingQC", "CentralAir", "Electrical", "KitchenQual",
        "Functional", "GarageType", "GarageFinish", "GarageQual", "GarageCond", "PavedDrive", "SaleType",
        "SaleCondition"]

for feature in list_:
    alldata[feature]= alldata[feature].astype("category")
    alldata = pd.get_dummies(alldata, columns=[feature])

In [None]:
pd.set_option('display.max_columns', 500)
alldata.head()

In [None]:
pd.set_option('display.max_columns', 500)
alldata.tail()

* Now we can seperate train and test data.

In [None]:
train = alldata[0:1195]
test = alldata[1195:2919]

* Let's take a look at our train data. Correlation and relationship between some variables.
* Corelation defines as a mutual relationship or connection between two or more things. A negative, or inverse correlation, between two variables, indicates that one variable increases while the other decreases. A positive correlation is a relationship between two variables in which both variables move in the same direction.

Let's look at our data.

<a id = "8"></a>


# Relationship Between Some Variables

In [None]:
import seaborn as sns
corr_new_train=train.corr()

plt.figure(figsize=(3,15))
sns.heatmap(corr_new_train[['SalePrice']].sort_values(by=['SalePrice'],ascending=False).head(30),
            annot_kws={"size": 16, "color": "black"},vmin=-1, cmap='PiYG', annot=True)
plt.title("Positive Corelation Sorting",fontweight="bold", fontsize = 20)
plt.show()

plt.figure(figsize=(3,15))
sns.heatmap(corr_new_train[['SalePrice']].sort_values(by=['SalePrice'],ascending=False).tail(30),
            annot_kws={"size": 16, "color": "black"},vmin=-1, cmap='PiYG', annot=True)
plt.title("Negative Corelation Sorting",fontweight="bold", fontsize = 20)


plt.show()

* We can easily understand that OverallQual and GrLiveArea affect the sale price more than other features, and this effect is positive.

* We can easily understand that ExterQual_TA and GarageFinish_Unf affect the sale price more than other features, and this effect is negative.


Now let's look at the relationship between some features and sale price.

In [None]:
list_ = ["MSZoning", "Street", "LotShape", "LandContour", "Utilities", "LotConfig",
        "LandSlope", "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle",
        "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd",
        "MasVnrType", "ExterQual", "ExterCond", "Foundation", "BsmtQual", "BsmtCond", "BsmtExposure", 
        "BsmtFinType1", "BsmtFinType2", "Heating", "HeatingQC", "CentralAir", "Electrical", "KitchenQual",
        "Functional", "GarageType", "GarageFinish", "GarageQual", "GarageCond", "PavedDrive", "SaleType",
        "SaleCondition"]

numerical_feature = []
for i in train.columns:
    if i not in list_:
        if i == "Id":
            continue
        else:
            numerical_feature.append(i)

        
plt.style.use("seaborn-white")
fig, axes = plt.subplots(18, 2,figsize=(20,80))
fig.subplots_adjust(hspace=0.6)
colors=[plt.cm.prism_r(each) for each in np.linspace(0, 1, len(numerical_feature))]

for i,ax,color in zip(numerical_feature,axes.flatten(),colors):
    
    sns.regplot(x=train[i], y=train["SalePrice"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.8},color=color,ax=ax)
    plt.xlabel(i,fontsize=12)
    plt.ylabel('SalePrice',fontsize=12)
    ax.set_yticks(np.arange(0,900001,100000))
    ax.set_title('SalePrice'+' - '+str(i),color=color,fontweight='bold',size=20)



<a id = "19"></a>
# Modeling

First of all, we need to prepare our test and train data.

Our test data contain SalePrice column and has NaN values on this column. We have to drop this column.
Test data also contain Id column which have to be drop.

In [None]:
test = test.drop("SalePrice", axis=1)
test = test.drop("Id", axis=1)

We will use train data while modelling and this train data must not contain the Sale Price and insignificant Id column. After trained model, we will test it with our absolute SalePrice values. So I separate the SalePrice data from the train yield and transfer it into a dataframe called "y". I do this by scaling my data.

In [None]:
y = np.log1p(train['SalePrice'])
x = train.drop(["Id", "SalePrice"], axis=1)

Now I will create the some regression models with default parameter values and calculate RMSE for each.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state = 42)

In [None]:
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import GridSearchCV, cross_val_score,StratifiedKFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn import metrics
import random as rd
models = [('LR', LinearRegression()),
          ("Ridge", Ridge()),
          ("Lasso", Lasso()),
          ("ElasticNet", ElasticNet()),
          ('KNN', KNeighborsRegressor()),
          ('CART', DecisionTreeRegressor()),
          ('RF', RandomForestRegressor()),
          ('SVR', SVR()),
          ('GBM', GradientBoostingRegressor()),
          ("XGBoost", XGBRegressor(objective='reg:squarederror')),
          ("LightGBM", LGBMRegressor()),
          ("CatBoost", CatBoostRegressor(verbose=False))]

In [None]:
for name, regressor in models:
    model = regressor
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    MSE = metrics.mean_squared_error(y_test,y_pred)
    RMSE = np.sqrt(MSE)
    print(f"RMSE: {round(RMSE, 4)} ({name})")

<a id = "20"></a>
## Hyperparameter Tuning - Grid Search - Cross Validation

Now, time to hyperparameter tuning with the ones with the lowest mean square error from the above machine learning algorithms.

In [None]:
random_state = 42
classifier = [Ridge(random_state = random_state),
              DecisionTreeRegressor(random_state = random_state),
             LGBMRegressor(random_state = random_state),
              GradientBoostingRegressor(random_state = random_state),
             CatBoostRegressor(verbose=False, random_state = random_state)]

ridge_param_grid = {"solver" : ["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"],
                   "normalize" : [True, False]}

dtr_param_grid = {"min_samples_split" : range(10,500,20),
                "max_depth": range(1,20,2),
                 "splitter": ["best", "random"]}

lgbmr_param_grid = {"learning_rate": [0.001, 0.01, 0.05],
               "n_estimators": [200, 500, 750],
               "max_depth": [-1, 2, 5],
               "colsample_bytree": [1, 0.50, 0.75]}

gbr_param_grid = {"loss": ["ls", "huber", "quantile"],
                 "n_estimators":[100,300],
                 "min_samples_split" : range(10,400,50)}

catboost_param_grid = {"learning_rate": np.linspace(0,0.2,5),
                 "n_estimators":[100, 200, 300]}


classifier_param = [ridge_param_grid,
                   dtr_param_grid,
                   lgbmr_param_grid,
                   gbr_param_grid,
                   catboost_param_grid]

error = []
estimator = []
for i in range(len(classifier)):
    
    model = GridSearchCV(classifier[i],
                            classifier_param[i],
                            cv=10,
                            n_jobs=-1, 
                            scoring = "neg_mean_squared_error",
                            verbose=True).fit(X_train, y_train)
    rmse = np.mean(np.sqrt(-cross_val_score(model, X_train, y_train, cv=5, scoring="neg_mean_squared_error")))
    error.append(rmse)
    estimator.append(str(classifier[i]))

In [None]:
cv_results = pd.DataFrame({"Cross Validation Errors":error, "ML Models":["Ridge", "DecisionTreeRegressor",
                                                                         "LGBMRegressor","GradientBoostingRegressor",
                                                                        "CatBoostRegressor"]})

g = sns.barplot("Cross Validation Errors", "ML Models", data = cv_results)
g.set_xlabel("neg_mean_squared_error")
g.set_title("Cross Validation Scores")

In [None]:
error_results = pd.DataFrame({"ML Models":["Ridge", "DecisionTreeRegressor",
                                            "LGBMRegressor","GradientBoostingRegressor",
                                          "CatBoostRegressor"], 
                                              'Mean Squared Error':error})
error_results

As you can see CatBoostRegressor has the best accuracy. Let's look at the which parameters are the best for this algorithm.

In [None]:
catboost_param_grid = {"learning_rate": np.linspace(0,0.2,5),
                 "n_estimators":[100, 200, 300],
                      "max_depth": [3,4,5],
                      "silent": [True]}

CatBoostRegressor_model = GridSearchCV(CatBoostRegressor(random_state = random_state),
                            catboost_param_grid,
                            cv=10,
                            n_jobs=-1,
                            verbose=True).fit(X_train, y_train)

In [None]:
CatBoostRegressor_model.best_params_

Now we can create final model with GradientBoostingRegressor and its best parameters.

In [None]:
params = {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300, "silent" : True}

final_model = CatBoostRegressor(**params)

In [None]:
final_model.fit(X_train,y_train)

In [None]:
rmse = np.mean(np.sqrt(-cross_val_score(final_model, X_train, y_train, cv=20, scoring="neg_mean_squared_error")))
rmse

Finally we can predict SalePrice of test data.

In [None]:
submission_absolute_prices = pd.read_csv("../input/house-prices-advanced-regression-techniques/sample_submission.csv")
y_pred = final_model.predict(test)
y_pred = np.expm1(y_pred)
df = pd.DataFrame({'Actual':submission_absolute_prices["SalePrice"], 'Predicted':y_pred})

In [None]:
submission_df = pd.DataFrame()
submission_df["Id"] = d_test["Id"] 
submission_df['SalePrice'] = df["Predicted"]
submission_df

In [None]:
submission_df.to_csv('submission.csv', index=False)

## Please upvote and make comment if you like my notebook.
# Thanks!