## Housing Prices Advanced Regression

In the given Dataset, we have 
* train.csv, 
* test.csv 
having about 1480 rows each. <br>
* data_description.txt has the feature descriptions.

### Feature Description

In [None]:
description = open('../input/house-prices-advanced-regression-techniques/data_description.txt')
data = description.readlines()
data = [i for i in data if ":" in i]

for i in data: print(i)
description.close()

## Imports

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings

In [None]:
# set matplotlib defaults
plt.style.use("seaborn-darkgrid")
plt.rc("figure", autolayout = True)
plt.rc("axes", 
      labelweight = "bold",
      labelsize = "large",
      titleweight = "bold",
      titlesize = 14, 
      titlepad = 10)

warnings.filterwarnings('ignore')

In [None]:
train_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
print("Train Data has ",len(train_data), " rows")

In [None]:
print("Test Data has ", len(test_data), " rows")

In [None]:
train_data.head()

In [None]:
print(train_data.shape)

In [None]:
train_data.info()

In [None]:
s1 = train_data.dtypes

s1.groupby(s1).count()

## Missing Values

In [None]:
train_data.shape

In [None]:
train_data.isnull().sum().sort_values(ascending = False).head(20)

Features PoolQC, MiscFeature, Alley, Fence have a significant portion of their data as missing values. <br>
We'll drop these features from the consideration

In [None]:
test_data.head()

Data Type - Test Data

In [None]:
print(test_data.shape)

In [None]:
test_data.info()

In [None]:
test_data.isnull().sum().sort_values(ascending = False)

In [None]:
"""con_data = train_data.copy()


for col in cat_var:
    con_data = con_data.drop(col, axis = 1)
    
    
training_corr = con_data.corr(method = 'spearman')


mask = np.zeros_like(training_corr)
"""

In [None]:
train_data.drop(['Id'], axis = 1, inplace = True)

In [None]:
train_data.head()

In [None]:
train_data.columns

In [None]:
train_data['Alley']

In [None]:
train_data['PoolQC']

In [None]:
train_data['MiscFeature']

In [None]:
train_data['Fence']

#### Majority of the values are null. Hence, dropping all the above 4 variables.

In [None]:
train_data.drop(['Alley', 'PoolQC', 'MiscFeature', 'Fence'], axis = 1, inplace = True)

In [None]:
train_data.head()

In [None]:
test_data.drop(['Id', 'Alley', 'PoolQC', 'MiscFeature', 'Fence'], axis = 1, inplace = True)

In [None]:
test_data.head()

Seperating Variables

In [None]:
## Continuous Variables

con_var = train_data.dtypes[train_data.dtypes.values != 'object'].index

con_var

In [None]:
# Categorical Variables

cat_var = train_data.dtypes[train_data.dtypes.values == 'object'].index

cat_var

Correlation Heatmap plot

In [None]:
con_data = train_data.copy()

In [None]:
con_data.drop(cat_var, axis = 1, inplace = True)

In [None]:
mask = np.zeros_like(con_data.corr(method = 'spearman'))
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize = (20,20))
sns.heatmap(con_data.corr(method = "spearman"), cmap = "YlGnBu", linewidths = 0.5, mask = mask)

Enclosed Porch, Kitchen AbvGr -> **Least correlated** with Sale Price <br>
Garage Cars, OverallQual, GrLivArea, FullBath -> **Highly correlated** with Sale Price

Top 10 Highly Correlated Continuous Features

In [None]:
corrs = con_data.corr('spearman')['SalePrice'].sort_values(ascending = False)

corrs_abs = corrs.abs()
print(corrs_abs.head(11))

Top 10 Least Correlated Continuous Features

In [None]:
print(corrs_abs.tail(10))

## Exploratory Data Analysis

A good Exploratory Data Analysis helps in smoother and faster conduct of preprocessing steps & better data modelling.

I. Let's begin with the **highly correlated feature variables**

In [None]:
# Overall Quality

train_data['OverallQual']

In [None]:
sns.countplot(train_data['OverallQual'])

Inference: Majority of the houses are of the medium quality (5,6,7). Extremely few are of poor quality. A significant number of houses are of good quality (8,9,10).

In [None]:
train_data['GrLivArea']

In [None]:
sns.distplot(train_data['GrLivArea'])

Inference: Huge chunk of the values lie in the range of 1000 to 2000.

In [None]:
train_data['GarageCars']

In [None]:
sns.countplot(train_data['GarageCars'])

Inference: Majority of the houses have 1 or 2 car capacity in the garage.

In [None]:
train_data['YearBuilt']

In [None]:
sns.distplot(train_data['YearBuilt'])

Inference: This one is interesting. There are 3 peaks. There was a surge in the year of mid 1925s. Then in the period of 1960s. Again, in the onset of 2000s.

In [None]:
train_data['GarageArea']

In [None]:
sns.distplot(train_data['GarageArea'])

Infernece: We have 2 major peaks and 2 minor peaks. 1st peak indicates 0 Garage Area meaning no garage. The Other major peak is around 500. Two minor peaks around 250 and 750 indicate a small number of houses have a good number of houses in this area range.

In [None]:
train_data['FullBath']

In [None]:
sns.countplot(train_data['FullBath'])

Inference: Full Bathrrom are usualy 1 or 2. Few of the houses have 3 bathrooms.

In [None]:
train_data['TotalBsmtSF']

In [None]:
sns.histplot(train_data['TotalBsmtSF'])

In [None]:
sns.distplot(train_data['TotalBsmtSF'])

Inference: A major chunk of the houses have TotalBsmtSF value in the range of 750 to 1250

In [None]:
train_data['GarageYrBlt']

In [None]:
sns.distplot(train_data['GarageYrBlt'])

Inference: Garages were built in linear proportion with time. However, there was a surge in the era of 1960s. Another such surge hapened in the early 2000s.

The graph follows the pattern of the House's year of construction. Let's confirm our hypothesis by plotting the same.

In [None]:
sns.distplot(train_data['GarageYrBlt'], color = 'r');
sns.distplot(train_data['YearBuilt'], color = 'k');


Confirmed our hypothesis was correct. This also means these are two redundant features. One of them can be dropped. We can drop "GarageYrBuilt" and keep "YearBuilt".

In [None]:
train_data['1stFlrSF']

In [None]:
sns.regplot(train_data['1stFlrSF'], train_data['SalePrice'])


Inference: 1stFlrSF is directly proportional to the SalePrice of the house

In [None]:
train_data['YearRemodAdd']

In [None]:
sns.distplot(train_data['YearRemodAdd'])

Inference: We can observe a peak on the onset of early 2000s

II. Coming to least correlated feature variables

In [None]:
train_data['EnclosedPorch']

In [None]:
sns.regplot(train_data['EnclosedPorch'], train_data['SalePrice'])

Inference: We can confirm the correlation obtained. EnclosedPorch hardly affects the SalePrice

In [None]:
train_data['KitchenAbvGr']

In [None]:
sns.boxplot(train_data['KitchenAbvGr'], train_data['SalePrice'])

Inference: Huge Class Imbalance with a large number of outliers. Mean values might remain same but the distribution of outliers is vivid.

In [None]:
train_data['OverallCond']

In [None]:
sns.boxplot(train_data['OverallCond'], train_data['SalePrice'])

Inference: There's no direct relation between OverallCond & the SalePrice. Hence, proved

In [None]:
train_data['LowQualFinSF']

In [None]:
sns.scatterplot(train_data['LowQualFinSF'], train_data['SalePrice'])

Inference: No direct relation.

In [None]:
train_data['MiscVal']

In [None]:
sns.regplot(train_data['MiscVal'], train_data['SalePrice'])

Inference: No significant between MiscVal and SalePrice

In [None]:
train_data['BsmtFinSF2']

In [None]:
sns.regplot(train_data['BsmtFinSF2'], train_data['SalePrice'])

Inference: It is quite evident that BsmtFinSF2 doesn't affect the SalePrice

In [None]:
train_data['YrSold']

In [None]:
sns.pointplot(train_data['YrSold'], train_data["SalePrice"])

Inference: The year sold in particular doesn't have any influence. However, the sudden drop after 2007 can be attributed to the Housing Price Crash after 2008.

In [None]:
train_data['BsmtHalfBath']

In [None]:
sns.countplot(train_data['BsmtHalfBath'])

Inference: We can see, majority of values for BsmtHalfBath are 0. Obviously, there's no variation in the values of BsmtHalfBath. No impact on the Sale Price.

In [None]:
train_data['MSSubClass']

In [None]:
plt.figure(figsize = (20,20))
sns.boxplot(train_data['MSSubClass'], train_data['SalePrice'])

Inference: The MSSubClass and SalePrice have no direct correlation among them. However, if MSSubClass value is 60, the SalePrice is on the higher end.

## Notebook in making

### Preprocessing Steps -> 

<br>
1. Loading <br>
2. Cleaning <br>
3. Encode <br>
4. Impute <br>

In [None]:
def load_data():
    df = pd.concat([df_train, df_test])
    # preprocessing
    df = clean(df)
    df = encode(df)
    df = impute(df)
    
    # reform splits 
    df_train = df.loc[df_train.index, :]
    df_test = df.loc[df_test.index, :]
    return df_train, df_test

In [None]:
def clean(df):
    pass


def encode(df):
    # nominal 
    for name in features_nom:
        
        

def impute(df):
    for name in df.select_dtypes('category'):
        df[name] = df[name].fillna('None')
    
    for name in df.select_dtypes('number'):
        df[name] = df[name].fillna(0)
    
    return df
        

In [None]:
def pre_process(train_data, test_data, fillna_dict = {}, drop_list = [], convert_list = [], log_list = [], regroup_dict = {}):
    
    
    

In [None]:
fillna_dict = {
    "Alley" : 'NA',
    "PoolQC" : "NA",
    "LotFrontage" : train_data['LotFrontage'].mean(),
    "MasVnrArea" : 0.0,
    "GarageYrBlt" : 0.0,
    "BsmtFinSF1" : 0.0,
    "BsmtFinSF2" : 0.0,
    "BsmtUnfSF" : 0.0,
    "TotalBsmtSF" : 0.0,
    "BsmtFullBath" : 0.0,
    "BsmtHalfBath" : 0.0,
    "GarageCars" : 0.0,
    "GarageArea" : 0.0,
    "MiscFeature" : "NA",
    "Fence" : "NA",
    "FireplaceQu" : "NA",
    'GarageFinish' : "NA",
    "GarageQual" : "NA",
    "GarageCond" : "NA",
    "GarageType" : "NA",
    "BsmtCond" : "NA",
    "BsmtQual" : "NA",
    "BsmtExposure" : "NA",
    "BsmtFinType1" : "NA",
    "BsmtFinType2" : "NA",
    "MasVnrType" : "None",
    "MSZoning" : train_data['MSZoning'].mode()[0]
}

In [None]:
convert_to_str_list = ['MSSubClass']


log_list = ['BsmtUnfSF', 'LotFrontage', 'LotArea','1stFlrSF', 'GrLivArea', 'TotalBsmtSF', 'GarageArea']




# Exploratory Data Analysis + Feature Engineering

## Modelling

Various regressor models :- 
* Ridge
* Lasso
* Elastic Net
* SVR
* Random Forest Regressor
* Gradient Boosting Regressor
* Stacking Regressor

In [None]:
seed= 21

kfolds = KFold(n_splits = 10, shuffle = True, random_state = seed)



def tune(objective):
    study = optuna.create_study(direction = 'maximize')
    study.optimize(objective, n_trails = 100)
    
    params = study.best_params
    best_score = study.best_value
    print(f"Best score : {best_score} \nOptimized parameters: {params}")
    return params

In [None]:
def ridge_objective(trial):
    _alpha = trial.suggest_float()
    
    ridge = Ridge(alpha = _alpha, random_state = RANDOM_STATE)
    
    score = cross_val_score(ridge, X, y, cv = kfolds, scoring = 'neg_root_mean_squared_error').mean()
    
    return score
    

In [None]:
ridge_params = {'alpha' : 7.4910616}

In [None]:
def randomforest_objective(trial):
    _n_estimators = trial.suggest_int("n_estimators", 50, 200)
    _max_depth = trial.suggest_int("max_depth", 5, 20)
    _min_samp_split = trial.suggest_int('min_samples_split', 2, 10)
    _min_samples_leaf = trial.suggest_int("min_samples_leaf", 2, 10)
    _max_features = trial.suggest_int("max_features", 10, 50)
    
    
    rf = RandomForestRegressor(
        max_depth = max_depth, 
        min_samples_split = _min_samp_split, 
        min_samples_leaf = min_samples_leaf,
        max_features= _max_features,
        n_estimators = _n_estimators,
        n_jobs = -1,
        random_state = RANDOM_SEED
    )
    
    
    score = cross_val_score(rf, X, y, cv = kfolds, scoring = "neg_root_mean_squared_error").mean()
    
    return score


randomforest_params = tune(randomforest_objective)

In [None]:
corrs = 


Model Performance Comparision

In [None]:
def cv_rmse(model):
    rmse = -cross_val_score()

In [None]:
def compare_models():
    models = {
        'Ridge' : ridge,
        'Lasso' : lasso,
        "Elastic Net" : elasticnet,
        'Gradient Boosting' : gbr,
        "XGBoost" : xgbr, 
        'LightGBM' : lgbr,
        "Stacking" : stack
    }
    
    
    scores = pd.DataFrame(columns = ['score', 'model'])
    
    for name, model in models.items():
        score = cv_rmse(model)
        print("")
        
    plt.figure(figsize = (20,10))
    sns.boxplot(data = scores, x = 'model', y = 'score')
    plt.show()
    
compare_models()

### Submission

In [None]:
print("Predict Submission --> \n")

submission = pd.read_csv("../input/house-prices-advanced-regression-techniques/sample_submission.csv")


submission.iloc[:, 1] = np.expm1(stack.predict(X_test))

submission.to_csv('my_submission.csv', index= False)

## Notebook in making