**Hi all, this is my first Kaggle notebook**

### Some nomenclatures used in the notebook:

#### train - train data from Kaggle
#### test - test data from Kaggle
#### X - independent variables (columns) from train data
#### y - dependent variable (column) from train data
#### data - combination of train and test data
#### X_ - train data after treating missing values
#### test_ - test data after treating missing values
#### X_scaled - scaled data from X_
#### x_train_90, x_test_10, y_train_90, y_test_10 - train test split data with test_size=0.10 from X_
#### x_train_75, x_test_25, y_train_75, y_test_25 - train test split data with test_size=0.25 from X_
#### x_train_scaled_90, x_test_scaled_10, y_train_scaled_90, y_test_scaled_10 - train test split where test_size=0.10 data from X_scaled
#### x_train_scaled_75, x_test_scaled_25, y_train_scaled_75, y_test_scaled_25 - train test split where test_size=0.25 data from X_scaled
#### score_test - holding scores of all algorithms used on x_test, y_test data
#### model - name of models/algorithms
#### best_model - name of the best model
#### y_predict_test - submission data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
    

In [None]:
#Let's load our data
train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
train.shape, test.shape

In [None]:
train.head()

In [None]:
test.head()

##### Ok so there are total of 81 columns, that's hell lot of variables.

In [None]:
train.count()

##### All columns except LotArea have equal amount of rows which means there are missing values.

> ## Starting preliminary analysis of data

In [None]:
train.dtypes

##### There are int, float and object data types.

In [None]:
train.describe(include = "all")

In [None]:
train.info

#### This gave us small gist about the dataset

> ## Identifying and dealing with Missing values

In [None]:
train.isnull()

##### I can see there are too many True values, hence missing values.

In [None]:
#To count how many missing values are there in the dataset in row as well as column
train.isnull().sum().sum()

In [None]:
test.isnull().sum().sum()

**##### Oh damn so there are 6965 and 7000 null values**
##### Let's see null values in each column

In [None]:
train.isnull().sum()

##### As there are 81 columns, it's difficult to display all the columns, let's try to display only those columns which have null values.

### Now before proceeding forward, let's append train and test data so that we can deal with them together.
##### Before appending them, adding a column named 'type' to distinguish between train and  test data.

In [None]:
#Split the train into x_train and y_train so that SalePrice can be kept separate for training later
y = train.SalePrice
X = train.drop(columns=["SalePrice"], axis=1)

In [None]:
y.shape, X.shape, test.shape

In [None]:
X['Type'] = 'train'
test['Type'] = 'test'
#test['SalePrice'] = -1
data = X.append(test)

In [None]:
data.isnull().sum().sum()

**##### So now, 6965 + 7000 = 13965 null values are there in total**
##### Now as said above, let's find out specific columns which are having null values.

In [None]:
columns_having_null_values = data[data.columns[data.isnull().sum()>0]]
columns_having_null_values

**##### Now we have a figure that 34 columns have null values out of 81 columns. This made our task much easier than before.**

## Now comes the most tidius part to deal with missing values

In [None]:
# We have to check what values are there in the table so that we can fill values according to real world scenario.
data['Electrical'].value_counts()

##### In the above values, we can see that "Sbrkr" is the mostly used 'Electrical' part. Here we can't put "None" in the null values because a house must have "Electrical" items/fuses. So we will fill null values with "Sbrkr" in this column.

In [None]:
data['Electrical'].fillna("Sbrkr", inplace=True)

##### Now we have to do this task for each columns with null/nan values (that's why I mentioned it as a tidius part).

In [None]:
data['MSZoning'].value_counts()
#Filling null values with 'RL'
data['MSZoning'].fillna("RL",inplace=True)

#Filling nul values with mean
data['LotFrontage'].fillna(data['LotFrontage'].mean(), inplace=True)

data['Alley'].fillna("Nothing", inplace=True)
data['Utilities'].fillna("AllPub", inplace=True)
data['Exterior1st'].fillna("VinylSd", inplace=True)
data['Exterior2nd'].fillna("VinylSd", inplace=True)
data['MasVnrArea'].fillna(0, inplace=True)
data['MasVnrType'].fillna("None", inplace=True)
data['BsmtCond'].fillna("No", inplace=True)
data['BsmtExposure'].fillna("NB", inplace=True)
data['BsmtFinType1'].fillna("NB", inplace=True)
data['BsmtFinSF1'].fillna(0.0, inplace=True)
data['BsmtFinSF2'].fillna(0.0, inplace=True)
data['BsmtUnfSF'].fillna(0.0, inplace=True)
data['TotalBsmtSF'].fillna(0.0, inplace=True)
data['BsmtFullBath'].fillna(0.0, inplace=True)
data['BsmtHalfBath'].fillna(0.0, inplace=True)
data['KitchenQual'].fillna("TA", inplace=True)
data['Functional'].fillna("Typ", inplace=True)
data['FireplaceQu'].fillna("None", inplace=True)
data['GarageType'].fillna("No", inplace=True)
data['GarageYrBlt'].fillna(0, inplace=True)
data['GarageFinish'].fillna("No", inplace=True)
data['GarageCars'].fillna(0, inplace=True)
data['GarageArea'].fillna(0, inplace=True)
data['GarageQual'].fillna("No", inplace=True)
data['GarageCond'].fillna("No", inplace=True)
data['PoolQC'].fillna("No", inplace=True)
data['Fence'].fillna("No", inplace=True)
data['MiscFeature'].fillna("No", inplace=True)
data['SaleType'].fillna("Con", inplace=True)
data['SaleCondition'].fillna("None", inplace=True)
data['BsmtQual'].fillna("TA", inplace=True)
data['BsmtFinType2'].fillna("Unf", inplace=True)

##### Now let's see what is the number of null values.

In [None]:
data.isnull().sum().sum()

> ## Hola, we have treated all the null values.

##### Let's deal with different types of data types in the dataset

In [None]:
int_columns = data[data.columns[data.dtypes=='int']]
int_columns.columns

In [None]:
data['MSZoning'].unique()

In [None]:
object_columnns = data[data.columns[data.dtypes=='object']]
object_columnns.columns

In [None]:
float_columns = data[data.columns[data.dtypes=='float']]
float_columns.columns

> ## Data preprocessing

In [None]:
data.var()

In [None]:
corr_matrix = data.corr()
corr_matrix

##### As we know all the diagonal elements will be 1 so let's take the upper triamgular matrix

In [None]:
upper_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),k=1).astype(np.bool))
upper_matrix

In [None]:
#Dropping columns with high correlation
drop_columns = [col for col in upper_matrix.columns if any(upper_matrix[col] > 0.85)]
drop_columns

In [None]:
data.drop(data[drop_columns], axis=1, inplace=True)
data.head()

> ## Label Encoding the categorical variables

In [None]:
from sklearn.preprocessing import LabelEncoder
for i in object_columnns:
    label = LabelEncoder()
    label.fit(data[i].values)
    data[i] = label.transform(data[i].values)

In [None]:
object_columnns = data[data.columns[data.dtypes=='object']]
object_columnns.columns

In [None]:
int_columns = data[data.columns[data.dtypes=='int']]
int_columns.columns

In [None]:
data.head()

##### So now we can see that all the object columns are turned to int
#### Let's split back the train and test data 

In [None]:
X_ = data[data.Type==1]
X_ = X_.drop(["Type"], axis=1)

test_ = data[data.Type==0]
test_ = test_.drop(["Type"], axis=1)

In [None]:
X_.shape, y.shape, test_.shape 

## Scaling
##### It is required because dataset has columns which varies highly in magnitudes. If scaling is not performed then high magnitude values will have more impact on modelling.

In [None]:
from sklearn import preprocessing
names = X_.columns
prepro = preprocessing.normalize(X_)
X_scaled = pd.DataFrame(prepro, columns=names)

In [None]:
X_scaled.head()

In [None]:
#from sklearn.preprocessing import MinMaxScaler
#minmaxscaler = MinMaxScaler()
#x_scaled = minmaxscaler.fit_transform(X_)

In [None]:
#We can do Scaling directly with formula shown below but we have pre-defined libraries so we will use them.. 
#x_scaled_formula = X_.copy()
#for cols in x_scaled_formula.columns:
#    x_scaled_formula[cols] = x_scaled_formula[cols] / x_scaled_formula[cols].abs().max()

In [None]:
#x_scaled_formula.head()

> ## Data visualization

In [None]:
#Scatterplot

import seaborn as sns
import matplotlib.pyplot as plt

sns.set()
cols = ['OverallQual', 'TotalBsmtSF', 'YearBuilt']
sns.pairplot(X_[cols], size = 2.5)
plt.show();


In [None]:
#Correlation matrix

corrmatrix = X_.corr()
f, ax = plt.subplots(figsize=(15, 9))
sns.heatmap(corrmatrix, vmax=.8, square=True);

> ## Modelling aka ML
##### There are so many regression algorithms which we can use, so we need to use most of them and then find out the best out of them.

In [None]:
#Creating lists to collect all the model names and their scores together

score_test = []
#score_train = []
model = []


#### Let's do the train_test_split first

### In this version 28, I am enhancing the train_test_split to test the results.

In [None]:
from sklearn.model_selection import train_test_split
x_train_90, x_test_10, y_train_90, y_test_10 = train_test_split(X_, y, test_size=0.10, random_state=1)

x_train_75, x_test_25, y_train_75, y_test_25 = train_test_split(X_, y, test_size=0.25, random_state=1)

In [None]:
from sklearn.model_selection import train_test_split
x_train_scaled_90, x_test_scaled_10, y_train_scaled_90, y_test_scaled_10 = train_test_split(X_scaled, y, test_size=0.10, random_state=1)

x_train_scaled_75, x_test_scaled_25, y_train_scaled_75, y_test_scaled_25 = train_test_split(X_scaled, y, test_size=0.25, random_state=1)

### *  Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
model_randomforest_train90 = RandomForestRegressor(n_estimators=500, n_jobs=-1, random_state=13)
model_randomforest_train90.fit(x_train_90, y_train_90)

model_randomforest_train75 = RandomForestRegressor(n_estimators=500, n_jobs=-1, random_state=13)
model_randomforest_train75.fit(x_train_75, y_train_75)

model_randomforest_scaled_train90 = RandomForestRegressor(n_estimators=500, n_jobs=-1, random_state=13)
model_randomforest_scaled_train90.fit(x_train_scaled_90, y_train_scaled_90)

model_randomforest_scaled_train75 = RandomForestRegressor(n_estimators=500, n_jobs=-1, random_state=13)
model_randomforest_scaled_train75.fit(x_train_scaled_75, y_train_scaled_75)

In [None]:
score_test.append(model_randomforest_train90.score(x_test_10, y_test_10))
model.append("model_randomforest_train90")

score_test.append(model_randomforest_train75.score(x_test_25, y_test_25))
model.append("model_randomforest_train75")

score_test.append(model_randomforest_scaled_train90.score(x_test_scaled_10, y_test_scaled_10))
model.append("model_randomforest_scaled_train90")

score_test.append(model_randomforest_scaled_train75.score(x_test_scaled_25, y_test_scaled_25))
model.append("model_randomforest_scaled_train75")

### * XGBoost

In [None]:
import xgboost as xgb
model_xgboost_train90 = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
 
model_xgboost_train90.fit(x_train_90, y_train_90)

model_xgboost_train75 = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
 
model_xgboost_train75.fit(x_train_75, y_train_75)

model_xgboost_scaled_train90 = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
 
model_xgboost_scaled_train90.fit(x_train_scaled_90, y_train_scaled_90)

model_xgboost_scaled_train75 = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
 
model_xgboost_scaled_train75.fit(x_train_scaled_75, y_train_scaled_75)

In [None]:
#score_train.append(model_xgboost.score(x_train, y_train))
#model.append("model_xgboost")

In [None]:
score_test.append(model_xgboost_train90.score(x_test_10, y_test_10))
model.append("model_xgboost_train90")

score_test.append(model_xgboost_train75.score(x_test_25, y_test_25))
model.append("model_xgboost_train75")

score_test.append(model_xgboost_scaled_train90.score(x_test_scaled_10, y_test_scaled_10))
model.append("model_xgboost_scaled_train90")

score_test.append(model_xgboost_scaled_train75.score(x_test_scaled_25, y_test_scaled_25))
model.append("model_xgboost_scaled_train75")

### * Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
model_decisiontree_train90 = DecisionTreeRegressor(random_state=0)
model_decisiontree_train90.fit(x_train_90, y_train_90)

model_decisiontree_train75 = DecisionTreeRegressor(random_state=0)
model_decisiontree_train75.fit(x_train_75, y_train_75)

model_decisiontree_scaled_train90 = DecisionTreeRegressor(random_state=0)
model_decisiontree_scaled_train90.fit(x_train_scaled_90, y_train_scaled_90)

model_decisiontree_scaled_train75 = DecisionTreeRegressor(random_state=0)
model_decisiontree_scaled_train75.fit(x_train_scaled_75, y_train_scaled_75)

In [None]:
score_test.append(model_decisiontree_train90.score(x_test_10, y_test_10))
model.append("model_decisiontree_train90")

score_test.append(model_decisiontree_train75.score(x_test_25, y_test_25))
model.append("model_decisiontree_train75")

score_test.append(model_decisiontree_scaled_train90.score(x_test_10, y_test_10))
model.append("model_decisiontree_scaled_train90")

score_test.append(model_decisiontree_scaled_train75.score(x_test_25, y_test_25))
model.append("model_decisiontree_scaled_train75")

### * LASSO 

In [None]:
from sklearn.linear_model import Lasso
model_lasso_train90 = Lasso(alpha=0.0005)
model_lasso_train90.fit(x_train_90, y_train_90)

model_lasso_train75 = Lasso(alpha=0.0005)
model_lasso_train75.fit(x_train_75, y_train_75)

In [None]:
score_test.append(model_lasso_train90.score(x_test_10, y_test_10))
model.append("model_lasso_train90")

score_test.append(model_lasso_train90.score(x_test_10, y_test_10))
model.append("model_lasso_train90")

### Collect all the models and scores together

In [None]:
final_scores = pd.DataFrame()
final_scores['model_name'] = model
final_scores['score_test'] = score_test
final_scores

### Let's find out which model scored the best

In [None]:
best_index = score_test.index(max(score_test))
best_model = final_scores['model_name'][best_index]
best_model

### And now predict the test data with best model

In [None]:
y_predict_best = model_xgboost_train90.predict(test_)

In [None]:
#Working on this statement
#y_predict_bestmodel = best_model.name.predict(test_)

> ### Submission

In [None]:
result = pd.DataFrame()
result['Id'] = test['Id']
result['SalePrice'] = y_predict_best

In [None]:
result.head()

In [None]:
result.to_csv('submission.csv', index=False)