## *Housing Prices Competition for Kaggle Learn Users*

***Upvote my Notebook :)***

# Important Libraries:

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
import xgboost as xgb
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings("ignore")

# First we analyze the train and test Data.

In [2]:
train_data = pd.read_csv('../input/home-data-for-ml-course/train.csv')
test_data = pd.read_csv('../input/home-data-for-ml-course/test.csv')

* **We check that how many columns are there and what type of columns are there to gain some knowledge from our newly loaded data.**
* **Then we also get to know that how many rows on particular column are null or non-null**

In [None]:
train_data.info()

* **So, to Make it easier to analyze the data. we split the train_data (Train.csv Data) into two parts:**
* **One is int+float type parts and One is Object type parts**
* **By part means I have spilts the table's columns into two parts.**

In [3]:
num_val = train_data.select_dtypes(exclude=['object']).copy()
cat_val = train_data.select_dtypes(include=['object']).copy()

* **Checking the num_val which contain numerical values,**

In [4]:
num_val.head()

* **Trying to Know that how much data from num_val contain Nan values and which column contain the most the upto so on by visualizing we get a clear picture in mind that there are few columns which contain Nan.**
* **Now we apply fillna function which is used to fill those cells of particular column with mean,median,mode or whatever value we choose to fill.**

*** So Column : LotFrontage, MasVnrArea, GarageYrBlt have Nan values**********

In [5]:
plt.figure(figsize=(12,6))
plt.title('Null Values Count')
num_val.isnull().sum().plot(kind='bar',legend=True)

In [6]:
num_val['LotFrontage'].fillna(num_val['LotFrontage'].median(),inplace=True)
num_val['MasVnrArea'].fillna(num_val['MasVnrArea'].median(),inplace=True)
num_val['GarageYrBlt'].fillna(num_val['GarageYrBlt'].mode()[0],inplace=True)

* **Plotting Histogram to check skewness from the features**
* **Skewed columns : LowQualFinSF, 3SsnPorch, PoolArea, MiscVal,LotArea**
* **We apply log transformation in those columns which contain continuous values. we can not apply log transformation on non-continuous columns because these columns will be filled with infinity or some Nan values which we don't want.**
* **So, we apply log transformation to LotArea column which is the continuous column and we will drop other 4 columns**

In [7]:
fig = plt.figure(figsize=(22,24))
for ind,col in enumerate(num_val):
    plt.subplot(6,7,ind+1)
    sns.histplot(data=num_val.loc[:,col].dropna(),kde=False,color='red')

fig.tight_layout(pad=1.5)

In [8]:
# Log Transformation
num_val['LotArea'] = np.log(num_val['LotArea'])

In [9]:
skewed_cols = ['LowQualFinSF','3SsnPorch','PoolArea','MiscVal']
num_val.drop(skewed_cols,axis=1,inplace=True)

# Now we check Variance of multiple columns because High Variance can make our model overfit. 

* **So No column contain too high variance**

In [10]:
num_val.var().sort_values(ascending=False)

* **Our Data can contain outliers, so to tackle this kind of problem we first need to calculate some statistical forms.**
* **we calculate firstQuartile,thirdQuartile and IQR the check the outliers**

In [11]:
# Finding Outliers 
for col in num_val.columns:
    
    first_quartile = num_val[col].quantile(0.25) 
    third_quartile = num_val[col].quantile(0.75)

    IQR = third_quartile - first_quartile
    out = third_quartile + 3*IQR 
    num_val.drop(num_val[num_val[col] > out].index,axis=0,inplace=True)


In [14]:
fig = plt.figure(figsize=(22,24))
for ind,col in enumerate(num_val):
    plt.subplot(6,7,ind+1)
    sns.boxplot(x=num_val.loc[:,col],data=num_val)

* **Check the Correlation between Columns of our data.**
* **Those columns have too high correlation like more than 0.75, then we will drop those columns.**

In [15]:
plt.figure(figsize=(22,24))
sns.heatmap(num_val.corr() > 0.75,annot=True)

In [16]:
num_val.drop(['GarageCars','Id'],axis=1,inplace=True)

# Now we Check our Categorical Data

* **first we check that which columns contain Nan values.**
* **Some Columns have high Nan values which can not be improve by filling there Nan values. So, we will drop those few columns**
* **We Drop Columns : Alley, PoolQC, MiscFeature, Fence**

In [18]:
plt.figure(figsize=(22,24))
cat_val.isnull().sum().plot(kind='bar',legend=True,color='forestgreen')

In [19]:
drop_col = ['Alley','PoolQC','MiscFeature','Fence']
cat_val.drop(drop_col,axis=1,inplace=True)

**We filled other columns with Mode, we can not use Mean or median on these categorical data and Mode is the best fit for categorical columns also.**

In [20]:
cat_val['MasVnrType'].fillna(cat_val['MasVnrType'].mode()[0],inplace=True)
cat_val['BsmtQual'].fillna(cat_val['BsmtQual'].mode()[0],inplace=True)
cat_val['BsmtCond'].fillna(cat_val['BsmtCond'].mode()[0],inplace=True)
cat_val['BsmtExposure'].fillna(cat_val['BsmtExposure'].mode()[0],inplace=True)
cat_val['BsmtFinType1'].fillna(cat_val['BsmtFinType1'].mode()[0],inplace=True)
cat_val['BsmtFinType2'].fillna(cat_val['BsmtFinType2'].mode()[0],inplace=True)
cat_val['FireplaceQu'].fillna(cat_val['FireplaceQu'].mode()[0],inplace=True)
cat_val['GarageType'].fillna(cat_val['GarageType'].mode()[0],inplace=True)
cat_val['GarageFinish'].fillna(cat_val['GarageFinish'].mode()[0],inplace=True)
cat_val['GarageQual'].fillna(cat_val['GarageQual'].mode()[0],inplace=True)
cat_val['GarageCond'].fillna(cat_val['GarageCond'].mode()[0],inplace=True)
cat_val['Electrical'].fillna(cat_val['Electrical'].mode()[0],inplace=True)

**Now we analyzed Our Data and Know we apply all the above methods that we done by splitting it. we know apply those functions in train_data and test_data**

In [21]:
columns = ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
       'ScreenPorch', 'MoSold', 'YrSold']

# In this cell we applied all those above Functions to make our train_data cleaner and insight full

In [None]:
# Numerical :

train_data['LotFrontage'].fillna(train_data['LotFrontage'].median(),inplace=True)
train_data['MasVnrArea'].fillna(train_data['MasVnrArea'].median(),inplace=True)

# Log Transformation: Applied on High Variance columns


train_data['LotArea'] = np.log(train_data['LotArea'])

# Dropped Skewed Columns

skewed_cols = ['LowQualFinSF','3SsnPorch','PoolArea','MiscVal','GarageYrBlt']
train_data.drop(skewed_cols,axis=1,inplace=True)

# Finding Outliers 
for col in columns:
    
    first_quartile = train_data[col].quantile(0.25) 
    third_quartile = train_data[col].quantile(0.75)

    IQR = third_quartile - first_quartile
    out = third_quartile + 3*IQR 
    train_data.drop(train_data[train_data[col] > out].index,axis=0,inplace=True)



train_data.drop(['Id','GarageCars'],axis=1,inplace=True)

# Categorical : 


drop_col = ['Alley','PoolQC','MiscFeature','Fence']
train_data.drop(drop_col,axis=1,inplace=True)

train_data['MasVnrType'].fillna(train_data['MasVnrType'].mode()[0],inplace=True)
train_data['BsmtQual'].fillna(train_data['BsmtQual'].mode()[0],inplace=True)
train_data['BsmtCond'].fillna(train_data['BsmtCond'].mode()[0],inplace=True)
train_data['BsmtExposure'].fillna(train_data['BsmtExposure'].mode()[0],inplace=True)
train_data['BsmtFinType1'].fillna(train_data['BsmtFinType1'].mode()[0],inplace=True)
train_data['BsmtFinType2'].fillna(train_data['BsmtFinType2'].mode()[0],inplace=True)
train_data['FireplaceQu'].fillna(train_data['FireplaceQu'].mode()[0],inplace=True)
train_data['GarageType'].fillna(train_data['GarageType'].mode()[0],inplace=True)
train_data['GarageFinish'].fillna(train_data['GarageFinish'].mode()[0],inplace=True)
train_data['GarageQual'].fillna(train_data['GarageQual'].mode()[0],inplace=True)
train_data['GarageCond'].fillna(train_data['GarageCond'].mode()[0],inplace=True)
train_data['Electrical'].fillna(train_data['Electrical'].mode()[0],inplace=True)


* **We know apply our last function of pre processing, which is LabelEncoder**
* **LabelEncoder Function/Method will encode the categorical data into Numerical**
* **We applied Encoding because Our Machine Learning Model can not be applied on string data, they can only be applicable on numerical data**

In [None]:
cat_columns = ['MSZoning','LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
       'BldgType', 'HouseStyle', 'RoofStyle','Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'HeatingQC', 'CentralAir','KitchenQual','Functional', 'FireplaceQu', 
        'GarageType', 'GarageFinish', 'GarageQual','Condition2',
       'GarageCond', 'PavedDrive', 'Electrical','Street','RoofMatl', 'Heating',
       'SaleType', 'SaleCondition']


In [None]:
le = LabelEncoder()
for col in cat_columns:
    train_data[col] = le.fit_transform(train_data[col])

# SAME WORK FOR test_data 

In [None]:
# Numerical :

test_data['LotFrontage'].fillna(test_data['LotFrontage'].median(),inplace=True)
test_data['MasVnrArea'].fillna(test_data['MasVnrArea'].median(),inplace=True)


# Log Transformation: Applied on High Variance columns

test_data['LotArea'] = np.log(test_data['LotArea'])

# Dropped Skewed Columns

skewed_cols = ['LowQualFinSF','3SsnPorch','PoolArea','MiscVal','GarageYrBlt']
test_data.drop(skewed_cols,axis=1,inplace=True)

test_data.drop(['Id','GarageCars'],axis=1,inplace=True)

# Categorical : 


drop_col = ['Alley','PoolQC','MiscFeature','Fence']
test_data.drop(drop_col,axis=1,inplace=True)

test_data['MasVnrType'].fillna(test_data['MasVnrType'].mode()[0],inplace=True)
test_data['BsmtQual'].fillna(test_data['BsmtQual'].mode()[0],inplace=True)
test_data['BsmtCond'].fillna(test_data['BsmtCond'].mode()[0],inplace=True)
test_data['BsmtExposure'].fillna(test_data['BsmtExposure'].mode()[0],inplace=True)
test_data['BsmtFinType1'].fillna(test_data['BsmtFinType1'].mode()[0],inplace=True)
test_data['BsmtFinType2'].fillna(test_data['BsmtFinType2'].mode()[0],inplace=True)
test_data['FireplaceQu'].fillna(test_data['FireplaceQu'].mode()[0],inplace=True)
test_data['GarageType'].fillna(test_data['GarageType'].mode()[0],inplace=True)
test_data['GarageFinish'].fillna(test_data['GarageFinish'].mode()[0],inplace=True)
test_data['GarageQual'].fillna(test_data['GarageQual'].mode()[0],inplace=True)
test_data['GarageCond'].fillna(test_data['GarageCond'].mode()[0],inplace=True)
test_data['Electrical'].fillna(test_data['Electrical'].mode()[0],inplace=True)

In [None]:
le = LabelEncoder()
for col in cat_columns:
    test_data[col] = le.fit_transform(test_data[col])

**Some features of test_data had contain Nan so we apply the fillna function to fill it by median. we applied median because these are numerical columns**

In [None]:
test_data['BsmtFinSF1'].fillna(test_data['BsmtFinSF1'].median(),inplace=True)
test_data['BsmtUnfSF'].fillna(test_data['BsmtUnfSF'].median(),inplace=True)
test_data['TotalBsmtSF'].fillna(test_data['TotalBsmtSF'].median(),inplace=True)
test_data['BsmtFullBath'].fillna(test_data['BsmtFullBath'].median(),inplace=True)
test_data['GarageArea'].fillna(test_data['GarageArea'].median(),inplace=True)

In [None]:
train_data

In [None]:
input = train_data.drop(['SalePrice'],axis=1)
target = train_data.SalePrice

# BreakDown it into train and test part so we can train our Model or evaluate it

**we applied 80 and 20 percent which is applied by me after multiple trials**

In [None]:
x_train,x_test,y_train,y_test = train_test_split(input,target,test_size=0.2)

**I used XGBRegressor Model which is powerful one and applied done hyperparameter tunning **

In [None]:
para = {
    'n_estimators': [300,200,100,400,500,600,700,900],
    'max_depth' :[6,3,7,8,9,5,4],
    'learning_rate' : [0.1,0.2,0.01,0.001,0.0001,0.02,0.002,0.5],
    'subsample': [0.1,0.2,0.5,0.9,0.8,0.05],
    'alpha': [10,14,20,22,30],
    'booster' : ['gbtree','gblinear'],
    'min_child_weight': [2,3,4,5,6,7,8,9],
    'col_sample_bytree' : [0.5,0.6,0.55,0.85,0.68,0.9,1,0.7]
}

xg = xgb.XGBRegressor()

**RandomizedSearchCV Model is applied to choose the best hyperparameters for XGBRegressor **

In [None]:
mod = RandomizedSearchCV(estimator=xg,param_distributions=para,cv = 5 , n_iter = 20)
mod.fit(x_train,y_train)

In [None]:
mod.score(x_test,y_test)

**These are the best Parameters for XGBRegressor on our data and according to our given parameters range**

In [None]:
mod.best_params_

In [None]:
pred = mod.predict(test_data)
sub = pd.read_csv('../input/home-data-for-ml-course/sample_submission.csv')
sub['SalePrice'] = pred
sub.to_csv('submission.csv',index=False)