# House Prices - Advanced Regression Techniques
### Predict sales prices and practice feature engineering, RFs, and gradient boosting

### Goal
Predict the sales price for each house. For each Id in the test set, predict the value of the SalePrice variable. 

### Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

### File descriptions
- train.csv - the training set
- test.csv - the test set
- data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
- sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms

### Data fields
- SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: $ Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale

In [None]:
#Import the toos for analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
pd.options.display.max_rows = 4000

In [None]:
train_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
answer_id = test_df['Id']

In [None]:
train_df.shape,test_df.shape

In [None]:
train_df.dtypes

- There are total 80 features (exclude the labels) in the data.
- The size of training set and testing set are very similar.

In [None]:
# Define function to print the columns with missing value
def print_missing(df):
    ### Check for missing data
    missing_columns = pd.DataFrame(df.isnull().sum())
    missing_columns.columns = ['missing']
    missing_columns = missing_columns[missing_columns.missing>0]
    missing_columns.sort_values(by=['missing'],ascending=False, inplace=True)
    return missing_columns.index.tolist()

def missing_detail(col):
    print('Missing value:', col.isnull().sum())
    print('Data type:',col.dtype)
    if col.dtype == 'object':
        print(col.value_counts())
    else:
        print(col.median())

# Filling missing value and filtering outliers

#### Combine the training set and testing set for fill missing value

In [None]:
comp_df = pd.concat([train_df, test_df])

In [None]:
comp_df.reset_index(drop=True,inplace=True)

#### Drop the columns with too many missing values

In [None]:
# Watch the missing percentage for every columns with missing values
miss_list = print_missing(comp_df)
print('Missing Rate:\n')
for col in miss_list:
    print('{} : {:.2f} %.'.format(col, comp_df[col].isnull().sum()/len(comp_df)*100))

We will drop columns with missing value more than 50%, there is a marginal case : FireplaceQu

In [None]:
comp_df.FireplaceQu.value_counts()
comp_df.groupby('FireplaceQu').median()['SalePrice'].plot.bar()

It is difficult the fill the missing value, I will drop this column also.

In [None]:
drop_list = ['PoolQC','MiscFeature','Alley','Fence','FireplaceQu']
comp_df.drop(drop_list,axis=1,inplace=True)

#### Correlation

In [None]:
# Let see the correlation in data
plt.figure(figsize=(15,13))
sns.heatmap(comp_df.corr(),cmap='Reds', cbar=True)

In [None]:
# Fill median for numeric values and mode for categorical values.

comp_df.LotFrontage.fillna(comp_df.LotFrontage.median(),inplace=True)

comp_df.GarageCond.fillna('TA',inplace=True)
comp_df.GarageYrBlt.fillna(1979, inplace=True)
comp_df.GarageFinish.fillna('Unf',inplace=True) 
comp_df.GarageQual.fillna('TA',inplace=True)
comp_df.GarageType.fillna('Attchd',inplace=True) 
comp_df.GarageCars.fillna(2,inplace=True)
comp_df.GarageArea.fillna(480,inplace=True)

comp_df.BsmtExposure.fillna('No',inplace=True)
comp_df.BsmtCond.fillna('TA',inplace=True)
comp_df.BsmtQual.fillna('TA',inplace=True)
comp_df.BsmtFinType2.fillna('Unf',inplace=True)
comp_df.BsmtFinType1.fillna('Unf',inplace=True)# marginal case for filling
comp_df.TotalBsmtSF.fillna(989.5,inplace=True)
comp_df.BsmtUnfSF.fillna(467,inplace=True)
comp_df.BsmtFinSF2.fillna(0,inplace=True)
comp_df.BsmtFinSF1.fillna(368.5,inplace=True)

comp_df.MasVnrType.fillna('None',inplace=True)
comp_df.MasVnrArea.fillna(0,inplace=True)
comp_df.MSZoning.fillna('RL',inplace=True)
comp_df.Functional.fillna('Typ',inplace=True)

comp_df.BsmtHalfBath.fillna(0,inplace=True)
comp_df.BsmtFullBath.fillna(0,inplace=True)
comp_df.Utilities.fillna('AllPub',inplace=True)
comp_df.KitchenQual.fillna('TA',inplace=True)

comp_df.Exterior2nd.fillna('VinylSd',inplace=True)
comp_df.Exterior1st.fillna('VinylSd',inplace=True)
comp_df.SaleType.fillna('WD',inplace=True)
comp_df.Electrical.fillna('SBrkr',inplace=True)

In [None]:
comp_df.isnull().sum()

# Data EDA and Features engineering

#### Check the distribution of SalePrice

In [None]:
fig, (ax1,ax2) = plt.subplots(2,1,figsize=(10,5))
sns.distplot(comp_df.SalePrice,ax=ax1)
sns.boxplot(comp_df.SalePrice,ax=ax2,width=0.3)

#### There are many outliers, I will drop the outliers after formatting all the features.

#### Above ground area

In [None]:
### First floor, second floor , and above ground area
comp_df[comp_df['1stFlrSF'] + comp_df['2ndFlrSF']+comp_df['LowQualFinSF']!= comp_df['GrLivArea']].head()

- first floor + second floor + low quality finished flooor = Above grade living area.
- I will drop the columns of first floor and second 

In [None]:
comp_df.drop(['1stFlrSF','2ndFlrSF'],axis=1, inplace=True)

#### Relation between living area and sale price.

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(13,3))
sns.boxplot(comp_df.GrLivArea,ax=ax1)
sns.regplot('GrLivArea','SalePrice',data=comp_df,ax=ax2)

#### Drop the outliers

In [None]:
drop_index = comp_df[(comp_df.GrLivArea>4000)&(comp_df.SalePrice.notnull())].index.tolist()
comp_df.drop(drop_index,inplace=True)

In [None]:
# Change low qual fin sf from square feet to ratio of above ground area
comp_df['LowQualFinSF'] = comp_df['LowQualFinSF']/comp_df['GrLivArea']

#### Porch Area

- There are 4 porch area in data, 
- openporch, 
- enclosed porch, 
- 3 season porch, 
- screen porch.

In [None]:
porch_col = ['OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch']

In [None]:
comp_df[porch_col].head()

- Group 3Ssn porch and screen porch to enclosed porch

In [None]:
comp_df['EnclosedPorch'] = comp_df['EnclosedPorch'] + comp_df['3SsnPorch'] + comp_df['ScreenPorch']
comp_df.drop(['3SsnPorch','ScreenPorch'],axis=1,inplace=True)

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(13,3))
sns.regplot('EnclosedPorch','SalePrice',data=comp_df,ax=ax1)
sns.regplot('OpenPorchSF','SalePrice',data=comp_df,ax=ax2)

#### Garage

- garage type, the garage location, some are attached to house
- garageyrblt, the year garage was built
- garage finish, the interior finish of garage
- garage quality: 6 levels
- garage cond: 6 levels
- paved Driveway: Y:Paved, P:partial , N: gravel

In [None]:
garag_col = ['GarageType','GarageYrBlt','GarageFinish','GarageArea','GarageQual','GarageCond','PavedDrive']
comp_df[garag_col].head()

In [None]:
fig, (ax1,ax2) = plt.subplots(2,1,figsize=(7,6))
sns.boxplot(comp_df.GarageArea,ax=ax1)
sns.distplot(comp_df.GarageYrBlt,ax=ax2, color='orange')

In [None]:
# filter out the outlier
out_index = comp_df[(comp_df.GarageArea>1200)&(comp_df.SalePrice.notnull())].index.tolist()
comp_df.drop(out_index, inplace=True)

In [None]:
# relatioship between garage area, built year and sale price
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,3))
sns.regplot('GarageArea','SalePrice',data=comp_df,ax=ax1)
sns.regplot('GarageYrBlt','SalePrice',data=comp_df,ax=ax2)

In [None]:
## Is garage condition and garage quality are the same?
(comp_df.GarageCond == comp_df.GarageQual).value_counts()

In [None]:
## Are garage quality and condition has high relationship with sale price?
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,3))
comp_df.groupby(['GarageQual']).mean()['SalePrice'].plot.bar(ax=ax1)
comp_df.groupby(['GarageCond']).mean()['SalePrice'].plot.bar(ax=ax2)

In [None]:
# Every house in dataset has its own garage.
comp_df.GarageType.value_counts()

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,3))
# Relation between garage finish and sale price
comp_df.groupby(['GarageFinish']).mean()['SalePrice'].plot.bar(ax=ax1)

# paved drive
comp_df.groupby(['PavedDrive']).mean()['SalePrice'].plot.bar(ax=ax2)

#### Bathroom
- BsmtFullBath, Basement full bathrooms
- BsmtHalfBath, Basement half bathrooms
- FullBath, full bathrooms above grade
- HalfBath, half bathrooms above grade

In [None]:
baths = ['BsmtFullBath', 'BsmtHalfBath','FullBath','HalfBath']
comp_df[baths].head()

#### Group all bathroom columns into one

In [None]:
comp_df['FullBath'] = comp_df['FullBath'] + comp_df['BsmtFullBath']
comp_df.drop('BsmtFullBath',axis=1,inplace=True)

comp_df['HalfBath'] = comp_df['HalfBath'] + comp_df['BsmtHalfBath']
comp_df.drop('BsmtHalfBath',axis=1,inplace=True)

comp_df['BathNum'] = comp_df['FullBath'] + (comp_df['HalfBath']/2)
comp_df.drop(['FullBath','HalfBath'],axis=1, inplace=True)

In [None]:
baths = ['BathNum']
comp_df[baths].head()

#### Basement

- BsmtQual
- BsmtCond
- BsmtExposure
- BsmtFinType1
- BsmtFinSF1
- BsmtFinType2
- BsmtFinSF2
- BsmtUnfSF
- TotalBsmtSF

In [None]:
base_cols = ['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinSF1','BsmtFinType2','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF']
comp_df[base_cols].head()

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,3))
comp_df.groupby(['BsmtQual']).mean()['SalePrice'].plot.bar(ax=ax1)
comp_df.groupby(['BsmtCond']).mean()['SalePrice'].plot.bar(ax=ax2)

In [None]:
comp_df.BsmtExposure.value_counts()

In [None]:
comp_df.BsmtFinType2.value_counts()

In [None]:
# Area of basement
base_area = ['BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF']
condition = (comp_df['TotalBsmtSF'] == comp_df['BsmtFinSF1'] + comp_df['BsmtFinSF2'] + comp_df['BsmtUnfSF'])
comp_df[~condition][base_area]

- The total basement area is equal to basement area of type 1 and 2 + unfinished area
- There is one extra case beacause of fillna() at the beginning.

In [None]:
# We will drop the basement area type1 and type2
comp_df.drop(['BsmtFinSF1','BsmtFinSF2'],axis=1, inplace=True)

# Change to unfinished area to ratio
comp_df['BsmtUnfSF'] = comp_df['BsmtUnfSF'] / comp_df['TotalBsmtSF']
comp_df['BsmtUnfSF'].fillna(0,inplace=True)

In [None]:
plt.figure(figsize=(12,2))
sns.boxplot(comp_df.TotalBsmtSF)

In [None]:
# Drop the outliers
drop_index = comp_df[(comp_df.TotalBsmtSF>3000)&(comp_df.SalePrice.notnull())].index.tolist()
comp_df.drop(drop_index, inplace=True)

#### Electronic devices
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system

In [None]:
devices = ['Heating','HeatingQC','CentralAir','Electrical']
comp_df[devices].head()

#### Land

- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property


In [None]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(15,3))
sns.boxplot(comp_df.LotArea,ax=ax1)
sns.regplot(comp_df.LotArea,comp_df.SalePrice,ax=ax2)

In [None]:
# drop the outliers 
drop_index = comp_df[(comp_df.LotArea>50000)&(comp_df.SalePrice.notnull())].index.tolist()
comp_df.drop(drop_index,inplace=True)

In [None]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(13,2))
sns.countplot(comp_df.LotShape,ax=ax1)
comp_df.groupby(['LotShape']).mean()['SalePrice'].plot.bar(ax=ax2)

In [None]:
# Group ir3 to ir2
comp_df['LotShape'].where(comp_df['LotShape'] != 'IR3','IR2', inplace=True)

#### Other room, places and functionality rate.
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces


In [None]:
comp_df[['BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd']].head()

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(13,2))
comp_df.TotRmsAbvGrd.plot.hist(ax=ax1)
sns.regplot(comp_df.TotRmsAbvGrd,comp_df.SalePrice,ax=ax2)

#### Roof and exterior
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation

In [None]:
comp_df.RoofMatl.value_counts()

In [None]:
#Group the rare values to other group to reduce variance
only_one = ['ClyTile','Roll','Membran','Metal']
comp_df['RoofMatl'] = comp_df.apply(lambda x : 'WdShngl' if x.RoofMatl in only_one else x.RoofMatl, axis=1)

In [None]:
comp_df.Exterior1st.value_counts()
comp_df['Exterior1st'] = comp_df.apply(lambda x : 'CBlock' if x.Exterior1st == 'ImStucc' else x.Exterior1st, axis=1)

In [None]:
comp_df.Exterior2nd.value_counts()
comp_df['Exterior2nd'] = comp_df.apply(lambda x : 'CBlock' if x.Exterior2nd == 'Other' else x.Exterior2nd, axis=1)

#### Date

- MoSold: Month Sold
- YrSold: Year Sold
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date :same as construction date if no remodeling or additions


In [None]:
import datetime

In [None]:
date_col = ['MoSold','YrSold','YearBuilt','YearRemodAdd']
comp_df[date_col].head()

#### Create two new columns : house age, and remodel age.

In [None]:
comp_df['HouseAge'] = datetime.datetime.today().year - comp_df['YearBuilt']
comp_df['RemodelAge'] = datetime.datetime.today().year - comp_df['YearRemodAdd']
comp_df.RemodelAge.where(comp_df.HouseAge != comp_df.RemodelAge,0, inplace=True)
comp_df.drop(['YearBuilt','YearRemodAdd'],axis=1,inplace=True)

In [None]:
sns.boxplot(x=comp_df.YrSold, y=comp_df.SalePrice)

In [None]:
# Create column for sold hist
comp_df['Sold_date'] = comp_df['YrSold'].astype(str)+'-'+comp_df['MoSold'].astype(str)
comp_df['Sold_hist'] = datetime.datetime.now() - pd.to_datetime(comp_df['Sold_date'])
comp_df['Sold_hist'] = comp_df['Sold_hist'].dt.days
comp_df.drop(['Sold_date'],axis=1,inplace=True)

#### Other column

- SaleType: Type of sale
- SaleCondition: Condition of sale
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- MSSubClass: The building class
- MSZoning: The general zoning classification

In [None]:
comp_df.Condition2.value_counts()

In [None]:
for idex in comp_df[(comp_df.Condition2=='RRAe') | (comp_df.Condition2 == 'RRAn')].index:
    comp_df.loc[idex, 'Condition2'] = 'RRNn'

#### Features encoding

In [None]:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

#### Normalized the numeric features.

In [None]:
set(comp_df.dtypes)

In [None]:
numeric_list = comp_df.select_dtypes(include=['int64','float64']).head().columns.tolist()
numeric_list.remove('SalePrice')
numeric_list.remove('Id')

In [None]:
comp_df[numeric_list] = MinMaxScaler().fit_transform(comp_df[numeric_list])

#### Features of grading
ExterCond, 
BsmtCond,
HeatingQC,
HeatingQC,
KitchenQual,
GarageCond

In [None]:
comp_df.ExterCond = comp_df.ExterCond.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
comp_df.BsmtCond = comp_df.BsmtCond.map({'Po':0,'Fa':1,'Gd':2,'TA':3})
comp_df.BsmtQual = comp_df.BsmtQual.map({'Fa':0,'TA':1,'Gd':2,'Ex':3})
comp_df.HeatingQC = comp_df.HeatingQC.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
comp_df.KitchenQual = comp_df.KitchenQual.map({'Fa':0,'TA':1,'Gd':2,'Ex':3})
comp_df.GarageCond = comp_df.GarageCond.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
comp_df.ExterQual = comp_df.ExterQual.map({'Fa':0,'TA':1,'Gd':2,'Ex':3})
comp_df.GarageQual = comp_df.GarageQual.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})

#### Label encoding

In [None]:
object_list = comp_df.select_dtypes(include='object').head().columns.tolist()

In [None]:
for col in object_list :
    comp_df[col] = LabelEncoder().fit_transform(comp_df[col])

In [None]:
comp_df.drop('Id',axis=1,inplace=True)

In [None]:
comp_df.shape

In [None]:
comp_df.head().T

# Splt to training set and testing set

In [None]:
train_df = comp_df[comp_df.SalePrice.notnull()]
test_df = comp_df[comp_df.SalePrice.isnull()]
test_df.drop('SalePrice',axis=1,inplace=True)

### Filter out the outliers of SalePrice

In [None]:
plt.figure(figsize=(13,2))
sns.boxplot(comp_df.SalePrice)

In [None]:
train_df  = train_df[train_df.SalePrice <= 400000]

In [None]:
x_train = train_df.drop('SalePrice',axis=1)
y_train =train_df['SalePrice']

In [None]:
# Check the shape
train_df.shape, test_df.shape

In [None]:
x_train.shape, y_train.shape

# Training Models

In [None]:
from lightgbm import LGBMRegressor

from sklearn.model_selection import GridSearchCV, cross_validate, RepeatedKFold

In [None]:
cv = RepeatedKFold(n_repeats=3, n_splits=10, random_state=42)
model = LGBMRegressor(random_state=42)
scores = cross_validate(model, x_train, y_train, cv=cv,scoring=['r2','neg_mean_squared_log_error'], verbose=1, n_jobs=-1)

In [None]:
print(scores['test_r2'].mean())
print(np.sqrt(np.abs(scores['test_neg_mean_squared_log_error'].mean())))

### Tuning hyperparameters

In [None]:
params = {
    'n_estimators':[50,100,200],
    'max_depth':[0,3,5,7],
    'learning_rate':[0.0001,0.001,0.01,0.1,1],
    'boosting_type':['gbdt','dart','goss'],
    'subsample':[0.3,0.5,0.7,1],
    'colsample_bytree':[0.3,0.5,0.7,1]
}
cv = RepeatedKFold(n_repeats=1, n_splits=5, random_state=42)
lbm_grid = GridSearchCV(LGBMRegressor(random_state=42),params, cv=cv,verbose=1, n_jobs=-1, scoring='r2')
lbm_grid.fit(x_train, y_train)

In [None]:
print(lbm_grid.best_score_,lbm_grid.best_estimator_)

# Prediction on test data

In [None]:
y_pred = pd.DataFrame(lbm_grid.predict(test_df))
y_pred['Id'] = answer_id
y_pred.set_index('Id',inplace=True)
y_pred.columns = ['SalePrice']

In [None]:
y_pred.to_csv('answer.csv')

Thank you