# **House Sales Regression**

## **Objective** - 
#### In this Analysis, we will use the King county US house sales data to find the pattern in the data as well as factors affecting the price of the house based on the information available about that individual house in the house sales data and make prediction.

## **Importing Required Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Plotly
from plotly.offline import init_notebook_mode,iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
import xgboost
from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings('ignore')

## **Import Dataset**

In [None]:
df = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')

In [None]:
# Lets check the first five rows of data
df.head().style.bar(
    color='#606ff2').background_gradient(cmap='plasma').background_gradient(cmap='plasma')

In [None]:
# Checking Missing Values 
df.isnull().sum()

**We can see that there are no missing values in our dataset.**

In [None]:
# lets check the datatype of features 
df.dtypes

## **Data Description**

In [None]:
df.info()

### <p>This detailed description based on my observation about the data and data description then selecting the type according to me.</p>

#### 1. **id** : It is integer variable.A term to represent Unique Id for each home sold.
#### 2. **date**: It includes date of house was sold between May 2014 and May 2015.
#### 3. **price**: It is Numeric continuous variable defines the price of each individual house sold.
#### 4. **bedrooms**: It is integer variable represents how many bedrooms in a particular house.
#### 5. **bathrooms**: It is numeric variable shows the number of bathrooms in house.here,values are decimal and represent as- Full bathroom(room with bathtub,toilet,sink and shower) = 1, Half bathroom(room with a toilet and sink but no shower) = 0.5, Room with toilet,sink and shower = 0.75
#### 6. **sqft_living**: It is Numeric continuous variable shows the total area(interior living space) of the house in square feet.
#### 7. **sqft_lot**: It is Numeric continuous variable shows the total area of plot(land space) which includes garden and ground of properties in square feet.
#### 8. **floors**: It is categorical variable represent number of floors in house that can be half values also(like 1.5,2.5 etc.)
#### 9. **waterfront**: It is categorical variable.The waterfront variable is 0 if the house does not have any waterfront view and 1 if does.
#### 10. **view**: It is categorical variable suggest an index from 0 to 4 of how good the view of the house was.
#### 11. **condition**: It is categorical variable assigned by king county system which shows the house condition.It has the value between 1 to 5, where 1 indicates very bad condition and 5 indicates very good condition.
#### 12. **grade**: It is categorical variable which represent an index from 1 to 13, where 1-3 means very bad level of construction and design, 7 has an average level of construction and design & 11-13 have very good level of construction and design.
#### 13. **sqft_above**: It is Numeric continuous variable represent the square footage of the interior house space that is above ground level.
#### 14. **sqft_basement**: It is Numeric continuous variable represent the square footage of the interior house space that is below ground level(basement).
#### 15. **yr_built**: it is in Date format represent in which year the house was initially built.
#### 16. **yr_renovated**: It is in categorical variable represent the house was renovated or not.
#### 17. **zipcode**: This variable is categorical, indicating the area the property sold is in.
#### 18. **lat**: It is Numeric continuous variable describes the latitude of the house.
#### 19. **long**: It is Numeric continuous variable describes the longitude of the house.
#### 20. **sqft_living15**: It is Numeric continuous variable.These variable records the value for the size of living in 2015,which may be different from 2014 values because of renovation.
#### 21. **sqft_lot15**: It is Numeric continuous variable represent the records for the size of plot in 2015 which is also different from values of 2014.

## **Summary Statistics of Data**

In [None]:
df.describe().T.style.bar(
    subset=['mean'],
    color='#606ff2').background_gradient(
    subset=['std'], cmap='PuBu').background_gradient(subset=['50%'], cmap='PuBu')

**Summary Description For Target Variable(Price) -**
##### 1.The price in our dataset ranges from 78,000 dollar to 7,700,000 dollar.
##### 2.The value of 1st quartile is nearly 320000 dollar, a price for which 25% of the data is less than that price.
##### 3.The Mean price of house which is 540088.14 dollar greater than the median house price which is 450,000 dollar also noticed that median price is closer to the first quartile than the third quartile which shows that distribution is positively skewed.
##### 4.The value of 3rd quartile is 645000 dollar, a price for which 75% of the data is less than that price.
##### 5.We can see that mean of price is very less as compared to maximum value of price so we can say that very minimum observations tends towards maximum value of price. 

##### Here we see the average home price is about 540088.14 dollar with the highest being sold for $7700000. The maximum number of bedrooms is 33 which may indicate a mistype, as the average number of bedrooms is about 3.the average for continuous variables may be skewed due to outliers.

## **Exploratory Data Analysis**

### Correlation Analysis

In [None]:
#Correlation plot
plt.figure(figsize=(14,10))
cor = df.corr()
sns.heatmap(cor,annot=True,cmap='plasma')
plt.show()

#### Correlation Plot Description -
* The variable that affects the price the most is 'sqft_living' with +0.70 correlation value.
* sqft_above and price are moderatly(moderate positive relationship) correlated as living space above the ground level increases price also increases because in general when we buy house in any building as the price of house above the ground level are high as compared to house at ground level
* There is weak positive correlation between sqft_basement and price.From the above summary of sqft_basement, we can say that 50% of the houses have no basement so it is reasonable to say that most of a houses does not have living space below ground level.
* There is weak positive correlation between latitude(lat) and price.
* sqft_living15 and price are moderatly positive correlated as living space in 2015 increases price will also increases.
* sqft_living is highly correlated with bathrooms,grade and sqft_above.

### **Univariate Distribution**

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Variable Price
</div>

In [None]:
plt.figure(figsize = (12, 6))

plt.subplot(121)
plt.title('Price Distribuition')
sns.distplot(df['price'])

plt.subplot(122)
g1 = plt.scatter(range(df.shape[0]), np.sort(df.price.values))
g1= plt.title("Price Curve Distribuition", fontsize=15)
g1 = plt.xlabel("")
g1 = plt.ylabel("Amount(US)", fontsize=12)

plt.subplots_adjust(wspace = 0.3, hspace = 0.5,
                    top = 0.9)
plt.show()

* We can see that most of the houses are priced under 1 million.
* long tail of distribution is longer on right hand side as compared to left hand side which shows that distribution is positively skewed.
* Transformation is required to normalize our target variable, price, lets try with a very general transformation function log and see if that helps here.

In [None]:
# lets check again the distribution of target variable - price
df1 = df.copy()
df1.price = np.log(df1.price)

plt.figure(figsize=(15,6))
f,ax = plt.subplots(1, sharex=True,)
mean_price = df1['price'].mean()
median_price = df1['price'].median()
mode_price = df1['price'].mode().values[0]

sns.distplot(df1['price'],ax = ax)
ax.axvline(mean_price, color='r', linestyle='--', label="Mean")
ax.axvline(median_price, color='g', linestyle='-', label="Median")
ax.axvline(mode_price, color='b', linestyle='-', label="Mode")

ax.legend()
plt.xlim()
plt.show()

* So with log, the distribution appears close to normal. So we can use this transformation for our target variable and move ahead.
* All predictions by the model will then be in log values and we will need to take the antilog to get the actual value.

In [None]:
v1 = [go.Box(y=df.price,name="Price",marker=dict(color="rgba(64,64,64,0.9)"),hoverinfo="name+y")]

layout1 = go.Layout(title="Price")

fig1 = go.Figure(data=v1,layout=layout1)
iplot(fig1)

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Variable Bedrooms
</div>

In [None]:
plt.figure(figsize = (12, 6))
total = float(len(df))
ax = sns.countplot(x='bedrooms',data=df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 1,
            '{:1.2f}%'.format((height/total) * 100),
            ha="center",fontsize=10) 
plt.show()

* Above graph shows that majority of houses have 3 bedrooms followed by 4 bedrooms.
* There is also a house with 33 bedrooms.
* Houses in range from 6-33 bedrooms and 0-1 are very low.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Variable sqft_living
</div>

In [None]:
v1 = [go.Box(y=df.sqft_living,name="Sqft_living",marker=dict(color="rgba(64,64,64,0.9)"),hoverinfo="name+y")]

layout1 = go.Layout(title="Sqft_living")

fig1 = go.Figure(data=v1,layout=layout1)
iplot(fig1)

* Living space area(sqft_living) in our dataset ranges from 290 to 13540 square feet.
* The value of 1st quartile is 1427 square feet, a value of sqft_living for which 25% of the data is less than that value.
* The value of 3rd quartile is 2550 square feet, a value of sqft_living for which 75% of the data is less than that value.
* Median of living space of house is 1910 square feet.
* All the values above upper fence 4230 sqft is considering as outlier.

In [None]:
fig,ax = plt.subplots(figsize = (12, 6))
mean=df['sqft_living'].mean()
median=df['sqft_living'].median()
mode=df['sqft_living'].mode()[0]


sns.distplot(df['sqft_living'],kde=True,hist_kws={'edgecolor': 'black'},bins=30,ax=ax)
ax.axvline(mean, color='r', linestyle='--')
ax.axvline(median, color='y', linestyle='-')
ax.axvline(mode, color='g', linestyle='-')
plt.show()

* Above graphs shows that majority of living space of house lies in the range of around 500 to 4000 sqft.
* From the above graph red line shows mean of living space , yellow line shows median and green line shows mode ,we can see that mean is greater than median, also long tail of distribution is longer on right hand side as compared to left hand side which shows that graph is slightly positive skewed.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Variable Floor
</div>

In [None]:
plt.figure(figsize = (12, 6))
total = float(len(df))
ax = sns.countplot(x='floors',data=df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 1,
            '{:1.2f}%'.format((height/total) * 100),
            ha="center",fontsize=10) 
plt.show()

* Above graph shows that majority of houses have 1 floor or 2 floors in a house .
* There is a very high percentage at floor 1 which is 49.41% ,it is quite reasonable because in reality normally most of the houses have only 1 floor.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Variable Waterfront
</div>

In [None]:
plt.figure(figsize = (12, 6))
total = float(len(df))
ax = sns.countplot(x='waterfront',data=df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 1,
            '{:1.2f}%'.format((height/total) * 100),
            ha="center",fontsize=10) 
plt.show()

The above countplot of the waterfront indicates that 99.25% houses does not have any waterfront view,only 0.75 % have waterfront view.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Variable Condition
</div>

In [None]:
plt.figure(figsize = (12, 6))
total = float(len(df))
ax = sns.countplot(x='condition',data=df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 1,
            '{:1.2f}%'.format((height/total) * 100),
            ha="center",fontsize=10) 
plt.show()

* The above plot shows that 64.92% houses condition are moderate.
* The second highest percentage are of index 4 ie. 26.28% which represent condition of house is good.
* only 0.14% houses condition are very bad.
* From the above plot we can say that there are very minimum houses having condition very bad or bad.

In [None]:
v21 = [go.Box(y=df.bedrooms,name="Bedrooms",marker=dict(color="rgba(51,0,0,0.9)"),hoverinfo="name+y")]
v22 = [go.Box(y=df.condition,name="Condition",marker=dict(color="rgba(0,102,102,0.9)"),hoverinfo="name+y")]
v23 = [go.Box(y=df.floors,name="Floors",marker=dict(color="rgba(204,0,102,0.9)"),hoverinfo="name+y")]

layout2 = go.Layout(title="Boxplot of Bedrooms,Condition and Floors",yaxis=dict(range=[0,33]))

fig2 = go.Figure(data=v21+v22+v23,layout=layout2)
iplot(fig2)

Outliers are present in Condition and Bedrooms variable.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Variable Grade
</div>

In [None]:
#Create Grade Frame
gradeframe = pd.DataFrame({"Grades":df.grade.value_counts().index,"House_Grade":df.grade.value_counts().values})
gradeframe["Grades"] = gradeframe["Grades"].apply(lambda x : "Grade " + str(x))
gradeframe.set_index("Grades",inplace=True)
p1 = [go.Pie(labels = gradeframe.index,values = gradeframe.House_Grade,hoverinfo="percent+label+value",hole=0.1,marker=dict(line=dict(color="#000000",width=2)))]

layout4 = go.Layout(title="Grade Pie Chart")

fig4 = go.Figure(data=p1,layout=layout4)

iplot(fig4)

* The above graph shows that 41.6% houses have grade 7 rating which represents Average grade of construction and design.
* 28.1% houses have grade 8 rating that means just above average in construction and design.
* 12.1% houses have grade 9 which represent better architectural design with extra interior and exterior design and quality.
* 9.43 % houses got grade 6 that is low quality materials and simple designs.
* Grade 10 was given to 5.25% houses which means finish work is better and more design quality is seen in the floor.
* All the other grades was given to very low percentages of houses.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Variable sqft_above
</div>

In [None]:
fig,ax = plt.subplots(figsize = (12, 6))
mean=df['sqft_above'].mean()
median=df['sqft_above'].median()
mode=df['sqft_above'].mode()[0]


sns.distplot(df['sqft_above'],kde=True,hist_kws={'edgecolor': 'black'},bins=30,ax=ax)
ax.axvline(mean, color='r', linestyle='--')
ax.axvline(median, color='y', linestyle='-')
ax.axvline(mode, color='g', linestyle='-')
plt.show()

* Above plot shows that majority of living space above ground level lies below 4000 sqft.
* long tail of distribution is longer on right hand side says that graph is slightly positive skewed.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Variable sqft_basement
</div>

In [None]:
fig,ax = plt.subplots(figsize = (12, 6))
mean=df['sqft_basement'].mean()
median=df['sqft_basement'].median()
mode=df['sqft_basement'].mode()[0]


sns.distplot(df['sqft_basement'],kde=False,hist_kws={'edgecolor': 'black'},bins=30,ax=ax)
ax.axvline(mean, color='r', linestyle='--')
ax.axvline(median, color='y', linestyle='-')
ax.axvline(mode, color='g', linestyle='-')
plt.show()

* Majority of houses having no basement.
* Very minimum houses having living space below ground level around 1800 to 4900 sqft.

### Conclusion from Analysis of Univariate Distribution

* Majority of house prices lies around mean price.
* Most of the houses doesn’t have waterfront view.
*  From the analysis,there is majority of only 1 floor in the house.
* Houses with 3 bedrooms having most observations.
* There are very large number of houses having living space between 1500 to 2100 sqft.
* Most of the houses got grade 7 which is for good construction and design.

### **Bivariate Distribution**

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Price and Bedrooms
</div>

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot( x=df['bedrooms'], y=df['price'] )

plt.title('Statistical Distribution of Bedrooms versus Price')
plt.show()

* It is clear from the above figure that the houses with 6 bedrooms are relatively more expensive than the rest.
* The Increasing trend in median prices says that as number of bedrooms increases the price of house also increases.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of log-Price and sqft_living
</div>

In [None]:
sns.set_style('whitegrid')
sns.jointplot(df1['sqft_living'], df1['price'], data=df1, kind='reg')
plt.show()

* The top histogram is respective to the x values and the one on the right is respective to the y values. we can see some linear relationship between x and y with the scatter and regression line!, as the living space of house increases the price of house also increases.
* Most of the houses having living space between 500 to 6000 square feet having range of price lie below 1.5 million dollar.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Price and Floors
</div>

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot( x=df['floors'], y=df['price'] )

plt.title('Statistical Distribution of Floors versus Price')
plt.show()

* The black diamonds are indicative of datapoints that deviate by three times the standard deviation in normally distributed data. 
* Median of price tends to increases as the number of floor increases from range 1 to 2.5
* This trend is not fit for floors 3 and 3.5, there might be other factors which affects it.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Price and Waterfront
</div>

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot( x=df['waterfront'], y=df['price'] )

plt.title('Statistical Distribution of Waterfront versus Price')
plt.show()

It is clearly visible from the output that the houses with waterfront (orange box) are far more expensive than those without the waterfront (blue box). This shows that waterfront can really be a useful feature to predict the house price.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Price and Condition
</div>

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot( x=df['condition'], y=df['price'] )

plt.title('Statistical Distribution of Condition versus Price')
plt.show()

The above boxplot shows relation between House condition and price,It is clear from the above figure that houses with very good condition are more expensive.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Price and Grade
</div>

In [None]:
plt.figure(figsize=(11,6))
sns.lineplot(x='grade', y='price', data=df)
plt.show()

The price and grade of the house have a clear positive correlation.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Price and yr_built
</div>

In [None]:
plt.figure(figsize=(11,6))
sns.lineplot(x='yr_built', y='price', data=df)
plt.show()

* The relation of year of built and the price of the house is very interesting. The houses that are too old are expensive may be due to historical value. Similarly, the houses that are relatively newer are expensive too.
* The houses that are neither too old nor new have a lower price value since they have neither any historical value associated with them, nor they are new.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of log-Price and sqft_lot
</div>

In [None]:
sns.set_style('whitegrid')
df1.sqft_lot = np.log(df1.sqft_lot)
sns.jointplot(df1['sqft_lot'], df1['price'], data=df1, kind='reg')
plt.show()

Even with the normalized sqft_lot feature, there’s a VERY limited linear relationship between sqft_lot and target variable.

### Conclusion from Analysis of Bivariate Distribution

**We wanted to identify house features that affects price variable. Through visualization,we gathered the following information about the data -**
* House of price depends on bedrooms.As the number of bedroom increases in particular house,price of that house also increases.
* Price increase with increase in grade and condition.
* Price of a particular house increases with increases in living space area(square feet living).
* Houses with waterfront are associated with high price compared to houses without waterfront.
* Price of house depends on how many floors that particular house exist.

### **Multivariate Distribution**

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of sqft_living vs Price by Condition
</div>

In [None]:
plt.figure(figsize=(12,6))
flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e"]
sns.scatterplot(x='sqft_living',y='price',hue='condition',data=df,palette = flatui)
plt.show()

* From the above graph it is clearly seen that most of the houses condition are moderate their living space area from approx 500 sqft to 5000 sqft and their price below 5 million.
* When the condition of houses are good their price is increases but not much effect on their living space area.
* A very minimum houses condition are very good and their living space is also limited ,it might be a reason that their prices are not much expensive.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Condition vs Price by Waterfront
</div>

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x='condition',y='price',hue='waterfront',data=df)
plt.show()

Homes that have a waterfront view and better condition are much more expensive than houses without waterfront.

### Conclusion Multivariate Distribution

* In this study of analysis a house sales based on different variables like sqft_living,bedrooms,condition,waterfront etc. We found in exploring this particular dataset that,the sample is unevenly distributed. 
* For the condition from 3-5(moderate to very good) ,price of houses are expensive than the rest and also there are not much affect from living space area of house.

## **Data Preprocessing**

In [None]:
def preprocess_input(df):
    df = df.copy()
    
    '''Create new columns "Year","Day,"month" from "date" column.
        Drop Id and date column as we already extracted useful features from it.'''
    
    df['date'] = pd.to_datetime(df['date'])
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day

    df.drop(["id","date"],axis=1,inplace=True)
    
    '''dropping 1% of data as we already knows that price is positive skewed so, to remove that distortion I am dropping those values.'''
    df = df.sort_values(['price'], ascending=False).iloc[int(len(df)*0.01):]
    
    # Creating dependent and independent features 
    y = df['price'].values
    X = df.drop('price',axis=1).values
    
    # train-test split
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state = 50)
    
    
    return X_train, X_test, y_train, y_test

In [None]:
X_train, X_test, y_train, y_test = preprocess_input(df)

## **Model Building**

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Linear Regression
</div>

In [None]:
reg = LinearRegression()
reg.fit(X_train,y_train)

print(f'R² score: {r2_score(y_test, reg.predict(X_test))*100}')

#### Lets try some different models....

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
SVR
</div>

In [None]:
# scaling data
sc_x = StandardScaler()
Xtrain_scaled = sc_x.fit_transform(X_train)
Xtest_scaled = sc_x.transform(X_test)

sc_y = StandardScaler()
ytrain_scaled = np.ravel(sc_y.fit_transform(y_train.reshape(-1,1)))
ytest_scaled = np.ravel(sc_y.transform(y_test.reshape(-1,1)))
svr_reg = SVR(kernel='rbf')
svr_reg.fit(Xtrain_scaled,ytrain_scaled)

print(f'R² score: {r2_score(ytest_scaled, svr_reg.predict(Xtest_scaled))*100}')

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Random Forest
</div>

In [None]:
rf = RandomForestRegressor(n_estimators = 50,random_state=50)
rf.fit(X_train,y_train.ravel())

print(f'R² score: {r2_score(y_test, rf.predict(X_test))*100}')

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
XGBoost Regressor
</div>

In [None]:
xgb = xgboost.XGBRegressor(n_estimators = 300,learning_rate=0.05,max_depth=5,random_state=50)
xgb.fit(X_train,y_train.ravel())

print(f'R² score: {r2_score(y_test, xgb.predict(X_test))*100}')

### Result Visualization

In [None]:
plt.figure(figsize=(16,12))
plt.subplot(2,3,1)
plt.scatter(y_test, rf.predict(X_test), color='b')
plt.plot(y_test,y_test, color='r')
plt.title('Random Forest', color='r')

plt.subplot(2,3,2)
plt.scatter(y_test, xgb.predict(X_test), color='b')
plt.plot(y_test,y_test, color='r')
plt.title('XGBoost', color='r')
plt.show()

## Conclusion

<p>As a conclusion, we prepared and visualized the data first. Then, we tried different models for the data. With respect to R square score, we choose best model(XGBoost).Thank you, hope its useful.</p>

## **Please upvote if you find it useful.**