# Importing Packages

In [None]:
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
import math
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Data Analysis

In [None]:
dataset_df = pd.read_csv('../input/real-estate/Real estate.csv', index_col=False)    #read csv (without index column)
dataset_df = dataset_df.drop('No', axis=1)    #drop first column

In [None]:
dataset_df

## Dataset Summary

No Null values for any parameters

In [None]:
dataset_df.info()

## Statistics of Data

In [None]:
dataset_df.describe()

## Problem Statement And Dataset Explanations
Build an Linear Regression Model to predict price per unit area of houses, given the following features.

##### Features
1) <b>Transaction Date</b> - Date of selling house<br>
2) <b>House Age</b> - Age of house (Describes how old the house is(in years))<br>
3) <b>Distance to nearest MRT Station</b> - Distance between the house and the nearest transportation (in Metres)<br> 
4) <b>No.of Convenience Stores</b> - Number of Convenience Stores in te neighbourhood<br>
5) <b>Latitude, Longitude</b> - Coordinates of the house<br>
6) <b>House price of unit area</b> - Price per unit area of the house (currency) <br>
    
<b>House price of unit area</b> is the dependent variable
while the others are independent variable

### 1) Transaction Date

In [None]:
dataset_df.transaction_date.describe()

In [None]:
fig = px.histogram(dataset_df, x='transaction_date', title='Transaction Date Distribution', marginal='box', color_discrete_sequence=['red'])
fig.update_layout(bargap=0.1)
fig.show()

More Houses are sold between 2013 - 2013.09 and 2013.4 - 2013.59<br>
The surge probably due to some companies are trying to buy lands for its development or some government tenders<br>
and so people were more interested in selling

### 2) House Age

In [None]:
dataset_df.house_age.describe()

In [None]:
fig = px.histogram(dataset_df, x='house_age', nbins=44, title='House-Age Distribution', marginal='box', color_discrete_sequence=['blue'])
fig.update_layout(bargap = 0.1)
fig.show()

There are many houses with age in the interval 12.5 - 18.5 years which are sold

### 3) Distance to the Nearest MRT Station

In [None]:
dataset_df.distance_to_the_nearest_MRT_station.describe()

In [None]:
fig = px.histogram(dataset_df, x='distance_to_the_nearest_MRT_station', title='Distance To Nearest MRT Station Distribution', marginal='box', color_discrete_sequence=['black'])
fig.update_layout(bargap=0.1)
fig.show()

There are many houses within 2.5KMs (3000 metres) radius from MRT stations.<br>
Around 21 houses are within 4-5Kms radius from MRT station<br>
Only around 45 houses are within 3-6.5Kms<br>

It may be true that lesser the distance from MRT station, more houses were sold

### 4) Number of Convenience Stores

In [None]:
dataset_df.no_of_convenience_stores.describe()

In [None]:
fig = px.histogram(dataset_df, x='no_of_convenience_stores', title='Number Of Convenience Stores Distribution', nbins=11, marginal='box', color_discrete_sequence=['green'])
fig.update_layout(bargap=0.1)
fig.show()

Half of the houses sold are with convenience stores less than 5.<br>
67 houses with no convenience stores in neighbourhood are sold.<br>
67 houses with 5 convenience stores in neighbourhood are sold.<br>

<b>Note :</b> It may be that in the dataset houses with no or lesser convenience stores are more<br>
not that they are sold more than houses with more than 4 convenience stores in neighbourhood.

### 5) Latitude

In [None]:
dataset_df.latitude.describe()

In [None]:
fig = px.histogram(dataset_df, x='latitude', title='Latitude Distribution', color_discrete_sequence=['orange'], marginal='box')
fig.update_layout(bargap=0.1)
fig.show()

Distance between each degree of latitude is 110KMs<br>
Latitude is distributed Normally<br>
There are very few houses in latitudes 24.99 - 25.01 and 24.93 - 24.94<br>

Max number of houses (80) are located in latitude 24.97 - 24.97499<br>
Probably because, it is the most important area and so the houses were marked for higher prices<br>
so that more houses were sold

### 6) Longitude

In [None]:
dataset_df.longitude.describe()

In [None]:
fig = px.histogram(dataset_df, x='longitude', title='Longitude Distribution', color_discrete_sequence=['violet'], marginal='box')
fig.update_layout(bargap=0.1)
fig.show()

Distance between each degree of latitude is 87KMs<br>
There are many houses sold in the longitude 121.54 - 121.545<br>

In [None]:
dataset_df.describe()

# Scatter & Violin Plots

## 1) Violin Plot : Transaction Date - House Price

In [None]:
fig = px.violin(dataset_df, x='transaction_date', y='house_price_of_unit_area', title='Transaction Date Vs House Price')
fig.update_traces(marker_size=5)
fig.show()

In [None]:
print('Correlation between Transaction Date and House Price =',dataset_df.house_price_of_unit_area.corr(dataset_df.transaction_date))

From the plot, it can be seen that charges of house does not depend on transaction date.<br>
Only very little dependecy which can be seen as slight increase in height toward right.<br>
Also correlation is very low

## 2) Scatter Plot : House Age - House Price

In [None]:
fig = px.scatter(dataset_df, x='house_age', y='house_price_of_unit_area', title='House Age Vs House Price', opacity=0.8)
fig.update_traces(marker_size=5)
fig.show()

In [None]:
print('Correlation between House Age and House Price =',dataset_df.house_price_of_unit_area.corr(dataset_df.house_age))

From the plot, with increase in age, price decreases (from 0 to 20)<br>
With increase in age, price is constant (from 25 to 45)<br>
There are some outliers (price > 60 and age > 30)<br>

Probably house price and age are linearly dependent for age < 25<br>

Also, correlation is significant for age and price.<br>
It is negative since, if age of house increases price of house depreciates.

## 3) Scatter Plot : Distance To Nearest MRT Station - House Price

In [None]:
fig = px.scatter(dataset_df, x='distance_to_the_nearest_MRT_station', y='house_price_of_unit_area', title='Distance To Nearest MRT Station Vs House Price', opacity=0.8)
fig.update_traces(marker_size=5)
fig.show()

In [None]:
print('Correlation between Distance to Nearest MRT Station and House Price =',dataset_df.house_price_of_unit_area.corr(dataset_df.distance_to_the_nearest_MRT_station))

From the plot,<br>
House price decreases linearly on increasing distance of the house to the nearest MRT station.<br>
Also, correlation is higher than age.<br>
It is negative since with increase in distance (from MRT stations) price decreases.<br>

Therefore, distance and price are strongly dependent

## 4) Violin Plot : Number of Convenience Stores - House Price

In [None]:
fig = px.violin(dataset_df, x='no_of_convenience_stores', y='house_price_of_unit_area', title='Number of Convienience Stores Vs House Price')
fig.update_traces(marker_size=5)
fig.show()

In [None]:
print('Correlation between Number of Convienience Stores And House Price =',dataset_df.house_price_of_unit_area.corr(dataset_df.no_of_convenience_stores))

From the plot,<br>
With increase in number of convienience stores near to the house, housing price increases.<br>
Also, correlation is significant between these two variables.<br>

Therefore, number of convienience stores and house price is linearly dependent

## 5) Scatter Plot : Latitude - House Price

In [None]:
fig = px.scatter(dataset_df, x='latitude', y='house_price_of_unit_area', title='Latitude Vs House Price', opacity=0.8)
fig.update_traces(marker_size=5)
fig.show()

In [None]:
print('Correlation between Latitude And House Price =',dataset_df.house_price_of_unit_area.corr(dataset_df.latitude))

From the plot,<br>
House price increases with increase in latitude towards 25.<br>
Also, the correlation between these two variables are significant.<br>


Therefore, Latitude and House price are Linearly dependent.

## 6) Scatter Plot : Longitude - House Price

In [None]:
fig = px.scatter(dataset_df, x='longitude', y='house_price_of_unit_area', title='Longitude Vs House Price', opacity=0.8)
fig.update_traces(marker_size=5)
fig.show()

In [None]:
print('Correlation between Longitude And House Price =',dataset_df.house_price_of_unit_area.corr(dataset_df.longitude))

From the plot,<br>
House price increases with increase in longitude towards 121.54 .<br>
Also, the correlation between these two variables are significant.<br>


Therefore, Longitude and House price are Linearly dependent.

## Correlation

In [None]:
dataset_df.corr()

### Heatmap

In [None]:
sns.heatmap(dataset_df.corr(), annot=True, cmap='Reds');
plt.title('Correlation Matrix');

## Result

<b>Dependent Variables :</b>(Decreasing Order)<br>
    1) Distance To Nearest MRT Station<br>
    2) Number of Convenience Stores<br>
    3) Latitude<br>
    4) Longitude<br>
    5) House Age<br>
    
<b>Independent Variables :</b><br>
    1) Transaction Date

# 2) Linear Regression - Single Feature
### 1) Transaction Date Vs House Price
#### Extract Data From Dataset

In [None]:
X = np.array(dataset_df['transaction_date'])
X = X.reshape((-1,1))
Y = np.array(dataset_df['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Visualize

In [None]:
sns.scatterplot(data=dataset_df,x='transaction_date', y='house_price_of_unit_area', alpha=0.7, s=15)
plt.plot(X, Y_pred, color='r')
plt.title('Transaction Date Vs House Price');

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### 2) House Age Vs House Price
#### Extract Data From Dataset

In [None]:
X = np.array(dataset_df['house_age'])
X = X.reshape((-1,1))
Y = np.array(dataset_df['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Visualize

In [None]:
sns.scatterplot(data=dataset_df,x='house_age', y='house_price_of_unit_area', alpha=0.7, s=15)
plt.plot(X, Y_pred, color='r')
plt.title('House Age Vs House Price');

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### 3) Distance To Nearest MRT Station Vs House Price
#### Extract Data From Dataset

In [None]:
X = np.array(dataset_df['distance_to_the_nearest_MRT_station'])
X = X.reshape((-1,1))
Y = np.array(dataset_df['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Visualise

In [None]:
sns.scatterplot(data=dataset_df,x='distance_to_the_nearest_MRT_station', y='house_price_of_unit_area', alpha=0.7, s=15)
plt.plot(X, Y_pred, color='r')
plt.title('Distance to Nearest MRT Station Vs House Price');

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### 4) Number of Convienience Stores Vs House Price
#### Extract Data From Dataset

In [None]:
X = np.array(dataset_df['no_of_convenience_stores'])
X = X.reshape((-1,1))
Y = np.array(dataset_df['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Visualize

In [None]:
sns.scatterplot(data=dataset_df,x='no_of_convenience_stores', y='house_price_of_unit_area', alpha=0.7, s=15)
plt.plot(X, Y_pred, color='r')
plt.title('Number of Convienience Stores Vs House Price');

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### 5) Latitude Vs House Price
#### Extract Data From Dataset

In [None]:
X = np.array(dataset_df['latitude'])
X = X.reshape((-1,1))
Y = np.array(dataset_df['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Visualize

In [None]:
sns.scatterplot(data=dataset_df,x='latitude', y='house_price_of_unit_area', alpha=0.7, s=15)
plt.plot(X, Y_pred, color='r')
plt.title('Latitude Vs House Price');

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### 6) Longitude Vs House Price
#### Extract Data From Dataset

In [None]:
X = np.array(dataset_df['longitude'])
X = X.reshape((-1,1))
Y = np.array(dataset_df['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Visualise

In [None]:
sns.scatterplot(data=dataset_df,x='longitude', y='house_price_of_unit_area', alpha=0.7, s=15)
plt.plot(X, Y_pred, color='r')
plt.title('Longitude Vs House Price');

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

#### Error Analysis (Increasing Order)
<b>1) Distance To Nearest MRT Station Vs House Price</b>      - 10.044189842789505<br>
<b>2) Number of Convienience Stores Vs House Price</b>        - 11.156701668848857<br>
<b>3) Latitude Vs House Price</b>                             - 11.382820998812367<br>
<b>4) Longitude Vs House Price</b>                            - 11.580849251674602<br>
<b>5) House Age Vs House Price</b>                            - 13.285348095978287<br>
<b>6) Transaction Date Vs House Price</b>                     - 13.537931667653718<br>

Out of these only transaction_date is independent of House Price

# 3) Linear Regression - Multiple Features
### 1) Distance To Nearest MRT Station and Number of Convenience Stores Vs House Price
#### Extract Data from Dataset

In [None]:
X = np.array(dataset_df[['distance_to_the_nearest_MRT_station','no_of_convenience_stores']])
Y = np.array(dataset_df['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### Error Reduction
Error between Distance To Nearest MRT Station Vs House Price - 10.044189842789505<br>
Error between Distance To Nearest MRT Station and Number of Convenience Stores Vs House Price - 9.64253326787197<br>

Error is reduced so continue considering variable 'Number of Convenience Stores'

### 2) Distance To Nearest MRT Station, Number of Convenience Stores and Latitude Vs House Price
#### Extract Data from Dataset

In [None]:
X = np.array(dataset_df[['distance_to_the_nearest_MRT_station','no_of_convenience_stores', 'latitude']])
Y = np.array(dataset_df['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### Error Reduction
Error between Distance To Nearest MRT Station Vs House Price - 10.044189842789505<br>
Error between Distance To Nearest MRT Station and Number of Convenience Stores Vs House Price - 9.64253326787197<br>
Error between Distance To Nearest MRT Station, Number of Convenience Stores and Latitude Vs House Price - 9.403946229570993

Error is reduced so continue considering variable 'Latitude'

### 3) Distance To Nearest MRT Station, Number of Convenience Stores, Latitude and Longitude Vs House Price
#### Extract Data from Dataset

In [None]:
X = np.array(dataset_df[['distance_to_the_nearest_MRT_station','no_of_convenience_stores', 'latitude', 'longitude']])
Y = np.array(dataset_df['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### Error Reduction
Error between Distance To Nearest MRT Station Vs House Price - 10.044189842789505<br>
Error between Distance To Nearest MRT Station and Number of Convenience Stores Vs House Price - 9.64253326787197<br>
Error between Distance To Nearest MRT Station, Number of Convenience Stores and Latitude Vs House Price - 9.403946229570993<br>
Error between Distance To Nearest MRT Station, Number of Convenience Stores, Latitude and Longitude Vs House Price - 9.403907424013934

Error is reduced very little only so it is <b>not mandatory</b> to continue considering variable 'Longitude'<br>
But still let us use it.

<b>Note :</b><br> Error Without considering both Latitude and Longitude - 9.64253326787197<br>
Error With considering only Latitude - 9.403946229570993<br>
Error With considering only Longitude - 9.639589308289922<br>
Error With considering both Latitude and Longitude - 9.403907424013934<br>

The correlation between House Price and longitude is nearer to that of latitude but there is no significant reduction in error loss.<br>

Before explaining Why refer below for some insights.

### Correlation Vs Causation

<b>Correlation</b> is a statistical technique which tells us how strongly the pair of variables are linearly related and change together. It does not tell us why and how behind the relationship but it <b>just says the relationship exists.</b><br>
<b>Example:</b> Correlation between Ice cream sales and sunglasses sold. As the sales of ice creams is increasing so do the sales of sunglasses. Ice cream is not causing more sales of sunglasses. Both are due to the common factor summer.

<b>Causation</b> takes a step further than correlation. It says any change in the value of one variable will cause a change in the value of another variable, which means <b>one variable makes other to happen</b>.<br>
<b>Example:</b> When a person is exercising then the amount of calories burning goes up every minute. Former is <b>causing</b> latter to happen.<br>

So <b>“Correlation does not imply causation!”.</b>Just because two things are linked doesn't mean that one causes the other.

So,the correlation between House Price and longitude is nearer to that of latitude but there is no significant reduction in error loss<br> because longitude and House price are correlated but longitude does not cause rise in House Price


### 4) Distance To Nearest MRTS, No of Convenience Stores, Latitude, Longitude and House Age Vs House Price
#### Extract Data from Dataset

In [None]:
X = np.array(dataset_df[['distance_to_the_nearest_MRT_station','no_of_convenience_stores', 'latitude', 'longitude', 'house_age']])
Y = np.array(dataset_df['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### Error Reduction
Error between Distance To Nearest MRT Station Vs House Price - 10.044189842789505<br>
Error between Distance To Nearest MRT Station and Number of Convenience Stores Vs House Price - 9.64253326787197<br>
Error between Distance To Nearest MRT Station, Number of Convenience Stores and Latitude Vs House Price - 9.403946229570993<br>
Error between Distance To Nearest MRT Station, Number of Convenience Stores, Latitude and Longitude Vs House Price - 9.403907424013934<br>
Error between Distance To Nearest MRT,No.of Stores,Latitude,Longitude and House Age Vs House Price - 8.899542229357065<br>

Error is reduced so continue considering variable 'House Age'

# 4) Separate Linear Regression For House Age > 25 and Age < 25 ?

In [None]:
fig = px.scatter(dataset_df, x='house_age', y='house_price_of_unit_area', title='House Age Vs House Price', opacity=0.8)
fig.update_traces(marker_size=5)
fig.show()

From the plot,<br>
it seems House Price decreases on increasing of House Age (0 - 25)<br>
and House Price increases on increasing of House Age (25 - 45)

### 1) House Age <  25
#### Extract Data

In [None]:
younger_house = dataset_df[dataset_df.house_age<25]
X = np.array(younger_house[['distance_to_the_nearest_MRT_station','no_of_convenience_stores', 'latitude', 'longitude', 'house_age']])
Y = np.array(younger_house['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### 2) House Age <  25
#### Extract Data

In [None]:
older_house = dataset_df[dataset_df.house_age>=25]
X = np.array(older_house[['distance_to_the_nearest_MRT_station','no_of_convenience_stores', 'latitude', 'longitude', 'house_age']])
Y = np.array(older_house['house_price_of_unit_area'])

#### Train

In [None]:
model = LinearRegression().fit(X, Y)

#### Predict

In [None]:
Y_pred = model.predict(X)

#### Error

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### Result & Conclusions
Error of the Single Linear Regression Model = 8.899542229357065<br>
Average Error of the Two Linear Regression Models = $(8.415035493436214 + 8.601557705002556)/2 = 8.508296599219385$

Difference in Error = $8.899542229357065 - 8.508296599219385 = 0.3912456301376803 = 0.4 (approx)$

Therefore, it is good to use two separate linear regression models (for House Age < 25 and House Age >= 25 separately).

# 6) Model Improvements - Feature Scaling

In [None]:
scaler = StandardScaler()
scaler.fit(dataset_df[['distance_to_the_nearest_MRT_station','no_of_convenience_stores', 'latitude', 'longitude', 'house_age']])
scaled_X = scaler.transform(dataset_df[['distance_to_the_nearest_MRT_station','no_of_convenience_stores', 'latitude', 'longitude', 'house_age']])
Y = np.array(dataset_df['house_price_of_unit_area'])

In [None]:
model = LinearRegression().fit(scaled_X, Y)

In [None]:
Y_pred = model.predict(scaled_X)

In [None]:
print('Mean Squared Error =',math.sqrt(metrics.mean_squared_error(Y, Y_pred)))

### Result & Conclusion
Error is not reduced in linear regression.<br>


# 7) Creating Test Data

Splitting 20% of Data as  Test Data

In [None]:
X = np.array(dataset_df[['distance_to_the_nearest_MRT_station','no_of_convenience_stores', 'latitude', 'longitude', 'house_age']])
Y = np.array(dataset_df['house_price_of_unit_area'])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.2)

In [None]:
model = LinearRegression().fit(X_train, Y_train)

In [None]:
Y_train_pred = model.predict(X_train)
Y_test_pred = model.predict(X_test)

In [None]:
print('Mean Squared Training Error =',math.sqrt(metrics.mean_squared_error(Y_train, Y_train_pred)))
print('Mean Squared Testing Error =',math.sqrt(metrics.mean_squared_error(Y_test, Y_test_pred)))