# Linear regression:
* in this notebook i used linear regression to make a model and predict the price of house based on each parameters house.
* the librarys that i used are numpy and pandas for work with data, matplotlib and seaborn to visualize the data, and scikit-learn for make a model. for more information about scikit-learn, you can checkout the [scikit-learn  web site](https://scikit-learn.org/stable/modules/linear_model.html).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# The data set:
#### dataset that i worked on, is about house prices based on these 6 parameters:
    * 1-transaction date
    * 2-house age
    * 3-distance to the nearest MRT station
    * 4-number of convenience stores
    * 5-latitude
    * 6-longitude
#### in this notebook i tried to make a model that can predict the 'house price of unit area' based on these parameters.

In [None]:
df = pd.read_csv('../input/real-estate-price-prediction/Real estate.csv')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.info()

# Exploratory data analysis:

#### as you can see, this below chart shows us the distribution of 'house price of unit area'. based on this chart, mean of 'house price of unit area' is about 40. the maximum of price is about 120.

In [None]:
sns.displot(df['Y house price of unit area'], kde=True, aspect=2, color='purple')
plt.show()

## correlation: 
#### to check the correlation of parameters and house price, i displayed the 6 scatter plots to see is there any correlation or not.
* chart 1: we cannot see the impressive correlation between transaction date and house price
* chart 2:there is small negative correlation between house age and house price
* chart 3: as you can see there is a negative corrolation between distance to the nearest MRT station and house price. this means if the 'distance to the nearest MRT station' become more, the house price become less.
* chart 4: there is a positive correlation. it means for more number of convenience stores, the house price become more.
* chart 5 and 6: for these charts, there is a positive correlation.

In [None]:
fig, axes= plt.subplots(nrows=3, ncols=2, figsize=(15,15))
fig.subplots_adjust(wspace=0.3, hspace=0.3)

for i in range(1, df.shape[1]-1):
    axes[(i-1)//2, (i+1)%2].set_title(f'chart {i}').set_size(20)
    sns.scatterplot(data=df, x=df.iloc[:, i], y='Y house price of unit area', ax=axes[(i-1)//2, (i+1)%2])

#### to better understand  the correlations you, can see the last row of this chart. as mentioned, 'house age' and 'distance to the nearest MRT station' have negative correlation with house price. but the 'number of convenience store' and 'geographical location' have positive correlation with house price.
#### Note: Green is shown for positive correlation and white for negative correlation.

In [None]:
fig = plt.figure(figsize=(10,5))
sns.heatmap(df.iloc[:, 1:].corr(), annot=True, cmap='Greens')

# Split the dataset to train & test:
#### for split the data to train and test, i used 'sklearn' library.

In [None]:
X = df.drop(['Y house price of unit area', 'No'],axis=1)
y = df['Y house price of unit area']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the model:

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

#### this chart shows us the coefficient of each parameters and the size of the their effect on house price.

In [None]:
pd.DataFrame(model.coef_, X.columns, columns=['Coeficient'])

# Predict the house price:
#### now we can predict the house price with model.

In [None]:
y_pred=model.predict(X_test)

# Evalutaing the model:
#### in this part we will evaluate the model with MAE, MSE and RMSE.
* Mean Absolute Error (MAE): MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.
![title](https://miro.medium.com/max/591/1*OVlFLnMwHDx08PHzqlBDag.gif)

* Mean squared error (MSE): To calculate the MSE, you take the difference between your model’s predictions and the ground truth, square it, and average it out across the whole dataset.

![title](https://miro.medium.com/max/884/1*-e1QGatrODWpJkEwqP4Jyg.png)

* Root mean squared error (RMSE): RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation.

![title](https://miro.medium.com/max/613/1*9hQVcasuwx5ddq_s3MFCyw.gif)

In [None]:
print(f'mean of house price is {df["Y house price of unit area"].mean()}')

from sklearn import metrics

MAE= metrics.mean_absolute_error(y_test, y_pred)
MSE= metrics.mean_squared_error(y_test, y_pred)
RMSE=np.sqrt(MSE)

pd.DataFrame([MAE, MSE, RMSE], index=['MAE', 'MSE', 'RMSE'], columns=['Metrics'])

### What are good RMSE values?
*  it is important to recall that RMSE has the same unit as the dependent variable (DV). It means that there is no absolute good or bad threshold, however you can define it based on your DV. For a datum which ranges from 0 to 1000, an RMSE of 0.7 is small, but if the range goes from 0 to 1, it is not that small anymore. However, although the smaller the RMSE, the better, you can make theoretical claims on levels of the RMSE by knowing what is expected from your DV in your field of research. Keep in mind that you can always normalize the RMSE.

### How to compare models with different datasets using RMSE?
* comparing two models with different datasets by using RMSE, you may do that provided that the DV is the same in both models. Here, the smaller the better but remember that small differences between those RMSE may not be relevant or even significant.

In [None]:
fig = plt.figure(figsize=(5,5))
plt.scatter(y_test, y_pred)
plt.xlabel('Y-Test')
plt.ylabel('Y-Pred')

In [None]:
test_residuals = y_test - y_pred

In [None]:
sns.scatterplot(x=y_test, y=test_residuals)
plt.axhline(y=0, color='r', ls='--')

In [None]:
sns.displot(test_residuals, bins=25, kde=True, aspect=2, color='purple')