# Tehran House Price Visualization & LinearRegression

In this notebook, I want to work with a real [dataset](https://www.kaggle.com/mokar2001/house-price-tehran-iran) collected by [Mohamadreza Kariminejad](https://www.kaggle.com/mokar2001) and includes the price of a house in Tehran(Capital of Iran).

We want to check the price of a house in Tehran by drawing some plot and also create a simple linear regression for it.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv("/kaggle/input/house-price-tehran-iran/housePrice.csv")

In [None]:
df.head()

In [None]:
df.info()

**According to info(), we see that several lines of addresses are null.**

**The number of data missing is not too large, but it may belong to expensive house, so I decided to fill those with the word "Unknown" instead of droping.**

In [None]:
df = df.fillna("Unknown")
missing_data = df.isnull().sum()
print(missing_data)

***We don't have missing data now. Let's draw some plot :)***

**The highest number of houses is in these five region of Tehran. (Note: This statistic may be inaccurate due to the limited data in this database)  :**

In [None]:
df['Address'].value_counts().nlargest(5)

**Five region of Tehran with the highest average house prices:**

In [None]:
# Price: Tomans
df.groupby('Address').mean()['Price'].nlargest(5).reset_index()

In [None]:
addressLPIR = df.groupby('Address').mean()['Price'].nlargest(5).reset_index()
sns.barplot(x="Address",
           y="Price",
           data=addressLPIR)

USD price format:

In [None]:
# Price: USD
df.groupby('Address').mean()['Price(USD)'].nlargest(5).reset_index()

In [None]:
addressLP = df.groupby('Address').mean()['Price(USD)'].nlargest(5).reset_index()
sns.barplot(x="Address",
           y="Price(USD)",
           data=addressLP)

**Which is the most common house (Number of Bedrooms) ?**

In [None]:
sns.countplot(x="Room",data=df)

**The relation between the number of rooms and the price:**

In [None]:
sns.lineplot(data = df, x = 'Room', y ='Price' )

**lets take a look box plot of it:**

In [None]:
ax = sns.boxplot(x="Room", y="Price", data=df)

The result is interesting: you may find a house in Tehran with 5 rooms at a price equal to a house with three or two rooms.

In [None]:
sns.jointplot(x = "Room", y = "Price", kind = "kde", data = df)

In [None]:
sns.jointplot(data=df,x='Room', y='Price', hue='Elevator')

*Strange and interesting: According to the chart above, most houses with 5 rooms are without elevators. I do not know maybe they were built as a villa.*

In [None]:
sns.jointplot(data=df,x='Room', y='Price', hue='Parking')

*Most houses that have rooms also have parking.*

#### Exploratory Data Analysis

In [None]:
df.Parking = df.Parking.astype(int)
df.Warehouse = df.Warehouse.astype(int)
df.Elevator = df.Elevator.astype(int)
df.Area = df.Area.str.replace(',' , '').astype(int)

**Drop outlier data from Area**

According to the author of the dataset, we have Outlier in the Area column.([here](https://www.kaggle.com/mokar2001/house-price-tehran-iran/discussion/270637))

lets drop those:

first let me show you which row in Area are Outlier:

In [None]:
df.nlargest(5,'Area')

Due to the large difference between items 5 and the first four, it can be seen that the first to fourth rows are Outlier.

In [None]:
df.drop( df[df['Area'] >= 2000000].index , inplace=True)

In [None]:
df.info()

**Convert "Address" Column to Ordinal Scale (from Nominal Scale to Ordinal Scale)**

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# Create an instance of LabelEncoder() and store it in labelencoder variable/object
labelencoder = LabelEncoder()
# Assigning Numerical Values and Storing it in "Address_n" Column
df["Address_n"] = labelencoder.fit_transform(df["Address"])

In [None]:
df.info()

In [None]:
df.head()

In [None]:
sns.pairplot(data = df)

In [None]:
df.corr()

In [None]:
sns.heatmap(df.corr(), annot=True,cmap='Greens')

#### Determine features and label:

In [None]:
# Features:
X = df.drop(['Price(USD)'  , 'Price', 'Address'] , axis = 1)
# Label:
y = df['Price(USD)']

#### Split dataset to train and test data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
X_train.shape , X_test.shape

### Train the model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
model.coef_

In [None]:
pd.DataFrame(model.coef_ , X.columns, columns=['Coefficient'])

#### Predicting test data

In [None]:
y_pred = model.predict(X_test)

### Evaluating The model

In [None]:
from sklearn import metrics
MAE = metrics.mean_absolute_error(y_test, y_pred)
MSE = metrics.mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)

In [None]:
pd.DataFrame([MAE, MSE, RMSE], index=['MAE', 'MSE', 'RMSE'], columns=['Metrics'])

In [None]:
df['Price(USD)'].mean()

In [None]:
test_residuals = y_test - y_pred

In [None]:
sns.displot(test_residuals, bins = 25 , kde = True)

In [None]:
sns.scatterplot(x = y_test, y= test_residuals)
plt.axhline(y = 0, color = '#000' , ls = '--' )

In [None]:
sns.scatterplot(x=y_test, y=y_pred)
plt.xlabel('Price(USD)')
plt.ylabel('Price(USD)')