# **Estimation of House Value in California**

### **Importing Libraries and Dataset**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('dark')
import pickle
import os
import tarfile
import urllib

In [None]:
housing_data = pd.read_csv("../input/california-housing-prices/housing.csv")
housing_data

Here **median_house_value** is the dependent attribute and **longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households and median_income** are independent attributes.

**ocean_proximity** needs to be encoded to train the model which will be dealt with later.

In [None]:
housing_data[housing_data['median_house_value']==500001.0]

### **EDA**

**Data Cleaning**

In [None]:
housing_data.info()

In [None]:
housing_data[housing_data['total_bedrooms'].isna()]

**total_bedrooms** have 207 null values.

In [None]:
housing_data['total_bedrooms']=housing_data['total_bedrooms'].fillna(housing_data['total_bedrooms'].dropna().mean())

In [None]:
housing_data.info()

In [None]:
housing_data

In [None]:
housing_data[housing_data.duplicated()==True].count()

No duplicates.

In [None]:
housing_data.describe()

**Plots**

In [None]:
plt.style.use("dark_background")
plt.figure(figsize=(18, 10))
sns.histplot(housing_data.median_house_value, bins = 45, color='red')
plt.xlabel('Housing Prices in $')
plt.ylabel('Number of Houses')
plt.title('Median Price of Housing in a Block', fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(18,10))
plt.scatter(housing_data['latitude'],housing_data['longitude'],c=housing_data['population'], cmap='Spectral')
plt.colorbar().set_label("Population")
plt.title('Population Magnitude in Different Areas')
plt.xlabel('Latitude')
plt.ylabel('Longitude')
plt.show()

It can be inferred that most of the places have **Population<5000**.

There are very few places with **Population>15000**.

In [None]:
plt.figure(figsize=(18,10))
housing_data['housing_median_age'].value_counts().plot(kind='bar',color='orange')
plt.xlabel("Age")
plt.ylabel("No. of Houses")
plt.title("Median Age of Housing")

The maximum number of houses are **52** years old.

In [None]:
plt.figure(figsize=(9,5))
housing_data['ocean_proximity'].value_counts().plot(kind='pie',autopct='%1.2f%%')
plt.title('Preferred Proximity from Ocean')

This shows most of the houses are in **<1H proximity** from the ocean.

**Island** has the least number of houses.

In [None]:
sns.heatmap(housing_data.iloc[:,2:9].corr(),annot=True)
plt.title("Correlation Matrix")

Many features have negligible correlation between them.

Features like **population,total_bedrooms,total_rooms,households** are highly correlated.

Maximum correlation is seen between **households and total_bedrooms**.

**median_house_value** has maximum correlation with **median_income**.

In [None]:
plt.figure(figsize=(16, 10))
plt.scatter(housing_data['median_house_value'],housing_data['median_income'], alpha=0.2,color='white')
plt.xlabel('Median House Value')
plt.ylabel('Median Income')
plt.title('Median Price vs Median Income')
plt.show()

In [None]:
sns.pairplot(housing_data.iloc[:,2:9])

The correlation between all the factors is depicted by this.

It is clear from this that the price of the houses don't vary linearly with any of the parameters.

### **Feature Engineering**

**Modifying the Features to get a Better Model**

In [None]:
housing_data['rooms_per_household']=housing_data['total_rooms']/housing_data['households']
housing_data['bedrooms_per_room']=housing_data['total_bedrooms']/housing_data['total_rooms']
housing_data['bedrooms_per_household']=housing_data['total_bedrooms']/housing_data['households']
housing_data['household_per_population']=housing_data['households']/housing_data['population']
housing_data['population_per_room']=housing_data['population']/housing_data['total_rooms']
housing_data['population_per_bedroom']=housing_data['population']/housing_data['total_bedrooms']
housing_data=housing_data.drop(columns=['households'])
housing_data=housing_data.drop(columns=['total_bedrooms'])

Added more features.

In [None]:
list(housing_data.columns.values)

In [None]:
housing_data=housing_data[['longitude','latitude',
 'housing_median_age',
 'total_rooms',
 'population',
 'median_income',
 'rooms_per_household',
 'bedrooms_per_room',
'bedrooms_per_household',
 'household_per_population',
'population_per_room',
'population_per_bedroom',
'ocean_proximity',
 'median_house_value'
 ]]

In [None]:
housing_data

**One Hot Encoding**

As mentioned earlier, **ocean_proximity** needs to be encoded as it is a categorical data.

In [None]:
X=housing_data.iloc[:,:-1].values
y=housing_data.iloc[:,-1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[12])],remainder='passthrough')

In [None]:
X=np.array(ct.fit_transform(X))

**Splitting into Train set and Test set**

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

**Scaling the Features**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train[:,5:]=scaler.fit_transform(X_train[:,5:])
X_test[:,5:]=scaler.transform(X_test[:,5:])

### **Fitting Random Forest Regression Model**

In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor=RandomForestRegressor(n_estimators=50)
regressor.fit(X_train,y_train)

### **Predicting Test Set Results**

In [None]:
y_pred=regressor.predict(X_test)
np.set_printoptions(precision=2)
compare=(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),axis=1))

In [None]:
compare

This is the comparision between the predicted(left) and the original(right) values corresponding to the test set.

Some values are very close while some have a significant difference.

### **Measuring the Accuracy**

Using the **R-Squared** metric.

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

This is the accuracy of the model in predicting the housing prices given the parameters. 

Using the **Mean Absolute Error** metric

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

Hence, the prediction by this model is on an average off by this value

Using the **Root Mean Squared Error** metric

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
print(sqrt(mean_squared_error(y_test, y_pred)))

The RMSE metric squares the error before it is averaged and so higher weight is given to larger errors. The presence of ouliers significantly affects the performance.

### **Loading model to compare results**

In [None]:
pickle.dump(regressor,open('model.pkl','wb'))

### **Saving model to disk**

In [None]:
pickle.load(open('model.pkl','rb'))