<a href="https://colab.research.google.com/github/sanjay-1208/House_Price_Prediction/blob/main/HOUSE_PRICE_PREDICTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# House Price Prediction

In [145]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# DataSet
from sklearn.datasets import fetch_california_housing

In [146]:
# fetching Data
data=fetch_california_housing()

In [147]:
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [148]:
df=pd.DataFrame(data=data.data,columns=data.feature_names)
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [149]:
# Dependent varibale or house prices
df['Target']=data.target

# EDA

In [150]:
!pip install sweetviz



In [151]:
import sweetviz as sv
report=sv.analyze(df)
report.show_html("./report2.html")

                                             |          | [  0%]   00:00 -> (? left)

##Data Preprocessing

In [152]:
#Feature Engineering
from geopy.geocoders import Nominatim

geolocater=Nominatim(user_agent="House price prediction")
location=geolocater.reverse('37.88	'+","+'-122.23',timeout=None).raw['address']

In [153]:
def location(gcord):
  latitude=str(gcord[0])
  longitude=str(gcord[1])
  location=geolocater.reverse(latitude+","+longitude,timeout=None).raw['address']
  if location.get('road') is None:
    location['road']=None
  if location.get('county') is None:
    location['county']=None

  upd_loc['county'].append(location['county'])
  upd_loc['road'].append(location['road'])


In [154]:
import pickle
upd_loc={
    "county":[],
    "road":[]
}
for i,cord in enumerate(df.iloc[:,6:-1].values):
  location(cord)
  pickle.dump(upd_loc,open('upd_update.pickle','wb'))

In [155]:
loc_update=pickle.load(open('/content/upd_update.pickle','rb'))
loc_update.keys()

dict_keys(['county', 'road'])

In [156]:
loc=pd.DataFrame(loc_update)
loc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   county  16725 non-null  object
 1   road    19659 non-null  object
dtypes: object(2)
memory usage: 322.6+ KB


In [157]:
# add new features to our dataframe
for i in loc_update.keys():
  df[i]=loc_update[i]
df=df.sample(axis=0,frac=1)

In [158]:
# drop latitude and longitude
df=df.drop(labels=["Latitude","Longitude"],axis=1)

In [159]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20640 entries, 6250 to 16548
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Target      20640 non-null  float64
 7   county      16725 non-null  object 
 8   road        19659 non-null  object 
dtypes: float64(7), object(2)
memory usage: 1.6+ MB


In [160]:
missing_idx=[]
for i in range(df.shape[0]):
  if(df['road'][i]) is None:
    missing_idx.append(i)

#Independent parameters
missing_road_x_train=np.array([ [df['MedInc'][i],df['AveRooms'][i],df['AveBedrms'][i]] for i in range(df.shape[0]) if i not in missing_idx])
# Dependent Parameters
missing_road_y_train=np.array([ df['road'][i] for i in range(df.shape[0]) if i not in missing_idx])


missing_road_x_test=np.array([ [df['MedInc'][i],df['AveRooms'][i],df['AveBedrms'][i]] for i in range(df.shape[0]) if i in missing_idx])


In [161]:
from sklearn.linear_model import SGDClassifier
# model intialiization
model_1= SGDClassifier()
# model training
model_1.fit(missing_road_x_train,missing_road_y_train)

missing_road_y_pred =model_1.predict(missing_road_x_test)


In [162]:
# add the model back to the data frame
for n,i in enumerate(missing_idx):
  df['road'][i]=missing_road_y_pred[n]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['road'][i]=missing_road_y_pred[n]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['road'][i]=missing_road_y_pred[n]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['road'][i]=missing_road_y_pred[n]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['road'][i]=missing_road_y_pred[n]
A value is trying to

In [163]:

from sklearn.preprocessing import LabelEncoder

le= LabelEncoder()
df['road']=le.fit_transform(df['road'])

In [164]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Target,county,road
6250,3.4722,34.0,5.114458,1.078313,2108.0,4.232932,1.546,,4606
9976,5.3698,20.0,5.340206,1.041237,589.0,3.036082,3.033,Napa County,7871
4231,3.5806,52.0,5.174917,1.113861,1213.0,2.00165,3.919,Los Angeles County,4643
17158,11.9666,28.0,7.139241,0.911392,191.0,2.417722,5.00001,San Mateo County,6993
200,3.0257,52.0,4.046948,1.00939,994.0,4.666667,0.808,Alameda County,3939


In [165]:
missing_idx=[]
for i in range(df.shape[0]):
  if(df['county'][i]) is None:
    missing_idx.append(i)

#Independent parameters
missing_road_x_train=np.array([ [df['MedInc'][i],df['AveRooms'][i],df['AveBedrms'][i]] for i in range(df.shape[0]) if i not in missing_idx])
# Dependent Parameters
missing_road_y_train=np.array([ df['county'][i] for i in range(df.shape[0]) if i not in missing_idx])


missing_road_x_test=np.array([ [df['MedInc'][i],df['AveRooms'][i],df['AveBedrms'][i]] for i in range(df.shape[0]) if i in missing_idx])




In [166]:
from sklearn.linear_model import SGDClassifier
# model intialiization
model_2= SGDClassifier()
# model training
model_2.fit(missing_road_x_train,missing_road_y_train)

missing_road_y_pred =model_2.predict(missing_road_x_test)

In [167]:
for n,i in enumerate(missing_idx):
  df['county'][i]=missing_road_y_pred[n]

from sklearn.preprocessing import LabelEncoder

le= LabelEncoder()
df['county']=le.fit_transform(df['county'])


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['county'][i]=missing_road_y_pred[n]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['county'][i]=missing_road_y_pred[n]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['county'][i]=missing_road_y_pred[n]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ve

In [168]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20640 entries, 6250 to 16548
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Target      20640 non-null  float64
 7   county      20640 non-null  int64  
 8   road        20640 non-null  int64  
dtypes: float64(7), int64(2)
memory usage: 2.1 MB


In [169]:
# Dependent values
y=df.iloc[:,-3].values
df.drop(labels=['Target'],axis=1,inplace=True)
x=df.iloc[:,:].values

In [170]:
from sklearn.model_selection import train_test_split

In [171]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)

In [172]:
from sklearn.linear_model import LinearRegression
model1=LinearRegression()
model1.fit(x_train,y_train)

In [173]:
y_pred1=model1.predict(x_test)

In [174]:
# Linear Regression Model accuracy
from sklearn.metrics import r2_score
r2_score(y_test,y_pred1)*100

46.947659332729195

In [175]:
from sklearn.ensemble import RandomForestRegressor
model2=RandomForestRegressor()
model2.fit(x_train,y_train)

In [176]:
# Model Prediction
y_pred2=model2.predict(x_test)

In [177]:
# Model Accuracy
from sklearn.metrics import r2_score
r2_score(y_test,y_pred2)*100

72.93161741224088

## Function to predict house prices

In [178]:
def predict(arr):
  x=arr.reshape((1,-1))
  return model2.predict(x)

arr=np.array([3.0214,	19.0,	5.265107, 1.122807,	1245.0,	2.184,	39,	5928])

result=predict(arr)
print(result)

[1.64601]
