### House Price Prediction ###

In [3]:
# Generic Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
# Data set
from sklearn.datasets import fetch_california_housing

In [5]:
# Load data set
data = fetch_california_housing()

In [4]:
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [8]:
# Load data to a dataframe (Independent Data)
df = pd.DataFrame(data = data.data[:2500], columns = data.feature_names) ## get first 2500 data rows
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [9]:
data.target

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

In [10]:
# Adding target column to the dataframe (Dependent Data)
df['Target'] = data.target[:2500]

In [11]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


#### Exploratory Data Analysis ####

**Exploratory Data Analysis (EDA) is an approach that is used to analyze the data and discover trends, patterns, or check assumptions in data with the help of statistical summaries and graphical representations². EDA refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. **

In [12]:
import sweetviz as sv

In [13]:
report = sv.analyze(df)
report.show_html("./report.html")

                                             |                                             | [  0%]   00:00 ->…

Report ./report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


#### Data Pre-Processing ####

In [14]:
data.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

In [15]:
# Fearture Engineering
from geopy.geocoders import Nominatim

In [16]:
geolocator = Nominatim(user_agent='geoapiExercises')

In [17]:
# test - my code
# location_list = []
# for i in range(df.shape[0]-20600):
#     location_list.append(geolocator.reverse(str(df['Latitude'][i])+','+str(df['Longitude'][i]))[0])
# print(location_list)

In [18]:
def locationfinder(cord):
    Lat = str(cord[0])
    Lon = str(cord[1])
    
    location = geolocator.reverse(Lat+','+Lon).raw['address'] # returns a dictionary
    
    # if a values are missing replace them by a empty string
    
    if location.get('Road') is None:
        location['Road'] = None
        
    if location.get('County') is None:
        location['County'] = None
        
    updated_location['County'].append(location['County'])
    updated_location['Road'].append(location['Road'])

In [20]:
import pickle
updated_location = {"County":[],"Road":[]}

for i,cord in enumerate(df.iloc[:,6:-1].values):
    locationfinder(cord)
    
    #continuously reading and saving
    pickle.dump(updated_location, open('updated_location.pickle','wb'))
    
    if i%500 == 0:
        print(i)

0
500
1000
1500
2000
