# Predicting the prices of houses in California using Linear Regression.

## Content 
The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned so there are some preprocessing steps required! The columns are as follows, their names are pretty self explanitory:

<b>longitude</b> : A measure of how far west a house is; a higher value is farther west

<b>latitude</b> : A measure of how far north a house is; a higher value is farther north

<b>housingmedianage</b> : Median age of a house within a block; a lower number is a newer building

<b>total_rooms</b> : Total number of rooms within a block

<b>total_bedrooms</b> : Total number of bedrooms within a block

<b>population</b> : Total number of people residing within a block

<b>households</b> : Total number of households, a group of people residing within a home unit, for a block

<b>median_income</b> : Median income for households within a block of houses (measured in tens of thousands of US Dollars)

<b>medianhousevalue</b> : Median house value for households within a block (measured in US Dollars)

<b>ocean_proximity</b> :  Location of the house w.r.t ocean/sea

## Acknowledgements
This data was initially featured in the following paper:
Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.

and I encountered it in 'Hands-On Machine learning with Scikit-Learn and TensorFlow' by Aurélien Géron.
Aurélien Géron wrote:
This dataset is a modified version of the California Housing dataset available from:
Luís Torgo's page (University of Porto)

#### Importing dependencies

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
import seaborn as sns
from pandas.plotting import scatter_matrix

In [None]:
cali = pd.read_csv('housing.csv')

In [None]:
cali.head()

In [None]:
cali.describe().T

#### Data Wrangling

In [None]:
sns.set()
cali.isna().sum().sort_values(ascending = True).plot(kind ='barh',figsize = (10,7))

In [None]:
#Dropping the missing values in our dataset
missing_data = cali.dropna(inplace=True)
cali.isna().sum()
#checking if the data has been cleaned
total_null = cali.isna().sum().sort_values(ascending=False)
percent = (cali.isna().sum()/cali.isna().count()).sort_values(ascending=False)
missing_data = pd.concat([total_null, percent], axis=1, keys=['Total', 'Percent'])
missing_data

In [None]:
cali.drop(['ocean_proximity'],axis = 1).head()

#### Inspecting our data (Exploratory Data Analysis)

In [None]:
#Plotting histograms
cali.hist(bins = 50, figsize = (20,15))
plt.show();

##### Histograms are used to represent the quantity of each attribute.

In [None]:
#Plotting a Scatterplot
plt.figure(figsize = (10,8))
plt.scatter(cali.latitude,cali.longitude,alpha = 0.2,c = cali.median_house_value, s = cali.population/100)
plt.colorbar()

##### The scatterplot resembles the shape of California and the intensity of the color represents the population based on latitude and longitude.

In [None]:
sns.pairplot(cali[["total_bedrooms","population","median_income","median_house_value"]],diag_kind = "kde")

###### A pairplot plots a pairwise relationship in a dataset. 


#### Data Pre-processing

In [None]:
y = cali['median_house_value']
x = cali[['median_income','total_rooms','housing_median_age']]

In [None]:
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size = 0.2)
#X = train['median_income'].values.reshape(-1,1)
#y = train['median_house_value']
train = x_train.join(y_train)
corr_mat = train.corr()

###### The correlation Matrix will give us an idea of the extent to which features are related to each other 

In [None]:
corr_mat['median_house_value'].sort_values(ascending = False)

In [None]:
fig = plt.subplots(figsize = (20,10))
sns.heatmap(train.corr(), annot = True)

#### ML- Technique used: Linear Regression model
Based on our correlation matrix, we can see that value of a house is strongly correlated to median house income, hence we will try to model the relationship using a linear regression model.

##### Performing Multiple Linear Regression : Predicting median value of house using features such as house age, income, total rooms

In [None]:
linear_regression = LinearRegression()
linear_regression.fit(x,y)

In [None]:
y_pred = linear_regression.predict(x)
y_pred

#### Measuring Accuracy of the model.
###### Metrics to measure used:
<ol>
    <li>Accuracy Score</li>
    <li>Mean Square Error</li>
    <li>Mean Absolute Error</li>
   </ol>

In [None]:
acc = linear_regression.score(x_test, y_test)
acc_percentage = acc*100
acc_percentage

##### Our model has around 52% accuracy.

In [None]:
#measuring accuracy using mae and mse
mse = mean_squared_error(y_pred,y)
np.sqrt(mse)

In [None]:
mae = mean_absolute_error(y_pred,y)
np.sqrt(mae)

In [None]:
#in a linear regression y = mx + c ( m is the coefficient for generating the model) )
print('Coefficients: \n', linear_regression.coef_)