## This project aims to predict the California housing prediction 
This is one of the full end-to-end project I worked on, in my free time to build my profile as a Data Scientist. The project is part of the book "Hands-on machine learning with sk-learn and tensorflow" and give a lot of information in a concise form. I describr here my learning and finding from this interesting read as well as apply my years of experience from my academic data analysis. 



## Data Scientist Checklist 
For any given project one needs a checklist, a common checklist is following which keep changing based on the scope of problem, but the minimal version remains the same. 

1. Frame the problem and get a grasp of the big pictue. 
2. Get the data 
3. Exploratory data analysis and gain insights. 
4. Prepare data to extract the features hidden in the data, Feature enginnering. 
5. Short list (and explore them) the possible models which can be tried on the given model, based on the knowledge gain from insights. 
6. Finetune these models and combine into a reliable and the best solution. 
7. Presentation of the solution 
8. Launch, monitor and maintain the system, it will evolve with time as more data arrive. 


### 1. Frame the problem 
As said, we have the housing dataset for California and we want to build the a model which can predict the price of house given other features of the dataset. So broadly, this is a supervised learning problem, where we will predict the price of the house, i.e. we need to use the regression algorithm. The dataset has many features like, loaction, age of house, rooms, population, income etc, all of these influence the price of house and hence this is  a multivariate regression problem to be solved. 

We are given that the existing solution do exist to predict the housing price, using some manual, time expensive and budget expensive and comes with an error of 15%. 

#### Performance metric 
##### 1. RMSE (Root Mean Square Error) 

One of the most commonly used metric for regression problem is RMSE (Root Mean Square Error). 

$ RMSE  = \sqrt (\frac{1}{m} \sum_{i=1}^{m} (h(x^{i}) - y^{i})^{2} ) $

where, 
- m: number of instance/row on which performance (RMSE) is measured
- $x^{i}$: Vector of all the input feature values (EXCLUDING OUTPUT) of $i^{th}$ instance in the data, and $y^{i}$ is the label of this vactor, or true value of the feature to be predicted, housing price in this case. 
- $h(x^{i})$: Predicted value given the input features $x^{i}$

[RMSE using sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error)  

##### 2. MAE (Mean Average Error) 
MAE is another metric used for the regression problems, specially when there are known outliers in the dataset. 
In such cases it is better to average out the errors using MAE: 

$MAE = \frac{1}{m} \sum_{i=1}^{m} |(h(x^{i}) - y^{i} |$

[MAE using sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-absolute-error)

##### 3. $R^{2} score$
Another important metric for regression exercise is $R^{2}$ score, also called coefficient of determination. It represents the proportion of variance (of y) that has been explained by the independent variables in the model. 
It provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance. 

$R^{2} score (h(x^{i}),  y^{i} )$ = 1 - $\frac{\sum_{i=1}^{m} y^{i} - h(x^{i})}{ \sum_{i=1}^{m}y^{i}-\bar{y}}$
where $\bar{y}$ is the mean of true values of y OR $h(x^{i})$. 

[$R^2 score$ in sklean](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination)

So depending on the type of problem one should choose the metric appropriately. 

### 2. Get the data 

The original dataset mentioned in the book is from *StatLib repository*. 

A similar dataset is freely available at the [Kaggle dataset](https://www.kaggle.com/camnugent/california-housing-prices) repository and has been used for this modelling. 



#### Import libraries 

In [2]:
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 
import numpy as np 
import sklearn 


#### Load the dataset 

In [3]:
df = pd.read_csv("housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


#### Features 
There are 10 columns in the data, namely, 

1. longitude: A measure of how far west a house is; a higher value is farther west

2. latitude: A measure of how far north a house is; a higher value is farther north

3. housingMedianAge: Median age of a house within a block; a lower number is a newer building

4. totalRooms: Total number of rooms within a block

5. totalBedrooms: Total number of bedrooms within a block

6. population: Total number of people residing within a block

7. households: Total number of households, a group of people residing within a home unit, for a block

8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

9. medianHouseValue: Median house value for households within a block (measured in US Dollars)

10. oceanProximity: Location of the house w.r.t ocean/sea




### 3. Exploratory data analysis and insights gain

In [4]:
df.info() ## provides the information of the datatype of each column in the dataframe. 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


The above information shows that there are some missing information in the total_bedrooms column, and there it has to be taken care of before feeding this data to any machine learning algorithm. It can be treated via: 
1. By remove those 207 enteries as this is an important feature to predict the price of the house,
2. however it can be coorelated with total_rooms which is available, hence the value can be kepts as median of the house with same number of total_rooms. 
Let's deal with this a bit later in the notebook. 