## Learn about California Housing Data 

### Data Set contains three key elements data, feature_names and 
* **Data** - All the input features
* **feature_names** - Name of the features
* **target** - output values we want to predict, in here it is median house value

In [7]:
# import california dataset from sklearn
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

housing_dataset = fetch_california_housing()

# Data Set contains three key elements data, feature_names and 

housing_dataframe = pd.DataFrame(data=housing_dataset.data, columns=housing_dataset.feature_names)

housing_dataframe.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [8]:
housing_dataframe['MedianHouseValue'] = housing_dataset.target

housing_dataframe.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [9]:
# Check if we have any missing values

print(housing_dataframe.isnull().sum())


MedInc              0
HouseAge            0
AveRooms            0
AveBedrms           0
Population          0
AveOccup            0
Latitude            0
Longitude           0
MedianHouseValue    0
dtype: int64


In [10]:
# Print the size of data frame
print(housing_dataframe.shape)

(20640, 9)


In [11]:
# Statistical Summary
print(housing_dataframe.describe())

             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
25%        2.563400     18.000000      4.440716      1.006079    787.000000   
50%        3.534800     29.000000      5.229129      1.048780   1166.000000   
75%        4.743250     37.000000      6.052381      1.099526   1725.000000   
max       15.000100     52.000000    141.909091     34.066667  35682.000000   

           AveOccup      Latitude     Longitude  MedianHouseValue  
count  20640.000000  20640.000000  20640.000000      20640.000000  
mean       3.070655     35.631861   -119.569704          2.068558  
std       10.386050      2.135952      2.003532          1.153956  
min        0.692

In [15]:
# Define Income Category
housing_dataframe['income_cat'] = pd.cut(housing_dataframe['MedInc'], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5])

# Groy by income category and calculate the average population
average_population = housing_dataframe.groupby('income_cat', observed=True).apply(lambda x: x['Population'].mean(), include_groups=False)

print(average_population)

income_cat
1    1105.806569
2    1418.232336
3    1448.062465
4    1488.974718
5    1389.890347
dtype: float64


In [18]:
# Add a RoomPerHousehold column
housing_dataframe['RoomsPerHousehold'] = housing_dataframe['AveRooms'] / housing_dataframe['AveOccup']

In [19]:
# Group by RoomPerHousehold and calculate the average 'MedInc' of each group
median_house_val = housing_dataframe.groupby('RoomsPerHousehold', observed=True).apply(lambda x: x['MedInc'].mean(), include_groups=False)

print(median_house_val)

RoomsPerHousehold
0.002547     10.2264
0.008576      5.5179
0.018065      4.2639
0.035955      6.1359
0.061605      4.2391
              ...   
33.843373     1.6154
34.214286     2.6250
41.333333     2.5893
52.033333     1.8750
55.222222     4.6250
Length: 20352, dtype: float64
