<a href="https://colab.research.google.com/github/tjp1992/ML-Jupyter/blob/main/Projects/03_Housing_Oreiley/Oreiley_Housing_ERD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# California Housing Prediction(Regression)

The following Notebook is intended to be used as personal study purposes.

---

## Framing the Problem

Realtors in California would like a solution for predicting the housing prices in different districts in California which would increase the efficiency when dealing with customers that have specific budget in mind.


Predict the housing prices of different districts in California based on the datasets

## Dataset
The following Dataset was retrieved from the github account of [ageron](https://github.com/ageron/handson-ml2), the author of [Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/).

In [6]:
# Following Cell was intended for importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# - library necessary for importation of the dataset
import os
import tarfile
import urllib

In [11]:
DOWNLOAD_ROOT = 'https://raw.githubusercontent.com/ageron/handson-ml2/master'
HOUSING_PATH = os.path.join('datasets','housing')
HOUSING_URL = DOWNLOAD_ROOT + 'datasets/housing/housing.tgz'

def fetch_housing_data(housing_url = HOUSING_URL, housing_path = HOUSING_PATH):
    """ Download the tgz file from the github repository and unpacking the dataset in to local path
    """
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, 'housing.tgz')
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()


def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df


def load_housing_data(housing_path = HOUSING_PATH):
    """create a dataframe and optimize its memory usage"""
    df = pd.read_csv(os.path.join(DOWNLOAD_ROOT, housing_path, "housing.csv"), parse_dates=True, keep_date_col=True)
    df = reduce_mem_usage(df)
    return df


In [12]:
housing_df = load_housing_data()
housing_df.head()

Memory usage of dataframe is 1.57 MB
Memory usage after optimization is: 0.41 MB
Decreased by 73.7%


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.25,37.875,41.0,880.0,129.0,322.0,126.0,8.328125,452600.0,NEAR BAY
1,-122.25,37.875,21.0,7100.0,1106.0,2400.0,1138.0,8.304688,358500.0,NEAR BAY
2,-122.25,37.84375,52.0,1467.0,190.0,496.0,177.0,7.257812,352100.0,NEAR BAY
3,-122.25,37.84375,52.0,1274.0,235.0,558.0,219.0,5.644531,341300.0,NEAR BAY
4,-122.25,37.84375,52.0,1627.0,280.0,565.0,259.0,3.845703,342200.0,NEAR BAY


In [20]:
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   longitude           20640 non-null  float16 
 1   latitude            20640 non-null  float16 
 2   housing_median_age  20640 non-null  float16 
 3   total_rooms         20640 non-null  float16 
 4   total_bedrooms      20433 non-null  float16 
 5   population          20640 non-null  float16 
 6   households          20640 non-null  float16 
 7   median_income       20640 non-null  float16 
 8   median_house_value  20640 non-null  float32 
 9   ocean_proximity     20640 non-null  category
dtypes: category(1), float16(8), float32(1)
memory usage: 423.6 KB


From `housing_df.info()`, one can see that the `ocean_proximity` column of the dataframe is an `object` column, but with the memory reduction function, it has been converted into `category` dtype.

In [19]:
housing_df.ocean_proximity.value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64