<a href="https://colab.research.google.com/github/w4bo/AA2425-unibo-bigdataandcloudplatforms/blob/main/slides/lab-01-Metadata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The `California Housing Pricing` case study

This notebook runs on Google Colab.

- Colab provides a serverless Jupyter notebook environment for interactive development.
- (At the moment, 2024) Google Colab is free to use like other G Suite products.

In this laboratory we will build a simple data pipeline to get acquainted with the "main" steps necessary to transform your data.

- The data contains information from the 1990 California census. It does provide an accessible introductory dataset for teaching people about the basics of machine learning. 

From the book [Hands on machine learning](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

> This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will just call them “districts” for short. The goal is to build a model to predict the median housing price in any district, given all the other metrics.

# Setup (& library versioning)

First of all, we need to setup the Python environment by installing and importing the necessary Python dependencies.

In [None]:
!pip install prov pydot
import pandas as pd
import sklearn as sk
import numpy as np
import seaborn as sns
import prov

print(pd.__version__)
print(sk.__version__)
print(np.__version__)
print(sns.__version__)
print(prov.__version__)

Why should we track the libraries imported in the coding environment?

# Data collection

Import the dataset. In this case, there is no need for ETL/integration since the dataset is ready for elaboration.

In [None]:
df = pd.read_csv("https://w4bo.github.io/AA2324-unibo-bigdataandcloudplatforms/housing.csv")
df

# Profiling: Schema

In [None]:
df.columns

Schema description

1. `longitude`: A measure of how far west a house is; a higher value is farther west
2. `latitude`: A measure of how far north a house is; a higher value is farther north
3. `housingMedianAge`: Median age of a house within a block; a lower number is a newer building
4. `totalRooms`: Total number of rooms within a block
5. `totalBedrooms`: Total number of bedrooms within a block
6. `population`: Total number of people residing within a block
7. `households`: Total number of households, a group of people residing within a home unit, for a block
8. `medianIncome`: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
9. `medianHouseValue`: Median house value for households within a block (measured in US Dollars)
10. `oceanProximity`: Location of the house w.r.t ocean/sea

# Profiling: Schema

In [None]:
df.info()

# Profiling: Distribution and statistics

In [None]:
df.describe(include='all')

# Profiling: Distribution

In [None]:
import matplotlib.pyplot as plt
df.hist(bins=50, figsize=(16, 9))
plt.show()

# Profiling: are there relationships between variables?

In [None]:
tmp = df[["median_income", "housing_median_age", "median_house_value", "households", "population", "total_rooms"]]
sns.pairplot(tmp.sample(n=1000, random_state=42), markers='o') # hue="median_house_value",
plt.show()

# Compression: Memory usage

What if I change float64 to float32?

In [None]:
dff = df.copy(deep=True) # copy the dataframe
for x in df.columns: # iterate over the columns
    if dff[x].dtype == 'float64': dff[x] = dff[x].astype('float32') # ... change it to `float32`
dff.info() # show some statistics on the dataframe

# Compression: Memory usage

What if I change float64 to float16?

In [None]:
dff = df.copy(deep=True) # copy the dataframe
for x in df.columns: # iterate over the columns
    if dff[x].dtype == 'float64': dff[x] = dff[x].astype('float16') # ... change it to `float16`
dff.info() # show some statistics on the dataframe

#

In [None]:
dff.describe()

# Data preprocessing

# Missing values

There are some missing values in the column `total_bedorooms` what can we do?

Most Machine Learning algorithms cannot work with missing features. We have three options:

- Get rid of the corresponding districts (i.e., drop the rows)
- Get rid of the whole attribute (i.e., drop the columns)
- Set the values to some value (zero, the mean, the median, etc.)


# Non-numeric attributes

`ocean_proximity` is a text attribute so we cannot compute its median. Some options:

- Get rid of the whole attribute. (`df.drop("ocean_proximity", axis=1)`)
- Change from categorical to ordinal (e.g., `NEAR BAY` = 0, `INLAND` = 1)
- Change from categorical to one hot encoding

# Scaling attributes

Attributes have very different scales.

Should we scale them?

- Min-max normalization
- Standardization
- Robust scaling

# Machine learning

Our machine learning pipeline can be composed by alternative solutions

If we consider the default parameters for each algorithm, we have

- 3 options for imputation
- ... x 2 options for encoding
- ... x 3 options for normalization
- ... x 3 algorithms
- = 54 alternatives!

# Alternative pre-processing pipelines

In [None]:
from sklearn.preprocessing import StandardScaler

if "ocean_proximity" in df.columns:  # For now we simply drop "ocean_proximity"
  df = df.drop("ocean_proximity", axis=1)

# Let's create some dataset variations
# dataset1: drop the rows containing the null values and the columns `latitude` and `longitude`
dataset_v1 = df.copy(deep=True).dropna().drop(["longitude", "latitude"], axis=1)
# dataset2: impute missing values with the average number of bedrooms
dataset_v2 = df.copy(deep=True)
dataset_v2["total_bedrooms"] = dataset_v2["total_bedrooms"].fillna(dataset_v2["total_bedrooms"].mean())
# dataset3: also standardize dataset_v2
numerical_features = dataset_v2.select_dtypes(include=np.number)  # select numerical features
scaler = StandardScaler()  # Create a StandardScaler object
scaled_features = scaler.fit_transform(numerical_features)  # Fit and transform the numerical features
dataset_v3 = pd.DataFrame(scaled_features, columns=numerical_features.columns)  # Convert the scaled features back to a DataFrame

# Create the list of datasets
datasets = [(dataset_v1, "dataset_v1"), (dataset_v2, "dataset_v2"), (dataset_v3, "dataset_v3")]

# Alternative machine learning algorithms

In [None]:
# Let's import some machine learning models (here we are not addressing hyper-parameter tuning)
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Lasso
dt_3 = DecisionTreeRegressor(random_state=0, max_depth=3)  # initialize decision tree regressor model
dt_5 = DecisionTreeRegressor(random_state=0, max_depth=5)  # initialize decision tree regressor model
lr = LinearRegression()  # initialize a linear regressor model
# Create the list of algorithms
ml_algorithms = [(lr, "lr"), (dt_3, "dt_3"), (dt_5, "dt_5")]

# Train the models

What are the insights?

In [None]:
from sklearn.model_selection import cross_val_score
instances, i = [], 0
for dataset, dataset_version in datasets:  # For each dataset version...
  X = dataset.drop(columns=["median_house_value"]).to_numpy()  # Get the training set
  y = dataset["median_house_value"].to_numpy()  # Get the label array
  for ml_algorithm, ml_algorithm_version in ml_algorithms:  # For each machine learning algorithm...
    instance = {}  # Run the machine learning algorithm on the given dataset
    instance["id"] = i  # store the id of the instance
    instance["dataset"] = dataset_version  # store the version of the dataset
    instance["algorithm"] = ml_algorithm_version  # store the version of the ml algorithm
    instance["score"] = cross_val_score(ml_algorithm, X, y, cv=10).mean()  # store the performance of the pipeline instance
    instances = instances + [instance]
    i += 1
result = pd.DataFrame.from_dict(instances, orient='columns')  # Collect the results
sns.catplot(x = "dataset", y = "score", hue = "algorithm", data = result, kind = "bar")

# How do we track all these changes?

In [None]:
#| echo: false
#| output: false

!apt update -y
!apt install graphviz -y

# Creating and plotting the provenance graph

In [None]:
from prov.model import ProvDocument
from prov.dot import prov_to_dot
from IPython.display import Image
d1 = ProvDocument()  # Create an empty provenance document
d1.add_namespace('unibo', 'https://www.unibo.it')  # add the namespace
d1.add_namespace('sk', 'https://scikit-learn.org/stable/')  # add the namespace
agent = d1.agent('unibo:mfrancia')  # add an agent
d1.wasDerivedFrom("unibo:dataset_v3", "unibo:dataset_v2")
for dataset, dataset_version in datasets:  # For each dataset version...
  original_dataset = d1.entity('unibo:' + dataset_version)  # register the dataset
  d1.wasAttributedTo(original_dataset, agent)  # attribute the dataset to the agent who created it
  for ml_algorithm, ml_algorithm_version in ml_algorithms:  # For each machine learning algorithm...
    algo = d1.activity('sk:' + ml_algorithm_version)  # register the algorithm as a (processing) activity
    processed_dataset = d1.entity('unibo:' + ml_algorithm_version + "_" + dataset_version, {'sk:cv-score': '...'})  # create an activity represented the processed dataset
    d1.used(algo, original_dataset)  # the activity used the dataset as input
    d1.wasGeneratedBy(processed_dataset, algo)  # the processed dataset has been created by the algorithm
    d1.wasDerivedFrom(processed_dataset, original_dataset)  # the processed dataset has been derived from the original one
dot = prov_to_dot(d1)  # visualize the graph
dot.write_png('prov.png')
Image('prov.png')

# Can you build and track a better model?

In [None]:
# Try your sk-learn model here