<a href="https://colab.research.google.com/github/sprince0031/ICT-Python-ML/blob/sprince/Week%204/Notebooks/Week4_reference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python & ML Foundations: Session 4
## Data Visualization, Classification & Regression

Welcome to the session 4 reference notebook! This week, we move from Python fundamentals to the core workflow of a machine learning practitioner. We'll learn how to visualize data to gain insights, build our first classification models, and finish with the concept of regression.

**Libraries for this week:**
- `pandas`: For loading and manipulating our data.
- `matplotlib` & `seaborn`: For data visualization.
- `scikit-learn`: For building and evaluating our machine learning models.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")  # https://seaborn.pydata.org/tutorial/aesthetics.html

# Load dataset
df = pd.read_csv('./sample_data/california_housing_train.csv')
df.head()

---

## 1. Data Visualisation

In this section, we'll explore why visualisation is the most critical first step in any data science project. We'll use the popular libraries Matplotlib and Seaborn to understand the `California Housing Dataset`.

### 4.1 - Histogram

In [None]:
plt.figure(figsize=(12, 7)) # width, height
plt.hist(df['median_house_value'], bins=50, edgecolor='black')
plt.title('Distribution of Median House Values')
plt.xlabel('Median House Value ($)')
plt.ylabel('Frequency (Number of Districts)')
plt.show()

In [None]:
df[df['median_house_value'] > 500000].describe()

In [None]:
df_filtered = df[df['median_house_value'] <= 500000]

In [None]:
plt.figure(figsize=(12, 7))
plt.hist(df_filtered['median_house_value'], bins=50, edgecolor='black')
plt.title('Distribution of Median House Values')
plt.xlabel('Median House Value ($)')
plt.ylabel('Frequency (Number of Districts)')
plt.show()

### 4.2 - Scatterplot

In [None]:
plt.figure(figsize=(12, 7))
plt.scatter(df_filtered['median_income'], df_filtered['median_house_value'], alpha=0.1)
plt.title('Median Income vs. Median House Value')
plt.xlabel('Median Income (in tens of thousands)')
plt.ylabel('Median House Value ($)')
plt.show()

### Some quick data engineering
Let's create a new feature which categorises housing data based on the `housing_median_age` feature. We want to categorise the data into 3 buckets: `New`, `Established`, `Historic`

In [None]:
def classify_age(age):
    if age < 15:
        return 'New'
    elif age <= 35:
        return 'Established'
    else:
        return 'Historic'

df_filtered['age_category'] = df_filtered['housing_median_age'].apply(classify_age)

print("Our new feature in action:")
df_filtered[['housing_median_age', 'age_category']].head()

### 4.3 - Bar plot

In [None]:
plt.figure(figsize=(12, 7))
sns.barplot(x='age_category', y='median_house_value', data=df_filtered, order=['New', 'Established', 'Historic'])
plt.title('Average House Value by Age Category')
plt.show()

## 4.4 Box plot

In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='age_category', y='median_house_value', data=df_filtered, order=['New', 'Established', 'Historic'])
plt.title('Distribution of House Values by Age Category')
plt.show()

## 4.5 Heatmap

In [None]:
# Get Correlation Matrix from DataFrame
corr_matrix = df_filtered.corr(numeric_only=True)

# The Plot
plt.figure(figsize=(12, 7))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of All Numerical Features')
plt.show()

---

## 2. Classical ML

Now we'll build our first model. To demonstrate a critical concept, we will turn our 10-class digit problem into a **binary, imbalanced** problem: trying to identify if a digit is a '5' or not.

### 4.2.1 - Creating an Imbalanced Dataset

We will define our features (X) as the pixel values and our target (y) as a boolean that is `True` only if the digit's label is 5. We will see that this creates a dataset where about 90% of the data belongs to one class ('Not 5').

### 4.2.2 - The Train-Test Split

This is the most important rule in machine learning: always separate your data into a training set (for the model to learn from) and a testing set (to evaluate its performance on unseen data).

### 4.2.3 - The Accuracy Trap: Dumb vs. Real Model

Let's prove that accuracy is a misleading metric here. We will build a `DummyClassifier` that always predicts the majority class ('Not 5'). Then, we'll build a real `DecisionTreeClassifier` and compare their accuracies. The results will be surprisingly similar, showing the trap of relying only on accuracy.

---

## 3. Regression

Regression is the other type of supervised ML where we try to predict a value on a continuous scale rather than classify into a class.