**<h3>Python Script Imports</h3>**

In [2]:
import os
import sys
import importlib    
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
#Paths
housing_path = r"E:/Future Plans/Post-Graduation-Projects/Hands-on Machine Learning/first-project/datasets/housing.csv"
python_scripts_path = r"E:/Future Plans/Post-Graduation-Projects/Hands-on Machine Learning/first-project/python-scripts"

#Path initialization so that Python can see where our script lies.
script_dir = os.path.abspath(python_scripts_path) 
sys.path.append(script_dir)

#Imports
from sklearn.model_selection import StratifiedShuffleSplit

from test_set_check import test_set_check

from split_train_test_by_id import split_train_test_by_id
from load_housing_data import load_housing_data

housing = load_housing_data(housing_path)

**<h3>Prepare clean dataset</h3>**
**<h5>Step 1: Create a stratified dataset</h5>**

**Idea:** Split the entire housing dataset into 5 categories of `median_income` This prevents sampling bias (favoring 1 type of `median_income` over the rest.

In [3]:
#limits number of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5) 
#merge "larger than 5" categories into category 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=False) 
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]


**<h5>Step 2: Separate predictors and labels </h5>**

**Definition**

- **Predictors**: Data that the machine learning model learns from to predict future ones.
- **Labels**: The answer letting the machine know if it gives a correct/incorrect prediction.

**Example: `drop()` function**

In [4]:
#1. Creates a new dataset without the answer (median_house_value) 
# so that the ML model can learn from.
housing = strat_train_set.drop("median_house_value", axis=1) 

#2. Creates a table of labels (a.k.a answers), letting the model know
# if it's doing well or not.
housing_labels = strat_train_set["median_house_value"].copy()

print(housing)
print(housing_labels)

       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
17606    -121.89     37.29                38.0       1568.0           351.0   
15698    -122.46     37.79                52.0        899.0            96.0   
14650    -117.20     32.77                31.0       1952.0           471.0   
3230     -119.61     36.31                25.0       1847.0           371.0   
3555     -118.59     34.23                17.0       6592.0          1525.0   
...          ...       ...                 ...          ...             ...   
6563     -118.13     34.20                46.0       1271.0           236.0   
12053    -117.56     33.88                40.0       1196.0           294.0   
13908    -116.40     34.09                 9.0       4855.0           872.0   
11159    -118.01     33.82                31.0       1960.0           380.0   
15775    -122.45     37.77                52.0       3095.0           682.0   

       population  households  median_income ocean_

**<h5>Step 3: Cleans the data</h5>** 

**Idea:** "fix" missing cells in a row (district), like this one (missing total_bedrooms value):
![image.png](attachment:b2f48d82-479d-476b-a6e8-52e61f6f6dc2.png)

**How to fix:**
- Remove corresponding districts (delete the entire row 292) 

`housing.dropna(subset=["total_bedrooms"])` 

- Remove the entire column (delete the total_bedrooms column) 

`housing.drop("total_bedrooms", axis=1) # option 2` 

- Set that cell to some value (zero/mean/median....) 

`median = housing["total_bedrooms"].median() # option 3`
`housing["total_bedrooms"].fillna(median, inplace=True) `

Note: If option 3 is chosen, **median** value of total_bedrooms must be calculated.

  - **What is Median?** The number in the middle of the "total_bedrooms" list. Here's an example:
    
    + Case 1: `total_bedrooms = [2, 3, 4, 5, 6]` => Median = `4`
    + Case 2: `total_bedrooms = [2, 3, 4, 5, 6, 7]` => Median = `(4+5)/2 = 4.5`



**<h3>Data cleanup with `Imputer()` function</h3>**
**Example**
- Create a SimpleImputer instance
- For each column (e.g. total_bedrooms), do the following:
  + Calculate the median value
  + Fill that median value to the empty cells.

In [5]:
imputer = SimpleImputer(strategy="median")

**Caveats**
- The Imputer only works on **numerical values**. This means **`ocean_proximity`** should be removed when the dataset is copied.
- The `fit()` function **learns the dataset** to find the median value.
  + We will **fill in the empty cells** of the dataset with this **median value** later.
  + According to the book: The imputer has simply computed the median of each attribute and stored the result in its statistics_ instance variable.

In [6]:
# 1. Creates a copy of the housing data, WITHOUT the string column (ocean_proximity)
housing_num = housing.drop("ocean_proximity", axis=1) 

# 2. Use the fit() function to learn from the housing_num dataset, then creates 
# the value to fill in the empty cells.
imputer.fit(housing_num)

#3. Fill the empty cells with the "transform" function. 
# We have X - a dataset with all empty cells filled with the median.
X = imputer.transform(housing_num)

**<h3>Scikit-learn Design Definitions</h3>**
- **Estimators (learners):**
  + A thing/Things that **predict/estimate some parameters** based on the dataset. Simply put, it **learns the current data**, to help predict the future data.
  + Common examples: **`fit()`**  
- **Transformers (do-ers):**
  + A thing/function/object that **transforms/modifies** a dataset, using the learned information. Learned information can be obtained from the **`fit()`** function, for example.
- **Predictors:**
  + Predicts the future data based on a given dataset.
- **Inspection:**
  + A way for us to know **what values are being held by**, for example, the **predictors**.
  + It's the same as attributes in an Object (Object-Oriented Programming).
  + Examples:
    +  **`imputer.strategy`**: Gets the used strategy from the **`imputer`** object.
    +  **`imputer.statistics`**: Gets the statistics from the **`imputer`** object.

**<h3>Handling Text/Categorical values </h3>**


**Method 1: Convert Text to Number using  Panda's **`factorize()`** method**

**Steps**
- Extract the **`ocean_proximity`**  column from the original dataset, and store the result in **`housing_cat`** variable (which is now a dataset with 1 column of **`ocean_proximity`**)
- Use **`factorize()`** to convert them to numbers. Factorize does the following:
  + Convert whatever (e.g. string, object) to numbers
  + Lists out **distinct** values only. 

In [7]:
housing_cat = housing["ocean_proximity"]
print("Housing categories before conversion: \n"+ str(housing_cat[:10]))
housing_cat_encoded, housing_categories = housing_cat.factorize()
print("Housing categories in integer:" + str(housing_cat_encoded[:10]))
print("Housing categories:" +str(housing_categories))

Housing categories before conversion: 
17606     <1H OCEAN
15698      NEAR BAY
14650    NEAR OCEAN
3230         INLAND
3555      <1H OCEAN
19480        INLAND
9026     NEAR OCEAN
13685        INLAND
4937      <1H OCEAN
4861      <1H OCEAN
Name: ocean_proximity, dtype: object
Housing categories in integer:[0 1 2 3 0 3 2 3 0 0]
Housing categories:Index(['<1H OCEAN', 'NEAR BAY', 'NEAR OCEAN', 'INLAND', 'ISLAND'], dtype='object')


**Issues**: 
- Given this category list `[0 1 2 3 0 3 2 3 0 0]`, the ML model tends to think that **2 nearby values** (e.g. `0 (position 1)`, `1 (position 2)`) **is more similar** than 2 distant values (e.g. `0 (position 1 "<1H OCEAN")`, `0 (position 4 "<1H OCEAN")`) (but they are similar!)

**<h3>Solution - One-hot encoding</h3>**
- For 1 category, create a binary attribute (1/0).
  + If the category `<1H OCEAN` is **present** => set the binary column to 1
  + If the category `<1H OCEAN` is **absent** => set the binary column to 0

**<h5>SciKit OneHotEncoder</h5>**

**Explanations**
- **`encoder.fit_transform()`**: Applies **one-hot encoding** to a list of numerical categories (`housing_cat_encoded`)
- **`encoder.reshape()`**:
  + Reshapes the `housing_cat_encoded` from **1 Direction** to **2 Direction**.
  + Parameters (Inputs):
    + (-1): The number of elements in the housing_cat_encoded. Leaving it `-1` means: Hey Scikit, calculate it yourself.
    + (1): Number of columns in the result.

![image.png](attachment:9feca7aa-5a89-4f85-abd8-77b2d2df7eca.png)
![image.png](attachment:1b0f535f-423b-4bdd-a159-51568563e540.png)

- **`encoder.toarray()`**: Converts the 2-D array to the 1-D array.


In [8]:
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 16512 stored elements and shape (16512, 5)>

**Sparse Matrix:** A matrix (2-D array) that is filled with mostly 0. This speeeds up calculation, as the model only focuses on non-zero values.

In [9]:
housing_cat_1hot.toarray()

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]], shape=(16512, 5))

**<h3>Custom Transformers</h3>**
**<h5>Definition</h5>**
- A Transformer that we **write on our own**
- It **extends/overrides the existing transformer(s) of Scikit**, for example
  + BaseEstimator
  + TransormerMixin 

**<h5>Example</h5>**
![image.png](attachment:293edbd9-132e-4af8-b523-ecbc9ca3d82b.png)

**<h5>Important</h5>**
This snippet is already created at **`python-scripts/custom-transformers/combined_attributes_adder.py`** 

**<h3>Feature Scaling</h3>**

**<h5>Problem</h5>**
- Machine Learning models sucks when working with numbers **having different scales**. Examples:
  + number_of_rooms range (**`6`** to **`39320`**) (large min-max difference (**`39320 - 6 = 39314`**) versus:
  + median_income range (**`0`** to **`15`**(**`15 - 0 = 15`**)) (small min-max difference)

**<h5>What to do</h5>**
- Convert ranges (of whatever difference) to a common range (usually from **`0 to 1`**)

**<h5>Why is it called Feature Scaling?</h5>**
Because, we are **scaling** each **feature** (e.g. number_of_rooms, median_income) to the **same range**. Example:
  + Instead of **6 to 39320**, it's now from **0 to 1**.
  + Instead of **0 to 15**, it's now from **0 to 1**.
  + Here, **0 to 1** is the common range. 

**<h5>How-to</h5>**
- **Min-Max scaling**: Make sure that the new range goes from 0 to 1. Here's the formula
![image.png](attachment:d51a6f4d-7481-455f-b939-186455c25b1b.png)

  + Where:
    + **XOriginal**: each data point (e.g. number_of_rooms in a district = 880) ![image.png](attachment:c4764d85-9f13-4428-ac4d-8b3812e9ff1f.png)
    + **XMin**: the minimum value (**`6`**)
    + **XMax**: the maximum value (**`39320`**)
  + After calculation, we have: the result = $\frac{880-6}{39320-6} = 0.0222312662155975$ (between 0 and 1)
- **Standardization**: ![image.png](attachment:546b9826-2ad0-4317-b98b-6df6670f9b68.png)