# Chapter 2 : End-To-End Machine Learning Project

### Our approach for this problem will be based below main steps-

1. Look at the bigger picture
2. Get the data
3. Explore & visualise the data to gain insights
4. Prepare the data for machine learning algorithms
5. Select a model & train it
6. Fine-tune your model
7. Present your solution
8. Launch, monitor & maintain your system

In [None]:
# import statements
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.metrics.pairwise import rbf_kernel
import numpy as np
from pandas.plotting import scatter_matrix
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector, make_column_transformer
%matplotlib inline

## 2. Get the Data

In [None]:
data_path = os.path.join(os.getcwd(), "housing.csv")

In [None]:
data_path

In [None]:
housing_df = pd.read_csv(data_path)
housing_df.head(5)

In [None]:
housing_df.info()

Notice feature 'total_bedrooms' has only 20433 out of total dataset entries 20640. Meaning 207 districts does not have valid value for this feature.

### Exploring 'ocean_proximity' as categorical feature: 5 categorical values as below

In [None]:
housing_df['ocean_proximity'].value_counts()

### Exploring the distribution of numerical attributes
- Numerical features are pretty skewed, we need may need to normalise them
- Few features like housing_median_age & median_house_value are capped at different levels

In [None]:
housing_df.hist(figsize=(15,8), bins=40) 
# figsize = how stretch the figure will be
# bins = how smooth the distribution looks like, divide data into smaller bins

### Creating a Test Set

- Here we have randomly selected the test set which is not the ideal way because it might introduce the sampling bias. Test set should be the representative of all the dataset to produce reliable & accurate results.
- Assuming 'median_income' feature is very important in predicting 'median_house_value' so the test set should be representative of this feature.
- Looking at 'median_income', it appears to be divided mostly from 1.5 to 6. Let's create new feature 'income_cat'to represent categories for this continous variable. Categories shouldn't be too large or small
- pd.cut() function provides the way to create these categories
- train_test_split() function provides an argument stratify=feature_name which is used to create test set with stratified strategy

In [None]:
housing_df["income_cat"] = pd.cut(housing_df["median_income"],
                                 bins=[0.,1.5,3.,4.5,6., np.inf],
                                 labels=[1,2,3,4,5]) # np.inf required as bins should > labels

In [None]:
housing_df["income_cat"].value_counts().sort_index().plot.bar(rot=0, grid=True) # rot: 
plt.xlabel("Income category")
plt.ylabel("Number of districts")

In [None]:
# unstratified data
unstrat_train_set, unstrat_test_set = train_test_split(housing_df, test_size=0.2, random_state=42)

In [None]:
# stratified sampling
strat_train_set, strat_test_set = train_test_split(housing_df, test_size=0.2, random_state=42, stratify=housing_df["income_cat"])

In [None]:
strat_test_set["income_cat"].value_counts()/len(strat_test_set) # total strat test set = 4128


In [None]:
# we won't use the income_cat feature again & also don't want it to influence our predictions, so we will remove it
for feat_set in (strat_train_set, strat_test_set):
    feat_set.drop("income_cat", axis=1, inplace=True)

## 3. Explore & Visualise the Data to Gain Insights

In [None]:
# Lets' copy the original data to analyse/manipulate

housing = strat_train_set.copy()
housing.head(5)

In [None]:
# As we are dealing with geographical data so it make sense to plot the scatterplot

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.2,
            grid=True, s=housing["population"]/100, 
              c="median_house_value",
            figsize=(10,8)) # alpha represents high density areas

### Correlations of attributes

In [None]:
housing.head()

In [None]:
corr_matrix = housing.corr(numeric_only=True) # corr method uses Pearson' coefficient

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
# Another way to look the correlations between numerical attributes is scatter plots
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12,8))
plt.show()
# only promising attribute to predict median_house_value would be median_income

### Experiment with Attribute Combinations

In [None]:
housing["rooms_per_house"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_ratio"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["people_per_house"] = housing["population"]/housing["households"]

In [None]:
housing.info()

In [None]:
corr_matrix = housing.corr(numeric_only=True)

In [None]:
# here bedrooms_ratio is negatively correlated to median_house_value
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
# lets zoom in with median_income
housing.plot(kind="scatter",x="median_income", y="median_house_value",alpha=0.2)

In [None]:
# removing the predictors from data
housing = strat_train_set.drop("median_house_value", axis=1) # axis = 1 means transform via columns
housing_labels = strat_train_set["median_house_value"].copy()

# adding the numeric features again to housing dataframe
housing["rooms_per_house"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_ratio"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["people_per_house"] = housing["population"]/housing["households"]

### Clean the Data
- Most ML algorithms won't work on missing data so we need to handle this data
- Its good idea to use imputer from sklearn
    - You can apply the transformation on all the data simultanously like train, test, validation
    - It stores the missing value strategy (for example, median) for all the features and for incoming data
    - Imputer saves the result into its 'statistics_' attribute

In [None]:
# Median can only work with numeric data features
imputer = SimpleImputer(strategy="median")

In [None]:
# dataframe with numeric only features
housing_num = housing.select_dtypes(include=[np.number])

In [None]:
housing_num.info() # total_bedrooms have null values

In [None]:
imputer.fit(housing_num)

In [None]:
# imputer saves the result in its 'statistics_' instance. Showing median for all 8 numeric features
imputer.statistics_

In [None]:
# now using this trained imputer to transform the values in training set
housing_imputed = imputer.transform(housing_num)

In [None]:
print(housing_imputed, type(housing_imputed))

In [None]:
# creating dataframe from numpy array which is the output of transformer
housing_tr = pd.DataFrame(housing_imputed, columns=housing_num.columns, index=housing_num.index)

In [None]:
# There are no null values left
housing_tr.info()

### Handling categorical attributes
- Currently we only have the 'ocean_proximity' attribute as categorical
- Simple way is to use OrdinalEncoder but the problem with this approach is it creates importance within nearer values. For example, categorised 1,2,3 values works with categories like good, better, excellent etc but not with ocean_proximity.
- Better and common solution is to use one hot encoding where each category creates an attribute with 1(hot) or 0 (cold) values

In [None]:
housing_cat = housing[["ocean_proximity"]] # housing["ocean_proximity"] -> series, housing[["ocean_proximity"]] -> dataframe(housing_cat)

housing_cat.head(5)

In [None]:
cat_encoder = OneHotEncoder()

In [None]:
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

In [None]:
cat_encoder.categories_

### Feature Scaling & transformation

- Machine learning algorithms don't perform well if the features are not scaled properly. For example, median_income range from 0-15 and total number of rooms range from 6 to 39320. In this case the algorithm will be baised towards 'total num of rooms' feature.
- sklearn provides MinMaxScaling, StandardScaler options to scale numeric features. MinMaxScaling squeeze feature values from 0 to 1 while StandardScaler substract mean & divide by standard deviation
- rbf kernels are radial basis function

In [None]:
minmax_scalar = MinMaxScaler()
housing_minmax = minmax_scalar.fit_transform(housing_num)
housing_minmax

In [None]:
std_scalar = StandardScaler()
housing_std_scalar = std_scalar.fit_transform(housing_num)
housing_std_scalar

In [None]:
age_simil_35 = rbf_kernel(housing[["housing_median_age"]], [[35]], gamma=0.1)
age_simil_35

### Transformation Pipeline

- sklearn provides a great functionality of transformers to create pipelines of functions
- transformers can be used to hook multiple preprocessing steps into one. Output of one step will feed into as input of next one and so on

In [None]:
# lets create a pipeline which handles numeric features - imputation & scaling
num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

In [None]:
housing_num_prepared = num_pipeline.fit_transform(housing_num)

In [None]:
# this is ndarray
housing_num_prepared[:2].round(2)

In [None]:
# name of columns
num_attributes = list(num_pipeline.get_feature_names_out())
num_attributes

In [None]:
df_housing_num_prepared = pd.DataFrame(housing_num_prepared, columns=num_pipeline.get_feature_names_out(), index=housing_num.index)

In [None]:
df_housing_num_prepared.head(5)

In [None]:
housing_cat.value_counts()

In [None]:
cat_attributes = list(housing_cat.columns)
cat_attributes

In [None]:
# similarly create a pipeline which handles categorical values - imputation & onehotencoding
cat_pipeline = make_pipeline(SimpleImputer(strategy="most_frequent"),OneHotEncoder(handle_unknown="ignore"))

In [None]:
# lets combine numeric & categorical features into one

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attributes),
    ("cat", cat_pipeline, cat_attributes)
])

# instead of above approach you can do following steps to get the same output with much ease
# preprocessing = make_column_transformer(num_pipeline, make_column_selector(dtype_include=np.number),
#                                        cat_pipeline, make_column_selector(dtype_include=object))

In [None]:

housing_prepared = preprocessing.fit_transform(housing)
housing_prepared[:5].round(2)

In [None]:
housing_prepared.shape

In [None]:
preprocessing.get_feature_names_out()

#### You can combine all preprocessing steps into one column transformer

In [None]:
preprocessing_df = pd.DataFrame(housing_prepared, columns=preprocessing.get_feature_names_out())

In [None]:
preprocessing_df.hist(figsize=(15,8), bins=40);

## Select and Train Model