<font size="+3"><strong>Machine Learning: Data Pre-Processing and Production</strong></font>

In [None]:
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

# What's scikit-learn?

[scikit-learn](https://scikit-learn.org/) is a Python library that contains implementations of many common machine learning algorithms and uses common interfaces for these that enables experimentation.  In this section, we'll look at **linear regression** (which you'll use to predict price based on the area of a property) and **K-nearest neighbors**, which you'll use to classify the neighborhood a property is in.

# Data Preprocessing

# Standardization

**Standardization** is a widely used scaling technique to transform features before fitting into models. Feature scaling changes all a dataset's continuous features to give us a more consistent range of values. Specifically, we subtract the mean from each data point and then divide by the standard deviation:

$$ \hat{X} = \frac{X-\mu}{\sigma}, $$

The goal of standardization is to improve model performance having all continuous features be on the same scale. It's useful in at least two circumstances:

1. For machine leaning algorithms that use Euclidean distance (k-means and k-nearest neighbors), different scales can distort the calculation of distance and hurt model performance.
1. For dimensionality reduction (principal component analysis), it can improve the model's ability to finds combinations of features that have the most variance.

Let's check the following example where we apply standardization on one of the columns in the following DataFrame:

In [None]:
import pandas as pd

# Read CSV into DataFrame
df = pd.read_csv("./data/mexico-city-test-features.csv").dropna()

df.head()

Our target feature is the `"surface_covered_in_m2"` column. Let's first check the maximum and minimum of this column before standardization:

In [None]:
print("Maximum before standardization is:", df["surface_covered_in_m2"].max())
print("Minimum before standardization is:", df["surface_covered_in_m2"].min())

We can perform the transformation by first instantiating the scaler and assigning the feature to a variable name. Then we fit the scaler and transform the data:

In [None]:
from sklearn.preprocessing import StandardScaler

# Name the scaler and targeted features
scaler = StandardScaler()
X_train = df[["surface_covered_in_m2"]]

In [None]:
# Fit the scaler to feature
scaler.fit(X_train)

In [None]:
# Pass the scaler to feature to transform data
X_transformed = scaler.transform(X_train)
X_transformed

Now you can see the transformed data range is much smaller after standardization:

In [None]:
print("Maximum after standardization is:", X_transformed.max())
print("Minimum after standardization is:", X_transformed.min())

We can also combine the fit and transform process together into one step:

In [None]:
X_transformed = scaler.fit_transform(X_train)
X_transformed

<font size="+1">Practice</font>  

Standardize the price column in `"mexico-city-real-estate-1.csv"`:

In [None]:
df1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
df1.head()

In [None]:
scaler = ...
X_train = ...
X_transformed = ...
X_transformed

## One-Hot Encoding

A property's district is **categorical data**, or data which can be divided into groups.  For many machine learning algorithms, it's common to create a column in a DataFrame to indicate if the feature is present or absent, instead of using the category's name. First you a column for each district names then, for each observation, you put a 1 or a 0 to indicate if the property is located in each neighborhood or not. Let's take a look at the `mexico-city-test-features.csv` dataset for properties which include the district.

In [None]:
import pandas as pd

# Read CSV into DataFrame
df = pd.read_csv("./data/mexico-city-test-features.csv").dropna()

df.head()

You can do one-hot encoding using pandas [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) function, but we'll use a the [Category Encoders](https://contrib.scikit-learn.org/category_encoders/) library since it allows us to integrate the one hot encoder as a transformer in a scikit-learn Pipeline.

In [None]:
from category_encoders import OneHotEncoder

# Instantiate transformer
ohe = OneHotEncoder(use_cat_names=True)

# Fit transformer to data
ohe.fit(df)

# Transform data
df_ohe = ohe.transform(df)

df_ohe.head()

<font size="+1">Practice</font>  

Create a DataFrame which one-hot encodes the `property_type` column in `mexico-city-real-estate-1.csv`.  The DataFrame you create should have extra columns for apartments, houses, and stores.

In [None]:
mexico_city1 = pd.read_csv(
    "./data/mexico-city-real-estate-1.csv", usecols=["property_type"]
)
ohe = ...
mexico_city1_ohe = ...
mexico_city1_ohe.head()

## Ordinal Encoding

For many machine learning algorithms, it's common to use one-hot encoding. This works well if there are a few categories, but as the number of features grows, the number of additional columns also grows. 

Having a large number of columns (and consequently a large number of features in your model) can lead to a number of issues often referred to as the **curse of dimensionality**. Two primary issues that can arise are computational complexity (operations performed on larger datasets may take longer) and overfitting (the model may not generalize to new data). In these scenarios, ordinal encoding is a popular choice for encoding the categorical variable. Instead of creating new columns, ordinal encoding simply replaces the categories in a categorical variable with integers.

One potential risk of ordinal encoding is that some machine learning algorithms assume the integer values imply an ordering in the variables. This is important in logistic regression, where a relationship is defined between increases or decreases in the features and the target. Techniques like decision trees are okay to use ordinal encoding, because they generate splits. Rather than assuming any ordering between the numeric values, the splits will occur between the numeric values and effectively separate them. You can use the `OrdinalEncoder` transformer to perform ordinal encoding:

In [None]:
from category_encoders import OrdinalEncoder

# Instantiate transformer
oe = OrdinalEncoder()

# Fit transformer to data
oe.fit(df)

# Transform data
X_train_oe = oe.transform(df)

X_train_oe.head()

<font size="+1">Practice</font>  

Create a DataFrame which ordinal encodes the `property_type` column in `mexico-city-real-estate-1.csv`.  The DataFrame you create should have integers replacing the values for apartments, houses, and stores.

In [None]:
mexico_city1 = pd.read_csv(
    "./data/mexico-city-real-estate-1.csv", usecols=["property_type"]
)

oe = ...
mexico_city1_oe = ...
mexico_city1_oe.head()

## Imputation

Let's take a look at `mexico-city-real-estate-1.csv` and impute some of the missing values. First, we'll load the dataset, limiting ourselves to the `"surface_covered_in_m2"` and `"price_aprox_usd"` columns.

In [None]:
columns = ["surface_covered_in_m2", "price_aprox_usd"]
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)
mexico_city1.info()

When you need to build a model using features that contain missing values, one helpful tool is the scikit-learn transformer [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html). In order to use it, we need to start by instantiating the transformer. 

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer()

Next, we train the imputer using the data. At this step it will calculate the mean value for each column.

In [None]:
imputer.fit(mexico_city1)

Last, we transform the data using the imputer.

In [None]:
mexico_city1_imputed = imputer.transform(mexico_city1)

Since the imputer doesn't return a DataFrame, let's transform it into one. 

In [None]:
mexico_city1_imputed = pd.DataFrame(mexico_city1_imputed, columns=columns)
mexico_city1_imputed.info()

Now there are no missing values!

Then we use the imputer to transform the data.

<font size="+1">Practice</font> 

Read `mexico-city-real-estate-1.csv` into a DataFrame and impute the missing values for `"surface_covered_in_m2"` and `"price_aprox_usd"`.

In [None]:
# Import data
columns = ["surface_covered_in_m2", "price_aprox_usd"]
mexico_city2 = ...

# Instantiate transformer
imputer = ...

# Fit transformer to data


# Transform data
mexico_city2_imputed = ...

# Create DataFrame
mexico_city2_imputed = pd.DataFrame(mexico_city2_imputed, columns=columns)

mexico_city2_imputed.info()

## Data Leakage

Let's consider the `mexico-city-real-estate-1.csv` dataset and fit a regression model using `surface_covered_in_m2` and `price_aprox_local_currency` to estimate `price_aprox_usd`.

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# Import data
columns = [
    "price",
    "price_aprox_local_currency",
    "price_aprox_usd",
    "surface_total_in_m2",
    "surface_covered_in_m2",
    "price_per_m2",
]

mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)

# Drop rows with missing values
mexico_city1.dropna(inplace=True)

lr = LinearRegression()
lr.fit(
    mexico_city1[["surface_covered_in_m2", "price_aprox_local_currency"]],
    mexico_city1["price_aprox_usd"],
)

Now let's calculate the mean absolute error in our training data.

In [None]:
price_pred = lr.predict(
    mexico_city1[["surface_covered_in_m2", "price_aprox_local_currency"]]
)
mean_absolute_error(price_pred, mexico_city1["price_aprox_usd"])

When you see a mean absolute error that's so close to zero (especially when the mean apartment price is so much larger), chances are there is leakage in your model!

# Imbalanced Data

When dealing with classification problems, we would ideally expect the training data to be evenly spread across different classes for better model performance. When the numbers of observations are uneven in different classes, we have imbalanced data. The class that represents the majority of observations is called the **majority class**, while the class with limited observation is called the **minority class**. Imbalanced data limits training data available for certain classes. In addition, when the one class takes the majority of the data, the model will keep predicting the majority class to achieve high accuracy result. Thus, prior to training a  model, it is essential to balance the data either through under-sampling the majority classes, or over-sampling the minority classes, or use other evaluation metrics like **recall** or **precision**.

## Under-sampling

When data is imbalanced in different classes, one way we can balance it is reducing the number of observations in the majority class. This is called **under-sampling**. We can under-sample by randomly deleting some observations in the majority class. The open source [imbalanced-learn](https://imbalanced-learn.org/stable/) (imported as `imblearn`) is an open-source library that works with `scikit-learn` and provides tools when dealing with imbalanced classes. Here's an example of randomly deleting observations from the majority class using Poland bankruptcy data from 2008.

In [None]:
import gzip
import json

with gzip.open("data/poland-bankruptcy-data-2008.json.gz", "r") as f:
    poland_data_gz = json.load(f)

df = pd.DataFrame().from_dict(poland_data_gz["data"])

df["bankrupt"].value_counts()

The data is clearly imbalanced as there are many more observations in non-bankruptcy compared to bankruptcy.

In [None]:
from imblearn.under_sampling import RandomUnderSampler

X, y = RandomUnderSampler().fit_resample(df[["company_id"]], df[["bankrupt"]])
y["bankrupt"].value_counts()

Now we have reduced the non-bankruptcy class to the same size as the bankruptcy class.

## Over-sampling

**Over-sampling** is the opposite of under-sampling. Instead of reducing the majority class, over-sampling increases the number of observations in the minority class by randomly making copies of the existing observations. Here is an example of making random copies from the minority class using the Poland bankruptcy data and `imblearn`.

In [None]:
from imblearn.over_sampling import RandomOverSampler

X, y = RandomOverSampler().fit_resample(df[["company_id"]], df[["bankrupt"]])
y["bankrupt"].value_counts()

Now we have increased the bankruptcy class to the size of the non-bankruptcy class.

### Practice

Now that you've seen an example of imbalanced data and how to under-  or over-sample it prior to model training, let's get some practice with the Poland bankruptcy data from 2007.

In [None]:
with gzip.open("data/poland-bankruptcy-data-2007.json.gz", "r") as f:
    poland_data_gz_2007 = json.load(f)

df_2007 = pd.DataFrame().from_dict(poland_data_gz_2007["data"])

First, check whether this data is imbalanced.

Next, do under-sampling.

In [None]:
X, y = ...

Finally, check whether the data is balanced.

Great work! Now try over-sampling.

In [None]:
X, y = ...

And check whether the data is balanced.

# scikit-learn in Production

The previous examples have built models and made predictions one step at a time.  Many machine learning applications will require you to run the same steps many times, usually with new or updated data.  scikit-learn allows you to define a set of steps to process data for machine learning in a reproducible manner using a pipeline. 

## Creating a Pipeline in scikit-learn

First, we create a pipeline to do linear regression on the transformed data set.

In [None]:
import pandas as pd
from sklearn import linear_model
from sklearn.pipeline import Pipeline

# construct pipeline
lin_reg = linear_model.LinearRegression()

pipe = Pipeline([("regressor", lin_reg)])

We can check the steps in the pipeline, but right now, there's only 1.

In [None]:
pipe.named_steps

Then we fit a linear regression model to our data.

In [None]:
# fit/train model and predict labels
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
mexico_city1 = mexico_city1.drop(
    [
        "floor",
        "price_usd_per_m2",
        "expenses",
        "rooms",
        "price_per_m2",
        "price",
        "surface_total_in_m2",
    ],
    axis=1,
)
mexico_city1 = mexico_city1.dropna(axis=0)
mexico_city1["surface_covered_in_m2"] = mexico_city1["surface_covered_in_m2"].astype(
    float
)

y = mexico_city1["price_aprox_usd"]
X = mexico_city1.surface_covered_in_m2.values.reshape(-1, 1)
pipe.fit(X, y)
y_pred = pd.DataFrame(pipe.predict(X))

In [None]:
print(y_pred.head())

<font size="+1">Practice</font> 

Try this on the  `price_aprox_usd` column in the `mexico-city-real-estate-1.csv` dataset.

In [None]:
y = ...
X = ...
pip.fit(...,...)
y_pred = ...
print(y_pred.head())

Let's use the `make_pipeline` function to create a pipeline to fit a linear regression model for the `mexico-city-real-estate-1.csv` dataset.

In [None]:
from sklearn.pipeline import make_pipeline

y = mexico_city1["price_aprox_usd"]
X = mexico_city1.surface_covered_in_m2.values.reshape(-1, 1)
model_lr = make_pipeline(linear_model.LinearRegression())
model_lr.fit(X, y)

Let's try to predict `price_aprox_usd` in the `mexico-city-test-features.csv` dataset.

In [None]:
mexico_city_features = pd.read_csv("./data/mexico-city-test-features.csv")
mexico_city_labels = pd.read_csv("./data/mexico-city-test-labels.csv")
X = mexico_city_features.surface_covered_in_m2.values.reshape(-1, 1)
model_lr.predict(X)

## Accessing an Object in a Pipeline

Let's figure out the regression coefficients.

In [None]:
pipe.named_steps["regressor"].coef_

<font size="+1">Practice</font>

Now obtain the intercept

In [None]:

# INCLUDE pipe.named_steps[...].intercept_

*References & Further Reading*
- [One-Hot Encoding with the Category Encoder Package](https://contrib.scikit-learn.org/category_encoders/onehot.html)
- [Example of Using One-Hot Encoding](https://scikit-learn.org/stable/auto_examples/linear_model/plot_tweedie_regression_insurance_claims.html#sphx-glr-auto-examples-linear-model-plot-tweedie-regression-insurance-claims-py)
- [Online Example of Using One-Hot Encoding](https://stackabuse.com/one-hot-encoding-in-python-with-pandas-and-scikit-learn/)
- [Official pandas Documentation on Get Dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)
- [Online Tutorial on Pipelines for Linear Regression](https://mahmoudyusof.github.io/general/scikit-learn-pipelines/)
- [scikit-learn Pipeline Documentation](https://scikit-learn.org/stable/modules/compose.html#combining-estimators)
- [Wikipedia article on the curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality#Machine_Learning)
- [Wikipedia Article on Leakage in Machine Learning](https://en.wikipedia.org/wiki/Leakage_(machine_learning))
- [Official Pandas Documentation on Missing Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)
- [Wikipedia Article on Imputation](https://en.wikipedia.org/wiki/Imputation_(statistics))
- [Online Tutorial on Removing Rows with Missing Data](https://datatofish.com/rows-with-nan-pandas-dataframe/)
- [scikit-learn Documentation on `SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
- [imbalanced-learn Documentation](https://imbalanced-learn.org/stable/)

---
Copyright © 2022 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
