<h1 style="text-align:center;">Model Deployment</h1>


#### Overview
In this notebook focused on XGBoost, we integrate several concepts to develop a machine learning model suitable for industrial applications. Unlike academic or competition settings, industrial models prioritize automation due to the frequent influx of new data. The process involves more structured procedures, placing less emphasis on minor performance enhancements through model tweaking.

#### Key Learning Areas
- **One-Hot Encoding and Sparse Matrices**: Essential techniques for handling categorical data.
- **Customizing Scikit-learn Transformers**: Enhancing automated workflows in machine learning pipelines.
- **Finalizing an XGBoost Model**: Preparing the model for real-world applications.
- **Building a Machine Learning Pipeline**: Constructing an end-to-end process for handling incoming data, accommodating both categorical and numericalhapter10).

#### Encoding Mixed Data: Case Study
Imagine working for an EdTech company with the task of predicting student grades to tailor tech skill-building services. The initial step involves loading and processing mixed data (numerical and categorical) related to student grades using pandas.

---

**Explanation of Key Terms:**
- **One-Hot Encoding**: A process to convert categorical data into a format that can be fed to machine learning algorithms. It creates new columns indicating the presence of each possible value from the original data.
- **Sparse Matrices**: Efficient storage format for matrices with a lot of zeros. Useful in handling large, one-hot encoded data.
- **Scikit-learn Transformers**: Tools in scikit-learn for transforming data before feeding it into a model. Customizing these allows for more tailored data preprocessing.
- **XGBoost**: A powerful machine learning algorithm known for its speed and performance, particularly with structured data.
- **Machine Learning Pipeline**: An automated process that includes steps like data preprocessing, model training, and making predictions. It ensures consistency and efficiency in model deployment.

In [68]:
import pandas as pd

import os
from pathlib import Path

import numpy as np

from xgboost import XGBRegressor

from sklearn.model_selection import cross_val_score, KFold, GridSearchCV,  train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

from helper_file import *
from scipy.sparse import csr_matrix, hstack
from sklearn.base import TransformerMixin

import warnings
warnings.filterwarnings('ignore')


In [6]:
data_dir = Path('data')

# Create a data folder if it doesn't exist
if not data_dir.exists():
    data_dir.mkdir()

# File path for saving, using the Path object
file_path = data_dir / 'student-por.txt'

The data contained herein is separated by semi-colons. CSV stands for Comma-Separated Values, not Semi-Colon-Separated Values. However, pandas comes with a `sep` parameter, which stands for separator, that may be set to the semi-colon, (;), as follows

In [8]:
df = pd.read_csv(file_path, sep=';')

df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,,18.0,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15.0,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15.0,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16.0,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


### Clearing null values

In [11]:
total_nulls(df)

3

In [12]:
show_nulls(df)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,,18.0,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11


In [15]:
# to see all the columns
pd.options.display.max_columns = None

In [14]:
show_nulls(df)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,,18.0,U,GT3,A,4,4,at_home,teacher,course,,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,4,0,11,11
1,GP,F,,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,2,9,11,11


From the output above, we can see that we have null values in 3 columns: 'sex', 'age' and `guardian`.

Let us understand one thing we intend to do:

1. **Numerical Null Values**: These are missing or undefined values in a dataset. In a dataset, such instances might be represented as `NaN` (Not a Number) or some other placeholder indicating that data is missing.

2. **Setting Null Values to -999.0**: Before feeding data into a machine learning model, it's often necessary to handle these missing values. One approach is to replace them with a distinct numerical value that does not naturally occur in the dataset. In this case, `-999.0` is suggested as a replacement. The idea is to use a value that clearly stands out from valid data, ensuring that the model recognizes it as different from other, meaningful values.

3. **XGBoost and the `missing` Hyperparameter**: XGBoost is a popular gradient boosting framework for machine learning. It has a feature to handle missing values efficiently. The `missing` hyperparameter in XGBoost allows you to specify the placeholder value you've used for missing data (in this case, `-999.0`). When XGBoost encounters this value during training, it treats it as a missing value.

4. **How XGBoost Handles These Values**: XGBoost, during its training process, tries to find the best way to handle these specified missing values. It determines how to split nodes in its trees considering the missing values, effectively learning whether to group them with certain values or treat them separately. This is part of the model's training process to make the best possible decision at each stage of the boosting process.

In summary, this technique of setting numerical null values to a distinct value like `-999.0` and informing XGBoost through the `missing` hyperparameter allows the model to treat missing values appropriately during training. This can be particularly useful when you cannot impute missing values with more conventional methods or when the presence of missing data itself might be informative for the model.

In [21]:
def tweak_data(df):
    return (df
            .assign (age = lambda x:  x['age'].fillna(-999.0))
           )

df_stud = tweak_data(df)
show_nulls(df_stud)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,,18.0,U,GT3,A,4,4,at_home,teacher,course,,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,4,0,11,11


We have eradicated one; two more to go

For categorical columns, we can fill missing values using the mode, which is the most frequently occurring value in a column. While using the mode can sometimes alter the column's original distribution, this is typically only a concern when there are a significant number of null values. In our case, with only two missing values, the impact on the distribution is negligible. An alternative approach is to substitute categorical null values with a label such as 'unknown'. This label can be transformed into a distinct column during one-hot encoding.

The code below demonstrates how to replace missing values in the 'sex' and 'guardian' columns with their respective modes:


In [22]:
def tweak_data(df):
    return (df
            .assign (age = lambda x:  x['age'].fillna(-999.0),
                    sex = lambda x: x['sex'].fillna(x['sex'].mode()),
                    guardian = lambda x: x['guardian'].fillna(x['guardian'].mode()),)
           )

df_stud = tweak_data(df)
show_nulls(df_stud)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3


From the output above, it seems like we have finally got rid of the null values like we intended.

### One Hot Encoding

#### Limitations of `pd.get_dummies`
In previous work, we used `pd.get_dummies` from pandas to convert categorical variables into numerical form, where `0` represents absence and `1` represents presence. This method, while functional, has two key drawbacks:
1. **Computational Load**: You might have noticed that `pd.get_dummies` can be quite slow, especially with large datasets.
2. **Integration with Scikit-learn Pipelines**: The method doesn’t integrate smoothly with scikit-learn’s pipeline framework, which we will discuss shortly.

#### The Advantage of Scikit-learn’s `OneHotEncoder`
As an alternative, scikit-learn’s `OneHotEncoder` offers a more efficient solution:
- **Efficient Representation**: It also converts categorical values into a 0/1 format but does so using a sparse matrix instead of a dense matrix. This results in significant space and time savings.
- **Sparse Matrix Efficiency**: Sparse matrices are efficient because they store only non-zero elements, preserving the same information with less memory.
- **Compatibility with Pipelines**: `OneHotEncoder` is a transformer in scikit-learn, designed to seamlessly fit into machine learning pipelines.

#### Evolution of `OneHotEncoder`
- **Historical Context**: In earlier versions of scikit-learn, `OneHotEncoder` required numerical inputs. To accommodate this, `LabelEncoder` was used as an intermediate step to numerically encode categorical data before one-hot encoding.
- **Current Capability**: Now, `OneHotEncoder` can directly handle categorical inputs, streamlining the preprocessing step in machine learning pipelines.


While while one-hot encoding isn't a strict requirement for XGBoost, its necessity and efficacy depend on the specific characteristics of your dataset and the problem you're tackling. It's usually recommended to test both approaches (with and without one-hot encoding) in the context of your specific dataset and problem. The impact of one-hot encoding can vary based on the nature of the data and the specific problem at hand.

Let us select the categorical columns step by step. Stay with me on this please.

First we check the columns with `Dtype` as object using the `.info` method of the pandas DataFrame.

In [27]:
df_stud.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   school      649 non-null    object 
 1   sex         649 non-null    object 
 2   age         649 non-null    float64
 3   address     649 non-null    object 
 4   famsize     649 non-null    object 
 5   Pstatus     649 non-null    object 
 6   Medu        649 non-null    int64  
 7   Fedu        649 non-null    int64  
 8   Mjob        649 non-null    object 
 9   Fjob        649 non-null    object 
 10  reason      649 non-null    object 
 11  guardian    649 non-null    object 
 12  traveltime  649 non-null    int64  
 13  studytime   649 non-null    int64  
 14  failures    649 non-null    int64  
 15  schoolsup   649 non-null    object 
 16  famsup      649 non-null    object 
 17  paid        649 non-null    object 
 18  activities  649 non-null    object 
 19  nursery     649 non-null    o

We now create a boolean mask that selects those columns with `dtypes=object`

In [28]:
df_stud.dtypes==object

school         True
sex            True
age           False
address        True
famsize        True
Pstatus        True
Medu          False
Fedu          False
Mjob           True
Fjob           True
reason         True
guardian       True
traveltime    False
studytime     False
failures      False
schoolsup      True
famsup         True
paid           True
activities     True
nursery        True
higher         True
internet       True
romantic       True
famrel        False
freetime      False
goout         False
Dalc          False
Walc          False
health        False
absences      False
G1            False
G2            False
G3            False
dtype: bool

We now subset it and select then names of the columns, convet to a list then save it to a variable name of our choice.

In [24]:
df_stud.columns[df_stud.dtypes==object]

Index(['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob',
       'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities',
       'nursery', 'higher', 'internet', 'romantic'],
      dtype='object')

In [31]:
cat_cols = df_stud.columns[df_stud.dtypes==object].tolist()
cat_cols

['school',
 'sex',
 'address',
 'famsize',
 'Pstatus',
 'Mjob',
 'Fjob',
 'reason',
 'guardian',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic']

In [34]:
#  initialize OneHotEncoder
ohe = OneHotEncoder()

In [38]:
# Use the fit_transform method on the columns
hot = ohe.fit_transform(df_stud[cat_cols])

print(type(hot))

<class 'pandas.core.frame.DataFrame'>


In [40]:
hot.sample(n=5, random_state=43)

Unnamed: 0,school_1,school_2,sex_1,sex_2,address_1,address_2,famsize_1,famsize_2,Pstatus_1,Pstatus_2,Mjob_1,Mjob_2,Mjob_3,Mjob_4,Mjob_5,Fjob_1,Fjob_2,Fjob_3,Fjob_4,Fjob_5,reason_1,reason_2,reason_3,reason_4,guardian_1,guardian_2,guardian_3,schoolsup_1,schoolsup_2,famsup_1,famsup_2,paid_1,paid_2,activities_1,activities_2,nursery_1,nursery_2,higher_1,higher_2,internet_1,internet_2,romantic_1,romantic_2
592,0,1,1,0,1,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,1,1,0,1,0,1,0,1,0,0,1,1,0
622,0,1,0,1,0,1,1,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,1,0,1,1,0,1,0,1,0,1,0,1,0
361,1,0,0,1,1,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,1,1,0,0,1,1,0,1,0,0,1,0,1
642,0,1,1,0,1,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,1,1,0,1,0,1,0,1,0,1,0,0,1,1,0
483,0,1,1,0,0,1,1,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,1,1,0,1,0,1,0,0,1,0,1,1,0


Let's isolate the numerical columns. This may be done with the `exclude=["object"]` parameter as input for `df.select_dtypes`.

In [43]:
cold_df = df_stud.select_dtypes(exclude=["object"])

cold_df.sample(n=5, random_state=43)

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
592,17.0,3,3,2,1,0,4,4,3,1,1,4,0,11,12,13
622,18.0,1,3,2,2,0,3,3,4,2,4,3,0,8,10,9
361,19.0,4,2,2,2,0,5,4,4,1,1,1,9,11,10,10
642,17.0,4,3,2,2,0,5,5,4,1,1,1,0,6,9,11
483,16.0,2,2,3,2,0,3,4,5,1,2,1,1,9,10,11


In [44]:
cold = csr_matrix(cold_df)
hot = csr_matrix(hot)

In [47]:
final_sparse_matrix = hstack((hot, cold))

In [48]:
final_df = pd.DataFrame(final_sparse_matrix.toarray())

final_df.sample(n=5, random_state=43)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58
592,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,17.0,3.0,3.0,2.0,1.0,0.0,4.0,4.0,3.0,1.0,1.0,4.0,0.0,11.0,12.0,13.0
622,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,18.0,1.0,3.0,2.0,2.0,0.0,3.0,3.0,4.0,2.0,4.0,3.0,0.0,8.0,10.0,9.0
361,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,19.0,4.0,2.0,2.0,2.0,0.0,5.0,4.0,4.0,1.0,1.0,1.0,9.0,11.0,10.0,10.0
642,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,17.0,4.0,3.0,2.0,2.0,0.0,5.0,5.0,4.0,1.0,1.0,1.0,0.0,6.0,9.0,11.0
483,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,16.0,2.0,2.0,3.0,2.0,0.0,3.0,4.0,5.0,1.0,2.0,1.0,1.0,9.0,10.0,11.0


### Customizing Scikit-learn Transformers for Streamlined Data Processing

Scikit-learn's transformers are crucial for preparing data for machine learning. They provide methods like `fit`, which computes parameters for a model, and `transform`, which applies these parameters to data. Conveniently, `fit_transform` combines these steps, streamlining the process.

In a typical workflow, multiple transformers, including machine learning models, can be seamlessly integrated into a single pipeline. This setup allows for efficient handling of incoming data, where it is fit and transformed in the pipeline to produce the required format.

Scikit-learn offers a variety of built-in transformers like `StandardScaler` for standardization, `Normalizer` for normalization, and `SimpleImputer` for handling missing values. However, when dealing with datasets that feature a mix of categorical and numerical columns, these standard options might not always fit the bill. In such scenarios, crafting custom transformers is a smart move.

#### Crafting Custom Transformers

To build custom transformers tailored to your specific needs, inherit from Scikit-learn's `TransformerMixin`. This is the cornerstone for creating transformers that align perfectly with your data processing requirements, ensuring greater control and efficiency in your machine learning pipelines.

Here is a general code outline to create a customized transformer in scikit-learn:

```python
class YourClass(TransformerMixin):

    def __init__(self):

        None

    def fit(self, X, y=None):

        return self

    def transform(self, X, y=None):

        # insert code to transform X

        return X
```

As you can see, you don't have to initialize anything, and fit can always return self. Simply put, you may place all your code for transforming the data under the transform method.

Now that you see how customization works generally, let's create a customized transformer to handle different kinds of null values.

#### Creating a Custom Scikit-learn Transformer for Null Value Imputation

To tailor your data processing needs, especially for handling null values in a mixed-type DataFrame, you can create a custom transformer in Scikit-learn. Here's a step-by-step guide:

1. **Start by Importing `TransformerMixin`**:
   - This is essential for creating a custom transformer.
   ```python
   from sklearn.base import TransformerMixin
   ```

2. **Define Your Custom Transformer Class**:
   - Derive your class from `TransformerMixin`.
   ```python
   class NullValueImputer(TransformerMixin):
   ```

3. **Initialize the Class**:
   - The `__init__` method sets up the class. It's standard to have it do nothing for simple transformers.
   ```python
   def __init__(self):
       pass
   ```

4. **Implement the `fit` Method**:
   - This method prepares the transformer. For this use case, it just returns `self`, as no fitting is required for null value imputation.
   ```python
   def fit(self, X, y=None):
       return self
   ```

5. **Craft the `transform` Method**:
   - This is where the data transformation logic resides. 
   - The method iterates through columns, filling null values differently based on the column's data type.
   ```python
   def transform(self, X, y=None):
       for column in X.columns:
           if X[column].dtype == object:  # Handling string (object) columns
               X[column] = X[column].fillna(X[column].mode()[0])  # Using mode for categorical data
           else:
               X[column] = X[column].fillna(-999.0)  # Using -999.0 for numerical data
       return X
   ```

In the `transform` method, we handle null values differently based on the column type. For categorical (string) columns, we use the mode, while for numerical columns, we fill with a placeholder value (-999.0).

The `y=None` parameter in `fit` and `transform` is a convention in Scikit-learn, especially for compatibility with the pipeline mechanism. It allows these methods to handle scenarios where a target variable (`y`) might or might not be present, especially useful when incorporating machine learning models into pipelines.

In [51]:
class NullValueImputer(TransformerMixin):
    def __init__(self):
        None
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        for column in X.columns.tolist():
            if column in X.columns[X.dtypes==object].tolist():
                X[column] = X[column].fillna(X[column].mode())
            else:
                X[column]=X[column].fillna(-999.0)
        return X

In [56]:
df = pd.read_csv(file_path, sep=';')
nvi = NullValueImputer().fit_transform(df)
show_nulls(nvi)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3


#### One-Hot Encoding for Mixed Data with Custom Transformer

To handle a dataset with mixed data types, we can create a custom transformer for one-hot encoding the categorical columns and then combine them with numerical columns. Here's how to do it using Scikit-learn's `TransformerMixin`:

1. **Define Your Custom Transformer Class**:
   - Begin by extending `TransformerMixin`.
   ```python
   from sklearn.base import TransformerMixin
   from sklearn.preprocessing import OneHotEncoder
   from scipy.sparse import csr_matrix, hstack

   class MixedDataEncoder(TransformerMixin):
   ```

2. **Initialize the Class**:
   - The `__init__` method is standard, typically doing nothing in this context.
   ```python
   def __init__(self):
       pass
   ```

3. **Implement the `fit` Method**:
   - This method prepares the transformer and returns `self`. No action is needed for this step.
   ```python
   def fit(self, X, y=None):
       return self
   ```

4. **Craft the `transform` Method**:
   - This method will handle the transformation of mixed data types.
   - Steps include identifying categorical columns, applying one-hot encoding, and combining the results with numerical columns.
   ```python
   def transform(self, X, y=None):
       # a) Identify categorical columns
       categorical_columns = X.columns[X.dtypes == object].tolist()

       # b) Initialize OneHotEncoder
       ohe = OneHotEncoder()

       # c) Apply OneHotEncoder to categorical columns
       hot = ohe.fit_transform(X[categorical_columns])

       # d) Extract numerical columns, excluding strings
       cold_df = X.select_dtypes(exclude=["object"])

       # e) Convert the numerical DataFrame to a sparse matrix
       cold = csr_matrix(cold_df)

       # f) Combine the one-hot encoded and numerical data
       final_sparse_matrix = hstack([hot, cold])

       # g) Convert to CSR format for compatibility with certain algorithms like XGBoost
       final_csr_matrix = final_sparse_matrix.tocsr()

       return final_csr_matrix
   ```

By using this custom transformer, `MixedDataEncoder`, you can efficiently process datasets with both categorical and numerical data, making them ready for machine learning models like XGBoost. The final output is a CSR matrix that is compatible with many algorithms and is optimized for performance.

In [57]:
class SparseMatrix(TransformerMixin):
    def __init__(self):
        None
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        categorical_columns= X.columns[X.dtypes==object].tolist()
        ohe = OneHotEncoder() 
        hot = ohe.fit_transform(X[categorical_columns])
        cold_df = X.select_dtypes(exclude=["object"])
        cold = csr_matrix(cold_df)
        final_sparse_matrix = hstack((hot, cold))
        final_csr_matrix = final_sparse_matrix.tocsr()
        return final_csr_matrix

In [58]:
sm = SparseMatrix().fit_transform(nvi)
print(sm)

  (0, 0)	1.0
  (0, 2)	1.0
  (0, 4)	1.0
  (0, 6)	1.0
  (0, 8)	1.0
  (0, 10)	1.0
  (0, 15)	1.0
  (0, 20)	1.0
  (0, 24)	1.0
  (0, 27)	1.0
  (0, 29)	1.0
  (0, 31)	1.0
  (0, 33)	1.0
  (0, 35)	1.0
  (0, 37)	1.0
  (0, 39)	1.0
  (0, 41)	1.0
  (0, 43)	18.0
  (0, 44)	4.0
  (0, 45)	4.0
  (0, 46)	2.0
  (0, 47)	2.0
  (0, 49)	4.0
  (0, 50)	3.0
  (0, 51)	4.0
  :	:
  (648, 20)	1.0
  (648, 24)	1.0
  (648, 28)	1.0
  (648, 29)	1.0
  (648, 31)	1.0
  (648, 33)	1.0
  (648, 36)	1.0
  (648, 37)	1.0
  (648, 40)	1.0
  (648, 41)	1.0
  (648, 43)	18.0
  (648, 44)	3.0
  (648, 45)	2.0
  (648, 46)	3.0
  (648, 47)	1.0
  (648, 49)	4.0
  (648, 50)	4.0
  (648, 51)	1.0
  (648, 52)	3.0
  (648, 53)	4.0
  (648, 54)	5.0
  (648, 55)	4.0
  (648, 56)	10.0
  (648, 57)	11.0
  (648, 58)	11.0


In [59]:
sm_df = pd.DataFrame(sm.toarray())
sm_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58
0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,18.0,4.0,4.0,2.0,2.0,0.0,4.0,3.0,4.0,1.0,1.0,3.0,4.0,0.0,11.0,11.0
1,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,-999.0,1.0,1.0,1.0,2.0,0.0,5.0,3.0,3.0,1.0,1.0,3.0,2.0,9.0,11.0,11.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,15.0,1.0,1.0,1.0,2.0,0.0,4.0,3.0,2.0,2.0,3.0,3.0,6.0,12.0,13.0,12.0
3,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,15.0,4.0,2.0,1.0,3.0,0.0,3.0,2.0,2.0,1.0,1.0,5.0,0.0,14.0,14.0,14.0
4,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,16.0,3.0,3.0,1.0,2.0,0.0,4.0,3.0,2.0,1.0,2.0,5.0,0.0,11.0,13.0,13.0


### Preprocessing pipeline

When developing machine learning models, a critical step is to preprocess your data effectively. This usually involves setting up a pipeline for transforming your features (predictors) while leaving the target variable intact. Additionally, to evaluate your model effectively, it's essential to split your dataset into training and testing sets. 

Begin by dividing your dataset into features (X) and the target (y). This separation is crucial as it allows you to process and transform your features without affecting the target variable.

In [60]:
df = pd.read_csv(file_path, sep=';')

When choosing X and y for the Student Performance dataset, it's important to note that the last three columns all include student grades. Two potential studies are of value here:

a) Including previous grades as predictor columns

b) Not including previous grades as predictor columns

Assume that your EdTech company wants to make predictions based on socioeconomic variables, not on previous grades earned, so ignore the first two grade columns indexed as -2 and -3.

Select the last column as y, and all columns except for the last three as X

In [61]:
y = df.iloc[:, -1]

X = df.iloc[:, :-3]

In [63]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=43)

In [65]:
data_pipeline = Pipeline([('null_imputer', NullValueImputer()), 
                          ('sparse', SparseMatrix())]
                        )

In [66]:
X_train_transformed = data_pipeline.fit_transform(X_train)

### Finalizing an XGBoost model
It's time to build a robust XGBoost model to add to the pipeline.

In [69]:
y_train.value_counts()

G3
11    78
10    68
13    63
12    51
14    51
15    36
8     31
16    29
9     25
17    20
18    12
0     11
7      6
6      2
19     2
5      1
Name: count, dtype: int64

In [71]:
kfold = KFold(n_splits=5, shuffle=True, random_state=43)

In [72]:
def cross_val(model):

    scores = cross_val_score(model, X_train_transformed, y_train, scoring='neg_root_mean_squared_error', cv=kfold)

    rmse = (-scores.mean())

    return rmse

In [73]:
cross_val(XGBRegressor(missing=-999.0))

2.8759082084755994

In [74]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train_transformed, y_train, random_state=43)

In [75]:
def n_estimators(model):
    eval_set = [(X_test_2, y_test_2)]
    eval_metric="rmse"
    model.fit(X_train_2, y_train_2, eval_metric=eval_metric, eval_set=eval_set, early_stopping_rounds=100)
    y_pred = model.predict(X_test_2)
    rmse = MSE(y_test_2, y_pred)**0.5
    return rmse  

In [76]:
n_estimators(XGBRegressor(n_estimators=5000, missing=-999.0))

[0]	validation_0-rmse:8.25434
[1]	validation_0-rmse:6.04610
[2]	validation_0-rmse:4.63637
[3]	validation_0-rmse:3.76622
[4]	validation_0-rmse:3.31774
[5]	validation_0-rmse:3.12871
[6]	validation_0-rmse:3.08479
[7]	validation_0-rmse:3.00243
[8]	validation_0-rmse:2.96403
[9]	validation_0-rmse:2.93577
[10]	validation_0-rmse:2.87623
[11]	validation_0-rmse:2.87114
[12]	validation_0-rmse:2.85060
[13]	validation_0-rmse:2.85854
[14]	validation_0-rmse:2.86017
[15]	validation_0-rmse:2.87909
[16]	validation_0-rmse:2.89732
[17]	validation_0-rmse:2.89801
[18]	validation_0-rmse:2.88963
[19]	validation_0-rmse:2.89084
[20]	validation_0-rmse:2.89787
[21]	validation_0-rmse:2.89785
[22]	validation_0-rmse:2.90117
[23]	validation_0-rmse:2.89989
[24]	validation_0-rmse:2.90361
[25]	validation_0-rmse:2.91006
[26]	validation_0-rmse:2.90656
[27]	validation_0-rmse:2.90807
[28]	validation_0-rmse:2.91110
[29]	validation_0-rmse:2.90573
[30]	validation_0-rmse:2.91392
[31]	validation_0-rmse:2.91448
[32]	validation_0-

2.8506019849126356

In [77]:
def grid_search(params, reg=XGBRegressor(missing=-999.0)):
    grid_reg = GridSearchCV(reg, params, scoring='neg_mean_squared_error', cv=kfold)
    grid_reg.fit(X_train_transformed, y_train)
    best_params = grid_reg.best_params_
    print(f"Best params:{best_params}")
    best_score = np.sqrt(-grid_reg.best_score_)
    print(f"Best score: {best_score}")

In [78]:
grid_search(params={'max_depth':[1, 2, 3, 4, 6, 7, 8], 
                    'n_estimators':[31]})

Best params:{'max_depth': 1, 'n_estimators': 31}
Best score: 2.6995781876875076


In [79]:
grid_search(params={'max_depth':[1, 2], 
                    'min_child_weight':[1,2,3,4,5], 
                    'n_estimators':[31]})

Best params:{'max_depth': 2, 'min_child_weight': 4, 'n_estimators': 31}
Best score: 2.685269482586703


In [84]:
grid_search(params={'max_depth':[2],
                    'min_child_weight':[2,3, 4],
                    'subsample':[0.5, 0.6, 0.7, 0.8, 0.9],
                   'n_estimators':[31, 50]})

Best params:{'max_depth': 2, 'min_child_weight': 4, 'n_estimators': 50, 'subsample': 0.8}
Best score: 2.678420215321727


In [85]:
grid_search(params={'max_depth':[2],
                    'min_child_weight':[3,4,5], 
                    'subsample':[0.8, 0.9, 1], 
                    'colsample_bytree':[0.5, 0.6, 0.7, 0.8, 0.9, 1],
                   'n_estimators':[50]})

Best params:{'colsample_bytree': 0.7, 'max_depth': 2, 'min_child_weight': 3, 'n_estimators': 50, 'subsample': 0.9}
Best score: 2.6615511570768837


In [86]:
grid_search(params={'max_depth':[2],
                    'min_child_weight':[3], 
                    'subsample':[.9], 
                    'colsample_bytree':[0.7],
                    'colsample_bylevel':[0.6, 0.7, 0.8, 0.9, 1],
                    'colsample_bynode':[0.6, 0.7, 0.8, 0.9, 1],
                    'n_estimators':[50]})

Best params:{'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 0.7, 'max_depth': 2, 'min_child_weight': 3, 'n_estimators': 50, 'subsample': 0.9}
Best score: 2.6615511570768837


In [87]:
cross_val(XGBRegressor(max_depth=2,
                       min_child_weight=3,
                       subsample=0.9, 
                       colsample_bytree=0.7, 
                       colsample_bylevel=1.0,
                       colsample_bynode=1.0, 
                       missing=-999.0,
                       booster='dart',
                      one_drop=True))

2.6547125392710713

In [88]:
X_test_transformed = data_pipeline.fit_transform(X_test)

In [89]:
type(y_train)

pandas.core.series.Series

In [91]:
model = XGBRegressor(max_depth=2,
                       min_child_weight=3,
                       subsample=0.9, 
                       colsample_bytree=0.7, 
                       colsample_bylevel=1.0,
                       colsample_bynode=1.0,
                     n_estimators=50,
                       missing=-999.0)
model.fit(X_train_transformed, y_train)
y_pred = model.predict(X_test_transformed)
rmse = MSE(y_pred, y_test)**0.5
rmse

3.145871582757146

In [92]:
model = XGBRegressor(max_depth=2,
                       min_child_weight=5,
                       subsample=0.6, 
                       colsample_bytree=0.9, 
                       colsample_bylevel=0.9,
                       colsample_bynode=0.8,
                     n_estimators=50,
                       missing=-999.0)
model.fit(X_train_transformed, y_train)
y_pred = model.predict(X_test_transformed)
rmse = MSE(y_pred, y_test)**0.5
rmse

3.2177922264217833

#### Building a machine learning pipeline

In [93]:
full_pipeline = Pipeline([('null_imputer', NullValueImputer()), 
                          ('sparse', SparseMatrix()), 
                          ('xgb', XGBRegressor(max_depth=2,
                                               min_child_weight=3,
                                               subsample=0.9, 
                                               colsample_bytree=0.9, 
                                               colsample_bylevel=0.9,
                                               colsample_bynode=0.8, 
                                               missing=-999.0))])

In [94]:
full_pipeline.fit(X, y)

In [95]:
new_data = X_test
full_pipeline.predict(new_data)

array([13.438138 , 10.877554 , 10.332089 , 13.583207 ,  8.323381 ,
        8.641584 , 11.806151 ,  8.323174 , 10.84991  ,  9.507316 ,
       13.3785095, 10.609498 , 11.464302 , 13.332452 ,  6.5241427,
       12.291414 ,  8.485447 , 12.476763 ,  9.387565 , 11.207339 ,
        9.424469 , 11.871625 ,  7.666484 , 12.4861355,  8.729989 ,
        9.376352 ,  9.383465 , 11.994037 ,  8.125768 , 12.351852 ,
       11.570734 , 11.633892 , 11.641348 , 14.577013 , 11.7222595,
       13.286118 , 13.434792 ,  8.953255 , 11.459575 , 10.527292 ,
       11.271511 , 12.72165  , 10.577722 ,  6.1928163, 12.027768 ,
       11.150083 , 11.3822   ,  8.45433  , 10.173357 , 10.533948 ,
        8.505165 ,  9.017391 ,  7.7787123, 12.646953 , 10.67798  ,
       12.386895 ,  9.727485 ,  7.3675466, 10.130199 ,  9.060929 ,
       10.857667 ,  9.084782 , 11.2500515, 11.436306 , 10.409196 ,
        7.722747 ,  8.883924 ,  9.706495 , 10.367291 , 11.169585 ,
        9.953716 , 10.706625 ,  8.798611 ,  9.947971 , 10.1963

In [96]:
np.round(full_pipeline.predict(new_data))

array([13., 11., 10., 14.,  8.,  9., 12.,  8., 11., 10., 13., 11., 11.,
       13.,  7., 12.,  8., 12.,  9., 11.,  9., 12.,  8., 12.,  9.,  9.,
        9., 12.,  8., 12., 12., 12., 12., 15., 12., 13., 13.,  9., 11.,
       11., 11., 13., 11.,  6., 12., 11., 11.,  8., 10., 11.,  9.,  9.,
        8., 13., 11., 12., 10.,  7., 10.,  9., 11.,  9., 11., 11., 10.,
        8.,  9., 10., 10., 11., 10., 11.,  9., 10., 10., 13., 11., 13.,
       10.,  9., 11., 13.,  9., 13., 12.,  7., 14., 11., 11.,  9.,  8.,
       13., 13., 10.,  9., 12., 12.,  6.,  8., 12., 13., 11., 11.,  8.,
        6., 13., 11., 11., 11., 10., 13., 13., 10., 11., 11., 11., 10.,
       14., 12., 12., 11., 12., 11.,  9., 10.,  8.,  9., 12.,  7., 10.,
       11., 11., 13., 13., 12., 12., 11., 12., 11.,  8., 10., 11., 14.,
       10., 12., 10., 11.,  8.,  8., 11., 12.,  7.,  7.,  9., 11., 10.,
       11., 13.,  9., 13., 11., 10., 10.], dtype=float32)

In [97]:
new_df = df = pd.read_csv(file_path, sep=';')
new_X = df.iloc[:, :-3]
new_y = df.iloc[:, -1]
new_model = full_pipeline.fit(new_X, new_y)

Now, this model may be used to make predictions on new data, as shown in the following code:

In [98]:
more_new_data = X_test[:25]
np.round(new_model.predict(more_new_data))

array([13., 11., 10., 14.,  8.,  9., 12.,  8., 11., 10., 13., 11., 11.,
       13.,  7., 12.,  8., 12.,  9., 11.,  9., 12.,  8., 12.,  9.],
      dtype=float32)

When attempting to predict using only one row of data, a challenge arises if you use a pipeline that includes one-hot encoding. The issue is that the sparse matrix created by encoding just one row will lack the necessary number of columns. This happens because the encoding only includes categories present in that single row. Consequently, this leads to a dimension mismatch error, as the machine learning model expects a sparse matrix with a broader range of data.

To overcome this, a practical approach is to merge the new row with a sufficient number of existing data rows. This ensures that all potential categorical columns are represented in the transformed sparse matrix. In a specific example, appending the single row with the first 25 rows from `X_test` has been proven effective, as it avoids errors. However, using only 20 or fewer rows from `X_test` may still lead to a mismatch error in this context.

Therefore, for accurate prediction with a single row of data, combine this row with the initial 25 rows of `X_test`. Here's how you can proceed with the prediction:


In [99]:
single_row = X_test[:1]
single_row_plus = pd.concat([single_row, X_test[:25]])
print(np.round(new_model.predict(single_row_plus))[:1])

[13.]


Kudos on completing this walk through with me. Our learning odyssey began with fundamental machine learning concepts and pandas and culminated in mastering the art of crafting bespoke transformers, pipelines, and functions. These skills enable us to deploy sophisticated, finely-tuned XGBoost models in real-world industry scenarios, adept at handling sparse matrices for new data predictions.

Our journey encompassed an in-depth exploration of XGBoost's evolution: starting from basic decision trees, advancing through random forests and gradient boosting, to unraveling the complex mathematics that underpins XGBoost's effectiveness. We've observed XGBoost's superior performance compared to other algorithms and honed our skills in optimizing its extensive hyperparameters, like `n_estimators`, `max_depth`, `gamma`, `colsample_bylevel`, `missing`, and `scale_pos_weight`.

We have gone through pivotal historical case studies in physics and astronomy, enhancing our understanding of XGBoost's versatility, especially in handling imbalanced datasets and employing alternative base learners. Insights from Kaggle competitions provided us with advanced techniques in feature engineering, creating non-correlated ensembles, and stacking. Additionally, we delved into advanced automation techniques for industrial applications.

Now, with advanced knowledge of XGBoost, we're equipped to efficiently and effectively address diverse machine learning challenges. While XGBoost excels in numerous areas, particularly with structured data, it's important to remember its limitations with unstructured data, where neural networks might be more suitable.

For those of us eager to delve deeper into XGBoost, participating in Kaggle competitions is highly recommended. Competing against seasoned practitioners will sharpen our skills, and the collaborative environment of Kaggle, with shared notebooks and discussions, will enrich our learning experience. This platform is where XGBoost solidified its impressive reputation, particularly in the Higgs boson competition highlighted in this walk-through.

With this knowledge, we're well-prepared to venture into the realm of big data, leveraging XGBoost to push the boundaries of research, excel in competitions, and develop robust, production-ready machine learning models.

Thank you for hanging in there with me.

---
