# End-to-end workflow with Modin and Intel® Extension for Scikit-learn*

In this example we will be running an end-to-end machine learning workload with US census data from 1970 to 2010.<br>
Optimized run uses **Modin with Ray** as backend compute engine for ETL(Extract Transform Load), and uses **Random Forest Regression algorithm from Intel scikit-learn-extension library** to train and predict the co-relation between US total income and education levels.<br>
You can use the default kernel <mark>"Python 3 (Intel® oneAPI 2023.0)"</mark> for this notebook.

Let's start by downloading census data to your local disk.

In [None]:
!wget https://storage.googleapis.com/intel-optimized-tensorflow/datasets/ipums_education2income_1970-2010.csv.gz
!gunzip ipums_education2income_1970-2010.csv.gz
!head -1000000 ipums_education2income_1970-2010.csv > ipums_education2income_1970-2010_subset.csv

### Install the required packages (optional)

In [None]:
%pip install --user --upgrade modin[ray]
%pip install --user --upgrade scikit-learn
%pip install --user --upgrade scikit-learn-intelex==2023.0.0

##### Reboot kernel to load installed packages

In [None]:
import os
os._exit(00)

Import basic python modules and disable warnings to avoid output cluttering

In [None]:
import os
import numpy as np
import warnings

warnings.filterwarnings("ignore")

Below flag can switch ON/OFF the Intel optimizations in this workflow

In [None]:
enable_intel_optimizations = True

In [None]:
if enable_intel_optimizations:
    print("Running optimized")
    import ray
    ray.shutdown()
    ray.init(num_cpus=4,_memory=16000 * 1024 * 1024,object_store_memory=500 * 1024 * 1024, _driver_object_store_memory=500 * 1024 * 1024)
    import modin.pandas as pd
    from sklearnex import patch_sklearn
    patch_sklearn()
else:
    print("Running stock")
    import pandas as pd

In [None]:
from sklearn import config_context
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import time

Read and load the data into a dataframe from the downloaded archive file

In [None]:
time_start = time.time()
df = pd.read_csv('ipums_education2income_1970-2010_subset.csv')
csv_load_time = round(time.time()-time_start)

In [None]:
df.head()

Run ETL (Extract Transform Load) operations to prepare and transform the ingested dataset into a form that can be readily consumed by the Random Forest regression algorithm.<br>
Keep columns that are relevant, clean up the samples with invalid income, education and normalize the income to account for yearly inflation

In [None]:
time_start = time.time()
# clean up features
keep_cols = [
    "YEAR", "DATANUM", "SERIAL", "CBSERIAL", "HHWT",
    "CPI99", "GQ", "PERNUM", "SEX", "AGE",
    "INCTOT", "EDUC", "EDUCD", "EDUC_HEAD", "EDUC_POP",
    "EDUC_MOM", "EDUCD_MOM2", "EDUCD_POP2", "INCTOT_MOM", "INCTOT_POP",
    "INCTOT_MOM2", "INCTOT_POP2", "INCTOT_HEAD", "SEX_HEAD",
]
df = df[keep_cols]

# clean up samples with invalid income, education, etc.
df = df[df["INCTOT"] != 9999999]
df = df[df["EDUC"] != -1]
df = df[df["EDUCD"] != -1]

# normalize income for inflation
df["INCTOT"] = df["INCTOT"] * df["CPI99"]

for column in keep_cols:
    df[column] = df[column].fillna(-1)
    df[column] = df[column].astype("float64")

y = df["EDUC"]
X = df.drop(columns=["EDUC", "CPI99"])
etl_time = round(time.time()-time_start)

Train the model and run prediction.<br> Loop 50 times to remove any bias in splitting the dataset into train & test set.<br> This is done in order to reduce chance of over-fitting from selecting a train set that fits the model too well to the test set

In [None]:
time_start = time.time()
params = {
    'n_estimators': 50,
    'random_state': 44,
    'n_jobs': -1
}

# ML - training and inference
clf = RandomForestRegressor(**params)
mse_values, cod_values = [], []
N_RUNS = 50
TRAIN_SIZE = 0.9
random_state = 777

X = np.ascontiguousarray(X, dtype=np.float64)
y = np.ascontiguousarray(y, dtype=np.float64)

# cross validation
for i in range(N_RUNS):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=TRAIN_SIZE,
                                                        random_state=random_state)
    random_state += 777

    # training
    with config_context(assume_finite=True):
        model = clf.fit(X_train, y_train)

    # inference
    y_pred = model.predict(X_test)

    mse_values.append(mean_squared_error(y_test, y_pred))
    cod_values.append(r2_score(y_test, y_pred))

model_build_time = round(time.time()-time_start)

Check the regression results by calculating the accuracy of the prediction using mean squared error and r square score

In [None]:
mean_mse = sum(mse_values)/len(mse_values)
mean_cod = sum(cod_values)/len(cod_values)
mse_dev = pow(sum([(mse_value - mean_mse)**2 for mse_value in mse_values])/(len(mse_values) - 1), 0.5)
cod_dev = pow(sum([(cod_value - mean_cod)**2 for cod_value in cod_values])/(len(cod_values) - 1), 0.5)
print("mean MSE ± deviation: {:.9f} ± {:.9f}".format(mean_mse, mse_dev))
print("mean COD ± deviation: {:.9f} ± {:.9f}".format(mean_cod, cod_dev))

#### Time Split-up

In [None]:
print(f"Data load time : {csv_load_time} seconds")
print(f"ETL time : {etl_time} seconds")
print(f"Model build time : {model_build_time} seconds")

In [None]:
if enable_intel_optimizations:
    ray.shutdown()

<br>

### Legal Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.<br>
Cost reduction scenarios described including recommendations are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.<br>
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. <br>
Any forecasts of goods and services needed for Intel’s operations are provided for discussion purposes only. Intel will have no liability to make any purchase in connection with forecasts published in this document.<br>
Intel technologies may require enabled hardware, software or service activation.<br>
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  <br>
Performance tests, are measured using specific computer systems, components, software, operations and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.   For more complete information visit www.intel.com/benchmarks.<br>

|* Other names and brands may be claimed as the property of others. <br>

Your costs and results may vary. <br>
© Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.<br>
Copyright 2023 Intel Corporation. 