In Kaggle competitions and in solving real-life problems, we tend to use complex models: combinations of various algorithms, neural networks, various boosts, and much more. However, for a quick start and plunge into the field of machine learning and big data, it is enough to know how to use the algorithms of classical machine learning. An example of a classical machine learning algorithm is Random Forest. Using only this algorithm may already give you a fairly good solution.

For classical machine learning algorithms, we often use the most popular Python library, Scikit-learn. With Scikit-learn you can fit models and search for optimal parameters, but it sometimes works for hours. Speeding up this process is something anyone who uses Scikit-learn would be interested in.

I want to show you how to use Scikit-learn library and get the results faster without changing the code. To do this, we will make use of another Python library, [**Intel® Extension for Scikit-learn***](https://github.com/intel/scikit-learn-intelex). It accelerates Scikit-learn and does not require you to change the code written for Scikit-learn.

I will show you how to speed up your kernel from **25 minutes to 14 minutes** without changing your code!

In [None]:
import numpy as np
import pandas as pd
import re
import optuna

import sklearn
from sklearn.model_selection import train_test_split

In [None]:
train =pd.read_csv("../input/tabular-playground-series-apr-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-apr-2021/test.csv")
sample_submission = pd.read_csv('../input/tabular-playground-series-apr-2021/sample_submission.csv')

train.head(5)

# Preprocessing

Most of the preprocessing was taken from Anisotropic's [Introduction to Ensembling/Stacking in Python](https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python) notebook. I have also added several new features  that are based on the existing ones. While doing this will allow us to take previously unknown regularities into account, it might also lead to a strong correlation in data and, consequently, to overfitting. We would need to find a balance.



In [None]:
PassengerId = test['PassengerId']
full_data = [train, test]

# Feature engineering steps taken from Anisotropic
# Create a new feature with the length of a name 
train['Name_length'] = train['Name'].apply(len)
test['Name_length'] = test['Name'].apply(len)
# Crate an new feature that tells whether a passenger had a cabin on the Titanic
train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

# Feature engineering steps taken from Sina
# Create a new feature FamilySize as a combination of SibSp and Parch
for dataset in full_data:
    dataset['FamilySize'] = (dataset['SibSp'] + dataset['Parch'] + 1)*100
    dataset['Pclass'] = dataset['Pclass']*10
# Create a new feature IsAlone from FamilySize
for dataset in full_data:
    dataset['IsAlone'] = -1
    dataset.loc[dataset['FamilySize'] == 100, 'IsAlone'] = 1
# Remove all NULLS in the Embarked column
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
# Remove all NULLS in the Fare column and create a new feature CategoricalFare
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)

# Create a new feature CategoricalAge
for dataset in full_data:
    dataset['Age'].fillna((dataset['Age'].mean()), inplace=True)
    dataset['Age'] = dataset['Age'].astype(int)
train['CategoricalAge'] = pd.cut(train['Age'], 5)
# Define a function to extract titles from passenger names
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""
# Create a new feature Title, containing the titles of passenger names
for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)
# Group all non-common titles into one single grouping "Rare"
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

for dataset in full_data:
    # Mapping Sex
    dataset['Sex'] = dataset['Sex'].map( {'female': 2, 'male': 1} ).astype(int)
    
    # Mapping titles
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)
    
    # Mapping Embarked
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
    # Mapping Fare
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare']                               = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare']                                  = 3
    dataset['Fare'] = dataset['Fare'].astype(int)
    
    # Mapping Age
    dataset.loc[ dataset['Age'] <= 16, 'Age']                          = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']                           = 4 ;
    
for dataset in full_data:
    # Some features of my own, it's a product of the most value previous features
    dataset['Pclass_Sex'] = (dataset['Pclass']*dataset['Sex'])
    dataset['Pclass_FS'] = (dataset['Pclass']*dataset['FamilySize'])
    dataset['Pclass_IsAlone'] = (dataset['Pclass']*dataset['IsAlone'])
    dataset['Sex_FS'] = (dataset['Sex']*dataset['FamilySize'])
    dataset['Sex_IsAlone'] = (dataset['Sex']*dataset['IsAlone'])
    dataset['FS_IsAlone'] = (dataset['FamilySize']*dataset['IsAlone'])
    dataset['Pclass_Sex_FS'] = (dataset['Pclass']*dataset['Sex']*dataset['FamilySize'])
    dataset['Pclass_Sex_IsAlone'] = (dataset['Pclass']*dataset['Sex']*dataset['IsAlone'])
    dataset['Pclass_Sex_FS_IsAlone'] = (dataset['Pclass']*dataset['Sex']*dataset['FamilySize']*dataset['IsAlone'])
    
# Feature selection
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']
train = train.drop(drop_elements, axis = 1)
train = train.drop(['CategoricalAge'], axis = 1)
train = train.drop(['CategoricalFare'], axis = 1)
test  = test.drop(drop_elements, axis = 1)

y = train["Survived"]
train.drop(columns=["Survived"], inplace=True)

train.head(5)

**Split the data into two parts: for training and prediction**

In [None]:
x_train, x_val, y_train, y_val = train_test_split(train, y, test_size=0.2)
x_train.shape, x_val.shape

# Random Forest

Random Forest is an ensemble of Decision Trees. The work of this algorithm can be represented as a collective decision made by some expert committee. First, each expert (decision tree) expresses their opinion. The opinions are aggregated and the final decision is reached by the head of the committee.

Let's select some of the hyperparameters that are available for Random Forest: the number of trees to be used in the algorithm `n_estimators`, the depth of each tree `max_depth`, the minimum number of samples in a tree leaf `min_samples_leaf` and the maximum number of features in a tree `max_features`. More information about parameters can be found in [**Scikit-learn library documentation**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

# Intel® Extension for Scikit-learn

As was mentioned earlier, we will use a library that accelerates Scikit-learn. Patch Scikit-learn using and compare a training time to get optimizations:

In [None]:
!pip install scikit-learn-intelex --progress-bar off >> /tmp/pip_sklearnex.log
from sklearnex import patch_sklearn
patch_sklearn()

# import Random Forest classifier
from sklearn.ensemble import RandomForestClassifier

# Find optimal parameters using Optuna

It's time to adjust the hyperparameters of the algorithm to our data. To search for the optimal values of the hyperparameters, let's use [**Optuna**](https://optuna.readthedocs.io/en/stable/index.html), a hyperparameter optimization framework.

In [None]:
def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 2000, 100),
        'max_depth': trial.suggest_int('max_depth', 5, 20, 5),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 5, 1),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2'])
    }

    rf = RandomForestClassifier(**params, random_state=777, n_jobs=-1)
    rf.fit(x_train, y_train)
    return rf.score(x_val, y_val)

In [None]:
%%time

search_space = {'n_estimators': [500, 1000, 2000],
                'max_depth': [10, 15, 20],
                'min_samples_leaf': [1, 2, 3],
                'max_features': ['sqrt', 'log2']}
study = optuna.create_study(sampler=optuna.samplers.GridSampler(search_space),
                            direction="maximize",
                            pruner=optuna.pruners.MedianPruner())
study.optimize(objective, show_progress_bar=True)

In [None]:
print(f"Best Value: {study.best_trial.value}")
print(f"Best Params: {study.best_params}")

**Train the final model**

Now let's train the final model using the best parameters.

In [None]:
%%time

cl = RandomForestClassifier(**study.best_params, random_state=777)
cl.fit(train, y)

**Predict using test data**

In [None]:
%%time

predict = cl.predict(test)

**Save the result**

In [None]:
sample_submission["Survived"] = predict
sample_submission.head()
sample_submission.to_csv('sklearnex.csv', index=False)

# Call stock Scikit-learn

Let’s run the same Scikit-learn code without the patching offered by Intel® Extension for Scikit-learn and compare its execution time with the execution time of the patched Scikit-learn.

In [None]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

# import Random Forest classifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 2000, 100),
        'max_depth': trial.suggest_int('max_depth', 5, 20, 5),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 5, 1),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2'])
    }

    rf = RandomForestClassifier(**params, random_state=777, n_jobs=-1)
    rf.fit(x_train, y_train)
    return rf.score(x_val, y_val)

In [None]:
%%time

search_space = {'n_estimators': [500, 1000, 2000],
                'max_depth': [10, 15, 20],
                'min_samples_leaf': [1, 2, 3],
                'max_features': ['sqrt', 'log2']}
study = optuna.create_study(sampler=optuna.samplers.GridSampler(search_space),
                            direction="maximize",
                            pruner=optuna.pruners.MedianPruner())
study.optimize(objective, show_progress_bar=True)

In [None]:
print(f"Best Value: {study.best_trial.value}")
print(f"Best Params: {study.best_params}")

In [None]:
%%time

cl = RandomForestClassifier(**study.best_params, random_state=777, n_jobs=-1)
cl.fit(train, y)

In [None]:
%%time

predict = cl.predict(test)

# Conclusions

We can see that using only one classical machine learning algorithm may give you a pretty hight accuracy score. We also use well-known libraries [Scikit-learn](https://scikit-learn.org/stable/) and [Optuna](https://optuna.readthedocs.io/en/stable/index.html), as well as the increasingly popular library [**Intel® Extension for Scikit-learn**](https://github.com/intel/scikit-learn-intelex). Noted that Intel® Extension for Scikit-learn gives you opportunities to:

* Use your Scikit-learn code for training and inference without modification.
* Train Scikit-learn models and use them for prediction up to 1.7 - 2 times faster.
* Get predictions of the similar quality as the other tested frameworks.

*Please upvote if you liked it.*