# Problem Statement

In this notebook, we <b>predict whether or not an email is spam</b>. It is a binary (2-class) classification problem. We have 600,000 rows in the training dataset with 102 attributes which include id, f0 to f99 and the target. We have 540,000 rows in the test dataset with 101 attributes which include id and f0 to f99. We have 540,000 rows in the submission file with 2 attributes, id and target. The id attribute is of type integer and f0 to f99 are of type float. The target is of type boolean having values of 0 which indicate not spam and 1 which indicates spam email. We do not have any missing values and the training dataset appears to be balanced.

We are going to cover the following steps:
1. Load data
2. Exploratory Data Analysis
3. Prepare Validation Dataset
4. Model 1 using Ftrl
5. Feature Importance for Model 1
6. Submission using Model 1
7. Model 2
8. Feature Importance for Model 2
9. Submission using Model 2
10. Ensemble of Model 1 and Model 2
11. References

Let's get started.

# Load Data

### Install Libraries

In [None]:
!pip install datatable

### Load Libraries

Let's start off by loading the libraries required.

In [None]:
# Load libraries
import datatable as dt
from datatable.models import Ftrl
print(dt.__version__)
import time
from pathlib import Path
import numpy as np
import pandas as pd

# to print all outputs of a cell
from IPython.core.interactiveshell import InteractiveShell  
InteractiveShell.ast_node_interactivity = "all"

import warnings
warnings.filterwarnings('ignore')

### Load data

In [None]:
## Data Table Reading
start = time.time()
data_dir = Path('../input/tabular-playground-series-nov-2021/')
dt_train = dt.fread(data_dir / "train.csv")
dt_test = dt.fread(data_dir / "test.csv")
dt_submission = dt.fread(data_dir / "sample_submission.csv")
end = time.time()
print(end - start)

Let's take a peek at the first 5 rows.

In [None]:
dt_train.head(5)

Let's find out the dimensions of the training dataset.

In [None]:
# number of rows and columns in training dataset
dt_train.shape

- We have 600,000 rows and 102 attributes in the training dataset.
- 102 attributes include one id attribute of type integer, 100 attributes (f0 to f99) of type float and the target of type bool.

Now, let's find out the dimensions of the test dataset.

In [None]:
# number of rows and columns in test dataset
dt_test.shape

- we have 540,000 rows and 101 attributes in the test dataset
- 101 attributes include one id attribute of type integer and 100 attributes (f0 to f99) of type float.

Let's have a look at the data types of all the attributes in the training dataset.

In [None]:
for i in range(len(dt_train.names)):
    print(dt_train.names[i], ":", dt_train.stypes[i])

In [None]:
dt_submission.head(5)

Let's have a look at the dimensions of the submission file.

In [None]:
dt_submission.shape

We can see that the submission file has 540,000 rows and two columns, which are 'id' and 'target'.

# Exploratory Data Analysis

Let's find out the mean, maximum, minimum, standard deviation and the number of missing values in the training dataset.

In [None]:
# mean
dt_train.mean()

In [None]:
# max
dt_train.max()

In [None]:
# min
dt_train.min()

In [None]:
# standard deviation
dt_train.sd()

In [None]:
dt_train.countna()

We don't have any missing values, hence imputation is not required.

In [None]:
class_counts = dt_train.to_pandas().groupby('target').size()
print(class_counts)

- The training dataset appears to be balanced.
- We have 296,394 / 600,000 = 49% cases in which the email is not spam.
- We have 303,606 / 600,000 = 51% cases in which the email is spam.

Let's have a look at the mean, minimum and maximum values for cases in which the email is spam.

In [None]:
dt_train[dt.f.target == 1, :].mean()

In [None]:
dt_train[dt.f.target == 1, :].min()

In [None]:
dt_train[dt.f.target == 1, :].max()

Now, let's have a look at the mean, minimum and maximum values for cases in which the email is not spam.

In [None]:
dt_train[dt.f.target == 0, :].mean()

In [None]:
dt_train[dt.f.target == 0, :].min()

In [None]:
dt_train[dt.f.target == 0, :].max()

### Correlation

In [None]:
start = time.time()

# Pairwise Pearson correlations
correlations = dt_train.to_pandas().corr(method='pearson')
print(correlations)

end = time.time()
print(end - start)

Let's have a look at the skewness and kurtosis of all the attributes in the training dataset.

In [None]:
dt_train.skew()

In [None]:
dt_train.kurt()

In [None]:
dt_train.nunique()

There is no point in checking the number of unique values since attributes f0 to f99 are all of type float.

# Prepare Validation Dataset

In [None]:
from sklearn.model_selection import train_test_split

X = dt_train[:, [col for col in dt_train.names if col != 'target']]
y = dt_train[:, -1]

X = X.to_numpy()
y = y.to_numpy()

# dt_df = dt_train.to_numpy()
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.3)

X_train = dt.Frame(X_train)
X_validation = dt.Frame(X_validation)
y_train = dt.Frame(y_train)
y_validation = dt.Frame(y_validation)

# Model 1

In this section, we use an [Ftrl](https://keras.io/api/optimizers/ftrl/) model, which stands for 'Follow The Regularized Leader'.

In [None]:
from datatable.models import Ftrl

model_ftrl_1 = Ftrl()
model_ftrl_1.fit(X_train, y_train)
model_ftrl_1

In [None]:
prediction_validation_1 = model_ftrl_1.predict(X_validation)
prediction_validation_1.head()

In [None]:
X_test = dt_test[:,:]
X_test = X_test.to_numpy()
X_test = dt.Frame(X_test)

prediction_test_1 = model_ftrl_1.predict(X_test)
prediction_test_1.head()

# Feature Importance of Model 1.

In [None]:
# Display the feature importances of model_ftrl_1 in descending order and calculate the logloss of y_validation and prediction_validation_1
model_ftrl_1.feature_importances[:, :, dt.sort(-dt.f.feature_importance)]

In [None]:
preds = dt.cbind(y_validation, prediction_validation_1)
# print(preds) very important to print pred because we will come to know that target has been renamed to C0
preds[:, -dt.mean(dt.f.C0 * dt.math.log(dt.f['True']) + (1-dt.f.C0) * dt.math.log(dt.f['False']))][0, 0]

In [None]:
submission_ids = dt_submission['id']
print(submission_ids)

# Submission using Model 1.

In [None]:
# Create submission_1 in the submission format of the competition, write it as submission_1.csv and submit it on Kaggle
submission_1 = dt.Frame(id=submission_ids, target=prediction_test_1['True'])
submission_1.to_csv('submission_1.csv')
submission_1.head()

# Model 2

We use the Ftrl model again, but with nepochs=3, nbins=10**8.

In [None]:
# Train another FTRL model model_ftrl_2 with nepochs=3, `nbins=10 8, display it's feature importances, score & evaluate it's logloss onvalid_dataand submit the predictionspreds_test_2oftestassubmission_2`**
model_ftrl_2 = Ftrl(nepochs=3, nbins=10**8)
model_ftrl_2.fit(X_train, y_train)
model_ftrl_2

# Feature Importance for Model 2

In [None]:
model_ftrl_2.feature_importances[:, :, dt.sort(-dt.f.feature_importance)]

In [None]:
prediction_validation_2 = model_ftrl_2.predict(X_validation)
prediction_validation_2.head()

In [None]:
prediction_test_2 = model_ftrl_2.predict(X_test)
prediction_test_2.head()

In [None]:
preds = dt.cbind(y_validation, prediction_validation_2)
preds[:, -dt.mean(dt.f.C0 * dt.math.log(dt.f['True']) + (1-dt.f.C0) * dt.math.log(dt.f['False']))][0, 0]

# Submission using Model 2

In [None]:
submission_2 = dt.Frame(id=submission_ids, target=prediction_test_2['True'])
submission_2.to_csv('submission_2.csv')
submission_2.head()

# Ensemble of Model 1 and Model 2

In [None]:
# Submit a ensemble of model_ftrl_1 and model_ftrl_2 by averaging the predictions as submission_ensemble
submission_ensemble = dt.cbind(submission_1, submission_2)
submission_ensemble[:, dt.update(target = 0.5 * dt.f.target + 0.5 * dt.f.target)]
del submission_ensemble[:, ['id.0', 'target.0']]
submission_ensemble.to_csv('submission_ensemble.csv')
submission_ensemble.head()

# References

Thank you to vopani for these [datatable exercises](https://github.com/vopani/datatableton#set-04--frame-operations--beginner--exercises-31-40).