In this notebook, you will learn how to make your first submission to the [Tabular Playground Series - Mar 2021 competition.](https://www.kaggle.com/c/tabular-playground-series-mar-2021)

# Make the most of this notebook!

You can use the "Copy and Edit" button in the upper right of the page to create your own copy of this notebook and experiment with different models. You can run it as is and then see if you can make improvements.

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt
        
input_path = Path('/kaggle/input/tabular-playground-series-mar-2021/')

import lightgbm as lgb

from sklearn.linear_model import LogisticRegression

import optuna
from sklearn.metrics import log_loss

# Read in the data files

In [None]:
train = pd.read_csv(input_path / 'train.csv', index_col='id')
display(train.head())

In [None]:
test = pd.read_csv(input_path / 'test.csv', index_col='id')
display(test.head())

In [None]:
submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='id')
display(submission.head())

## We need to encode the categoricals.

There are different strategies to accomplish this, and different approaches will have different performance when using different algorithms.  You may decide to encode features with high cardinality (e.g., more distinct values) diffirently than features with low cardinality. For this starter notebook, we'll use simple encoding.

In [None]:
for c in train.columns:
    if train[c].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(train[c].values) + list(test[c].values))
        train[c] = lbl.transform(train[c].values)
        test[c] = lbl.transform(test[c].values)
        
display(train.head())

## Pull out the target, and make a validation split

In [None]:
target = train.pop('target')
X_train, X_test, y_train, y_test = train_test_split(train, target, train_size=0.60)

# Simple Random Forest

In previous Tabular Playground Series competition, when the target was continuous, we created a "naive" dummy model, that just predicted the average of the target. That approach is less useful when the scoring metric is AUC, since any constant prediction will score 0.5. So we'll skip that this time, and note that we want to score better than 0.5 for our model to be considered better than naive or random.

In [None]:
clf = RandomForestClassifier(n_estimators=200, max_depth=7, n_jobs=-1)
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)[:, 1] # This grabs the positive class prediction
score = roc_auc_score(y_test, y_pred)
print(f'{score:0.5f}') # 0.87323 shows we're doing better than a dummy model

## Let's take a look at how the model predicted the various classes

The graph below shows that the model does well with most of the negative observations, but struggles with many of the positive observations.

In [None]:
plt.figure(figsize=(8,4))
plt.hist(y_pred[np.where(y_test == 0)], bins=100, alpha=0.75, label='neg class')
plt.hist(y_pred[np.where(y_test == 1)], bins=100, alpha=0.75, label='pos class')
plt.legend()
plt.show()

# Let's train it on all the data and make a submission!

In [None]:
clf = RandomForestClassifier(n_estimators=200, max_depth=7, n_jobs=-1)
clf.fit(train, target)
submission['target'] = clf.predict_proba(test)[:, 1]
submission.to_csv('random_forest.csv')

## Now you should save your Notebook (blue button in the upper right), and then when that's complete go to the notebook viewer and make a submission to the competition. :-)

## There's lots of room for improvement. What things can you try to get a better score?

lgbboost

In [None]:
input_path = Path('/kaggle/input/tabular-playground-series-mar-2021/')

train = pd.read_csv(input_path / 'train.csv', index_col='id')

In [None]:
test = pd.read_csv(input_path / 'test.csv', index_col='id')

In [None]:
submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='id')

In [None]:
for c in train.columns:
    if train[c].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(train[c].values) + list(test[c].values))
        train[c] = lbl.transform(train[c].values)
        test[c] = lbl.transform(test[c].values)

In [None]:
y_train = train['target']
X_train = train.drop('target',axis=1)

In [None]:
X_train,X_valid,y_train,y_valid = train_test_split(X_train,y_train,test_size=0.3,random_state=0,stratify=y_train)

In [None]:
categorical_features = ['cat0','cat1','cat2','cat3','cat4','cat5','cat6','cat7','cat8','cat9','cat10','cat11','cat12','cat13','cat14','cat15','cat16','cat17','cat18','cont0','cont1','cont2','cont3','cont4','cont5','cont6','cont7','cont8','cont9','cont10']

In [None]:
def objective(trial):
    params = {
        'objective': 'binary',
        'max_bin': trial.suggest_int('max_bin', 255, 500),
        'learning_rate': 0.05,
        'num_leaves': trial.suggest_int('num_leaves', 32, 128),
    }
    
    lgb_train = lgb.Dataset(X_train,y_train,categorical_feature=categorical_features)
    lgb_eval = lgb.Dataset(X_valid,y_valid,reference=lgb_train,categorical_feature=categorical_features)

    model = lgb.train(
        params, lgb_train,
        valid_sets=[lgb_train, lgb_eval],
        verbose_eval=10,
        num_boost_round=1000,
        early_stopping_rounds=10
    )

    y_pred_valid = model.predict(X_valid, num_iteration=model.best_iteration)
    score = log_loss(y_valid, y_pred_valid)
    return score

In [None]:
study = optuna.create_study(sampler=optuna.samplers.RandomSampler(seed=0))
study.optimize(objective, n_trials=40)

In [None]:
study.best_params

In [None]:
params = {
    'objective':'binary',
    'max_bin': study.best_params['max_bin'],
    'learning_rate': 0.05,
    'num_leaves': study.best_params['num_leaves']
}

lgb_train = lgb.Dataset(X_train,y_train,categorical_feature=categorical_features)
lgb_eval = lgb.Dataset(X_valid,y_valid,reference=lgb_train,categorical_feature=categorical_features)

model = lgb.train(
    params, lgb_train,
    valid_sets=[lgb_train, lgb_eval],
    verbose_eval=10,
    num_boost_round=1000,
    early_stopping_rounds=10
)

y_pred = model.predict(test, num_iteration=model.best_iteration)

In [None]:
y_pred = (y_pred > 0.5).astype(int)
y_pred[:10]

**feature engineering**

In [None]:
input_path = Path('/kaggle/input/tabular-playground-series-mar-2021/')

train = pd.read_csv(input_path / 'train.csv', index_col='id')

In [None]:
test = pd.read_csv(input_path / 'test.csv', index_col='id')

In [None]:
submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='id')

In [None]:
for c in train.columns:
    if train[c].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(train[c].values) + list(test[c].values))
        train[c] = lbl.transform(train[c].values)
        test[c] = lbl.transform(test[c].values)

In [None]:
y_train = train['target']
X_train = train.drop('target',axis=1)

In [None]:
clf = LogisticRegression(penalty='l2', solver="sag", random_state=0)

In [None]:
clf.fit(X_train, y_train)

In [None]:
y_pred2 = clf.predict(test)

In [None]:
submission['target'] = y_pred + clf.predict_proba(test)[:, 1] + y_pred2
submission['target'] = (submission['target'] >= 2).astype(int)

In [None]:
submission.to_csv('random_forest.csv')
submission.head()