<a id="Using LightGBM in Google Brain- Ventilator Pressure Competition"></a>
# Using LightGBM in Google Brain- Ventilator Pressure Competition

This notebook is a straightforward implementation of LightGBM to make predictions of ventilator pressure in the Google Brain - Ventilator Pressure Competition[1].There are no missing values in the train dataset so those can be used directly for training the LightGBM Model without encoding. 

This notebook includes hyperparameter tuning of the LightGBM model using scikit-optimize. 

[1]https://www.kaggle.com/c/ventilator-pressure-prediction



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

<a id="Load Data"></a>
# Load Data

In [None]:
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')
train.head()

In [None]:
train.shape

In [None]:
train.nunique()

In [None]:
train.isnull().sum()

<a id="Plot the target variable"></a>
# Plot the target variable

In [None]:
sns.histplot(train['pressure'], stat = 'probability', bins = 30, kde = 'True')

We can see that the target (pressure) is skewed to the right.

In [None]:
#Save id column to train_id
train_id = train['id']
# Drop id column from train dataset
train.drop(['id'], axis=1, inplace=True)

In [None]:
#Assign the pressure column as target
target = train['pressure']
#Drop 'pressure' from the train dataset
train.drop(['pressure'], axis=1, inplace = True)

In [None]:
test = pd.read_csv('../input/ventilator-pressure-prediction/test.csv')

In [None]:
test.head()

In [None]:
test.shape

The test dataset has fewer rows than the train dataset.

In [None]:
test.isnull().sum()

In [None]:
#Save test id column to test_id
test_id = test['id']
test.drop(['id'], axis=1, inplace=True)

<a id="Train LightGBM Model"></a>
# Train LightGBM Model

In [None]:
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(train, target, test_size = 0.1)

reg_lgb = lgb.LGBMRegressor(n_estimators = 2000)
reg_lgb.fit(X_train,y_train)
preds = reg_lgb.predict(X_test)

lgb_score = metrics.mean_absolute_error(y_test, preds)
lgb_score

<a id="Hyperparameter Tuning"></a>
# Hyperparameter Tuning

In [None]:
pip install scikit-optimize

Uncomment the next few cells to execute hyperparameter tuning

In [None]:
# def optimize(params, param_names, x, y):
#   params = dict(zip(param_names, params))
#   model = lgb.LGBMRegressor(**params)
#   kf = model_selection.KFold(n_splits = 5)
#   accuracies = []
#   for idx in kf.split(X=x, y=y):
#     train_idx, test_idx = idx[0], idx[1]
#     xtrain = x.iloc[train_idx]
#     ytrain = y.iloc[train_idx]

#     xtest = x.iloc[test_idx]
#     ytest = y.iloc[test_idx]

#     model.fit(xtrain, ytrain)
#     preds = model.predict(xtest)
#     fold_acc = metrics.mean_absolute_error(ytest, preds)
#     accuracies.append(fold_acc)

#   return np.mean(accuracies)

In [None]:
# from functools import partial
# from skopt import space
# from skopt import gp_minimize
# from sklearn import model_selection
# from skopt import callbacks
# from skopt.callbacks import CheckpointSaver
# from scipy.stats import uniform as sp_uniform

# checkpoint_saver = CheckpointSaver("./checkpoint.pkl", compress = 9)

# param_space = [
#                space.Categorical(['regression'], name = 'objective'),
#                space.Integer(100,5000, name = 'n_estimators'),
#                space.Integer(2,100, name = 'num_leaves'),
#                space.Integer(1,50, name = 'min_data_in_leaf'),
#                space.Integer(1,25, name = 'max_depth'),
#                space.Real(0.0001, 0.1, name = 'learning_rate'),
#                space.Real(0.01, 0.99, name = 'bagging_fraction'),
#                space.Integer(1,20, name = 'bagging_freq'),
#                space.Integer(1,10, name = 'bagging_seed'),
#                space.Integer(2,100, name = 'max_bin'),
#                space.Real(0.01, 0.99, name = 'feature_fraction'),
#                space.Integer(1, 10, name = 'feature_fraction_seed'),
#                space.Integer(1, 20, name = 'min_sum_hessian_in_leaf')
# ]

# param_names = [
#                'objective',
#                'n_estimators',
#                'num_leaves',
#                'min_data_in_leaf',
#                'max_depth',
#                'learning_rate',
#                'bagging_fraction',
#                'bagging_freq',
#                'bagging_seed',
#                'max_bin',
#                'feature_fraction',
#                'feature_fraction_seed',
#                'min_sum_hessian_in_leaf'
# ]

# optimization_function = partial(
#     optimize,
#     param_names = param_names,
#     x=train,
#     y=target
# )

# result = gp_minimize(
#     optimization_function,
#     dimensions = param_space,
#     n_calls = 50,
#     n_random_starts = 10,
#     n_jobs = -1,
#     callback = [checkpoint_saver],
#     random_state = 123,
#     verbose = True,
# )

In [None]:
# from skopt import load

# result = load('./checkpoint.pkl')

# print("""Best parameters:
# objective=%s,
# n_estimators = %d,
# num_leaves=%d,
# min_data_in_leaf=%d,
# max_depth=%d,
# learning_rate=%.6f,
# bagging_fraction=%f,
# bagging_freq=%d,
# bagging_seed=%d,
# max_bin=%d,
# feature_fraction=%f,
# feature_fraction_seed=%d,
# min_sum_hessian_in_leaf=%d
# """ % (result.x[0], result.x[1],result.x[2], result.x[3],result.x[4],result.x[5],result.x[6],
#         result.x[7],result.x[8],result.x[9],result.x[10],result.x[11],result.x[12])
# )

In [None]:
reg_lgb = lgb.LGBMRegressor(objective='regression',
n_estimators = 2200,
num_leaves=70,
min_data_in_leaf=36,
max_depth=13,
learning_rate=0.078025,
bagging_fraction=0.412706,
bagging_freq=12,
bagging_seed=2,
max_bin=41,
feature_fraction=0.624771,
feature_fraction_seed=4,
min_sum_hessian_in_leaf=6)
reg_lgb.fit(train,target)

<a id="Make Predictions"></a>
# Make Predictions

In [None]:
preds = reg_lgb.predict(test)
output = pd.DataFrame({'id': test_id, 'pressure': preds})
output.to_csv('submission.csv', index=False)