# Light GBM

### Importing Libraries

In [1]:
# Importing required libraries
import pandas as pd 
import numpy as np

import warnings
warnings.filterwarnings("ignore")

### Loading the dataset

In [5]:
# reading the data
data = pd.read_csv('data_cleaned.csv')

In [6]:
# first five rows of the data
data.head()

Unnamed: 0,Survived,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,...,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,0,0,1,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
1,1,38.0,71.2833,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,1,0,0
2,1,26.0,7.925,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
3,1,35.0,53.1,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1
4,0,35.0,8.05,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1


### Separating independent and dependent variables.

In [7]:
# independent variables
x = data.drop(['Survived'], axis=1)

# dependent variable
y = data['Survived']

### Creating the train and test dataset

In [8]:
# import the train-test split
from sklearn.model_selection import train_test_split

In [9]:
# divide into train and test sets
train_x,test_x,train_y,test_y = train_test_split(x,y, random_state = 101, stratify=y)

## Building an LGBM Model

In [10]:
# Importing LGBM 
import lightgbm as lgb

In [11]:
train_data = lgb.Dataset(train_x, label=train_y)

In [12]:
# define parameters
params = {'learning_rate':0.001}

In [13]:
model = lgb.train(params, train_data, 100)
y_pred = model.predict(test_x)

In [14]:
for i in range(0,185):
    if y_pred[i]>=0.5:
        y_pred[i]=1
        
    else:
        y_pred[i]=0

In [17]:
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_pred, test_y)**0.5
rmse

0.6024038094272935

# Hyperparameter Tuning

**num_iterations:**
  - It defines the number of boosting iterations to be performed.
  
  
**num_leaves:**
  - This parameter is used to set the number of leaves to be formed in tree.
  
  - Incase of **Light GBM**, since splitting takes place leaf-wise rather than depth-wise, **num_leaves** must be smaller than **2^(max_depth)**, otherwise it may lead to **overfitting.**


**min_data_in_leaf:**
  - A very small value may cause over-fitting.
  
  - It is also one of the most important parameters in dealing with over-fitting.


**max_depth:**
  - It specifies the maximum depth or level upto which a tree can grow.
  
  - A very high value for this parameter can cause overfitting.
  

**bagging_fraction:**
  - It is used to specify the fraction of data to be used for each iteration.
  
  - This parameter is generally used to speed up the training.


**max_bin:**
  - Defines the maximum of bins that feature values will be bucketed in.
  
  - A smaller value of max_bin can save a lot of time as it buckets the feature values in discrete bins which is computationally inexpensive.