## Bayesian methods of hyperparameter optimization

In addition to the random search and the grid search methods for selecting optimal hyperparameters, we can use Bayesian methods of probabilities to select the optimal hyperparameters for an algorithm.

In this case study, we will be using the BayesianOptimization library to perform hyperparmater tuning. This library has very good documentation which you can find here: https://github.com/fmfn/BayesianOptimization

You will need to install the Bayesian optimization module. Running a cell with an exclamation point in the beginning of the command will run it as a shell command — please do this to install this module from our notebook in the cell below.

In [2]:
#! pip install bayesian-optimization
#Running a cell with an exclamation point in the beginning of the command will run it as a shell command —
#please do this to install this module from our notebook in the cell below

#HAIN?!?! install it from YOUR notebook??? what'll that do??? why not install it ourSELVES?!?
##################################################################################################################

In [3]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
################################################
import numpy as np
import pandas as pd
import lightgbm
################################################
from bayes_opt import BayesianOptimization
from catboost import CatBoostClassifier, cv, Pool

In [4]:
import os
os.listdir()

#'os' is operating system, checked, but why is that the thing for where these files are? this is just this folder
#/location, aka like 'cd' or just '/' auto?
#so this is just 'listing' everything that's in the/my (current) 'directory'

['flight_delays_test.csv.zip',
 '.DS_Store',
 'Bayesian Optimization Case Study - Final Submission.ipynb',
 'flight_delays_train.csv.zip',
 'Bayesian Optimization Case Study - WORKING COPY.ipynb',
 '.ipynb_checkpoints',
 'Bayesian Optimization Case Study - Original Copy.ipynb']

## How does Bayesian optimization work?

Bayesian optimization works by constructing a posterior distribution of functions (Gaussian process) that best describes the function you want to optimize. As the number of observations grows, the posterior distribution improves, and the algorithm becomes more certain of which regions in parameter space are worth exploring and which are not, as seen in the picture below.

<img src="https://github.com/fmfn/BayesianOptimization/blob/master/examples/bo_example.png?raw=true" />
As you iterate over and over, the algorithm balances its needs of exploration and exploitation while taking into account what it knows about the target function. At each step, a Gaussian Process is fitted to the known samples (points previously explored), and the posterior distribution, combined with an exploration strategy (such as UCB — aka Upper Confidence Bound), or EI (Expected Improvement). This process is used to determine the next point that should be explored (see the gif below).
<img src="https://github.com/fmfn/BayesianOptimization/raw/master/examples/bayesian_optimization.gif" />

## Let's look at a simple example

The first step is to create an optimizer. It uses two items:
* function to optimize
* bounds of parameters

The function is the procedure that counts metrics of our model quality. The important thing is that our optimization will maximize the value on function. Smaller metrics are best. Hint: don't forget to use negative metric values.

Here we define our simple function we want to optimize.

In [5]:
def simple_func(a, b):
    return a + b

Now, we define our bounds of the parameters to optimize, within the Bayesian optimizer.

In [6]:
optimizer = BayesianOptimization(simple_func,{'a': (1, 3),'b': (4, 7)})
#so it goes, BO(func,bounds_dict)
#REPRESENTING THE ONE-TWO-STEPS! ;P

These are the main parameters of this function:

* **n_iter:** This is how many steps of Bayesian optimization you want to perform. The more steps, the more likely you are to find a good maximum.

* **init_points:** This is how many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.

In [None]:
#hmmm okay don't remember if we played w/ these 2 settings before?
#but this does sound like the stuff we know about - like init_points must be the number of 'pre'/exploration/
#CALIBRATION steps!!!!!!?
#i remember we saw that, but don't know if we explicitly programmed or accepted default
#and then n_iter is after that, once you hone in on the areas to concentrate on/cluster(s/range) of highest density,
#and that i think i remember we did specify like 10,000? or that was the default again
#have to go back n check

Let's run an example where we use the optimizer to find the best values to maximize the target value for a and b given the inputs of 3 and 2.

In [16]:
optimizer.maximize(3,2)
#okay so "find the best values to maximize the target value for a & b given the inputs of 3 & 2"...

#okay so if you look at the docx, the 3,2 in order should be
#init_points & n_iter...
#so 3 initial calibration steps and 2 actual / production lol / 'forreal' steps
#OHHHHHHH >>> SO IS THAT WHY THERE'S 5 TOTAL STEPS HERE?!?!?
#but i didn't think they included/combined the calibration & production ones
#together?!?! but also - why does it need ANY calibration
#if you can literally just go straight to the max values of each??
#and what is it even learning from the calibration lol
#if so few points?? oh, well, remem it's a RANGE!
#so can be ANY INFINITE NUMBER OF DECIMALS BETWEEN!
#but like, if you look at it - like how did it even
#conclude on what range to focus on / flesh out
#in calibration?! like how could it do that in just 2 steps?
#spreads out? and how did it suddenly figure out to try
#exactly the max values lol?
####################################################
#NICE!!! so aH we were right about/confirmed what 3,2 are by manually specifying, below
#AND ALSO! our theory at least lines up w/ later example here of optimizer/maximizer
#in that that's 10,2 and has a total of *12* rows!!!!!!
########################################################

#ok so remem we specified the range of a & b above! 1->3 & 4->7, resp
#note, 3,2 aren't a,b! this isn't the FUNCTION we made! this is a method call! we ALREADY
#specified a,b - that WENT *INTO* THE OPTIMIZER!!!!
#plus, wouldn't make sense anyway cuz 2 is out of range for b....

#okay, so, if we look below, the winner is 'target = 10, from a=3 + b=7',....
#okay so - that's LITERALLY simply maximizing the function given the ranges, cuz 3 & 7 are RANGE TOPPERS resp!!!!
########################################################
#like it's saying find the values that maximize the "target value"
#THE 'TARGET VALUE' SIMPLY REFERS TO THE THING WE'RE TRYNA PREDICT, AKA THE *Y*!!!!!!
#AKA 'TARGET *FEATURE*!!!!! like here it's y = a + b!!!!
#SO YEAH IF a GOES UP TO MAX 3 & Y GOES UP TO MAX 7, THEN MAX(A+B)=MAX(A)+MAX(B)=7+3=10!!!
#so like if we plotted this line, it'd literally be a small little line segment noodle
#in 3D SPACE!!!!!!
#'maximizing' the target value literally means just making it the maximum possible lolll
#i was confused first at the terminology 'target value' cuz i'm thinking that means like
#'there's a certain literally *TARGET* value we're tryna reach!!!!

#obvy in this case we're not tryna "predict" y (right?) but rather it's just our f(x)
#just like how we had/used the function for the Normal Distribution Equation
#there, we were tryna choose the best hyperparameter value, which we used as the x
#in that equation, and since we had 2 hyper/params, we had to use that equation for EACH!
#but it's interesting!!!!! those two hyperparams were for their OWN equation where
#they both coexisted, aka *THE BEST FIT SLOPE LINE Y=MX+B!!!!!!
#but rather than solving m & b analytically like using the equation for OLS,
#we do it like here BAYESIAN PROBABILISTICALLY/DISTRIBUTIONALISTICALLY!!!!!!
#so yeah DON'T GET CONFUSED BY THE EQUATION *OF* THE HYPERPARAMS WE'RE TRYNA OPTIMIZE
#&THE EQUATION WE'RE *USING* TO OPTIMIZE THE HYPERPARAMS!!!!!!!!!!!!
#*****BUT NOTE***** in this VERY VERY simplified sample example, we made things super
#easy - a & b first of all ARE our hyperparameters, and so the SAME equation y=a+b is
#both the equation OF our hyperparameters and the equation TO/FOR *MAXIMIZE* OUR
#HYPERPARAMETERS!!!!!!*******************
###############################################################################################
#NOTE! in y=mx+b Bayesian we were predicting in the sense of that is what we’re AIMING to do with the y=mx+b equation ultimately!!!!! and like when we’re optimizing, we’re picking / testing m/b’s to come up with PREDICTED Y’S!!!!!! that we use for our yp’s in the NormDistEqn!!!!!
#In this case study as we’ll see, we ARE gonna use Bayes for prediction but it’ll be CLASSIFICATION and will look very different than what we saw w/ NUTS! you’ll see about that below!

#so yeah here in this simple example - we're not tryna predict! the y of the function we choose to optimize
#our parameters simply tells us what parameter value to choose since we're MAXIMIZING
#the function aka literally getting the MAX F(X) AKA *MAX Y* and using the X-value
#aka HYPERPARAM VALUE THAT GIVES US THAT MAXIMUM!!!!!!
#in Normal Dist Equation example, that's the optimizer function and so the units
#are *Probability Density* aka that's our METRIC!!!! but it REPRESENTS/IS *INVERSELY* PROPORTIONAL TO (right?)
#SQUARED ERROR!!!!!! so - really we're doing the same thing in OLS as we are in Bayesian - the objective is
#the same, we're tryna MINIMIZE THE SQUARED ERROR!
#it's just that in OLS, we're doing it EXACTLY, aka "analytically", w/ an equation for the slope
#of the best fit line (the (IN)famous/notorious one we well know!) - that works out the be the slope of the line
#that minimizes the squared error/distance to the points overall
#then of course we can just plug & chug into y=mx+b, having m and using the CENTER x,y as the x,y
#to put in here, since we KNOW the line will run thru the CENTER point, and thus can solve/get b!
#but NOTE in OLS we're not doing it 'DIRECTLY', as i previously thought of it as, cuz if you think about it,
#that would mean we're LITERALLY using an equation for the squared error of all the proposed predicted points,
#but that's NOT what we're doing, so you could say actually that OLS is *INDIRECTLY* minimizing the error!!!!!
#cuz again we're just using the actuals ALONE! aka JUST the actual X,y's and the center/their centers, basically
#"spreading" the "area to center"!!!!!!! and what results *IS* the slope/intercept/line that minimizes the
#squared error to the points!!!!!!
#Now, w/ Bayesian probability estimation, we actually *ARE* focused DIRECTLY on squared error,
#w/ just (yp-yi)^2, where we can't actually SOLVE it for the minimum, but rather, can just do *"TRIAL & ERROR"*
#testing out diff lines aka diff m/b aka SLOPE/INTERCEPT COMBOS for diff y=mx+b equations to GIVE US
#DIFFERENT SETS OF PREDICTED POINTS AND FOR EACH SET FOR EACH M/B AKA Y=MX+B COMBO/STRAIGHT EDGE/LINE/EQUATION
#WE CALCULATE THE SQUARED ERROR AND THEORETICALLY THERE'S *INFINITE* COMBOS TO TRY SO WE'RE JUST TRYNA *NARROW
#IT DOWN*!!!!!! HONE IN TILL WE REALISTICALLY GET THE EFFECTIVE OPTIMAL VALUE PAIR/COMBO OF M/B!//SLOPE/INTERCEPT!!!!!!
#but the thing is, we don't use that equation/measure *DIRECTLY* lol so this is also INDIRECT in this sense,
#that we use the NORMAL DISTRIBUTION EQUATION because we can hack it to use it for SQUARED DIFFERENCE also
#since that's apart of the equation, but also, this "transforms" the values too in our favor where the
#smaller the squared error, the bigger the f(x)/value of the equation, which is perfect since we wanna find
#the MAX of f(x)//y, thus, the *MINIMUM SQUARED ERROR IS THE MAXIMUM Y//F(X)!!!!!!!!!!!!!!!!!!!!!!!!!!!*
#and i forget the details but we do do this separately each for slope & intercept? so even though we need BOTH
#in order to come up with an equation/line to get our y's...., but yeah like there's a certain range of i guess
#optimal / best range / target m's/slopes and same for intercepts and so we're looking for the combo that
#TOGETHER gives us maximum of helper function aka MINIMUM SQUARED ERROR!!!!!
#or you know what? it might actually just be TWO DIFFERENT *VIEWS* OF THE *SAME THING*!!!!!!!!
#just like if you have more than 2 dimensions but are only PLOTTING ON 2 dimensions, you can be looking at the
#SAME THING, just from different angles - cuz yeah like you DO need both slope & intercept to come up w/ the
#y's to try out to calculate that squared error/f(x) - so i think it's like you're really doing ONE simulation,
#where you're testing out BOTH values simultaneously cuz they're a COMBO aka **IN TANDEM**, so yeah they're both
#showing the same thing, just different angles >> but wait... they are doing independent calculations tho aren't
#they? so then how are they different? if they both needa test both? are they like each attacking it from a
#different angle and just simply ARRIVING AT the SAME CONCLUSION/DESTINATION?!?!??!?!!?!?
#but like i don't know how you really have different approaches/focuses if you need both cuz like even if one
#is 'focusing on the intercept' let's say... well it still needs to go thru the range of intercepts and pair
#it w/ the range of slopes. i mean yeah i guess you could literally just do the ORDER backwards of like, you go
#thru one intercept at a time, and try all slopes for it, while slope-focused one goes thru each slope,
#trying each intercept in the range w/ it, but then that kinda seems like a waste???/redundant
#unless they were both tryna like each independently focus on figuring out the optimum range for its piece/param,
#but that doesn't seem to be the case - again, both use both, and plus, seems like that's what CALIBRATION PERIOD
#IS FOR!!!!!!

#but yeah so, keep in mind that it's not like OLS where you can JUST find slope
#and then cuz we can just solve intercept from y=mx+b then - doesn't work like that here cuz we CAN'T
#SOLVE SLOPE & INTERCEPT INDEPENDENTLY!!!!!! WE SOLVE THEM *SIMULTANEOUSLY* THRU *TRIAL&ERROR*!!!!!!!!!!
#gotta guess at both to make combos!!!!!!
#but again, don't get confused - even tho we're running TWO equations in Bayes, it's not like each one is
#JUST FOR IT/THAT PARAM!!!!!!!!!!!

###############################################################################################

#so yeah, HERE IN BAYESIAN PROBABILITY/ESTIMATION WE'RE *STILL* MINIMIZING SQUARED ERROR!!!!!!! BUT USING
#A DIFFERENT EQUATION AND DOING SO *PROBABILISTICALLY*!!!!!!! lol "ballistically" ;P
#so because of the nature of the equation, it allows us to have this perfect inverse
#relationship that works out for us to give us the MAXIMUM OF OUR HELPER/HYPERPARAM
#OPTIMIZER FUNCTION @ THE *MINIMUM* OF OUR METRIC OF SQUARED ERROR!!!!!!
#or actually, is the 'metric' considered the/this y of the optimizer function???
###############################################################################################



|   iter    |  target   |     a     |     b     |
-------------------------------------------------
| [0m1        [0m | [0m8.467    [0m | [0m2.727    [0m | [0m5.741    [0m |
| [0m2        [0m | [0m5.952    [0m | [0m1.934    [0m | [0m4.018    [0m |
| [95m3        [0m | [95m9.42     [0m | [95m2.69     [0m | [95m6.73     [0m |
| [0m4        [0m | [0m8.0      [0m | [0m1.0      [0m | [0m7.0      [0m |
| [95m5        [0m | [95m10.0     [0m | [95m3.0      [0m | [95m7.0      [0m |


In [7]:
optimizer.maximize(n_iter=3,init_points=2)

|   iter    |  target   |     a     |     b     |
-------------------------------------------------
| [0m1        [0m | [0m6.146    [0m | [0m1.113    [0m | [0m5.033    [0m |
| [95m2        [0m | [95m7.668    [0m | [95m1.643    [0m | [95m6.025    [0m |
| [95m3        [0m | [95m8.089    [0m | [95m1.974    [0m | [95m6.114    [0m |
| [95m4        [0m | [95m9.386    [0m | [95m2.386    [0m | [95m7.0      [0m |
| [95m5        [0m | [95m10.0     [0m | [95m3.0      [0m | [95m7.0      [0m |


In [None]:
#NOTICE YOU'LL GET DIFF RESULTS EVERY TIME!!!! is it cuz of RANDOMNESS?? no random_state /_seed option??
#as far as the SPECIFICS, aka the JOURNEY to getting to the max,
#but it'll still settle on/converge on/ARRIVE AT THE *SAME MAX* OF THE TARGET VALUE!!!!!!!
#their results are yet ANOTHER variation! but again, same max result/conclusion/solution!
#A ROW GETS HIGHLIGHTED PINK WHEN/IF IT'S THE REIGNING/INCUMBENT MAX (minus/except for the 1st),
#aka KING'S COURT RULES!!!!!Ruler lol - Reign until you get dethroned! but this shows the progression/journey!
#############################################################################################################

Great, now let's print the best parameters and the associated maximized target.

In [17]:
print(optimizer.max['params']);optimizer.max['target']

{'a': 3.0, 'b': 7.0}


10.0

In [8]:
#(if remove 'print()')
optimizer.max['params'];optimizer.max['target']

10.0

## Test it on real data using the Light GBM

The dataset we will be working with is the famous flight departures dataset. Our modeling goal will be to predict if a flight departure is going to be delayed by 15 minutes based on the other attributes in our dataset. As part of this modeling exercise, we will use Bayesian hyperparameter optimization to identify the best parameters for our model.

**<font color='teal'> You can load the zipped csv files just as you would regular csv files using Pandas read_csv. In the next cell load the train and test data into two separate dataframes. </font>**


In [10]:
train_df = pd.read_csv('flight_delays_train.csv.zip')
test_df = pd.read_csv('flight_delays_test.csv.zip')

**<font color='teal'> Print the top five rows of the train dataframe and review the columns in the data. </font>**

In [19]:
train_df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [None]:
#so based on month, day of month, day of WEEK, time of day, carrier, origin, destination & distance
#then predictor/target feature is BINARY yes/no for whether delayed 15 mins or not (AT LEAST 15 mins, obv,
#i'm assuming). So note, basically the definition for 'delay' is that it was AT LEAST 15 minutes! if less than that,
#then grace period lol, off the hook

**<font color='teal'> Use the describe function to review the numeric columns in the train dataframe. </font>**

In [11]:
train_df.describe()

Unnamed: 0,DepTime,Distance
count,100000.0,100000.0
mean,1341.52388,729.39716
std,476.378445,574.61686
min,1.0,30.0
25%,931.0,317.0
50%,1330.0,575.0
75%,1733.0,957.0
max,2534.0,4962.0


In [12]:
train_df.describe().round(1)

Unnamed: 0,DepTime,Distance
count,100000.0,100000.0
mean,1341.5,729.4
std,476.4,574.6
min,1.0,30.0
25%,931.0,317.0
50%,1330.0,575.0
75%,1733.0,957.0
max,2534.0,4962.0


In [13]:
train_df

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y
...,...,...,...,...,...,...,...,...,...
99995,c-5,c-4,c-3,1618,OO,SFO,RDD,199,N
99996,c-1,c-18,c-3,804,CO,EWR,DAB,884,N
99997,c-1,c-24,c-2,1901,NW,DTW,IAH,1076,N
99998,c-4,c-27,c-4,1515,MQ,DFW,GGG,140,N


In [None]:
#100,THOUSAND ENTRIES!!!!!

In [None]:
#ohhh right, so automatically ONLY NUMERICAL FEATURES ARE INCLUDED!!!!

In [14]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Month              100000 non-null  object
 1   DayofMonth         100000 non-null  object
 2   DayOfWeek          100000 non-null  object
 3   DepTime            100000 non-null  int64 
 4   UniqueCarrier      100000 non-null  object
 5   Origin             100000 non-null  object
 6   Dest               100000 non-null  object
 7   Distance           100000 non-null  int64 
 8   dep_delayed_15min  100000 non-null  object
dtypes: int64(2), object(7)
memory usage: 6.9+ MB


Notice, `DepTime` is the departure time in a numeric representation in 2400 hours. 

In [None]:
#hmm, is it appropriate for us to use TIME tho, since it's CIRCULAR? that's interesting right,
#almost like its OWN SEPARATE CATEGORY! from discrete & continuous (and not 'discrete style continuous either lol)
#cuz like 2359 (11:59PM) is basically the SAME as 0000 (12AM MIDNIGHT!)... but they're MAXIMALLY SEPARATED
#ON THIS SCALE!!!!!!!! SO HOW DOES THAT WORK?!?!
######################################################################################################

 **<font color='teal'>The response variable is 'dep_delayed_15min' which is a categorical column, so we need to map the Y for yes and N for no values to 1 and 0. Run the code in the next cell to do this.</font>**

In [22]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Month              100000 non-null  object
 1   DayofMonth         100000 non-null  object
 2   DayOfWeek          100000 non-null  object
 3   DepTime            100000 non-null  int64 
 4   UniqueCarrier      100000 non-null  object
 5   Origin             100000 non-null  object
 6   Dest               100000 non-null  object
 7   Distance           100000 non-null  int64 
 8   dep_delayed_15min  100000 non-null  object
dtypes: int64(2), object(7)
memory usage: 6.9+ MB


In [None]:
#nice no at least APPARENT/LITERAL BLANKS!!!!! doesn't guarantee no MISSINGS or BALONEYS/BOGEYS!!!!!/BOGUESS! :P

In [21]:
train_df.dep_delayed_15min.unique()

array(['N', 'Y'], dtype=object)

In [42]:
#train_df = train_df[train_df.DepTime <= 2400].copy()
#NOTE^^This was just here, already commented out, they don't say anything about this, but is this
#to filter bogus values? OH! yeah if we look @/to the describe table, we see the max time *IS* >2400!!!!!
#SO THERE MAY BE OTHERS!!!!
#BUT THEN WHY WOULDN'T THEY RUN THIS?!?! i guess maybe they're simplifying it for us for now but reminding us of
# / giving us an idea of some kind of stuff/examples for preprocessing / cleanup we should be looking for normally
#on the job / in the field.... Lol or maybe commenting it out was legit an accident :P
#########################################################################################################
y_train = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values
#note gotta convert to NUMBERS ofc!
#SO REMEM! .MAP() IS A SERIOUS TOOL! YOU HAVE TO USE IT VERY CAREFULLY/BE VERY SURE!
#compared to .replace(manual_mapping_dict), like we usually do, .map() will RENDER ANYTHING THAT'S *NOT*
#A MATCH AKA *NOT* CONTAINED IN THE DICT AS A *NAN*!!!!!!!!!
#but we can check w/ a .unique, esp w/ binaries it's very easy - as done above!^
#and maybe it's that the only non-specified values ARE nans, so those won't be affected lol, but we shoulda
#started w/ confirming there aren't any of those/removing any if there are!
#checked above^ and we're good!

In [None]:
#ohhhh okay, sA, just realized this is different than what we usually do cuz
#we're working on / treating / cleaning & doing EDA *ONLY* on the TRAINING
#portion/split?!?!?! WHY'S THAT?!? why didn't we START with the FULL??
#is this like SUPER 'data leakage' / cybersecurity paranoia?! lol
########################################################################

## Feature Engineering
Use these defined functions to create additional features for the model. Run the cell to add the functions to your workspace.

In [33]:
#DON'T KNOW! but gives us mapping we use later
def label_enc(df_column): #hain?
    df_column = LabelEncoder().fit_transform(df_column)
    return df_column

def make_harmonic_features_sin(value, period=2400): #hain?
    value *= 2 * np.pi / period 
    return np.sin(value)

def make_harmonic_features_cos(value, period=2400): #hain?
    value *= 2 * np.pi / period 
    return np.cos(value)

#'MANUFACTURED FEATURES' AKA *FEATURE ENGINEERING* INDEED!!!!!
def feature_eng(df):
    #THESE GIVE US *ADDITIONAL GROUPINGS* FOR FLIGHTS SO THAT WE CAN SEE WHAT OTHER FLIGHTS WERE THAT SAME COMBO!
    df['flight'] = df['Origin']+df['Dest'] #i guess if there are REPEAT routes in here then we can see if there are any patterns/consistencies w/ those! note that ORDER MATTERS!!!! like CHI-SEA is diff than SEA-CHI!!!!! MAKES SENSE THO!!!!!
    df['flightUC'] = df['flight']+df['UniqueCarrier'] #this one builds off what we just made!
    df['DestUC'] = df['Dest']+df['UniqueCarrier'] #so like MDW-SOUTHWEST
    df['OriginUC'] = df['Origin']+df['UniqueCarrier'] #also could be MDW-SOUTHWEST! but since these are DIFF columns, they would be totally separate! like it wouldn't make sense for the origin and destination city to be the same!

    #these ones just clean up the formatting, like removing the weird 'c-' prefix from these date monikers
    df['Month'] = df.Month.map(lambda x: x.split('-')[-1]).astype('int32')
    df['DayofMonth'] = df.DayofMonth.map(lambda x: x.split('-')[-1]).astype('uint8')
    df['DayOfWeek'] = df.DayOfWeek.map(lambda x: x.split('-')[-1]).astype('uint8')

    #puts a decimal between the first two//hour numbers & second two//minute numbers of the military time stamp,
    #i guess to make easier to read? kinda serving like purpose of a 'colon:' but RETAINS its NUMERICAL identity!
    #which again i'm skeptical about
    df['hour'] = df.DepTime.map(lambda x: x/100).astype('int32')

    #these ones create categories / labels / groups! but rather than CUTTING which REPLACES specifics/numbers
    #w/ broader general categories, this is in ADDITION TO... But i guess we could achieve the same thing
    #w/ cutting if we just created a COPIED COLUMN?!?!
    ########################################################################
    #labels for time of month
    df['begin_of_month'] = (df['DayofMonth'] < 10).astype('uint8')
    df['midddle_of_month'] = ((df['DayofMonth'] >= 10)&(df['DayofMonth'] < 20)).astype('uint8')
    df['end_of_month'] = (df['DayofMonth'] >= 20).astype('uint8')
    #labels for time of day
    df['morning'] = df['hour'].map(lambda x: 1 if (x <= 11)& (x >= 7) else 0).astype('uint8')
    df['day'] = df['hour'].map(lambda x: 1 if (x >= 12) & (x <= 18) else 0).astype('uint8')
    df['evening'] = df['hour'].map(lambda x: 1 if (x >= 19) & (x <= 23) else 0).astype('uint8')
    df['night'] = df['hour'].map(lambda x: 1 if (x >= 0) & (x <= 6) else 0).astype('int32')
    #labels for time of YEAR!
    df['winter'] = df['Month'].map(lambda x: x in [12, 1, 2]).astype('int32')
    df['spring'] = df['Month'].map(lambda x: x in [3, 4, 5]).astype('int32')
    df['summer'] = df['Month'].map(lambda x: x in [6, 7, 8]).astype('int32')
    df['autumn'] = df['Month'].map(lambda x: x in [9, 10, 11]).astype('int32')

    #labels for Weekend vs. Weekday... though, it says 'Holiday'... but that's misleading, though a GOOD
    #idea to have a THIRD if possible for ACTUAL like federal US holidays!!!
    ########################################################################
    df['holiday'] = (df['DayOfWeek'] >= 5).astype(int) 
    df['weekday'] = (df['DayOfWeek'] < 5).astype(int)

    #These are like counts/metrics pertaining to one piece of that row/record
    #wait these all do *.TRANSFORM* - isn't that only for MODELS that have been FIT?!
    ########################################################################
    #we can make sense of everything OUTSIDE of the .transform(), aka the GROUPBY, but don't understand the .transform()...
    #so i'll at least comment on the groupbys counts
    df['airport_dest_per_month'] = df.groupby(['Dest', 'Month'])['Dest'].transform('count') #dest airport & month count
    df['airport_origin_per_month'] = df.groupby(['Origin', 'Month'])['Origin'].transform('count') #orig airport & month count
    df['airport_dest_count'] = df.groupby(['Dest'])['Dest'].transform('count') #just dest airport count
    df['airport_origin_count'] = df.groupby(['Origin'])['Origin'].transform('count') #TOTAL number of origins... couldn't we just do like len(df.col.unique())? or df.col.unique().count()??
    df['carrier_count'] = df.groupby(['UniqueCarrier'])['Dest'].transform('count') #TOTAL number of destinations each carrier/airline had
    df['carrier_count_per month'] = df.groupby(['UniqueCarrier', 'Month'])['Dest'].transform('count') #number of destinations that airline had each month
    
    #OHHHHHHHHHHHHHHHHHHHHHHHHHHHH!
    #okay now aH i think i get it for what .transform is doing
    #so, this is something i've done in my own excel analytics gymnastics acrobatics lol
    #it's basically putting some summary value for some grouping related to that row, for *EVERY* grouping in that row!!!
    #so, we start off w/ each of these as making a new column right
    #so the purpose of that column is to have the value of the grouping that pertains to THAT ROW!!!!
    #so EVERY member/row belonging to a grouping has the *SAME* exact value for that column!!!!!!!!!!
    #so like, let's pick a simple one - airport_origin_count - so that's ONLY grouped on every ORIGIN,
    #thus, remem w/ each row being a *FLIGHT*, ALL the flights of the SAME origin will show the SAME NUMBER/VALUE
    #FOR THAT COLUMN OF THE TOTAL NUMBER OF (SUM) FLIGHTS WHERE THAT ORIGIN AIRPORT WAS THE ORIGIN AIRPORT!
    #SO AKA HOW MANY FLIGHTS IN THIS DATAFRAME ARE FROM THAT ORIGIN!!!!!!
    #SO SAY/TAKE CHICAGO MIDWAY (MDW) - SAY 100 OF THE FLIGHTS IN THIS DATAFRAME ARE MIDWAY ORIGIN FLIGHTS;
    #THEN, FOR THIS 'AIRPORT_ORIGIN_COUNT' FIELD, FOR *EVERY* MIDWAY FLIGHT, *IT'S GONNA SAY/BE *100*!!!!!!
    ####################################################################################################

    
    #STILL FIGURING WHAT THIS IS ALL ABOUT
    df['deptime_cos'] = df['DepTime'].map(make_harmonic_features_cos)
    df['deptime_sin'] = df['DepTime'].map(make_harmonic_features_sin)


    return df.drop('DepTime', axis=1)

In [31]:
train_df.groupby(['UniqueCarrier', 'Month'])['Dest'].count()

UniqueCarrier  Month
AA             c-1      797
               c-10     770
               c-11     756
               c-12     785
               c-2      708
                       ... 
YV             c-5      170
               c-6      190
               c-7      198
               c-8      194
               c-9      179
Name: Dest, Length: 261, dtype: int64

In [26]:
#okay so if we take this as an example:
#train_df.groupby(['Dest', 'Month'])['Dest'].transform('count')

#well, let's take out the .transform for now and see

#okay so this does what we expect aH - groups each destination airport & month combo
#and simply counts how many times each combo comes up

train_df.groupby(['Dest', 'Month'])['Dest'].count()

Dest  Month
ABE   c-1      7
      c-10     3
      c-11     7
      c-12     4
      c-2      8
              ..
YUM   c-5      5
      c-6      3
      c-7      3
      c-8      2
      c-9      2
Name: Dest, Length: 3037, dtype: int64

In [24]:
#okay so then... what the heck is .transform doing?
train_df.groupby(['Dest', 'Month'])['Dest'].transform('count')

0        373
1        168
2        104
3         46
4         21
        ... 
99995      2
99996      4
99997    266
99998      2
99999     68
Name: Dest, Length: 100000, dtype: int64

In [None]:
##################################################
#hmmm - yeah not sure... it's got one number for every ROW in the df... but what
#would that number represent? but also, it's a transform on the GROUPBY/pivot...
#which ISN'T all the diff original indiv rows.... So not sure how it's coming up w/ that??

#>>>>>>>>>>>>>NOW FIGURED OUT! PUT ABOVE!
##################################################

In [34]:
# help(transform)

Concatenate the training and testing dataframes.


In [None]:
#lol reverse of how we usually do it... but kinda like serious segregation

In [15]:
test_df

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-7,c-25,c-3,615,YV,MRY,PHX,598
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377
4,c-6,c-6,c-3,1505,UA,ORD,STL,258
...,...,...,...,...,...,...,...,...
99995,c-6,c-5,c-2,852,WN,CRP,HOU,187
99996,c-11,c-24,c-6,1446,UA,ORD,LAS,1515
99997,c-1,c-30,c-2,1509,OO,ORD,SGF,438
99998,c-1,c-5,c-5,804,DL,LGA,ATL,761


In [36]:
#DROP THE THING WE'RE TRYNA PREDICT! like making the groom leave the room for the bride lollll
full_df = pd.concat([train_df.drop('dep_delayed_15min', axis=1), test_df])
full_df = feature_eng(full_df)

#hmmm okay, speaking of groom, w/ this now we're gonna 'groom' the FULL dataset where only the training portion was treated - .... OHHHHHHH! okay, so no - not really, we never did any 'treating' – we simply EXPLORED the full training data / EDA – well, we ALMOST/WOULD’VE cleaned out the TIME column w/ the >2400! but didn’t lol. so yeah, that was just exploration, and so now we’re gonna COMBINE the test & train df, but *NOTE*, the *TEST DF* ALREADY HAS THE TARGET FEATURE y DROPPED!!!!! SO WE’RE JUST GONNA DROP IT FROM THE TRAINING AND THEN CONCAT THE 2 AND THEY’LL BE ALIGNED!!!!!
#again, don’t know why we did like this? maybe just showing us a diff way we might see things in the field? but yeah, now w/ the target y to be predicted gone, we’ll have our X_FEATURES matrix!!! so note, that we’ll apply all these functions we just made to this COMBINED x_features matrix to add columns/feature engineer our combined train+test dataset!!!!!!

#NOTE! in y=mx+b Bayesian we were predicting in the sense of that is what we’re AIMING to do with the y=mx+b equation ultimately!!!!! and like when we’re optimizing, we’re picking / testing m/b’s to come up with PREDICTED Y’S!!!!!! that we use for our yp’s in the NormDistEqn!!!!!
#Here, we’re tryna predict whether there’s a delay or not based on all the other info. So this is classification rather than regression, and, as we’ll see, will be done via a decision tree ensemble which is what Light Gradient Boosting Machine does. What ‘equation’ is it using? Not sure – but I guess concept is the same – I guess it wouldn't be minimizing SQUARED LOSS/ERROR, but rather the error it’d be minimizing is simply straight ACCURACY error?! Aka MAXIMIZING ACCURACY (or whatev metric most appropriate for the application)?! cuz like we give it the ‘right answers’!!!!! so it can simply grade/score it

#Remem! we did save the target y's aka DELAYED15MINS in TRAINING as its own/as y_train, just didn't NIX it off officially!
#separate the y's off  - JUST the y's!!!!!! and that makes sense cuz we need the y's for TRAINING but *NOT* to be included in TEST!!! until the END!
#that makes sense cuz we need the TRAINING y's to train, and the test y's will be used just to grade/score
#final performance, but... don't see the test y's anywhere?! lol we never evaluate this model performance?!?!


Apply the earlier defined feature engineering functions to the full dataframe.

In [37]:
for column in ['UniqueCarrier', 'Origin', 'Dest','flight',  'flightUC', 'DestUC', 'OriginUC']:
    full_df[column] = label_enc(full_df[column])


Split the new full dataframe into X_train and X_test. 

In [38]:
#so... just splitting it BACK to train & test, except this time WITHOUT the y, and now feature_engineered features
#just using the SHAPE of original train & test
X_train = full_df[:train_df.shape[0]]
X_test = full_df[train_df.shape[0]:]

Create a list of the categorical features.

In [39]:
categorical_features = ['Month',  'DayOfWeek', 'UniqueCarrier', 'Origin', 'Dest','flight',  'flightUC', 'DestUC', 'OriginUC']

Let's build a light GBM model to test the bayesian optimizer.

In [None]:
#grape. GradientBoosting was the one thing i didn't really learn

### [LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:

* Faster training speed and higher efficiency.
* Lower memory usage.
* Better accuracy.
* Support of parallel and GPU learning.
* Capable of handling large-scale data.

First, we define the function we want to maximize and that will count cross-validation metrics of lightGBM for our parameters.

Some params such as num_leaves, max_depth, min_child_samples, min_data_in_leaf should be integers.

In [None]:
#okay so they're saying like, just like we saw above simple & w/ Bayes NUTS NormDistEqn, we gotta have a
#FUNCTION TO MAXIMIZE!!!! so that's what we're doing below, w/ LightGBM, which will utilize decision trees
#for this classification prediction!

In [40]:
#dang so we gotta put in ALL THESE?! and NONE OF 'EM HAVE DEFAULTS?!?!
########################################################################
def lgb_eval(num_leaves,max_depth,lambda_l2,lambda_l1,min_child_samples, min_data_in_leaf):
    params = {
        "objective" : "binary",
        "metric" : "auc", 
        'is_unbalance': True,
        "num_leaves" : int(num_leaves),
        "max_depth" : int(max_depth),
        "lambda_l2" : lambda_l2,
        "lambda_l1" : lambda_l1,
        "num_threads" : 20,
        "min_child_samples" : int(min_child_samples),
        'min_data_in_leaf': int(min_data_in_leaf),
        "learning_rate" : 0.03,
        "subsample_freq" : 5,
        "bagging_seed" : 42,
        "verbosity" : -1
    }
    lgtrain = lightgbm.Dataset(X_train, y_train,categorical_feature=categorical_features)
    cv_result = lightgbm.cv(params,
                       lgtrain,
                       1000,
                       early_stopping_rounds=100,
                       stratified=True,
                       nfold=3)
    return cv_result['auc-mean'][-1]

Apply the Bayesian optimizer to the function we created in the previous step to identify the best hyperparameters. We will run 10 iterations and set init_points = 2.


In [43]:
#aka TWO CALIBRATION ITERATIONS!!!
#so then why don't they call it 'init_iter' or 'init_steps'?!?!? why init 'POINTS'?!?!?!?!?!? that seems really
#misleading / confusing no?!?!?!?!?!!? unless i'm understanding it wrong and they're NOT iterations?!

#leaves! max_depth! so they were right! this is decision tree stuff!!!
#Here's wiki!:
#"Gradient boosting is a machine learning technique used in regression and classification tasks, among others.
#It gives a prediction model in the form of an ensemble of weak prediction models, which are typically
#decision trees.[1][2] When a decision tree is the weak learner, the resulting algorithm is called
#gradient-boosted trees; it usually outperforms random forest.[1][2][3] A gradient-boosted trees model is built
#in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing
#optimization of an arbitrary differentiable loss function." >>>THAT EXPLAINS IT!!!!!!

lgbBO = BayesianOptimization(lgb_eval, {'num_leaves': (25, 4000),
                                                'max_depth': (5, 63),
                                                'lambda_l2': (0.0, 0.05),
                                                'lambda_l1': (0.0, 0.05),
                                                'min_child_samples': (50, 10000),
                                                'min_data_in_leaf': (100, 2000)
                                                })

#ahhhhhhhhhhhhhhhhh, okay, so, lgb_eval, which we just defined above/made as a function,
#is our OPTIMIZER function, and it takes all those arguments, as shown in the cell block above,
#and the values for those are GIVEN HERE IN THE {} ARRAY!!!!! remem, NO DEFAULTS SPECIFIED!!!!! SO GOTTA FILL/SPECIFY!!!!!
#normally we're used to calling custom functions by just like lgb_eval(num_leaves=(25,400)....etc etc!)....
#BUT! REMEM! THIS IS HOW WE USED THE BAYESIANOPTIMIZER FUNCTION CALL IN OUR SIMPLE SAMPLE EXAMPLE ABOVE!!!!!!
#aka 'simple_func' was simply a+b:
# def simple_func(a, b):
#     return a + b

#so w/ BO, we did, remem:

# optimizer = BayesianOptimization(simple_func,{'a': (1, 3),'b': (4, 7)})

#aka we DEFINED/SPECIFIED WHAT THE PARAMETER VALUES a & b WERE!!!!!!!!!!!!!!
#########################################################################################################
#terminology a little confusing tho cuz lgb_eval() func^^ has that WHOLE LIST of 'params' as it calls them,
#but then what we'd normally call 'args' of the functions are the hyperparams of this function we're tryna optimize!!!!!!
#remem - params are not variables but *CONSTANTS*!!!!! like in y=mx+b, m&b: constants/params, y/x: VARIABLES!!!!!!
#########################################################################################################

lgbBO.maximize(n_iter=10, init_points=2)

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
| [0m1        [0m | [0m0.7193   [0m | [0m0.002822 [0m | [0m0.03023  [0m | [0m49.67    [0m | [0m6.078e+03[0m | [0m444.6    [0m | [0m678.6    [0m |
| [95m2        [0m | [95m0.7194   [0m | [95m0.04816  [0m | [95m0.01211  [0m | [95m9.398    [0m | [95m6.357e+03[0m | [95m308.9    [0m | [95m1.189e+03[0m |
| [95m3        [0m | [95m0.7437   [0m | [95m0.04154  [0m | [95m0.01364  [0m | [95m9.394    [0m | [95m5.024e+03[0m | [95m1.975e+03[0m | [95m2.909e+03[0m |
| [0m4        [0m | [0m0.7436   [0m | [0m0.04682  [0m | [0m0.01212  [0m | [0m31.43    [0m | [0m3.594e+03[0m | [0m1.994e+03[0m | [0m3.903e+03[0m |
| [0m5        [0m | [0m0.728    [0m | [0m0.00967  [0m | [0m0.01261  [0m | [0m60.63    [0m | [0m2.466e+03[0m | [0m878.2    [0m 

In [None]:
#OHHHHHHH, got it - so just like we had columns of target | a | b in our simple sample, HERE WE HAVE
#COLUMNS FOR THE VALUES OF *EACH OF OUR HYPERPARAMETERS!!!!!!!!!!!!!!!!!!!!!!!!
#AGAIN, THESE HYPERPARAMS ARE SIMPLY THE *ARGS* OF OUR CUSTOM DEFINED LGB_EVAL() FUNCTION!!!!!!!
#AND THE "TARGET" IS THE F(X) VALUE OF OUR OPTIMIZER FUNCTION!!!!!! AKA REMEM WE'RE TRYNA GET THE *MAXIMUM*
#OF THAT!!!!!! GIVEN THE COMBINATION OF HYPERPARAM VALUES!!!!!
#SO LIKE IN BEST FIT LINE / NORMALDISTEQN NUTS, THE 'TARGET' WAS THE F(X) OF THE NORMALDISTEQN WE WERE TRYNA
#GET MAX OF WHICH REPRESENTED *MINIMUM SQUARED ERROR*!!!!!!
#SO IT'S GONNA BE ONE OF THE PINK ROWS OFC!!!!!
#HERE INTERESTINGLY IT'S A TIE FOR AS FAR AS DECIMALS WE CAN SEE!!! 2nd & 6th iter!!!
#so would have to look closer or jsut pick one lol

#BUT AGAIN - WHAT IS THE ACTUAL LIKE "EQUATION" WE'RE USING/OPTIMIZING?! LIKE THAT WE COULD PLOT?!
#AND ALSO, *WHERE IS THE ***y_TEST*** FOR JUDGING THE BEST PERFORMER?!
#OR ARE WE *ONLY* DOING IT ON TRAIN??
#BUT IF WE'RE ONLY DOING ON TRAIN, WHAT WAS THE POINT OF SPLITTING INTO TEST?!
#OR IS THIS JUST LIKE GETTING US STARTED?! and leaving out the rest for simplicity?!
#but yeah, WONDERING LIKE WHAT DO THESE TARGET VALUES REPRESENT?! LIKE IS IT *ACCURACY*?!? which would be cool cuz,
#unlike squared error, here we *ALREADY* WANT THE MAX!!!!! THE HIGHER THE BETTER!!!!!


#OHHHHHHHHHHHHHHHHH! okay just saw / looked closer now! So - YES - we ARE only doing on TRAIN! look at this piece
#from above - the BOTTOM piece/SECOND half! >> 'lgTRAIN'!!!!!!:

    # lgtrain = lightgbm.Dataset(X_train, y_train,categorical_feature=categorical_features)
    # cv_result = lightgbm.cv(params,
    #                    lgtrain,
    #                    1000,
    #                    early_stopping_rounds=100,
    #                    stratified=True,
    #                    nfold=3)
    # return cv_result['auc-mean'][-1]

#so yeah lgtrain IS just for TRAIN, and as you see the next line is the *CROSS-VALIDATION RESULT* which
#USES LGTRAIN!!!!!!! AND THEN THE VERY LAST LINE SHOWS US THE SCORING METRIC WHICH IS NONE OTHER THAN
# 'AUC'!!!!!!!!!!!! the mean to be exact!
# which is indeed something we can use in Classification!!! which combines diff of the other classification/confusion
#metrics!!!!!! quantifies like trade-off!
####################################################################################################################

#Okay, but - if we did just train, and we're doing DECISION trees esp, THEN WOULDN'T/SHOULDN'T THE
#MAX HAVE BEEN A *PERFECT* OVERFIT SCORE OF 1!?!??!?!?!??!?!!?!?!??!!?!?!??!!?!?!?
####################################################################################################################

In [None]:
#ohhh, so if you watch it in real-time come up w/ this list / go thru these iterations,
#it makes pink/purple the CURRENT max/best up till then?? but then it shoulda made the VERY FIRST purp then
#no? maybe it excludes that. AHHHH, yeah, so i think that's right! that adds up! is consistent with what we see!
#cuz if you look at the TARGET column, which is what we're after the max of, like back in our run
#in the beginning when we were confused on why there were 2 purple rows, it's cuz TWICE the current run's target
#was the NEW HIGH! and yeah, the very last one ended up being the HIGHEST and MAX HIGH POSSIBLE?!
#and if you look at their pre-loaded version, the 4th was the overall max, and it worked out that yeah - that's
#correct, NO OTHER ROW SHOULD'VE LIT UP/BEEN HIGHLIGHTED CUZ NO ONE EVER DETHRONED THE CURRENT MAX (lol like the name!
# like, who's the MAX?!)/ KING'S COURT! 7.8, 7.1, 7.4 [10.0], 9.4!!!!!!!
#############################################################################################################

In [None]:
#SO WHAT WAS THAT WHOLE *GIF* THING ABOUT RELATIVE TO THIS?!? HOW DOES THAT APPLY HERE / WHAT DOES IT LOOK
#LIKE FOR HERE?! maybe gotta look at full documentation example - took a quick look n didn't see, nor about
#yTEST!!!!!!
#############################################################################################################

In [None]:
#what's the warning about?:
#[Warning] min_data_in_leaf is set=444, min_child_samples=6077 will be ignored.
#Current value: min_data_in_leaf=444
#############################################################################################################

 **<font color='teal'> Print the best result by using the '.max' function.</font>**

In [44]:
lgbBO.max

{'target': 0.743677184606366,
 'params': {'lambda_l1': 0.04368878406260319,
  'lambda_l2': 0.04998807104235295,
  'max_depth': 52.535726432546326,
  'min_child_samples': 6161.725956471824,
  'min_data_in_leaf': 1960.867045207072,
  'num_leaves': 3998.716330891756}}

Review the process at each step by using the '.res[0]' function.

In [45]:
lgbBO.res[0]

{'target': 0.7192540779609389,
 'params': {'lambda_l1': 0.002822101387856496,
  'lambda_l2': 0.030229580059482154,
  'max_depth': 49.66714825748048,
  'min_child_samples': 6077.71394513601,
  'min_data_in_leaf': 444.62104547614024,
  'num_leaves': 678.6063438891952}}

In [None]:
#ohh okay, so lgbBO ('light gradient boosting bayesian optimizer'?) MAX is the MAX target run/RES,
#aka [RES] is AN INDIVIDUAL RUN! so this one, [0] is simply the first one above!

In [None]:
#oh WOOWWWWWWWW! wait - you gotta be kidding me - we did / commented on ALLLLLLLLLLLLL those feature engineering
#added columns and functions for treating and DIDN'T USE ANY OF THEM EXCEPT THE VERY FIRST ONE / LABEL ENCODER?!?!?!??!!?!?
#####################################################################################################################