<h1>TMDB Features for Catboost and Catboost Optimization</h1>
<h2>What is Catboost</h2>

<p>A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.</p>

<p>Provided by Yandex and basicly it is the russian tensorflow and focused on Gradient Boosting insteed of neural network. Also has a lot of GPU features.</p>


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))
print(os.listdir("./"))
# Any results you write to the current directory are saved as output.

['train.csv', 'sample_submission.csv', 'test.csv']
['.ipynb_checkpoints', '__notebook_source__.ipynb']


<h3>Imports and Setup</h3>

In [2]:
import numpy as np
import pandas as pd
import os

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing, model_selection, neighbors, svm
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer, MultiLabelBinarizer
from sklearn.preprocessing import StandardScaler, RobustScaler, MaxAbsScaler
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.impute import SimpleImputer

from catboost import CatBoostRegressor, Pool

from tqdm import tqdm
import json
import ast

from datetime import datetime

<h3>Paths and Definitions</h3>

In [3]:
TRAIN_DATA_PATH = "../input/train.csv"
TEST_DATA_PATH = "../input/test.csv"
SUBMISSON_PATH = "../input/sample_submission.csv"
LABEL_COL_NAME = "revenue"

<h3>Functions and Feature Generation</h3>
<p>Here we are pasing json data and distribute over pandas dataframe. Like Crew, Cast, Genre, Production Company etc... . We are defining every unique value as column in our dataframe. If the movie has that value row value of that column will be 1.</p> 
<p>Also by the way we try to find use full features too. They could be important somehow.</p>
<ul>
    <li>Is the title is different of the original title.</li>
    <li>Count of casts</li>
    <li>Count of crews</li>
    <li>Count of casts gender</li>
    <li>Count of crews gender</li>
    <li>Has a home page or not</li>
    <li>Is released or not</li>
    <li>Count of keywords</li>
    <li>Count of production companies and countries</li>
    <li>Release Day, Month and Year as seperate features.</li>
    <li>title and original title length</li>
</ul>

<p>Budget and Revenues are so big skewed values. Not good for machine learning. We are using log of them.</p>
<p>More over we are imputing the budget with median strategy. It may increase the score.</p>

In [4]:
def date(x):
    x=str(x)
    year=x.split('/')[2]
    if int(year)<19:
        return x[:-2]+'20'+year
    else:
        return x[:-2]+'19'+year

def isNaN(x):
    return str(x) == str(1e400 * 0)

def getIsoListFormJson(data, isoKey='id', forceInt=False):
    datas = data.values.flatten()
    ids = []
    for c in (datas):    
        ccc = []
        if isNaN(c) == False:
            c = json.dumps(ast.literal_eval(c))        
            c = json.loads(c)            
            for cc in c:
                if forceInt:
                    ccStr = int(cc[isoKey])
                else:
                    ccStr = str(cc[isoKey])
                ccc.append(ccStr)
        else:
            if forceInt:
                ccc.append(0)
            else:
                ccc.append('0')
        ids.append(ccc)    
    return np.array(ids)

def distributeIdsOverData(data, colName, isoKey='id', forceInt=True):
    arr = getIsoListFormJson(data[colName], isoKey, forceInt)    

    gsi = -1
    for gs in tqdm(arr):
        gsi += 1
        gs.sort()
        for g in gs:
            gi = gs.index(g)
            try:
                data.loc[gsi, f"{colName}_{gi}"] = float(g)                
            except :
                data.loc[gsi, f"{colName}_{gi}"] = g                
            
    data.drop(colName, axis=1, inplace=True)
    print(f"{colName} distributed over data, cols: {len(data.columns)}")

def imput_title(df):
    for index, row in df.iterrows():
        if row['title'] == "none":
            df.at[index,'title'] = df.loc[index]['original_title']
    return df    
    
def prepareData(data):    
    data = imput_title(data)

    data["different_title"] = data["original_title"] != data["title"]

    data.drop("overview", axis=1, inplace=True)
    data.drop("poster_path", axis=1, inplace=True)
    data.drop('imdb_id', axis=1, inplace=True)    

    data["belongs_to_collection"] = getIsoListFormJson(data["belongs_to_collection"])

    cast = data['cast'].fillna('none')
    cast = cast.apply(lambda x: {} if x == 'none' else ast.literal_eval(x))
    data['num_cast'] = cast.apply(lambda x: len(x) if x != {} else 0)
    # Get the sum of each of the cast genders in a film: 0 `unknown`, 1 `female`, 2 `male`
    data['genders_0_cast'] = cast.apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
    data['genders_1_cast'] = cast.apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
    data['genders_2_cast'] = cast.apply(lambda x: sum([1 for i in x if i['gender'] == 2]))
    distributeIdsOverData(data,'cast','cast_id')

    crew = data['crew'].fillna('none')
    crew = crew.apply(lambda x: {} if x == 'none' else ast.literal_eval(x))    
    data['num_crew'] = crew.apply(lambda x: len(x) if x != {} else 0)    
    # Get the sum of each of the cast genders in a film: 0 `unknown`, 1 `female`, 2 `male`
    data['genders_0_crew'] = crew.apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
    data['genders_1_crew'] = crew.apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
    data['genders_2_crew'] = crew.apply(lambda x: sum([1 for i in x if i['gender'] == 2]))
    distributeIdsOverData(data,'crew','name',False) 

    distributeIdsOverData(data,'genres')
    
    keywords = data['Keywords'].fillna('none')
    keywords = keywords.apply(lambda x: {} if x == 'none' else ast.literal_eval(x))
    data['num_keywords'] = keywords.apply(lambda x: len(x) if x != {} else 0)
    distributeIdsOverData(data,'Keywords')

    data["Has_HomePage"] = list(map(lambda c: float(c is not np.nan), data["homepage"]))
    data.drop('homepage', axis=1, inplace=True)

    data["IsReleased"] = list(map(lambda c: float(c == "Released"), data["status"]))
    data.drop("status", axis=1, inplace=True)
  
    data["original_title_len"] = list(map(lambda c: float(len(str(c))), data["original_title"]))
    data.drop("original_title", axis=1, inplace=True)
    
    production_companies = data['production_companies'].fillna('none')
    production_companies = production_companies.apply(lambda x: {} if x == 'none' else ast.literal_eval(x))
    data['num_production_companies'] = production_companies.apply(lambda x: len(x) if x != {} else 0)
    distributeIdsOverData(data,'production_companies')    

    production_countries = data['production_countries'].fillna('none')
    production_countries = production_countries.apply(lambda x: {} if x == 'none' else ast.literal_eval(x))
    data['num_production_countries'] = production_countries.apply(lambda x: len(x) if x != {} else 0)    
    distributeIdsOverData(data,'production_countries','iso_3166_1',False)

    data['release_date']=data['release_date'].fillna('1/1/90').apply(lambda x: date(x))
    data['release_date']=data['release_date'].apply(lambda x: datetime.strptime(x,'%m/%d/%Y'))
    data['release_day']=data['release_date'].apply(lambda x:x.weekday())
    data['release_month']=data['release_date'].apply(lambda x:x.month)
    data['release_year']=data['release_date'].apply(lambda x:x.year)
    data.drop('release_date', axis=1, inplace=True)
    
    spoken_languages = data['spoken_languages'].fillna('none')
    spoken_languages = spoken_languages.apply(lambda x: {} if x == 'none' else ast.literal_eval(x))
    data['num_spoken_languages'] = spoken_languages.apply(lambda x: len(x) if x != {} else 0)
    distributeIdsOverData(data,'spoken_languages','iso_639_1',False)

    data["tagline_len"] = list(map(lambda c: float(len(str(c))), data["tagline"]))
    data.drop("tagline", axis=1, inplace=True)

    data["title_len"] = list(map(lambda c: float(len(str(c))), data["title"]))
    data.drop("title", axis=1, inplace=True)    

    data.fillna(0, inplace=True)
    data["budget"] = np.log1p(SimpleImputer(missing_values=0, strategy="median", verbose=1).fit_transform(data["budget"].values.reshape(-1,1)))
    #data["budget"] = np.log1p(data["budget"])

    data[LABEL_COL_NAME] = np.log1p(data[LABEL_COL_NAME])

<h3>Loading the test and train data</h3>

In [5]:
train = pd.read_csv(TRAIN_DATA_PATH, index_col='id')
print("Train Data Loaded")
test = pd.read_csv(TEST_DATA_PATH, index_col = 'id')
print("Test Data Loaded")

Train Data Loaded
Test Data Loaded


There are some missing values in the test and train data. If we know them from imdb then we could fill them manually.

In [6]:
if not os.path.exists("all_data.pickle"):   
    ##FILLING MISSIN BUDGET DATA
    train.loc[16,'revenue'] = 192864          # Skinning
    train.loc[90,'budget'] = 30000000         # Sommersby          
    train.loc[118,'budget'] = 60000000        # Wild Hogs
    train.loc[149,'budget'] = 18000000        # Beethoven
    train.loc[313,'revenue'] = 12000000       # The Cookout 
    train.loc[451,'revenue'] = 12000000       # Chasing Liberty
    train.loc[464,'budget'] = 20000000        # Parenthood
    train.loc[470,'budget'] = 13000000        # The Karate Kid, Part II
    train.loc[513,'budget'] = 930000          # From Prada to Nada
    train.loc[797,'budget'] = 8000000         # Welcome to Dongmakgol
    train.loc[819,'budget'] = 90000000        # Alvin and the Chipmunks: The Road Chip
    train.loc[850,'budget'] = 90000000        # Modern Times
    train.loc[1112,'budget'] = 7500000        # An Officer and a Gentleman
    train.loc[1131,'budget'] = 4300000        # Smokey and the Bandit   
    train.loc[1359,'budget'] = 10000000       # Stir Crazy 
    train.loc[1542,'budget'] = 1              # All at Once
    train.loc[1542,'budget'] = 15800000       # Crocodile Dundee II
    train.loc[1571,'budget'] = 4000000        # Lady and the Tramp
    train.loc[1714,'budget'] = 46000000       # The Recruit
    train.loc[1721,'budget'] = 17500000       # Cocoon
    train.loc[1865,'revenue'] = 25000000      # Scooby-Doo 2: Monsters Unleashed
    train.loc[2268,'budget'] = 17500000       # Madea Goes to Jail budget
    train.loc[2491,'revenue'] = 6800000       # Never Talk to Strangers
    train.loc[2602,'budget'] = 31000000       # Mr. Holland's Opus
    train.loc[2612,'budget'] = 15000000       # Field of Dreams
    train.loc[2696,'budget'] = 10000000       # Nurse 3-D
    train.loc[2801,'budget'] = 10000000       # Fracture

    test.loc[3889,'budget'] = 15000000       # Colossal
    test.loc[6733,'budget'] = 5000000        # The Big Sick
    test.loc[3197,'budget'] = 8000000        # High-Rise
    test.loc[6683,'budget'] = 50000000       # The Pink Panther 2
    test.loc[5704,'budget'] = 4300000        # French Connection II
    test.loc[6109,'budget'] = 281756         # Dogtooth
    test.loc[7242,'budget'] = 10000000       # Addams Family Values
    test.loc[7021,'budget'] = 17540562       #  Two Is a Family
    test.loc[5591,'budget'] = 4000000        # The Orphanage
    test.loc[4282,'budget'] = 20000000       # Big Top Pee-wee

    train.loc[391,'runtime'] = 86 #Il peor natagle de la meva vida
    train.loc[592,'runtime'] = 90 #А поутру они проснулись
    train.loc[925,'runtime'] = 95 #¿Quién mató a Bambi?
    train.loc[978,'runtime'] = 93 #La peggior settimana della mia vita
    train.loc[1256,'runtime'] = 92 #Cipolla Colt
    train.loc[1542,'runtime'] = 93 #Все и сразу
    train.loc[1875,'runtime'] = 86 #Vermist
    train.loc[2151,'runtime'] = 108 #Mechenosets
    train.loc[2499,'runtime'] = 108 #Na Igre 2. Novyy Uroven
    train.loc[2646,'runtime'] = 98 #同桌的妳
    train.loc[2786,'runtime'] = 111 #Revelation
    train.loc[2866,'runtime'] = 96 #Tutto tutto niente niente
    
    test.loc[4074,'runtime'] = 103 #Shikshanachya Aaicha Gho
    test.loc[4222,'runtime'] = 93 #Street Knight
    test.loc[4431,'runtime'] = 100 #Плюс один
    test.loc[5520,'runtime'] = 86 #Glukhar v kino
    test.loc[5845,'runtime'] = 83 #Frau Müller muss weg!
    test.loc[5849,'runtime'] = 140 #Shabd
    test.loc[6210,'runtime'] = 104 #Le dernier souffle
    test.loc[6804,'runtime'] = 145 #Chaahat Ek Nasha..
    test.loc[7321,'runtime'] = 87 #El truco del manco

    all_data = train.append(test)
    print("Preparing All Data")
    prepareData(all_data)    
    all_data.to_pickle("all_data.pickle")
    print("saved all data")
else: 
    all_data = pd.read_pickle("all_data.pickle")
    print("saved all data")

Preparing All Data


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)
100%|██████████| 7398/7398 [01:26<00:00, 85.72it/s] 


cast distributed over data, cols: 188


100%|██████████| 7399/7399 [03:22<00:00, 36.47it/s]


crew distributed over data, cols: 385


100%|██████████| 7399/7399 [00:10<00:00, 714.90it/s]


genres distributed over data, cols: 392


100%|██████████| 7399/7399 [00:32<00:00, 226.34it/s]


Keywords distributed over data, cols: 541


100%|██████████| 7399/7399 [00:12<00:00, 616.04it/s]


production_companies distributed over data, cols: 567


100%|██████████| 7399/7399 [00:11<00:00, 634.78it/s]


production_countries distributed over data, cols: 579


100%|██████████| 7399/7399 [00:12<00:00, 596.65it/s]


spoken_languages distributed over data, cols: 590
saved all data


In [7]:
train = all_data[:len(train)]
test = all_data[len(train):]

In [8]:
train.head()

Unnamed: 0_level_0,belongs_to_collection,budget,original_language,popularity,revenue,runtime,different_title,num_cast,genders_0_cast,genders_1_cast,genders_2_cast,cast_0,cast_1,cast_2,cast_3,cast_4,cast_5,cast_6,cast_7,cast_8,cast_9,cast_10,cast_11,cast_12,cast_13,cast_14,cast_15,cast_16,cast_17,cast_18,cast_19,cast_20,cast_21,cast_22,cast_23,cast_24,cast_25,cast_26,cast_27,cast_28,...,production_companies_14,production_companies_15,production_companies_16,production_companies_17,production_companies_18,production_companies_19,production_companies_20,production_companies_21,production_companies_22,production_companies_23,production_companies_24,production_companies_25,num_production_countries,production_countries_0,production_countries_1,production_countries_2,production_countries_3,production_countries_4,production_countries_5,production_countries_6,production_countries_7,production_countries_8,production_countries_9,production_countries_10,production_countries_11,release_day,release_month,release_year,num_spoken_languages,spoken_languages_0,spoken_languages_1,spoken_languages_2,spoken_languages_3,spoken_languages_4,spoken_languages_5,spoken_languages_6,spoken_languages_7,spoken_languages_8,tagline_len,title_len
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,313576,16.454568,en,6.575393,16.3263,93.0,False,24.0,6.0,8.0,10.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,17.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,26.0,27.0,28.0,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,US,0,0,0,0,0,0,0,0,0,0,0,4,2,2015,1,en,0,0,0,0,0,0,0,0,52.0,22.0
2,107674,17.50439,en,8.248895,18.370959,113.0,False,20.0,0.0,10.0,10.0,5.0,6.0,11.0,12.0,13.0,14.0,15.0,22.0,23.0,24.0,25.0,26.0,27.0,29.0,31.0,32.0,33.0,34.0,35.0,36.0,37.0,38.0,39.0,40.0,41.0,42.0,43.0,44.0,45.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,US,0,0,0,0,0,0,0,0,0,0,0,4,8,2004,1,en,0,0,0,0,0,0,0,0,60.0,40.0
3,0,15.009433,en,64.29999,16.387512,105.0,False,51.0,31.0,7.0,13.0,1.0,5.0,6.0,7.0,8.0,9.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,IN,0,0,0,0,0,0,0,0,0,0,0,4,10,2014,1,en,hi,0,0,0,0,0,0,0,47.0,8.0
4,0,13.997833,hi,3.174936,16.588099,122.0,False,7.0,4.0,1.0,2.0,3.0,4.0,5.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,KR,0,0,0,0,0,0,0,0,0,0,0,4,3,2012,2,ko,0,0,0,0,0,0,0,0,3.0,7.0
5,0,16.648724,ko,1.14807,15.182615,118.0,True,4.0,0.0,0.0,4.0,6.0,7.0,8.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,0,0,0,0,0,0,3,2,2009,1,en,0,0,0,0,0,0,0,0,3.0,10.0


In [9]:
train.describe()

Unnamed: 0,budget,popularity,revenue,runtime,num_cast,genders_0_cast,genders_1_cast,genders_2_cast,cast_0,cast_1,cast_2,cast_3,cast_4,cast_5,cast_6,cast_7,cast_8,cast_9,cast_10,cast_11,cast_12,cast_13,cast_14,cast_15,cast_16,cast_17,cast_18,cast_19,cast_20,cast_21,cast_22,cast_23,cast_24,cast_25,cast_26,cast_27,cast_28,cast_29,cast_30,cast_31,...,Keywords_146,Keywords_147,Keywords_148,Has_HomePage,IsReleased,original_title_len,num_production_companies,production_companies_0,production_companies_1,production_companies_2,production_companies_3,production_companies_4,production_companies_5,production_companies_6,production_companies_7,production_companies_8,production_companies_9,production_companies_10,production_companies_11,production_companies_12,production_companies_13,production_companies_14,production_companies_15,production_companies_16,production_companies_17,production_companies_18,production_companies_19,production_companies_20,production_companies_21,production_companies_22,production_companies_23,production_companies_24,production_companies_25,num_production_countries,release_day,release_month,release_year,num_spoken_languages,tagline_len,title_len
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,...,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,16.40808,8.463274,15.97737,108.17,20.603667,6.776333,4.511,9.316333,18.262667,18.846,19.797,20.856333,21.951,22.441333,22.921,23.796333,23.844667,23.992333,24.056667,23.247,22.761667,22.753,22.639667,21.173333,19.801,18.873,17.957667,17.054,15.960333,15.623333,14.595667,13.836667,13.437333,13.126,12.752333,12.355,11.986333,11.469,10.802333,10.293333,...,71.692,71.730333,73.724,0.315333,0.998667,14.802,2.698333,4094.469,6856.011,7345.976333,4844.554,3584.525667,2771.314333,1748.981333,1089.431667,714.049667,285.320333,277.132667,168.727667,124.855667,98.038333,87.162,57.416667,24.199667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.326333,3.269333,6.775333,1999.713,1.452333,36.358,15.159
std,1.649808,12.104,3.024962,21.198842,16.629635,9.508441,4.712248,7.374804,117.499051,114.711478,113.324575,111.94995,109.160197,104.567216,99.832391,98.41565,93.585226,90.232471,88.725202,81.291698,77.198532,75.722084,73.927841,65.04494,60.072165,57.558398,54.961084,52.046614,48.761891,49.054034,45.406388,45.284466,45.514788,45.90299,46.007592,46.178463,46.048499,45.99751,41.63491,41.536141,...,3926.732559,3928.832162,4038.029783,0.464726,0.036497,8.310264,2.014121,10387.60635,13728.56336,15828.172697,13364.062956,12606.312034,11834.314487,9712.495928,7610.761385,6591.862661,4072.133572,4350.607382,3469.924494,3068.378447,2605.128146,2536.487924,2091.424359,1325.470332,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.752349,1.30615,3.409115,15.423313,0.887688,28.321474,8.329196
min,0.693147,1e-06,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1921.0,0.0,3.0,1.0
25%,16.012735,4.018053,14.691963,94.0,11.0,1.0,2.0,5.0,1.0,2.0,3.0,4.0,6.0,7.0,8.0,9.0,10.0,11.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,9.0,1.0,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,4.0,1993.0,1.0,17.0,10.0
50%,16.648724,7.374861,16.63731,104.0,16.0,4.0,3.0,8.0,2.0,3.0,5.0,6.0,8.0,10.0,11.0,13.0,14.0,15.0,16.0,17.0,17.0,18.0,19.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,13.0,2.0,491.0,1173.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0,7.0,2004.0,1.0,32.0,13.0
75%,17.216708,10.890983,18.046365,118.0,24.0,8.0,6.0,12.0,5.0,7.0,8.0,11.0,13.0,15.0,17.0,19.0,20.0,21.25,23.0,24.0,25.0,26.0,27.0,29.0,30.0,30.0,30.0,30.0,29.0,29.0,28.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,18.0,4.0,3644.25,7495.5,7503.0,192.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0,10.0,2011.0,2.0,50.0,19.0
max,19.755682,294.337037,21.141685,338.0,156.0,114.0,87.0,84.0,1021.0,1022.0,1023.0,1024.0,1028.0,1029.0,1030.0,1037.0,1038.0,1035.0,1044.0,1071.0,1038.0,1068.0,1069.0,1070.0,1071.0,1072.0,1073.0,1074.0,1075.0,1076.0,1077.0,1078.0,1079.0,1080.0,1081.0,1082.0,1083.0,1084.0,1049.0,1050.0,...,215076.0,215191.0,221172.0,1.0,1.0,62.0,17.0,94930.0,94301.0,94198.0,95595.0,95018.0,95406.0,95342.0,95295.0,95296.0,92199.0,92231.0,92232.0,92233.0,87850.0,87851.0,87852.0,72599.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,6.0,12.0,2017.0,9.0,232.0,62.0


In [10]:
test.head()

Unnamed: 0_level_0,belongs_to_collection,budget,original_language,popularity,revenue,runtime,different_title,num_cast,genders_0_cast,genders_1_cast,genders_2_cast,cast_0,cast_1,cast_2,cast_3,cast_4,cast_5,cast_6,cast_7,cast_8,cast_9,cast_10,cast_11,cast_12,cast_13,cast_14,cast_15,cast_16,cast_17,cast_18,cast_19,cast_20,cast_21,cast_22,cast_23,cast_24,cast_25,cast_26,cast_27,cast_28,...,production_companies_14,production_companies_15,production_companies_16,production_companies_17,production_companies_18,production_companies_19,production_companies_20,production_companies_21,production_companies_22,production_companies_23,production_companies_24,production_companies_25,num_production_countries,production_countries_0,production_countries_1,production_countries_2,production_countries_3,production_countries_4,production_countries_5,production_countries_6,production_countries_7,production_countries_8,production_countries_9,production_countries_10,production_countries_11,release_day,release_month,release_year,num_spoken_languages,spoken_languages_0,spoken_languages_1,spoken_languages_2,spoken_languages_3,spoken_languages_4,spoken_languages_5,spoken_languages_6,spoken_languages_7,spoken_languages_8,tagline_len,title_len
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
3001,34055,16.648724,ja,3.851534,0.0,90.0,True,7.0,4.0,3.0,0.0,2.0,3.0,4.0,5.0,6.0,8.0,9.0,10.0,11.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,US,0,0,0,0,0,0,0,0,0,0,0,5,7,2007,2,en,0,0,0,0,0,0,0,0,51.0,28.0
3002,0,11.385103,en,3.559789,0.0,65.0,False,10.0,6.0,2.0,2.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,US,0,0,0,0,0,0,0,0,0,0,0,0,5,1958,1,en,0,0,0,0,0,0,0,0,96.0,27.0
3003,0,16.648724,en,8.085194,0.0,100.0,False,9.0,0.0,4.0,5.0,6.0,7.0,8.0,9.0,13.0,14.0,17.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,26.0,88.0,89.0,90.0,91.0,92.0,93.0,94.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,CA,FR,0,0,0,0,0,0,0,0,0,0,4,5,1997,1,ar,en,fr,0,0,0,0,0,0,41.0,16.0
3004,0,15.732433,fr,8.596012,0.0,130.0,False,23.0,15.0,3.0,5.0,1.0,2.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,US,0,0,0,0,0,0,0,0,0,0,0,5,9,2010,3,en,0,0,0,0,0,0,0,0,55.0,9.0
3005,0,14.508658,en,3.21768,0.0,92.0,False,4.0,0.0,0.0,4.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,12.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,US,0,0,0,0,0,0,0,0,0,0,0,4,2,2005,1,en,0,0,0,0,0,0,0,0,221.0,18.0


In [11]:
test.describe()

Unnamed: 0,budget,popularity,revenue,runtime,num_cast,genders_0_cast,genders_1_cast,genders_2_cast,cast_0,cast_1,cast_2,cast_3,cast_4,cast_5,cast_6,cast_7,cast_8,cast_9,cast_10,cast_11,cast_12,cast_13,cast_14,cast_15,cast_16,cast_17,cast_18,cast_19,cast_20,cast_21,cast_22,cast_23,cast_24,cast_25,cast_26,cast_27,cast_28,cast_29,cast_30,cast_31,...,Keywords_146,Keywords_147,Keywords_148,Has_HomePage,IsReleased,original_title_len,num_production_companies,production_companies_0,production_companies_1,production_companies_2,production_companies_3,production_companies_4,production_companies_5,production_companies_6,production_companies_7,production_companies_8,production_companies_9,production_companies_10,production_companies_11,production_companies_12,production_companies_13,production_companies_14,production_companies_15,production_companies_16,production_companies_17,production_companies_18,production_companies_19,production_companies_20,production_companies_21,production_companies_22,production_companies_23,production_companies_24,production_companies_25,num_production_countries,release_day,release_month,release_year,num_spoken_languages,tagline_len,title_len
count,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,...,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0,4399.0
mean,16.363691,8.548286,0.0,107.713799,21.191862,7.204137,4.650602,9.337122,25.846101,26.218459,27.319618,28.01273,28.726756,29.004774,29.940441,29.956581,28.968857,28.933621,27.755399,26.565583,25.675835,25.305297,24.613549,23.084792,22.189589,21.112071,20.117072,19.14185,17.951353,17.542851,16.697204,15.875199,14.450102,13.522846,13.276199,12.929984,12.331212,11.929757,11.546261,10.9609,...,0.0,0.0,0.0,0.322801,0.997727,14.821778,2.775404,3959.911343,7029.519436,7224.09002,5573.641055,3839.711525,2659.363719,1623.482155,1101.272107,867.653785,459.159127,406.545579,225.39509,189.354853,156.375313,140.325756,148.367811,113.940441,116.981132,98.165038,66.877472,55.300296,32.700159,14.754944,14.755399,12.2005,12.200727,1.336895,3.226642,6.886338,1999.670834,1.441464,36.651512,15.116845
std,1.803768,12.208307,0.0,20.81757,17.981498,10.682749,4.616389,7.390047,145.386772,142.37999,141.624291,139.352724,135.403558,130.495478,128.857468,124.648013,117.536894,113.892978,105.181922,97.965496,92.489098,90.445506,84.365382,79.219822,76.735035,74.021072,69.64077,66.517682,64.851212,65.077059,63.419169,61.523973,52.94332,46.049604,46.350886,46.446462,43.780106,43.786304,41.159558,38.181007,...,0.0,0.0,0.0,0.4676,0.04763,8.33742,2.296967,9868.395794,14674.804112,16028.797132,15051.383047,13100.112056,11251.828769,8904.257387,7517.042627,6926.744211,4885.615114,4890.954131,3314.296663,3001.888252,2584.117096,2510.281626,2728.659758,2570.825674,2688.117559,2489.663162,2061.3045,1839.797677,1448.975499,826.682644,826.700489,809.197653,809.21273,0.815483,1.343558,3.371805,15.286348,0.899026,28.88225,8.394436
min,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1922.0,0.0,3.0,1.0
25%,16.012735,3.888453,0.0,94.0,11.0,1.0,2.0,5.0,1.0,2.0,3.0,4.0,6.0,7.0,8.0,9.0,10.0,11.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,9.0,1.0,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,4.0,1992.0,1.0,18.0,9.0
50%,16.648724,7.481524,0.0,104.0,16.0,4.0,4.0,8.0,2.0,3.0,5.0,6.0,8.0,10.0,11.0,13.0,14.0,15.0,16.0,17.0,17.0,18.0,18.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,13.0,2.0,441.0,1088.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0,7.0,2004.0,1.0,33.0,13.0
75%,17.147715,10.936597,0.0,118.0,24.0,8.0,6.0,12.0,5.0,7.0,9.0,10.0,13.0,15.0,17.0,19.0,20.0,22.0,23.0,25.0,26.0,27.0,28.0,29.0,30.0,30.0,30.0,30.0,30.0,29.0,28.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,18.0,4.0,3688.5,7364.0,7295.0,932.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0,10.0,2011.0,2.0,50.0,19.0
max,19.376192,547.488298,0.0,320.0,165.0,143.0,60.0,85.0,1055.0,1056.0,1057.0,1059.0,1060.0,1061.0,1062.0,1063.0,1064.0,1065.0,1066.0,1092.0,1093.0,1094.0,1095.0,1096.0,1101.0,1102.0,1103.0,1104.0,1105.0,1106.0,1107.0,1108.0,1060.0,1061.0,1062.0,1063.0,1064.0,1065.0,1066.0,1067.0,...,0.0,0.0,0.0,1.0,1.0,104.0,26.0,95102.0,96035.0,96043.0,95674.0,95983.0,95345.0,93119.0,89163.0,95289.0,95290.0,93466.0,92187.0,79077.0,76994.0,77813.0,88099.0,88100.0,88101.0,88102.0,79438.0,79439.0,78943.0,53668.0,53669.0,53670.0,53671.0,12.0,6.0,12.0,2018.0,9.0,252.0,104.0


Now we have our train and test data, we should define the categorical data for catboost. Because catboost could also handle string categorical data if defined. No need to make some label encoding for catboost.

In [12]:
X_tr = train.drop(LABEL_COL_NAME, axis = 1)
y_tr = train[LABEL_COL_NAME]
numerical_features = ['budget',
                      'popularity', 
                      'runtime', 
                      'title_len', 
                      'original_title_len', 
                      'tagline_len',
                      'num_crew',
                      'num_cast',
                      'num_keywords',
                      'num_production_companies',
                      'num_production_countries',
                      'num_spoken_languages']
cat_features = set(X_tr.columns) - set(numerical_features)
cat_features = [list(X_tr.columns).index(c) for c in cat_features]

<h3>Cross Validator Functions and Hyper Parameter Optimizations.</h3>

In [13]:
#import required packages
import lightgbm as lgb
import xgboost as xgb
import catboost as cb
import gc
from hyperopt import hp, tpe, Trials, STATUS_OK
from hyperopt.fmin import fmin
from hyperopt.pyll.stochastic import sample
#optional but advised
import warnings
warnings.filterwarnings('ignore')

#GLOBAL HYPEROPT PARAMETERS
NUM_EVALS = 100 #number of hyperopt evaluation rounds
N_FOLDS = 3 #number of cross-validation folds on data in each evaluation round

#LIGHTGBM PARAMETERS
LGBM_MAX_LEAVES = 2**11 #maximum number of leaves per tree for LightGBM
LGBM_MAX_DEPTH = 25 #maximum tree depth for LightGBM
EVAL_METRIC_LGBM_REG = 'mae' #LightGBM regression metric. Note that 'rmse' is more commonly used 
EVAL_METRIC_LGBM_CLASS = 'auc'#LightGBM classification metric

#XGBOOST PARAMETERS
XGB_MAX_LEAVES = 2**12 #maximum number of leaves when using histogram splitting
XGB_MAX_DEPTH = 25 #maximum tree depth for XGBoost
EVAL_METRIC_XGB_REG = 'mae' #XGBoost regression metric
EVAL_METRIC_XGB_CLASS = 'auc' #XGBoost classification metric

#CATBOOST PARAMETERS
CB_MAX_DEPTH = 6 #maximum tree depth in CatBoost
OBJECTIVE_CB_REG = 'RMSE' #CatBoost regression metric
OBJECTIVE_CB_CLASS = 'Logloss' #CatBoost classification metric

def quick_hyperopt(data, labels, package='lgbm', num_evals=NUM_EVALS, diagnostic=False, cat_features=[]):
    
    #==========
    #LightGBM
    #==========
    
    if package=='lgbm':
        
        print('Running {} rounds of LightGBM parameter optimisation:'.format(num_evals))
        #clear space
        gc.collect()
        
        integer_params = ['max_depth',
                         'num_leaves',
                          'max_bin',
                         'min_data_in_leaf',
                         'min_data_in_bin']
        
        def objective(space_params):
            
            #cast integer params from float to int
            for param in integer_params:
                space_params[param] = int(space_params[param])
            
            #extract nested conditional parameters
            if space_params['boosting']['boosting'] == 'goss':
                top_rate = space_params['boosting'].get('top_rate')
                other_rate = space_params['boosting'].get('other_rate')
                #0 <= top_rate + other_rate <= 1
                top_rate = max(top_rate, 0)
                top_rate = min(top_rate, 0.5)
                other_rate = max(other_rate, 0)
                other_rate = min(other_rate, 0.5)
                space_params['top_rate'] = top_rate
                space_params['other_rate'] = other_rate
            
            subsample = space_params['boosting'].get('subsample', 1.0)
            space_params['boosting'] = space_params['boosting']['boosting']
            space_params['subsample'] = subsample
            
            #for classification, set stratified=True and metrics=EVAL_METRIC_LGBM_CLASS
            cv_results = lgb.cv(space_params, train, nfold = N_FOLDS, stratified=False,
                                early_stopping_rounds=100, metrics=EVAL_METRIC_LGBM_REG, seed=42)
            
            best_loss = cv_results['l1-mean'][-1] #'l2-mean' for rmse
            #for classification, comment out the line above and uncomment the line below:
            #best_loss = 1 - cv_results['auc-mean'][-1]
            #if necessary, replace 'auc-mean' with '[your-preferred-metric]-mean'
            return{'loss':best_loss, 'status': STATUS_OK }
        
        train = lgb.Dataset(data, labels)
                
        #integer and string parameters, used with hp.choice()
        boosting_list = [{'boosting': 'gbdt',
                          'subsample': hp.uniform('subsample', 0.5, 1)},
                         {'boosting': 'goss',
                          'subsample': 1.0,
                         'top_rate': hp.uniform('top_rate', 0, 0.5),
                         'other_rate': hp.uniform('other_rate', 0, 0.5)}] #if including 'dart', make sure to set 'n_estimators'
        metric_list = ['MAE', 'RMSE'] 
        #for classification comment out the line above and uncomment the line below
        #modify as required for other classification metrics classification
        #metric_list = ['auc']
        objective_list_reg = ['huber', 'gamma', 'fair', 'tweedie']
        objective_list_class = ['logloss', 'cross_entropy']
        #for classification set objective_list = objective_list_class
        objective_list = objective_list_reg

        space ={'boosting' : hp.choice('boosting', boosting_list),
                'num_leaves' : hp.quniform('num_leaves', 2, LGBM_MAX_LEAVES, 1),
                'max_depth': hp.quniform('max_depth', 2, LGBM_MAX_DEPTH, 1),
                'max_bin': hp.quniform('max_bin', 32, 255, 1),
                'min_data_in_leaf': hp.quniform('min_data_in_leaf', 1, 256, 1),
                'min_data_in_bin': hp.quniform('min_data_in_bin', 1, 256, 1),
                'lambda_l1' : hp.uniform('lambda_l1', 0, 5),
                'lambda_l2' : hp.uniform('lambda_l2', 0, 5),
                'learning_rate' : hp.loguniform('learning_rate', np.log(0.005), np.log(0.2)),
                'metric' : hp.choice('metric', metric_list),
                'objective' : hp.choice('objective', objective_list),
                'feature_fraction' : hp.quniform('feature_fraction', 0.5, 1, 0.01),
                'bagging_fraction' : hp.quniform('bagging_fraction', 0.5, 1, 0.01)
            }
        
        #optional: activate GPU for LightGBM
        #follow compilation steps here:
        #https://www.kaggle.com/vinhnguyen/gpu-acceleration-for-lightgbm/
        #then uncomment lines below:
        #space['device'] = 'gpu'
        #space['gpu_platform_id'] = 0,
        #space['gpu_device_id'] =  0

        trials = Trials()
        best = fmin(fn=objective,
                    space=space,
                    algo=tpe.suggest,
                    max_evals=num_evals, 
                    trials=trials)
                
        #fmin() will return the index of values chosen from the lists/arrays in 'space'
        #to obtain actual values, index values are used to subset the original lists/arrays
        best['boosting'] = boosting_list[best['boosting']]['boosting']#nested dict, index twice
        best['metric'] = metric_list[best['metric']]
        best['objective'] = objective_list[best['objective']]
        
        #cast floats of integer params to int
        for param in integer_params:
            best[param] = int(best[param])
            
        print('{' + '\n'.join('{}: {}'.format(k, v) for k, v in best.items()) + '}')
        if diagnostic:
            return(best, trials)
        else:
            return(best)
    
    #==========
    #XGBoost
    #==========
    
    if package=='xgb':
        
        print('Running {} rounds of XGBoost parameter optimisation:'.format(num_evals))
        #clear space
        gc.collect()
        
        integer_params = ['max_depth']
        
        def objective(space_params):
            
            for param in integer_params:
                space_params[param] = int(space_params[param])
                
            #extract multiple nested tree_method conditional parameters
            #libera te tutemet ex inferis
            if space_params['tree_method']['tree_method'] == 'hist':
                max_bin = space_params['tree_method'].get('max_bin')
                space_params['max_bin'] = int(max_bin)
                if space_params['tree_method']['grow_policy']['grow_policy']['grow_policy'] == 'depthwise':
                    grow_policy = space_params['tree_method'].get('grow_policy').get('grow_policy').get('grow_policy')
                    space_params['grow_policy'] = grow_policy
                    space_params['tree_method'] = 'hist'
                else:
                    max_leaves = space_params['tree_method']['grow_policy']['grow_policy'].get('max_leaves')
                    space_params['grow_policy'] = 'lossguide'
                    space_params['max_leaves'] = int(max_leaves)
                    space_params['tree_method'] = 'hist'
            else:
                space_params['tree_method'] = space_params['tree_method'].get('tree_method')
                
            #for classification replace EVAL_METRIC_XGB_REG with EVAL_METRIC_XGB_CLASS
            cv_results = xgb.cv(space_params, train, nfold=N_FOLDS, metrics=[EVAL_METRIC_XGB_REG],
                             early_stopping_rounds=100, stratified=False, seed=42)
            
            best_loss = cv_results['test-mae-mean'].iloc[-1] #or 'test-rmse-mean' if using RMSE
            #for classification, comment out the line above and uncomment the line below:
            #best_loss = 1 - cv_results['test-auc-mean'].iloc[-1]
            #if necessary, replace 'test-auc-mean' with 'test-[your-preferred-metric]-mean'
            return{'loss':best_loss, 'status': STATUS_OK }
        
        train = xgb.DMatrix(data, labels)
        
        #integer and string parameters, used with hp.choice()
        boosting_list = ['gbtree', 'gblinear'] #if including 'dart', make sure to set 'n_estimators'
        metric_list = ['MAE', 'RMSE'] 
        #for classification comment out the line above and uncomment the line below
        #metric_list = ['auc']
        #modify as required for other classification metrics classification
        
        tree_method = [{'tree_method' : 'exact'},
               {'tree_method' : 'approx'},
               {'tree_method' : 'hist',
                'max_bin': hp.quniform('max_bin', 2**3, 2**7, 1),
                'grow_policy' : {'grow_policy': {'grow_policy':'depthwise'},
                                'grow_policy' : {'grow_policy':'lossguide',
                                                  'max_leaves': hp.quniform('max_leaves', 32, XGB_MAX_LEAVES, 1)}}}]
        
        #if using GPU, replace 'exact' with 'gpu_exact' and 'hist' with
        #'gpu_hist' in the nested dictionary above
        
        objective_list_reg = ['reg:linear', 'reg:gamma', 'reg:tweedie']
        objective_list_class = ['reg:logistic', 'binary:logistic']
        #for classification change line below to 'objective_list = objective_list_class'
        objective_list = objective_list_reg
        
        space ={'boosting' : hp.choice('boosting', boosting_list),
                'tree_method' : hp.choice('tree_method', tree_method),
                'max_depth': hp.quniform('max_depth', 2, XGB_MAX_DEPTH, 1),
                'reg_alpha' : hp.uniform('reg_alpha', 0, 5),
                'reg_lambda' : hp.uniform('reg_lambda', 0, 5),
                'min_child_weight' : hp.uniform('min_child_weight', 0, 5),
                'gamma' : hp.uniform('gamma', 0, 5),
                'learning_rate' : hp.loguniform('learning_rate', np.log(0.005), np.log(0.2)),
                'eval_metric' : hp.choice('eval_metric', metric_list),
                'objective' : hp.choice('objective', objective_list),
                'colsample_bytree' : hp.quniform('colsample_bytree', 0.1, 1, 0.01),
                'colsample_bynode' : hp.quniform('colsample_bynode', 0.1, 1, 0.01),
                'colsample_bylevel' : hp.quniform('colsample_bylevel', 0.1, 1, 0.01),
                'subsample' : hp.quniform('subsample', 0.5, 1, 0.05),
                'nthread' : -1
            }
        
        #optional: activate GPU for XGBoost
        #uncomment line below
        #space['tree_method'] = 'gpu_hist'
        
        trials = Trials()
        best = fmin(fn=objective,
                    space=space,
                    algo=tpe.suggest,
                    max_evals=num_evals, 
                    trials=trials)
        
        best['tree_method'] = tree_method[best['tree_method']]['tree_method']
        best['boosting'] = boosting_list[best['boosting']]
        best['eval_metric'] = metric_list[best['eval_metric']]
        best['objective'] = objective_list[best['objective']]
        
        #cast floats of integer params to int
        for param in integer_params:
            best[param] = int(best[param])
        if 'max_leaves' in best:
            best['max_leaves'] = int(best['max_leaves'])
        if 'max_bin' in best:
            best['max_bin'] = int(best['max_bin'])
        
        print('{' + '\n'.join('{}: {}'.format(k, v) for k, v in best.items()) + '}')
        
        if diagnostic:
            return(best, trials)
        else:
            return(best)
    
    #==========
    #CatBoost
    #==========
    
    if package=='cb':
        
        print('Running {} rounds of CatBoost parameter optimisation:'.format(num_evals))
        
        #clear memory 
        gc.collect()
            
        integer_params = ['depth',
                          'one_hot_max_size', #for categorical data
                          'min_data_in_leaf',
                          'max_bin']
        
        def objective(space_params):
                        
            #cast integer params from float to int
            for param in integer_params:
                space_params[param] = int(space_params[param])
                
            #extract nested conditional parameters
            if space_params['bootstrap_type']['bootstrap_type'] == 'Bayesian':
                bagging_temp = space_params['bootstrap_type'].get('bagging_temperature')
                space_params['bagging_temperature'] = bagging_temp
                
            if space_params['grow_policy']['grow_policy'] == 'LossGuide':
                max_leaves = space_params['grow_policy'].get('max_leaves')
                space_params['max_leaves'] = int(max_leaves)
                
            space_params['bootstrap_type'] = space_params['bootstrap_type']['bootstrap_type']
            space_params['grow_policy'] = space_params['grow_policy']['grow_policy']
                           
            #random_strength cannot be < 0
            space_params['random_strength'] = max(space_params['random_strength'], 0)
            #fold_len_multiplier cannot be < 1
            space_params['fold_len_multiplier'] = max(space_params['fold_len_multiplier'], 1)
                       
            #for classification set stratified=True
            cv_results = cb.cv(train, space_params, fold_count=N_FOLDS, 
                             early_stopping_rounds=25, stratified=False, partition_random_seed=42)
           
            #best_loss = cv_results['test-MAE-mean'].iloc[-1] 
            best_loss = cv_results['test-RMSE-mean'].iloc[-1] 
            
            #for classification, comment out the line above and uncomment the line below:
            #best_loss = cv_results['test-Logloss-mean'].iloc[-1]
            #if necessary, replace 'test-Logloss-mean' with 'test-[your-preferred-metric]-mean'
            
            return{'loss':best_loss, 'status': STATUS_OK}
        
        train = cb.Pool(data, labels.astype('float32'), cat_features=cat_features)
        
        #integer and string parameters, used with hp.choice()
        bootstrap_type = [
                          {'bootstrap_type':'Poisson'}, 
                          {'bootstrap_type':'Bayesian', 'bagging_temperature' : hp.loguniform('bagging_temperature', np.log(1), np.log(50))},
                          {'bootstrap_type':'Bernoulli'}] 
        LEB = ['No', 'AnyImprovement', 'Armijo'] #remove 'Armijo' if not using GPU
        #score_function = ['Correlation', 'L2', 'NewtonCorrelation', 'NewtonL2']
        grow_policy = [{'grow_policy':'SymmetricTree'},
                       {'grow_policy':'Depthwise'},
                       {'grow_policy':'Lossguide',
                        'max_leaves': hp.quniform('max_leaves', 2, 32, 1)}]
        eval_metric_list_reg = ['MAE', 'RMSE', 'Poisson']
        eval_metric_list_class = ['Logloss', 'AUC', 'F1']
        #for classification change line below to 'eval_metric_list = eval_metric_list_class'
        eval_metric_list = eval_metric_list_reg
                
        space ={'depth': hp.quniform('depth', 2, CB_MAX_DEPTH, 1),
                'max_bin' : hp.quniform('max_bin', 1, 32, 1), #if using CPU just set this to 254
                #'max_bin': 254,
                'l2_leaf_reg' : hp.uniform('l2_leaf_reg', 0, 5),
                'min_data_in_leaf' : hp.quniform('min_data_in_leaf', 1, 50, 1),
                'random_strength' : hp.loguniform('random_strength', np.log(0.005), np.log(5)),
                'one_hot_max_size' : hp.quniform('one_hot_max_size', 2, 16, 1), #uncomment if using categorical features
                'bootstrap_type' : hp.choice('bootstrap_type', bootstrap_type),
                'learning_rate' : hp.uniform('learning_rate', 0.05, 0.25),
                'eval_metric' : hp.choice('eval_metric', eval_metric_list),
                'objective' : OBJECTIVE_CB_REG,
                #'score_function' : hp.choice('score_function', score_function), #crashes kernel - reason unknown
                'leaf_estimation_backtracking' : hp.choice('leaf_estimation_backtracking', LEB),
                'grow_policy': hp.choice('grow_policy', grow_policy),
                #'colsample_bylevel' : hp.quniform('colsample_bylevel', 0.1, 1, 0.01),# CPU only
                'fold_len_multiplier' : hp.loguniform('fold_len_multiplier', np.log(1.01), np.log(2.5)),
                'od_type' : 'Iter',
                'od_wait' : 25,
                'task_type' : 'GPU',
                'verbose' : 0,
                'cat_features': cat_features
            }
        
        #optional: run CatBoost without GPU
        #uncomment line below
        #space['task_type'] = 'CPU'
            
        trials = Trials()
        best = fmin(fn=objective,
                    space=space,
                    algo=tpe.suggest,
                    max_evals=num_evals, 
                    trials=trials)
        
        #unpack nested dicts first
        best['bootstrap_type'] = bootstrap_type[best['bootstrap_type']]['bootstrap_type']
        best['grow_policy'] = grow_policy[best['grow_policy']]['grow_policy']
        best['eval_metric'] = eval_metric_list[best['eval_metric']]
        
        #best['score_function'] = score_function[best['score_function']] 
        #best['leaf_estimation_method'] = LEM[best['leaf_estimation_method']] #CPU only
        best['leaf_estimation_backtracking'] = LEB[best['leaf_estimation_backtracking']]        
        
        #cast floats of integer params to int
        for param in integer_params:
            best[param] = int(best[param])
        if 'max_leaves' in best:
            best['max_leaves'] = int(best['max_leaves'])
        
        print('{' + '\n'.join('{}: {}'.format(k, v) for k, v in best.items()) + '}')
        
        if diagnostic:
            return(best, trials)
        else:
            return(best)
    
    else:
        print('Package not recognised. Please use "lgbm" for LightGBM, "xgb" for XGBoost or "cb" for CatBoost.')     

In [14]:
cb_params = quick_hyperopt(X_tr, y_tr, 'cb', 15, cat_features=cat_features)
np.save('cb_params.npy', cb_params)
print(cb_params)

Running 15 rounds of CatBoost parameter optimisation:
100%|██████████| 15/15 [27:13<00:00, 91.19s/it, best loss: 2.1659739060428613] 
{bootstrap_type: Bernoulli
depth: 6
eval_metric: MAE
fold_len_multiplier: 1.8090889285313607
grow_policy: Depthwise
l2_leaf_reg: 2.8807709614379338
leaf_estimation_backtracking: Armijo
learning_rate: 0.06499349671407013
max_bin: 32
min_data_in_leaf: 28
one_hot_max_size: 5
random_strength: 0.28920711927167503}
{'bootstrap_type': 'Bernoulli', 'depth': 6, 'eval_metric': 'MAE', 'fold_len_multiplier': 1.8090889285313607, 'grow_policy': 'Depthwise', 'l2_leaf_reg': 2.8807709614379338, 'leaf_estimation_backtracking': 'Armijo', 'learning_rate': 0.06499349671407013, 'max_bin': 32, 'min_data_in_leaf': 28, 'one_hot_max_size': 5, 'random_strength': 0.28920711927167503}


In [15]:
try:
    model = CatBoostRegressor(**cb_params, task_type='GPU')
    model.fit(X_tr, y_tr, cat_features=cat_features)    
except:
    print("GPU grow_policy error, just remove it")
    cb_params.pop('grow_policy')
    model = CatBoostRegressor(**cb_params, task_type='GPU')
    model.fit(X_tr, y_tr, cat_features=cat_features)

0:	learn: 14.9390690	total: 20.4ms	remaining: 20.4s
1:	learn: 13.9697474	total: 39.9ms	remaining: 19.9s
2:	learn: 13.0677435	total: 57ms	remaining: 18.9s
3:	learn: 12.2277031	total: 74.4ms	remaining: 18.5s
4:	learn: 11.4445313	total: 90.9ms	remaining: 18.1s
5:	learn: 10.7143965	total: 107ms	remaining: 17.8s
6:	learn: 10.0334518	total: 123ms	remaining: 17.5s
7:	learn: 9.3984492	total: 139ms	remaining: 17.3s
8:	learn: 8.8058600	total: 153ms	remaining: 16.9s
9:	learn: 8.2522637	total: 167ms	remaining: 16.5s
10:	learn: 7.7355313	total: 180ms	remaining: 16.1s
11:	learn: 7.2533906	total: 192ms	remaining: 15.8s
12:	learn: 6.8029635	total: 205ms	remaining: 15.6s
13:	learn: 6.3818034	total: 217ms	remaining: 15.3s
14:	learn: 5.9910872	total: 230ms	remaining: 15.1s
15:	learn: 5.6273665	total: 242ms	remaining: 14.9s
16:	learn: 5.2897643	total: 254ms	remaining: 14.7s
17:	learn: 4.9758577	total: 267ms	remaining: 14.5s
18:	learn: 4.6850430	total: 278ms	remaining: 14.4s
19:	learn: 4.4162474	total: 290

170:	learn: 1.1107217	total: 2.03s	remaining: 9.85s
171:	learn: 1.1082000	total: 2.04s	remaining: 9.84s
172:	learn: 1.1075855	total: 2.06s	remaining: 9.82s
173:	learn: 1.1063364	total: 2.07s	remaining: 9.81s
174:	learn: 1.1062466	total: 2.08s	remaining: 9.79s
175:	learn: 1.1053781	total: 2.09s	remaining: 9.78s
176:	learn: 1.1040479	total: 2.1s	remaining: 9.77s
177:	learn: 1.1031024	total: 2.11s	remaining: 9.75s
178:	learn: 1.1027451	total: 2.12s	remaining: 9.73s
179:	learn: 1.1019338	total: 2.13s	remaining: 9.72s
180:	learn: 1.1009847	total: 2.15s	remaining: 9.71s
181:	learn: 1.1000177	total: 2.15s	remaining: 9.69s
182:	learn: 1.0994211	total: 2.17s	remaining: 9.67s
183:	learn: 1.0966113	total: 2.18s	remaining: 9.66s
184:	learn: 1.0962386	total: 2.19s	remaining: 9.64s
185:	learn: 1.0958319	total: 2.2s	remaining: 9.62s
186:	learn: 1.0956440	total: 2.21s	remaining: 9.6s
187:	learn: 1.0952132	total: 2.22s	remaining: 9.58s
188:	learn: 1.0950298	total: 2.23s	remaining: 9.57s
189:	learn: 1.0

330:	learn: 0.9838386	total: 3.85s	remaining: 7.79s
331:	learn: 0.9836050	total: 3.87s	remaining: 7.78s
332:	learn: 0.9822715	total: 3.88s	remaining: 7.76s
333:	learn: 0.9821781	total: 3.89s	remaining: 7.75s
334:	learn: 0.9811847	total: 3.9s	remaining: 7.74s
335:	learn: 0.9803892	total: 3.91s	remaining: 7.73s
336:	learn: 0.9793239	total: 3.92s	remaining: 7.71s
337:	learn: 0.9790115	total: 3.93s	remaining: 7.7s
338:	learn: 0.9779425	total: 3.94s	remaining: 7.69s
339:	learn: 0.9772667	total: 3.96s	remaining: 7.68s
340:	learn: 0.9768381	total: 3.97s	remaining: 7.67s
341:	learn: 0.9760866	total: 3.98s	remaining: 7.65s
342:	learn: 0.9753236	total: 3.99s	remaining: 7.64s
343:	learn: 0.9742718	total: 4s	remaining: 7.63s
344:	learn: 0.9740565	total: 4.01s	remaining: 7.62s
345:	learn: 0.9734099	total: 4.02s	remaining: 7.6s
346:	learn: 0.9721651	total: 4.03s	remaining: 7.59s
347:	learn: 0.9715697	total: 4.05s	remaining: 7.58s
348:	learn: 0.9710143	total: 4.06s	remaining: 7.57s
349:	learn: 0.9700

490:	learn: 0.8772027	total: 5.68s	remaining: 5.89s
491:	learn: 0.8763219	total: 5.7s	remaining: 5.88s
492:	learn: 0.8757806	total: 5.71s	remaining: 5.87s
493:	learn: 0.8755321	total: 5.72s	remaining: 5.86s
494:	learn: 0.8754364	total: 5.73s	remaining: 5.84s
495:	learn: 0.8746247	total: 5.74s	remaining: 5.83s
496:	learn: 0.8736840	total: 5.75s	remaining: 5.82s
497:	learn: 0.8729630	total: 5.76s	remaining: 5.81s
498:	learn: 0.8711711	total: 5.78s	remaining: 5.8s
499:	learn: 0.8708353	total: 5.79s	remaining: 5.79s
500:	learn: 0.8701146	total: 5.8s	remaining: 5.78s
501:	learn: 0.8695938	total: 5.82s	remaining: 5.77s
502:	learn: 0.8687146	total: 5.83s	remaining: 5.76s
503:	learn: 0.8684126	total: 5.84s	remaining: 5.75s
504:	learn: 0.8665877	total: 5.85s	remaining: 5.74s
505:	learn: 0.8665072	total: 5.86s	remaining: 5.72s
506:	learn: 0.8661143	total: 5.87s	remaining: 5.71s
507:	learn: 0.8652985	total: 5.89s	remaining: 5.7s
508:	learn: 0.8645328	total: 5.9s	remaining: 5.69s
509:	learn: 0.864

666:	learn: 0.7762463	total: 7.75s	remaining: 3.87s
667:	learn: 0.7758167	total: 7.76s	remaining: 3.86s
668:	learn: 0.7756099	total: 7.77s	remaining: 3.85s
669:	learn: 0.7747450	total: 7.78s	remaining: 3.83s
670:	learn: 0.7739166	total: 7.8s	remaining: 3.82s
671:	learn: 0.7735563	total: 7.81s	remaining: 3.81s
672:	learn: 0.7734185	total: 7.82s	remaining: 3.8s
673:	learn: 0.7732432	total: 7.83s	remaining: 3.79s
674:	learn: 0.7723215	total: 7.85s	remaining: 3.78s
675:	learn: 0.7720518	total: 7.86s	remaining: 3.77s
676:	learn: 0.7717111	total: 7.87s	remaining: 3.75s
677:	learn: 0.7712239	total: 7.88s	remaining: 3.74s
678:	learn: 0.7709723	total: 7.89s	remaining: 3.73s
679:	learn: 0.7707190	total: 7.9s	remaining: 3.72s
680:	learn: 0.7703640	total: 7.92s	remaining: 3.71s
681:	learn: 0.7700815	total: 7.92s	remaining: 3.69s
682:	learn: 0.7695950	total: 7.94s	remaining: 3.68s
683:	learn: 0.7684272	total: 7.95s	remaining: 3.67s
684:	learn: 0.7677002	total: 7.96s	remaining: 3.66s
685:	learn: 0.7

833:	learn: 0.6877500	total: 10s	remaining: 1.99s
834:	learn: 0.6874496	total: 10s	remaining: 1.98s
835:	learn: 0.6872695	total: 10s	remaining: 1.97s
836:	learn: 0.6867301	total: 10s	remaining: 1.96s
837:	learn: 0.6854186	total: 10.1s	remaining: 1.95s
838:	learn: 0.6843673	total: 10.1s	remaining: 1.93s
839:	learn: 0.6839549	total: 10.1s	remaining: 1.92s
840:	learn: 0.6832650	total: 10.1s	remaining: 1.91s
841:	learn: 0.6826971	total: 10.1s	remaining: 1.9s
842:	learn: 0.6822694	total: 10.1s	remaining: 1.89s
843:	learn: 0.6814019	total: 10.1s	remaining: 1.88s
844:	learn: 0.6802561	total: 10.2s	remaining: 1.86s
845:	learn: 0.6801510	total: 10.2s	remaining: 1.85s
846:	learn: 0.6796740	total: 10.2s	remaining: 1.84s
847:	learn: 0.6792148	total: 10.2s	remaining: 1.83s
848:	learn: 0.6789285	total: 10.2s	remaining: 1.82s
849:	learn: 0.6786059	total: 10.2s	remaining: 1.8s
850:	learn: 0.6782533	total: 10.2s	remaining: 1.79s
851:	learn: 0.6767786	total: 10.2s	remaining: 1.78s
852:	learn: 0.6767048	

In [16]:
test = test.drop(LABEL_COL_NAME, axis = 1)
y_test = np.expm1(model.predict(test))

In [17]:
submission = pd.read_csv(SUBMISSON_PATH, index_col='id')
submission[LABEL_COL_NAME] = y_test[:-1]
submission.to_csv(f'submission.csv')
print(submission)

           revenue
id                
3001  6.325281e+05
3002  2.362490e+05
3003  6.607285e+06
3004  1.597385e+07
3005  1.886249e+06
3006  4.377182e+06
3007  2.625564e+06
3008  3.261047e+07
3009  2.550607e+07
3010  4.095561e+08
3011  1.053723e+06
3012  3.220981e+05
3013  2.413199e+07
3014  1.177211e+06
3015  2.020941e+07
3016  5.809602e+05
3017  4.610647e+07
3018  1.195674e+08
3019  1.314391e+07
3020  2.242504e+08
3021  5.084288e+07
3022  3.264951e+07
3023  4.653937e+05
3024  1.573245e+07
3025  1.047255e+06
3026  1.327498e+08
3027  2.252228e+06
3028  8.503923e+07
3029  4.909475e+05
3030  8.915429e+07
...            ...
7369  1.044012e+07
7370  8.661743e+07
7371  6.015235e+05
7372  6.897314e+07
7373  2.506530e+08
7374  1.942182e+07
7375  2.313075e+07
7376  6.284405e+06
7377  2.359627e+07
7378  1.522728e+07
7379  2.716889e+07
7380  9.132610e+05
7381  7.569085e+05
7382  3.681252e+06
7383  1.317653e+05
7384  1.542667e+07
7385  4.446349e+07
7386  2.947730e+07
7387  8.466639e+06
7388  1.7777