# Overview
- **Tweedie distro**
 - I noticed that the distribution of scores for this competition looked like a reverse tweedie distribution. 
 - There are a lot of 100's and a normal curve otherwise... so if you flip it around, you can have a tweedie distro
- **LightGBM**
 - LightGBM handles tweedie distros very well if you simply set it to optimize for it. 
 - I also wanted to get some experience in tuning a LightGBM model
- **Tabular data only**
 - This notebook is only working with tabular data
 - The intention is to get a working model and df. Then to do some image recognition in another notebook and add onto the df and model created here.


h/t to this notebook that I took a fork from that did a lot of the basic stuff of loading data into a dataframe and EDA that was nice to levarage and skip over.
https://www.kaggle.com/carlolepelaars/petfinder2021-eda-baseline

In [None]:
import os
import numpy as np
import pandas as pd
import lightgbm as lgb_no_tune
import random as rn
from glob import glob
import matplotlib.pyplot as plt
import os, sys, gc, time, warnings, pickle, psutil, random
from sklearn.model_selection import train_test_split

# Path variables
BASE_PATH = "../input/petfinder-pawpularity-score/"
TRAIN_PATH = BASE_PATH + "train.csv"
TEST_PATH = BASE_PATH + "test.csv"
TRAIN_IMAGES = glob(BASE_PATH + "train/*.jpg")
TEST_IMAGES = glob(BASE_PATH + "test/*.jpg")

# We are trying to predict this "Pawpularity" variable
TARGET = "Pawpularity"

# Seed for reproducability
seed = 1234
rn.seed(seed)
np.random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)

In [None]:
from sklearn.metrics import mean_squared_error
def rmse(y_true, y_pred):
    """Numpy RMSE"""
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [None]:
df = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)


## EDA

The features given in the CSV are additional binary descriptive features. 

Our target is the "Pawpularity" score which ranges between 1 and 100.

In [None]:
print(df.shape)
df.head()

In [None]:
df[TARGET].plot(kind='hist', bins=100, figsize=(15, 6));
plt.title("Target distribution", weight='bold', fontsize=16);

# Backwards tweedie
This distribution looks a lot like a Tweedie/possion distribution... but in reverse... i.e. there are a lot of 100's and what looks like a kind of normal curve otherwise. This is important because LightGBM (and other boosted tree models) work really well when you can tell it what kind of distrubtion it should expect.

So we're going to flip the distribution around, make predictions on the flipped around version and then flip it back.

More about Tweedie
https://en.wikipedia.org/wiki/Tweedie_distribution

In [None]:
# Simply... 100 - original score = flipped version
df['Pawpularity_reverse'] = 100-df['Pawpularity']
df['Pawpularity_reverse'].head()

In [None]:
TARGET = "Pawpularity_reverse" #changing the target to the reverse
#looking to make sure it was done right. each row should add up to 100
df[['Pawpularity_reverse', 'Pawpularity']].head() 

In [None]:
#Voila.. this looks basically like a tweedie distribution 
#with the exception of the little bump at the end...
df[TARGET].plot(kind='hist', bins=100, figsize=(15, 6));
plt.title("Target distribution", weight='bold', fontsize=16);

# LightGBM setup
First doing an out of the box lgb train. The only paramater that has been changed from the default is to use tweedie optimzer.

In [None]:
# remove original score from the DF to remove from training
Pawpularity = df.pop('Pawpularity') #used pop to save it for later
FEATURES = [col for col in df.columns if col not in ['Id', TARGET]]

In [None]:
%%time
X_train, X_test, y_train, y_test = train_test_split(df[FEATURES], df[TARGET], test_size=0.2, random_state=seed)
model = lgb_no_tune.LGBMRegressor(objective="tweedie", metric="rmse")
#model = lgb.LGBMRegressor()
model.fit(X_train, y_train)

In [None]:
print(f"Train RMSE: {rmse(y_train, model.predict(X_train)).round(4)}")
print(f"Test RMSE: {rmse(y_test, model.predict(X_test)).round(4)}")

## Scores
Out of the box lgbm:
* Train RMSE: 20.4106 
* Test RMSE: 20.5170

Tweedie:
* Train RMSE: 20.4153
* Test RMSE: 20.5247


The simple decision tree in the below notebook had scores of 20.4999 on test, so we're doing worse!
https://www.kaggle.com/carlolepelaars/petfinder2021-eda-baseline

Let's see if tuning it will help at all...

In [None]:
import optuna 
import optuna.integration.lightgbm as lgb

In [None]:
%%time
#ts = time.time()

dtrain = lgb.Dataset(X_train, label=y_train)
eval_data = lgb.Dataset(X_test, label=y_test)


param = {
        'objective': 'tweedie',
        'metric': 'rmse',
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'seed': 42}

best = lgb.train(param, 
                 dtrain,
                 valid_sets=eval_data,
                 early_stopping_rounds=100)

#time.time() - ts

# time: 2945.9576

"""
###Parameters that are actually tuned###

param = {
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }
"""

In [None]:
best.params

In [None]:
best.best_iteration

In [None]:
best.best_score

In [None]:
%%time
train_data = lgb_no_tune.Dataset(X_train, label=y_train)
valid_data = lgb_no_tune.Dataset(X_test, label=y_test, reference=train_data)
model = lgb_no_tune.LGBMRegressor(objective = 'tweedie',
 metric= 'rmse',
 verbosity= -1,
 boosting_type= 'gbdt',
 seed= 42,
 feature_pre_filter= False,
 lambda_l1= 0.006059620618914152,
 lambda_l2= 0.007561768153951137,
 num_leaves= 221,
 feature_fraction= 1.0,
 bagging_fraction= 1.0,
 bagging_freq= 0,
 min_child_samples= 20,
 num_iterations= 1000,
#  valid_sets=[valid_data],
#  early_stopping_round= 100
 )
model.fit(X_train, y_train)

# lgbfit = lgb.train(best_params,
#                    dtrain,
#                    valid_sets=eval_data,
#                    early_stopping_rounds=100)

In [None]:
#train on the whole dataset
model.fit(df[FEATURES], df[TARGET])

# Make a submission file

In [None]:
# Train final model on all training data
model.fit(df[FEATURES], df[TARGET])

## Submission

In [None]:
test.head(2)

In [None]:
%%time
test[TARGET] = model.predict(test[FEATURES])
sub = test[['Id', TARGET]]

In [None]:
sub.head()

In [None]:
# We have to reverse the score from the reverse tweedie back to the normal score
sub['Pawpularity'] = 100- sub['Pawpularity_reverse']
sub.head()

In [None]:
sub.drop(['Pawpularity_reverse'], axis=1,inplace = True)
sub.head()

In [None]:
sub.to_csv("submission.csv", index=False)

In [None]:
sub['Pawpularity'].plot(kind='hist', bins=15, title='Prediction distribution');