# Table of Contents

#### [1 What is TrueFoundry?]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#1-What-is-TrueFoundry?')
#### [2 Tabular Playground Series]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#-2-Tabular-Playground-Series')
#### [3 Preparation]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#-3-Preparation')

*  [3.1 Essential Packages]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#3.1-Essential-Packages')

#### [4 Data Loading and Preprocessing]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#4-Data-Loading-and-Preprocessing')

*  [4.1 Filling missing values]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#4.1-Filling-missing-values')
*  [4.2 Encoding]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#4.2-Encoding')

#### [5 Models]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#5-Models')

* [5.1 LGBM Classifier]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#5.1-LGBM-Classifier')
* [5.2 XGBoost Classifier]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#5.2-XGBoost-Classifier')
* [5.3 Ensemble]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#5.3-Ensemble')
* [5.4 Submission]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#5.4-Submission')

#### [6 TrueFoundry Platform]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#6-TrueFoundry-Platform')
* [6.1 Projects]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#6.1-Projects')
* [6.2 Runs]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#6.2-Runs')

  * [6.2.1 Overview]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#6.2.1-Overview')
  * [6.2.2 Run Metrics]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#6.2.2-Run-Metrics')
  * [6.2.3 Data & Feature Metrics]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#6.2.3-Data-&-Feature-Metrics')
  * [6.2.4 General Artifact]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#6.2.4-General-Artifact')
  
* [6.3 Models Comparison]('https://www.kaggle.com/code/khotijahs1/ml-experiment-tracking-with-truefoundry-platform/notebook#6.3-Models-Comparison')




## 1 What is [TrueFoundry]('https://app.truefoundry.com/signin')?





[TrueFoundry]('https://app.truefoundry.com/mlfoundry') aims to provide the different components in a Machine learning stack - all wound together in a way that they talk to each other seamlessly and teams don't have to spend time gluing pieces together. While all the pieces are knit tightly together, we also design all the components in a way that they can be seamlessly integrated with other tools in the future. TrueFoundry can be accessed at https://app.truefoundry.com/mlfoundry.

Truefoundry comprises of the following pieces to tie together things seamlessly:

1. ```MlFoundry```: used during model training to log your model artifacts, parameters, data & code so as to be able to collaborate with your team and reproduce Machine Learning Experiments.
2. ```ServiceFoundry```: single API which containerizes and deploys your model to a managed Kubernetes Cluster. This also generates a Grafana cluster with complete visibility of your Service Health, System Logs, and Kubernetes Workspace.
3. ```Monitoring```: model input-output monitoring, data drift charts, and root-cause analysis when things break. Coming soon!

## 2 Tabular Playground Series

The dataset is used for this competition is synthetic but based on a real dataset (in this case, the actual Titanic data!) and generated using a CTGAN. The statistical properties of this dataset are very similar to the original Titanic dataset, but there's no way to "cheat" by using public labels for predictions. 

## 3 Preparation

Prepare packages and data that will be used in the analysis process and we will use [TrueFoundry]('https://app.truefoundry.com/signin') to track our experiments and Essential packages that will be loaded are mainly for data manipulation, data visualization and modeling. 



### 3.1 Essential Packages

In [None]:
import pandas as pd
import numpy as np
import random
import os

from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold

import lightgbm as lgb
import catboost as ctb
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier 
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_graphviz

import graphviz
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

import warnings
warnings.simplefilter('ignore')


In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("api_key")

In [None]:
!pip install mlfoundry
import mlfoundry as mlf

Login to TrueFoundry platform and get the API key. Create New API keys if we don't have it or copy the generated API keys before. API key can be found at https://app.truefoundry.com/settings.

In [None]:
Image('../input/truefoundry-2022/API_Keys.png')



We will login using ```mlf.get_client``` and use pass our ```API Keys``` on api_key. We also create a project name.

In [None]:
client = mlf.get_client(api_key=secret_value_0)
project_name = 'synthanic'

## 4 Data Loading and Preprocessing



In [None]:
TARGET = 'Survived'

N_ESTIMATORS = 2000
N_SPLITS = 5
SEED = 2021
EARLY_STOPPING_ROUNDS = 100
VERBOSE = 100


def set_seed(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    
set_seed(SEED)

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv')
submission = pd.read_csv('../input/tabular-playground-series-apr-2021/sample_submission.csv')
test_df[TARGET] = pd.read_csv("../input/local-tps-apr/pseudo_label.csv")[TARGET]

all_df = pd.concat([train_df, test_df]).reset_index(drop=True)

### 4.1 Filling missing values


In [None]:
# Age fillna with mean age for each class
all_df['Age'] = all_df['Age'].fillna(all_df['Age'].mean())

# Cabin, fillna with 'X' and take first letter
all_df['Cabin'] = all_df['Cabin'].fillna('X').map(lambda x: x[0].strip())

# Ticket, fillna with 'X', split string and take first split 
all_df['Ticket'] = all_df['Ticket'].fillna('X').map(lambda x:str(x).split()[0] if len(str(x).split()) > 1 else 'X')

# Fare, fillna with mean value
fare_map = all_df[['Fare', 'Pclass']].dropna().groupby('Pclass').median().to_dict()
all_df['Fare'] = all_df['Fare'].fillna(all_df['Pclass'].map(fare_map['Fare']))
all_df['Fare'] = np.log1p(all_df['Fare'])

# Embarked, fillna with 'X' value
all_df['Embarked'] = all_df['Embarked'].fillna('X')

# Name, take only surnames
all_df['Name'] = all_df['Name'].map(lambda x: x.split(',')[0])

### 4.2 Encoding

In [None]:
label_cols = ['Name', 'Ticket', 'Sex']
onehot_cols = ['Cabin', 'Embarked']
numerical_cols = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

In [None]:
def label_encoder(c):
    le = LabelEncoder()
    return le.fit_transform(c)

scaler = StandardScaler()

onehot_encoded_df = pd.get_dummies(all_df[onehot_cols])
label_encoded_df = all_df[label_cols].apply(label_encoder)
numerical_df = pd.DataFrame(scaler.fit_transform(all_df[numerical_cols]), columns=numerical_cols)
target_df = all_df[TARGET]

all_df = pd.concat([numerical_df, label_encoded_df, onehot_encoded_df, target_df], axis=1)

## 5 Models

We are going to use 2 models: LGBM Classifier,and XGB Classifier. We will also create 5 folds cross validation. 

### TrueFoundry Experiment Tracking

We will track our experiment using TrueFoundry. Below are some explanations in the code related to TrueFoundry experiment tracking. We are using XGB Classifier as an example and consistently being used for other models:

- Create ```run = client.create_run(project_name=project_name, run_name="XGB")``` to start logging our experiment by creating project name and the run name. In this case we log the project name as ```synthanic``` that has been setup before and naming our run as ```XGB``` ```for ```XGBoost Classifier``` model.
- We can track our dataset including target prediction and target actual using ```run.log_dataset(features=train_df[features], dataset_name="full", actuals=train_df['Transported'], predictions=train_oof)```. Logging our actual and prediction target will help us to compare them in TrueFoundry platform. This line of codes are putted the end of the code as we will need to wait until all the prediction in each fold finished.
- We will also do the same thing for each fold-dataset, we ```userun.log_dataset(dataset_name="fold_"+str(fold), features=X_valid, actuals=y_valid, predictions=temp_oof) ``` but we will perform this after a fold prediction finished.
- To log hyperparameters from the model, we use ```run.log_params(model.get_xgb_params())```. We can only log 1 set hyperparameters, that's why we put it at the end of the code.
- We can also log our validation accuracy metrics over time using below code:
    ```results = model.evals_result()```
   ``` epochs = len(results['validation_0']['error'])```
    ```accuracy_fold = [1-err for err in results['validation_0']['error']]```
   ``` for global_step in range(epochs):```
    ```run.log_metrics(metric_dict={f'Accuracy_fold_{fold}':accuracy_fold[global_step]}, step=global_step)```
    We can compare the ```accuracy``` of each ```fold``` and ```OOF``` accuracy across all model over time.
    The last but not least, we need to end our run using ```run.end()```.



### 5.1 LGBM Classifier

LGBM ((Light Gradient Boosting Machine) is a gradient boosting framework based on decision trees to increases the efficiency of the model and reduces memory usage. It uses two novel techniques: Gradient-based One Side Sampling and Exclusive Feature Bundling (EFB) which fulfills the limitations of histogram-based algorithm that is primarily used in all GBDT (Gradient Boosting Decision Tree) frameworks. The two techniques of GOSS and EFB described below form the characteristics of LightGBM Algorithm. They comprise together to make the model work efficiently and provide it a cutting edge over other GBDT frameworks 

In [None]:
params = {
    'metric': 'binary_logloss',
    'n_estimators': N_ESTIMATORS,
    'objective': 'binary',
    'random_state': SEED,
    'learning_rate': 0.01,
    'min_child_samples': 150,
    'reg_alpha': 3e-5,
    'reg_lambda': 9e-2,
    'num_leaves': 20,
    'max_depth': 16,
    'colsample_bytree': 0.8,
    'subsample': 0.8,
    'subsample_freq': 2,
    'max_bin': 240,
}


In [None]:
run = client.create_run(project_name=project_name, run_name="LGBMClassifier") #TrueFoundry
lgb_oof  = np.zeros(train_df.shape[0])
lgb_preds = np.zeros(test_df.shape[0])

skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)

for fold, (train_idx, valid_idx) in enumerate(skf.split(all_df, all_df[TARGET])):
    print(f"===== FOLD {fold} =====")
    oof_idx = np.array([idx for idx in valid_idx if idx < train_df.shape[0]])
    preds_idx = np.array([idx for idx in valid_idx if idx >= train_df.shape[0]])

    X_train, y_train = all_df.iloc[train_idx].drop(TARGET, axis=1), all_df.iloc[train_idx][TARGET]
    X_valid, y_valid = all_df.iloc[oof_idx].drop(TARGET, axis=1), all_df.iloc[oof_idx][TARGET]
    X_test = all_df.iloc[preds_idx].drop(TARGET, axis=1)
    
    pre_model = lgb.LGBMClassifier(**params)
    pre_model.fit(
        X_train, y_train,
        eval_set=[(X_train, y_train),(X_valid, y_valid)],
        early_stopping_rounds=EARLY_STOPPING_ROUNDS,
        verbose=VERBOSE
    )

    params2 = params.copy()
    params2['learning_rate'] = params['learning_rate'] * 0.1
    model = lgb.LGBMClassifier(**params2)
    model.fit(
        X_train, y_train,
        eval_set=[(X_train, y_train),(X_valid, y_valid)],
        early_stopping_rounds=EARLY_STOPPING_ROUNDS,
        verbose=VERBOSE,
        init_model=pre_model
    )
        
    temp_oof= model.predict(X_valid)
    lgb_oof[oof_idx]=temp_oof
    lgb_preds[preds_idx-train_df.shape[0]] = model.predict(X_test)
    
    print(f'Fold {fold} Accuracy: ', accuracy_score(y_valid, temp_oof))
    results = model.evals_result_
    epochs = len(results['valid_1']['binary_logloss'])
    accuracy_fold = [1-err for err in results['valid_1']['binary_logloss']]
    for global_step in range(epochs):
        run.log_metrics(metric_dict={f'Accuracy_fold_{fold}':accuracy_fold[global_step]}, step=global_step) #TrueFoundry
    run.log_dataset(dataset_name="fold_"+str(fold), features=X_valid, actuals=y_valid, predictions=temp_oof) #TrueFoundry

    
accuracy_final = accuracy_score(train_df['Survived'], lgb_oof)    
print(f'OOF AUC: ', accuracy_final)

run.log_params(model.get_params()) #TrueFoundry
run.log_metrics(metric_dict={'Accuracy_OOF':accuracy_final}) #TrueFoundry
run.end() #TrueFoundry
 
    

### 5.2 XGBoost Classifier

XGBoost, which stands for Extreme Gradient Boosting, is a scalable, distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems.

XGBoost is a scalable and highly accurate implementation of gradient boosting that pushes the limits of computing power for boosted tree algorithms, being built largely for energizing machine learning model performance and computational speed. With XGBoost, trees are built in parallel, instead of sequentially like GBDT. It follows a level-wise strategy, scanning across gradient values and using these partial sums to evaluate the quality of splits at every possible split in the training set. 

In [None]:
params = {
'max_depth':7,
'n_estimators':5000,
'objective':'binary:logistic',
'booster':'gbtree',
'n_jobs':1,
'min_child_weight':1,
'colsample_bytree':0.7,
}




In [None]:
run = client.create_run(project_name=project_name, run_name="XGB") #TrueFoundry

xgb_oof  = np.zeros(train_df.shape[0])
xgb_preds = np.zeros(test_df.shape[0])

skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)

for fold, (train_idx, valid_idx) in enumerate(skf.split(all_df, all_df[TARGET])):
    print(f"===== FOLD {fold} =====")
    oof_idx = np.array([idx for idx in valid_idx if idx < train_df.shape[0]])
    preds_idx = np.array([idx for idx in valid_idx if idx >= train_df.shape[0]])

    X_train, y_train = all_df.iloc[train_idx].drop(TARGET, axis=1), all_df.iloc[train_idx][TARGET]
    X_valid, y_valid = all_df.iloc[oof_idx].drop(TARGET, axis=1), all_df.iloc[oof_idx][TARGET]
    X_test = all_df.iloc[preds_idx].drop(TARGET, axis=1)
    
    model = XGBClassifier(**params)
    model.fit(
        X_train, y_train,
        eval_set=[(X_train, y_train),(X_valid, y_valid)],
        early_stopping_rounds=EARLY_STOPPING_ROUNDS,
        verbose=VERBOSE
    )

            
    temp_oof= model.predict(X_valid)
    xgb_oof[oof_idx]=temp_oof
    xgb_preds[preds_idx-train_df.shape[0]] = model.predict(X_test)
    
    print(f'Fold {fold} Accuracy: ', accuracy_score(y_valid, temp_oof))
    results =model.evals_result()
    epochs = len(results['validation_0']['logloss'])
    accuracy_fold = [1-err for err in results['validation_0']['logloss']]
    for global_step in range(epochs):
        run.log_metrics(metric_dict={f'Accuracy_fold_{fold}':accuracy_fold[global_step]}, step=global_step) #TrueFoundry
    run.log_dataset(dataset_name="fold_"+str(fold), features=X_valid, actuals=y_valid, predictions=temp_oof) #TrueFoundry

    
accuracy_final = accuracy_score(train_df['Survived'], xgb_oof)    
print(f'OOF AUC: ', accuracy_final)

run.log_params(model.get_params()) #TrueFoundry
run.log_metrics(metric_dict={'Accuracy_OOF':accuracy_final}) #TrueFoundry
run.end() #TrueFoundry
 
    

### 5.3 Ensemble

This is only for competition purpose (to boost score).

In [None]:
submission['submit_lgb'] = np.where(lgb_preds>0.5, 1, 0)
submission['submit_xgb'] = np.where(xgb_preds>0.5, 1, 0)


In [None]:
submission[[col for col in submission.columns if col.startswith('submit_')]].sum(axis = 1).value_counts()

In [None]:
submission[TARGET] = (submission[[col for col in submission.columns if col.startswith('submit_')]].sum(axis=1) >= 2).astype(int)
submission.drop([col for col in submission.columns if col.startswith('submit_')], axis=1, inplace=True)

### 5.4 Submission

In [None]:
submission['submit_1'] = submission[TARGET].copy()
submission['submit_2'] = pd.read_csv("../input/local-tps-apr/voting_submission.csv")[TARGET]
submission['submit_3'] = pd.read_csv("../input/local-tps-apr/dae.csv")[TARGET]

In [None]:
submission[[col for col in submission.columns if col.startswith('submit_')]].sum(axis = 1).value_counts()

In [None]:
submission[TARGET] = (submission[[col for col in submission.columns if col.startswith('submit_')]].sum(axis=1) >= 2).astype(int)


In [None]:
submission[['PassengerId', TARGET]].to_csv("voting_submission.csv", index = False)

## 6 TrueFoundry Platform

In this section we will see how our ```dataset```, ```hyperparameters``` and ```metrics``` have been logged in TrueFoundry. We will see into 2 sections: Projects and Runs.

### 6.1 Projects

We can check all of our projects in ML Foundry section. In this case we are looking for our ```synthanic``` projects.

In [None]:
Image('../input/truefoundry-2022/all_project.jpg')



Choose the project and we will see model runs that have been logged before which are ``LGBMClassifier`` and ``XGB ``.


In [None]:
Image('../input/truefoundry-2022/model.png')

### 6.2 Runs

Let's check our XGB run. In the upper side, we can see our ``Run Name``, ``Run Id``, ``Author``, ``Status``, ``Last Updated On``, ``Tags`` and ``Run Duration``. We can also put ``tags`` and ``notes``.

#### 6.2.1 Overview


* In the left side, we can see our log``Key Metrics`` which is ``accuracy``. It logs ``fold_0`` through ``fold_4`` including ``Accuracy_OOF `` metrics.
* In the right side, we can see our * In the right side, we can see our ``hyperparameters`` that has been logged. It also represents the latest hyperparameters and it would be the same accross the folds.



In [None]:
Image('../input/truefoundry-2022/xgb_overview.png')

#### 6.2.2 Run Metrics

We can see each of our validation ``fold (0 through 4) accuracy `` in a line graph. Meaning we can see how it performs in each steps.

In [None]:
Image('../input/truefoundry-2022/xgb_result.png')

#### 6.2.3 Data & Feature Metrics

We can see our folds dataset, in this case we can see ``fold_0`` through ``fold_4`` and ``full dataset`` as we have logged them before. We can also see more details on each features by clicking the Details.

In [None]:
Image('../input/truefoundry-2022/xgb_feature.png')



We can also see the comparsion between our predictions and actual performance for each ``fold (0 to 4)`` and also in the ``full dataset``.


In [None]:
Image('../input/truefoundry-2022/distribution_label.png')

#### 6.2.4 General Artifact
In here we can see our dataset that have been stored in csv format and can be re-downloaded.

In [None]:
Image('../input/truefoundry-2022/general_artifact.png')

### 6.3 Models Comparison

We can perform model comparison by clicking all the models that we want to compare. 

In [None]:
Image('../input/truefoundry-2022/model_comparison1.png')



We can see the comparison of our models ``(LGBMClassifier and XGB)`` .The graphics are based on the validation accuracy metric for each fold. 


In [None]:
Image('../input/truefoundry-2022/model_comparison2.png')