# Author 

Sakdipat Ontoum

https://www.linkedin.com/in/sakdipat-ontoum-256bb0209/

# Introduction
![](https://storage.googleapis.com/kaggle-media/competitions/Tabular%20Playground/1.jpeg)

Rocket League is a video game about soccer. Each of the two teams has up to eight players who use rocket-powered vehicles to strike a ball into their opponent's goal and collect points during the course of a match. In addition, Tabular Playground Series - Oct 2022 give challenges this video game to everyone to predict the probability of each team scoring within the next 10 seconds of the game.

# Objective

To create model that can predict the probability of each team scoring within the next 10 seconds of the game.

# Approach

Maybe Logistic Regression for predicting, and Principal component analysis (PCA) / t-distributed stochastic neighbor embedding (t-SNE) for reducing dimention of data. I will try to play around of them as possible.


# Dataset

I used the data from [Tabular Playground Series - Oct 2022](https://www.kaggle.com/competitions/tabular-playground-series-oct-2022/). There is made up of sequences of snapshots of the current state of a Rocket League match, including the position and velocity of all players and the ball.

# Performance Measure

I will use the formula that [Tabular Playground Series - Oct 2022](https://www.kaggle.com/competitions/tabular-playground-series-oct-2022/) given. the log loss formula to measure the performace measure as below here.

$$\large score = -\frac{1}{2}\sum_{m=1}^{M}[y_{i,m}\log (\hat{y_{i, m}})+(1-y_{i, m})\log (1-\hat{y_{i, m}})] $$

where:

* $N$ is the number of id observations in the test data
* $M$ is the number of scored targets (here , one for each team)
* $\hat{y_{i,m}}$ is the predicted scoring probability of team  (Team A or Team B in the dataset)
* ${y_{i,m}}$ is the ground truth for team , 1 for a goal within 10 seconds, 0 otherwise
* $\log ()$ is the natural (base e) logarithm

 
Note: the actual submitted predicted probabilities are replaced with $\max (\min (p,1-10^{-15}), 10^{-15})$.  A smaller log loss is better.

# 1. Configurations

## 1.1 Install and Imports Library

All the install library go here. 

In [1]:
import os
import gc
import numpy as np
import pandas as pd
import lightgbm as lgb

from pathlib import Path

from sklearn.decomposition import PCA, KernelPCA
from sklearn.model_selection import KFold, cross_val_score, cross_validate

import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn 

## 1.2 Global Settings

There variables will be used across the project.

In [2]:
#PCA Setting
AMOUNT_COMPONENTS = 10

#Model Setting
params = {
      'objective': 'binary',
      'metric': 'logloss',
      'num_iterations': 500
     }

## 1.3 Data Importation

In [3]:
#check overall file in directories
run_this = False

if run_this:
    for dirname, _ ,filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))

As you can see that on the Tabular Playground the Series on Oct 2022, there has 10 training, 1 testing, and 1 sample_submission files.

In [4]:
%%time
#Convert csv file to parquet feather since there are big data.
run_this = False

if run_this:
#training set
    for index in range(10):
        train_dataframe = pd.read_csv(f'../input/tabular-playground-series-oct-2022/train_{index}.csv')
        train_dataframe.to_parquet(f'train_{index}_parquet.gzip', compression='gzip')
        print('Done with train file', index)

#testing set
    test_dataframe = pd.read_csv(f'../input/tabular-playground-series-oct-2022/test.csv')
    test_dataframe.to_parquet(f'test.parquet.gzip', compression='gzip')

CPU times: user 0 ns, sys: 5 µs, total: 5 µs
Wall time: 10 µs


In [5]:
features = [
    'ball_pos_x', 'ball_pos_y','ball_pos_z', 'ball_vel_x', 'ball_vel_y', 'ball_vel_z', 
    'p0_pos_x', 'p0_pos_y', 'p0_pos_z', 'p0_vel_x', 'p0_vel_y', 'p0_vel_z', 'p0_boost', 'p0_na',
    'p1_pos_x', 'p1_pos_y', 'p1_pos_z', 'p1_vel_x', 'p1_vel_y', 'p1_vel_z', 'p1_boost', 'p1_na',
    'p2_pos_x', 'p2_pos_y', 'p2_pos_z', 'p2_vel_x', 'p2_vel_y', 'p2_vel_z', 'p2_boost', 'p2_na',
    'p3_pos_x', 'p3_pos_y', 'p3_pos_z', 'p3_vel_x', 'p3_vel_y', 'p3_vel_z', 'p3_boost', 'p3_na',
    'p4_pos_x', 'p4_pos_y', 'p4_pos_z', 'p4_vel_x', 'p4_vel_y', 'p4_vel_z', 'p4_boost', 'p4_na',
    'p5_pos_x', 'p5_pos_y', 'p5_pos_z', 'p5_vel_x', 'p5_vel_y', 'p5_vel_z', 'p5_boost', 'p5_na',
    'boost0_timer', 'boost1_timer', 'boost2_timer', 'boost3_timer',
    'boost4_timer', 'boost5_timer']

In [6]:
targets = [
    'team_A_scoring_within_10sec',
    'team_B_scoring_within_10sec']

In [7]:
DEBUG = False
input_path = Path('../input/fast-loading-high-compression-with-feather/feather_data')

def fe(x):
#     # indicators for respawns...
#     x['p0_na'] = x['p0_pos_x'].isna().astype('int8')
#     x['p1_na'] = x['p1_pos_x'].isna().astype('int8')
#     x['p2_na'] = x['p2_pos_x'].isna().astype('int8')
#     x['p3_na'] = x['p3_pos_x'].isna().astype('int8')
#     x['p4_na'] = x['p4_pos_x'].isna().astype('int8')
#     x['p5_na'] = x['p5_pos_x'].isna().astype('int8')
    for feature in features:
        if feature.endswith('_na'):
            continue
        if feature.endswith('_x'):
            x[feature] = (x[feature] / 82).fillna(0).astype('float16')
        if feature.endswith('_y'):
            x[feature] = (x[feature] / 120).fillna(0).astype('float16')
        if feature.endswith('_z'):
            x[feature] = (x[feature] / 40).fillna(0).astype('float16')
        if feature.endswith('_boost'):
            x[feature] = (x[feature] / 100).fillna(0).astype('float16')
        if feature.endswith('_timer'):
            x[feature] = (-x[feature] / 100).astype('float16')
    return x

def read_train():
    dfs = []
    for i in range(10):
        dfs.append(fe(pd.read_feather(input_path / f'train_{i}_compressed.ftr')))
    result = pd.concat(dfs)
    if DEBUG:
        result = result.sample(frac=0.05)
    return result

def read_test():
    return fe(pd.read_feather(input_path / 'test_compressed.ftr'))

train_dataframe = read_train()
gc.collect()
test_dataframe = read_test()
gc.collect()

0

In [8]:
%%time
#Read some of training set and testing set that already converted from .csv to parquet
# train_dataframe = pd.read_parquet('../input/tps-oct-2022-compressed-parquet-files/train_0.parquet.gzip')
# train_dataframe

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.2 µs


In [9]:
# test_dataframe = pd.read_parquet('../input/tps-oct-2022-compressed-parquet-files/test.parquet.gzip')
# test_dataframe

In [10]:
train_dataframe.head()

Unnamed: 0,game_num,event_id,event_time,ball_pos_x,ball_pos_y,ball_pos_z,ball_vel_x,ball_vel_y,ball_vel_z,p0_pos_x,...,boost0_timer,boost1_timer,boost2_timer,boost3_timer,boost4_timer,boost5_timer,player_scoring_next,team_scoring_next,team_A_scoring_within_10sec,team_B_scoring_within_10sec
0,1,1002,-33.3125,-0.0,0.0,0.046356,-0.0,0.0,0.0,0.509766,...,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,3,B,0,0
1,1,1002,-33.21875,-0.0,0.0,0.046356,-0.0,0.0,0.0,0.515137,...,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,3,B,0,0
2,1,1002,-33.09375,-0.0,0.0,0.046356,-0.0,0.0,0.0,0.526855,...,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,3,B,0,0
3,1,1002,-33.0,-0.0,0.0,0.046356,-0.0,0.0,0.0,0.535645,...,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,3,B,0,0
4,1,1002,-32.875,-0.0,0.0,0.046356,-0.0,0.0,0.0,0.54834,...,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,3,B,0,0


# 2. Data Preprocessing

In [11]:
def euclidian_norm(x):
    return np.linalg.norm(x, axis=1)

vel_groups = {
    f"{el}_vel": [f'{el}_vel_x', f'{el}_vel_y', f'{el}_vel_z']
    for el in ['ball'] + [f'p{i}' for i in range(6)]
}
pos_groups = {
    f"{el}_pos": [f'{el}_pos_x', f'{el}_pos_y', f'{el}_pos_z']
    for el in ['ball'] + [f'p{i}' for i in range(6)]
}

for col, vec in vel_groups.items():
    train_dataframe[col] = euclidian_norm(train_dataframe[vec])
    test_dataframe[col] = euclidian_norm(test_dataframe[vec])
    
for col, vec in pos_groups.items():
    train_dataframe[col + "_ball_dist"] = euclidian_norm(train_dataframe[vec].values - train_dataframe[pos_groups["ball_pos"]].values)
    test_dataframe[col + "_ball_dist"] = euclidian_norm(test_dataframe[vec].values - test_dataframe[pos_groups["ball_pos"]].values)

In [12]:
#Drop Columns
features = list(train_dataframe.columns[3:-24]) + list(train_dataframe.columns[-14:])
test_dataframe = test_dataframe[features].copy()
features.append('team_A_scoring_within_10sec')
features.append('team_B_scoring_within_10sec')
train_dataframe = train_dataframe[features].copy()

In [13]:
#Deal with missing Values
train_dataframe = train_dataframe.dropna()
test_dataframe = test_dataframe.fillna(value=test_dataframe.mean())

  return umr_sum(a, axis, dtype, out, keepdims, initial, where)


In [14]:
#Dimensionally Reduction with PCA
features = list(train_dataframe.columns)
len(features)

64

In [15]:
features.remove('team_A_scoring_within_10sec')
features.remove('team_B_scoring_within_10sec')

In [16]:
dataframe = pd.concat([train_dataframe[features], test_dataframe])

In [None]:
number = AMOUNT_COMPONENTS
pca = PCA(n_components=number)
rotatedData = pca.fit_transform(dataframe)


new_features = []
for index in range(number):
    new_features.append(f'X{index}')

PCA_dataframe = pd.DataFrame(data=rotatedData, columns=new_features)
PCA_dataframe.head()

In [None]:
len(train_dataframe)

In [None]:
X_train = PCA_dataframe.iloc[:len(train_dataframe)].copy()
X_test = PCA_dataframe.iloc[len(train_dataframe):].copy()

y_train_team_A = train_dataframe['team_A_scoring_within_10sec'].copy()
y_train_team_B = train_dataframe['team_B_scoring_within_10sec'].copy()


# 3. Model Creation

In [None]:
def cv_score_team_A(model):
    k_fold = KFold(5, shuffle=True, random_state=0).get_n_splits(X_train.values)
    scores = cross_val_score(model, X_train.values, y_train_team_A, scoring='neg_log_loss', cv=k_fold)
    return scores

def cv_score_team_B(model):
    k_fold = KFold(5, shuffle=True, random_state=0).get_n_splits(X_train.values)
    scores = cross_val_score(model, X_train.values, y_train_team_B, scoring='neg_log_loss', cv=k_fold)
    return scores

In [None]:
lgbm = lgb.LGBMClassifier(**params)

# 4. Evaluation 

In [None]:
lgbm_score_team_A = cv_score_team_A(lgbm)
print("LightGBM score of Team A: {:.4f} ({:.4f})".format(lgbm_score_team_A.mean(), lgbm_score_team_A.std()))

lgbm_score_team_B = cv_score_team_B(lgbm)
print("LightGBM score of Team B: {:.4f} ({:.4f})".format(lgbm_score_team_B.mean(), lgbm_score_team_B.std()))

In [None]:
lgbm_team_A = lgb.LGBMClassifier(**params)
lgbm_team_B = lgb.LGBMClassifier(**params)

lgbm_team_A.fit(X_train, y_train_team_A)
lgbm_team_B.fit(X_train, y_train_team_B)

y_pred_team_A_lgbm = lgbm_team_A.predict_proba(X_test)[:,1]
y_pred_team_B_lgbm = lgbm_team_B.predict_proba(X_test)[:,1]

In [None]:
submission = pd.read_csv('../input/tabular-playground-series-oct-2022/sample_submission.csv')
submission['team_A_scoring_within_10sec'] = y_pred_team_A_lgbm
submission['team_B_scoring_within_10sec'] = y_pred_team_B_lgbm
submission.head()

In [None]:
submission.max()

In [None]:
submission.to_csv('TBS-submission-10062022-13.csv', index=False)

# Reference

https://www.kaggle.com/code/ryanluoli2/a-simple-lightgbm-baseline-with-pca

https://www.kaggle.com/code/reymaster/compress-files-parquet-7x-loading-speedup

https://www.kaggle.com/datasets/reymaster/tps-oct-2022-compressed-parquet-files

https://en.wikipedia.org/wiki/Rocket_League

https://www.kaggle.com/competitions/tabular-playground-series-oct-2022/

https://www.kaggle.com/code/chazzer/rocket-league-xgboost-feat-engineering-cv/notebook?scriptVersionId=107113700

# Thank you 😆😆

![](https://wallpaperaccess.com/full/5914023.png)