# Tabular-playground-series-nov-2022

<br>

## Details of competition
#### URL: https://www.kaggle.com/competitions/tabular-playground-series-nov-2022/overview

<br>

#### Topic
Provide a fun and approachable-for-anyone tabular dataset to model.

<br>

#### Background

<br>

##### About the Tabular Playground Series
Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, we've launched many Playground competitions that are more approachable than our Featured competitions and thus, more beginner-friendly.

<br>

**The goal of these competitions is to provide a fun and approachable-for-anyone tabular dataset to model.** These competitions are a great choice for people looking for something in between the Titanic Getting Started competition and the Featured competitions. If you're an established competitions master or grandmaster, these probably won't be much of a challenge for you; thus, we encourage you to avoid saturating the leaderboard.

<br>

For each monthly competition, we'll be offering Kaggle Merchandise for the top three teams. And finally, because we want these competitions to be more about learning, we're limiting team sizes to 3 individuals.

### 

## 한 줄 요약!!

<br>

- **준비된 뒷토막 + 프로젝트 진행(앙상블 + 블랜딩)하는 앞토막 >> 모델 향상 >> 결과 제출**

## Files 항목

### submission_files/

- 앞 반토막에 대한 이진 분류 모델 예측값 파일 모음

<br>

### train_labels.csv

- 제출 파일에 대한 앞 반토막 실측값


<br>


### sample_submission.csv

- 뒷토막만 정상 데이터인 샘플 제출 파일


- 포맷(ids)은 정상임


- 점수 향상을 위해, 앞 반토막에 대해 추가 작업 진행 (데이터 섞기)




### 

## 데이터 관련 이해사항:

<br>

### submission_files(folder)
- 이진 분류 작업에 대한 예측값들이 저장


- 각 파일 이름은 로그 손실(logloss) 점수


- 실측 라벨 파일(train_label?)의 예측 행의 절반에 대한 로그 손실 점수가 파일 제목


- 훌륭한 트레이닝 세트가 됨

<br>

### train_labels.csv

- submission_files의 파일의 행에 대한 실측 정보값 제공


<br>


### sample_submission.csv

- 이해가 어느정도 된 부분!

- 나머지 뒷부분 반토막(20000~39999)에 대한 결과 예측을 제출!!

### 

## 우리의 목표

- 앙상블 활용!


- 다양한 submission file들의 앙상블을 통해, 모델 예측 성능을 높이는 것!!


- submission_files(반토막짜리 행에 대한 실측값들)들이 있으므로 블랜딩(여러 데이터를 합쳐서)하면 더 좋은 결과를 얻었는지 아닌지 알 수 있을 것임!!

<br>

### 요약

1. 두 파일을 합쳐서 반토막 데이터에 대한 점수(예측값)를 확인할 수 있음!!


2. 점수를 계속해서 개선해나갈 수 있음!!


3. 점수를 (극한으로) 개선 후, 나머지 행들에 대해, 리더보드에 제출할 수 있음!! 🥸



<br>

In [4]:
# import data
# !ls
!ls ./tabular-playground-series-nov-2022/

sample_submission.csv [1m[36msubmission_files[m[m      train_labels.csv


In [5]:
# import data
sample_submission = pd.read_csv('./tabular-playground-series-nov-2022/sample_submission.csv')
sample_submission

Unnamed: 0,id,pred
0,20000,0.640707
1,20001,0.636904
2,20002,0.392496
3,20003,0.588658
4,20004,0.783603
...,...,...
19995,39995,0.382515
19996,39996,0.352498
19997,39997,0.577554
19998,39998,0.712353


In [6]:
# import data
train_labels = pd.read_csv('./tabular-playground-series-nov-2022/train_labels.csv')
train_labels

Unnamed: 0,id,label
0,0,0
1,1,1
2,2,1
3,3,1
4,4,0
...,...,...
19995,19995,1
19996,19996,1
19997,19997,0
19998,19998,0


In [9]:
# import data
_6222863195 = pd.read_csv('./tabular-playground-series-nov-2022/submission_files/0.6222863195.csv')
_6222863195

Unnamed: 0,id,pred
0,0,0.709336
1,1,0.452988
2,2,0.675462
3,3,0.481046
4,4,0.957339
...,...,...
39995,39995,0.382515
39996,39996,0.352498
39997,39997,0.577554
39998,39998,0.712353


### 

## Cloning

<br>

### Reference: 
- https://www.kaggle.com/code/hasanbasriakcay/tpsnov22-pseudo-labels-lgbm-xgb-lb-0-514

<br>

## Introduction

<br>

In [4]:
# import

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gc
import warnings
import datetime as dt
import math
from tqdm.auto import tqdm
import glob

from scipy.stats.mstats import gmean
from scipy.stats import hmean
from scipy.stats import spearmanr

np.random.seed(0)
warnings.simplefilter("ignore")

VAR_TH = 1.0e-03
PSEUDO_NFOLDS = 5
PSEUDO_TH = 0.94
CLIP = True
VAR_SMOOTHING = 1818

In [2]:
!ls

[1m[36mtabular-playground-series-nov-2022[m[m
tabular-playground-series-nov-2022.zip
tabular_playground_series_nov_2022.ipynb
train.ftr
train.ftr.zip


In [9]:
data = pd.read_feather('./train.ftr')
sub = pd.read_csv('./tabular-playground-series-nov-2022/sample_submission.csv')
y = pd.read_csv('./tabular-playground-series-nov-2022/train_labels.csv')

train = data.loc[:19999, :]
test = data.loc[20000:, :]

display(train.head())
display(sub.head())

Unnamed: 0,sub_0,sub_1,sub_2,sub_3,sub_4,sub_5,sub_6,sub_7,sub_8,sub_9,...,sub_4990,sub_4991,sub_4992,sub_4993,sub_4994,sub_4995,sub_4996,sub_4997,sub_4998,sub_4999
0,0.709336,0.799007,0.851891,0.537158,0.62393,0.70597,0.503437,0.633185,0.64155,0.666604,...,0.769207,0.75025,0.66337,0.739333,0.822384,0.749498,0.7298,0.867847,0.745888,0.787
1,0.452988,0.364453,0.567582,0.354468,0.513818,0.584119,0.454809,0.238501,0.472171,0.522314,...,0.640052,0.794052,0.721298,0.804369,0.620626,0.733606,0.816942,0.814229,0.598331,0.547
2,0.675462,0.84226,0.800013,0.525229,0.692071,0.715418,0.651008,0.609124,0.691198,0.609994,...,0.812841,0.779859,0.865657,0.828493,0.76301,0.802883,0.806891,0.896058,0.855776,0.667
3,0.481046,0.577118,0.683032,0.541356,0.630088,0.664514,0.413373,0.50821,0.52614,0.584565,...,0.824703,0.799698,0.80013,0.716604,0.603779,0.708499,0.844837,0.853057,0.850657,0.622
4,0.957339,0.910337,0.917322,0.874487,0.787595,0.854273,0.843846,0.876749,0.821128,0.913054,...,0.934803,0.90015,0.960911,0.906037,0.96124,0.935608,0.889757,0.978505,0.953681,0.934


Unnamed: 0,id,pred
0,20000,0.640707
1,20001,0.636904
2,20002,0.392496
3,20003,0.588658
4,20004,0.783603


In [10]:
print("train shape:", train.shape)
print("test shape:", test.shape)
print("sub shape:", sub.shape)

train shape: (20000, 5000)
test shape: (20000, 5000)
sub shape: (20000, 2)


In [11]:
print("train nan value sum:", train.isna().sum().sum())
print("test nan value sum:", test.isna().sum().sum())

train nan value sum: 0
test nan value sum: 0


In [12]:
print("train dublicated value sum:", train.duplicated().sum().sum())
print("test dublicated value sum:", test.duplicated().sum().sum())

train dublicated value sum: 0
test dublicated value sum: 0


<br>

## Preprocesses

<br>

In [13]:
normalized_train = (train - train.mean()) / train.std()
normalized_train.var().sort_values()

sub_1825    4.930627e-32
sub_2356    1.000000e+00
sub_4963    1.000000e+00
sub_1691    1.000000e+00
sub_1685    1.000000e+00
                ...     
sub_3226    1.000000e+00
sub_2618    1.000000e+00
sub_684     1.000000e+00
sub_1148    1.000000e+00
sub_1824             NaN
Length: 5000, dtype: float64

In [14]:
for feature in ["sub_1824", "sub_1825"]:
    print(f"train {feature} unique : {train[feature].unique()}")
    print(f"test {feature} unique : {test[feature].unique()}")

train sub_1824 unique : [0.491104]
test sub_1824 unique : [0.491104]
train sub_1825 unique : [0.51]
test sub_1825 unique : [0.51]


In [15]:
train.max().max(), train.min().min()

(1.356611, -0.336521)

In [16]:
if CLIP:
    train = train.clip(0,1)
    test = test.clip(0,1)
    print(train.max().max(), train.min().min())

1.0 -0.0


<br>

## Feature Selection

<br>

In [None]:
sub_file_names = sorted(glob.glob('./tabular-playground-series-nov-2022/submission_files/*.csv'))
scores_all = [float(file_name.split('/')[-1][:-4]) for file_name in sub_file_names]
print("sub_file_names len:", len(sub_file_names))

# ref: https://www.kaggle.com/code/takanashihumbert/drop-4000-features-lb-0-51530
best = 'sub_0'
else_cols = [f for f in data.columns if f != best]
drop_cols = []
for col, score in tqdm(zip(else_cols, scores_all)):
    logloss = score
    r, _ = spearmanr(data[best].values, data[col].values)
    if r > 0.98 or logloss > 0.680:
        drop_cols.append(col)

del data
gc.collect()