# ML Workflow 체험기

저희 6조는 이번 프로젝트를 통해 ML workflow를 경험하는 것에 집중했습니다.  
모델이 어떤 식으로 동작하는지 이해하고 모델의 성능을 높이는 것도 중요하지만, 단기간 안에 원하는 지식을 모두 습득할 수 없었기 때문에 전체적인 흐름을 이해하고 직접 코딩해보는 시간을 길게 가져갔습니다.

## Table of Content
- 프로젝트 개요
- EDA
- Data Preprocessing
- Feature Engineering
- Machine Learning(Train)
- Test

### 프로젝트 개요
유저들이 매치 한 경기를 플레이하면서 발생시킨 데이터들로 최종 순위를 예측하는 프로젝트.  

### Data fields
- **DBNOs** - Number of enemy players knocked.
- **assists** - Number of enemy players this player damaged that were killed by teammates.
- **boosts** - Number of boost items used.
- **damageDealt** - Total damage dealt. Note: Self inflicted damage is subtracted.
- **headshotKills** - Number of enemy players killed with headshots.
- **heals** - Number of healing items used.
- **Id** - Player’s Id
- **killPlace** - Ranking in match of number of enemy players killed.
- **killPoints** - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
- **killStreaks** - Max number of enemy players killed in a short amount of time.
- **kills** - Number of enemy players killed.
- **longestKill** - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
- **matchDuration** - Duration of match in seconds.
- **matchId** - ID to identify match. There are no matches that are in both the training and testing set.
- **matchType** - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
- **rankPoints** - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
- **revives** - Number of times this player revived teammates.
- **rideDistance** - Total distance traveled in vehicles measured in meters.
- **roadKills** - Number of kills while in a vehicle.
- **swimDistance** - Total distance traveled by swimming measured in meters.
- **teamKills** - Number of times this player killed a teammate.
- **vehicleDestroys** - Number of vehicles destroyed.
- **walkDistance** - Total distance traveled on foot measured in meters.
- **weaponsAcquired** - Number of weapons picked up.
- **winPoints** - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
- **groupId** - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
- **numGroups** - Number of groups we have data for in the match.
- **maxPlace** - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
- **winPlacePerc** - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

## EDA  

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
base_path = '/kaggle/input/pubg-finish-placement-prediction/'

df_train = pd.read_csv(base_path + 'train_V2.csv')
df_test = pd.read_csv(base_path +'test_V2.csv')
submission = pd.read_csv(base_path +'sample_submission_V2.csv')

In [None]:
pd.set_option('display.max_columns', None)

#### 결측치 확인

In [None]:
df_train.head()

In [None]:
df_train[df_train.isnull().any(axis=1)]

In [None]:
df_test[df_test.isnull().any(axis=1)]

#### 히트맵으로 상관관계 확인

In [None]:
f,ax = plt.subplots(figsize=(30, 24))
colormap = plt.cm.RdBu
sns.heatmap(df_train.corr(), linewidths=0.1, vmax=1.0,
           square=True, cmap=colormap, linecolor='white', annot=True, annot_kws={"size": 16})
plt.show()

groupid 기준 groupby했을때 상관관계가 높아지는 것을 확인  
-> feature engineering에서 상관관계의 변화가 큰 컬럼을 기준으로 데이터 평균값 컬럼 추가 고려

#### 이상치 탐색

In [None]:
# ride distance가 0일 때 roadkill이 있는 경우
df_train[(df_train['roadKills'] != 0) & (df_train['matchType'] != 4) & (df_train['rideDistance'] == 0.0000)]

In [None]:
# damagedealt가 0일 때 kill이 있는 경우
df_train[(df_train['kills'] != 0) & (df_train['damageDealt'] == 0) & (df_train['matchType'] != 4)]

In [None]:
# 같은 매치 내의 유저들은 모두 같은 matchDuration을 가지고 있기 때문에 개인의 플레이 기준으로 삼으면 안됨.
df_train.loc[(df_train['matchDuration'] > 0) & (df_train['matchId'] == 'a10357fd1a4a91'), ['matchId','matchDuration']]

#### VIF 값으로 다중공선성 확인

In [None]:
temp = df_train.drop(columns = ['Id', 'groupId', 'matchId', 'killPlace',\
    'matchType','winPlacePerc'])

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
pd.DataFrame({
    "VIF Factor": [variance_inflation_factor(temp.values, idx) for idx in range(temp.shape[1])],
    "features": temp.columns,
})

## Data Preprocessing

In [None]:
import numpy as np

def reduce_mem_usage(df): # 메모리 사용량을 줄이는 함수
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    #start_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    #end_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    #print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

def rm_MissingValue(df): # 결측치 제거
    new_df = df.dropna(axis=0).copy()
    return new_df

## Feature Engineering

In [None]:
# Feature Engineering
import pandas as pd
import numpy as np

# 팀 플레이어수
def team_player(df):
    df['team_player']=df.groupId.map(df.groupId.value_counts())
    return df['team_player']

# 총 플레이어 수
def player(df):
    df['player']=df.matchId.map(df.matchId.value_counts())
    return df['player']

# 총 이동 거리
def total_distance(df):
    df['total_distance']=df['rideDistance']+df['swimDistance']+df['walkDistance']
    return df['total_distance']
  
def scaling(df, scaler, col_list): # scaler와 scaling할 컬럼 선택
    scaler = scaler
    temp = scaler.fit_transform(df.loc[:, col_list])
    for i in range(len(col_list)):
        df[col_list[i]] = temp[:,i]
    return df

# 컬럼 place화(등수 나열)
def columns_place(list, X, df):
    for i in list:
        X[i + 'Place'] = df.groupby('matchId')[i].rank(method='max', ascending = False)
    
    return X

# matchType 분류
def matchType_classify(df):
    def classify(x):
        if 'flare' in x or 'crash' in x or 'normal' in x:
            return 'event'
        elif 'solo' in x:
            return 'solo'
        elif 'duo' in x:            
            return 'duo'
        else:
            return 'squad'
    
    new_df = df
    new_df['matchType'] = df['matchType'].apply(classify)
    return new_df

# matchType encoding
def matchType_encoding(df):
    df_OHE = pd.get_dummies(df, columns=['matchType'])
    return df_OHE


# 그룹아이디에 따라 그룹화 및 컬럼 평균값 대입.
def columns_grouped_mean(list, X, df):   
    for i in list :
        X['group'+i] = df.groupby('groupId')[i].transform('mean')
        
    return X

def average_weaponsAcquired(df): # 1분당 무기 습득 개수
    df['average_weaponsAcquired'] = df.weaponsAcquired / (df.matchDuration / 60)
    return df['average_weaponsAcquired']

def average_damage(df): # 1분당 딜량
    df['average_damage'] = df.damageDealt / (df.matchDuration / 60)
    return df['average_damage']

# 힐+부스트 당 킬 관여
def healboost_per_kill(df):
    df['healboost_per_kill'] =(df['heals']+df['boosts'])/df['assists']+df['kills']
    return df['healboost_per_kill']

#게임당 거리
def dist_per_game(df):
    df['dist_per_game'] = df['total_distance']/df['matchDuration']
    return df['dist_per_game']
    
#데미지 비율
def damage_ratio(df):
    df['damage_ratio'] = df['damageDealt']/df['assists']+df['kills']
    return df['damage_ratio']

# 평균등수
def ave_place(df):
    df['ave_maxplace'] = df['killPlace'] / df['maxPlace']
#     df['ave_maxplace'].fillna(0, inplace=True)
#     df['ave_maxplace'].replace(np.inf, 0, inplace=True)
    return df['ave_maxplace']

#킬당 걸음
def walk_kills(df):
    df['walk_kills'] = df['walkDistance'] / df['kills']
#     df['walk_kills'].fillna(0, inplace=True)
#     df['walk_kills'].replace(np.inf, 0, inplace=True)
    return df['walk_kills']
    
# 총 도움관여 = 어시 + 아군부활 횟수
def support(df):
    df['support'] = df['assists'] + df['revives']
    return df['support']

# 솔플 평균킬
def solo_avg_kill(df):
    df['solo_avg_kill'] = df['killPlace']/df['player']
    return df['solo_avg_kill']

## Machine Learning(Train)

In [None]:
from sklearn.metrics import mean_absolute_error

def training(model, X, y):
    reg = model
    reg.fit(X, y)
    pred_train = reg.predict(X)
    mae_train = mean_absolute_error(y, pred_train)
    return [mae_train ,reg]

각 모듈에서 함수들을 호출한 main.py

In [None]:
#%%
import numpy as np
import pandas as pd
import pickle
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression   # 1. Linear Regression 
from sklearn.linear_model import Lasso              # 2. Lasso
from sklearn.linear_model import Ridge              # 3. Ridge
from xgboost.sklearn import XGBRegressor            # 4. XGBoost
from lightgbm.sklearn import LGBMRegressor          # 5. LightGBM
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# from src.FE import columns_place, columns_grouped_mean, matchType_classify, matchType_encoding, team_player,\
#     scaling, player, total_distance, average_weaponsAcquired, average_damage ,healboost_per_kill,dist_per_game,damage_ratio,ave_place,walk_kills,support,solo_avg_kill

# from src.load_data import load_data
# from src.preprocess import feature_drop, reduce_mem_usage, rm_MissingValue
# from src.model import training

## 1. Preprocessing.
# 데이터프레임 메모리 사용량 줄이기
train_prep = reduce_mem_usage(df_train)

# 결측치 처리
train_prep = rm_MissingValue(train_prep)

# feature selection
train_prep = train_prep.drop(columns = ['Id','matchId','groupId','killPlace',\
                                       'killPoints','numGroups','rankPoints',\
                                       'teamKills', 'winPoints'])

## 2. Feature engineering
train_FE = train_prep
X = train_FE.drop(columns=['winPlacePerc','matchType'])
X_matchType = train_FE.matchType
y = train_FE.winPlacePerc

# Create new feature
X['average_weaponsAcquired'] = average_weaponsAcquired(df_train)
X['average_damage'] = average_damage(df_train)

X['totalDistance'] = total_distance(df_train)
X['team_player'] = team_player(df_train)
X['player']= player(df_train)
X['headshotKillsPerc'] = df_train.headshotKills / df_train.kills
X['kills_per_distance'] = df_train.kills / X.totalDistance
X['knocked_per_distance'] = df_train.DBNOs / X.totalDistance
X['damage_per_distance'] = df_train.damageDealt / X.totalDistance
X['killStreaks_rate'] = df_train.killStreaks / df_train.kills
X = columns_place(['assists','damageDealt','DBNOs','headshotKills','longestKill'], X, df_train)
X = columns_grouped_mean(['kills', 'assists', 'killStreaks', 'walkDistance'], X, df_train)

X['healboost_per_kill'] = healboost_per_kill(df_train)
X['dist_per_game'] = dist_per_game(df_train)
X['damage_ratio'] = damage_ratio(df_train)
X['ave_place'] = ave_place(df_train)
X['walk_kills'] = walk_kills(df_train)
X['support'] = support(df_train)
X['solo_avg_kill'] = solo_avg_kill(df_train)

X = X.replace((np.inf, -np.inf, np.nan), 0)

# Normalization(scaling)
X_scaled = scaling(X, StandardScaler(), ['damageDealt','longestKill','walkDistance',\
                                       'swimDistance','rideDistance'])

# Categorical feature encoding
X = pd.concat([X_scaled, X_matchType], axis=1)
X_OHE = matchType_classify(X)
X = matchType_encoding(X_OHE)

## 3. Train
# Sampling
import random
df = pd.concat([X,y], axis=1)

def sampling(df,n):
    idx = sorted(np.random.permutation(len(df))[:n])
    return df.iloc[idx].copy()

n = len(df) // 5
df_sample = sampling(df, n)
X = df_sample.drop(columns = 'winPlacePerc')
y = df_sample.winPlacePerc

# Hyper-parameter tuning
mae, reg = training(LGBMRegressor(max_depth=1), X, y)

print("LGBMregressor      : %.4f" % mae)

test data도 train data와 똑같은 전처리 진행

In [None]:
## 1. Preprocessing.
# 데이터프레임 메모리 사용량 줄이기
test_prep = reduce_mem_usage(df_test)

# 결측치 처리
# test_prep = rm_MissingValue(test_prep)

# feature selection
test_prep = test_prep.drop(columns = ['Id','matchId','groupId','killPlace',\
                                       'killPoints','numGroups','rankPoints',\
                                       'teamKills', 'winPoints'])

## 2. Feature engineering
test_FE = test_prep
X = test_FE.drop(columns='matchType')
X_matchType = test_FE.matchType
#y = test_FE.winPlacePerc

# Create new feature
X['average_weaponsAcquired'] = average_weaponsAcquired(df_test)
X['average_damage'] = average_damage(df_test)

X['totalDistance'] = total_distance(df_test)
X['team_player'] = team_player(df_test)
X['player']= player(df_test)
X['headshotKillsPerc'] = df_test.headshotKills / df_test.kills
X['kills_per_distance'] = df_test.kills / X.totalDistance
X['knocked_per_distance'] = df_test.DBNOs / X.totalDistance
X['damage_per_distance'] = df_test.damageDealt / X.totalDistance
X['killStreaks_rate'] = df_test.killStreaks / df_test.kills
X = columns_place(['assists','damageDealt','DBNOs','headshotKills','longestKill'], X, df_test)
X = columns_grouped_mean(['kills', 'assists', 'killStreaks', 'walkDistance'], X, df_test)

X['healboost_per_kill'] = healboost_per_kill(df_test)
X['dist_per_game'] = dist_per_game(df_test)
X['damage_ratio'] = damage_ratio(df_test)
X['ave_place'] = ave_place(df_test)
X['walk_kills'] = walk_kills(df_test)
X['support'] = support(df_test)
X['solo_avg_kill'] = solo_avg_kill(df_test)

X = X.replace((np.inf, -np.inf, np.nan), 0)

# Normalization(scaling)
X_scaled = scaling(X, StandardScaler(), ['damageDealt','longestKill','walkDistance',\
                                       'swimDistance','rideDistance'])

# Categorical feature encoding
X = pd.concat([X_scaled, X_matchType], axis=1)
X_OHE = matchType_classify(X)
X_test = matchType_encoding(X_OHE)

## 3. test
# mae, reg = training(XGBRegressor(max_depth=15), X, y)

result = reg.predict(X_test)

In [None]:
# 결과값 제출
submission['winPlacePerc'] = result
submission.to_csv('submission.csv', index=False)