# MLB Digital Engagement Forecasting 
### Introduction 
Inspired by the amazing startup notebook [Getting Started with MLB Player Digital Engagement Forecasting](https://www.kaggle.com/ryanholbrook/getting-started-with-mlb-player-digital-engagement), I intend to reconstruct some cells and add some new ideas. Furthermore, there are some explanations written in Chinese **(blue blocks)**, making chinese native speakers easier to get the meanings of the specific features.

However, I'm not very familiar with the terminologies used in baseball. If there's any mistake I make in this notebook, please feel free to correct me and discuss together!

### Competition Goal 
Forecast four measures of fan engagement (`target1`-`target4`) for the next day (i.e. for `date` d, you're going to predict the engagement for `day` d+1).
<div class="alert alert-block alert-info" style="font-size: 15px; font-family: DFKai-sb;">
    <h3>比賽目的</h3>
    <p>利用球員、比賽記錄、獲獎記錄等資料及參賽者自行提取之特徵，預測下一個時間點的粉絲參與度 (target1 - target4)，數值介於0~100；其中，時間序列資料以天為單位。</p>
</div>

### Data
#### Static
<div class="alert alert-block alert-info" style="font-size: 15px; font-family: DFKai-sb;">
    <h4>靜態資料</h4>
    <ul>
        <li><code>players.csv</code> - MLB球員資料</li>
        <li><code>teams.csv</code> - MLB隊伍資料</li>
        <li><code>seasons.csv</code> - 各賽季起訖日記錄</li>
        <li><code>awards.csv</code> - 2018前的球員獲獎記錄</li>
    </ul>
</div>

#### Time-dependent (Daily)
<div class="alert alert-block alert-info" style="font-size: 15px; font-family: DFKai-sb;">
    <h4>時間序列資校</h4>
    <ul>
        <li><code>train.csv</code> - 以日為單位的時間序列資料 (詳細說明請參考後續分析)</li>
    </ul>
</div>

In [None]:
# Import packages
import os
import warnings 
import gc
from pprint import pprint
from datetime import datetime 

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import plotly.express as px 
import plotly.graph_objects as go
import seaborn as sns

# Configurations
# warnings.simplefilter("ignore")
pd.set_option('max_columns', 100)   # Enable complete display if the column number of dataframe is less than 100

In [None]:
# Variable definitions
DATA_PATH = "../input/mlb-player-digital-engagement-forecasting" 
DATA_FILES = ["players.csv", "teams.csv", "seasons.csv", 
              "awards.csv", "train.csv"]

In [None]:
# Utility functions 
def describe(df, stats):
    '''Describe the basic information of the raw dataframe.
    
    Parameters:
        df: pd.DataFrame, raw dataframe to be analyzed
        stats: boolean, whether to get descriptive statistics 
    
    Return:
        None
    '''
    df_ = df.copy(deep=True)   # Copy of the raw dataframe
    n_features = df_.shape[1]
    if n_features > pd.get_option("max_columns"):
        # If the feature (column) number is greater than max number of columns displayed
        warnings.warn("Please reset the display-related options max_columns \
                      to enable the complete display.", 
                      UserWarning) 
    print("=====Basic information=====")
    display(df_.info())
    get_nan_ratios(df_)
    if stats:
        print("=====Description=====")
        numeric_col_num = df_.select_dtypes(include=np.number).shape[1]   # Number of cols in numeric type
        if numeric_col_num != 0:
            display(df_.describe())
        else:
            print("There's no description of numeric data to display!")
    del df_
    gc.collect()

def get_nan_ratios(df):
    '''Get NaN ratios of columns with NaN values.
    
    Parameters:
        df: pd.DataFrame, raw dataframe to be analyzed
        
    Return:
        None
    '''
    df_ = df.copy()   # Copy of the raw dataframe
    nan_ratios = df_.isnull().sum() / df_.shape[0] * 100   # Ratios of value nan in each column
    nan_ratios = pd.DataFrame([df_.columns, nan_ratios]).T   # Take transpose 
    nan_ratios.columns = ["Columns", "NaN ratios"]
    nan_ratios = nan_ratios[nan_ratios["NaN ratios"] != 0.0]
    print("=====NaN ratios of columns with NaN values=====")
    if len(nan_ratios) == 0:
        print("There isn't any NaN value in the dataset!")
    else:
        display(nan_ratios)
    del df_
    gc.collect() 

# **1. Basic Description of Data**
Let's take a peek at the data files used in this competition! 

In [None]:
# Get basic information about train.csv 
train = pd.read_csv(os.path.join(DATA_PATH, "train.csv"))
train['date'] = pd.to_datetime(train['date'], format="%Y%m%d")
describe(train, True)

In [None]:
# Get basic information about static files
sup_files = DATA_FILES[:-1]
for file in sup_files:
    sup_df = pd.read_csv(os.path.join(DATA_PATH, file))
    print(f"====={file[:-4]}=====")
    display(sup_df.head())
    print(f"Shape: {sup_df.shape}")
    describe(sup_df, False)
    globals()[file[:-4]] = sup_df   # Assign supplementary dataframe into globally accessible dict

# **2. Train DataFrame Unpacking**
Now, let's unpack `train.csv` to enable the further analysis.

In [None]:
# Utility functions
def unpack_json(json_str):
    '''Convert json string found in daily dataframe to pandas object.
    
    Parameters:
        json_str: str, data entry in "train" dataframe in the format of json string
    
    Return:
        json_obj: np.nan or pandas object, nan if the entry is originally nan; or, the converted pandas object is returned 
    '''
    return np.nan if pd.isna(json_str) else pd.read_json(json_str)

In [None]:
# Create mapping relationship between column names of train dataframe and the "unpacked" table
daily_unpacked_dfs = pd.DataFrame(train.columns[1:], columns=["dfName"])
daily_unpacked_dfs["df"] = [pd.DataFrame() for _ in range(len(daily_unpacked_dfs))]
daily_unpacked_dfs

In [None]:
# Unpack each table residing in train dataframe
for df_idx, sub_df in daily_unpacked_dfs.iterrows():
    table_name = sub_df["dfName"]
    table = train.loc[:, ["date", table_name]]
    table = (table[~pd.isna(table[table_name])].reset_index(drop=True))   # Get samples having no nan entries 
    
    daily_unpacked_samples = [] 
    for daily_idx, daily_sample in table.iterrows():
        daily_unpacked_sample = unpack_json(daily_sample[table_name])
        daily_unpacked_sample["dailyDataDate"] = daily_sample["date"]
        daily_unpacked_samples = daily_unpacked_samples + [daily_unpacked_sample]
    unpacked_table = pd.concat(daily_unpacked_samples, ignore_index=True).set_index("dailyDataDate").reset_index()

    globals()[table_name] = unpacked_table   # Assign unpacked table into globally accessible dict
    daily_unpacked_dfs["df"][df_idx] = unpacked_table   # Assign unpacked table to create the mapping relationship 
                                                        # (dfName (table name) <--> unpacked table)

# Free up the memory
del train, table, daily_unpacked_samples
gc.collect()

In [None]:
# Dump dataframe to csv
daily_unpacked_dfs.to_csv("./daily_unpacked_dfs.csv", index=False)

In [None]:
# Check unpacked table by showing one single sample from each table 
for table_name in daily_unpacked_dfs["dfName"]:
    print(f"=====Table: {table_name}=====")
    table = globals()[table_name]
    display(table.head(1))
    print(f"Shape: {table.shape}\n")

# **3. Data Merging**
In this part, we are going to dig deeper into each table unpacked above and try to understand the basic meaning of each feature. Also, there will be some **feature extracting processes** aiming at finding potentially useful features.

## *3.1 Feature Extraction for Dates and Seasons*
### 3.1.1 Features related to `dates`
* Decompositions of the date (e.g. year and month)
* Weekday of the date 

In [None]:
dates = pd.DataFrame(nextDayPlayerEngagement['dailyDataDate'].unique(), columns=['dailyDataDate'])
dates['date'] = pd.to_datetime(dates['dailyDataDate'].astype(str))
dates['year'] = dates['date'].dt.year
dates['month'] = dates['date'].dt.month
dates['weekday'] = dates['date'].apply(lambda date: date.weekday())   # Retrieve weekday for each date

print("=====DataFrame: dates=====")
display(dates.head())

### 3.1.2 Features related to `seasons`:
* Whether the date is in season or not
* Season categories
<div class="alert alert-block alert-info" style="font-size: 15px; font-family: DFKai-sb;">
    <ul>
        <li><code>Offseason</code> - 休賽季</li>
        <li><code>Preseason</code> - 熱身賽</li>
        <li><code>Reg Season 1st Half</code> - 明星賽前的例行賽</li>
        <li><code>All-Star Break</code> - 明星賽</li>
        <li><code>Reg Season 2nd Half</code> - 明星賽後的例行賽</li>
        <li><code>Between Reg and Postseason</code> - 例行賽結束到季後賽開始前的過渡期</li>
        <li><code>Postseason</code> - 季後賽</li>
        <li><code>Offseason</code> - 休賽季</li>
    </ul>
</div>

In [None]:
dates_with_seasons = pd.merge(dates, seasons, left_on='year', right_on='seasonId')   # Join two dfs with key 'seasonId' (i.e. year)

# Determine whether the date is in season or not
dates_with_seasons['inSeason'] = dates_with_seasons['date'].between(
    dates_with_seasons['regularSeasonStartDate'],
    dates_with_seasons['postSeasonEndDate'],
    inclusive=True   # Include boudaries
)   

# Categorize different game seasons
dates_with_seasons['seasonPart'] = np.select(
    [
        dates_with_seasons['date'] < dates_with_seasons['preSeasonStartDate'], 
        dates_with_seasons['date'] < dates_with_seasons['regularSeasonStartDate'],
        dates_with_seasons['date'] <= dates_with_seasons['lastDate1stHalf'],
        dates_with_seasons['date'] < dates_with_seasons['firstDate2ndHalf'],
        dates_with_seasons['date'] <= dates_with_seasons['regularSeasonEndDate'],
        dates_with_seasons['date'] < dates_with_seasons['postSeasonStartDate'],
        dates_with_seasons['date'] <= dates_with_seasons['postSeasonEndDate'],
        dates_with_seasons['date'] > dates_with_seasons['postSeasonEndDate']
    ], 
    [
        'Offseason',
        'Preseason',
        'Reg Season 1st Half',
        'All-Star Break',
        'Reg Season 2nd Half',
        'Between Reg and Postseason',
        'Postseason',
        'Offseason'
    ], 
    default=np.nan
)    

print("=====DataFrame: dates_with_seasons=====")
display(dates_with_seasons.head())

# Plot bar chart of season categories
season_counts = dates_with_seasons['seasonPart'].value_counts()
fig = go.Figure()
fig.add_trace(go.Pie(
    labels=season_counts.index,
    values=season_counts
))
fig.update_layout(
    title='Pie Chart of Season Categories'
)
fig.show()

## *3.2 Feature Extraction for Game Stats at **Player Game** Level*
<div class="alert alert-block alert-info" style="font-size: 15px; font-family: DFKai-sb;">
    <p>以下為 <code>playerBoxScores</code> table中所有features的中文解釋，請點擊下方展開cell<i class="fas fa-arrow-circle-down"></i></p>
</div>

<div class="alert alert-block alert-info" style="font-size:15px; font-family: DFKai-sb;">
    <ul>
        <li><code>home</code> - 是否為主隊</li>
        <li><code>gamePk</code> - 比賽識別碼</li>
        <li><code>gameDate</code> - 比賽日期</li>
        <li><code>gameTimeUTC</code> - 投手投出第一球的時間點 (UTC)</li>
        <li><code>teamId</code> - 隊伍識別碼</li>
        <li><code>teamName</code> - 隊伍名稱</li>
        <li><code>playerId</code> - 球員識別碼</li>
        <li><code>playerName</code> - 球員名稱</li>
        <li><code>jerseyNum</code> - 球員背號</li>
        <li><code>positionCode</code> - 攻守位置代碼</li>
        <li><code>positionName</code> - 攻守位置名稱</li>
        <li><code>positionType</code> - 攻守位置種類</li>
        <li><code>battingOrder</code> - 打擊順序，第一個數字為棒次、後兩個數字為此球員在本場比賽第幾次上場打擊。例如:301表示打者為第三棒，且此打者在本場比賽的第二次打擊。</li>
        <li><code>gamesPlayedBatting</code> - 1表示此球員登記為打者、跑壘員或野手</li>
        <li><code>flyOuts</code> - 遭接殺次數</li>
        <li><code>groundOuts</code> - 滾地球出局次數</li>
        <li><code>runsScored</code> - 得分</li>
        <li><code>doubles</code> - 二壘安打次數</li>
        <li><code>triples</code> - 三壘安打次數</li>
        <li><code>homeRuns</code> - 全壘打次數</li>
        <li><code>strikeOuts</code> - 遭三振次數</li>
        <li><code>baseOnBalls</code> - 被保送次數</li>
        <li><code>intentionalWalks</code> - 被故意四壞保送次數</li>
        <li><code>hits</code> - 安打次數</li>
        <li><code>hitByPitch</code> - 被觸身球擊中次數</li>
        <li><code>atBats</code> - <a href="https://zh.wikipedia.org/wiki/%E6%89%93%E6%95%B8">打數</a></li>
        <li><code>caughtStealing</code> - 盜壘失敗次數</li>
        <li><code>stolenBases</code> - 盜壘次數</li>
        <li><code>groundIntoDoublePlay</code> - 雙殺打次數</li>
        <li><code>groundIntoTriplePlay</code> - 三殺打次數</li>
        <li><code>plateAppearances</code> - <a href="https://zh.wikipedia.org/wiki/%E6%89%93%E5%B8%AD%E6%95%B8">打席數<a/></li>
        <li><code>totalBases</code> - <a href="http://twbsball.dils.tku.edu.tw/wiki/index.php/%E5%A3%98%E6%89%93%E6%95%B8">壘打數</a></li>
        <li><code>rbi</code> - <a href="https://zh.wikipedia.org/wiki/%E6%89%93%E9%BB%9E">打點</a></li>
        <li><code>leftOnBase</code> - 殘壘</li>
        <li><code>sacBunts</code> - 犧牲觸擊次數</li>
        <li><code>sacFlies</code> - 高飛犧牲打次數</li>
        <li><code>catchersInterference</code> - 捕手妨礙打擊次數</li>
        <li><code>pickoffs</code> - 牽制次數</li>
        <li><code>gamesPlayedPitching</code> - 二進位值，球員是否登記為投手，若是則為1</li>
        <li><code>gamesStartedPitching</code> - 二進位值，球員是否為先發投手，若是則為1</li>
        <li><code>completeGamesPitching</code> - 二進位值，投手是否完投，若是則為1</li>
        <li><code>shutoutsPitching</code> - 二進位值，投手是否完封，若是則為1</li>
        <li><code>winsPitching</code> - 二進位值，投手是否為勝投，若是則為1</li>
        <li><code>lossesPitching</code> - 二進位值，投手是否為敗投，若是則為1</li>
        <li><code>flyOutsPitching</code> - 投球造成外野飛球出局次數</li>
        <li><code>airOutsPitching</code> - 投球造成飛球出局 (外野+內野)次數</li>
        <li><code>groundOutsPitching</code> - 投球造成滾球出局次數</li>
        <li><code>runsPitching</code> - 總失分</li>
        <li><code>doublesPitching</code> - 被擊出二壘安打次數</li>
        <li><code>triplesPitching</code> - 被擊出三壘安打次數</li>
        <li><code>homeRunsPitching</code> - 被擊出全壘打次數</li>
        <li><code>strikeOutsPitching</code> - 三振數</li>
        <li><code>baseOnBallsPitching</code> - 四壞球保送次數</li>
        <li><code>intentionalWalksPitching</code> - 故意四壞球保送次數</li>
        <li><code>hitsPitching</code> - 被擊出安打次數</li>
        <li><code>hitByPitchPitching</code> - 投出觸身球次數</li>
        <li><code>atBatsPitching</code> - 投手創造的打數</li>
        <li><code>caughtStealingPitching</code> - 牽制成功次數</li>
        <li><code>stolenBasesPitching</code> - 被盜壘次數</li>
        <li><code>inningsPitched</code> - 投球局數</li>
        <li><code>saveOpportunities</code> - 二進位值，是否有救援機會，若是則為1</li>
        <li><code>earnedRuns</code> - 責任失分</li>
        <li><code>battersFaced</code> - 投手面對的打者人數，也就是投球人次</li>
        <li><code>outsPitching</code> - 投手創造的出局數</li>
        <li><code>pitchesThrown</code> - 投球數</li>
        <li><code>balls</code> - 壞球總數</li>
        <li><code>strikes</code> - 好球總數</li>
        <li><code>hitBatsmen</code> - 被投手觸身球擊中的打者總數</li>
        <li><code>balks</code> - 投手犯規次數</li>
        <li><code>wildPitches</code> - 暴投次數</li>
        <li><code>pickoffsPitching</code> - 牽制次數</li>
        <li><code>rbiPitching</code> - 投手造成的總打點</li>
        <li><code>inheritedRunners</code> - 救援投手上場時已在壘上的跑者數</li>
        <li><code>inheritedRunnersScored</code> - <a href="http://twbsball.dils.tku.edu.tw/wiki/index.php/Inherited_runners-scored">繼承跑者得分</a></li>
        <li><code>catchersInterferencePitching</code> - 投捕搭檔造成的捕手妨礙打擊次數</li>
        <li><code>sacBuntsPitching</code> - 投手造成的犧牲觸擊次數</li>
        <li><code>sacFliesPitching</code> - 投手造成的高飛犧牲打次數</li>
        <li><code>saves</code> - 二進位值，投手是否救援成功，若是則為1</li>
        <li><code>holds</code> - 二進位值，投手是否中繼成功，若是則為1</li>
        <li><code>blownSaves</code> - 二進位值，投手是否救援失敗，若是則為1</li>
        <br>
        <p>以下名詞詳細解釋請參考<a href="http://twbsball.dils.tku.edu.tw/wiki/index.php/%E5%AE%88%E5%82%99%E6%A9%9F%E6%9C%83">守備機會</a></p>
        <li><code>assists</code> - 助攻</li>
        <li><code>putOuts</code> - 刺殺</li>
        <li><code>errors</code> - 失誤</li>
        <li><code>chances</code> - 守備機會</li>
    </ul>
</div>

### 3.2.1 Features related to `game states`
* Fractional representation of innings a pitcher pitches in a game 
    <div class="alert alert-block alert-info" style="font-size:15px; font-family: DFKai-sb;">
        <ul>
            <li><code>inningsPitched</code> - Game total innings pitched. &rArr; 總投球局數</li>
        </ul>
    </div>
* [Tom Tango pitching game score](https://www.mlb.com/glossary/advanced-stats/game-score) 
* Whether it's a no-hitter game or not
<div class="alert alert-block alert-info" style="font-size:15px; font-family: DFKai-sb;">
    <ul>
        <li><code>noHitter</code> - When a pitcher allows no hits during the entire course of a game, consisting of at least nine innings. &rArr; 無安打比賽 </li>
    </ul> 
</div>

    * Because the feature is at **player game** level, no-hitter game discussed here is completed by **a single pitcher**.

In [None]:
# Rename columns to enhance recognizability, indicating these columns aren't come from 'roster'
player_game_stats = playerBoxScores.copy().rename(
    columns={
        'teamId': 'gameTeamId', 
        'teamName': 'gameTeamName'
    }
)

# Add in fractional representation of inningsPitched
player_game_stats['inningsPitchedAsFrac'] = np.where(
    pd.isna(player_game_stats['inningsPitched']),
    np.nan,   
    (np.floor(player_game_stats['inningsPitched']) +
    (player_game_stats['inningsPitched'] -
    np.floor(player_game_stats['inningsPitched'])) * 10/3)
)

# Add in Tom Tango pitching game score 
player_game_stats['pitchingGameScore'] = (
    40 + 
    2 * player_game_stats['outsPitching'] +   # Game total outs recorderd 
    1 * player_game_stats['strikeOutsPitching'] -
    2 * player_game_stats['baseOnBallsPitching'] -
    2 * player_game_stats['hitsPitching'] -
    3 * player_game_stats['runsPitching'] -
    6 * player_game_stats['homeRunsPitching']
)

# Add in criteria for no-hitter game completed by a single pitcher 
player_game_stats['noHitter'] = np.where(
    (player_game_stats['gamesStartedPitching'] == 1) &
    (player_game_stats['inningsPitched'] >= 9) &
    (player_game_stats['hitsPitching'] == 0),
    1, 
    0
)

print("=====DataFrame: player_game_stats=====")
display(player_game_stats.head())

### 3.2.2 Features related to results of applying aggregate functions on `game stats` 
All the aggregate functions are applied at **date-player** level, meaning that the groupings are done by compound key **(dailyDataDate, playerId)**.
* Number of games per player per day
* Number of teams per player per day 
This should be **one team per player per day**, but playerId **518617 (Jake Diekman)** had 2 games for different teams marked as played on 5/19/19, due to resumption of game after he was traded.
* Team identifier for (dailyDataDate, playerId) pair
This should be only **one team for almost all (dailyDataDate, playerId) pairs**. 
* Sum of some player game stats 

In [None]:
# Apply aggregate functions on specific features
player_date_stats_agg = pd.merge(
    player_game_stats.groupby(['dailyDataDate', 'playerId'], as_index=False).agg(
        numGames=('gamePk', 'nunique'),  
        numTeams=('gameTeamId', 'nunique'),
        gameTeamId=('gameTeamId', 'min')   # Take min to simplify the extraction 
    ), 
    player_game_stats.groupby(['dailyDataDate', 'playerId'], as_index = False)[
        ['runsScored', 'homeRuns', 'strikeOuts', 'baseOnBalls', 
         'hits', 'hitByPitch', 'atBats', 'caughtStealing', 
         'stolenBases', 'groundIntoDoublePlay', 'groundIntoTriplePlay', 'plateAppearances',
         'totalBases', 'rbi', 'leftOnBase', 'sacBunts', 
         'sacFlies', 'gamesStartedPitching', 'runsPitching', 'homeRunsPitching', 
         'strikeOutsPitching', 'baseOnBallsPitching', 'hitsPitching', 'inningsPitchedAsFrac', 
         'earnedRuns', 'battersFaced', 'saves', 'blownSaves', 
         'pitchingGameScore', 'noHitter'
        ]
    ].sum(),   
    on=['dailyDataDate', 'playerId'],
    how='inner'
)

print("=====DataFrame: player_date_stats_agg=====")
display(player_date_stats_agg.head())

## *3.3 Feature Extraction for Game Stats at **Team Game** Level*
<div class="alert alert-block alert-info" style="font-size: 15px; font-family: DFKai-sb;">
    <p>由於 <code>teamBoxScores</code> 中的features可以跟上面 <code>playerBoxScores</code> 作對照，故不另外作翻譯。兩者的差異在於統計數據的層次不同，<code>playerBoxScores</code> 是以單個球員的表現計算，<code>teamBoxScores</code> 則統計整個球隊的比賽表現。</p>
    <p>以下為 <code>games</code> 中所有features的中文解釋，請點擊下方展開cell<i class="fas fa-arrow-circle-down"></i></p>
</div>

<div class="alert alert-block alert-info" style="font-size:15px; font-family: DFKai-sb;">
    <ul>
        <li><code>gamePk</code> - 比賽識別碼</li>
        <li><code>gameType</code> - 比賽類型 (包含類身賽、例行賽等等)，這欄的值其實就是下方seriesDescription的縮寫。</li>
        <li><code>season</code> - 賽季 (以年表示)</li>
        <li><code>gameDate</code> - 比賽日期</li>
        <li><code>gameTimeUTC</code> - 投手投出第一球的時間點 (UTC)</li>
        <li><code>resumeDate</code> - 被沒收的比賽的重賽時間 (若沒有被沒收，則為null)</li>
        <li><code>resumedFrom</code> - 被沒收的那場比賽的時間 (若沒有被沒收，則為null)。觀察發現，若這欄有值，其實就會跟gameTimeUTC相等。</li>
        <li><code>codedGameState</code> - 比賽狀態代碼</li>
        <li><code>detailedGameState</code> - 比賽狀態描述</li>
        <li><code>isTie</code> - 布林值，若比賽結果為平手</li>
        <li><code>gameNumber</code> - 幫助辨識<a href="http://twbsball.dils.tku.edu.tw/wiki/index.php/%E9%9B%99%E9%87%8D%E8%B3%BD">雙重賽</a>的標幟，其值為1或2。</li>
        <li><code>doubleHeader</code> - Y為雙重賽、N為單場比賽、S為<font style="color: red;">split-ticket</font></li>
        <li><code>dayNight</code> - 比賽開始時間為白天或晚上</li>
        <li><code>scheduledInnings</code> - 預定球賽局數</li>
        <li><code>gamesInSeries</code> - 在目前系列賽的第幾場比賽</li>
        <li><code>seriesDescription</code> - 系列賽類型 (包含類身賽、例行賽等等)，這欄的值其實就是上方gameType的全名。</li>
        <li><code>homeId</code> - 主隊識別碼</li>
        <li><code>homeName</code> - 主隊名稱</li>
        <li><code>homeAbbrev</code> - 主隊名稱縮寫</li>
        <li><code>homeWins</code> - 主隊在本季到目前為止的勝場數</li>
        <li><code>homeLosses</code> - 主隊在本季到目前為止的敗場數</li>
        <li><code>homeWinPct</code> - 主隊在本季到目前為止的勝率</li>
        <li><code>homeWinner</code> - 布林值，若主隊在這場比賽中獲勝則為true。</li>
        <li><code>homeScore</code> - 主隊得分數</li>
        <li><code>awayId</code> - 客隊識別碼</li>
        <li><code>awayName</code> - 客隊名稱</li>
        <li><code>awayAbbrev</code> - 客隊名稱縮寫</li>
        <li><code>awayWins</code> - 客隊在本季到目前為止的勝場數</li>
        <li><code>awayLosses</code> - 客隊在本季到目前為止的敗場數</li>
        <li><code>awayWinPct</code> - 客隊在本季到目前為止的勝率</li>
        <li><code>awayWinner </code> - 布林值，若客隊在這場比賽中獲勝則為true。</li>
        <li><code>awayScore </code> - 客隊得分數</li>
    </ul>
</div>

### 3.3.1 Games table reconstruction
To convert the `games` table into the format of **one row per team-game**, the following processing is done:
* Extract specific game types and ensure the validity of scores.
* Recreate two games tables from different perspectives (**home team** and **away team**).
* Concatenate two games tables into final one.

In [None]:
# Extract games played in regular or post-season with valid scores (those without NaN)
games_for_stats = games[
    np.isin(games['gameType'], ['R', 'F', 'D', 'L', 'W', 'C', 'P']) &
    ~pd.isna(games['homeScore']) &
    ~pd.isna(games['awayScore'])
]

# Get games table from home team perspective
games_home_perspective = games_for_stats.copy()
games_home_perspective.columns = [
    col_value.replace('home', 'team').replace('away', 'opp') for 
    col_value in games_home_perspective.columns.values
]   # Change column names so that "team" is "home", "opp" is "away"
games_home_perspective['isHomeTeam'] = 1

# Get games table from away team perspective
games_away_perspective = games_for_stats.copy()
games_away_perspective.columns = [
    col_value.replace('home', 'opp').replace('away', 'team') for 
    col_value in games_away_perspective.columns.values
]   # Change column names so that "opp" is "home", "team" is "away"
games_away_perspective['isHomeTeam'] = 0

# Put together games tables from home/away perspective
team_games = (pd.concat(
    [
        games_home_perspective,
        games_away_perspective
    ],
    ignore_index=True)
)

print("=====DataFrame: team_games=====")
display(team_games.head())

### 3.3.2 Features related to results of applying aggregate functions on `game stats`
First, modification of column names is done on `team_game_stats` (`teamBoxScores`) table; then, we merge `team` table with `team_game_stats` table. The final extracted features are as follows:
* Number of games per team per day
* Number of winning games per team per day
* Number of losing games per team per day
* Total runs scored per team per day
* Total runs allowd per team per day

In [None]:
# Copy over team box scores data (team-game level)
team_game_stats = teamBoxScores.copy()

# Add suffix 'Team' to column names to reflect these stats are at team-game level,
# helping differentiate from individual player stats (player-game level) when joining
team_game_stats.columns = [
    (col_value + 'Team') 
    if (col_value not in ['dailyDataDate', 'home', 'teamId', 
                          'gamePk','gameDate', 'gameTimeUTC'])
    else col_value for 
    col_value in team_game_stats.columns.values
]

# Merge games table with team_game_stats table
team_games_with_stats = pd.merge(
    team_games,
    # Drop columns already present in team_games table
    team_game_stats.drop(['home', 'gameDate', 'gameTimeUTC'], axis = 1),
    on = ['dailyDataDate', 'gamePk', 'teamId'],
    # Doing this as 'inner' join excludes spring training games, postponed games,
    # etc. from original games table, but this may be fine for purposes here 
    how = 'inner'
)

print("=====DataFrame: team_games_with_stats=====")
display(team_games_with_stats.head())

# Apply aggregate functions on specific features
team_date_stats_agg = (
    team_games_with_stats.groupby(['dailyDataDate', 'teamId', 'gameType', 
                                   'oppId', 'oppName'], as_index = False).agg(
        numGamesTeam = ('gamePk', 'nunique'),
        winsTeam = ('teamWinner', 'sum'),
        lossesTeam = ('oppWinner', 'sum'),
        runsScoredTeam = ('teamScore', 'sum'),
        runsAllowedTeam = ('oppScore', 'sum')
    )
)

print("=====DataFrame: team_date_stats_agg=====")
display(team_date_stats_agg.head())

## *3.4 Feature Extraction for `Standings`*
In this part, only certain features of interest are picked from `standings` table.
### 3.4.1 Features related to `streakCode`
* Number of streak win
* Number of streak loss

In [None]:
# Select features of interest from 'standings' table
standings_selected_fields = (
    standings[['dailyDataDate', 'teamId', 'streakCode', 
               'divisionRank', 'leagueRank', 'wildCardRank', 
               'pct']].rename(columns = {'pct': 'winPct'})
)

# Add suffix 'Team' to column names to reflect these features are at team level,
# helping differentiate from those at player level when joining
standings_selected_fields.columns = [
    (col_value + 'Team') 
    if (col_value not in ['dailyDataDate', 'teamId'])
    else col_value for 
    col_value in standings_selected_fields.columns.values
]

# Process the streak (win/lose) information
# Add fields to separate winning and losing streak from streak code
standings_selected_fields['streakLengthTeam'] = (
    standings_selected_fields['streakCodeTeam'].
    str.replace('W', '').
    str.replace('L', '').
    astype(float)
)   # Extract magnitude of streak

# Process scenario of winning 
standings_selected_fields['winStreakTeam'] = np.where(
    standings_selected_fields['streakCodeTeam'].str[0] == 'W',
    standings_selected_fields['streakLengthTeam'],
    np.nan
)

# Process scenario of losing 
standings_selected_fields['lossStreakTeam'] = np.where(
    standings_selected_fields['streakCodeTeam'].str[0] == 'L',
    standings_selected_fields['streakLengthTeam'],
    np.nan
)

standings_for_digital_engagement_merge = (
    pd.merge(
        standings_selected_fields,
        dates_with_seasons[['dailyDataDate', 'inSeason']],
        on=['dailyDataDate'],
        how='left'
    ).
    # Limit down standings to only in season version
    query("inSeason").
    # Drop fields (features) no longer necessary
    drop(['streakCodeTeam', 'streakLengthTeam', 'inSeason'], axis=1).
    reset_index(drop=True)
)

print("=====DataFrame: standings_for_digital_engagement_merge=====")
display(standings_for_digital_engagement_merge.head())

<div class="alert alert-blocks alert-info" style="text-align: center">
    <h3>Work in Progress...</h3>
    <h3>更多中文翻譯及解釋即將釋出~<i class="fas fa-baseball-ball"></i>~<i class="fas fa-baseball-ball"></i></h3>
    <h3>Thanks for your attention!!</h3>
</div>



