<a id="top"></a>

<h2 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">MLB Player Digital Engagement Forecasting</h2>

<img src='https://media.istockphoto.com/vectors/several-baseball-players-in-different-positions-vector-id134085440?k=6&m=134085440&s=612x612&w=0&h=5_Qznbnajho0p04xvUVcVkb9vLPAs_TlGspOqnkI-d8='></img>

* [0. Players](#0)
* [1. Seasons](#1)
* [2. Teams](#2)
* [3. Train](#3)
    * [3.1 nextDayPlayerEngagement](#3.1)
        * [3.1.1 PlayerId[628317] Example](#3.1.1)
        * [3.1.2 Targets Vs primaryPosition](#3.1.2)
        * [ 3.1.3 Targets Vs BMI, Heaight, Weight, Age](#3.1.3)
    * [3.2 rosters](3.2)
        * [3.2.1 isActive Feature](#3.2.1)
        * [3.2.2 Illness Feature](#3.2.2)
        * [3.2.3 Bereavement Feature](#3.2.3)
        * [3.2.4 Deceased Feature](#3.2.4)
        * [3.2.5 Family Medical Emergency Feature](#3.2.5)
        * [3.2.6 Paternity & Paternity List Feature](#3.2.6)
        * [3.2.7 Reassigned to Major Features](#3.2.7)
        * [3.2.8 Reassigned  to Minor Features](#3.2.8)
        * [3.2.9 Reserve List (Minors) Features](#3.2.9)
        * [3.2.10 Suspended Features](#3.2.10)

In [None]:
!pip install -q jupyter-dash

import os
import json
import datetime
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
import dash_core_components as dcc
import dash_html_components as html


from jupyter_dash import JupyterDash
from dash.dependencies import Input, Output
from dateutil import parser
from tqdm.notebook import tqdm

<a id="0"></a>
## 0. Players

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Baseball_positions.svg/1200px-Baseball_positions.svg.png'></img>

* `playerId` - Unique identifier for a player.
* `playerName`
* `DOB` - Player’s date of birth.
* `mlbDebutDate`
* `birthCity`
* `birthStateProvince`
* `birthCountry`
* `heightInches`
* `weight`
* `primaryPositionCode` - Player’s primary position code, details are [here](https://statsapi.mlb.com/api/v1/positions).
* `primaryPositionName` - player’s primary position, details are [here](https://statsapi.mlb.com/api/v1/positions).
* `playerForTestSetAndFuturePreds` - Boolean, true if player is among those for whom predictions are to be made in test data

In [None]:
players_df = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/players.csv')

In [None]:
players_df.head()

In [None]:
players_df['densety'] = 1 / len(players_df)
players_df = players_df.fillna('NaN')

fig = px.sunburst(players_df, path=['playerForTestSetAndFuturePreds', 'primaryPositionName', 'birthCountry', 'birthStateProvince', 'birthCity'], values='densety')
fig.show()

So about 57.5 % of players is used on test stage. Most players is Pitchers from California USA.

In [None]:
_players_df = players_df.loc[players_df.playerForTestSetAndFuturePreds != 'NaN', :]
_players_df["heightInches / 100"] = _players_df["heightInches"] / 100

fig = px.violin(_players_df, y="heightInches / 100", x="playerForTestSetAndFuturePreds", color="primaryPositionName", box=False)
fig.show()

The tallest players in positions:
* outfilder
* first base
* pitcher

In [None]:
_players_df["weight / 100"] = _players_df["weight"] / 100
fig = px.violin(_players_df, y="weight / 100", x="playerForTestSetAndFuturePreds", color="primaryPositionName", box=False,)
fig.show()

Players with biggest weights in positions:
* catcher
* first base
* pitcher

Let's calculate [BMI](https://en.wikipedia.org/wiki/Body_mass_index):
$$ BMI = \frac{weight}{height^2} * 703 $$

In [None]:
_players_df["BMI"] = _players_df["weight"] / (_players_df["weight"] * _players_df["weight"]) * 703
fig = px.violin(_players_df, y="BMI", x="playerForTestSetAndFuturePreds", color="primaryPositionName", box=False,)
fig.show()

Players with less BMI in positions:
* Designated Hitter
* First Base

Players with big BMI in positions:
* Shortstop

In [None]:
def calculateAge(birthDate):
    birthDate = parser.parse(birthDate)
    today = datetime.date.today()
    age = (today.year - birthDate.year - 
         ((today.month, today.day) <
         (birthDate.month, birthDate.day)))
    return age

In [None]:
_players_df['age'] = _players_df['DOB'].apply(lambda x: calculateAge(x))
_players_df['age / 100'] = _players_df['age']/100
fig = px.violin(_players_df, y="age / 100", x="playerForTestSetAndFuturePreds", color="primaryPositionName", box=False, points='all')
fig.show()

Looks like the test group players are younger. Let's calculate average age for two groups.

In [None]:
print('Average age players in test group:', _players_df.loc[_players_df.playerForTestSetAndFuturePreds == True, 'age'].mean())
print('Average age players in train only group:', _players_df.loc[_players_df.playerForTestSetAndFuturePreds == False, 'age'].mean())

In [None]:
fig = px.scatter(_players_df, x="age", y="BMI", color="playerForTestSetAndFuturePreds", opacity=0.25, trendline='ols')
fig.show()

Players BMI with age decrease.

<a id="1"></a>
## 1. Seasons
* `seasonId`
* `seasonStartDate`
* `seasonEndDate`
* `preSeasonStartDate`
* `preSeasonEndDate`
* `regularSeasonStartDate`
* `regularSeasonEndDate`
* `lastDate1stHalf`
* `allStarDate`
* `firstDate2ndHalf`
* `postSeasonStartDate`
* `postSeasonEndDate`

In [None]:
seasons_df = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/seasons.csv')

In [None]:
seasons_df

In [None]:
start = []
end = []
ttype = []

for i, row in seasons_df.iterrows():
    start.append(row.seasonStartDate)
    end.append(row.seasonEndDate)
    ttype.append('Season')
    
    start.append(row.preSeasonStartDate)
    end.append(row.preSeasonEndDate)
    ttype.append('Pre Season')   
    
    start.append(row.regularSeasonStartDate)
    end.append(row.regularSeasonEndDate)
    ttype.append('Regular Season')
    
    start.append(row.postSeasonStartDate)
    end.append(row.postSeasonEndDate)
    ttype.append('Post Season')
    
seson_df_timeline = pd.DataFrame({'Start': start, 'End': end, 'Type': ttype})

In [None]:
fig = px.timeline(seson_df_timeline, x_start="Start", x_end="End", y="Type", color="Type")

for i, row in seasons_df.iterrows():
    fig.add_shape(type='line',
                yref="y",
                xref="x",
                x0=row.lastDate1stHalf,
                x1=row.lastDate1stHalf,
                y0=-1,
                y1=4,
                line=dict(color='green', width=1))
    fig.add_annotation(
                x=row.lastDate1stHalf,
                y=1.06,
                yref='paper',
                showarrow=False,
                text=f'lastDate1stHalf {row.seasonId}')
    
    fig.add_shape(type='line',
                yref="y",
                xref="x",
                x0=row.firstDate2ndHalf,
                x1=row.firstDate2ndHalf,
                y0=-1,
                y1=4,
                line=dict(color='red', width=1))
    fig.add_annotation(
                x=row.firstDate2ndHalf,
                y=-0.12,
                yref='paper',
                showarrow=False,
                text=f'firstDate2ndHalf {row.seasonId}')
    
    if isinstance(row.allStarDate, str):
        fig.add_shape(type='line',
                yref="y",
                xref="x",
                x0=row.allStarDate,
                x1=row.allStarDate,
                y0=-1,
                y1=4,
                line=dict(color='blue', width=1))
        fig.add_annotation(
                x=row.allStarDate,
                y=1.10,
                yref='paper',
                showarrow=False,
                textangle = 0, 
                text=f'allStarDate {row.seasonId}')
fig.update_yaxes(autorange="reversed")
fig.show()

<a id="2"></a>
## 2. Teams
* `id` - teamId
* `name`
* `teamName`
* `teamCode`
* `shortName`
* `abbreviation`
* `locationName`
* `leagueId`
* `leagueName`
* `divisionId`
* `divisionName`
* `venueId`
* `venueName`

In [None]:
teams_df = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/teams.csv')

In [None]:
teams_df.head()

In [None]:
teams_df['densety'] = 1 / len(teams_df)
teams_df = teams_df.fillna('NaN')

fig = px.sunburst(teams_df, path=['leagueName', 'divisionName', 'name'], values='densety')
fig.show()

<a id="3"></a>
## 3. Train

This contains data on MLB players active at some point since 2018. Predictions are only scored for those players active in 2021 (see above), but previous seasons’ players are included here to provide more data for exploration and modeling purposes.

^indicates that a more complete walkthrough is below

   * `date` - Integer formatted date, which is a primary index of the CSV.
   * `nextDayPlayerEngagement^` - Nested JSON containing all modeling targets from the following day.
   * `games^` - Nested JSON containing all game information for a given day. Includes spring training and exhibition games along with regular season, Postseason, and All-Star games.
   * `rosters^` - Nested JSON containing all roster information for a given day. Includes in-season and offseason team rosters.
   * `playerBoxScores^` - Nested JSON containing game stats aggregated at the player game level for a given day. Includes regular season, Postseason, and All-Star games.
   * `teamBoxScores^` - Nested JSON containing game stats aggregated at the team game level for a given day. Includes regular season, Postseason, and All-Star games.
   * `transactions^` - Nested JSON containing all transaction information involving MLB teams for a given day.
   * `standings^` - Nested JSON containing all standings information involving MLB teams for a given day.
   * `awards^` - Nested JSON containing all awards or honors handed out on a given day.
   * `events^` - Nested JSON containing all on-field game events for a given day. Includes regular season and Postseason games.
   * `playerTwitterFollowers^` - Nested JSON containing some players’ number of Twitter followers on that day.
   * `teamTwitterFollowers^` - Nested JSON containing each team’s number of Twitter followers on that day.


In [None]:
train_df = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/train.csv')

In [None]:
train_df.head()

<a id="3.1"></a>
## 3.1 nextDayPlayerEngagement
   * `engagementMetricsDate` - date of player engagement metrics, based on US Pacific Time (aligns with previous day’s games, rosters, on-field statistics, transactions, awards, etc.).
   * `playerId`
   * `target1`
   * `target2`
   * `target3`
   * `target4`

target1-target4 are each daily indexes of digital engagement on a 0-100 scale.

<a id="3.1.1"></a>
### 3.1.1 PlayerId[628317] Example

In [None]:
records = []
for nextDayPlayerEngagement in train_df.nextDayPlayerEngagement.values:
    records.extend(filter(lambda x:  x['playerId'] == 628317, json.loads(nextDayPlayerEngagement)))
playerTarget = pd.DataFrame.from_records(records)

In [None]:
fig = px.line(playerTarget, x="engagementMetricsDate", y=["target1", "target2", "target3", "target4"])
fig.update_layout(
    title={
        'text': "PlayerId: 628317",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name in ['target2', 'target3', 'target4'] else ())
fig.show()

<a id="3.1.2"></a>
### 3.1.2 Targets Vs primaryPosition

In [None]:
records = []
for pN in tqdm(players_df.primaryPositionName.unique()):
    pids = players_df.loc[players_df.primaryPositionName==pN, 'playerId'].values
    for nextDayPlayerEngagement in tqdm(train_df.nextDayPlayerEngagement.values, total=len(train_df.nextDayPlayerEngagement.values)):
        filtered = list(filter(lambda x:  x['playerId'] in pids, json.loads(nextDayPlayerEngagement)))
        records.extend([
            {
                'engagementMetricsDate': filtered[0]['engagementMetricsDate'],
                'target1': np.mean([f['target1'] for f in filtered]),
                'target2': np.mean([f['target2'] for f in filtered]),
                'target3': np.mean([f['target3'] for f in filtered]),
                'target4': np.mean([f['target4'] for f in filtered]),
                'primaryPositionName': pN
            }
        ])

In [None]:
targetStatByPosition = pd.DataFrame.from_records(records)

In [None]:
fig = px.line(targetStatByPosition, x="engagementMetricsDate", y="target1", color='primaryPositionName')
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['Pitcher'] else ())
fig.show()

In [None]:
fig = px.violin(targetStatByPosition, y="target1", x="primaryPositionName", color="primaryPositionName", box=False)
fig.show()

In [None]:
fig = px.line(targetStatByPosition, x="engagementMetricsDate", y="target2", color='primaryPositionName')
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['Pitcher'] else ())
fig.show()

In [None]:
fig = px.violin(targetStatByPosition, y="target2", x="primaryPositionName", color="primaryPositionName", box=False)
fig.show()

In [None]:
fig = px.line(targetStatByPosition, x="engagementMetricsDate", y="target3", color='primaryPositionName')
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['Pitcher'] else ())
fig.show()

In [None]:
fig = px.violin(targetStatByPosition, y="target3", x="primaryPositionName", color="primaryPositionName", box=False)
fig.show()

In [None]:
fig = px.line(targetStatByPosition, x="engagementMetricsDate", y="target4", color='primaryPositionName')
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['Pitcher'] else ())
fig.show()

In [None]:
fig = px.violin(targetStatByPosition, y="target4", x="primaryPositionName", color="primaryPositionName", box=False)
fig.show()

So `primaryPositionName` is categorical feature which have none linear relation with targets value distributions.

<a id="3.1.3"></a>
### 3.1.3 Targets Vs BMI, Heaight, Weight, Age

In [None]:
playerStat = {}
for nextDayPlayerEngagement in tqdm(train_df.nextDayPlayerEngagement.values, total=len(train_df.nextDayPlayerEngagement.values)):
    nextDayPlayerEngagement = json.loads(nextDayPlayerEngagement)
    for player in nextDayPlayerEngagement:
        if player['playerId'] in playerStat:
            playerStat[player['playerId']] += np.array([
                    float(player['target1']), float(player['target2']), float(player['target3']), float(player['target4']),
                    1., 1., 1., 1.
            ])
        else:
            playerStat[player['playerId']] = np.array([
                    float(player['target1']), float(player['target2']), float(player['target3']), float(player['target4']),
                    1., 1., 1., 1.
            ])
            
for i in range(1, 5):
    _players_df[f'target{i}Mean'] = 0
    
for pid, v in playerStat.items():
    _players_df.loc[players_df.playerId == pid, ['target1Mean', 'target2Mean', 'target3Mean', 'target4Mean']] = np.array([
        v[0 + i]/v[4 + i] for i in range(4)
    ])

In [None]:
fig = px.scatter(_players_df, x="BMI", y=['target1Mean', 'target2Mean', 'target3Mean', 'target4Mean'], opacity=0.25, trendline='ols')
fig.show()

In [None]:
fig = px.scatter(_players_df, x="heightInches", y=['target1Mean', 'target2Mean', 'target3Mean', 'target4Mean'], opacity=0.25, trendline='ols')
fig.show()

In [None]:
fig = px.scatter(_players_df, x="weight", y=['target1Mean', 'target2Mean', 'target3Mean', 'target4Mean'], opacity=0.25, trendline='ols')
fig.show()

In [None]:
fig = px.scatter(_players_df, x="age", y=['target1Mean', 'target2Mean', 'target3Mean', 'target4Mean'], opacity=0.25, trendline='ols')
fig.show()

Targets have increased trend with age, weight, height & decreased trend for BMI index.

<a id="3.2"></a>
## 3.2 rosters
   * `playerId` - Unique identifier for a player.
   * `gameDate`
   * `teamId` - teamId that player is on that date.
   * `statusCode` - Roster status abbreviation.
   * `status` - Descriptive roster status.


<a id="3.2.1"></a>
### 3.2.1 isActive Feature

In [None]:
playerActivity = {pid: np.zeros((len(train_df),)) for pid in players_df.playerId}
playerTarget = {pid: np.zeros((len(train_df), 4)) for pid in players_df.playerId}

for i, nextDayPlayerEngagement in tqdm(enumerate(train_df.nextDayPlayerEngagement.values)):
    nextDayPlayerEngagement = json.loads(nextDayPlayerEngagement)
    for ndpe in nextDayPlayerEngagement:
        if ndpe['playerId'] in playerTarget:
            playerTarget[ndpe['playerId']][i] = [ndpe[f'target{j}'] for j in range(1, 5)]

status = set()
for i, roster in tqdm(enumerate(train_df.rosters)):
    for r in json.loads(roster):
        if r['playerId'] in playerActivity:
            status.add(r['status'])
            playerActivity[r['playerId']][i] += int(r['status'] == 'Active')
train_df.date = train_df.date.apply(lambda x: parser.parse(str(x)))

In [None]:
pid = 628317
_ex = pd.DataFrame({'date': train_df.date, 'Activity': playerActivity[pid] * 100, 
                    'Target1': playerTarget[pid][:, 0], 'Target2': playerTarget[pid][:, 1],
                    'Target3': playerTarget[pid][:, 2], 'Target4': playerTarget[pid][:, 3]})
fig = px.line(_ex, x='date', y=['Activity', 'Target1', 'Target2', 'Target3', 'Target4'])
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['Activity', 'Target1'] else ())
fig.update_layout(
    title={
        'text': "PlayerId: 628317",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
onActive = np.zeros((4,))
onInActive = np.zeros((4,))
onActiveCount = 0
onInActiveCount = 0
for pid in playerTarget:
    target = playerTarget[pid]
    activity = playerActivity[pid]
    
    active = target[activity > 0]
    inactive = target[activity == 0]
        
    onActiveCount += 1 if len(active) > 0 else 0
    onInActiveCount += 1 if len(inactive) > 0 else 0
    for j in range(0, 4):
        onActive[j] += np.mean(active[:, j]) if len(active) > 0 else 0
        onInActive[j] += np.mean(inactive[:, j]) if len(inactive) > 0 else 0
onActive /= onActiveCount
onInActive /= onInActiveCount

In [None]:
_ex = pd.DataFrame({'isActive': [True, False, True, False, True, False, True, False], 
                    'targetMean': [onActive[0], onInActive[0], onActive[1], onInActive[1], onActive[2], onInActive[2], onActive[3], onInActive[3]],
                    'targetType': ['target1', 'target1', 'target2', 'target2', 'target3', 'target3', 'target4', 'target4']
                   })
fig = px.bar(_ex, x='targetType', y='targetMean', color='isActive', barmode='group')
fig.show()

So for player which play on game targets value (`target1`, `target2`, `target4`) is greater then for inactive.

<a id="3.2.2"></a>
### 3.2.2 Illness Feature

In [None]:
playerIL = {pid: np.zeros((len(train_df),)) for pid in players_df.playerId}
lastDate = None
for i, roster in tqdm(enumerate(train_df.rosters)):
    updatedPid = set()
    for r in json.loads(roster):
        if r['playerId'] in playerIL:
            if r['status'] == '10-day IL':
                playerIL[r['playerId']][i] += 10
                updatedPid.add(r['playerId'])
            elif r['status'] == '60-day IL':
                playerIL[r['playerId']][i] += 60
                updatedPid.add(r['playerId'])
                
    if lastDate is not None:
        day = (train_df.date[i] - lastDate).days
        for pid in playerIL:
            if pid not in updatedPid:
                playerIL[pid][i] += max(0, playerIL[pid][i-1] - 1)
    lastDate = train_df.date[i]

In [None]:
pid = 622554
_ex = pd.DataFrame({'date': train_df.date, 'IL Days': playerIL[pid], 
                    'Target1': playerTarget[pid][:, 0], 'Target2': playerTarget[pid][:, 1],
                    'Target3': playerTarget[pid][:, 2], 'Target4': playerTarget[pid][:, 3]})
fig = px.line(_ex, x='date', y=['IL Days', 'Target1', 'Target2', 'Target3', 'Target4'])
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['IL Days', 'Target1'] else ())
fig.update_layout(
    title={
        'text': f"PlayerId: {pid}",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
illness = np.concatenate([v for v in playerIL.values()], axis=0)
targets = np.concatenate([v[:, 0] for v in playerTarget.values()], axis=0)
bins = np.array(list(range(0, 100)))
value = np.array([np.sum(illness[(targets < b + 1) & (targets >= b)]) for b in bins])/np.sum(illness)
fig = px.bar(x=bins, y=value, color=value, labels={'x':'target1', 'y':'illness'})
fig.show()

In [None]:
targets = np.concatenate([v[:, 1] for v in playerTarget.values()], axis=0)
value = np.array([np.sum(illness[(targets < b + 1) & (targets >= b)]) for b in bins])/np.sum(illness)
fig = px.bar(x=bins, y=value, color=value, labels={'x':'target2', 'y':'illness'})
fig.show()

In [None]:
targets = np.concatenate([v[:, 2] for v in playerTarget.values()], axis=0)
value = np.array([np.sum(illness[(targets < b + 1) & (targets >= b)]) for b in bins])/np.sum(illness)
fig = px.bar(x=bins, y=value, color=value, labels={'x':'target3', 'y':'illness'})
fig.show()

In [None]:
targets = np.concatenate([v[:, 3] for v in playerTarget.values()], axis=0)
value = np.array([np.sum(illness[(targets < b + 1) & (targets >= b)]) for b in bins])/np.sum(illness)
fig = px.bar(x=bins, y=value, color=value, labels={'x':'target4', 'y':'illness'})
fig.show()

This means that the sick player has low KPI values, which is logical. 

<a id="3.2.3"></a>
### 3.2.3 Bereavement Feature

In [None]:
playerBereavement = {pid: np.zeros((len(train_df),)) for pid in players_df.playerId}
for i, roster in tqdm(enumerate(train_df.rosters)):
    for r in json.loads(roster):
        if r['playerId'] in playerBereavement:
            playerBereavement[r['playerId']][i] += int(r['status'] == 'Bereavement List')

In [None]:
onBereavement = np.zeros((4,))
onUnbereavement = np.zeros((4,))
onBereavementCount = 0
onUnBereavementCount = 0
for pid in playerTarget:
    target = playerTarget[pid]
    bereavement = playerBereavement[pid]
    
    tunbereavement = target[bereavement == 0]
    tbereavement = target[bereavement > 0]
    
    onBereavementCount += 1 if len(tbereavement) > 0 else 0
    onUnBereavementCount += 1 if len(tunbereavement) > 0 else 0
    for j in range(0, 4):
        onBereavement[j] += np.mean(tbereavement[:, j]) if len(tbereavement) > 0 else 0
        onUnbereavement[j] += np.mean(tunbereavement[:, j]) if len(tunbereavement) > 0 else 0
onBereavement /= onBereavementCount
onUnbereavement /= onUnBereavementCount

_ex = pd.DataFrame({'isBereavement': [True, False, True, False, True, False, True, False], 
                    'targetMean': [onBereavement[0], onUnbereavement[0], onBereavement[1], onUnbereavement[1], onBereavement[2], onUnbereavement[2], onBereavement[3], onUnbereavement[3]],
                    'targetType': ['target1', 'target1', 'target2', 'target2', 'target3', 'target3', 'target4', 'target4']
                   })
fig = px.bar(_ex, x='targetType', y='targetMean', color='isBereavement', barmode='group')
fig.show()

Hm... Some targets value for a player with bereavement greater than for a player without bereavement.

<a id="3.2.4"></a>
### 3.2.4 Deceased Feature

In [None]:
playerDeceased = {pid: np.zeros((len(train_df),)) for pid in players_df.playerId}
for i, roster in tqdm(enumerate(train_df.rosters)):
    for r in json.loads(roster):
        if r['playerId'] in playerDeceased:
            playerDeceased[r['playerId']][i:] += int(r['status'] == 'Deceased')
            if r['status'] == 'Deceased':
                print('Deceased Player:', r['playerId'], players_df.loc[players_df.playerId == pid, 'playerName'].values[0])

In [None]:
pid = 572140
_ex = pd.DataFrame({'date': train_df.date, 'isDeceased': playerDeceased[pid] * 100, 
                    'Target1': playerTarget[pid][:, 0], 'Target2': playerTarget[pid][:, 1],
                    'Target3': playerTarget[pid][:, 2], 'Target4': playerTarget[pid][:, 3]})
fig = px.line(_ex, x='date', y=['isDeceased', 'Target1', 'Target2', 'Target3', 'Target4'])
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['isDeceased', 'Target1'] else ())
fig.update_layout(
    title={
        'text': f"PlayerId: {pid}; Player Name: {players_df.loc[players_df.playerId == pid, 'playerName'].values[0]}",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

Wow! What is target1, target2, target3, target4 mean?

[Tyler Wayne Skaggs](https://en.wikipedia.org/wiki/Tyler_Skaggs) (July 13, 1991 – July 1, 2019) was an American left-handed professional baseball starting pitcher who played in Major League Baseball (MLB) for the Arizona Diamondbacks and Los Angeles Angels of Anaheim.

In [None]:
onDeceased = np.zeros((4,))
onUnDeceased = np.zeros((4,))
onDeceasedCount = 0
onUnDeceasedCount = 0
for pid in playerTarget:
    target = playerTarget[pid]
    deceased = playerDeceased[pid]
    tundeceased = target[deceased == 0]
    tdeceased = target[deceased > 0]
    
    onDeceasedCount += 1 if len(tdeceased) > 0 else 0
    onUnDeceasedCount += 1 if len(tundeceased) > 0 else 0
    for j in range(0, 4):
        onDeceased[j] += np.mean(tdeceased[:, j]) if len(tdeceased) > 0 else 0
        onUnDeceased[j] += np.mean(tundeceased[:, j]) if len(tundeceased) > 0 else 0
onDeceased /= onDeceasedCount
onUnDeceased /= onUnDeceasedCount

_ex = pd.DataFrame({'isDeceased': [True, False, True, False, True, False, True, False], 
                    'targetMean': [onDeceased[0], onUnDeceased[0], onDeceased[1], onUnDeceased[1], onDeceased[2], onUnDeceased[2], onDeceased[3], onUnDeceased[3]],
                    'targetType': ['target1', 'target1', 'target2', 'target2', 'target3', 'target3', 'target4', 'target4']
                   })
fig = px.bar(_ex, x='targetType', y='targetMean', color='isDeceased', barmode='group')
fig.show()

<a id="3.2.5"></a>
### 3.2.5 Injured Feature

In [None]:
playerInjured = {pid: np.zeros((len(train_df),)) for pid in players_df.playerId}
lastDate = None
for i, roster in tqdm(enumerate(train_df.rosters)):
    updatedPid = set()
    for r in json.loads(roster):
        if r['playerId'] in playerIL:
            if r['status'] == 'Injured 7-Day':
                playerInjured[r['playerId']][i] += 7
                updatedPid.add(r['playerId'])
            elif r['status'] == 'Injured 10-Day':
                playerInjured[r['playerId']][i] += 10
                updatedPid.add(r['playerId'])
            elif r['status'] == 'Injured 60-Day':
                playerInjured[r['playerId']][i] += 60
                updatedPid.add(r['playerId'])
                
    if lastDate is not None:
        day = (train_df.date[i] - lastDate).days
        for pid in playerInjured:
            if pid not in updatedPid:
                playerInjured[pid][i] += max(0, playerInjured[pid][i-1] - 1)
    lastDate = train_df.date[i]

In [None]:
pid = 640449
_ex = pd.DataFrame({'date': train_df.date, 'Injured Days': playerInjured[pid], 
                    'Target1': playerTarget[pid][:, 0], 'Target2': playerTarget[pid][:, 1],
                    'Target3': playerTarget[pid][:, 2], 'Target4': playerTarget[pid][:, 3]})
fig = px.line(_ex, x='date', y=['Injured Days', 'Target1', 'Target2', 'Target3', 'Target4'])
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['Injured Days', 'Target1'] else ())
fig.update_layout(
    title={
        'text': f"PlayerId: {pid}",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
injured = np.concatenate([v for v in playerInjured.values()], axis=0)
targets = np.concatenate([v[:, 0] for v in playerTarget.values()], axis=0)
bins = np.array(list(range(0, 100)))
value = np.array([np.sum(injured[(targets < b + 1) & (targets >= b)]) for b in bins])/np.sum(injured)
fig = px.bar(x=bins, y=value, color=value, labels={'x':'target1', 'y':'injured'})
fig.show()

In [None]:
targets = np.concatenate([v[:, 1] for v in playerTarget.values()], axis=0)
value = np.array([np.sum(injured[(targets < b + 1) & (targets >= b)]) for b in bins])/np.sum(injured)
fig = px.bar(x=bins, y=value, color=value, labels={'x':'target2', 'y':'injured'})
fig.show()

In [None]:
targets = np.concatenate([v[:, 2] for v in playerTarget.values()], axis=0)
value = np.array([np.sum(injured[(targets < b + 1) & (targets >= b)]) for b in bins])/np.sum(injured)
fig = px.bar(x=bins, y=value, color=value, labels={'x':'target3', 'y':'injured'})
fig.show()

In [None]:
targets = np.concatenate([v[:, 3] for v in playerTarget.values()], axis=0)
value = np.array([np.sum(injured[(targets < b + 1) & (targets >= b)]) for b in bins])/np.sum(injured)
fig = px.bar(x=bins, y=value, color=value, labels={'x':'target4', 'y':'injured'})
fig.show()

Statistics for an injured feature is the same as for an illness feature.

<a id="3.2.5"></a>
### 3.2.5 Family Medical Emergency Feature

In [None]:
playerFME = {pid: np.zeros((len(train_df),)) for pid in players_df.playerId}
for i, roster in tqdm(enumerate(train_df.rosters)):
    for r in json.loads(roster):
        if r['playerId'] in playerFME:
            playerFME[r['playerId']][i] += int(r['status'] == 'Family Medical Emergency')
            if r['status'] == 'Family Medical Emergency':
                print('Family Medical Emergency Player:', r['playerId'], players_df.loc[players_df.playerId == r['playerId'], 'playerName'].values[0])

In [None]:
pid = 456078
_ex = pd.DataFrame({'date': train_df.date, 'isFME': playerFME[pid] * 10, 
                    'Target1': playerTarget[pid][:, 0], 'Target2': playerTarget[pid][:, 1],
                    'Target3': playerTarget[pid][:, 2], 'Target4': playerTarget[pid][:, 3]})
fig = px.line(_ex, x='date', y=['isFME', 'Target1', 'Target2', 'Target3', 'Target4'])
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['isFME', 'Target1'] else ())
fig.update_layout(
    title={
        'text': f"PlayerId: {pid}; Player Name: {players_df.loc[players_df.playerId == pid, 'playerName'].values[0]}",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
onFME = np.zeros((4,))
onUnFME = np.zeros((4,))
onFMECount = 0
onUnFMECount = 0
for pid in playerTarget:
    target = playerTarget[pid]
    fme = playerFME[pid]
    tunfme = target[fme == 0]
    tfme = target[fme > 0]
    
    onFMECount += 1 if len(tfme) > 0 else 0
    onUnFMECount += 1 if len(tunfme) > 0 else 0
    for j in range(0, 4):
        onFME[j] += np.mean(tfme[:, j]) if len(tfme) > 0 else 0
        onUnFME[j] += np.mean(tunfme[:, j]) if len(tunfme) > 0 else 0
onFME /= onFMECount
onUnFME /= onUnFMECount

_ex = pd.DataFrame({'isFME': [True, False, True, False, True, False, True, False], 
                    'targetMean': [onFME[0], onUnFME[0], onFME[1], onUnFME[1], onFME[2], onUnFME[2], onFME[3], onUnFME[3]],
                    'targetType': ['target1', 'target1', 'target2', 'target2', 'target3', 'target3', 'target4', 'target4']
                   })
fig = px.bar(_ex, x='targetType', y='targetMean', color='isFME', barmode='group')
fig.show()

Hm... The average targets value is higher for players time periods with Family Medical Emergency Flag.

<a id="3.2.6"></a>
### 3.2.6 Paternity & Paternity List Feature

In [None]:
playerPaternity = {pid: np.zeros((len(train_df),)) for pid in players_df.playerId}
for i, roster in tqdm(enumerate(train_df.rosters)):
    for r in json.loads(roster):
        if r['playerId'] in playerPaternity:
            playerPaternity[r['playerId']][i] += int(r['status'] == 'Paternity' or r['status'] == 'Paternity List')

In [None]:
pid = 628317
_ex = pd.DataFrame({'date': train_df.date, 'isPaternity': playerPaternity[pid] * 100, 
                    'Target1': playerTarget[pid][:, 0], 'Target2': playerTarget[pid][:, 1],
                    'Target3': playerTarget[pid][:, 2], 'Target4': playerTarget[pid][:, 3]})
fig = px.line(_ex, x='date', y=['isPaternity', 'Target1', 'Target2', 'Target3', 'Target4'])
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['isPaternity', 'Target1'] else ())
fig.update_layout(
    title={
        'text': f"PlayerId: {pid}; Player Name: {players_df.loc[players_df.playerId == pid, 'playerName'].values[0]}",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
onPaternity = np.zeros((4,))
onUnPaternity = np.zeros((4,))
onPaternityCount = 0
onUnPaternityCount = 0
for pid in playerTarget:
    target = playerTarget[pid]
    paternity = playerPaternity[pid]
    tunpaternity = target[paternity == 0]
    tpaternity = target[paternity > 0]
    
    onPaternityCount += 1 if len(tpaternity) > 0 else 0
    onUnPaternityCount += 1 if len(tunpaternity) > 0 else 0
    for j in range(0, 4):
        onPaternity[j] += np.mean(tpaternity[:, j]) if len(tpaternity) > 0 else 0
        onUnPaternity[j] += np.mean(tunpaternity[:, j]) if len(tunpaternity) > 0 else 0
onPaternity /= onPaternityCount
onUnPaternity /= onUnPaternityCount

_ex = pd.DataFrame({'isPaternity': [True, False, True, False, True, False, True, False], 
                    'targetMean': [onPaternity[0], onUnPaternity[0], onPaternity[1], onUnPaternity[1], onPaternity[2], onUnPaternity[2], onPaternity[3], onUnPaternity[3]],
                    'targetType': ['target1', 'target1', 'target2', 'target2', 'target3', 'target3', 'target4', 'target4']
                   })
fig = px.bar(_ex, x='targetType', y='targetMean', color='isPaternity', barmode='group')
fig.show()

The average targets value is higher for players time periods with Paternity & Paternity List Flags.

<a id="3.2.7"></a>
### 3.2.7 Reassigned to Major Features

In [None]:
playerReassigned = {pid: np.zeros((len(train_df),)) for pid in players_df.playerId}
for i, roster in tqdm(enumerate(train_df.rosters)):
    roster = json.loads(roster)
    for r in roster:
        if r['playerId'] in playerReassigned:
            playerReassigned[r['playerId']][i] += int(r['status'] == 'Reassigned' and r['status'] != 'Reassigned to Minors')
            playerReassigned[r['playerId']][i] -= int(r['status'] == 'Reassigned to Minors')
    for r in roster:
        if r['playerId'] in playerReassigned:
            playerReassigned[r['playerId']][i] = max(0., playerReassigned[r['playerId']][i])

In [None]:
pid = 650619
_ex = pd.DataFrame({'date': train_df.date, 'isReassignedToMajor': playerReassigned[pid] * 100, 
                    'Target1': playerTarget[pid][:, 0], 'Target2': playerTarget[pid][:, 1],
                    'Target3': playerTarget[pid][:, 2], 'Target4': playerTarget[pid][:, 3]})
fig = px.line(_ex, x='date', y=['isReassignedToMajor', 'Target1', 'Target2', 'Target3', 'Target4'])
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['isReassignedToMajor', 'Target1'] else ())
fig.update_layout(
    title={
        'text': f"PlayerId: {pid}; Player Name: {players_df.loc[players_df.playerId == pid, 'playerName'].values[0]}",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
onReassigned = np.zeros((4,))
onUnReassigned = np.zeros((4,))
onReassignedCount = 0
onUnReassignedCount = 0
for pid in playerTarget:
    target = playerTarget[pid]
    reassigned = playerReassigned[pid]
    tunreassigned = target[reassigned == 0]
    treassigned = target[reassigned > 0]
    
    onReassignedCount += 1 if len(treassigned) > 0 else 0
    onUnReassignedCount += 1 if len(tunreassigned) > 0 else 0
    for j in range(0, 4):
        onReassigned[j] += np.mean(treassigned[:, j]) if len(treassigned) > 0 else 0
        onUnReassigned[j] += np.mean(tunreassigned[:, j]) if len(tunreassigned) > 0 else 0
onReassigned /= onPaternityCount
onUnReassigned /= onUnPaternityCount

_ex = pd.DataFrame({'isReassignedToMajor': [True, False, True, False, True, False, True, False], 
                    'targetMean': [onReassigned[0], onUnReassigned[0], onReassigned[1], onUnReassigned[1], onReassigned[2], onUnReassigned[2], onReassigned[3], onUnReassigned[3]],
                    'targetType': ['target1', 'target1', 'target2', 'target2', 'target3', 'target3', 'target4', 'target4']
                   })
fig = px.bar(_ex, x='targetType', y='targetMean', color='isReassignedToMajor', barmode='group')
fig.show()

<a id="3.2.8"></a>
### 3.2.8 Reassigned  to Minor Features

In [None]:
playerReassignedToMinor = {pid: np.zeros((len(train_df),)) for pid in players_df.playerId}
for i, roster in tqdm(enumerate(train_df.rosters)):
    roster = json.loads(roster)
    for r in roster:
        if r['playerId'] in playerReassignedToMinor:
            playerReassignedToMinor[r['playerId']][i] += int(r['status'] == 'Reassigned to Minors')

In [None]:
pid = 650619
_ex = pd.DataFrame({'date': train_df.date, 'isReassignedToMinor': playerReassignedToMinor[pid] * 100, 
                    'Target1': playerTarget[pid][:, 0], 'Target2': playerTarget[pid][:, 1],
                    'Target3': playerTarget[pid][:, 2], 'Target4': playerTarget[pid][:, 3]})
fig = px.line(_ex, x='date', y=['isReassignedToMinor', 'Target1', 'Target2', 'Target3', 'Target4'])
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['isReassignedToMinor', 'Target1'] else ())
fig.update_layout(
    title={
        'text': f"PlayerId: {pid}; Player Name: {players_df.loc[players_df.playerId == pid, 'playerName'].values[0]}",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
onReassigned = np.zeros((4,))
onUnReassigned = np.zeros((4,))
onReassignedCount = 0
onUnReassignedCount = 0
for pid in playerTarget:
    target = playerTarget[pid]
    reassigned = playerReassignedToMinor[pid]
    tunreassigned = target[reassigned == 0]
    treassigned = target[reassigned > 0]
    
    onReassignedCount += 1 if len(treassigned) > 0 else 0
    onUnReassignedCount += 1 if len(tunreassigned) > 0 else 0
    for j in range(0, 4):
        onReassigned[j] += np.mean(treassigned[:, j]) if len(treassigned) > 0 else 0
        onUnReassigned[j] += np.mean(tunreassigned[:, j]) if len(tunreassigned) > 0 else 0
onReassigned /= onPaternityCount
onUnReassigned /= onUnPaternityCount

_ex = pd.DataFrame({'isReassignedToMinor': [True, False, True, False, True, False, True, False], 
                    'targetMean': [onReassigned[0], onUnReassigned[0], onReassigned[1], onUnReassigned[1], onReassigned[2], onUnReassigned[2], onReassigned[3], onUnReassigned[3]],
                    'targetType': ['target1', 'target1', 'target2', 'target2', 'target3', 'target3', 'target4', 'target4']
                   })
fig = px.bar(_ex, x='targetType', y='targetMean', color='isReassignedToMinor', barmode='group')
fig.show()

<a id="3.2.9"></a>
### 3.2.9 Reserve List (Minors) Features

In [None]:
playerReserve = {pid: np.zeros((len(train_df),)) for pid in players_df.playerId}
for i, roster in tqdm(enumerate(train_df.rosters)):
    roster = json.loads(roster)
    for r in roster:
        if r['playerId'] in playerReserve:
            playerReserve[r['playerId']][i] += int(r['status'] == 'Reserve List (Minors)')

In [None]:
pid = 656887
_ex = pd.DataFrame({'date': train_df.date, 'isReserve': playerReserve[pid] * 100, 
                    'Target1': playerTarget[pid][:, 0], 'Target2': playerTarget[pid][:, 1],
                    'Target3': playerTarget[pid][:, 2], 'Target4': playerTarget[pid][:, 3]})
fig = px.line(_ex, x='date', y=['isReserve', 'Target1', 'Target2', 'Target3', 'Target4'])
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['isReserve', 'Target1'] else ())
fig.update_layout(
    title={
        'text': f"PlayerId: {pid}; Player Name: {players_df.loc[players_df.playerId == pid, 'playerName'].values[0]}",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
onReserve = np.zeros((4,))
onUnReserve = np.zeros((4,))
onReserveCount = 0
onUnReserveCount = 0
for pid in playerTarget:
    target = playerTarget[pid]
    reserve = playerReserve[pid]
    tunreserve = target[reserve == 0]
    treserve = target[reserve > 0]
    
    onReserveCount += 1 if len(treserve) > 0 else 0
    onUnReserveCount += 1 if len(tunreserve) > 0 else 0
    for j in range(0, 4):
        onReserve[j] += np.mean(treserve[:, j]) if len(treserve) > 0 else 0
        onUnReserve[j] += np.mean(tunreserve[:, j]) if len(tunreserve) > 0 else 0
onReserve /= onReserveCount
onUnReserve /= onUnReserveCount

_ex = pd.DataFrame({'isReserve': [True, False, True, False, True, False, True, False], 
                    'targetMean': [onReserve[0], onUnReserve[0], onReserve[1], onUnReserve[1], onReserve[2], onUnReserve[2], onReserve[3], onUnReserve[3]],
                    'targetType': ['target1', 'target1', 'target2', 'target2', 'target3', 'target3', 'target4', 'target4']
                   })
fig = px.bar(_ex, x='targetType', y='targetMean', color='isReserve', barmode='group')
fig.show()

<a id="3.2.10"></a>
### 3.2.10 Suspended Features

In [None]:
playerSuspended = {pid: np.zeros((len(train_df),)) for pid in players_df.playerId}
for i, roster in tqdm(enumerate(train_df.rosters)):
    roster = json.loads(roster)
    for r in roster:
        if r['playerId'] in playerSuspended:
            playerSuspended[r['playerId']][i] += int(r['status'] == 'Suspended' or r['status'] == 'Suspended # days')

In [None]:
pid = 592206
_ex = pd.DataFrame({'date': train_df.date, 'isSuspended': playerSuspended[pid] * 100, 
                    'Target1': playerTarget[pid][:, 0], 'Target2': playerTarget[pid][:, 1],
                    'Target3': playerTarget[pid][:, 2], 'Target4': playerTarget[pid][:, 3]})
fig = px.line(_ex, x='date', y=['isSuspended', 'Target1', 'Target2', 'Target3', 'Target4'])
fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in ['isSuspended', 'Target1'] else ())
fig.update_layout(
    title={
        'text': f"PlayerId: {pid}; Player Name: {players_df.loc[players_df.playerId == pid, 'playerName'].values[0]}",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
onSuspended = np.zeros((4,))
onUnSuspended = np.zeros((4,))
onSuspendedCount = 0
onUnSuspendedCount = 0
for pid in playerTarget:
    target = playerTarget[pid]
    suspended = playerSuspended[pid]
    tunsuspended = target[suspended == 0]
    tsuspended = target[suspended > 0]
    
    onSuspendedCount += 1 if len(tsuspended) > 0 else 0
    onUnSuspendedCount += 1 if len(tunsuspended) > 0 else 0
    for j in range(0, 4):
        onSuspended[j] += np.mean(tsuspended[:, j]) if len(tsuspended) > 0 else 0
        onUnSuspended[j] += np.mean(tsuspended[:, j]) if len(tsuspended) > 0 else 0
onSuspended /= onSuspendedCount
onUnSuspended /= onUnSuspendedCount

_ex = pd.DataFrame({'isReserve': [True, False, True, False, True, False, True, False], 
                    'targetMean': [onSuspended[0], onUnSuspended[0], onSuspended[1], onUnSuspended[1], onSuspended[2], onUnSuspended[2], onSuspended[3], onUnSuspended[3]],
                    'targetType': ['target1', 'target1', 'target2', 'target2', 'target3', 'target3', 'target4', 'target4']
                   })
fig = px.bar(_ex, x='targetType', y='targetMean', color='isReserve', barmode='group')
fig.show()

### to be continued...