# MLB Player Digital Engagement Forecasting #

This notebook builds upon the starter notebook provided by the competition organisers [here](https://www.kaggle.com/ryanholbrook/getting-started-with-mlb-player-digital-engagement)

Table of Contents 
* [Introduction](#intro)
* [Initial Questions on Data](#que)
* [Importing the Datasets](#import)
* [Exploratory Analysis](#exp)
* [Feature Engineering](#feature)
* [Model Development & Hyperparameter Tuning](#model)
* [Final Submission](#submit)


## Introduction<a name="intro"></a>
The aim of this competition is to understand the factors that lead to rise & fall in supporter engagement for players. We've been provided with a number of datasets around match score statistics, player records, awards & events within a season. In this notebook we'll be looking to understand the datasets provided and try and understand how different factors like scores, teams, time of the year contribute to variations in player Engagement. 

If you like me don't have lot of domain knowledge about baseball then this discussion [thread](https://www.kaggle.com/c/mlb-player-digital-engagement-forecasting/discussion/245457) will be a good place to understand the format of the tournament as well as rules for the game & the statistics measured in baseball.


## Datasets Provided
* **train.csv** - Dataset at a playerId, date level. Contains nested JSONs for following - ```games, rosters, playerBoxScores, teamBoxScores, awards, events, playerEngagement etc.``` etc. The 4 playerEngagement variables are - ```target1, target2, target3, target4``` target1-target4 are each daily indexes of digital engagement on a 0-100 scale. Data is provided from 2018 - 2021.
* **teams.csv** - Dataset containing info for teams like ```teamName, location, league & division``` that they play in.
* **seasons.csv** - Contains dates for various phases of a season - ``preSeason, regularSeason, playoffs, postseason`` etc. from 2017-2021
* **players.csv** - Dataset containing info for players - ```DOB, birthcountry,mlbDebutDate, position``` etc. 


## Initial Questions on Data<a name="que"></a>

* How do the four target engagement variables vary from one year to the other and how are they related to each other?
* How does engagament score decay with time after a baseball game?
* How are engagement score values autocorrelated with each other?
* How do engagement scores vary with time of the year & progress of the season? Is there seasonality associated with these scores?
* Does successful teams enjoy higher engagement scores compared to remaining teams? Do higher ranked teams see more engagement from fans?
* How does starting position of a player play a role in driving engagement scores? Are players in certain positions likely to drive higher engagement?
* How does a player's demographic attributes play a role in determining their engagement scores?
* How do run scoring & pitching stats impact engagement scores? Does good & bad performance both lead to higher engagement from fans?

## Importing the datasets<a name="import"></a>

This section below has been borrowed from the notebook provided by the competition organizers. The section of code below unpacks the JSON for various columns in the ```train.csv``` and creates a nice tabular dataset for us to run further analyses.




In [None]:
#!pip install raceplotly

In [None]:
#Importing packages
import gc
import sys
import warnings
from pathlib import Path

import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from statsmodels.tsa.deterministic import (CalendarFourier,
                                           CalendarSeasonality,
                                           CalendarTimeTrend,
                                           DeterministicProcess)

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

#from raceplotly.plots import barplot

warnings.simplefilter("ignore")

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 5))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)
plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)

# Load and Join Data #

In [None]:
data_dir = Path('../input/mlb-player-digital-engagement-forecasting/')
#training = pd.read_csv(data_dir / 'train.csv')
players = pd.read_csv(data_dir / 'players.csv')


# Convert training data date field to datetime type
#training['date'] = pd.to_datetime(training['date'], format="%Y%m%d")

#display(training.info())

There are in addition a number of supplementary files. See the [data documentation](https://www.kaggle.com/c/mlb-player-digital-engagement-forecasting/data) on the competition page for more information.

In [None]:
#### Look at data from each of the input dfs read in ####
df_names = ['seasons', 'teams', 'players', 'awards']

for name in df_names:
    globals()[name] = pd.read_csv(data_dir / f"{name}.csv")

kaggle_data_tabs = widgets.Tab()
# Add Output widgets for each pandas DF as tabs' children
kaggle_data_tabs.children = list([widgets.Output() for df_name in df_names])

for index in range(0, len(df_names)):
    # Rename tab bar titles to df names
    kaggle_data_tabs.set_title(index, df_names[index])
    
    # Display corresponding table output for this tab name
    with kaggle_data_tabs.children[index]:
        display(eval(df_names[index]))

display(kaggle_data_tabs)

In [None]:
# # Helper function to unpack json found in daily data
# def unpack_json(json_str):
#     return np.nan if pd.isna(json_str) else pd.read_json(json_str)

# ### Unnest various nested data within training (daily) data ####
# daily_data_unnested_dfs = pd.DataFrame(data = {
#   'dfName': training.drop('date', axis = 1).columns.values.tolist()
#   })

# daily_data_unnested_dfs['df'] = [pd.DataFrame() for row in 
#   daily_data_unnested_dfs.iterrows()]

# for df_index, df_row in daily_data_unnested_dfs.iterrows():
#     nestedTableName = str(df_row['dfName'])
    
#     date_nested_table = training[['date', nestedTableName]]
    
#     date_nested_table = (date_nested_table[
#       ~pd.isna(date_nested_table[nestedTableName])
#       ].
#       reset_index(drop = True)
#       )
    
#     daily_dfs_collection = []
    
#     for date_index, date_row in date_nested_table.iterrows():
#         daily_df = unpack_json(date_row[nestedTableName])
        
#         daily_df['dailyDataDate'] = date_row['date']
        
#         daily_dfs_collection = daily_dfs_collection + [daily_df]

#     unnested_table = pd.concat(daily_dfs_collection,
#       ignore_index = True).set_index('dailyDataDate').reset_index()

#     # Creates 1 pandas df per unnested df from daily data read in, with same name
#     globals()[df_row['dfName']] = unnested_table    
    
#     daily_data_unnested_dfs['df'][df_index] = unnested_table

# del training
# gc.collect()



# #### Get some information on each date in daily data (using season dates of interest) ####
# dates = pd.DataFrame(data = 
#   {'dailyDataDate': nextDayPlayerEngagement['dailyDataDate'].unique()})

# dates['date'] = pd.to_datetime(dates['dailyDataDate'].astype(str))

# dates['year'] = dates['date'].dt.year
# dates['month'] = dates['date'].dt.month

# dates_with_info = pd.merge(
#   dates,
#   seasons,
#   left_on = 'year',
#   right_on = 'seasonId'
#   )

# dates_with_info['inSeason'] = (
#   dates_with_info['date'].between(
#     dates_with_info['regularSeasonStartDate'],
#     dates_with_info['postSeasonEndDate'],
#     inclusive = True
#     )
#   )

# dates_with_info['seasonPart'] = np.select(
#   [
#     dates_with_info['date'] < dates_with_info['preSeasonStartDate'], 
#     dates_with_info['date'] < dates_with_info['regularSeasonStartDate'],
#     dates_with_info['date'] <= dates_with_info['lastDate1stHalf'],
#     dates_with_info['date'] < dates_with_info['firstDate2ndHalf'],
#     dates_with_info['date'] <= dates_with_info['regularSeasonEndDate'],
#     dates_with_info['date'] < dates_with_info['postSeasonStartDate'],
#     dates_with_info['date'] <= dates_with_info['postSeasonEndDate'],
#     dates_with_info['date'] > dates_with_info['postSeasonEndDate']
#   ], 
#   [
#     'Offseason',
#     'Preseason',
#     'Reg Season 1st Half',
#     'All-Star Break',
#     'Reg Season 2nd Half',
#     'Between Reg and Postseason',
#     'Postseason',
#     'Offseason'
#   ], 
#   default = np.nan
#   )

# #### Add some pitching stats/pieces of info to player game level stats ####

# player_game_stats = (playerBoxScores.copy().
#   # Change team Id/name to reflect these come from player game, not roster
#   rename(columns = {'teamId': 'gameTeamId', 'teamName': 'gameTeamName'})
#   )

# # Adds in field for innings pitched as fraction (better for aggregation)
# player_game_stats['inningsPitchedAsFrac'] = np.where(
#   pd.isna(player_game_stats['inningsPitched']),
#   np.nan,
#   np.floor(player_game_stats['inningsPitched']) +
#     (player_game_stats['inningsPitched'] -
#       np.floor(player_game_stats['inningsPitched'])) * 10/3
#   )

# # Add in Tom Tango pitching game score (https://www.mlb.com/glossary/advanced-stats/game-score)
# player_game_stats['pitchingGameScore'] = (40
# #     + 2 * player_game_stats['outs']
#     + 1 * player_game_stats['strikeOutsPitching']
#     - 2 * player_game_stats['baseOnBallsPitching']
#     - 2 * player_game_stats['hitsPitching']
#     - 3 * player_game_stats['runsPitching']
#     - 6 * player_game_stats['homeRunsPitching']
#     )

# # Add in criteria for no-hitter by pitcher (individual, not multiple pitchers)
# player_game_stats['noHitter'] = np.where(
#   (player_game_stats['gamesStartedPitching'] == 1) &
#   (player_game_stats['inningsPitched'] >= 9) &
#   (player_game_stats['hitsPitching'] == 0),
#   1, 0
#   )

# player_date_stats_agg = pd.merge(
#   (player_game_stats.
#     groupby(['dailyDataDate', 'playerId'], as_index = False).
#     # Some aggregations that are not simple sums
#     agg(
#       numGames = ('gamePk', 'nunique'),
#       # Should be 1 team per player per day, but adding here for 1 exception:
#       # playerId 518617 (Jake Diekman) had 2 games for different teams marked
#       # as played on 5/19/19, due to resumption of game after he was traded
#       numTeams = ('gameTeamId', 'nunique'),
#       # Should be only 1 team for almost all player-dates, taking min to simplify
#       gameTeamId = ('gameTeamId', 'min')
#       )
#     ),
#   # Merge with a bunch of player stats that can be summed at date/player level
#   (player_game_stats.
#     groupby(['dailyDataDate', 'playerId'], as_index = False)
#     [['runsScored', 'homeRuns', 'strikeOuts', 'baseOnBalls', 'hits',
#       'hitByPitch', 'atBats', 'caughtStealing', 'stolenBases',
#       'groundIntoDoublePlay', 'groundIntoTriplePlay', 'plateAppearances',
#       'totalBases', 'rbi', 'leftOnBase', 'sacBunts', 'sacFlies',
#       'gamesStartedPitching', 'runsPitching', 'homeRunsPitching', 
#       'strikeOutsPitching', 'baseOnBallsPitching', 'hitsPitching',
#       'inningsPitchedAsFrac', 'earnedRuns', 
#       'battersFaced','saves', 'blownSaves', 'pitchingGameScore', 
#       'noHitter'
#       ]].
#     sum()
#     ),
#   on = ['dailyDataDate', 'playerId'],
#   how = 'inner'
#   )

# #### Turn games table into 1 row per team-game, then merge with team box scores ####
# # Filter to regular or Postseason games w/ valid scores for this part
# games_for_stats = games[
#   np.isin(games['gameType'], ['R', 'F', 'D', 'L', 'W', 'C', 'P']) &
#   ~pd.isna(games['homeScore']) &
#   ~pd.isna(games['awayScore'])
#   ]

# # Get games table from home team perspective
# games_home_perspective = games_for_stats.copy()

# # Change column names so that "team" is "home", "opp" is "away"
# games_home_perspective.columns = [
#   col_value.replace('home', 'team').replace('away', 'opp') for 
#     col_value in games_home_perspective.columns.values]

# games_home_perspective['isHomeTeam'] = 1

# # Get games table from away team perspective
# games_away_perspective = games_for_stats.copy()

# # Change column names so that "opp" is "home", "team" is "away"
# games_away_perspective.columns = [
#   col_value.replace('home', 'opp').replace('away', 'team') for 
#     col_value in games_away_perspective.columns.values]

# games_away_perspective['isHomeTeam'] = 0

# # Put together games from home/away perspective to get df w/ 1 row per team game
# team_games = (pd.concat([
#   games_home_perspective,
#   games_away_perspective
#   ],
#   ignore_index = True)
#   )

# # Copy over team box scores data to modify
# team_game_stats = teamBoxScores.copy()

# # Change column names to reflect these are all "team" stats - helps 
# # to differentiate from individual player stats if/when joining later
# team_game_stats.columns = [
#   (col_value + 'Team') 
#   if (col_value not in ['dailyDataDate', 'home', 'teamId', 'gamePk',
#     'gameDate', 'gameTimeUTC'])
#     else col_value
#   for col_value in team_game_stats.columns.values
#   ]

# # Merge games table with team game stats
# team_games_with_stats = pd.merge(
#   team_games,
#   team_game_stats.
#     # Drop some fields that are already present in team_games table
#     drop(['home', 'gameDate', 'gameTimeUTC'], axis = 1),
#   on = ['dailyDataDate', 'gamePk', 'teamId'],
#   # Doing this as 'inner' join excludes spring training games, postponed games,
#   # etc. from original games table, but this may be fine for purposes here 
#   how = 'inner'
#   )

# team_date_stats_agg = (team_games_with_stats.
#   groupby(['dailyDataDate', 'teamId', 'gameType', 'oppId', 'oppName'], 
#     as_index = False).
#   agg(
#     numGamesTeam = ('gamePk', 'nunique'),
#     winsTeam = ('teamWinner', 'sum'),
#     lossesTeam = ('oppWinner', 'sum'),
#     runsScoredTeam = ('teamScore', 'sum'),
#     runsAllowedTeam = ('oppScore', 'sum')
#     )
#    )

# # Prepare standings table for merge w/ player digital engagement data
# # Pick only certain fields of interest from standings for merge
# standings_selected_fields = (standings[['dailyDataDate', 'teamId', 
#   'streakCode', 'divisionRank', 'leagueRank', 'wildCardRank', 'pct'
#   ]].
#   rename(columns = {'pct': 'winPct'})
#   )

# # Change column names to reflect these are all "team" standings - helps 
# # to differentiate from player-related fields if/when joining later
# standings_selected_fields.columns = [
#   (col_value + 'Team') 
#   if (col_value not in ['dailyDataDate', 'teamId'])
#     else col_value
#   for col_value in standings_selected_fields.columns.values
#   ]

# standings_selected_fields['streakLengthTeam'] = (
#   standings_selected_fields['streakCodeTeam'].
#     str.replace('W', '').
#     str.replace('L', '').
#     astype(float)
#     )

# # Add fields to separate winning and losing streak from streak code
# standings_selected_fields['winStreakTeam'] = np.where(
#   standings_selected_fields['streakCodeTeam'].str[0] == 'W',
#   standings_selected_fields['streakLengthTeam'],
#   np.nan
#   )

# standings_selected_fields['lossStreakTeam'] = np.where(
#   standings_selected_fields['streakCodeTeam'].str[0] == 'L',
#   standings_selected_fields['streakLengthTeam'],
#   np.nan
#   )

# standings_for_digital_engagement_merge = (pd.merge(
#   standings_selected_fields,
#   dates_with_info[['dailyDataDate', 'inSeason']],
#   on = ['dailyDataDate'],
#   how = 'left'
#   ).
#   # Limit down standings to only in season version
#   query("inSeason").
#   # Drop fields no longer necessary (in derived values, etc.)
#   drop(['streakCodeTeam', 'streakLengthTeam', 'inSeason'], axis = 1).
#   reset_index(drop = True)
#   )

# #### Merge together various data frames to add date, player, roster, and team info ####
# # Copy over player engagement df to add various pieces to it
# player_engagement_with_info = nextDayPlayerEngagement.copy()

# # Take "row mean" across targets to add (helps with studying all 4 targets at once)
# player_engagement_with_info['targetAvg'] = np.mean(
#   player_engagement_with_info[['target1', 'target2', 'target3', 'target4']],
#   axis = 1)

# # Merge in date information
# player_engagement_with_info = pd.merge(
#   player_engagement_with_info,
#   dates_with_info[['dailyDataDate', 'date', 'year', 'month', 'inSeason',
#     'seasonPart']],
#   on = ['dailyDataDate'],
#   how = 'left'
#   )

# # Merge in some player information
# player_engagement_with_info = pd.merge(
#   player_engagement_with_info,
#   players[['playerId', 'playerName', 'DOB', 'mlbDebutDate', 'birthCity',
#     'birthStateProvince', 'birthCountry', 'primaryPositionName']],
#    on = ['playerId'],
#    how = 'left'
#    )

# # Merge in some player roster information by date
# player_engagement_with_info = pd.merge(
#   player_engagement_with_info,
#   (rosters[['dailyDataDate', 'playerId', 'statusCode', 'status', 'teamId']].
#     rename(columns = {
#       'statusCode': 'rosterStatusCode',
#       'status': 'rosterStatus',
#       'teamId': 'rosterTeamId'
#       })
#     ),
#   on = ['dailyDataDate', 'playerId'],
#   how = 'left'
#   )
    
# # Merge in team name from player's roster team
# player_engagement_with_info = pd.merge(
#   player_engagement_with_info,
#   (teams[['id', 'teamName']].
#     rename(columns = {
#       'id': 'rosterTeamId',
#       'teamName': 'rosterTeamName'
#       })
#     ),
#   on = ['rosterTeamId'],
#   how = 'left'
#   )

# # Merge in some player game stats (previously aggregated) from that date
# player_engagement_with_info = pd.merge(
#   player_engagement_with_info,
#   player_date_stats_agg,
#   on = ['dailyDataDate', 'playerId'],
#   how = 'left'
#   )

# # Merge in team name from player's game team
# player_engagement_with_info = pd.merge(
#   player_engagement_with_info,
#   (teams[['id', 'teamName']].
#     rename(columns = {
#       'id': 'gameTeamId',
#       'teamName': 'gameTeamName'
#       })
#     ),
#   on = ['gameTeamId'],
#   how = 'left'
#   )

# # Merge in some team game stats/results (previously aggregated) from that date
# player_engagement_with_info = pd.merge(
#   player_engagement_with_info,
#   team_date_stats_agg.rename(columns = {'teamId': 'gameTeamId'}),
#   on = ['dailyDataDate', 'gameTeamId'],
#   how = 'left'
#   )

# # Merge in player transactions of note on that date
    
# # Merge in some pieces of team standings (previously filter/processed) from that date
# player_engagement_with_info = pd.merge(
#   player_engagement_with_info,
#   standings_for_digital_engagement_merge.
#     rename(columns = {'teamId': 'gameTeamId'}),
#   on = ['dailyDataDate', 'gameTeamId'],
#   how = 'left'
#   )

# #display(player_engagement_with_info)
# player_engagement_with_info.head(5)


## Exploratory Analysis<a name="exp"></a>

As you can see above, the starter code converts the initial ```train.csv``` dataset with 1216 rows & several packed JSONs into a dataset with 2.5M+ rows. THe processed dataframe ```player_engagement_with_info``` contains data in a much more simpler format for data analysis. Let's start exploring the dataset and start answering the questions highlighted at the start of the notebook. 



In [None]:
import pickle

# Pkl_Filename = "player_engagaement_with_info.pkl"  

#player_engagement_with_info.to_pickle('player_engagement_with_info.pkl')


In [None]:

player_engagement_with_info = pd.read_pickle("../input/mlb-player-digital-engagement-merged-data/player_engagement_with_info.pkl")

In [None]:
player_engagement_with_info

Some of the columns present in the dataset above are - 

### Player Info - 
* DOB
* MLB Debut Date
* Birth Country
* Position 
* Roster Status

#### Player Batting Scores - 
* Runs Scored
* Home Runs
* Strike Outs 
* RBI - Runs Batted In

#### Player Pitching Scores - 
* Runs Pitching
* Strike Outs Pitching
* Hits Pitching

#### Team Stats - 
* Team & Opposition Info
* Runs Scored & Allowed
* Wins & Loss Record

#### Target Columns - Target1 - Target 4

Let's first start by taking a look at Target Columns - 

### Target Columns 


In [None]:
import plotly.express as px
time_target_gp=player_engagement_with_info.groupby(['dailyDataDate']).agg({'target1':'mean','target2':'mean','target3':'mean','target4':'mean','targetAvg':'mean'}).reset_index()

from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=3, cols=2)

fig.add_trace(
    go.Scatter(x=time_target_gp['dailyDataDate'], y=time_target_gp['target1'],name='Target1'),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=time_target_gp['dailyDataDate'], y=time_target_gp['target2'],name='Target2'),
    row=1, col=2
)
fig.add_trace(
    go.Scatter(x=time_target_gp['dailyDataDate'], y=time_target_gp['target3'],name='Target3'),
    row=2, col=1
)
fig.add_trace(
    go.Scatter(x=time_target_gp['dailyDataDate'], y=time_target_gp['target4'],name='Target4'),
    row=2, col=2
)
fig.add_trace(
    go.Scatter(x=time_target_gp['dailyDataDate'], y=time_target_gp['targetAvg'],name='targetAvg'),
    row=3, col=1
)


fig.update_layout(height=600, width=800, title_text="Daily Average Target Variable Values")
fig.show()


In [None]:
player_engagement_with_info['quarter']=player_engagement_with_info['dailyDataDate'].dt.quarter
player_engagement_with_info['month']=player_engagement_with_info['dailyDataDate'].dt.month
player_engagement_with_info['year']=player_engagement_with_info['dailyDataDate'].dt.year

player_engagement_with_info

month_gp=player_engagement_with_info.groupby('month').agg({'target1':'mean','target2':'mean','target3':'mean','target4':'mean'}).reset_index()

fig=go.Figure()

fig.add_trace(go.Scatter(x=month_gp['month'], y=month_gp['target1'],
                    mode='lines',
                    name='target1'))
fig.add_trace(go.Scatter(x=month_gp['month'], y=month_gp['target2'],
                    mode='lines',
                    name='target2'))
fig.add_trace(go.Scatter(x=month_gp['month'], y=month_gp['target3'],
                    mode='lines',
                    name='target3'))

fig.add_trace(go.Scatter(x=month_gp['month'], y=month_gp['target4'],
                    mode='lines',
                    name='target4'))

fig.update_layout(title='Variation in Average Target Values with Month')
fig.show()

In [None]:
#Time Series Correlation
import seaborn as sns
corr=time_target_gp.corr()
corr

In [None]:
daily=player_engagement_with_info.groupby('dailyDataDate').agg({'target1':'mean','target2':'mean','target3':'mean','target4':'mean'}).reset_index()

fig = make_subplots(
    rows=3, cols=3,shared_yaxes=True,
    subplot_titles=("Target1 vs Target2", "Target1 vs Target3", "Target1 vs Target4","Target2 vs Target3", "Target2 vs Target4","", "Target3 vs Target4"))
fig.add_trace(
    go.Scatter(y=daily['target2'], x=daily['target1'],mode='markers'),
    row=1, col=1
    )
fig.add_trace(
    go.Scatter(y=daily['target3'], x=daily['target1'],mode='markers'),
    row=1, col=2
    )
fig.add_trace(
    go.Scatter(y=daily['target4'], x=daily['target1'],mode='markers'),
    row=1, col=3
    )
fig.add_trace(
    go.Scatter(y=daily['target3'], x=daily['target2'],mode='markers'),
    row=2, col=1
    )
fig.add_trace(
    go.Scatter(y=daily['target4'], x=daily['target2'],mode='markers'),
    row=2, col=2
    )
fig.add_trace(
    go.Scatter(y=daily['target4'], x=daily['target3'],mode='markers'),
    row=3, col=1
    )
fig.update_layout(showlegend=False)

fig.show()

In [None]:
player_mean=player_engagement_with_info.groupby('playerId').agg({'target1':'mean','target2':'mean','target3':'mean','target4':'mean'}).reset_index()
player_mean


fig=go.Figure()
# fig.add_trace(go.Histogram(x=player_mean['target1'],histnorm='probability',name='Target1'))
# fig.add_trace(go.Histogram(x=player_mean['target2'],histnorm='probability',name='Target2'))
# fig.add_trace(go.Histogram(x=player_mean['target3'],histnorm='probability',name='Target3'))
# fig.add_trace(go.Histogram(x=player_mean['target4'],histnorm='probability',name='Target4'))
# fig.update_layout(barmode='overlay',title='Distribution of Average Player Target Scores')
# fig.update_traces(opacity=0.75)

fig.add_trace(go.Violin(y=player_mean['target1'],name='Target1',box_visible=True,meanline_visible=True))
fig.add_trace(go.Violin(y=player_mean['target2'],name='Target2',box_visible=True,meanline_visible=True))
fig.add_trace(go.Violin(y=player_mean['target3'],name='Target3',box_visible=True,meanline_visible=True))
fig.add_trace(go.Violin(y=player_mean['target4'],name='Target4',box_visible=True,meanline_visible=True))
fig.update_layout(title='Distribution of Average Player Target Scores')
# fig.update_traces(opacity=0.75)

fig.show()



Some of the things that are visible from the charts above - 
* Target variables seem to be cyclical in nature with same pattern repeating every year
* **The Target variables don't seem to have significant Pearson's correlation with each other**. However some combinations do seem to be related to each other - for ex. Target 2 doesn't have any relation with Target 3 but seems to vary along with Target 4
* Monthwise plot for the average of Target variables shows significant seasonality. **Target Values seem to peak in the month of March & then dip during the month of October**. Similar trend is observable in all the target variables
* The mean target values for Players above shows that the **mean value for Target2 is much higher than other target variables**. 
* Each of these target variables have a high number of outlier values - showing that a smaller group of players tend to outperform the vast majority

If we also look at the sample records for the ```player_engagement_with_info``` table we can see a lot of blank values for game stats. It is because we are required to predict engagement for players even on days that they are not playing. Therefore on non game days these values will appear as blank. Let's take a look at the breakdown of gamedays vs non gamedays in the dataset and also the average values for engagement on game days and non game days - 

In [None]:
#create a flag for gamedays and non-gamedays for players
player_engagement_with_info['gameday']=~player_engagement_with_info.gameTeamId.isna()
player_engagement_with_info['flag']=0
player_engagement_with_info.loc[player_engagement_with_info.gameTeamId.isna(),'flag']=-1

player_engagement_with_info.gameday.value_counts()


In [None]:
gameday_gp=player_engagement_with_info.groupby(['gameday']).agg({'targetAvg':'mean'}).reset_index()
px.bar(gameday_gp,x='gameday',y='targetAvg',title='Difference between targetAvg on MatchDays vs non MatchDays')

As we can see above, the number of non gameday records in the dataset are almost 10x higher than gameday records. Also as expected, the engagement scores are much higher on game days compared to non gamedays. Let's now take a deeper look at the data to try and understand how does engagement score decay on non-playing days. 


In [None]:
player_engagement_with_info.sort_values(by=['playerId','dailyDataDate'],inplace=True,ascending=True)
player_engagement_with_info=player_engagement_with_info.reset_index(drop=True)
#player_engagement_with_info.loc[player_engagement_with_info['gameday']==True,'daysSinceLastGame']=0

player_engagement_with_info['daysSinceLastGame']=0

# chk=player_engagement_with_info.head(10000)
# chk.to_csv('chk.csv')

In [None]:
player_engagement_with_info.sort_values(by=['playerId','dailyDataDate'],inplace=True,ascending=True)

In [None]:
def count_consecutive_items_n_cols(df, col_name_list, output_col):
    cum_sum_list = [
        (df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
    ]
    df[output_col] = df.groupby(
        ["_".join(map(str, x)) for x in zip(*cum_sum_list)]
    ).cumcount() + 1
    return df

player_engagement_with_info=count_consecutive_items_n_cols(player_engagement_with_info,['playerId','gameday'],'daysSinceLastGame')
player_engagement_with_info.loc[player_engagement_with_info['gameday']==True,'daysSinceLastGame']=0


In [None]:
#Calculate Days Since Last Game
lag_gp=player_engagement_with_info.groupby(['daysSinceLastGame']).agg({'targetAvg':'mean','playerId':'nunique'}).reset_index()
px.bar(lag_gp,x='daysSinceLastGame',y='targetAvg',title='Variation in targetAvg with daysSinceLastGame')

fig1 = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig1.add_trace(
    go.Bar(x=lag_gp['daysSinceLastGame'], y=lag_gp['targetAvg'], name="targetAvg"),
    secondary_y=True,
)

fig1.add_trace(
    go.Scatter(x=lag_gp['daysSinceLastGame'], y=lag_gp['playerId'], name="No. of Players"),
    secondary_y=False,
)
fig1.update_layout(title='Variation in targetAvg by daysSinceLastGame',xaxis_title='daysSinceLastGame')
fig1.show()

immediate_drop=player_engagement_with_info.loc[player_engagement_with_info['daysSinceLastGame']<=10]
immediate_drop

lag_gp=immediate_drop.groupby(['daysSinceLastGame']).agg({'targetAvg':'mean','playerId':'nunique'}).reset_index()
lag_gp

fig2=px.line(lag_gp,x='daysSinceLastGame',y='targetAvg',title='Drop in targetAvg in immediate days after a game')
fig2.show()

In [None]:
high_target=player_engagement_with_info.loc[player_engagement_with_info['daysSinceLastGame']>=1000]
high_target.loc[high_target.rosterTeamName.isna(),'rosterTeamName']='No Team'

high_target_players=high_target.groupby(['playerName']).agg({'targetAvg':'mean'}).reset_index()

high_target_players=high_target_players.sort_values(by='targetAvg',ascending=False).head(10)
px.bar(high_target_players,x='playerName',y='targetAvg',title='Non-Active players with highest targetAvg')


We calculated a column ```daysSinceLastGame``` that keeps a track of number of days since player last played a game. Some of these players have been inactive for upto 2-3 years before they even played a game. We can see the following things above -
* The target engagement is higher immediately on game day but then significantly declines with time
* The drop in target engagement is not steady. There are several peaks observable around 200, 800 and 1000 day period. 
* We looked at the drop in engagement immediately after the game and observed it initially goes down but then picks up again around 3-5 day period and then goes down again
* The rise in target engagement for players after remaining inactive for 1000 days was a bit perplexing. Therefore we filtered the data for these players and identified such players. 22 year old Brailyn Marquez & 23 year old Alex Kiriloff had the highest engagement scores
* A quick Google search revealed that some of the reasons for high engagement are player transfers & players returning from injury after a long break, leading to an increase in engagement. 

In [None]:
avg_gp_between_games=immediate_drop.groupby(['daysSinceLastGame']).agg({'playerId':'count'}).reset_index()
#avg_gp_between_games

In [None]:
# team_gp=player_engagement_with_info.groupby('year').rosterTeamName.nunique()
# team_gp

### Top Players

Let's try to understand who the most successful players are in the MLB over the years 2018-2021 and which positions they occupy on the field. 

In [None]:
player_high=player_engagement_with_info.groupby('playerId').agg({'targetAvg':'mean'}).reset_index()
player_high.sort_values(by='targetAvg',inplace=True,ascending=False)
player_high3=player_high.head(3)
player_high3

top3_players=player_engagement_with_info.loc[player_engagement_with_info['playerId'].isin(player_high3.playerId)]
top3_players=top3_players.groupby(['playerId','year','month']).agg({'targetAvg':'mean'}).reset_index()
top3_players['month_year'] = top3_players.year.astype(str) + '_'+top3_players.month.astype(str)

players_info=players[['playerId','playerName','primaryPositionName']]

top3_players=pd.merge(left=top3_players,right=players_info,left_on='playerId',right_on='playerId',how='inner')


px.line(top3_players,x='month_year',y='targetAvg',color='playerName',title='targetAvg Engagement patterns for Top 3 Players')

In [None]:
player_high100=player_high.head(100)

top100_players=pd.merge(left=player_high100,right=players_info,left_on='playerId',right_on='playerId',how='inner')
top100_players

fig=px.scatter(top100_players,y='targetAvg',color='primaryPositionName',title='targetAvg scores for top 100 players')
fig.show()

fig1 = px.pie(top100_players, values='targetAvg', names='primaryPositionName', title='Breakdown of top 100 players by position')

fig1.show()

Findings from the charts above - 
* **Mike Trout, Bryce Harper & Aaron Judge stand out as the top 3 players during the period of 2018-2021**. Each of them had a really outstanding year where their targetAvg values was much higher than other years. 
* As we noticed in the overall monthly targetAvg values chart, the targetAvg values for these top players stayed highest in March & then dipped to its lowest in October. 
* I also took a look at the top 100 players with highest targetAvg values. **Top 4 players all turn out to be Designated Hitters**. Designated Hitter is a player who bats in place of a pitcher. The targetAvg values for these top 4 players are significantly higher than rest of the group
* I also looked at the breakdown of top 100 players by position. **30% of these players were Designated Hitters & approx another 30% were pitchers**, which shows that these players are much likelier to score higher than rest of the team players on engagement score metric. 

Having taken a look at the top 100 players, let's take a look at all the players in the dataset and average scores by position. The image below shows positioning of different players on the baseball field -  

![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Baseball_positions.svg/1200px-Baseball_positions.svg.png)

In [None]:
position_avg=player_engagement_with_info.groupby(['primaryPositionName']).agg({'targetAvg':'mean','playerId':'nunique'}).reset_index()
position_avg.sort_values(by='targetAvg',inplace=True,ascending=False)

fig=px.bar(position_avg,x='primaryPositionName',y='targetAvg',title='Target Avg values by Position')
fig.show()
position_avg.sort_values(by='playerId',inplace=True,ascending=False)

fig1=px.bar(position_avg,x='primaryPositionName',y='playerId',title='Count of Players by Position')
fig1.show()


player_high_join=pd.merge(left=player_high,right=players,how='inner')
#player_high_join=player_high_join.loc[player_high_join['primaryPositionName']=='Pitcher']
player_high_join

px.violin(player_high_join,y='targetAvg',box=True,x='primaryPositionName',title='Distribution of targetAvg by primaryPosition')

Here again we see that the average target scores for **Designated Hitters are higher than rest of the positions**. One thing though that comes across as a surprise is that average scores for pitchers is quite low and much behind several other positions. This shows that **while some pitchers are likely to be among the best, there are several pitchers who don't do so well over the season, bringing down the average of the entire group**. 

Also we see that the overall distribution of players in the dataset seems to be heavily skewed towards pitchers, which does look a bit odd. The violin plot above shows that though there are lot of outliers in the pitchers group, the overall average for the group is quite low. 

### Player Demographics



In [None]:
player_data=player_engagement_with_info[['playerId','dailyDataDate','targetAvg']]
#player_data

player_demog_data=pd.merge(left=player_data,right=players,left_on='playerId',right_on='playerId',how='inner')

player_demog_data['DOB'] = pd.to_datetime(player_demog_data['DOB'], format="%Y-%m-%d")
player_demog_data['mlbDebutDate'] = pd.to_datetime(player_demog_data['mlbDebutDate'], format="%Y-%m-%d")

player_demog_data['age'] = (player_demog_data['dailyDataDate'] - player_demog_data['DOB']).dt.days
player_demog_data['age']=(player_demog_data['age']/365).astype(int)

player_demog_data['yearsSinceDebut'] = (player_demog_data['dailyDataDate'] - player_demog_data['mlbDebutDate']).dt.days
player_demog_data['yearsSinceDebut']=player_demog_data['yearsSinceDebut']/365
player_demog_data['yearsSinceDebut']=player_demog_data['yearsSinceDebut'].fillna(-1)
player_demog_data['yearsSinceDebut']=player_demog_data['yearsSinceDebut'].astype(int)
#player_demog_data

def plot_viz(colnames):
    j=1
    k=1
    leng=len(colnames)
    tup=('yearsSinceDebut vs targetAvg','Country of Birth vs targetAvg','Age vs targetAvg','Weight vs targetAvg')
    fig = make_subplots(rows=2, cols=2,subplot_titles=tup)
    for i in range(leng):
        gp=player_demog_data.groupby(colnames[i]).agg({'targetAvg':'mean'}).reset_index()
      #  print(i,j,k)
        fig.add_trace(go.Bar(x=gp.iloc[:,0],y=gp.iloc[:,1]),row=j,col=k)

        if(i==1):
            k=k+1
        if(j==2):
            j=1
        else :
            j=j+1
    fig.update_layout(showlegend=False)
    fig.show()
     #   fig.show()
        
    
colnames=['yearsSinceDebut','age','birthCountry','weight']
plot_viz(colnames)


#gp=player_demog_data.groupby(colnames[0]).agg({'targetAvg':'mean'}).reset_index()
#gp.iloc[:,0]

* Players performance generally tends to get better with age in MLB, which is a bit odd when you compare this to other sports. Players above 40 have a very high engagement scores, it could also be due to a small number of players in the group driving the average up
* New & young players generally don't do as well compared to their new counterparts. The average engagement scores tend to increase with levels of experience among players. 
* Overseas players generally tend to better than American players on average. This could be due to the fact that as an overseas player, someone has to be really good to become a pro MLB player, whereas in USA it might be a bit easier to get notived and also due to the high volume of players the average gets lowered. 
* There is no clear relation between player's weight & their targetAvg score performance


### Team Performances
Let's now look at the average scores for the teams involved over the years. 

In [None]:
team_gp=player_engagement_with_info.groupby(['rosterTeamName','year']).agg({'targetAvg':'mean','winsTeam':'sum','numGamesTeam':'sum'}).reset_index()
team_gp['pct_wins']=team_gp['winsTeam']/team_gp['numGamesTeam']
team_gp['year'] = team_gp['year'].apply(str)

team_gp.sort_values(by='targetAvg',inplace=True,ascending=False)
team_gp["team_year"] = team_gp["rosterTeamName"].astype(str) + '_'+team_gp["year"].astype(str)

#team_gp

In [None]:
px.scatter(team_gp,x='pct_wins',y='targetAvg',text='team_year',title='Variation of targetAvg with Team WinPercent over Seasons')

In [None]:
team_month_yr_gp=player_engagement_with_info.groupby(['rosterTeamName','year','month']).agg({'targetAvg':'mean'}).reset_index()
team_month_yr_gp=team_month_yr_gp.sort_values(by=['year','month'],ascending=True)
team_month_yr_gp['month_yr']=team_month_yr_gp['month'].astype(str)+"_"+team_month_yr_gp['year'].astype(str)
team_month_yr_gp['month_yr1'] = pd.to_datetime(team_month_yr_gp[['year', 'month']].assign(DAY=1))

team_month_yr_gp

# my_raceplot = barplot(team_month_yr_gp,  item_column='rosterTeamName', value_column='targetAvg', time_column='month_yr')
# my_raceplot.plot(item_label = 'team name', value_label = 'number of followers', frame_duration = 800,title="Top 10 teams with highest targetAvg score from 2018-2021")


In [None]:
# my_raceplot = barplot(team_month_yr_gp,  item_column='rosterTeamName', value_column='targetAvg', time_column='month_yr1')
# my_raceplot.plot(item_label = 'team name', value_label = 'targetAvgScore', frame_duration = 800,title="Top 10 teams with highest targetAvg score from 2018-2021")


In [None]:
rankscores=player_engagement_with_info.groupby('divisionRankTeam').agg({'targetAvg':'mean'}).reset_index()
rankscores

leaguerankscores=player_engagement_with_info.groupby('leagueRankTeam').agg({'targetAvg':'mean'}).reset_index()
leaguerankscores

px.line(leaguerankscores,x='leagueRankTeam',y='targetAvg',title='Variation in TargetAvg with League Rank for Team')

In [None]:
team_player_gp=player_engagement_with_info.groupby(['rosterTeamName','playerName']).agg({'targetAvg':'mean'}).reset_index()
#team_player_gp.sort_values(['rosterTeamName','targetAvg'],ascending=['True','False'], inplace=True)

team_player_sum=team_player_gp.groupby('rosterTeamName').agg({'targetAvg':'sum'}).reset_index()
team_player_sum.columns=['rosterTeamName','targetAvgSum']

team_player_gp["rank"] = team_player_gp.groupby(['rosterTeamName'])["targetAvg"].rank("dense", ascending=False)
team_player_gp

team_player_gp_rank=pd.merge(left=team_player_gp,right=team_player_sum,how='inner')
team_player_gp_rank10=team_player_gp_rank.loc[team_player_gp_rank['rank']<=10]
team_player_gp_rank10.sort_values(by=['rosterTeamName','rank'],inplace=True)

team_player_gp_rank10['top10score']=team_player_gp_rank10.groupby(['rosterTeamName'])['targetAvg'].apply(lambda x: x.cumsum())
team_player_gp_rank10

team_player_gp_rank10_contribution=team_player_gp_rank10.loc[team_player_gp_rank10['rank']==10]
team_player_gp_rank10_contribution
team_player_gp_rank10_contribution['pct_contribution_top10']=team_player_gp_rank10_contribution['top10score']/team_player_gp_rank10_contribution['targetAvgSum']
team_player_gp_rank10_contribution.sort_values(by='pct_contribution_top10',ascending=False,inplace=True)

team_avg=player_engagement_with_info.groupby(['rosterTeamName']).agg({'targetAvg':'mean'}).reset_index()


team_player_gp_rank10_contribution=pd.merge(left=team_player_gp_rank10_contribution,right=team_avg,how='inner',on='rosterTeamName')

px.scatter(team_player_gp_rank10_contribution,y='pct_contribution_top10',x='targetAvgSum',size='targetAvg_y',text='rosterTeamName',title='Contribution of Top10 Players towards sum of team targetAvg scores' )


* Each point in the first scatter plot shows average values for target scores during a particular season. **Familiar team names like Yankees, Red Sox & Dodgers show up at the top, which shows that teams that have a high win percentage generally tend to have higher targetAvg as well**
* Yankees in 2018 has been the most engaged team
* Looking at the raceplot above, the list of top 10 teams has been fairly constant over time
* I also looked at the variation of targetAvg scores by leagueRankTeam. **Teams that have a better rank tend to have higher targetAvg scores as well.**
* We then looked at contribution of top 10 players towards the sum of targetAvg of all players in the team. Yankees clearly stand out here - they have the highest sum of targetAvg by a fair margin and only 30% of their total comes in from top10 players
* Less successful teams like Astros, Cardinals & Rockies are much more dependent on their top 10 players

Let's also quickly take a look at variation of targetAvg scores by game types as well 

In [None]:
gametype=player_engagement_with_info.groupby('gameType').agg({'targetAvg':'mean'}).reset_index()
px.bar(gametype,x='gameType',y='targetAvg',title='Variation of targetAvg with GameType')

In [None]:
seasontype=player_engagement_with_info.groupby('seasonPart').agg({'targetAvg':'mean'}).reset_index()
px.bar(seasontype,x='seasonPart',y='targetAvg',title='Variation of targetAvg with seasonPart')

Preseason period tends to have the highest engagement score & postseason has the lowest targetAvg. The targetAvg scores tends to remain high during the Regular Season. Let's now take a look at pitching & batting stats to see how they impact targetAvg scores. 

### Pitching & Batting Stats 

There are lots of detailed stats around batting & pitching that are available in the dataset. Let's try to see how good & bad performances lead to engagement. 

In [None]:
player_engagement_with_info.sort_values(by=['playerId','dailyDataDate'],inplace=True)
#player_engagement_with_info

In [None]:
pitching_stats=player_engagement_with_info.groupby(['pitchingGameScore']).agg({'targetAvg':'mean','playerId':'count'}).reset_index()
pitching_stats

px.scatter(pitching_stats,x='pitchingGameScore',y='targetAvg',title='Variation in targetAvg scores with Tom Tango pitchingGameScore')


In the JSON unpacking code provided by competition organizers, we see them calculating an aggregate pitching statistic, Tom Tango's pitchingGameScore, which gives us an impression of pitcher's performance. Here is how it was calculated - 
PitchingGameScore = 40 + 2 * outs + 1 * strikeOutsPitching - 2 * baseOnBallsPitching - 2 * player_game_stats - 3 * runsPitching - 6 * homeRunsPitching

As we can see above, Engagement scores are high for both poor & great pitching performances. However there is a cluster of scores in the 60-85 range where targetAvg values are lower. Since this doesn't show a very clear pattern in targetAvg, maybe we need to individually look at pitcher statistics. 

In [None]:
run_stats=player_engagement_with_info.groupby(['runsScored','homeRuns']).agg({'targetAvg':'mean','target1':'count'}).reset_index()
run_stats['homeRuns']=run_stats['homeRuns'].apply(str)
run_stats.columns=['runsScored','homeRuns','targetAvg','Innings']

team_run_stats=player_engagement_with_info.groupby(['runsScoredTeam']).agg({'targetAvg':'mean','target1':'count'}).reset_index()
team_run_stats.columns=['runsScoredTeam','targetAvg','Innings']
#team_run_stats


In [None]:
fig = px.bar(run_stats, x="runsScored", y="targetAvg",
             color='homeRuns', barmode='group',
             height=400)
fig.update_layout(title='Variation of targetAvg with Player runsScored & homeRuns scored in Match')
fig.show()

rbi_stats=player_engagement_with_info.groupby(['rbi']).agg({'targetAvg':'mean'}).reset_index()

fig=px.line(rbi_stats,x='rbi',y='targetAvg',title='Variation in targetAvg values with RunsBattedIn by Player')
fig.show()

fig1 = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig1.add_trace(
    go.Scatter(x=team_run_stats['runsScoredTeam'], y=team_run_stats['targetAvg'], name="targetAvg"),
    secondary_y=True,
)

fig1.add_trace(
    go.Scatter(x=team_run_stats['runsScoredTeam'], y=team_run_stats['Innings'], name="Innings"),
    secondary_y=False,
)
fig1.update_layout(title='Variation of Runs Scored by a Team vs targetAvg')
fig1.show()

Let's now take a look at the impact of batting stats on targetAvg values - 
* **General rule seems to be that players who score higher runs in the match tend to have higher engagement**. Baseball is a low scoring game and scoring a run is a big event in the game. Therefore more runs generally means more engagment.
* **Home Runs are also a rare event in the sport of Baseball. Greater the number of Home Runs scored in the game, higher the engagement.**
* Another statistic for batter performance is RBI - Runs Batted In. It shows that the number of runs scored by a team where a player was involved in. This also shows that higher the RBI number, greater the engagement for a player
* Also the **team as a whole tends to have higher engagement levels when the runsScored are higher**.

Since this is essentially a time series dataset, we should also look at autocorrelations for target variables - autocorrelation measures correlation of a time series value at any point with its previous values. For every player, we will calculate autocorrelation at various lags and look at the distribution of autocorrelation for different lag values. 

In [None]:
# out_stats=player_engagement_with_info.groupby(['strikeOuts']).agg({'targetAvg':'mean','target1':'count'}).reset_index()
# out_stats.columns=['strikeOuts','targetAvg','Num_Days']
# #px.line(out_stats,x='strikeOuts',y='targetAvg')

# fig = make_subplots(specs=[[{"secondary_y": True}]])

# # Add traces
# fig.add_trace(
#     go.Scatter(x=out_stats['strikeOuts'], y=out_stats['targetAvg'], name="targetAvg"),
#     secondary_y=True,
# )

# fig.add_trace(
#     go.Scatter(x=out_stats['strikeOuts'], y=out_stats['Num_Days'], name="Num_Days"),
#     secondary_y=False,
# )
# fig.update_layout(xaxis_title='Count of Strike Outs',title='StrikeOuts in a Game vs Target Score')
# fig.show()

# pitch_run_stats=player_engagement_with_info.groupby(['runsPitching']).agg({'targetAvg':'mean','target1':'count'}).reset_index()
# pitch_run_stats.columns=['runsPitching','targetAvg','Num_Days']

# fig = make_subplots(specs=[[{"secondary_y": True}]])

# # Add traces
# fig.add_trace(
#     go.Scatter(x=pitch_run_stats['runsPitching'], y=pitch_run_stats['targetAvg'], name="targetAvg"),
#     secondary_y=True,
# )

# fig.add_trace(
#     go.Scatter(x=pitch_run_stats['runsPitching'], y=pitch_run_stats['Num_Days'], name="Num_Days"),
#     secondary_y=False,
# )
# fig.update_layout(xaxis_title='Count of Runs Allowed',title='Runs Allowed in a Game vs targetAvg')
# fig.show()


In [None]:
### group by year?
player_avg = player_engagement_with_info.groupby(["playerId"])[["target1","target2","target3","target4"]].mean().reset_index()
player_median = player_engagement_with_info.groupby(["playerId"])[["target1","target2","target3","target4"]].median().reset_index()
player_mean=pd.concat([player_avg, player_median], axis=1).groupby(axis=1, level=0).mean()
player_mean["target1"] = .85 * player_mean["target1"]
player_mean["target2"] = .85 * player_mean["target2"]
player_mean["target3"] = .85 * player_mean["target3"]
player_mean["target4"] = .85 * player_mean["target4"]
gc.collect()
print(player_mean.shape)
player_mean

In [None]:
#player_engagement_with_info.columns

In [None]:
data=player_engagement_with_info[['dailyDataDate','playerId','target1','target2','target3','target4','targetAvg','quarter','gameday']]

In [None]:
autocorrel_list=list()
for i in range(31):
    ser=data.groupby('playerId')['targetAvg'].apply(lambda x: x.autocorr(lag=i))
    df=ser.to_frame().reset_index()
    df['lag']=i
    autocorrel_list.append(df)
auto_frame=pd.concat(autocorrel_list).reset_index(drop=True)

#autocorrel_list

In [None]:
lst=list()

for i in range(1,31):
    lst.append(str('lag_'+str(i)))
tup=tuple(lst)

fig = make_subplots(
    rows=5, cols=6,shared_yaxes=True,
    subplot_titles=tup)
k=1
j=1

for i in range(1,31):
    lag_frame=auto_frame.loc[auto_frame['lag']==i]
 #   print(k,j)
    fig.add_trace(
    go.Histogram(x=lag_frame['targetAvg'],name=i),
    row=k, col=j
    )
#    k=k%8
    j=j%6
    j=j+1
    
#    k=k+1
    if(i%6==0):
        k=k+1
      #  j=j+1

        
fig.update_layout(height=1000,title_text="Autocorrelation Plots for player targetAvg for various lags",showlegend=False)
fig.show()

In [None]:
lag_gp=auto_frame.groupby('lag').agg({'targetAvg':['mean','median']}).reset_index()
lag_gp.columns=['lag','mean','median']
fig=go.Figure()
fig.add_trace(go.Scatter(x=lag_gp['lag'],y=lag_gp['mean'],mode='lines',name='mean'))
fig.add_trace(go.Scatter(x=lag_gp['lag'],y=lag_gp['median'],mode='lines',name='median'))
fig.update_layout(title='Mean & Median of player Autocorrelation distributions across various lag periods',xaxis_title='lag')


* **The autocorrelation distributions seem to become more left skewed with increase in lag values**
* Autocorrelation lag means are reasonably high till lag 5 but then reduces significantly afterwards. This information will be important for us at the Feature Engineering stage since we would be looking to incorporate lag values as features in the model.

## Feature Engineering<a name="feature"></a>
Some of the exploratory analysis we did helped us gain a better understanding of the data. Now we would be looking to convert some of that information into features and utilize that to fit a model on the data to generate predictions. 

These are the features we would go with for starters - 

### Player Features 
* Number of Runs Scored
* Home Runs
* Birth Country
* Starting Position
* Roster Status

### Target Features 
* Mean of Target Variables
* Min of Target Variables
* Max of Target Variables
* Std. Dev of Target Variables

These features should help us generate a good baseline model. We will now prepare the datasets with these features generated. 


In [None]:
player_engagement_with_info.columns

In [None]:
data=player_engagement_with_info[['dailyDataDate','playerId','target1','target2','target3','target4','quarter','year','month','runsScored', 'homeRuns','gamesStartedPitching','birthCountry','primaryPositionName','rosterTeamName','rosterStatus']]

In [None]:
data_avg=player_engagement_with_info.loc[player_engagement_with_info['engagementMetricsDate']<'2021-04-01']
data_avg

player_target_stats=data_avg.groupby('playerId').agg({'target1':['min','max','mean','std'],'target2':['min','max','mean','std'],'target3':['min','max','mean','std'],'target4':['min','max','mean','std']}).reset_index()
player_target_stats.columns=['playerId','target1_min','target1_max','target1_mean','target1_std','target2_min','target2_max','target2_mean','target2_std','target3_min','target3_max','target3_mean','target3_std','target4_min','target4_max','target4_mean','target4_std']
player_target_stats

player_score_stats=data_avg.groupby('playerId').agg({'runsScored':'sum','homeRuns':'sum'}).reset_index()
player_score_stats

In [None]:
import gc
del data_avg
gc.collect()

In [None]:
player=player_engagement_with_info[['engagementMetricsDate','playerId','birthCountry','primaryPositionName','rosterTeamId','rosterStatus','target1','target2','target3','target4']]
# label encoding
country2num = {c: i for i, c in enumerate(player['birthCountry'].unique())}
position2num = {c: i for i, c in enumerate(player['primaryPositionName'].unique())}
teamid2num = {c: i for i, c in enumerate(player['rosterTeamId'].unique())}
status2num = {c: i for i, c in enumerate(player['rosterStatus'].unique())}
player['label_country_id'] = player['birthCountry'].map(country2num).fillna(-1)
player['label_position_id'] = player['primaryPositionName'].map(position2num).fillna(-1)
player['label_team_id'] = player['rosterTeamId'].map(teamid2num).fillna(-1)
player['label_status_id'] = player['rosterStatus'].map(status2num).fillna(-1)

player_info=player[['playerId','label_country_id','label_position_id']]
player_info
#player



In [None]:
player_target=pd.merge(left=player,right=player_target_stats,on='playerId',how='inner')
del player
gc.collect()
player_score_stats1=pd.merge(left=player_target,right=player_score_stats,on='playerId',how='inner')
player_score_stats1

In [None]:
del data
del player_engagement_with_info
gc.collect()

In [None]:
player_score_stats1.columns

In [None]:
feature_cols=['engagementMetricsDate', 'playerId', 'target1_min', 'target1_max',
        'label_country_id','label_position_id', 'label_team_id', 'label_status_id',
       'target1_mean', 'target1_std', 'target2_min', 'target2_max',
       'target2_mean', 'target2_std', 'target3_min', 'target3_max',
       'target3_mean', 'target3_std', 'target4_min', 'target4_max',
       'target4_mean', 'target4_std', 'runsScored', 'homeRuns']
target_cols=['target1','target2', 'target3', 'target4']


In [None]:
index=player_score_stats1['engagementMetricsDate']<'2021-04-01'
X_train=player_score_stats1.loc[index,feature_cols]
Y_train=player_score_stats1.loc[index,target_cols]

X_test=player_score_stats1.loc[~index,feature_cols]
Y_test=player_score_stats1.loc[~index,target_cols]

del X_train['engagementMetricsDate']
del X_test['engagementMetricsDate']


In [None]:
# from skmultilearn.model_selection import iterative_train_test_split
# from sklearn.multioutput import MultiOutputClassifier
# from sklearn.ensemble import RandomForestRegressor

# model = RandomForestRegressor(n_estimators=200,max_depth=10, random_state=0,min_samples_split=10)
# model.fit(X_train, Y_train)

# # #Generating predictions from Random Fores Models
# # pred_rf=model.predict(X_test)
# # pred_rf_proba=model.predict_proba(X_test)

# # feat_importances = pd.Series(model.feature_importances_, index=feature_list)
# # feat_importances=feat_importances.sort_values()
# # feat_importances.plot(kind='barh',figsize=(16,16))#Plotting feature importance

# print('Model Accuracy')
# #print(model.score(X_test,Y_test))



## Model Development & Hyperparameter Tuning<a name="model"></a>

* We will use LightGBM as a model for prediction. 
* We will be training four different models to predict each of the target variables
* Since this is a time series competition, we need to be very careful about leakage of features. Therefore we will be using data till 31 March 2021 as training data and will use the April 2021 data for validation
* We generated optimal parameters through Hyperparameter Tuning with Optuna
* This model is then used to generate predictions on the final data


In [None]:
import lightgbm as lgbm
from sklearn.metrics import mean_absolute_error
#https://www.kaggle.com/columbia2131/mlb-lightgbm-starter-dataset-code-en-ja
    
def fit_lgbm(x_train, y_train, x_valid, y_valid, params: dict=None, verbose=100):
    oof_pred = np.zeros(len(y_valid), dtype=np.float32)
    model = lgbm.LGBMRegressor(**params)
    model.fit(x_train, y_train, 
        eval_set=[(x_valid, y_valid)],  
        early_stopping_rounds=verbose, 
        verbose=verbose)
    oof_pred = model.predict(x_valid)
    score = mean_absolute_error(oof_pred, y_valid)
    print('mae:', score)
    return oof_pred, model, score



In [None]:
##Hyperparameter Optimization using Optuna

# import optuna

# # 1. Define an objective function to be maximized.
# def objective1(trial):
#     ...

#     # 2. Suggest values of the hyperparameters using a trial object.
#     param = {
#         'objective': 'regression',
#         'metric': 'mae',
#         'verbosity': -1,
#         'boosting_type': 'gbdt',
#         'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
#         'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
#         'num_leaves': trial.suggest_int('num_leaves', 2, 256),
#         'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
#         'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
#         'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
#         'min_child_samples': trial.suggest_int('min_child_samples', 10, 100),
#          'n_estimators': 100,
#         'feature_pre_filter':False
#     }

#     gbm = lgbm.train(param, train_data1)
#     preds = gbm.predict(X_test)
#     score = mean_absolute_error(preds, Y_test['target1'])

#     ...
#     return score
# def objective2(trial):
#     ...

#     # 2. Suggest values of the hyperparameters using a trial object.
#     param = {
#         'objective': 'regression',
#         'metric': 'mae',
#         'verbosity': -1,
#         'boosting_type': 'gbdt',
#         'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
#         'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
#         'num_leaves': trial.suggest_int('num_leaves', 2, 256),
#         'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
#         'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
#         'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
#         'min_child_samples': trial.suggest_int('min_child_samples', 10, 100),
#          'n_estimators': 100,
#         'feature_pre_filter':False
#     }

#     gbm = lgbm.train(param, train_data2)
#     preds = gbm.predict(X_test)
#     score = mean_absolute_error(preds, Y_test['target2'])

#     ...
#     return score
# def objective3(trial):
#     ...

#     # 2. Suggest values of the hyperparameters using a trial object.
#     param = {
#         'objective': 'regression',
#         'metric': 'mae',
#         'verbosity': -1,
#         'boosting_type': 'gbdt',
#         'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
#         'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
#         'num_leaves': trial.suggest_int('num_leaves', 2, 256),
#         'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
#         'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
#         'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
#         'min_child_samples': trial.suggest_int('min_child_samples', 10, 100),
#          'n_estimators': 100,
#         'feature_pre_filter':False
#     }

#     gbm = lgbm.train(param, train_data3)
#     preds = gbm.predict(X_test)
#     score = mean_absolute_error(preds, Y_test['target3'])

#     ...
#     return score
# def objective4(trial):
#     ...

#     # 2. Suggest values of the hyperparameters using a trial object.
#     param = {
#         'objective': 'regression',
#         'metric': 'mae',
#         'verbosity': -1,
#         'boosting_type': 'gbdt',
#         'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
#         'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
#         'num_leaves': trial.suggest_int('num_leaves', 2, 256),
#         'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
#         'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
#         'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
#         'min_child_samples': trial.suggest_int('min_child_samples', 10, 100),
#          'n_estimators': 100,
#         'feature_pre_filter':False
#     }

#     gbm = lgbm.train(param, train_data4)
#     preds = gbm.predict(X_test)
#     score = mean_absolute_error(preds, Y_test['target4'])

#     ...
#     return score


# train_data1 = lgbm.Dataset(X_train, label=Y_train['target1'])
# train_data2 = lgbm.Dataset(X_train, label=Y_train['target2'])
# train_data3 = lgbm.Dataset(X_train, label=Y_train['target3'])
# train_data4 = lgbm.Dataset(X_train, label=Y_train['target4'])


# # 3. Create a study object and optimize the objective function.
# study1 = optuna.create_study(direction='minimize')
# study1.optimize(objective1, n_trials=5,gc_after_trial=True,n_jobs=-1)

# study2 = optuna.create_study(direction='minimize')
# study2.optimize(objective2, n_trials=5,gc_after_trial=True,n_jobs=-1)

# study3 = optuna.create_study(direction='minimize')
# study3.optimize(objective3, n_trials=5,gc_after_trial=True,n_jobs=-1)

# study4 = optuna.create_study(direction='minimize')
# study4.optimize(objective4, n_trials=5,gc_after_trial=True,n_jobs=-1)


In [None]:
# params1=study1.best_params
# optuna.visualization.plot_param_importances(study1)


In [None]:
# fig = optuna.visualization.plot_optimization_history(study1)
# fig.show()

In [None]:
# params1=study1.best_params
# params2=study2.best_params
# params3=study3.best_params
# params4=study4.best_params

# with open('params1.pickle', 'wb') as handle:
#     pickle.dump(params1, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('params2.pickle', 'wb') as handle:
#     pickle.dump(params2, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('params3.pickle', 'wb') as handle:
#     pickle.dump(params3, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('params4.pickle', 'wb') as handle:
#     pickle.dump(params4, handle, protocol=pickle.HIGHEST_PROTOCOL)
    


In [None]:
# with open('params1.pickle', 'rb') as handle:
#      params1 = pickle.load(handle)
# with open('params2.pickle', 'rb') as handle:
#      params2 = pickle.load(handle)
# with open('params3.pickle', 'rb') as handle:
#      params3 = pickle.load(handle)
# with open('params4.pickle', 'rb') as handle:
#      params4 = pickle.load(handle)

In [None]:
params1={'objective':'mae','lambda_l1': 2.4720578317437477e-05,
 'lambda_l2': 1.3527755261208504e-06,
 'num_leaves': 40,
 'feature_fraction': 0.7554047245795381,
 'bagging_fraction': 0.5080066015078072,
 'bagging_freq': 3,
 'min_child_samples': 100}
params2={'objective':'mae','lambda_l1': 0.0014725905491640504,
 'lambda_l2': 5.713092762240107,
 'num_leaves': 47,
 'feature_fraction': 0.462021949863225,
 'bagging_fraction': 0.4500482872517192,
 'bagging_freq': 4,
 'min_child_samples': 24}
params3={'objective':'mae','lambda_l1': 1.646629475621938e-07,
 'lambda_l2': 0.00020089914770634004,
 'num_leaves': 42,
 'feature_fraction': 0.6792583947012405,
 'bagging_fraction': 0.4772814797729401,
 'bagging_freq': 4,
 'min_child_samples': 99}
params4={'objective':'mae','lambda_l1': 0.007010740786027014,
 'lambda_l2': 8.864818411173501e-08,
 'num_leaves': 46,
 'feature_fraction': 0.6146671105020214,
 'bagging_fraction': 0.7314001423885812,
 'bagging_freq': 2,
 'min_child_samples': 75}


In [None]:

oof1, model1, score1 = fit_lgbm(
    X_train, Y_train['target1'],
    X_test, Y_test['target1'],
    params1
)

oof2, model2, score2 = fit_lgbm(
    X_train, Y_train['target2'],
    X_test, Y_test['target2'],
    params2
)
oof3, model3, score3 = fit_lgbm(
    X_train, Y_train['target3'],
    X_test, Y_test['target3'],
    params3
)
oof4, model4, score4 = fit_lgbm(
    X_train, Y_train['target4'],
    X_test, Y_test['target4'],
    params4
)

score = (score1+score2+score3+score4) / 4
print(f'score: {score}')


In [None]:
lgbm.plot_importance(model1)

## Final Submission<a name="submit"></a>

We will use the competition API to get access to the hidden test set and make predictions. 

In [None]:
import mlb
env = mlb.make_env() # initialize the environment
iter_test = env.iter_test() # iterator which loops over each date in test set


In [None]:
rosters_cols = ['playerId','gameDate','teamId','statusCode','status']
player_info=player_info.reset_index(drop=True)
player_info=player_info.drop_duplicates(subset=['playerId'], keep='first')


In [None]:
for (test_df, sample_prediction_df) in iter_test:
    
#     test_df=test_df.reset_index(drop=True)
#     if test_df['rosters'].iloc[0] == test_df['rosters'].iloc[0]:
#         test_rosters = pd.DataFrame(eval(test_df['rosters'].iloc[0]))
#     else:
#         test_rosters = pd.DataFrame({'playerId': sample_prediction_df['playerId']})
        
#     for col in rosters_cols:
#         if col == 'playerId': continue
#         test_rosters[col] = np.nan
#     test_rosters['label_team_id'] = test_rosters['teamId'].map(teamid2num).fillna(-1)
#     test_rosters['label_status_id'] = test_rosters['status'].map(status2num).fillna(-1)
    
#     player_rosters=test_rosters[['playerId','label_team_id','label_status_id']]
#     player_rosters=player_rosters.drop_duplicates(subset=['playerId'],keep='first')
    
    sample_prediction_df = sample_prediction_df.reset_index(drop=True)
    
    sample_prediction_df['playerId'] = sample_prediction_df['date_playerId']\
                                        .map(lambda x: int(x.split('_')[1]))  
    
    sample_prediction_df_info=pd.merge(left=sample_prediction_df,right=player_info,on='playerId',how='left')
  #  sample_prediction_df_rosters=pd.merge(left=sample_prediction_df_info,right=player_rosters,on='playerId',how='left')
    
    sample_prediction_df_info['label_team_id']=0
    sample_prediction_df_info['label_status_id']=0
    
    
    sample_prediction_df_tgt=pd.merge(left=sample_prediction_df_info,right=player_target_stats,on='playerId',how='left')
    sample_prediction_df_tgt_stats=pd.merge(left=sample_prediction_df_tgt,right=player_score_stats,on='playerId',how='left')
    sample_prediction_df_tgt_stats['engagementMetricsDate']='2020-04-01'
    test_X = sample_prediction_df_tgt_stats[feature_cols]
    del test_X['engagementMetricsDate']
    test_X=test_X.fillna(-1)

    
        # predict
    pred1 = model1.predict(test_X)
    pred2 = model2.predict(test_X)
    pred3 = model3.predict(test_X)
    pred4 = model4.predict(test_X)
 
    # merge submission
    sample_prediction_df['target1'] = np.clip(pred1, 0, 100)
    sample_prediction_df['target2'] = np.clip(pred2, 0, 100)
    sample_prediction_df['target3'] = np.clip(pred3, 0, 100)
    sample_prediction_df['target4'] = np.clip(pred4, 0, 100)
    sample_prediction_df = sample_prediction_df.fillna(0.)
    del sample_prediction_df['playerId']
  #  del sample_prediction_df['date']
    
    env.predict(sample_prediction_df)
    
    #env.predict(sample_prediction_df)

In [None]:
X_test.isna().sum()

In [None]:
# if test_df['rosters'].iloc[0] == test_df['rosters'].iloc[0]:
#         test_rosters = pd.DataFrame(eval(test_df['rosters'].iloc[0]))
# else:
#     test_rosters = pd.DataFrame({'playerId': sample_prediction_df['playerId']})
#     for col in rosters.columns:
#         if col == 'playerId': continue
#         test_rosters[col] = np.nan
# test_rosters
# player_rosters=test_rosters[['playerId','teamId','status']]
# test_rosters

In [None]:
#     test_df=test_df.reset_index(drop=True)
#     if test_df['rosters'].iloc[0] == test_df['rosters'].iloc[0]:
#         test_rosters = pd.DataFrame(eval(test_df['rosters'].iloc[0]))
#     else:
#         test_rosters = pd.DataFrame({'playerId': sample_prediction_df['playerId']})
#     for col in rosters_cols:
#         if col == 'playerId': continue
#         test_rosters[col] = np.nan
    
#     test_rosters['label_team_id'] = test_rosters['teamId'].map(teamid2num).fillna(-1)
#     test_rosters['label_status_id'] = test_rosters['status'].map(status2num).fillna(-1)
    
#     player_rosters=test_rosters[['playerId','label_team_id','label_status_id']]
    
#     sample_prediction_df = sample_prediction_df.reset_index(drop=True)
    
#     sample_prediction_df['playerId'] = sample_prediction_df['date_playerId']\
#                                         .map(lambda x: int(x.split('_')[1]))  
#     player_info=player_info.drop_duplicates(subset=['playerId'], keep='first')
    
#     sample_prediction_df_info=pd.merge(left=sample_prediction_df,right=player_info,on='playerId',how='left')
#     sample_prediction_df_rosters=pd.merge(left=sample_prediction_df_info,right=player_rosters,on='playerId',how='left')
    
#     sample_prediction_df_tgt=pd.merge(left=sample_prediction_df_rosters,right=player_target_stats,on='playerId',how='left')
#     sample_prediction_df_tgt_stats=pd.merge(left=sample_prediction_df_tgt,right=player_score_stats,on='playerId',how='left')
#     sample_prediction_df_tgt_stats['engagementMetricsDate']='2020-04-01'
#     test_X = sample_prediction_df_tgt_stats[feature_cols]
#     del test_X['engagementMetricsDate']
#     test_X=test_X.fillna(-1)

    
#         # predict
#     pred1 = model1.predict(test_X)
#     pred2 = model2.predict(test_X)
#     pred3 = model3.predict(test_X)
#     pred4 = model4.predict(test_X)
 
#     # merge submission
#     sample_prediction_df['target1'] = np.clip(pred1, 0, 100)
#     sample_prediction_df['target2'] = np.clip(pred2, 0, 100)
#     sample_prediction_df['target3'] = np.clip(pred3, 0, 100)
#     sample_prediction_df['target4'] = np.clip(pred4, 0, 100)
#     sample_prediction_df = sample_prediction_df.fillna(0.)
#     del sample_prediction_df['playerId']


In [None]:
sample_prediction_df

In [None]:
player_info


I love doing projects around Sports Analytics, if you want to check out some of my other work around sports analytics do check this out - https://arpitsolanki.github.io/blog/

### Next Steps 
* Run further analysis to gain better understanding of data - impact of player origins, average age, fluctuation in forms etc. on engagement scores
* Bring in additional datasets like FiveThirtyEight ELO ratings to understand how competitiveness of games impacts engagement scores
* Convert results of analysis into meaningful features for prediction
* Create a prediction model to predict engagment scores for test dataset

To be continued...

