<center><img src="https://stn2.tv/wp-content/uploads/2020/04/mlb-logo.jpg"></center>

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
#!pip install raceplotly
#from raceplotly.plots import barplot

# About the Competition🚩
<p style="font-size:15px">In this competition, you’ll predict how fans engage with MLB players’ digital content on a daily basis for a future date range. You’ll have access to player performance data, social media data, and team factors like market size. Successful models will provide new insights into what signals most strongly correlate with and influence engagement.

Imagine if you could predict MLB All Stars all season long or when each of a team’s 25 players has his moment in the spotlight. These insights are possible when you dive deeper into the fandom of America’s pastime. Be part of the first method of its kind to try to understand digital engagement at the player level in this granular, day-to-day fashion. Simultaneously help MLB build innovation more easily using Google Cloud’s data analytics, Vertex AI and MLOps tools. You could play a part in shaping the future of MLB fan and player engagement.

Submissions are evaluated on the mean column-wise mean absolute error (MCMAE). A mean absolute error is calculated for each of the four target variables and the score is the average of those four MAE values.

</p>

# Data Description

<div style="font-size:15px">
 We are given 7 csv files:-
<ul>
    <li><code>train.csv:</code>training set</li>
    <li><code>example_test.csv:</code>example of test set</li>
    <li><code>example_sample_submission.csv:</code>example of sample_submission</li>
    <li><code>awards.csv:</code>awards won by players before 2018</li>
    <li><code>players.csv:</code>Library high level information about all players.</li>
    <li><code>seasons.csv:</code>Information about start and end dates of all seasons in this dataset</li>
    <li><code>teams.csv:</code>Library containing high level information about all MLB teams.</li>
</ul>    
</div>

<div class="alert alert-block alert-info" style="font-size:15px; font-family:verdana; line-height: 2.0em;">
Note: Since this is a code competition You must submit to this competition using the provided MLB python time-series module, which ensures that models do not peek forward in time.
</div>

# EDA

In [None]:
players = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/players.csv')
seasons = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/seasons.csv')
awards = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/awards.csv')
teams = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/teams.csv')
train = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/train.csv')

<p style="font-size:15px">Let's take a peek at train.csv</p>

In [None]:
train.head()

In [None]:
train.info()

<div class="alert alert-block alert-info" style="font-size:1ilocx; font-family:verdana; line-height: 2.0em;">
Note: Coulmn nextDayPlayerEngagement contains the targets that we want to predict
</div>

In [None]:
train

I am using an unnested dataset created by @Ml_Bear link of the original [kernel](https://www.kaggle.com/naotaka1128/creating-unnested-dataset) do check it out

In [None]:
train_next_day = pd.read_pickle('../input/mlb-unnested/train_nextDayPlayerEngagement.pickle')
train_next_day.engagementMetricsDate = train_next_day.engagementMetricsDate.astype('datetime64')

In [None]:
train_next_day

In [None]:
train_next_day.head()

<p style="font-size:15px">Let's visualize how our targets change over time</p>

In [None]:
unique_date = list(train_next_day.engagementMetricsDate.unique())
target1_lis = []
target2_lis = []
target3_lis = []
target4_lis = []
for i in unique_date:
    df = train_next_day[train_next_day['engagementMetricsDate']==i]
    target1 = (df['target1'].sum())/len(df)
    target2 = (df['target2'].sum())/len(df)
    target3 = (df['target3'].sum())/len(df)
    target4 = (df['target4'].sum())/len(df)
    target1_lis.append(target1)
    target2_lis.append(target2)
    target3_lis.append(target3)
    target4_lis.append(target4)

In [None]:
px.line(x=unique_date,y=target1_lis,title='target 1 over time')

In [None]:
px.line(x=unique_date,y=target2_lis,title='target 2 over time')

In [None]:
px.line(x=unique_date,y=target3_lis,title='target 3 over time')

In [None]:
px.line(x=unique_date,y=target4_lis,title='target 4 over time')

In [None]:
team_twitter = pd.read_pickle('../input/mlb-unnested/train_teamTwitterFollowers.pickle')
team_twitter

<p style="font-size:15px"> We can plot a raceplot to visualize no. of followers of the team growing over the years  </p>

In [None]:
#my_raceplot = barplot(team_twitter,  item_column='teamName', value_column='numberOfFollowers', time_column='date')
#my_raceplot.plot(item_label = 'team name', value_label = 'number of followers', frame_duration = 800)

<div class="alert alert-block alert-info" style="font-size:1ilocx; font-family:verdana; line-height: 2.0em;">
📌no. of followers have grown over the years<br>
📌Houston Astros also seems to becoming more popular
</div>

In [None]:
standings = pd.read_pickle('../input/mlb-unnested/train_standings.pickle')
transactions = pd.read_pickle('../input/mlb-unnested/train_transactions.pickle')

similary we can also plot raceplot to visuzalize no. of wins of a team over the years

In [None]:
#my_raceplot = barplot(standings,  item_column='teamName',value_column='wins', time_column='dailyDataDate',top_entries=10)
#my_raceplot.plot(item_label = 'team name', value_label = 'current wins', frame_duration = 800)

In [None]:
transactions = transactions.dropna()

In [None]:
transactions

In [None]:
most_transaction_team=transactions.groupby('fromTeamName')['toTeamName'].count().reset_index(name='Count').sort_values('Count',ascending=False)
px.bar(most_transaction_team.head(10),x='fromTeamName',y='Count')

In [None]:
most_transaction_player=transactions.groupby('playerName')['toTeamName'].count().reset_index(name='Count').sort_values('Count',ascending=False)
px.bar(most_transaction_player.head(10),x='playerName',y='Count')

<p style="font-size:15px">Now let's take look at players.csv</p>

In [None]:
players.head()

In [None]:
players.info()

<p style="font-size:15px">Let's see viz of country</p>

In [None]:
px.histogram(players,x='birthCountry',color='birthCountry')

<div style="font-size:15px">Let's see if there is a relationship between height and weight of players</div>

In [None]:
px.scatter(players,x='weight',y='heightInches')

<div style="font-size:15px">We can also combine 2 data frames to gain more insights for example by combining award count with primary position name we can see which position gets most awards</div>

In [None]:
playerid = list(awards['playerId'])
award_count = []
for i in playerid:
    award_count.append(len(awards[awards['playerId']==i]))
award_count = pd.DataFrame({"playerId":playerid,"award_count":award_count})
players = pd.merge(players,award_count,on='playerId')

In [None]:
postition_list = list(players['primaryPositionName'].unique())
award_count_sum = []
for i in postition_list:
    award_count_sum.append(players[players['primaryPositionName']==i]['award_count'].sum()) 

In [None]:
px.bar(x=postition_list,y=award_count_sum)

In [None]:
most_award = awards['playerId'].mode()
print(f"playerID: {most_award.values}")
awards[awards['playerId'] == 405395]

In [None]:
px.histogram(awards,y='awardName',category_orders=awards['awardName'])

# Model

Following starter model code is inspired from @ulrich07 <a href="https://www.kaggle.com/ulrich07/baseline-model-player-mean-or-median">notebook</a> instead of just using median I am using weighted median and giving higher weight to the recent years

In [None]:
sample_preiction = pd.read_pickle('../input/mlb-unnested/example_sample_submission.pickle')
example_test_games = pd.read_pickle('../input/mlb-unnested/example_test_games.pickle')
train_games = pd.read_pickle('../input/mlb-unnested/train_games.pickle')
train_next_day = pd.read_pickle('../input/mlb-unnested/train_nextDayPlayerEngagement.pickle')
train_next_day['year'] = pd.DatetimeIndex(train_next_day['engagementMetricsDate']).year

In [None]:
year = train_next_day.year.unique()
weight=[0.05,0.05,0.1,0.8]#Experiment here with different values

In [None]:
def weighted_median(df, val, weight):
    df_sorted = df.sort_values(val)
    cumsum = df_sorted[weight].cumsum()
    cutoff = df_sorted[weight].sum() / 2.
    return df_sorted[cumsum >= cutoff][val].iloc[0]

year wise weighted targets

In [None]:
def preprocess(df,weight,year):
    for i in range(len(year)):
        df.loc[df['year']==year[i],'weight'] = weight[i]
preprocess(train_next_day,weight,year)

In [None]:
train_next_day.head()

In [None]:
playerId = train_next_day['playerId'].unique()
for i in playerId:
    df=train_next_day[train_next_day['playerId']==i]
    wm1 = weighted_median(df,'target1','weight')
    wm2 = weighted_median(df,'target2','weight')
    wm3 = weighted_median(df,'target3','weight')
    wm4 = weighted_median(df,'target4','weight')
    train_next_day.loc[train_next_day['playerId']==i,'target1'] = wm1
    train_next_day.loc[train_next_day['playerId']==i,'target2'] = wm2
    train_next_day.loc[train_next_day['playerId']==i,'target3'] = wm3
    train_next_day.loc[train_next_day['playerId']==i,'target4'] = wm4

In [None]:
train_mean = train_next_day.groupby(["playerId"])[["target1","target2","target3","target4"]].median().reset_index()

In [None]:
train_mean.head()

In [None]:
def process_pred(df):
    df["playerId"] = df["date_playerId"].apply(lambda x: int( x.split("_")[1] ) )
    df.drop(["target1","target2","target3","target4"], axis=1, inplace=True)
    df = df.merge(train_mean, on="playerId", how="left")
    df.drop("playerId", axis=1, inplace=True)
    df = df.fillna(0.)
    return df

In [None]:
import mlb
env = mlb.make_env() # initialize the environment
iter_test = env.iter_test() # iterator which loops over each date in test set

for (test_df, sample_prediction_df) in iter_test:
    sample_prediction_df = process_pred(sample_prediction_df)
    env.predict(sample_prediction_df)

In [None]:
sample_prediction_df.head()

<h2><center>If you learned something new or forked the notebook then please don't forget to upvote<br>Thank You</center>
</h2>

<h2><center>Work in Progress ... ⏳</center></h2>