# MLB Player Digital Engagement Forecasting

Ideas
* Do rivalry games create more digital content for players?
* Are the best players on the best teams the most followed on twitter?
* Do other sporting events impact the digital content for MLB?
* Does digital engagement change during the innings?
* Does the all-star event impact players performances?
* Do awards benefit players digital content?
* Do twitter followers engage with the best players?

### 1. Importing the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        print(filename)

In [None]:
# Review each of the csv files to understand what is available
dir_name = '/kaggle/input/mlb-player-digital-engagement-forecasting/'
data = ['players', 'teams', 'seasons', 'awards']
# Create a list of dataframes
csvs = [pd.read_csv(f'{dir_name}{d}.csv') for d in data]

### 2. Exploratory data analysis

#### 2a. Awards data

In [None]:
# Import the awards data and set date
awards = pd.read_csv(f'{dir_name}awards.csv')

In [None]:
# Variable data types
awards.dtypes

In [None]:
# Describe the key variables within the awards DataFrame
# Including the 'all' parameter allows the string variables to be included in the output
awards.describe(include='all', datetime_is_numeric=True)

In [None]:
# Sample of the dataframe
awards.sample(10)

In [None]:
#word cloud visualisation to show the popular neighbourhoods
from wordcloud import WordCloud

plt.subplots(figsize=(20,15))
wordcloud = WordCloud(
                          width=1920,
                          height=1080
                         ).generate(" ".join(awards.playerName))
plt.imshow(wordcloud)
plt.title('Word Cloud for Player Name Awards')
plt.axis('off')
plt.show()

In [None]:
# Understand the 10 largest award winners
top_players = awards.groupby('playerName')['playerName'].count().nlargest(n=10, keep='all')
top_players

In [None]:
# For the 10 largest award winners. Lets understand the number of unique awards won
# Did one player dominate a single award
player = awards.groupby('playerName').agg(
    {
        'playerName' : 'count',
        'awardSeason' : ['min', 'max'],
        'awardId' : pd.Series.nunique
    }
).nlargest(10, ('playerName', 'count'))
player

In [None]:
# Convert date variable from object to datetime
awards['awardDate'] = pd.to_datetime(awards['awardDate'])
awards.head()

In [None]:
# Animated scattergraph to review player awards by time for the largest award winners
tp_array = np.array(awards['playerName'].isin(player.index))
awards_tp = awards.loc[(tp_array)]

In [None]:
awards_tp.head()

In [None]:
awards_sum = awards_tp.groupby(['playerName', 'awardSeason'])['playerId'].count()
awards_sum1 = awards_sum.reset_index()
awards_sum1

In [None]:
fig = px.bar(awards_sum1, x='playerName', y='playerId', color='playerName',
            animation_frame='awardSeason')
fig.show()

#### 2b. Players

In [None]:
# Import the players data and set date
df_p = pd.read_csv(f'{dir_name}players.csv')

In [None]:
df_p.sample(10)

In [None]:
# Distribution of players by country
df_p_s1 = df_p.groupby('birthCountry').agg(
    {
        'playerName' : 'count'
    }
)

# Bar chart
fig = px.bar(df_p_s1, x=df_p_s1.index, y="playerName", title="Distribution by Country")
fig.show()

In [None]:
# Convert the mlbDebutDate and DOB to datetime
df_p['mlbDebutDate'] = pd.to_datetime(df_p['mlbDebutDate'])
df_p['DOB'] = pd.to_datetime(df_p['DOB'])


In [None]:
# Debut year and birth year
df_p['mlbDebutYear'] = df_p['mlbDebutDate'].dt.year
df_p['DOBYear'] = df_p['DOB'].dt.year

# What age is average for starting in MLB
df_p['DebutAge'] = df_p['mlbDebutYear'] - df_p['DOBYear']

In [None]:
# summary of the numeric values
df_p.describe()

In [None]:
# Has the distribution of new players got younger over time?
age_sum = df_p.groupby(['mlbDebutYear', 'DebutAge'])['playerName'].count()
age_sum = age_sum.reset_index()
age_sum
fig = px.bar(age_sum, x="mlbDebutYear", y="playerName", color="DebutAge")
fig.show()

This output will be impacted by players who have left the baseball dataset over time. The players that remain are only included. Therefore the most recent years provide a fairer reflection of the age distribution for MLB debut's. 

In [None]:
# Review a scatter plot
fig = px.scatter(
    age_sum, x='mlbDebutYear', y='DebutAge', opacity=0.65, size="playerName",
    trendline='ols', trendline_color_override='darkblue'
)
fig.show()