## Introduction

Here, I would like to provide some of the insights I obtained through the competition.  
I hope it will be of some help to you.  

The points of my analysis are as follows.  
* Use lag features: Useful for explaining short-term trends  
* Use MSE(Mean Squared Error): Suitable for explaining the "Peaks(outliers)" scattered in the data, compared to MAE.  

### Import libraries

In [None]:
import numpy as np
import pandas as pd

from pathlib import Path
from sklearn.metrics import mean_absolute_error, mean_squared_error

import lightgbm as lgbm

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.3)

import pickle
import gc
import os

pd.set_option("display.max_columns", 1000)
pd.set_option('display.max_rows', 1000)

### Settings

In [None]:
BASE_DIR = Path('../input/mlb-player-digital-engagement-forecasting')
TRAIN_DIR = Path('../input/mlb-datasets')

### Tools

In [None]:
# Memory reduction function
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

### Columns

In [None]:
targets_cols = ['playerId', 'target1', 'target2', 'target3', 'target4', 'date']
teams_cols = ['id','leagueId', 'divisionId']
players_cols = ['playerId', 'primaryPositionName']
rosters_cols = ['playerId', 'teamId', 'status', 'date']
scores_cols = ['playerId', 'battingOrder', 'gamesPlayedBatting', 'flyOuts',
       'groundOuts', 'runsScored', 'doubles', 'triples', 'homeRuns',
       'strikeOuts', 'baseOnBalls', 'intentionalWalks', 'hits', 'hitByPitch',
       'atBats', 'caughtStealing', 'stolenBases', 'groundIntoDoublePlay',
       'groundIntoTriplePlay', 'plateAppearances', 'totalBases', 'rbi',
       'leftOnBase', 'sacBunts', 'sacFlies', 'catchersInterference',
       'pickoffs', 'gamesPlayedPitching', 'gamesStartedPitching',
       'completeGamesPitching', 'shutoutsPitching', 'winsPitching',
       'lossesPitching', 'flyOutsPitching', 'airOutsPitching',
       'groundOutsPitching', 'runsPitching', 'doublesPitching',
       'triplesPitching', 'homeRunsPitching', 'strikeOutsPitching',
       'baseOnBallsPitching', 'intentionalWalksPitching', 'hitsPitching',
       'hitByPitchPitching', 'atBatsPitching', 'caughtStealingPitching',
       'stolenBasesPitching', 'inningsPitched', 'saveOpportunities',
       'earnedRuns', 'battersFaced', 'outsPitching', 'pitchesThrown', 'balls',
       'strikes', 'hitBatsmen', 'balks', 'wildPitches', 'pickoffsPitching',
       'rbiPitching', 'gamesFinishedPitching', 'inheritedRunners',
       'inheritedRunnersScored', 'catchersInterferencePitching',
       'sacBuntsPitching', 'sacFliesPitching', 'saves', 'holds', 'blownSaves',
       'assists', 'putOuts', 'errors', 'chances', 'date']
awards_cols = ['date', 'playerId', 'awardId']
playerTwitterFollowers_cols = ['playerId', 'numberOfFollowers', 'date']
teamTwitterFollowers_cols = ['teamId', 'numberOfFollowers', 'date']
standings_cols = ['teamId', 'wins', 'losses', 'lastTenWins', 'lastTenLosses', 'date']

# lag features
lag_cols =  ['lag_t' + str(i) + '_tgt' + str(j) for i in range(1, 4) for j in range(1, 5)] + \
            ['roll_mean_t' + str(i) + '_tgt' + str(j) for i in [7, 28] for j in range(1, 5)] + \
            ['roll_min_t' + str(i) + '_tgt' + str(j) for i in [7, 28] for j in range(1, 5)] + \
            ['roll_max_t' + str(i) + '_tgt' + str(j) for i in [7, 28] for j in range(1, 5)]

# date features
date_cols = ['year', 'month', 'week', 'day', 'dayofweek']

# statistics
stats_cols = ['playerId', 
              'target1_mean', 'target1_max', 'target1_min', 
              'target2_mean', 'target2_max', 'target2_min', 
              'target3_mean', 'target3_max', 'target3_min',
              'target4_mean', 'target4_max', 'target4_min']

feature_cols = [
       'label_playerId', 'label_primaryPositionName', 'label_teamId', 'label_status', 
       'battingOrder', 'gamesPlayedBatting', 'flyOuts',
       'groundOuts', 'runsScored', 'doubles', 'homeRuns',
       'strikeOuts', 'baseOnBalls', 'intentionalWalks', 'hits', 'hitByPitch',
       'atBats', 'caughtStealing', 'stolenBases', 'groundIntoDoublePlay',
       'groundIntoTriplePlay', 'plateAppearances', 'totalBases', 'rbi',
       'leftOnBase', 'sacBunts', 'sacFlies', 'catchersInterference',
       'pickoffs', 'gamesPlayedPitching', 'gamesStartedPitching',
       'completeGamesPitching', 'shutoutsPitching', 'winsPitching',
       'lossesPitching', 'flyOutsPitching', 'airOutsPitching',
       'groundOutsPitching', 'runsPitching', 'doublesPitching',
       'triplesPitching', 'homeRunsPitching', 'strikeOutsPitching',
       'baseOnBallsPitching', 'intentionalWalksPitching', 'hitsPitching',
       'hitByPitchPitching', 'atBatsPitching', 'caughtStealingPitching',
       'stolenBasesPitching', 'inningsPitched', 'saveOpportunities',
       'earnedRuns', 'battersFaced', 'outsPitching', 'pitchesThrown', 'balls',
       'strikes', 'hitBatsmen', 'balks', 'wildPitches', 'pickoffsPitching',
       'rbiPitching', 'gamesFinishedPitching', 'inheritedRunners',
       'inheritedRunnersScored', 'catchersInterferencePitching',
       'sacBuntsPitching', 'sacFliesPitching', 'saves', 'holds', 'blownSaves',
       'assists', 'putOuts', 'errors', 'chances', 'awardId_count', 'playernumberOfFollowers',               
       'teamnumberOfFollowers', 'label_leagueId', 'label_divisionId', 'wins', 'losses', 
       'lastTenWins', 'lastTenLosses'
    ] + stats_cols[1:] + lag_cols + date_cols

### Read csv data

In [None]:
players = pd.read_csv(BASE_DIR / 'players.csv', usecols=players_cols)
players = reduce_mem_usage(players)

teams = pd.read_csv(BASE_DIR / 'teams.csv')
teams = teams.rename(columns = {'id':'teamId'})
teams = reduce_mem_usage(teams)

rosters = pd.read_csv(TRAIN_DIR / 'rosters.csv', usecols=rosters_cols)
rosters = reduce_mem_usage(rosters)

targets = pd.read_csv(TRAIN_DIR / 'nextDayPlayerEngagement.csv', usecols=targets_cols)
targets = reduce_mem_usage(targets)

scores = pd.read_csv(TRAIN_DIR / 'playerBoxScores.csv', usecols = scores_cols)
scores = scores.groupby(['playerId', 'date']).sum().reset_index()
scores = reduce_mem_usage(scores)

awards = pd.read_csv(TRAIN_DIR / 'awards.csv', usecols=awards_cols)
awards_count = awards[['playerId', 'awardId']].groupby('playerId').count().reset_index()
awards_count = awards_count.rename(columns = {'awardId':'awardId_count'})
awards_count = reduce_mem_usage(awards_count)

playerTwitterFollowers = pd.read_csv(TRAIN_DIR / 'playerTwitterFollowers.csv', usecols=playerTwitterFollowers_cols)
playerTwitterFollowers["year"] = pd.to_datetime(playerTwitterFollowers['date'], format='%Y%m%d').dt.year
playerTwitterFollowers["month"] = pd.to_datetime(playerTwitterFollowers['date'], format='%Y%m%d').dt.month
playerTwitterFollowers = playerTwitterFollowers.drop('date', axis=1)
playerTwitterFollowers = playerTwitterFollowers.groupby(['playerId', 'year', 'month']).sum().reset_index()
playerTwitterFollowers = playerTwitterFollowers.rename(columns = {'numberOfFollowers':'playernumberOfFollowers'})
playerTwitterFollowers = reduce_mem_usage(playerTwitterFollowers)

teamTwitterFollowers = pd.read_csv(TRAIN_DIR / 'teamTwitterFollowers.csv', usecols=teamTwitterFollowers_cols)
teamTwitterFollowers["year"] = pd.to_datetime(teamTwitterFollowers['date'], format='%Y%m%d').dt.year
teamTwitterFollowers["month"] = pd.to_datetime(teamTwitterFollowers['date'], format='%Y%m%d').dt.month
teamTwitterFollowers = teamTwitterFollowers.drop('date', axis=1)
teamTwitterFollowers = teamTwitterFollowers.groupby(['teamId', 'year', 'month']).sum().reset_index()
teamTwitterFollowers = teamTwitterFollowers.rename(columns = {'numberOfFollowers':'teamnumberOfFollowers'})
teamTwitterFollowers = reduce_mem_usage(teamTwitterFollowers)

standings = pd.read_csv(TRAIN_DIR / 'standings.csv', usecols=standings_cols)
standings = reduce_mem_usage(standings)

stats = pd.read_csv(TRAIN_DIR / 'player_target_stats.csv', usecols=stats_cols)
stats = reduce_mem_usage(stats)

gc.collect()

### Merge data

In [None]:
train = targets[targets_cols].merge(players[players_cols], on=['playerId'], how='left')
train['year'] = pd.to_datetime(train['date'], format='%Y%m%d').dt.year
train['month'] = pd.to_datetime(train['date'], format='%Y%m%d').dt.month
train = train.merge(rosters, on=['playerId', 'date'], how='left')
train = train.merge(scores, on=['playerId', 'date'], how='left')
train = train.merge(stats, on=['playerId'], how='left')
train = train.merge(teams, on='teamId', how='left')
train = train.merge(awards_count, on='playerId', how='left')
train['awardId_count'] = train['awardId_count'].fillna(0)
train = train.merge(playerTwitterFollowers, how='left', on=['playerId', 'year', 'month'])
train = train.merge(teamTwitterFollowers, how = 'left', on=['teamId', 'year', 'month'])
train = train.merge(standings, how='left', on = ['teamId', 'date'])
train = train.drop(['year', 'month'], axis=1)

# label encoding
player2num = {c: i for i, c in enumerate(train['playerId'].unique())}
position2num = {c: i for i, c in enumerate(train['primaryPositionName'].unique())}
teamid2num = {c: i for i, c in enumerate(train['teamId'].unique())}
status2num = {c: i for i, c in enumerate(train['status'].unique())}
leagueId2num = {c: i for i, c in enumerate(train['leagueId'].unique())}
divisionId2num = {c: i for i, c in enumerate(train['divisionId'].unique())}

train['label_playerId'] = train['playerId'].map(player2num)
train['label_primaryPositionName'] = train['primaryPositionName'].map(position2num)
train['label_teamId'] = train['teamId'].map(teamid2num)
train['label_status'] = train['status'].map(status2num)
train['label_leagueId'] = train['leagueId'].map(leagueId2num)
train['label_divisionId'] = train['divisionId'].map(divisionId2num)

train["date"] = pd.to_datetime(train['date'], format='%Y%m%d')

gc.collect()

## Simple EDA - Trend of TGT value (Each playerId) 

First, let's check the trend of the target in each player as a simple analysis.

### Target 1

* It seems that the engagement tends to be more active during the regular season. (See the upper graph)  
* Engagement from March to July 2020 has decreased compared to the normal year.  
  This is apparently due to the late start of the season, which is caused by COVID-19.  
* Peaks are scattered in the data. It seems difficult to predict this peak...  

In [None]:
fig, ax = plt.subplots(nrows=2, figsize=(25,10))
for i in [0, 5, 12, 18, 21]:
    ax[0].plot(train[train["playerId"] == train["playerId"].loc[i]]["date"], 
              train[train["playerId"] == train["playerId"].loc[i]]["target1"], 
              label=train["playerId"].loc[i])
    ax[1].plot(train[train["playerId"] == train["playerId"].loc[i]]["date"], 
          train[train["playerId"] == train["playerId"].loc[i]]["target1"], 
          label=train["playerId"].loc[i])

# Expanded regular season period
ax[1].set_xlim(np.array(('2019-03-01', '2019-10-01'),  dtype='datetime64'))
ax[0].set_ylim([0,100])
ax[1].set_ylim([0,100])
ax[0].legend()

### Target2

* It seems that the trend fluctuations appear to be more stable than Target1.  

In [None]:
fig, ax = plt.subplots(nrows=2, figsize=(25,10))
for i in [0, 5, 12, 18, 21]:
    ax[0].plot(train[train["playerId"] == train["playerId"].loc[i]]["date"], 
              train[train["playerId"] == train["playerId"].loc[i]]["target2"], 
              label=train["playerId"].loc[i])
    ax[1].plot(train[train["playerId"] == train["playerId"].loc[i]]["date"], 
          train[train["playerId"] == train["playerId"].loc[i]]["target2"], 
          label=train["playerId"].loc[i])

# Expanded regular season period
ax[1].set_xlim(np.array(('2019-03-01', '2019-10-01'),  dtype='datetime64'))
ax[0].set_ylim([0,100])
ax[1].set_ylim([0,100])
ax[0].legend()

### Target3

* There are drastic changes in the data, similar to Target1.  

In [None]:
fig, ax = plt.subplots(nrows=2, figsize=(25,10))
for i in [0, 5, 12, 18, 21]:
    ax[0].plot(train[train["playerId"] == train["playerId"].loc[i]]["date"], 
              train[train["playerId"] == train["playerId"].loc[i]]["target3"], 
              label=train["playerId"].loc[i])
    ax[1].plot(train[train["playerId"] == train["playerId"].loc[i]]["date"], 
          train[train["playerId"] == train["playerId"].loc[i]]["target3"], 
          label=train["playerId"].loc[i])

# Expanded regular season period
ax[1].set_xlim(np.array(('2019-03-01', '2019-10-01'),  dtype='datetime64'))
ax[0].set_ylim([0,100])
ax[1].set_ylim([0,100])
ax[0].legend()

### Target4

* The value is low and stable as a whole. It's relatively similar to target2?

In [None]:
fig, ax = plt.subplots(nrows=2, figsize=(25,10))
for i in [0, 5, 12, 18, 21]:
    ax[0].plot(train[train["playerId"] == train["playerId"].loc[i]]["date"], 
              train[train["playerId"] == train["playerId"].loc[i]]["target4"], 
              label=train["playerId"].loc[i])
    ax[1].plot(train[train["playerId"] == train["playerId"].loc[i]]["date"], 
          train[train["playerId"] == train["playerId"].loc[i]]["target4"], 
          label=train["playerId"].loc[i])

# Expanded regular season period
ax[1].set_xlim(np.array(('2019-03-01', '2019-10-01'),  dtype='datetime64'))
ax[0].set_ylim([0,100])
ax[1].set_ylim([0,100])
ax[0].legend()

As a first impression, target1 and 3 are relatively similar, and target2 and 4 are.  
Target 1 and 3 have relatively short-term violent fluctuations, and targets 2 and 4 have medium or long-term relatively gradual fluctuations.

Looking at the trends above, I remembered the competition I had participated in before.  
[M5-forecast](https://www.kaggle.com/c/m5-forecasting-accuracy)  

In this competition, the "lag features" were very useful, and I was able to make good inferences by using it. (61st/5558)  
Therefore, I believed that the Lag statistic would also be useful information in this competition as well.  

## Make "lag features"

Here is a function that generates lag features.  
In addition to simple lag information, the rolling method is used to calculate the average, maximum, and minimum values for the past few days.  
(In the actual competition, we used additional lag features. For simplicity, I will omit it here.)  

In [None]:
def make_features(df, shift=1):
    
    for i in range(1, 5):
        # lag features
        df['lag_t1_tgt' + str(i)] = df.groupby(['playerId'])['target' + str(i)].transform(lambda x: x.shift(shift))
        df['lag_t2_tgt' + str(i)] = df.groupby(['playerId'])['target' + str(i)].transform(lambda x: x.shift(shift+1))
        df['lag_t3_tgt' + str(i)] = df.groupby(['playerId'])['target' + str(i)].transform(lambda x: x.shift(shift+2))
        
        # use rolling methods to caululate statistics 
        df['roll_mean_t7_tgt' + str(i)] = df.groupby(['playerId'])['target' + str(i)].transform(lambda x: x.shift(shift).rolling(7).mean())
        df['roll_mean_t28_tgt' + str(i)] = df.groupby(['playerId'])['target' + str(i)].transform(lambda x: x.shift(shift).rolling(28).mean())
        df['roll_max_t7_tgt' + str(i)] = df.groupby(['playerId'])['target' + str(i)].transform(lambda x: x.shift(shift).rolling(7).max())
        df['roll_max_t28_tgt' + str(i)] = df.groupby(['playerId'])['target' + str(i)].transform(lambda x: x.shift(shift).rolling(28).max())
        df['roll_min_t7_tgt' + str(i)] = df.groupby(['playerId'])['target' + str(i)].transform(lambda x: x.shift(shift).rolling(7).min())
        df['roll_min_t28_tgt' + str(i)] = df.groupby(['playerId'])['target' + str(i)].transform(lambda x: x.shift(shift).rolling(28).min())
        
    # time features
    df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['week'] = df['date'].dt.week
    df['day'] = df['date'].dt.day
    df['dayofweek'] = df['date'].dt.dayofweek

    return df

## Confirmation of correlation between each feature and Target

Next, let's check the correlation between each feature and the target.  

* Given features (in training data)  
* stats (long-term trends of each target)  
* lag features (relatively short-term trends of each target)   

In [None]:
# given features
corr = train[scores_cols + standings_cols + targets_cols + 
            ["playernumberOfFollowers", "teamnumberOfFollowers", "awardId_count"]].corr().\
            drop(["target1", 'target2', 'target3', 'target4', 'playerId', 'teamId'], axis=0)
corr = corr.loc[:, ["target1", 'target2', 'target3', 'target4']]

# stats
corr_static = train[stats_cols + targets_cols].corr().\
              drop(["target1", 'target2', 'target3', 'target4', 'playerId'], axis=0)
corr_static = corr_static.loc[:, ["target1", 'target2', 'target3', 'target4']]

### Given features

* Most of the features appear to have little correlation with the target.  

##### Target1  
* "homeruns", "totalbases", "rbi" have a relatively high correlation with Target1.  
  It's not hard to imagine that the reaction of SNS will increase at once when you hit a home run, a long hit, or a RBI.  
  In Japan, my hometown, when Shohei Ohtani hit a home run, SNS was very exciting! lol  
  The "peaks" of the target1 may be related to such a sudden rise in reaction.  
  
##### Target2 / 4
* "numberOfFollowers" has a relatively high correlation with Target2 / 4.  
  Based on this fact, target2 / 4 may be related to social networking interactions between fans and players/teams.  
  Since mutual exchanges on SNS always occur regardless of whether or not there is a match, it is understandable that the trend is stable over the long term.  
    
##### Target3
* There is a slight correlation with the features related to the details of the game content, such as "strikes", "winspitching", flyOutsPitching", and so on.  
  Therefore, it may have something to do with the engagements related to the content of the match.  

In [None]:
sns.set(font_scale=1)
fig=plt.figure(figsize=(15, 25))
sns.heatmap(corr, vmax=0.4, vmin=-0.4, annot=True, cmap='coolwarm')

### Stats

* Stats are relatively strongly correlated with all targets.  
* In particular, targets 2 and 4 have a high correlation.  
  This is thought to be due to the long-term stable characteristics of targets 2 and 4.  

In [None]:
sns.set(font_scale=1)
fig=plt.figure(figsize=(15, 7))
sns.heatmap(corr_static, vmax=0.4, vmin=-0.4, annot=True, cmap='coolwarm')

### Lag features

First, let's look at the lag features for delay of 1-day.  

* "lag_tx": Each target value (x days ago). For example, If the base date is 5/1, "lag_t1" is the value at 4/30.  
  Looking at the heatmap, you can see that lag_t1 and each Target show a very high correlation.(lag_t1_tgt4 and target4: 0.7!!)  
  To put it simply, it is natural to think that the reactions of the previous and next days are similar.  
  Therefore, if the information of the previous day is known, lag_t1 is a very powerful feature.  
  (In the competition, for the estimate of 8/1, the target information of 7/31 should be used as the lag feature.)  
* Other features, such as rolling features, are also useful to estimate.  

In [None]:
train = make_features(train, 1)

corr_lag = train[lag_cols + targets_cols].corr().\
           drop(["target1", 'target2', 'target3', 'target4', 'playerId'], axis=0)
corr_lag = corr_lag.loc[:, ["target1", 'target2', 'target3', 'target4']]

In [None]:
sns.set(font_scale=1)
fig=plt.figure(figsize=(15, 20))
sns.heatmap(corr_lag, vmax=0.4, vmin=-0.4, annot=True, cmap='coolwarm')

Next, let's look at the lag features for delay of 7-day.  

Here, it means that lag_t1 is the information of 7(=1+6) days ago, and lag_t2 is the information of 8(=2+6) days ago.  

* Not surprisingly, the information of 7-days ago cannot be as useful as the information of 1-day ago.  
  Nonetheless, it shows a high correlation, especially in Targets 2 and 4.  
* Like stats, short-term statistics calculated by rolling method are especially useful for targets 2 and 4.  

In [None]:
train = make_features(train, 7)

corr_lag = train[lag_cols + targets_cols].corr().\
           drop(["target1", 'target2', 'target3', 'target4', 'playerId'], axis=0)
corr_lag = corr_lag.loc[:, ["target1", 'target2', 'target3', 'target4']]

In [None]:
sns.set(font_scale=1)
fig=plt.figure(figsize=(15, 20))
sns.heatmap(corr_lag, vmax=0.4, vmin=-0.4, annot=True, cmap='coolwarm')

## Confirm feature importance (use LightGBM)

Finally, let's check the importance of each feature using LightGBM.  

The model generation function is shown below.  

In [None]:
def train_lgbm(train_X, train_y, valid_X, valid_y, params: dict=None):
    pred = np.zeros(len(valid_y), dtype=np.float32)
    lgb_train = lgbm.Dataset(train_X, train_y)
    lgb_valid = lgbm.Dataset(valid_X, valid_y, reference=lgb_train)
    model = lgbm.train(
        params, lgb_train, valid_sets=lgb_valid,
        verbose_eval=50,
                       )

    pred = model.predict(valid_X)
    score = mean_absolute_error(pred, valid_y)
    print('MAE:', score)
    
    if valid_y[valid_y > 10].shape[0] != 0:
        score_over10 = mean_absolute_error(pred[valid_y > 10], valid_y[valid_y > 10])
        print('MAE for TGT values higher than 10:', score_over10)

    lgbm.plot_importance(model, figsize=(10, 30), importance_type="gain")
    plt.show()
    
    return pred, model, score

In the competition, submissions are evaluated on the mean-absolute error(MAE).  
But in this analysis, I will use mean-squared error(MSE) for objective function.  

I think what you are most interested in is what causes the "peaks(outliers)".  
Since MSE imposes strong penalties on outliers, I thought that it would be possible to create a model that emphasizes peaks by using MSE.  

In [None]:
# Minimal adjustment so as not to disturb the competition

params = {'objective': 'mse', 
          'metric': 'l2', 
          'lambda_l1': 3.0, 
          'early_stopping_round': 50,
          'random_state': 63,
          'learning_rate': 0.1}

Here, we use the lag statistic based on 7 days ago.

In [None]:
train = make_features(train, shift=7)

In [None]:
# use only regular season's data 
train_idx = ((train['date'] >= "2018-03-01") & (train['date'] < "2018-10-01")) |\
            ((train['date'] >= "2019-03-01") & (train['date'] < "2019-10-01")) |\
            ((train['date'] >= "2020-07-23") & (train['date'] < "2020-10-01")) |\
            ((train['date'] >= "2021-03-01") & (train['date'] < "2021-06-19"))
valid_idx = (train['date'] >= "2021-06-19") & (train['date'] < "2021-07-19")

train_X = train.loc[train_idx].reset_index(drop=True)[feature_cols]
train_y = train.loc[train_idx].reset_index(drop=True)[['target1', 'target2', 'target3', 'target4']]
valid_X = train.loc[valid_idx].reset_index(drop=True)[feature_cols]
valid_y = train.loc[valid_idx].reset_index(drop=True)[['target1', 'target2', 'target3', 'target4']]

idx_pitcher_t = (train_X['label_primaryPositionName'] == 0)
idx_pitcher_v = (valid_X['label_primaryPositionName'] == 0)

Many of the pre-given statistics are thought to be divided into those related to "pitchers" and those related to "batters".  
Therefore, we decided to check the importance of the features separately for the pitcher and the batter.  

### Pitcher

#### Target1
* The most important feature was calculated to be "inningsPitched".  
  Certainly, high engagement is reasonable for a good pitcher who can throw long innings. 
  In addition, the pitcher that can strike out a lot is also attractive, so it is understandable that "StrikeOutsPitching" is quite important.
* Surprisingly, "homeruns" are very important for pitcher(Needless to say for batters).  
  Certainly, pitchers rarely hit home runs, and it's no wonder that engagement increases at that time.

In [None]:
print("==========target1-pitcher==========")
pred_1, model_1, score_1 = train_lgbm(
    train_X[idx_pitcher_t], train_y['target1'][idx_pitcher_t],
    valid_X[idx_pitcher_v], valid_y['target1'][idx_pitcher_v],
    params
)

#### Target2
* Most of the top rankings are due to lag features, and as mentioned above, target2 is associated with relatively medium/long-term trends.  
* "label_status" is ranked third.  
  For example, a player who is out of order is less likely to be picked up on Twitter of the team, so the status seems to be important.  

In [None]:
print("==========target2-pitcher==========")
pred_2, model_2, score_2 = train_lgbm(
    train_X[idx_pitcher_t], train_y['target2'][idx_pitcher_t],
    valid_X[idx_pitcher_v], valid_y['target2'][idx_pitcher_v],
    params
)

#### Target3
* Features related to match information are judged to be important(e.g. "gamesStartedPitching, gamesFinishedPitching").  
* "teamnumberOfFollowers" is also judged to be important.  
  Given this fact, I suppose that the team is disseminating match information via SNS and engaging the fans' reactions to it as target3.  

In [None]:
print("==========target3-pitcher==========")
pred_3, model_3, score_3 = train_lgbm(
    train_X[idx_pitcher_t], train_y['target3'][idx_pitcher_t],
    valid_X[idx_pitcher_v], valid_y['target3'][idx_pitcher_v],
    params
)

#### Target4
* Like target2, the main features are medium/long-term informations.

In [None]:
print("==========target4-pitcher==========")
pred_4, model_4, score_4 = train_lgbm(
    train_X[idx_pitcher_t], train_y['target4'][idx_pitcher_t],
    valid_X[idx_pitcher_v], valid_y['target4'][idx_pitcher_v],
    params
)

### Batter

#### Target1
* The most important feature was "homeruns".  
  It is no exaggeration to say that the peak of target1 shows the excitement when hitting a home run!  
* "rbi" is also important. It is easy to imagine that SNS will be exciting even when adding points.  

In [None]:
print("==========target1-nonpitcher==========")
pred_1, model_1, score_1 = train_lgbm(
    train_X[~idx_pitcher_t], train_y['target1'][~idx_pitcher_t],
    valid_X[~idx_pitcher_v], valid_y['target1'][~idx_pitcher_v],
    params
)

#### Target2
* As with the discussion of the pitcher section, long-term trends are emphasized.

In [None]:
print("==========target2-nonpitcher==========")
pred_2, model_2, score_2 = train_lgbm(
    train_X[~idx_pitcher_t], train_y['target2'][~idx_pitcher_t],
    valid_X[~idx_pitcher_v], valid_y['target2'][~idx_pitcher_v],
    params
)

#### Target3
* Long-term information is important, but features related to match information such as home runs and rbi are also important.

In [None]:
print("==========target3-nonpitcher==========")
pred_3, model_3, score_3 = train_lgbm(
    train_X[~idx_pitcher_t], train_y['target3'][~idx_pitcher_t],
    valid_X[~idx_pitcher_v], valid_y['target3'][~idx_pitcher_v],
    params
)

#### Target4
* Like target2, the main features are medium/long-term informations.

In [None]:
print("==========target4-nonpitcher==========")
pred_4, model_4, score_4 = train_lgbm(
    train_X[~idx_pitcher_t], train_y['target4'][~idx_pitcher_t],
    valid_X[~idx_pitcher_v], valid_y['target4'][~idx_pitcher_v],
    params
)

### (FYI) MSE vs MAE

Create MAE model to compare with the MSE prediction.

In [None]:
# Minimal adjustment so as not to disturb the competition

params_mae = {'objective': 'mae', 
              'metric': 'l1', 
              'lambda_l1': 3.0, 
              'early_stopping_round': 50,
              'random_state': 63,
              'learning_rate': 0.1}

In [None]:
print("==========target1-nonpitcher==========")
pred_1_mae, model_1_mae, score_1_mae = train_lgbm(
    train_X[~idx_pitcher_t], train_y['target1'][~idx_pitcher_t],
    valid_X[~idx_pitcher_v], valid_y['target1'][~idx_pitcher_v],
    params_mae
)

Looking at the graph, we can see that MSE is more sensitive to high values = peaks (whether the predicted values are correct or not).  
MAE does not place much emphasis on outliers, making it insensitive to peaks.  
For this reason, I thought that MSE was suitable for explaining the peak.

In [None]:
fig = plt.figure(figsize=(25, 6))
plt.plot(pred_1, label="mse")
plt.plot(pred_1_mae, label="mae")
plt.legend(loc='upper right', fontsize=26)

## Summary

I provided some of the insights I obtained through the competition.  
I think that I was able to perform an effective analysis for each Target and the peaks, by using "lag features" and "mse".  

I hope you find this analysis useful.  
Thank you!  