# MLB Feature Importance with LOFO

![](https://upload.wikimedia.org/wikipedia/tr/4/48/MLB_Belirtke.png)
![](https://raw.githubusercontent.com/aerdem4/lofo-importance/master/docs/lofo_logo.png)

**LOFO** (Leave One Feature Out) Importance calculates the importances of a set of features based on **a metric of choice**, for **a model of choice**, by **iteratively removing each feature from the set**, and **evaluating the performance** of the model, with **a validation scheme of choice**, based on the chosen metric.

LOFO first evaluates the performance of the model with all the input features included, then iteratively removes one feature at a time, retrains the model, and evaluates its performance on a validation set. The mean and standard deviation (across the folds) of the importance of each feature is then reported.

While other feature importance methods usually calculate how much a feature is used by the model, LOFO estimates how much a feature can make a difference by itself given that we have the other features. Here are some advantages of LOFO:
* It generalises well to unseen test sets since it uses a validation scheme.
* It is model agnostic.
* It gives negative importance to features that hurt performance upon inclusion.
* It can group the features. Especially useful for high dimensional features like TFIDF or OHE features. It is also good practice to group very correlated features to avoid misleading results.

https://github.com/aerdem4/lofo-importance

In [None]:
!pip install lofo-importance

In [None]:
import numpy as np
import pandas as pd
import os, gc
from tqdm import tqdm


ROOT_DIR = "/kaggle/input/mlb-player-digital-engagement-forecasting"

df = pd.read_csv(f"{ROOT_DIR}/train_updated.csv")
print(df.shape)
df.head()

# Read the data

In [None]:
target_df = [eval(x) for x in tqdm(df["nextDayPlayerEngagement"].values)]

flatten = lambda t: [item for sublist in t for item in sublist]


target_df = pd.DataFrame(flatten(target_df))

print(target_df.shape)
target_df.head()

In [None]:
roster_df = [eval(x) for x in tqdm(df["rosters"].values) if str(x) != "nan"]

roster_df = pd.DataFrame(flatten(roster_df)).drop(["statusCode"], axis=1).rename(columns={"gameDate": "date"})

print(roster_df.shape)
roster_df.head()

In [None]:
standings_df = [eval(x.replace(":false", ":False").replace(":true", ":True").replace(":null", ":None")) for x in tqdm(df["standings"].values) if str(x) != "nan"]

standings_df = pd.DataFrame(flatten(standings_df)).rename(columns={"gameDate": "date"})[["date", "teamId", "leagueRank", "lastTenWins"]]

print(standings_df.shape)
standings_df.head()

In [None]:
transaction_df = []

for x in tqdm(df["transactions"].values):
    if str(x) != "nan":
        transaction_df.extend(eval(x.replace(":null", ':""')))

transaction_df = pd.DataFrame(transaction_df)

print(transaction_df.shape)
transaction_df.head()

In [None]:
scores_df = []

for x in tqdm(df["playerBoxScores"].values):
    if str(x) != "nan":
        scores_df.extend(eval(x.replace(":null", ':""')))

scores_df = pd.DataFrame(scores_df)

print(scores_df.shape)
scores_df.head()

In [None]:
awards_df = [eval(x.replace(":null", ':""')) for x in df["awards"].values if str(x) != "nan"]
awards_df = pd.DataFrame(flatten(awards_df))
awards_df.shape

In [None]:
twitter_df = []

for x in tqdm(df["playerTwitterFollowers"].values):
    if str(x) != "nan":
        twitter_df.extend(eval(x.replace(":null", ':""')))

twitter_df = pd.DataFrame(twitter_df)[["date", "playerId", "numberOfFollowers"]]
twitter_df["date"] = pd.to_datetime(twitter_df["date"])

print(twitter_df.shape)
twitter_df.head()

### Only the evaluated players will be used for this analysis

In [None]:
players_df = pd.read_csv(f"{ROOT_DIR}/players.csv")

available_players = players_df[players_df["playerForTestSetAndFuturePreds"] == True]["playerId"].values
len(available_players)

In [None]:
target_df = target_df[target_df["playerId"].isin(available_players)].reset_index(drop=True)
target_df.shape

In [None]:
targets = ["target1", "target2", "target3", "target4"]


target_df["target_date"] = pd.to_datetime(target_df["engagementMetricsDate"])
target_df['date'] = target_df['target_date'] -  pd.to_timedelta(1, unit='d')

target_df["dom"] = target_df["target_date"].dt.day
target_df["dow"] = target_df["target_date"].dt.dayofweek
target_df["month"] = target_df["target_date"].dt.month - 1
target_df["year"] = target_df["target_date"].dt.year

target_df["time"] = target_df["year"]*12 + target_df["month"]

# Player Feature Extraction
* Age of player
* If the player is from US or not
* Time since his debut date
* Is he a jr (son of a previous well-known MLB player) ?

In [None]:
target_df = target_df.merge(players_df[["playerId", "DOB", "mlbDebutDate", "birthCountry", "playerName"]],
                            on="playerId", how="left") 

In [None]:
target_df["age"] = (target_df["date"] - pd.to_datetime(target_df["DOB"])).dt.days
target_df["US_person"] = 1*(target_df["birthCountry"] == "USA")
target_df["time_since_debut"] = (target_df["date"] - pd.to_datetime(target_df["mlbDebutDate"])).dt.days
target_df["jr"] = target_df["playerName"].apply(lambda x: "Jr." in x)
target_df["jr"].mean()

# Twitter Feature Extraction
* Number of followers
* Twitter trend: Percentage increase/decrease in followers compared to last month

In [None]:
twitter_df["twitter_trend"] = ((twitter_df["numberOfFollowers"] + 1 - twitter_df.groupby("playerId")["numberOfFollowers"].shift())
                               / (twitter_df["numberOfFollowers"] + 1))
twitter_df["twitter_trend"].clip(-0.2, 0.2).hist(bins=50)

In [None]:
target_df = target_df.merge(twitter_df, on=["date", "playerId"], how="left")

target_df["numberOfFollowers"] = target_df.groupby("playerId")["numberOfFollowers"].fillna(method="ffill")
target_df["twitter_trend"] = target_df.groupby("playerId")["twitter_trend"].fillna(method="ffill")

# Awards Feature Extraction
* Award type
* Time since award

In [None]:
awards_df.loc[awards_df.groupby("awardId")["awardDate"].transform("count") < 3, "awardName"] = "Other"
awards_df["awardName"].value_counts()

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
awards_df["awardType"] = le.fit_transform(awards_df["awardName"])

In [None]:
awards_df = awards_df.groupby(["awardDate", "playerId"])["awardType"].first().reset_index().rename(columns={"awardDate": "date"})
awards_df["date"] = pd.to_datetime(awards_df["date"])
awards_df["award_date"] = awards_df["date"].values
awards_df.head()

In [None]:
target_df = target_df.merge(awards_df, how="left", on=["date", "playerId"])
target_df["award_date"].isnull().mean()

In [None]:
target_df["award_date"] = target_df.groupby("playerId")["award_date"].fillna(method="ffill")
target_df["awardType"] = target_df.groupby("playerId")["awardType"].fillna(method="ffill")

In [None]:
target_df["time_since_award"] = (target_df["date"] - target_df["award_date"]).dt.days
target_df["time_since_award"].hist()

# Roster Feature Extraction
* Status

In [None]:
roster_df.loc[roster_df.groupby("status")["date"].transform("count") < 20, "status"] = "Other"
roster_df["status"].value_counts()

In [None]:
le = LabelEncoder()
roster_df["status"] = le.fit_transform(roster_df["status"])

In [None]:
roster_df["date"] = pd.to_datetime(roster_df["date"])

target_df = target_df.merge(roster_df, how="left", on=["date", "playerId"])
target_df["status"].isnull().mean(), target_df["teamId"].isnull().mean()

# Transaction Feature Extraction
* Transfer type
* Time since transfer

In [None]:
transaction_df["date"] = pd.to_datetime(transaction_df["date"])
transaction_df["effectiveDate"] = pd.to_datetime(transaction_df["effectiveDate"])


le = LabelEncoder()
transaction_df["transfer_type"] = le.fit_transform(transaction_df["typeCode"])

In [None]:
target_df = target_df.merge(transaction_df[["playerId", "date", "effectiveDate", "transfer_type"]], 
                            how="left", on=["date", "playerId"])

In [None]:
target_df["effectiveDate"] = target_df.groupby("playerId")["effectiveDate"].fillna(method="ffill")
target_df["transfer_type"] = target_df.groupby("playerId")["transfer_type"].fillna(method="ffill")

target_df["time_since_transfer"] = (target_df["date"] - target_df["effectiveDate"]).dt.days
target_df["time_since_transfer"].hist()

# Scores Feature Extraction
* Total Bases
* Strike outs Pitching
* Plate Appearances
* Home Runs
* Innings Pitched
* Saves
* rbi
* No game since

In [None]:
scores_df.rename(columns={"gameDate": "date"}, inplace=True)
scores_df["date"] = pd.to_datetime(scores_df["date"])

scores_features = ['totalBases', 'strikeOutsPitching', 'plateAppearances', 
                   'homeRuns', 'inningsPitched', 'saves', 'rbi']
target_df = target_df.merge(scores_df[["playerId", "date"] + scores_features], on=["playerId", "date"], how="left")

target_df[scores_features].isnull().mean()

In [None]:
def f(x):
    y = np.zeros(len(x))
    
    for i, el in enumerate(x):
        if el or (i == 0):
            y[i] = 0
        else:
            y[i] = y[i-1] + 1
    return y
            

target_df["no_game_since"] = target_df["rbi"].notnull()
target_df["no_game_since"] = target_df.groupby("playerId")["no_game_since"].transform(f)
target_df["no_game_since"].hist()

In [None]:
for col in tqdm(scores_features):
    target_df[col] = target_df.groupby("playerId")[col].fillna(method="ffill")

In [None]:
def to_float(x):
    try:
        return float(x)
    except:
        pass
    return None


for col in scores_features:
    target_df[col] = target_df[col].apply(to_float)
    
target_df["playerId"] = target_df["playerId"].astype(int)

# Standing Features
* League Rank
* Last Ten Wins

In [None]:
standings_df["date"] = pd.to_datetime(standings_df["date"])

target_df = target_df.merge(standings_df, on=["teamId", "date"], how="left")
standing_features = ["leagueRank", "lastTenWins"]

for col in standing_features:
    target_df[col] = target_df.groupby("playerId")[col].fillna(method="ffill")

# Target Lag Features

**Since the competition task is to predict the digital engagement for 45 days, we need to be careful with lag features. We won't have yesterday's target values for day 10 for example. In order to simplify the problem for feature importance, we can assume that we want to predict for the 20th day. So all lag features should be calculated before the 20th day. Mean and std for the last available month and year are calculated.**

In [None]:
FUTURE = 20

lags = np.array([21, 28, 35])
assert np.all(lags > FUTURE)



for lag in lags:
    for t in tqdm(targets):
        fname = f"lag{lag}_{t}"
        target_df[fname] = target_df.groupby("playerId")[t].shift(lag)
        
        if lag == 28:
            for period in [4*7, 52*7]:
                fname = f"std{period}_{lag}_{t}"
                target_df[fname] = target_df.groupby("playerId")[t].rolling(period).std().reset_index().sort_values("level_1")[t].values
                target_df[fname] = target_df.groupby("playerId")[fname].shift(lag)
                fname = f"mean{period}_{lag}_{t}"
                target_df[fname] = target_df.groupby("playerId")[t].rolling(period).mean().reset_index().sort_values("level_1")[t].values
                target_df[fname] = target_df.groupby("playerId")[fname].shift(lag)

In [None]:
for t in targets:
    target_df[f"dif_{t}"] = target_df[f"lag21_{t}"] - target_df[f"lag28_{t}"]
    target_df[f"diff_{t}"] = target_df[f"dif_{t}"] - (target_df[f"lag28_{t}"] - target_df[f"lag35_{t}"])

# Validation Scheme: Time-based Cross-validation

**We need to split the data by time, because our model gets applied on future data. But one time split is not enough, therefore we can pick 5 months as validation sets and the other months before them as training set. The competition will run on August-September, therefore these 2 months are good choices to be included in validation and at least one month from 2021 would be nice.**
* August 2019
* September 2019
* August 2020
* September 2020
* May 2021

![](https://miro.medium.com/max/558/1*AXRu72CV1hdjLfODFGbMWQ.png)

In [None]:
VAL_MONTHS = {12*2019 + 7, 12*2019 + 8, 12*2020 + 7, 12*2020 + 8, 12*2021 + 4}
# 2019 August, September, 2020 August, September, 2021 May
VAL_MONTHS = target_df[(target_df["year"]*12 + target_df["month"]).isin(VAL_MONTHS)]["time"].drop_duplicates().values

cv_scheme = []

for valm in VAL_MONTHS:
    train_ind = np.where(target_df["time"].values < valm)[0]
    val_ind = np.where(target_df["time"].values == valm)[0]
    cv_scheme.append((train_ind, val_ind))
VAL_MONTHS

### Individual features including categoricals

In [None]:
categoricals = ["year", "dow", "awardType", "transfer_type", "status", "playerId", "teamId", "month"]
numericals = ["time_since_award", "time_since_transfer", "no_game_since",
              "age", "time_since_debut", "US_person", "jr", "twitter_trend", "numberOfFollowers"]
features = list(categoricals) + list(numericals)

### LOFO allows us to group features. This prevents highly correlated features to have underestimated importance.

In [None]:
target_features = {f"{t}_features": target_df[[f"lag21_{t}", f"dif_{t}", f"diff_{t}",
                                               f"std28_28_{t}", f"mean28_28_{t}",
                                               f"std364_28_{t}", f"mean364_28_{t}"]].values for t in targets}

target_features["scores_features"] = target_df[scores_features].values
target_features["standing_features"] = target_df[standing_features].values

target_features.keys()

# Applying LOFO
* Model: LGBMRegressor with Huber loss
* Metric: Mean Absolute Error
* CV: 5 time splits

**Green bars represent useful, red bars represent harmful features. We can also see the standard deviation of importance. This helps us understand if the feature is really important or it is a noisy estimation.**

# target1

In [None]:
from lofo import Dataset, LOFOImportance, plot_importance
from lightgbm import LGBMClassifier, LGBMRegressor


def get_importance(target_name):
    model = LGBMRegressor(min_child_samples=20, n_jobs=-1, objective="huber", alpha=0.1,
                          n_estimators=100)
    dataset = Dataset(df=target_df, target=target_name, features=features, feature_groups=target_features)
    lofo_imp = LOFOImportance(dataset, cv=cv_scheme, scoring="neg_mean_absolute_error", model=model,
                              fit_params={"categorical_feature": categoricals})
    return lofo_imp.get_importance()


importance_df = get_importance("target1")
plot_importance(importance_df, figsize=(8, 8), kind="default")

**scores_features** and **no_games_since** are the only significantly important features. teamId, target1_features, status, month also look important but not significant enough. **playerId** seems to be a harmful feature for target1, meaning that it causes overfitting and we can benefit from removing this feature. Most of the features seem to be either redundant or not useful. One would expect lag features of **target1** to have the highest importance but LOFO Importance is the importance of a feature given that you have the other features. So it is about how much it can make a difference. Having target2, target3, target4 lag features and playerId, target1 lag features seem to be redundant.

Let's repeat the experiment with less features:


In [None]:
model = LGBMRegressor(min_child_samples=20, n_jobs=-1, objective="huber", alpha=0.1,
                      n_estimators=100)
dataset = Dataset(df=target_df, target="target1", 
                  features=["no_game_since", "teamId", "status", "month"], 
                  feature_groups={"target1_features": target_features["target1_features"],
                                  "scores_features": target_features["scores_features"]})
lofo_imp = LOFOImportance(dataset, cv=cv_scheme, scoring="neg_mean_absolute_error", model=model,
                          fit_params={"categorical_feature": ["teamId", "status", "month"]})

plot_importance(lofo_imp.get_importance(), figsize=(8, 4), kind="box")

**target1** lag features now have larger importance but still less than **scores_features** and **no_games_since**. We can say that the target values for previous month and year is not very important for **target1**. **target1** mainly depends on **Player Box Scores** data.

# target2

In [None]:
importance_df = get_importance("target2")
plot_importance(importance_df, figsize=(8, 8), kind="default")

**status**, **target2_features**, **scores_features**, **no_game_since** and **month** seem to be the important features. Let's repeat the experiment with less features.

In [None]:
model = LGBMRegressor(min_child_samples=20, n_jobs=-1, objective="huber", alpha=0.1,
                      n_estimators=100)
dataset = Dataset(df=target_df, target="target2", 
                  features=["no_game_since", "teamId", "status", "month"], 
                  feature_groups={"target2_features": target_features["target2_features"],
                                  "scores_features": target_features["scores_features"]})
lofo_imp = LOFOImportance(dataset, cv=cv_scheme, scoring="neg_mean_absolute_error", model=model,
                          fit_params={"categorical_feature": ["teamId", "status", "month"]})

plot_importance(lofo_imp.get_importance(), figsize=(8, 4), kind="box")

**no_game_since** and **scores_features** are important but not as much as they were for **target1**. And **month** feature has high std meaning that it is probably not consistently improtant across the all validation sets. The most important features are **target2** lag features and **status** feature.

# target3

In [None]:
importance_df = get_importance("target3")
plot_importance(importance_df, figsize=(8, 8), kind="default")

**scores_features** and **all target lag features** seem to be important and **no_game_since** is a harmful feature. We can repeat the experiment with less features:

In [None]:
model = LGBMRegressor(min_child_samples=20, n_jobs=-1, objective="huber", alpha=0.1,
                      n_estimators=100)
dataset = Dataset(df=target_df, target="target3", 
                  features=["no_game_since", "time_since_transfer", "status", "month"], 
                  feature_groups={"target3_features": target_features["target3_features"],
                                  "scores_features": target_features["scores_features"]})
lofo_imp = LOFOImportance(dataset, cv=cv_scheme, scoring="neg_mean_absolute_error", model=model,
                          fit_params={"categorical_feature": ["status", "month"]})

plot_importance(lofo_imp.get_importance(), figsize=(8, 4), kind="box")

**target3** lag features are the most important ones. One interesting thing is that while **scores_features** are important, **no_game_since** feature is harmful. For some reason, **Player Box Scores** data is useful for **target3** but only the recent scores without knowing how many days old they are. But there is very small signal overall for **target**, it is difficult to make strong conclusions.

# target4

In [None]:
importance_df = get_importance("target4")
plot_importance(importance_df, figsize=(8, 8), kind="default")

**target4** lag features, **playerId** and **status** are important features but they have small mean/std ratios. So they are not robust across different validation sets. **time_since_debut** is a consistently harmful feature but its performance degradation is not significant.

In [None]:
model = LGBMRegressor(min_child_samples=20, n_jobs=-1, objective="huber", alpha=0.1,
                      n_estimators=100)
dataset = Dataset(df=target_df, target="target4", 
                  features=["no_game_since", "playerId", "status", "month"], 
                  feature_groups={"target4_features": target_features["target4_features"],
                                  "scores_features": target_features["scores_features"]})
lofo_imp = LOFOImportance(dataset, cv=cv_scheme, scoring="neg_mean_absolute_error", model=model,
                          fit_params={"categorical_feature": ["playerId", "status", "month"]})

plot_importance(lofo_imp.get_importance(), figsize=(8, 4), kind="box")

**no_game_since** is a harmful feature. It probably means that behavior of **no_game_since** changes over time. Learning its behavior in interaction with other features has more harm than its benefits for **target4**.

# Summary
* **target1** mostly depends on the player's individual game performances.
* Previous target2 values are very predictive for **target2** and status is also very important.
* There is very little signal for **target3** and **target4**. Target lag features are useful and robust compared to other features. And **no_games_since** is a harmful feature for these targets.