# Back to the Kaggle
<hr style="border:0.02in solid gray"> </hr>

This is a visual exploratory data analysis for the MLB Player Digital Engagement Forecasting Competition. In fact, it is a kind of template I'm doing my self to restart my kaggle activity, so I thought, It will be better instead of have it private, to share some of the formats, code snippets and apply them to an actual active competition. So, I hope you find useful this notebook, not only for predictive modeling activity but also for code organization, documentation, formatting, data visualization and actually for other projects outside Kaggle.

# A Short Series: Exploratory-Data-Analysis
<hr style="border:0.02in solid gray"> </hr>

My intention is to produce notebooks iteratively until I feel comfortable to jump into a competition mode, so, here are a simple roadmap of intended Kaggle notebooks:

1. Players Visual EDA (Short Version)
2. Teams Visual EDA (Short Version)
3. ... evaluate next content according to evidence of interaction and modeling findings (mainly notebooks or mainly competition)

There are two main reasons on why I'm sharing this:
1. To draw your attention into the "Documentation" kind of aspect in these notebooks, so they help *you* win the competition.
2. To ask for your comments, suggestions and hopefully a couple of votes if you found some content here useful for your particular purposes.

In [None]:
# -------------------------------------------------------------------------------------------- #
# -- Kaggle Python 3 environment
# -- Image://github.com/kaggle/docker-python
# -------------------------------------------------------------------------------------------- #

# -- Generic packages
import numpy as np   # Linear Algebra and Scientific Computing
import pandas as pd  # Data I/O and Processing

# -- Operative System Navigation
import os            
from os import listdir, path
from os.path import isfile, join

# -- Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pp

# About the competition
<hr style="border:0.02in solid gray"> </hr>

I found very useful and self-explanatory the first paragraph of the description tab in the competition, from that and for the sake of producing a quick and short version of Visual EDA, consider the following: 

- Time Series clasification problem (Multiclass): target1, target2, target3, target4.
- You have info from other years but just forecast for players active in the 2021 season.
- There are files with panel data (that do not change over time): players.csv, teams.csv, seasons.csv, awards.csv.
- There is a file with time series data (daily data, train.csv).
- Predicting the target in $d$ (in dataset), means to forecast target value for $d+1$ (in the problem context).

# Input data
<hr style="border:0.02in solid gray"> </hr>

Input data files are available in the read-only "../input/" directory. Which contains:
- A set of static files that do not change with time: players, seasons, teams, awards
- A training dataset with daily labels (shifted +1 in time)

In [None]:
# Get the absolute path of the folder with train files
files_path = path.abspath('/kaggle/input/mlb-player-digital-engagement-forecasting')
data_players = pd.read_csv(files_path + '/players.csv')
data_teams = pd.read_csv(files_path + '/teams.csv')
data_seasons = pd.read_csv(files_path + '/seasons.csv')
data_awards = pd.read_csv(files_path + '/awards.csv')

I will argue that the fundamental aspect of the competition is the *players*, since you are asked to forecast *engagement* for every player active in 2021 season. So, lets start with that.

# Serious Exploratory-Data-Analysis
<hr style="border:0.02in solid gray"> </hr>

To be honest, it feels sometimes like cheating, *pandas_profiling* is an automatic data profiling tools that is mounted on top of *pandas*. According to [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling): *The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.*

In [None]:
# Lets see the data types of the columns
data_players.dtypes

In [None]:
# Option 1
players_profile_ds = data_players.describe()
players_profile_ds

No wonder, there are by default just 3 numeric columns, hence *describe* falls short of providing a quick summary. There are several other options of just *describe*, here is a very interesting one.

In [None]:
# Option 2
players_profile_pp = pp.ProfileReport(data_players, title = "players.csv Data Profile")
players_profile_pp.to_notebook_iframe()

## Exploration
<hr style="border:0.02in solid gray"> </hr>

Are the *birthStateProvince* and *birthCity* NAs because of those players were born outside USA ?

In [None]:
display(data_players.head(10))

Lets get the list of _birthCountry_, and relate to that the _birthStateProvince_ and _birthCity_, so we can see NaNs values according to each _birthCountry_

In [None]:
# Unique countries
countries = list(data_players['birthCountry'].unique())

# Number of NAs in StateProvince and City
state_nas = [data_players[data_players['birthCountry'] == country]['birthStateProvince'].isna().sum() for country in countries]
city_nas = [data_players[data_players['birthCountry'] == country]['birthCity'].isna().sum() for country in countries]
country_count = [len(data_players[data_players['birthCountry'] == country]) for country in countries]

# Aggregated dataframe
origins = pd.DataFrame({'birthCountry': countries, 'count_birthCountry': country_count, 'NaNs_StateProvince': state_nas, 'NaNs_City': city_nas})

# Ordered dataframe
origins = origins.sort_values(by=['birthCountry'], inplace=False, ascending=False)
data_players = data_players.sort_values(by=['birthCountry'], inplace=False, ascending=False)

# Display results and reference
display(origins)

Well, as suspected, the NaNs do not come from players born in USA. As pandas_profiling function previously stated, there are: 545 missing values in StateProvince.

In [None]:
display('There are: {NaNs} missing values in StateProvince '.format(NaNs = sum(origins['NaNs_StateProvince'])), )

Interestingly enough, none where from Mexico, and since the majority of Latinos in USA are from that country, and in the North of Mexico baseball is a popular sport, it could be the case that engagement from Mexican players has interesting properties. But of course, that would be up to you to decide explore.

In [None]:
display(origins[origins['NaNs_StateProvince'] > 0])

## Well, that is for the short version, What do you think about quick progress and iterative content in this notebook, does that make sense ?