# Import and prepare data for model

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#python won't show a long unnecessary error message that comes up a lot
pd.options.mode.chained_assignment = None

In [None]:
hitters=pd.read_csv('/kaggle/input/hitters/hitters_filtered').drop(columns='Unnamed: 0')
hitters.head(10)

Add full player name, rather than just having pleyer ID.

In [None]:
names=pd.read_csv('/kaggle/input/the-history-of-baseball/player.csv')
names['name']=names['name_first']+' '+names['name_last']
names=names[['player_id','name']]
names=names[names['player_id'].isin(hitters['player_id'].tolist())]
hitters=hitters.join(names.set_index('player_id'),on='player_id')
hitters.head(10)

Some of the entries in the 'percent' column are empty. We have to fill these with something or cut them out in order to carry on with the 'percent' column in our machine learning algorithm.

In [None]:
hitters2=hitters[-hitters['percent'].isnull()].reset_index().drop(columns='index')

# Create decision tree regressor model

In [None]:
y=hitters2['percent']
features=['g','ab','r','h','rbi','bb','hr']
X=hitters2[features]

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state=1, test_size=0.4)
basic_model = DecisionTreeRegressor(random_state=1)
basic_model.fit(train_X, train_y)
predictions=basic_model.predict(val_X)

# Evaluate model

Make new dataframe with all testing data, add predictions, whether each prediction would warrant a hall of fame induction, and if the prediction was correct.

In [None]:
df=pd.DataFrame(val_X)
df=df.join(hitters2[['player_id','name','inducted','percent','threshold','year']])
df['prediction']=predictions

df['guess']=''
for index in df.reset_index()['index']:
    if df['prediction'][index]>=df['threshold'][index]:
        df['guess'][index]='Y'
    else:
        df['guess'][index]='N'
df['correct?']=df['guess']==df['inducted']

#change the order of the columns
df=df[['name','player_id','g','ab','r','h','hr','rbi','bb','percent','threshold','year','inducted','prediction','guess',
      'correct?']]

In [None]:
df['correct?'].value_counts()

231/265 predictions correct, or 87.2%.

Let's take a closer look at the actual hall of famers in the testing data.

In [None]:
pd.set_option('display.max_rows', None)
hof=df[df['inducted']=='Y']
hof

In [None]:
hof['correct?'].value_counts()

Only 18 out of the 30 hall of famers were predicted to be hall of famers by the model.

# Visualize the data, predictions

In [None]:
import matplotlib.pyplot as plt

In [None]:
s30=range(1930,1940)
s40=range(1940,1950)
s50=range(1950,1960)
s60=range(1960,1970)
s70=range(1970,1980)
s80=range(1980,1990)
s90=range(1990,2000)
s2000 = range(2000,2016)

decades=[s30,s40,s50,s60,s70,s80,s90,s2000]

fig, axes = plt.subplots(nrows=4, ncols=2,figsize=(40, 20))
fig.subplots_adjust(hspace=1)
plt.suptitle('MLB HOF Voting results and predictions \n green: incorrect- should be HOF \n blue: incorrect- should not be HOF',fontsize=30)
for decade,ax in zip(decades,axes.flatten()):
    frame=df[df['year'].isin(decade)]
    
    ax.plot(frame['name'],frame['percent'],'o',color='red',label = 'Actual Values')

    ax.plot(frame['name'],frame['prediction'],'X',color='yellow',label = 'Predicted Values')
  
    incorrect=frame[frame['correct?'].isin([False])]
    circle_rad = 10 
    
    overshoot=incorrect[incorrect['prediction']>incorrect['percent']]
    ax.plot(overshoot['name'], overshoot['percent'], 'o',ms=circle_rad * 2, mec='b', mfc='none', mew=2)
    ax.plot(overshoot['name'], overshoot['prediction'], 'o',ms=circle_rad * 2, mec='b', mfc='none', mew=2)
    
    undershoot=incorrect[incorrect['percent']>incorrect['prediction']]
    ax.plot(undershoot['name'], undershoot['percent'], 'o',ms=circle_rad * 2, mec='g', mfc='none', mew=2)
    ax.plot(undershoot['name'], undershoot['prediction'], 'o',ms=circle_rad * 2, mec='g', mfc='none', mew=2)
    
    ax.set_xlabel('Player')
    ax.set_ylabel('Percent of HOF Votes')
    ax.set_title(str(decade[0])+'-'+str(decade[-1]))
    ax.legend(loc = 'upper right')
    ax.set_xticklabels(labels=frame['name'],rotation=90)

# Interpreting the visuals

These visuals help visualize what the model did well and what it didn't. What stands out are the number of blue circles- this being players that were predicted to make the HOF but didn't actually get voted in. Let's take a closer look at all of the incorrect predictions.

In [None]:
incorrect=df[df['correct?'].isin([False])]
incorrect

In [None]:
overshoot=incorrect[incorrect['prediction']>incorrect['percent']]
len(overshoot)

22 of the incorrect predictions were the result of too high of a prediction (like a false positive).

In [None]:
undershoot=incorrect[incorrect['percent']>incorrect['prediction']]
len(undershoot)

12 of the incorrect predictions were the result of too low of a prediction (like a false negative).

Takeaways:

-87.2% accuracy seems pretty good for a start, but looking closer at the data we realize that there are many flaws in the model.
-Only 18 out of the 30 hall of famers in the testing data were actually predicted to make the hall of fame
-22 players that are not hall of famers were predicted to be

This means that most of the players our model predicted to be in the hall of fame are not, thus 87.2% accuracy is certainly not a fair way to summarize the data.

# Future work

There are so many ways to improve this model. Some ideas include:


1. Including player awards such as MVP, silver slugger, and many other baseball awards
2. Include advanced stats such as slugging percentage and on base percentage
3. Include postseason stats and other stats that weren't included in this model.
4. Changing the paramaters to fine tune the model- I have some other machine learning notebooks that show how to do this
5. Try a different type of model, ex. random forest regressor
6. Instead of predicting percent of the vote, try making it a bunary classification problem. That is, rather than predict a numerical value and determine if that warrants a HOF induction, simply predict whether or not a player will get inducted into the HOF.
7. The data used for this project is dated. Data through 2020 will help create a better model. It may be hard to find that data, however, and the data for this project was already avaialable on kaggle which made it easier to use

There are more ways to help improve the model as these are just what come to mind. The other notebooks show how I combined data from different datasets to create the data for this model, and how I decided which features to put in the model.