# Import and prepare data for the model

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        


# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#get rid of redundant error message
pd.options.mode.chained_assignment = None

In [None]:
pitchers=pd.read_csv('/kaggle/input/filtered/pitchers_filtered (1)').drop(columns='Unnamed: 0')
pitchers.head(10)

This dataset already has the full player names.

In [None]:
#can't use data in which 'percent' column is null
pitchers2=pitchers[-pitchers['percent'].isnull()].reset_index().drop(columns='index')

In [None]:
y=pitchers2['percent']
features=['W','SHO','H','SO','BFP','IP']
X=pitchers2[features]

# Create decision tree regressor model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state=1, test_size=0.4)
basic_model = DecisionTreeRegressor(random_state=1)
basic_model.fit(train_X, train_y)
predictions=basic_model.predict(val_X)

# Evaluate the model

Make new dataframe with all testing data, add predictions, whether each prediction would warrant a hall of fame induction, and if the prediction was correct.

In [None]:
df=pd.DataFrame(val_X)
df['prediction']=predictions
df=df.join(pitchers2[['playerID','name','inducted','percent','threshold','year']])
df['guess']=''
for index in df.reset_index()['index']:
    if df['prediction'][index]>=df['threshold'][index]:
        df['guess'][index]='Y'
    else:
        df['guess'][index]='N'
df['correct?']=df['guess']==df['inducted']

Change the order of the columns (makes it easier to interpret a player's atributes).

In [None]:
df=df[['name','playerID','W','SHO','H','SO','BFP','IP','percent','threshold','year','inducted','prediction','guess',
      'correct?']]
#view first five rows
df.head(5)

In [None]:
df['correct?'].value_counts()

125/141 correct, or 88.7%.

A closer look at the actual hall of famer's shows that this model doesn't do very good job at successfully identifying players that are in the hall of fame.

In [None]:
pd.set_option('display.max_rows', None)
hof=df[df['inducted']=='Y']
hof

In [None]:
hof['correct?'].value_counts()

Only 5/15 hall of famers were actually predicted to be in the hall of fame.

# Visualize the data and predictions

In [None]:
import matplotlib.pyplot as plt

In [None]:
s30=range(1930,1940)
s40=range(1940,1950)
s50=range(1950,1960)
s60=range(1960,1970)
s70=range(1970,1980)
s80=range(1980,1990)
s90=range(1990,2000)
s2000 = range(2000,2016)

decades=[s30,s40,s50,s60,s70,s80,s90,s2000]

fig, axes = plt.subplots(nrows=4, ncols=2,figsize=(40, 20))
fig.subplots_adjust(hspace=1)
plt.suptitle('MLB HOF Voting results and predictions \n green: incorrect- should be HOF \n blue: incorrect- should not be HOF',fontsize=30)

for decade,ax in zip(decades,axes.flatten()):
    frame=df[df['year'].isin(decade)]
    
    ax.plot(frame['name'],frame['percent'],'o',color='red',label = 'Actual Values')

    ax.plot(frame['name'],frame['prediction'],'X',color='yellow',label = 'Predicted Values')
  
    incorrect=frame[frame['correct?'].isin([False])]
    circle_rad = 10 
    
    overshoot=incorrect[incorrect['prediction']>incorrect['percent']]
    ax.plot(overshoot['name'], overshoot['percent'], 'o',ms=circle_rad * 2, mec='b', mfc='none', mew=2)
    ax.plot(overshoot['name'], overshoot['prediction'], 'o',ms=circle_rad * 2, mec='b', mfc='none', mew=2)
    
    undershoot=incorrect[incorrect['percent']>incorrect['prediction']]
    ax.plot(undershoot['name'], undershoot['percent'], 'o',ms=circle_rad * 2, mec='g', mfc='none', mew=2)
    ax.plot(undershoot['name'], undershoot['prediction'], 'o',ms=circle_rad * 2, mec='g', mfc='none', mew=2)
    
    ax.set_xlabel('Player')
    ax.set_ylabel('Percent of HOF Votes')
    ax.set_title(str(decade[0])+'-'+str(decade[-1]))
    ax.legend(loc = 'upper right')
    ax.set_xticklabels(labels=frame['name'],rotation=90)

# Interpreting the visuals

The green and blue circles help identify where our predictions were wrong; blue representing a player that was wrongfully predicted to be in the HOF, green representing a player that was wrongfully predicted to not be in the HOF.

Let's take a closer look at the incorrect predictions.

In [None]:
incorrect=df[df['correct?'].isin([False])]
incorrect

In [None]:
incorrect['inducted'].value_counts()

6 were false positives (blue), and 10 were false negatives (green).

Takeaways:

88.7% accuracy seems good at first, but a closer look reveals that this stat is just propped up by all the easy predictions made on players that clearly should not be in the hall of fame. 

When it comes to true contenders, the model does not do so well. Only 5/15 hall of famers were actually predicted to be in the hall.

6 players were wrongfully predicted to be in the HOF, this means that most (6/11) out of the playes predicted to be in the HOF are not actually in it.

# Future work

This notebook was a rough draft, meaning there is much room for improvement that I have left undone.

There are so many ways to improve this model. Some ideas include:

1. Including player awards such as MVP, Cy young, and many other baseball awards
2. Include advanced stats such as WHIP (walks+hits per inings pitched), BB/9, K/9
3. Include postseason stats and other stats that weren't included in this model.
4. Changing the paramaters to fine tune the model- I have some other machine learning notebooks that show how to do this
5. Try a different type of model, ex. random forest regressor
6. Instead of predicting percent of the vote, try making it a bunary classification problem. That is, rather than predict a numerical value and determine if that warrants a HOF induction, simply predict whether or not a player will get inducted into the HOF.
7. The data used for this project is dated. Data through 2020 will help create a better model. It may be hard to find that data, however, and the data for this project was already avaialable on kaggle which made it easier to use
8. There are more ways to help improve the model as these are just what come to mind. The other notebooks show how I combined data from different datasets to create the data for this model, and how I decided which features to put in the model.