# Machine Learning Lab 2

### Eric Johnson & Quincy Schurr

#### Overview
The dataset that we will be using for this lab is called Lahman's Baseball Database. The data is comprised of 24 tables that describe a variety of baseball statistics for players from 1871 through the 2015 season. For this lab, we will be selecting a prediction task for this dataset and evaluate several different prediction algorithms on the dataset. We will be optimizing the parameters through the tuning of hyper-parameters and then will create a measure from this prediction and conclude our findings.

The purpose of this data set is for fun and for learning. The dataset has been made available so that baseball fans can analyze player performance and so that baseball statisticians can view player performance and look for correlations in the player statistics. This is helpful for baseball teams looking to use their budget to acquire the best players available at the cheapest cost. There is a book, based on a true story about the Oakland Athletic's team, about the team using this same process to gain an advantage over their competition by buying lesser known, high-quality players for cheap. The book is called Moneyball. If there was a correlation or predictive aspect to what we visualized then it could be useful to baseball teams looking to advance their program in order to get more wins.

The documentation for this data set can be found at the following link: http://seanlahman.com/files/database/readme2014.txt

For this lab, we have decided to determine the quality of a baseball player. We will do this by analyzing a player's previous season performance in order to predict if a player will make the season's All Star team. We will classify players as a "yes" or "no" in terms of making it to next season's All Star team.


In [None]:
#import statements
import plotly
import plotly.graph_objs as go
import plotly.plotly as py
plotly.offline.init_notebook_mode() # run at the start of every notebook
py.sign_in('qschurr', '926ny2havn')

import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline

To start, we will import the tables we are going to use from github. No post season tables will be used in this analysis since not every team has the opportunity to play extra games. The Master table contains all the demographic data for a player, including their name, playerID, date of birth, hometown, height, and weight. The Salary table contains the year, team, league, and salary for every player (noted by their playerID). The Batting table shows all batting statistics for a player during a single season, included games, at bats, runs, hits, doubles, triples, homeruns, runs batted in, etc. The Pitching table contaings statistics about the pitchers wins and losses, games, innings pitched, earned run average, hits and walks allowed, strike outs, etc. The Fielding table contains statistics for both offense and defense of a player, such as double plays, stolen base, errors, assists, wild pitch, etc.

In [None]:
# bring in table data from github
master = pd.read_csv('https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/Master.csv')
batting = pd.read_csv('https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/Batting.csv')
pitching = pd.read_csv('https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/Pitching.csv')
fielding = pd.read_csv('https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/Fielding.csv')
allstar = pd.read_csv('https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/AllstarFull.csv')

We will be excluding all null values within each table. This is because we feel that imputing the data would lead to false conclusions as it may not be an accurate representation of player performance since some players are better than average and some worse. The exception with this is the All Star table, we will just be dropping the last column of starting position, becuase there are some players who make the team who do not start, for which there is no value in that column. We have also decided to drop the GP column. The GP column is a binary count for whether or not the player actually played in the game. For this lab, we do not need to know if they played the allstar game, it's enough to know that they made the team. We also need to rename the yearID column for the All Star table because we will be comparing it to previous year's performance which also have the same column identifier. 

In [None]:
master.drop(['nameGiven', 'retroID', 'bbrefID', 'deathYear', 'deathMonth', 'deathDay', 'deathCountry', 'deathState', 'deathCity'], axis = 1, inplace=True)
batting.drop(['CS', 'IBB', 'HBP', 'SH', 'SF', 'GIDP', 'stint', 'SB'], axis=1, inplace=True)
fielding.drop(['stint', 'PB', 'WP', 'SB', 'CS', 'ZR', 'GS', 'InnOuts'], axis=1, inplace=True)
fielding.head()

In [None]:
### Drop NA's when building training and testing datasets
#master.dropna(inplace=True)
#batting.dropna(inplace=True)
#pitching.dropna(inplace=True)
#fielding.dropna(inplace=True)
allstar.rename(columns={'yearID' : 'asYear'}, inplace=True)
allstar.drop(['startingPos', 'gameID', 'GP', 'gameNum'], axis=1, inplace=True)

In [None]:
#pulling only data from allstar games in 1933-1949 to see if we can pull out predictors to base our predictions on.
#must drop null values to only get the rows within the year bounds
'''allstar33_49 = allstar.where((1933 <= allstar['asYear']) & (allstar['asYear'] <= 1949)).dropna()
start_pred = allstar33_49.merge(master, on='playerID')
start_pred.head()'''

In [None]:
'''floatList = ['birthYear', 'birthMonth', 'birthDay', 'asYear']
for x in floatList:
    start_pred[x] = start_pred[x].astype(int)'''

In [None]:
def createTrainingData(start_year=1933, end_year=1949):
    allstar_in_range = allstar.where((1933 <= allstar['asYear']) & (allstar['asYear'] <= 1949)).dropna()
    
    #Stats for all players other than pitchers during this range
    batting_train = batting.where((start_year <= batting['yearID']) & (batting['yearID'] <= end_year)).dropna()
    fielding_train = fielding.where((start_year <= fielding['yearID']) & (fielding['yearID'] <= end_year)).dropna()
    fielding_train.rename(columns={'G' : 'GF'}, inplace=True)
    #master_train = batting_train.merge(fielding_train, on=['playerID', 'yearID', 'teamID', 'lgID'])
    master_train = master.merge(batting_train.merge(fielding_train, on=['playerID', 'yearID', 'teamID', 'lgID']), on='playerID')
    master_train = master_train[master_train.POS != 'P']
    master_train.drop(['birthYear', 'birthMonth', 'birthDay', 'birthState', 'birthCity', 'nameFirst', 'nameLast'], axis=1, inplace=True)    
    
    return master_train, allstar_in_range

In [None]:
def createTestData(start_year=1950, end_year=2016):
    allstar_end = allstar.where((1950 <= allstar['asYear']) & (allstar['asYear'] <= 2016)).dropna()
    #Stats for all players other than pitchers during this range
    batting_test = batting.where((start_year <= batting['yearID']) & (batting['yearID'] <= end_year)).dropna()
    fielding_test = fielding.where((start_year <= fielding['yearID']) & (fielding['yearID'] <= end_year)).dropna()
    fielding_test.rename(columns={'G' : 'GF'}, inplace=True)
    #master_train = batting_train.merge(fielding_train, on=['playerID', 'yearID', 'teamID', 'lgID'])
    master_test = master.merge(batting_test.merge(fielding_test, on=['playerID', 'yearID', 'teamID', 'lgID']), on='playerID')
    master_test = master_test[master_test.POS != 'P']
    master_test.drop(['birthYear', 'birthMonth', 'birthDay', 'birthState', 'birthCity', 'nameFirst', 'nameLast'], axis=1, inplace=True)   
    
    return master_test

In [None]:
'''start_year=1933
end_year=1949
batting_train = batting.where((start_year <= batting['yearID']) & (batting['yearID'] <= end_year)).dropna()
fielding_train = fielding.where((start_year <= fielding['yearID']) & (fielding['yearID'] <= end_year)).dropna()
fielding_train.rename(columns={'G' : 'GF'}, inplace=True)
master_train = master.merge(batting_train.merge(fielding_train, on=['playerID', 'yearID', 'teamID', 'lgID']), on='playerID')
master_train = master_train[master_train.POS != 'P']
master_train.drop(['birthYear', 'birthMonth', 'birthDay', 'birthState', 'birthCity', 'nameFirst', 'nameLast'], axis=1, inplace=True)'''  

master_train, allstar_in_range = createTrainingData()
#master_train.head()
#allstar_in_range.head()
master_train['AS'] = 0

y = 0
for x, row in master_train.iterrows():
    for y, row2 in allstar_in_range.iterrows():
        if row2['playerID'] == row['playerID']:
            print('TRUE')
        

In [None]:
master_test = createTestData()
master_test.head()