# Machine Learning Lab 1
## Eric Johnson & Quincy Schurr

### Overview

The dataset that we will be using for this lab is called Lahman's Baseball Database. The data is comprised of 24 tables that describe a variety of baseball statistics for players from 1871 through the 2015 season. The four main dataset tables are the Master, Batting, Pitching, and Fielding tables. For the analysis we will be using the Master and Batting tables

The Master table contains all the demographic data for a player, including their name, playerID, date of birth, hometown, height, and weight. This table originally had 24 features and 19049 rows, for all players in baseball. The requirements for this project say at least 30,000 records, but once the data has been cleaned up and merged it with some of the other tables that we want to draw data from, there will be many more records for each player.

The other three main tables include playing statistics for the player for each year that they played including their team. We cleaned up the data a bit by dropping columns that we did not want to analyze as they were not a statistic that all players had, or were unecessary for the study.

The purpose of this data set is for fun and for learning. The dataset is made available so that baseball fans can analyze player performance and so that baseball statisticians can view player performance and look for correlations in the player statistics. This is helpful for baseball teams looking to use their budget to acquire the best players available at the cheapest cost.

There are a lot of different statistics that we could analyze with this dataset that would draw some interesting conclusions. We would like to analyze the most common hometown in each country represented by players. We would also like to look at how many trades have occured during each season, maybe even by each team to see trends over time. The correlation between batting average and home runs hit would be an interesting statistic as well. The last category we would like to analyze is the average time a player stays in the major league. 



### Data Understanding

The following section looks to describe each statistic shown in the merged data set.

###### playerID 
The player ID is representative of all player statistics. It is defined in the Master table and it is what is used to link most of the other tables together. It is unique to each player.
###### birthYear, birthMonth, and birthDay
These features represent the year, month, and day that the player was born. They are all integers and the data is ordinal.
###### birthCountry, birthState, birthCity
These features represent the hometown of each player. They provide the city, state, and country each player was born in. These strings are represented with objects in the data frame.
##### nameFirst, nameLast
Objects representing the name of the player.
##### weight, height
Floats that represent the height and weight of the player for the season.
##### bats, throws
Object feature representing with which hand the player bats and throws.
##### debut, finalGame
The date of each player's Major League Baseball debut and final game.
##### yearID
Integer value representing the season for which the statistics are for.
##### teamID
Object that represents which team the player played for and which team their statistics apply to.
##### lgID
Represents the league in which the team the player played for was a part of. Either the National League represented by NL or American League represented by AL.
##### G, AB, R, H, 2B, 3B, HR, RBI, BB, SO
These features are all represented with integers. They represent the number of games played in a season, the number of at bat opportunities, the number of runs scores, number of hits, number of doubles, number of triples, number of home runs, number of runs batted in, number of walks, and number of strikeouts throughout the season.

In [3]:
import numpy as mp
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline

In [4]:
master = pd.read_csv('https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/Master.csv')
master.drop(['nameGiven', 'retroID', 'bbrefID', 'deathYear', 'deathMonth', 'deathDay', 'deathCountry', 'deathState', 'deathCity'], axis = 1, inplace=True)

In [5]:
batting = pd.read_csv('https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/Batting.csv')
batting.drop(['CS', 'IBB', 'HBP', 'SH', 'SF', 'GIDP', 'stint', 'SB'], axis=1, inplace=True)

The reasoning behind dropping all values that had a null value/missing value were that many of the statistics should not be influenced by an overwhelming number of records with mean values. The statistics in this data set highlight individual player performance and so filling in missing values with the mean would not be representative of the performance of the player that year. It was better to remove the whole record.

In [6]:
#merge player demographics with batting statistics
baseballdata = master.merge(batting)
#just to clean up the data a bit, still have 83822 records. 
baseballdata.dropna(inplace=True)
#convert to int b/c birth date shouldnt be float. Also converting Runs, etc b/c all whole numbers
floatList = ['birthYear', 'birthMonth', 'birthDay', 'R', 'AB', 'H', '2B', '3B', 'HR', 'RBI', 'BB', 'SO']
for x in floatList:
    baseballdata[x] = baseballdata[x].astype(int)
baseballdata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 83822 entries, 0 to 101331
Data columns (total 28 columns):
playerID        83822 non-null object
birthYear       83822 non-null int64
birthMonth      83822 non-null int64
birthDay        83822 non-null int64
birthCountry    83822 non-null object
birthState      83822 non-null object
birthCity       83822 non-null object
nameFirst       83822 non-null object
nameLast        83822 non-null object
weight          83822 non-null float64
height          83822 non-null float64
bats            83822 non-null object
throws          83822 non-null object
debut           83822 non-null object
finalGame       83822 non-null object
yearID          83822 non-null int64
teamID          83822 non-null object
lgID            83822 non-null object
G               83822 non-null int64
AB              83822 non-null int64
R               83822 non-null int64
H               83822 non-null int64
2B              83822 non-null int64
3B              83822 n

In [25]:
#see where players are from. By Country
b_grouped = baseballdata.groupby(by='birthCountry')
print(b_grouped.birthCountry.count())

birthCountry
Australia           136
Bahamas              23
Belgium               5
Brazil                8
CAN                 862
Colombia             87
Cuba                777
Czech Republic        1
D.R.               3582
Germany              22
Greece                1
Honduras              8
Ireland              19
Italy                 1
Jamaica              50
Japan               263
Mexico              671
Nicaragua            70
Panama              355
Poland               23
Saudi Arabia          3
South Korea          87
USA               74779
United Kingdom      110
V.I.                 89
Venezuela          1790
Name: birthCountry, dtype: int64


In [26]:
# Start by just plotting what we previsously grouped!
plt.style.use('ggplot')



In [22]:
statePlayers = baseballdata.where(baseballdata.birthCountry == 'USA')
state_group = statePlayers.groupby(by='birthState')
print(state_group.birthState.count())

birthState
AK       69
AL     1858
AR      828
AZ      536
CA    11988
CO      395
CT      817
DC      334
DE      215
FL     2686
GA     1824
HI      218
IA      964
ID      167
IL     4590
IN     1577
KS      953
KY     1180
LA     1314
MA     2158
MD     1087
ME      186
MI     2052
MN      701
MO     2529
MS     1040
MT       93
NC     2134
ND       78
NE      607
NH      205
NJ     1827
NM      126
NV      159
NY     4604
OH     4146
OK     1393
OR      711
PA     4841
RI      329
SC     1026
SD      183
TN     1392
TX     4478
UT      175
VA     1295
VT      141
WA      979
WI      949
WV      549
WY       93
Name: birthState, dtype: int64
