# Machine Learning Lab 1
### Quincy Schurr

#### Overview

The dataset that I will be using for this lab is called Lahman's Baseball Database. The data is comprised of 24 tables that describe a variety of baseball statistics for players from 1871 through the 2015 season. The four main dataset tables are the Master, Batting, Pitching, and Fielding tables. For my analysis I will be using the Master and Batting tables.

The Master table contains all the demographic data for a player, including their name, playerID, date of birth, hometown, height, and weight. This table originally had 24 features and 19049 rows, for all players in baseball. The requirements for this project say at least 30,000 records, but once I have cleaned up the data and merged it with some of the other tables that I want to draw data from, there will be many more records for each player.

The other three main tables include playing statistics for the player for each year that they played including their team. I cleaned up the data a bit by dropping columns that I did not want to analyze as they were not a statistic that all players had, or were unecessary for the study.

This purpose of this data set is for fun and for learning. The dataset is made available so that baseball fans can analyze player performance and so that baseball statisticians can view player performance and look for correlations in the player statistics. This is helpful for baseball teams looking to use their budget to acquire the best players available at the cheapest cost.

There are a lot of different statistics that I could analyze with this dataset that would draw some interesting conclusions. I would like to analyze the most common hometown in each country represented by players. I would also like to look at how many trades have occured during each season, maybe even by each team to see trends over time. The correlation between batting average and home runs hit would be an interesting statistic as well. The last category I would like to analyze is the average time a player stays in the major league. 



#### Data Understanding

The following section looks to describe each statistic in this data set

###### playerID 
The player ID is representative of all player statistics. It is defined in the Master table and it is what is used to link most of the other tables together. It is unique to each player.
###### birthYear
This is the year the player was born, it is an integer and the data is ordinal.
###### birthMonth
###### birthDay
###### birthCountry
###### birthState
##### birthCity
##### nameFirst
##### nameLast
##### weight
##### height
##### bats
##### throws
##### debut
##### finalGame
##### yearID
##### teamID
##### lgID
##### G
##### AB
##### R
##### H
##### 2B
##### 3B
##### HR
##### RBI
##### BB
##### SO

In [13]:
import numpy as mp
import pandas as pd

master = pd.read_csv('https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/Master.csv')
master.drop(['nameGiven', 'retroID', 'bbrefID', 'deathYear', 'deathMonth', 'deathDay', 'deathCountry', 'deathState', 'deathCity'], axis = 1, inplace=True)

In [21]:
batting = pd.read_csv('https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/Batting.csv')
batting.drop(['CS', 'IBB', 'HBP', 'SH', 'SF', 'GIDP', 'stint', 'SB'], axis=1, inplace=True)

The reasoning behind dropping all values that had a null value/missing value were that many of the statistics should not be influenced by an overwhelming number of records with mean values. The statistics in this data set highlight individual player performance and so filling in missing values with the mean would not be representative of the performance of the player that year. It was better to remove the whole record.

In [30]:
#merge player demographics with batting statistics
baseballdata = master.merge(batting)
#just to clean up the data a bit, still have 83822 records. 
baseballdata.dropna(inplace=True)
#convert to int b/c birth date shouldnt be float. Also converting Runs, etc b/c all whole numbers
floatList = ['birthYear', 'birthMonth', 'birthDay', 'R', 'AB', 'H', '2B', '3B', 'HR', 'RBI', 'BB', 'SO']
for x in floatList:
    baseballdata[x] = baseballdata[x].astype(int)
baseballdata.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 83822 entries, 0 to 101331
Data columns (total 28 columns):
playerID        83822 non-null object
birthYear       83822 non-null int64
birthMonth      83822 non-null int64
birthDay        83822 non-null int64
birthCountry    83822 non-null object
birthState      83822 non-null object
birthCity       83822 non-null object
nameFirst       83822 non-null object
nameLast        83822 non-null object
weight          83822 non-null float64
height          83822 non-null float64
bats            83822 non-null object
throws          83822 non-null object
debut           83822 non-null object
finalGame       83822 non-null object
yearID          83822 non-null int64
teamID          83822 non-null object
lgID            83822 non-null object
G               83822 non-null int64
AB              83822 non-null int64
R               83822 non-null int64
H               83822 non-null int64
2B              83822 non-null int64
3B              83822 n