# Baseball Data Hitting Statistics
by Ryan Reardon

This analysis is based on baseball data of hitting statistics of individual players over a portion of its history  It's clean and limited to only two hitting categories; batting average and homeruns, but it does height, weight and batting stance.  The data set is tidy and without quality issues.

## Gather

In [1]:
#import libraries
import pandas as pd
import numpy as np

In [2]:
#read in baseball data file
df_data = pd.read_csv('baseball_data.csv')

## Assess

In [3]:
#assess baseball data file
df_data

Unnamed: 0,name,handedness,height,weight,avg,HR
0,Tom Brown,R,73,170,0.000,0
1,Denny Lemaster,R,73,182,0.130,4
2,Joe Nolan,L,71,175,0.263,27
3,Denny Doyle,L,69,175,0.250,16
4,Jose Cardenal,R,70,150,0.275,138
5,Mike Ryan,R,74,205,0.193,28
6,Fritz Peterson,B,72,185,0.159,2
7,Dick Bertell,R,72,200,0.250,10
8,Rod Kanehl,R,73,180,0.241,6
9,Ozzie Osborn,R,74,195,0.000,0


In [4]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1157 entries, 0 to 1156
Data columns (total 6 columns):
name          1157 non-null object
handedness    1157 non-null object
height        1157 non-null int64
weight        1157 non-null int64
avg           1157 non-null float64
HR            1157 non-null int64
dtypes: float64(1), int64(3), object(2)
memory usage: 54.3+ KB


## Clean

##### Quality
- none
- no quality issues other than including players without batting statistics (AL pitchers) after the DH rule; I left the pitchers with batting stats

####  Tidy
- none
- add column for BMI statistic and BMI Classification for analysis
- rename and reorder file for analysis

### Add BMI Statistic Column

In [5]:
#add a BMI column
wt = df_data['weight']
ht = df_data['height']

df_data['BMI'] = (wt/ht/ht)*703

#TEST
df_data.head()

Unnamed: 0,name,handedness,height,weight,avg,HR,BMI
0,Tom Brown,R,73,170,0.0,0,22.426346
1,Denny Lemaster,R,73,182,0.13,4,24.009383
2,Joe Nolan,L,71,175,0.263,27,24.40488
3,Denny Doyle,L,69,175,0.25,16,25.84016
4,Jose Cardenal,R,70,150,0.275,138,21.520408


### Add BMI Classification Column

In [6]:
#function to place BMI into a category
def b(row):
    if row['BMI'] > 40:
        val = 'morbidly obese'
    elif row['BMI'] >= 30 and row['BMI'] <= 39.99999:
        val = 'obese'
    elif row['BMI'] >= 25 and row['BMI'] <= 29.99999:
        val = 'overweight'
    elif row['BMI'] >= 18.5 and row['BMI'] <= 24.99999:
        val = 'normal'
    else:
        val = 'underweight'
    return val

In [7]:
#name BMI category column
df_data['BMI_Classification'] = df_data.apply(b, axis=1)

#TEST
df_data.head()

Unnamed: 0,name,handedness,height,weight,avg,HR,BMI,BMI_Classification
0,Tom Brown,R,73,170,0.0,0,22.426346,normal
1,Denny Lemaster,R,73,182,0.13,4,24.009383,normal
2,Joe Nolan,L,71,175,0.263,27,24.40488,normal
3,Denny Doyle,L,69,175,0.25,16,25.84016,overweight
4,Jose Cardenal,R,70,150,0.275,138,21.520408,normal


### Rename and Reorder Columns

In [8]:
#rename columns
df_data = df_data.rename(columns={'avg': 'BA', 'handedness': 'Batted', 'name': 'Name',
                                 'height': 'Height','weight': 'Weight'})

#reorder the columns
df_data = df_data[['Name', 'Batted', 'BA', 'HR', 'Height', 'Weight', 'BMI', 'BMI_Classification']]

In [9]:
#check for null data in BMI classification
sum(df_data.BMI_Classification.isnull())

0

In [10]:
#TEST
df_data.head()

Unnamed: 0,Name,Batted,BA,HR,Height,Weight,BMI,BMI_Classification
0,Tom Brown,R,0.0,0,73,170,22.426346,normal
1,Denny Lemaster,R,0.13,4,73,182,24.009383,normal
2,Joe Nolan,L,0.263,27,71,175,24.40488,normal
3,Denny Doyle,L,0.25,16,69,175,25.84016,overweight
4,Jose Cardenal,R,0.275,138,70,150,21.520408,normal


### Drop Players without Batting Statistics

In [11]:
#drop players without batting stats
df_data = df_data.drop(df_data[(df_data.BA == 0) & (df_data.HR == 0)].index)

#Test check dataframe
df_data

Unnamed: 0,Name,Batted,BA,HR,Height,Weight,BMI,BMI_Classification
1,Denny Lemaster,R,0.130,4,73,182,24.009383,normal
2,Joe Nolan,L,0.263,27,71,175,24.404880,normal
3,Denny Doyle,L,0.250,16,69,175,25.840160,overweight
4,Jose Cardenal,R,0.275,138,70,150,21.520408,normal
5,Mike Ryan,R,0.193,28,74,205,26.317568,overweight
6,Fritz Peterson,B,0.159,2,72,185,25.087770,overweight
7,Dick Bertell,R,0.250,10,72,200,27.121914,overweight
8,Rod Kanehl,R,0.241,6,73,180,23.745543,normal
12,Juan Bonilla,R,0.256,7,69,170,25.101869,overweight
13,Frank Tepedino,L,0.241,6,71,185,25.799445,overweight


In [12]:
#Test view dataframe info
df_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 1154
Data columns (total 8 columns):
Name                  891 non-null object
Batted                891 non-null object
BA                    891 non-null float64
HR                    891 non-null int64
Height                891 non-null int64
Weight                891 non-null int64
BMI                   891 non-null float64
BMI_Classification    891 non-null object
dtypes: float64(2), int64(3), object(3)
memory usage: 62.6+ KB


In [13]:
#save wrangled baseball data file for data story in tableau
df_data.to_csv('baseball_data_new.csv', encoding='utf-8', index=False)