---
Title: "Learning the Basics of Pandas in Python Using 2024 MLB Team Batting Data - ANSWER KEY"

Author:
  - Name: Austin Hayes

  -  Email: ahayes65@charlotte.edu

  -  Affiliation: University of North Carolina at Charlotte

Date: June 2, 2025

Description: Using 2024 Batting Data from every team, we will be exploring the basic functions and uses of the Pandas library in Python in order to learn about Data Science in Python.

Categories:
  - Dataframes
  - Summary statistics
  - Importing and Reading data
  - Series
  - Data Science


### Data

This Dataset is Originally from Baseball Reference and has been converted to CSV and Excel files for this learning module. 

Visit their website at: https://www.baseball-reference.com/leagues/majors/2024.shtml

The data set contains 32 rows and 29 columns. Each row represents a MLB team.

Download data: 

Available on the [Data For Pandas Module Data Repository](https://github.com/ahaze65/Data-For-Pandas-Module): [2024_MLB_Team_Batting_Data.csv](https://raw.githubusercontent.com/ahaze65/Data-For-Pandas-Module/refs/heads/main/2024_MLB_Team_Batting_Data.csv)

<details>
<summary><b>Variable Descriptions</b></summary>

| Variable | Description |
|----|----------------------------|
| Tm | Team |
| '#Bat' | Number of Players used in Games | 
| BatAge | Batters’ average age. Weighted by AB + Games Played |
| R/G | Runs Scored Per Game |
| G | Games Played or Pitched |
| PA | Plate Appearances. When available, we use actual plate appearances from play-by-play game accounts. Otherwise estimated using AB + BB + HBP + SF + SH, which excludes catcher interferences. When this color, click for a summary of each PA. |
| AB | At Bats |
| R | Runs Scored/Allowed |
| H | Hits/Hits Allowed |
| 2B | Second Base Hits? Not stated on Website. | 
| 3B | Third Base Hits? Not stated on Website. |
| HR | Home Runs Hit/Allowed |
| RBI | Runs Batted In |
| SB | Stolen Bases |
| CS | Caught Stealing |
| BB | Bases on Balls/Walks |
| SO | Strikeouts |
| BA | Hits/At Bats. For recent years, leaders need 3.1 PA per team game played. Bold indicates highest BA using current stats. Gold means awarded title at end of year.|
| OBP | (H + BB + HBP) / (At Bats + BB + HBP + SF). For recent years, leaders need 3.1 PA per team game played. |
|SLG | Total Bases/At Bats OR (1B + 2*2B + 3*3B + 4*HR) / AB. For recent years, leaders need 3.1 PA per team game played. |
|OPS | On-Base + Slugging Percentages. For recent years, leaders need 3.1 PA per team game played. |
|OPS+ | 100*[OBP/lg OBP + SLG/lg SLG - 1]. Adjusted to the player’s ballpark(s) |
| TB | Total Bases. Singles + 2 x Doubles + 3 x Triples + 4 x Home Runs. |
| GDP | Double Plays Grounded Into. Only includes standard 6-4-3, 4-3, etc. double plays. First tracked in 1933. For gamelogs only in seasons we have play-by-play, we include triple plays as well. All official seasonal totals do not include GITP's. |
| HBP | Times Hit by a Pitch |
| SH | Sacrifice Hits (Sacrifice Bunts) |
|SF | Sacrifice Flies. First tracked in 1954. |
| IBB | Intentional Bases on Balls. First tracked in 1955. |
|LOB | Runners Left On Base |

# Answer Key:

## For Questions 1 - 3 use the following dataset: https://raw.githubusercontent.com/ahaze65/Data-For-Pandas-Module/refs/heads/main/Pitching%20Data%20-%20MLB%202024%20Team%20Stats.csv


SOURCE:

This dataset is from Baseball Reference. Please go visit their website at https://www.baseball-reference.com 

Or for this exact dataset: https://www.baseball-reference.com/leagues/majors/2024.shtml#all_teams_standard_pitching 

### 1. Import the dataset from above, drop the last 2 rows and find the mean amount of earned runs allowed last year. 

Hint: Earned Runs Allowed column name is 'ER'

In [None]:
import pandas as pd

import numpy as np

In [None]:
# Import data
pitching_data = pd.read_csv('https://raw.githubusercontent.com/ahaze65/Data-For-Pandas-Module/refs/heads/main/Pitching%20Data%20-%20MLB%202024%20Team%20Stats.csv')

# Drop the last 2 rows
pitching_data = pitching_data.drop([30,31])

# Calculate the average ER among teams
average_er = pitching_data['ER'].mean()

print(f"The average number of earned runs allowed (ER) among teams is: {average_er:.2f}")

### 2. Assign String indexes to every row that matches the team of that row. Test this by using a certain function that allows you to view rows by labeling to view a team of your choice

Hint: Reuse the team name column 'Tm'

Hint: This function is in the data selection section.

In [None]:
# Make indexing labels
pitching_data.index = pitching_data['Tm']

# Select a team by its name
pitching_data.loc['Washington Nationals']

### 3. Print out a list of teams who had at least 1 complete pitching shutout last season. 

Hint: complete shutout column is called 'cSho'

In [None]:
# Select teams with at least 1 complete game shutout (cSho)
pitching_data.loc[pitching_data['cSho'] >= 1]

### 4. Create your own dataframe with at least 10 rows and 3 columns. Include at least 3 rows with null data points (Use np.nan) and make at least 1 column of numbers.

In [None]:
# Create a Dataframe
pitchers = pd.DataFrame(
    {
    'Player': ['Max Scherzer', 'Jacob deGrom', 'Clayton Kershaw', 'Gerrit Cole', 'Paul Skenes', 'Dallas Keuchel', 'Tarik Skubal', 'Jack Flaherty', 'Chris Sale', 'Yoshinobu Yamamoto', 'Justin Verlander'],
    'Team': ['Toronto Bluejays', 'Texas Rangers', 'Los Angeles Dodgers', 'New York Yankees', 'Pittsburgh Pirates', 'Milwakee Brewers','Detroit Tigers', 'Detroit Tigers', 'Atlanta Braves', 'Los Angeles Dodgers', 'Houston Astros'],
    'ERA': [2.85, np.nan, 3.01, 2.67, 3.50, 4.20, 3.80, 4.10, 3.90, np.nan, 2.75],
    'W': [15, 14, 16, 18, 12, 10, 9, 8, np.nan, 13, 17],
    'L': [4, 5, 6, 3, 7, np.nan, 9, 10, 5, 4, 3],
    }
)

#Print it out
print(pitchers)

### 5. Using the dataframe you created in the last question, fill those null values with a phrase of your choosing and print out the dataframe. Make sure you are NOT applying those changes to the original dataframe.

Hint: do not use the inplace argument in you function call

In [None]:
# Fill null values with 'N/A'
pitchers.fillna('N/A')

# Print out the updated DataFrame
print(pitchers)

### 6. Still using that dataframe you created, drop any rows that conatain null values. This time, make sure you apply those changes to your dataframe and NOT to a temporary copy. 


In [None]:
# Drop rows with null values and apply changes
pitchers.dropna(inplace=True)

# Print out the result
print(pitchers)

### 7. Using one function, calculate all key statisitics for your dataframe such as mean, median, min, etc.

In [None]:
# Get key statistics of the DataFrame
pitchers.describe()

### 8. View the last 5 rows of your dataframe and drop 3 of them of your choosing with only one line of code. View the last 5 rows again to check your changes.

In [None]:
# View last 5 rows of the DataFrame
print(pitchers.tail())

In [None]:
# Drop specific rows by index
pitchers.drop([10, 3, 4], inplace=True)

In [None]:
# Print out the DataFrame after dropping rows
print(pitchers.tail())

### 9. Choose one of your numeric columns, convert it to an integer data type (if it is not already) and then into a Numpy array. Print the final array out.

In [None]:
# CHeck intial data types
pitchers.info()

In [None]:
# Convert the Win column to type int
pitchers['W'] = pitchers['W'].astype(int)

# Check to see if it changed
print(pitchers.info())


In [None]:
# Convert it to numpy array and print it out
pitchers_array = pitchers['W'].to_numpy()
print(pitchers_array)   