Comments: M. Schuckers 
11 June 2025

1. Add a set of learning goals at the start that are specifically what students should take away from the module - done
2. 'pip install pandas' did not work for me - linked to a resource
3. Add some motivation about why we would read in data. - done
4. Motivate the use of 'NA' for EXCEL_Batting_Data_2024_NA_version, be clear about why we might want to do that. - done
5. Add some additional detail about what is a Dataframe at the start of that section - done
6. Explain what 0-based indexing is - done
7. Add some depth to the np.nan section.  Explain a bit about missing data - done
8. When you change to float and you're looking at Dtype's, maybe talk about the consequences of this difference - done
9. Talk a bit about numpy and what it is useful for - done.
10. Make sure you've got comments for nearly all of your cells. - done



---
Title: "Learning the Basics of Pandas in Python Using 2024 MLB Team Batting Data"

Author:
  - Name: Austin Hayes

  -  Email: ahayes65@charlotte.edu

  - Affiliation: University of North Carolina at Charlotte

Date: June 2, 2025

Description: Using 2024 Batting Data from every team, we will be exploring the basic functions and uses of the Pandas library in Python in order to learn about Data Science in Python.

Categories:
  - Dataframes
  - Summary statistics
  - Importing and Reading data
  - Series
  - Data Science


### Data

This Dataset is Originally from Baseball Reference and has been converted to CSV and Excel files for this learning module. 

Visit their website at: https://www.baseball-reference.com/leagues/majors/2024.shtml

The data set contains 32 rows and 29 columns. Each row represents a MLB team.

Download data: 

Available on the [Data For Pandas Module Data Repository](https://github.com/ahaze65/Data-For-Pandas-Module): [2024_MLB_Team_Batting_Data.csv](https://raw.githubusercontent.com/ahaze65/Data-For-Pandas-Module/refs/heads/main/2024_MLB_Team_Batting_Data.csv)

<details>
<summary><b>Variable Descriptions</b></summary>

| Variable | Description |
|----|----------------------------|
| Tm | Team |
| '#Bat' | Number of Players used in Games | 
| BatAge | Batters’ average age. Weighted by AB + Games Played |
| R/G | Runs Scored Per Game |
| G | Games Played or Pitched |
| PA | Plate Appearances. When available, we use actual plate appearances from play-by-play game accounts. Otherwise estimated using AB + BB + HBP + SF + SH, which excludes catcher interferences. When this color, click for a summary of each PA. |
| AB | At Bats |
| R | Runs Scored/Allowed |
| H | Hits/Hits Allowed |
| 2B | Second Base Hits? Not stated on Website. | 
| 3B | Third Base Hits? Not stated on Website. |
| HR | Home Runs Hit/Allowed |
| RBI | Runs Batted In |
| SB | Stolen Bases |
| CS | Caught Stealing |
| BB | Bases on Balls/Walks |
| SO | Strikeouts |
| BA | Hits/At Bats. For recent years, leaders need 3.1 PA per team game played. Bold indicates highest BA using current stats. Gold means awarded title at end of year.|
| OBP | (H + BB + HBP) / (At Bats + BB + HBP + SF). For recent years, leaders need 3.1 PA per team game played. |
|SLG | Total Bases/At Bats OR (1B + 2*2B + 3*3B + 4*HR) / AB. For recent years, leaders need 3.1 PA per team game played. |
|OPS | On-Base + Slugging Percentages. For recent years, leaders need 3.1 PA per team game played. |
|OPS+ | 100*[OBP/lg OBP + SLG/lg SLG - 1]. Adjusted to the player’s ballpark(s) |
| TB | Total Bases. Singles + 2 x Doubles + 3 x Triples + 4 x Home Runs. |
| GDP | Double Plays Grounded Into. Only includes standard 6-4-3, 4-3, etc. double plays. First tracked in 1933. For gamelogs only in seasons we have play-by-play, we include triple plays as well. All official seasonal totals do not include GITP's. |
| HBP | Times Hit by a Pitch |
| SH | Sacrifice Hits (Sacrifice Bunts) |
|SF | Sacrifice Flies. First tracked in 1954. |
| IBB | Intentional Bases on Balls. First tracked in 1955. |
|LOB | Runners Left On Base |

---

# Learning Goals:


- Use Pandas for importing data from CSV and Excel Files
- Creation your own Dataframe using Python Dictionaries
- Data selection of rows and columns with loc and iloc
- Data Filtering with conditionals
- How to calculate key statisitics 
- Handling null data 
- Changing data types

---

# Introduction:

This SCORE module will be a introduction to the basics of the Pandas Library in Python. We will go over some of the basic items need to use Python and Pandas throughout this module. 

Some of the Topics we will be discussing: 

    - Reading and Importing Data

    - Creating a Dataframe

    - Creating a Series 

    - Data Selection

    - Dataframe Analysis

    - Data Manipulation

# Importing Pandas:

Before getting started, we need to import the Pandas Library into this Jupytr Notebook. 

Try Running the following code cell below.

In [1]:
# Import the Pandas library
import pandas as pd

NOTE: "as pd" is an alias we can use for Pandas so that we do not have to type out Pandas everytime we want to use it.

If that code did not run, you may need to download Pandas onto your Computer. 

Here is a resource for learning how to download Pandas: https://pandas.pydata.org/docs/getting_started/install.html 

# Importing / Reading In Data:

Importing data into Python allows us to turn raw numbers into powerful insights with a few lines of Python. Learning to read data into Python isn’t just a skill; it’s the key to becoming a data scientist or a data analyst.

We can read in several different file types using the Pandas library. This can include CSV files, Excel files (XLSX), HTML files, JSON files, etc. Doing this will automatically read in the data as a dataframe which is essentially the Python version of an Excel spreadsheet for organizing data. We wil talk about this later on in the module.

We will keep it simple by importing the same dataset in CSV form and Excel Form. 

Lets go ahead and run a few of the code items below so that we can read in some of our Football datasets for analysis.

In [None]:
# Read in Baseball Data CSV file as a DataFrame
CSV_Batting_Data_2024 = pd.read_csv('https://raw.githubusercontent.com/ahaze65/Data-For-Pandas-Module/refs/heads/main/2024_MLB_Team_Batting_Data.csv')

In order to Read an Excel file in Python you may need to download "openpyxl". 

Here is a resource to learn about downloading it: https://openpyxl.readthedocs.io/en/stable/tutorial.html

Although the documentation states to include a "$" in the import statement, it may not be neccesary.

In [None]:
# Read in Baseball Data Excel file as a DataFrame
EXCEL_Batting_Data_2024 = pd.read_excel('https://raw.githubusercontent.com/ahaze65/Data-For-Pandas-Module/main/2024_MLB_Team_Batting_Data_Excel_Version.xlsx')

Certain datasets may contain missing values that are represented by a stand-in value. For example, a dataset may have missing values that are labeled as 'NA' but we want them to be recognized as true missing values without a string representation. 

Following this example, we can register them as missing values with the code below:

In [None]:
# Read in Baseball Data Excel file as a DataFrame and treat all instances of 'NA' as a missing values
EXCEL_Batting_Data_2024_NA_version = pd.read_excel('https://raw.githubusercontent.com/ahaze65/Data-For-Pandas-Module/main/2024_MLB_Team_Batting_Data_Excel_Version.xlsx', na_values=['NA'])

By doing this, we can tell Python what values we want to be treated as missing values.

# Dataframes and Basic Data Selection:

As discussed earlier, Dataframes are a convenient way of organizing data in python using pandas. Dataframes are a lot like an excel spreadsheet; it is a way of organizing data into rows and columns with Python. Each column can hold different data types and the size of the dataframe is mutable through the removal of rows and columns. Dataframes are essential to python and allow you to do many different thing to the data including data cleaning, adding/deleting columns and rows, data entry, data manipulation, etc. 

Lets take a look at how to create our own dataframe. 

Run the following code:

In [5]:
# Create a new DataFrame called df that stores information about States and their capitals
df = pd.DataFrame(
    {
    'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California'],
    'Capital': ['Montgomery', 'Juneau', 'Phoenix', 'Little Rock', 'Sacramento']
    }
)

print(df)

        State      Capital
0     Alabama   Montgomery
1      Alaska       Juneau
2     Arizona      Phoenix
3    Arkansas  Little Rock
4  California   Sacramento


There are a few functions we can use to gather data from only specific columns and/or rows:

 - loc[ ]: Accesses rows and columns by labels.
 - iloc[ ]: Accesses rows and columns by integer-based index.
 - df_name['column_name']: Selects a specific column.
 - df_name[['column_name1', 'column_name2']]: Selects multiple columns.



Now Lets try printing out a specific row number and column from the dataframe we just made.

NOTE: Columns and rows in dataframes use 0-based indexing.

This Means that the first item in a Python list start with the index 0 and not 1. For example, if I want to get the very first item in a list I would do this: list[0]. 

Here is code example:

In [None]:
# Create a list
list = [1,2,3,4,5,6,7,8,9,10]

# Get the first element of the list
list[0]

1

### Loc[ ]:

Syntax: df.loc[start_row (inclusive): end_row (exclusive)].

 - If you want to print all rows from the second row to the end, you can use df.loc['row_1':]. 

 - If you want to print up to but not including the second row, you can use df.loc[:row_1].

Loc[ ] uses labels to locate rows and columns but currently our rows currently use integer-based indexing. This means that our rows cannot be identified by labels. Let's Change that. 

Run the code below:


In [6]:
# Create a copy our df
df_copy = df

# Add row labels to our DataFrame
df.index = ['row0', 'row1', 'row3', 'row4', 'row5']

# Lets try printing the DataFrame with .loc[]

print(df.loc['row2':'row4'])

         State      Capital
row3   Arizona      Phoenix
row4  Arkansas  Little Rock


We can also apply filtering to our row level operations. For example, lets filter and get the rows where the capital is Little Rock.  Note that we have to use `==` to specify the truth of our equality and to distinguish from `=` which assigns values to an object.

In [None]:
# Print the row(s) where the Capital is 'Little Rock'
print(df.loc[df['Capital'] == 'Little Rock'])

         State      Capital
row4  Arkansas  Little Rock


We can also do the reverse and return any rows whose capital is NOT Little Rock.  Note that we use `!=` for 'not equal to.'

In [None]:
# Print the row(s) where the Capital is not 'Little Rock'
print(df.loc[df['Capital'] != 'Little Rock'])

           State     Capital
row0     Alabama  Montgomery
row1      Alaska      Juneau
row3     Arizona     Phoenix
row5  California  Sacramento


### iLoc[ ]:

Syntax: df.iloc[start_row (inclusive): end_row (exclusive)].

 - If you want to print all rows from the second row to the end, you can use df.loc[1:]. 

 - If you want to print up to but not including the second row, you can use df.loc[:1].

Unlike loc[], we can use the standard integer-based index. We can use either our copy of the dataframe with String labeled indexing or our original dataframe with integer-based indexing. 

Run the code below:

In [None]:
# Print out the first 3 rows of the DataFrame
print(df.iloc[0:3])

Notice how iloc still works despite the string label indexing? Very convenient to use.

### df_name['column_name'] and df_name[['column_name1', 'column_name2']]:

Luckily, columns are a bit easier to work with whether you are using 1 or many. The only 2 differences between the two are the usage of one bracket vs double brackets and comma seperated column entry.

Lets try it out!

In [None]:
# Print out the state names only
print(df['State'])


In [None]:
# Now let's print out the state and capital names
print(df[['State', 'Capital']])

# Dataframe Analysis:


The Pandas library gives us quite a few functions to gather information about a dataframe. 

Functions we will be learning about:
 - .head(): Displays the first n rows of a DataFrame.
 - .tail(): Displays the last n rows of a DataFrame.
 - .info(): Provides information about the DataFrame, including data types and non-null counts.
 - .describe(): Generates descriptive statistics of numerical columns.
 - .shape: Returns the dimensions of the DataFrame (rows, columns).


Let's use our Baseball dataset for this one!

## .head()

By deafult, .head() will show us the first 5 rows of a dataframe. However, we can specify how many rows we want if needed.

Run the code below:

In [None]:
# Print out the first 5 rows of the Dataframe
CSV_Batting_Data_2024.head()

In [None]:
# Print out only the first 2 rows of the Dataframe
CSV_Batting_Data_2024.head(2)

## .tail()

Just like .head(), .tail() will show us 5 rows by default but it shows us the last 5 rows of a dataframe instead. We can also specify how many rows we want if needed.

Run the code below:

In [None]:
# Print out the last 5 rows of the Dataframe
CSV_Batting_Data_2024.tail()

In [None]:
# Print out the last 2 rows of the Dataframe

CSV_Batting_Data_2024.tail(2)

NOTE: As you can see, there is a leagure average and a 'NaN' row at the bottom. There represent a total sum row and an average row. However, we may not want these in our data since we only want teams as our rows. We will go over how to get rid of these later.

## .info()

.info() will give you some brief information about a dataset. This will include column names, column data types (strings, integers, floats, etc.), how many rows of non-null data are in each column, and memory usage. 

Run the code below:

In [None]:
# lets get some information about our baseball data!
CSV_Batting_Data_2024.info()

## .describe()

.describe() will give you some basic summary statisitics of each column including standard deviation, average, count of rows, the maximum value, the minimum value and percentiles. 

Run the code below:

In [None]:
# Describe the DataFrame with key statistics
CSV_Batting_Data_2024.describe()

## .shape


.shape is a very simple function that returns how many rows and columns are in the dataframe. It's output states rows first then columns second (r, c).

In [None]:
# Lets check how many columns and rows are in the basbeball dataset.
CSV_Batting_Data_2024.shape

There are 32 rows and 29 columns in total!

# Data Manipulation using Dataframes

There are a lot of different functions that you can use for data manipulation with Pandas. In this module we will go over the following:

- .drop()
- .dropna()
- .fillna()
- .astype()

## .drop()

.drop() is a very simple function that can drop given rows/columns in a dataframe. Lets use our Baseball dataset from earlier.

In [None]:
# Lets check the column names in the baseball dataset.
CSV_Batting_Data_2024.columns

Lets say that we don't care about the average age of batters. So lets create a copy of the dataset and get rid of that column.

In [None]:
# Create a copy of the dataset.
copy_dataset = CSV_Batting_Data_2024.copy()

# Lets get rid of the average batters age column 'BatAge'
copy_dataset.drop(columns=['BatAge'],inplace=True)
copy_dataset.head()

we use 'inplace = true' because it allows us to apply our changes to the original dataset instead of creating a temporary copy of the datset.

We can also drop rows and use conditions for .drop()

Run the code below:

In [None]:
# Drop the first row of the DataFrame
copy_dataset.drop(0, inplace=True)

# Get first 5 rows of the DataFrame after dropping the first row
copy_dataset.head()

See how it got rid of the Arizona Diamondbacks data using row indexing?




Now that we have learned about .drop(), lets drop the last 2 rows of the dataset() because they are not teams. 

In [None]:
# Drop the last two rows of the DataFrame
CSV_Batting_Data_2024.drop([30,31], inplace=True)

# Get the last 5 rows of the DataFrame after dropping the last two rows
CSV_Batting_Data_2024.tail()

## .dropna()

.dropna() is a very simple function that allows you to remove all rows with null values.

Let's start by creating our own dataframe for this example.

We are going to want to import Numpy in order to make null data entries.

In [None]:
# Import Numpy library
import numpy as np

Numpy is another python library that is used for mathematical computation using arrays. It has a very useful function called .nan. It stands for "Not a Number" and allows us to label an entry as a NULL value. This may be useful for missing data and labeling it with an "NaN". Missing data can be hard to handle at times, and sometimes labeling it as missing may be the best possible answer depending on the situation.

If the Numpy import does not work, you may need to download it. Here is a resource to help: https://numpy.org/install/

In [None]:
# Create a DataFrame with missing values
df1 = pd.DataFrame(
    {
    'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', np.nan],
    'Capital': ['Montgomery', np.nan, 'Phoenix', 'Little Rock', 'Sacramento']
    }
)

# Print the DataFrame with missing values
print(df1)

      State      Capital
0   Alabama   Montgomery
1    Alaska          NaN
2   Arizona      Phoenix
3  Arkansas  Little Rock
4       NaN   Sacramento


Now lets get rid of those null values!

In [None]:
# Drop rows with any missing values
df1.dropna(inplace=True)

# Print out the resulting DataFrame after dropping rows with missing values
print(df1)

See how it dropped those rows! Very useful.

## .fillna()

This function is used to fill null values with a value of your choice.

Lets use the custom df we used earlier as an example.

In [None]:
# Create a DataFrame with missing values
df1 = pd.DataFrame(
    {
    'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', np.nan],
    'Capital': ['Montgomery', np.nan, 'Phoenix', 'Little Rock', 'Sacramento']
    }
)

# Lets fill those missing values with 'empty!'
df1.fillna('empty!', inplace=True)

# Print out the resulting DataFrame after filling missing values
print(df1)

# .astype()


This function declares a pandas object as a certain data type.

Let's create a custom example with some players and enter their batting averages as strings.

In [14]:
# Lets create a new DataFrame with some batting numbers
batting_numbers = pd.DataFrame(
    {
        'Player': ['Bryce Harper', 'Andrew McCutchen', 'Buster Posey'],
        'Team': ['Philadelphia Phillies', 'Pittsburgh Pirates', 'San Francisco Giants'],
        'Batting_Average': ['0.300', '0.275', '0.290'],
        'Home_Runs': [30, 25, 20],
        'RBIs': [90, 80, 70]
    }
)

# Let's check the column type
print(batting_numbers.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Player           3 non-null      object
 1   Team             3 non-null      object
 2   Batting_Average  3 non-null      object
 3   Home_Runs        3 non-null      int64 
 4   RBIs             3 non-null      int64 
dtypes: int64(2), object(3)
memory usage: 252.0+ bytes
None


Notice how 'Batting_Average' is an object dtype? lets change that to be a float so they are in a decimal format.

In [None]:
# Convert the 'Batting_Average' column to the float type
batting_numbers['Batting_Average'] = batting_numbers['Batting_Average'].astype(float)

# Let's check the column type again 
print(batting_numbers.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Player           3 non-null      object 
 1   Team             3 non-null      object 
 2   Batting_Average  3 non-null      float64
 3   Home_Runs        3 non-null      int64  
 4   RBIs             3 non-null      int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 252.0+ bytes
None


There are a few reasons why we may want to change a column's data type to something else:

- Ensures Data is properly represented with correct formatting
- Helps prevent errors during calculation and analysis (such as trying to divide a number by a string)

Perfect!

# Computing Basic Statistics


Instead of using .describe() on a whole dataframe as perviously discussed, we can just calculate some basic statisitics of one column such as:

 - .mean()
 - .median()
 - .mode()
 - .max()
 - .min()
 - .std()

I'll provide some examples below for the column 'SO' which will allow us to examine some strikeout statisitics among all MLB teams.

In [None]:
# Print the average number of Strikeouts (SO) for a team during the 2024 MLB season
print(f'Mean: {CSV_Batting_Data_2024['SO'].mean()}')

# Print the median number of Strikeouts (SO) for a team during the 2024 MLB season
print(f'Median: {CSV_Batting_Data_2024['SO'].median()}')

# Print the most commonly occuring number of Strikeouts (SO) among teams during the 2024 MLB season
print(f'Mode: {CSV_Batting_Data_2024["SO"].mode()}')

# Print the highest amount of Strikeouts (SO) a team had during the 2024 MLB season
print(f'Maximum: {CSV_Batting_Data_2024["SO"].max()}')

# Print the lowest amount of Strikeouts (SO) a team had during the 2024 MLB season
print(f'Minimum: {CSV_Batting_Data_2024["SO"].min()}')

# Print the standard deviation of team Strikeouts (SO) during the 2024 MLB season
print(f'Standard Deviation: {CSV_Batting_Data_2024["SO"].std()}')


We can see that the league average for team strikeouts last season was 1373.2333!

# Numpy Conversion



Lastly, lets go over how to convert a dataframe column to a numpy array for easier calculations.

As discussed earlier in this module, the Numpy library is a powerful python library that is used for mathematical computation using arrays. This can be extremely useful with topics such as matrices and linear algebra. However, we just want to learn how to establish a dataframe column into a Numpy Array for this lesson.


Lets start by recreating our custom dataframe from earlier.

In [None]:
# Create a new DataFrame with some batting numbers
batting_numbers2 = pd.DataFrame(
    {
        'Player': ['Bryce Harper', 'Andrew McCutchen', 'Buster Posey'],
        'Team': ['Philadelphia Phillies', 'Pittsburgh Pirates', 'San Francisco Giants'],
        'Batting_Average': [0.300, 0.275, 0.290],
        'Home_Runs': [30, 25, 20],
        'RBIs': [90, 80, 70]
    }
)

Lets say we want to convert the 'Home_Runs' column to a numpy array. So we will do the following:

In [None]:
# Convert the 'Home_runs' column to a numpy array
HR_array = batting_numbers2['Home_Runs'].to_numpy()

# Print it out
print(HR_array)

See, very simple and easy! This will now allow you to do Numpy computation on the 'Home_Runs' column if you want.

# Review Questions:

Answer the following questions about Pandas for review:

## For Questions 1 - 3 use the following dataset: https://raw.githubusercontent.com/ahaze65/Data-For-Pandas-Module/refs/heads/main/Pitching%20Data%20-%20MLB%202024%20Team%20Stats.csv


SOURCE:

This dataset is from Baseball Reference. Please go visit their website at https://www.baseball-reference.com

Or for this exact dataset: https://www.baseball-reference.com/leagues/majors/2024.shtml#all_teams_standard_pitching 

### 1. Import the dataset from above, drop the last 2 rows and find the mean amount of earned runs allowed last year. 

Hint: Earned Runs Allowed column name is 'ER'

### 2. Assign String indexes to every row that matches the team of that row. Test this by using a certain function that allows you to view rows by labeling

Hint: Reuse the team name column 'Tm'

Hint: This function is in the data selection section.

### 3. Print out a list of teams who had at least 1 complete pitching shutout last season. 

Hint: complete shutout column is called 'cSho'

### 4. Create your own dataframe with at least 10 rows and 3 columns. Include at least 3 rows with null data points (Use np.nan) and make at least 1 column of numbers.

### 5. Using the dataframe you created in the last question, fill those null values with a phrase of your choosing and print out the dataframe. Make sure you are NOT applying those changes to the original dataframe.

Hint: do not use the inplace argument in you function call

### 6. Still using that dataframe you created, drop any rows that conatain null values. Make sure you apply those changes to your dataframe and NOT to a temporary copy. 

### 7. Using one function, calculate all key statisitics for your dataframe such as mean, median, min, etc.

### 8. View the last 5 rows of your dataframe and drop 3 of them of your choosing with only one line of code. View the last 5 rows again to check your changes.

### 9. Choose one of your numeric columns, convert it to an integer data type (if it is not already) and then into a Numpy array. Print the final array out.