# Introduction to Pandas

## Overview

This module introduces Pandas, a powerful data manipulation library for Python. It covers basic operations, data structures, and common data analysis tasks using Pandas, including analyzing, cleaning, exploring, and manipulating data.

## Learning Objectives

- Understand Pandas data structures: Series and DataFrame
- Learn to create, read, and manipulate DataFrames
- Perform basic data analysis operations using Pandas
- Handle missing data in Pandas

## Prerequisites

- Basic Python knowledge (For a refresher, see the [Python tutorial](https://docs.python.org/tutorial/).)
- Familiarity with NumPy is helpful but not required

## Get Started

Install pandas and import the required libraries.


In [None]:
# Install the tables library to read and write HDF5 files, typically used for large datasets
%pip install tables

# Install the openpyxl library to read and write Excel files (.xlsx)
%pip install openpyxl

# Import pandas library, used for data manipulation and analysis
import pandas as pd

# Import numpy library, used for numerical operations and working with arrays
import numpy as np


## Object creation

See the [introduction to data structures](https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro) section of Pandas documentation for details.

You can create a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series) (One-dimensional ndarray with axis labels, including time series) by passing a list of values, letting pandas create a default integer index:


In [None]:
# Create a pandas Series with some integer values and a NaN (Not a Number) value
# The NaN value represents missing data or undefined values in the series
s = pd.Series([1, 3, 5, np.nan, 6, 8])

# Display the Series to view its contents
s

In [None]:
# Create a pandas Series with hourly energy consumption values (in kWh) for a day
# The index represents hours (0 to 23), and the values are sample energy readings
energy = pd.Series(
    [2.5, 2.7, 2.3, 2.1, 2.0, 2.4, 2.8, 3.0, 3.5, 3.2, 3.1, 2.9, 2.8, 2.6, 2.5, 2.7, 3.0, 3.4, 3.6, 3.2, 2.9, 2.5, 2.3, 2.1],
    index=range(24)
)

In [None]:
# Display the entire Series to view its contents
print("Full Series:")
print(energy)

In [None]:
# Slice the `energy` Series to extract elements from index 2 to 15 (exclusive) with a step of 3
# This means selecting every third element starting from index 2 up to (but not including) index 15
# Corresponds to hours 2, 5, 8, 11, 14
print("\nSlice from index 2 to 15 with step 3:")
print(energy[2:15:3])

You can create a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) (Two-dimensional, size-mutable, potentially heterogeneous tabular data) by passing a NumPy array, with a datetime index and labeled columns:


In [None]:
# Generates a range of equally spaced time points (dates)
# "20220306" is the starting date (March 6, 2022)
# `periods=6` specifies that the range will contain 6 equally spaced dates
dates = pd.date_range("20220306", periods=6)

# Displays the generated date range
dates

In [None]:
# Create a DataFrame with random values from a standard normal distribution
# The shape of the DataFrame is (6, 4), i.e., 6 rows and 4 columns
# The index is set to the `dates` variable (which should be a list, array, or pandas DateTimeIndex)
# The column labels are set to the list ['A', 'B', 'C', 'D']
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

# Display the DataFrame
df

You can also create a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:


In [None]:
# Create a new DataFrame named df2 with various types of data for each column
df2 = pd.DataFrame(
    {
        # Column 'A' with a constant value of 1.0 for all rows
        "A": 1.0,
        # Column 'B' with a single timestamp (2013-01-02) for all rows
        "B": pd.Timestamp("20130102"),
        # Column 'C' with a pandas Series of length 4 filled with 1's, dtype is float32
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        # Column 'D' with a NumPy array of integers (3 repeated 4 times), dtype is int32
        "D": np.array([3] * 4, dtype="int32"),
        # Column 'E' with a categorical variable containing 'test' and 'train'
        "E": pd.Categorical(["test", "train", "test", "train"]),
        # Column 'F' with a string constant "foo" for all rows
        "F": "foo",
    }
)

# Display the DataFrame df2
df2

The columns of the resulting DataFrame have different `dtypes`:


In [None]:
# Display the data types of each column in the DataFrame `df2`
# This allows us to check the type of data stored in each column (e.g., int, float, object, etc.)
df2.dtypes

## Viewing data

See [Essential basic functionality](https://pandas.pydata.org/docs/user_guide/basics.html#basics) section of Pandas documentation for details.

You can view the top and bottom rows of the frame:


In [None]:
df.head(3)  # first three rows

In [None]:
df.tail(2)  # last two rows

You can display the indexes and columns:


In [None]:
df.index

In [None]:
df.columns

**describe**() shows a quick statistic summary of your data:


In [None]:
df.describe()

Transposing your data:


In [None]:
df.T

[Sorting](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html) by axis:


In [None]:
# The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
df.sort_index(axis=1, ascending=False)  # Sort based on column label

[Sorting](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values) by values:


In [None]:
df.sort_values(by="C")  # Sort by 'C' column ascending

## Selection

See [Indexing and Selecting Data](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing) section of Pandas documentation for details.

Selecting a single column, which yields a Series, equivalent to df.A:


In [None]:
df["A"]  # Select 'A' column

Selecting via [], which slices the rows:


In [None]:
df[0:4]  # Select first 4 rows

In [None]:
df["20220306":"20220310"]  # Get "2022-03-06" through "2022-03-10" rows

### Selection by label

- **loc** selects rows and columns with specific labels.
- **at** selects a single value for a row/column pair with specific labels (faster than `loc` when you only need a single value)

Getting a cross section using a label:


In [None]:
dates

In [None]:
df.loc[dates[0]]  # Get row indexed by '2022-03-06'

Selecting on a multi-axis by label:


In [None]:
df.loc[:, ["A", "B"]]  # Get 'A' and 'B' columns for all rows

Showing label slicing, both endpoints are included:


In [None]:
df.loc[
    "20220307":"20220309", ["A", "B"]
]  # # Get 'A' and 'B' columns for rows indexed by '2022-03-07' through '2022-03-09'

Reduction in the dimensions of the returned object:


In [None]:
df.loc["20220308", ["A", "B"]]  # Get 'A' and 'B' columns of '2022-03-08' row

Getting a scalar value:


In [None]:
df.loc[dates[0], "A"]  # Get value at dates[0] row and 'A' column.

Getting fast access to a scalar (equivalent to the prior method):


In [None]:
df.at[dates[0], "A"]

### Selection by position

- **iloc** selects rows and columns at specific integer positions.
- **iat** selects a single value for a row/column pair at specific integer positions (faster than `iloc` when you only need a single value)

Selecting via the position of the passed integers:


In [None]:
df.iloc[2]  # Get all values of third row

By integer slices, similar to NumPy/Python:


In [None]:
df

In [None]:
df.iloc[3:5, 0:2]  # Get values of 4 and 5 rows, 'A', 'B' columns

By lists of integer position locations, similar to the NumPy/Python style:


In [None]:
df.iloc[[1, 2, 4], [0, 2]]  # Get values of 2, 3, 5 rows, 'A', 'C' columns

Slicing rows explicitly:


In [None]:
df.iloc[1:3, :]  # Get values of 2 and 3 rows

Slicing columns explicitly:


In [None]:
df.iloc[:, 1:3]  # Get values of 2 and 3 columns

Getting a value explicitly:


In [None]:
df.iloc[1, 1]  # Get value at 2nd row and 2nd columns

Getting fast access to a scalar (equivalent to the prior method):


In [None]:
df.iat[1, 1]

### Boolean indexing

Using a single column’s values to select data:


In [None]:
df[df["A"] > 0]  # Get rows where 'A' columns is greater than 0

Using the isin() method for filtering:


In [None]:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]  # Add new column 'E'
df2

In [None]:
df2[
    df2["E"].isin(["two", "four"])
]  # Get rows where the values in column 'E' is either "two" or "four"

## Setting

Setting a new column automatically aligns the data by the indexes:


In [None]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20220306", periods=6))
s1

In [None]:
df["F"] = s1
df

Setting values by label:


In [None]:
df.at[dates[0], "A"] = 0  # Set the value at dates[0] and 'A' column to be 0
df

Setting values by position:


In [None]:
df.iat[0, 1] = 0  # Set the value at first row, second column to be 0
df

Setting by assigning with a NumPy array:


In [None]:
df[df.columns[3]] = np.array([5] * len(df)) # Index-Based Assignment
df

## Missing data

See [Missing Data](https://pandas.pydata.org/docs/user_guide/missing_data.html#missing-data) section of Pandas documentation for details.

Pandas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations.


In [None]:
# Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
# Add a new column "E" and set the first two rows of "E" to be "1"
df1.loc[dates[0] : dates[1], "E"] = 1
df1

To drop any rows that have missing data:


In [None]:
# Drop rows with any missing (NaN) values
# The 'how="any"' parameter specifies that if any column in a row has a NaN value, that row will be dropped
df1.dropna(how="any")

Filling missing data:


In [None]:
# Fill missing (NaN) values in the DataFrame `df1` with the specified value (5)
df1.fillna(value=5)

To get the boolean mask where values are _NaN_:


In [None]:
# Check for missing (NaN) values in the DataFrame 'df1'
# df1.isna() returns a DataFrame of the same shape as 'df1' with True for NaN values and False for non-NaN values

df1.isna()

## Operations

See the [Flexible binary operations](https://pandas.pydata.org/docs/user_guide/basics.html#basics-binop) section of Pandas documentation for details.

### Stats

Operations in general exclude missing data.

Performing a descriptive statistic:


In [None]:
df

In [None]:
df.max(axis=0)  # Get max of all columns

Same operation on the other axis:


In [None]:
df.max(axis=1)  # Get max of all rows

In [None]:
# Create a pandas DataFrame with employee performance metrics for three departments
# The DataFrame includes some missing values (NaN) to demonstrate handling of incomplete data
data = {
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Sales': [12000, 15000, None, 10000, 18000, 13000],
    'Productivity': [85, 90, 78, 82, 95, None],
    'Customer_Rating': [4.5, None, 4.0, 3.8, 4.8, 4.2]
}
df_emp = pd.DataFrame(data)

In [None]:
# Display the full DataFrame to view its contents
print("Full DataFrame:")
print(df_emp)

In [None]:
# Compute descriptive statistics for all numerical columns using describe()
# This provides count, mean, std, min, 25% (Q1), 50% (median), 75% (Q3), and max
print("\nDescriptive Statistics (describe()):")
print(df_emp.describe())

In [None]:
# Compute specific quantiles (10th, 25th, 75th, 90th percentiles) for each numerical column
# quantile() accepts a single value or list of values between 0 and 1
print("\nCustom Quantiles (10th, 25th, 75th, 90th percentiles):")
print(df_emp[['Sales', 'Productivity', 'Customer_Rating']].quantile([0.1, 0.25, 0.75, 0.9]))

In [None]:
# Compute the mean for each metric
print("\nMean Values per Metric:")
print(df_emp[['Sales', 'Productivity', 'Customer_Rating']].mean())

In [None]:
# Compute the median (50th percentile) for each metric
print("\nMedian Values per Metric:")
print(df_emp[['Sales', 'Productivity', 'Customer_Rating']].median())

In [None]:
# Compute the standard deviation for each metric
print("\nStandard Deviation per Metric:")
print(df_emp[['Sales', 'Productivity', 'Customer_Rating']].std())

In [None]:
# Count non-missing values for each metric
print("\nCount of Non-Missing Values per Metric:")
print(df_emp[['Sales', 'Productivity', 'Customer_Rating']].count())

In [None]:
# Compute the minimum value across all metrics for each employee
# Use min() along axis=1 to find the lowest value per row (excluding 'Employee' column)
print("\nMinimum Value per Employee (across metrics):")
print(df_emp[['Sales', 'Productivity', 'Customer_Rating']].min(axis=1))

### Apply

Applying functions to the data:


In [None]:
df.apply(lambda x: x.max() - x.min(), axis=1)  # Get the max-min differences of columns
# def test(x):
#   result = x.max() - x.min()
#   return result

### String Methods

Series is equipped with a set of string processing methods in the `str` attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in `str` generally uses regular expressions by default (and in some cases always uses them).


In [None]:
# Create a pandas Series with a mix of strings and a NaN value
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

# Apply the `str.lower()` function to convert all string elements in the Series to lowercase
# This function works element-wise, meaning it will convert each string in the Series to lowercase
s.str.lower()

In [None]:
# Create a pandas DataFrame with data about animal species
# Columns include species name, weight, lifespan, population size, and predation risk
data = {
    'species': ['Tiger', 'Elephant', 'Blue Whale', 'Bald Eagle'],
    'weight_kg': [250, 5000, 150000, 6],
    'lifespan_years': [15, 60, 90, 20],
    'population': [3900, 40000, 25000, 100000],
    'predation_risk': [0.3, 0.1, 0.05, 0.2]
}
animal_data = pd.DataFrame(data)


In [None]:
# Display the full DataFrame to view its contents
print("Full DataFrame:")
print(animal_data)

In [None]:
# Filter the `animal_data` DataFrame to include animals with a weight greater than 1000 kg
# The condition animal_data["weight_kg"] > 1000 selects rows where the weight exceeds 1000 kg
# Then, select only the columns whose names start with "p"
# This is achieved using the str.startswith("p") function on the column names
filtered_data = animal_data[animal_data["weight_kg"] > 1000][
    animal_data.columns[pd.Series(animal_data.columns).str.startswith("p")]
]

In [None]:
# Display the filtered DataFrame
print("\nFiltered DataFrame (weight > 1000 kg, columns starting with 'p'):")
print(filtered_data)

## Getting data in/out

### CSV

Writing to a csv file:


In [None]:
# Save the DataFrame `df` to a CSV file named "foo.csv"
# The `to_csv` method writes the data from the DataFrame into a CSV file
df.to_csv("foo.csv")

Reading from a csv file:


In [None]:
# Use pandas to read a CSV file and load its contents into a DataFrame
df = pd.read_csv("foo.csv")
df

### HDF5

HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections. If you want to know more about HDF5 format, please see [Introduction to HDF5](https://support.hdfgroup.org/documentation/hdf5/latest/_intro_h_d_f5.html) for details.

Reading and writing to HDFStores.


In [None]:
# Save the DataFrame 'df' to a HDF5 file named 'foo.h5', with the data stored under the key "df"
df.to_hdf("foo.h5", key="df")

Reading from a HDF5 Store:


In [None]:
# Use pandas to read an HDF5 file ("foo.h5") and load the data from the "df" dataset
df = pd.read_hdf("foo.h5", "df")
df

In [None]:
# Replace specific cells with NaN using loc
df.loc[1, 'A'] = np.nan  # Row index 1, PageRank column
df.loc[2, 'D'] = np.nan    # Row index 2, Degree column
df 

### Excel

Reading and writing to MS Excel.

Writing to an excel file:


In [None]:
# Save the dataframe 'df' to an Excel file with the name 'foo.xlsx'
# The data will be written to the sheet named 'Sheet1'
df.to_excel("foo.xlsx", sheet_name="Sheet1")

Reading from an excel file:


In [None]:
# Read an Excel file and load the data from "Sheet1"
# - "foo.xlsx" is the path to the Excel file.
# - "Sheet1" specifies the sheet name to be read.
# - index_col=None ensures that no column is used as the index.
# - na_values=["NA"] specifies that any "NA" values in the data should be treated as NaN (Not a Number).
df = pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
new_columns = ['id', 'date'] + df.columns[2:].tolist()
df.columns = new_columns
df

## Conclusion

In this module, we've learned about understanding Pandas data structures: Series and DataFrame. We also learned to create, read, and manipulate DataFrames, performed basic data analysis operations using Pandas, and handled missing data in Pandas.

## Clean up

Remember to shut down your Jupyter Notebook instance when you're done to avoid unnecessary charges. You can do this by stopping the notebook instance from the Amazon SageMaker console.
