# Introduction to Pandas

## Overview

This module introduces Pandas, a powerful data manipulation library for Python. It covers basic operations, data structures, and common data analysis tasks using Pandas, including analyzing, cleaning, exploring, and manipulating data.

## Learning Objectives

* Understand Pandas data structures: Series and DataFrame
* Learn to create, read, and manipulate DataFrames
* Perform basic data analysis operations using Pandas
* Handle missing data in Pandas

## Prerequisites

- Basic Python knowledge (For a refresher, see the [Python tutorial](https://docs.python.org/tutorial/).)
- Familiarity with NumPy is helpful but not required

## Get Started

Install pandas and import the required libraries.

In [None]:
!pip install pandas
!pip install tables
!pip install openpyxl

import pandas as pd
import numpy as np

## Object creation

See the [introduction to data structures](https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro) section of Pandas documentation for details.

You can create a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series) (One-dimensional ndarray with axis labels, including time series) by passing a list of values, letting pandas create a default integer index:

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

You can create a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) (Two-dimensional, size-mutable, potentially heterogeneous tabular data) by passing a NumPy array, with a datetime index and labeled columns:



In [None]:
# Returns the range of equally spaced time points
dates = pd.date_range("20220306", periods=6)
dates

In [None]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

You can also create a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:

In [None]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

The columns of the resulting DataFrame have different `dtypes`:

In [None]:
df2.dtypes

## Viewing data

See [Essential basic functionality](https://pandas.pydata.org/docs/user_guide/basics.html#basics) section of Pandas documentation for details.

You can view the top and bottom rows of the frame:

In [None]:
df.head(3) # first three rows

In [None]:
df.tail(2) # last two rows

You can display the indexes and columns:

In [None]:
df.index

In [None]:
df.columns

**describe**() shows a quick statistic summary of your data:

In [None]:
df.describe()

Transposing your data:

In [None]:
df.T

[Sorting](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html) by axis:

In [None]:
# The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
df.sort_index(axis=1, ascending=False) # Sort based on column label

[Sorting](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values) by values:

In [None]:
df.sort_values(by="C") # Sort by 'C' column ascending

## Selection

See [Indexing and Selecting Data](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing) section of Pandas documentation for details.

Selecting a single column, which yields a Series, equivalent to df.A:

In [None]:
df["A"] # Select 'A' column

Selecting via [], which slices the rows:

In [None]:
df[0:4] # Select first 4 rows

In [None]:
df["20220306":"20220310"] # Get "2022-03-06" through "2022-03-10" rows

### Selection by label

**loc** selects rows and columns with specific labels. **iloc** selects rows and columns at specific integer positions.

Getting a cross section using a label:

In [None]:
dates

In [None]:
df.loc[dates[0]] # Get row indexed by '2022-03-06'

Selecting on a multi-axis by label:

In [None]:
df.loc[:, ["A", "B"]] # Get 'A' and 'B' columns for all rows

Showing label slicing, both endpoints are included:

In [None]:
df.loc["20220307":"20220309", ["A", "B"]] # # Get 'A' and 'B' columns for rows indexed by '2022-03-07' through '2022-03-09'

Reduction in the dimensions of the returned object:

In [None]:
df.loc["20220308", ["A", "B"]] # Get 'A' and 'B' columns of '2022-03-08' row

Getting a scalar value:

In [None]:
df.loc[dates[0], "A"] # Get value at dates[0] row and 'A' column.

Getting fast access to a scalar (equivalent to the prior method):

In [None]:
df.at[dates[0], "A"]

### Selection by position
Selecting via the position of the passed integers:

In [None]:
df.iloc[2] # Get all values of third row

By integer slices, similar to NumPy/Python:

In [None]:
df

In [None]:
df.iloc[3:5, 0:2] # Get values of 4 and 5 rows, 'A', 'B' columns

By lists of integer position locations, similar to the NumPy/Python style:

In [None]:
df.iloc[[1, 2, 4], [0, 2]] # Get values of 2, 3, 5 rows, 'A', 'C' columns

Slicing rows explicitly:

In [None]:
df.iloc[1:3, :] # Get values of 2 and 3 rows

Slicing columns explicitly:

In [None]:
df.iloc[:, 1:3] # Get values of 2 and 3 columns

 Getting a value explicitly:

In [None]:
df.iloc[1, 1] # Get value at 2nd row and 2nd columns

Getting fast access to a scalar (equivalent to the prior method):

In [None]:
df.iat[1, 1]

### Boolean indexing

Using a single column’s values to select data:

In [None]:
df[df["A"] > 0] # Get rows where 'A' columns is greater than 0

Using the isin() method for filtering:

In [None]:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"] # Add new column 'E'
df2

In [None]:
df2[df2["E"].isin(["two", "four"])] # Get rows where the values in column 'E' is either "two" or "four"

## Setting

Setting a new column automatically aligns the data by the indexes:

In [None]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20220306", periods=6))
s1

In [None]:
df["F"] = s1
df

Setting values by label:

In [None]:
df.at[dates[0], "A"] = 0 # Set the value at dates[0] and 'A' column to be 0
df

Setting values by position:

In [None]:
df.iat[0, 1] = 0 # Set the value at first row, second column to be 0
df

Setting by assigning with a NumPy array:

In [None]:
df.loc[:, "D"] = np.array([5] * len(df)) # Set the values at 'D' columns to be 5
df

## Missing data

See [Missing Data](https://pandas.pydata.org/docs/user_guide/missing_data.html#missing-data) section of Pandas documentation for details.

Pandas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations.

In [None]:
# Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
# Add a new column "E" and set the first two rows of "E" to be "1"
df1.loc[dates[0] : dates[1], "E"] = 1
df1

To drop any rows that have missing data:

In [None]:
df1.dropna(how="any")

Filling missing data:

In [None]:
df1.fillna(value=5)

To get the boolean mask where values are *NaN*:

In [None]:
pd.isna(df1)

## Operations

See the [Flexible binary operations](https://pandas.pydata.org/docs/user_guide/basics.html#basics-binop) section of Pandas documentation for details.

### Stats

Operations in general exclude missing data.

Performing a descriptive statistic:

In [None]:
df

In [None]:
df.max(axis=0) # Get max of all columns

Same operation on the other axis:

In [None]:
df.max(axis=1) # Get max of all rows

### Apply

Applying functions to the data:

In [None]:
df.apply(lambda x: x.max() - x.min(), axis=1) # Get the max-min differences of columns
# def test(x):
#   result = x.max() - x.min()
#   return result

### String Methods

Series is equipped with a set of string processing methods in the `str` attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in `str` generally uses regular expressions by default (and in some cases always uses them).

In [None]:
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
s.str.lower()

## Getting data in/out

### CSV

Writing to a csv file:

In [None]:
df.to_csv("foo.csv")

Reading from a csv file:

In [None]:
pd.read_csv("foo.csv")

### HDF5

HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections. If you want to know more about HDF5 format, please see [What is HDF5](https://support.hdfgroup.org/HDF5/whatishdf5.html) for details.

Reading and writing to HDFStores.

In [None]:
df.to_hdf("foo.h5", "df")

Reading from a HDF5 Store:

In [None]:
pd.read_hdf("foo.h5", "df")

### Excel

Reading and writing to MS Excel.

Writing to an excel file:

In [None]:
df.to_excel("foo.xlsx", sheet_name="Sheet1")

Reading from an excel file:

In [None]:
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])