# Starting with Pandas

Here we use the Pandas library to load table data into Python.

Thus far we have used the Numpy library to work with data in arrays. As
always with Python, when we want to use a library — like Numpy, or
Pandas — we have to `import` it first.

We have used the term *library* here, but Python uses the term *module*
to refer to libraries of code and data that you `import`. We will use
the terms “library” and “module” to mean the same thing — a Python
module.

When using Numpy, we write:

In [None]:
# Import the Numpy library (module), name it "np".
import numpy as np

Now we will use the Pandas library (module).

We can import Pandas like this:

In [None]:
# Import the Pandas library (module)
import pandas

As Numpy has a standard abbreviation `np`, that almost everyone writing
Python code will recognize and use, so Pandas has the standard
abbreviation `pd`:

In [None]:
# Import the Pandas library (module), name it "pd".
import pandas as pd

Pandas is the standard data science library for Python. It is
particularly good at loading data files, and presenting them to us as a
useful table-like structure, called a *data frame*.

We start by using Pandas to load our data file:

In [None]:
district_income = pd.read_csv('data/congress_2023.csv')

We have thus far done many operations that returned Numpy *arrays*.
`pd.read_csv` returns a Pandas *data frame*:

In [None]:
type(district_income)

A data frame is Pandas’ own way of representing a table, with columns
and rows. You can think of it as Python’s version of a spreadsheet. As
strings or Numpy arrays have *methods* (functions attached to the
array), so Pandas data frames have methods. These methods do things with
the data frame to which they are attached. For example, the `head`
method of the data frame shows (by default) the first five rows in the
table:

In [None]:
# Show the first five rows in the data frame
district_income.head()

The data are in income order, from lowest to highest, so the first five
districts are those with the lowest household income.

**Note: Sorting**

If the data were not already in income order, we could have sorted them
with NumPy’s `sort` function.

**End of note**

We are particularly interested in the column named `Median_Income`.

You may remember the idea of *indexing*, introduced in
<a href="#sec-array-indexing" class="quarto-xref"><span
class="quarto-unresolved-ref">sec-array-indexing</span></a>. Indexing
occurs when we fetch data from within a container, such as a string or
an array. We do this by putting square brackets `[]` after the value we
want to index into, and put something inside the brackets to say what we
want.

For example, to get the *first* element of the `priv` array above, we
use indexing:

In [None]:
# Fetch the first element of the priv array with indexing.
# This is the element at position 0.
priv[0]

As you can index into strings and Numpy arrays, by using square
brackets, so you can index into Pandas data frames. Instead of putting
the *position* between the square brackets, we can put the *column
name*. This fetches the data from that column, returning a new type of
value called a Pandas *Series*.

In [None]:
# Index into Pandas data frame to get one column of data.
# Notice we use a string between the square brackets, giving the column name.
income_col = district_income['Median_Income']
# The value that comes back is of type Series.  A Series represents the
# data from a single column.
type(income_col)

We want to go straight to our familiar Numpy arrays, so we convert the
column of data into a Numpy array, using the `np.array` function you
have already seen:



In [None]:
# Convert column data into a Numpy array.
incomes = np.array(income_col)
# Show the first five values, by indexing with a slice.
incomes[:5]