# Agenda

1. Get started and series
    - Why Python?
    - What is Pandas?
    - What is a Pandas series -- defining them and working with them
    - Broadcasting
    - Boolean/mask indexes
    - Indexes
    - Dtypes
2. Data frames
3. Analyzing data

# Jupyter notebook

Jupyter is a REPL -- read, eval, print loop -- but one that is web-based.  

I'll be using Jupyter in this course! If you want to, you can:

- You can install Jupyter on your computer if you have some Python experience
- If you use VSCode or PyCharm, then they come with Jupyter installed already
- You can use Jupyter Lite, at https://jupyter.org/try-jupyter/lab/ 



# Some things to keep in mind during the course

1. PLEASE ASK LOTS OF QUESTIONS! Use the Q&A widget (not the attendee chat) to ask.
2. We will do exercises, please try to do them, too!
3. Whatever I type into my Jupyter window is mirrored on GitHub, at https://github.com/reuven/OReilly-2026-02February-pandas .

In [1]:
2 + 5

7

# A 5-minute introduction to Jupyter

We type into *cells* in Jupyter. Each cell contains Python code or documentation (Markdown). Each cell has one of two "modes" when you type into it:

- Edit mode, which means that when you type, the text goes into the cell (like right now!). You can enter edit mode by pressing `ENTER` or by clicking inside of a cell. This is true for either Python cells or Markdown cells; typing into them requires that you're in edit mode.
- Command mode, which means that you're giving one-letter commands to Jupyter. The character isn't entered into the cell, but affects how Jupyter does things. To enter command mode, press `ESC` or click to the left of the cell.

When you're in edit mode and want to tell Python to run/display what you've done, press shift+`ENTER` together. That'll "run" the cell.

### Command mode commands:

- `c` -- copies the current cell
- `x` -- cuts the current cell
- `v` -- pastes the most recent copy/cut
- `a` -- add a new cell *above* the current one
- `b` -- add a new cell *below* the current one
- `y` -- puts a cell into Python mode, for coding
- `m` -- puts a cell into Markdown mode, for documenting

In [1]:
2 + 5

7

In [1]:
2 + 5

7

# Why am I using Jupyter?

1. It's really easy to mix code + documentation, which I need to do when I'm teaching.
2. It's relatively easy to install, and works on every Python platform.
3. It lets me do things interactively.
4. It creates a document that lasts after our class, so you can remind yourself of what we did.
5. It's also super-duper popular in the data world to use Jupyter

# What is data science?

This is actually a huge, new term in the last decade or two. I divide it into several parts:

- All of data science is about retrieving, analyzing, reporting, visualizing, and forecasting with data. It's the umbrella term.
- *Data analytics* is making sense of data we've already collected. If you have a store, you want to know when people bought, and what they bought, and what coupons they used.
- *Data engineering* is about moving data around, from one place to another, typically to make it easier to analyze and work with.
- *Modeling* and *machine learning* is about making predictions, or forecasts, based on existing data.

All of these are using Python as their main language of choice. 

If you have been using Python for any length of time, this seems **SUPER DUPER WEIRD**. That's because Python runs slowly, and uses lots of memory.

The reason that Python is the #1 language for data science is NumPy, a package that basically implements data structures in C, but exposes them with a thin layer of Python. 

Pandas is basically a wrapper around NumPy. You *could* work with NumPy directly, and many people do! But Pandas provides a huge number of convenient methods for reading, cleaning, filtering, analyzing, exporting, and visualizing data.

Pandas is super popular -- it has about 100m downloads every month.

Many organizations are moving from Matlab, R, and/or Excel to Pandas, because it does the same sort of thing, but lets you control it with Python *and* it's open source.

You can think of Pandas a Excel inside of Python. 

In [2]:
# let's load Pandas!

import numpy as np
import pandas as pd
from pandas import Series

If you aren't yet using `uv` for your Python projects, *you should switch ASAP*! 

I have a free course: https://uvCrashCourse.com

# Series

The core data structure in Pandas is the *series*. This is a 1D data structure. It's kind of like a list in Python, but it has lots of other, different behaviors. We can create a new series with `pd.Series` (or in my case, `Series`), and by passing it a Python list of integers. Those integers will be used to initialize the series:

In [3]:
s = Series([10, 20, 30, 40, 50, 60, 70])  

s  # in Jupyter, if you have an expression on the final line of a cell, you'll see the expression's value

0    10
1    20
2    30
3    40
4    50
5    60
6    70
dtype: int64

# Let's take apart what we just did

1. We created a new Pandas series. This is *not* a list! We used a list to initialize it, but it's not a list.
2. Our series needs to have values that are all of the same type. Here, they're all integers, and specifically, 64-bit integers.
3. The series can have any number of values.
4. The series, behind the scenes, actually contains a NumPy array of integers!
5. When we print our series, you can see not just the values (on the right), but the index on the left. Right now, we have the basic, standard, default index, which is a bunch of integers starting at 0. Pretty soon, you'll see how we can modify that index.
6. We can see the `dtype` of the series at the bottom, which is `int64` -- meaning, 64-bit integers.

In [4]:
# compare with a list!

mylist = [10, 20, 30, 40, 50]

mylist + mylist   # what will happen? will this even work?

[10, 20, 30, 40, 50, 10, 20, 30, 40, 50]

In [6]:
# what if I add my series to itself?

s + s

0     20
1     40
2     60
3     80
4    100
5    120
6    140
dtype: int64

If you run an operation on two series, then the operator is applied to the values at the same index... and we get a new series back, of the same length

When I said `s+s`, I got:

- `s[0] + s[0]` in the 0 index
- `s[1] + s[1]` in the 1 index
- `s[2] + s[2]` in the 2 index

and so forth.

In [7]:
# can we retrieve from a series?

s[0]  # returns the value at index 0

np.int64(10)

In [9]:
s[1]

np.int64(20)

In [10]:
# can I use a slice? Yes!

s[2:7]   # from index 2 until (not including) index 7

2    30
3    40
4    50
5    60
6    70
dtype: int64

In [11]:
# can I change a series?

s[2] = 999
s

0     10
1     20
2    999
3     40
4     50
5     60
6     70
dtype: int64

In [12]:
s[0]

np.int64(10)

In [13]:
np.__version__  # np, dot, 2 _ then version, then 2 _

'2.4.2'

In [14]:
pd.__version__

'3.0.0'

In [15]:
s[6]

np.int64(70)

In [16]:
s[-1]   # will this work?

KeyError: -1

In [17]:
%xmode Minimal

Exception reporting mode: Minimal


In [18]:
s[-1]

KeyError: -1

# Mixing types

In Python, it's totally OK (technically) to have any combination of values in a list or tuple (or even a dict or a set). That's why we don't refer to Python lists as "arrays," because an array *must* have values that are all of the same type.

The same is true in NumPy arrays... and thus, the same thing is true in Pandas. All of the values must be of the same type. If you try to have a string in an integer series, it'll be converted into an integer... if it can be converted! If not, then you'll get an error.

# Exercise: Initial series stuff

1. `import pandas as pd` in your environment/notebook.
2. Create a Pandas series containing 5 integers.
3. Grab the first integer (at index 0).
4. Grab the third integer (at index 2).
5. What happens if you add the series to itself?

In [19]:
import pandas as pd

s = pd.Series([10, 15, 18, 12, 3])

s[0]

np.int64(10)

In [20]:
s[2]

np.int64(18)

In [21]:
s + s

0    20
1    30
2    36
3    24
4     6
dtype: int64

# Useful methods on a series

In [22]:
s

0    10
1    15
2    18
3    12
4     3
dtype: int64

In [23]:
s.sum()   # this returns an integer, the sum total of the elements of 

np.int64(58)

In [24]:
# whether you say pd.Series or Series depends on how you imported Pandas and/or specific names

import pandas as pd        # now I can use pd.Series
from pandas import Series  # now I can use just Series -- but pd isn't defined or installed

In [25]:
s.count()  # how many values are there in the series?

np.int64(5)

In [26]:
s.mean()  # what is the arithmetic mean of the series? It's the same as s.sum() / s.count()

np.float64(11.6)

In [27]:
s.sum() / s.count()

np.float64(11.6)

In [28]:
s.std()  # standard deviation -- how much do values in s "wiggle" from the mean?

np.float64(5.683308895353129)

In [29]:
s

0    10
1    15
2    18
3    12
4     3
dtype: int64

In [30]:
len(s)   # how many values are in s?

5

In [31]:
s.min()  # smallest value

np.int64(3)

In [32]:
s.max()  # largest value

np.int64(18)

In [33]:
s.median()  # what is the central value? This is often a better measure of the "center" than mean

np.float64(12.0)

In [34]:
numbers = Series([10, 20, 30, 40, 50, 100_000])
numbers

0        10
1        20
2        30
3        40
4        50
5    100000
dtype: int64

In [35]:
numbers.mean()

np.float64(16691.666666666668)

In [36]:
numbers.median()

np.float64(35.0)

In [37]:
# if you want all of the summary data about a series, you can invoke "describe"

s.describe()

count     5.000000
mean     11.600000
std       5.683309
min       3.000000
25%      10.000000
50%      12.000000
75%      15.000000
max      18.000000
dtype: float64

In [38]:
# some other really nice methods

s.head()  # give me the first 5 values

0    10
1    15
2    18
3    12
4     3
dtype: int64

In [39]:
s.head(2)  # give me the first 2 values

0    10
1    15
dtype: int64

In [40]:
s.tail(2)  # give me the final 2 values

3    12
4     3
dtype: int64

# My favorite method: `value_counts`

If you invoke this on a series, you get back a new series, telling you how often every value appeared.

In [41]:
s = Series([10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 30, 30, 40, 50, 50, 50, 60, 70])

s.value_counts()  # the numbers in s becomes the index! The values of this series represent the counts

10    6
20    4
50    3
30    2
40    1
60    1
70    1
Name: count, dtype: int64

In [42]:
# since it's sorted from most common to least common
# and since it's a series
# we can get the 3 most common element in s with:

s.value_counts().head(3)  

10    6
20    4
50    3
Name: count, dtype: int64

# Next up

1. Series exercises
2. Setting and retrieving values
3. Broadcasting

# Exercise: Series calculations

1. Create a series of 10 numbers. It's OK for some of them to repeat.
2. Find the largest and smallest numbers in two ways. First, using individual methods. Then, use `describe` to find them.
3. Whichi number occurs the most often in your series?