# Agenda

1. Get started and series
    - Why Python?
    - What is Pandas?
    - What is a Pandas series -- defining them and working with them
    - Broadcasting
    - Boolean/mask indexes
    - Indexes
    - Dtypes
2. Data frames
3. Analyzing data

# Jupyter notebook

Jupyter is a REPL -- read, eval, print loop -- but one that is web-based.  

I'll be using Jupyter in this course! If you want to, you can:

- You can install Jupyter on your computer if you have some Python experience
- If you use VSCode or PyCharm, then they come with Jupyter installed already
- You can use Jupyter Lite, at https://jupyter.org/try-jupyter/lab/ 



# Some things to keep in mind during the course

1. PLEASE ASK LOTS OF QUESTIONS! Use the Q&A widget (not the attendee chat) to ask.
2. We will do exercises, please try to do them, too!
3. Whatever I type into my Jupyter window is mirrored on GitHub, at https://github.com/reuven/OReilly-2026-02February-pandas .

In [1]:
2 + 5

7

# A 5-minute introduction to Jupyter

We type into *cells* in Jupyter. Each cell contains Python code or documentation (Markdown). Each cell has one of two "modes" when you type into it:

- Edit mode, which means that when you type, the text goes into the cell (like right now!). You can enter edit mode by pressing `ENTER` or by clicking inside of a cell. This is true for either Python cells or Markdown cells; typing into them requires that you're in edit mode.
- Command mode, which means that you're giving one-letter commands to Jupyter. The character isn't entered into the cell, but affects how Jupyter does things. To enter command mode, press `ESC` or click to the left of the cell.

When you're in edit mode and want to tell Python to run/display what you've done, press shift+`ENTER` together. That'll "run" the cell.

### Command mode commands:

- `c` -- copies the current cell
- `x` -- cuts the current cell
- `v` -- pastes the most recent copy/cut
- `a` -- add a new cell *above* the current one
- `b` -- add a new cell *below* the current one
- `y` -- puts a cell into Python mode, for coding
- `m` -- puts a cell into Markdown mode, for documenting

In [1]:
2 + 5

7

In [1]:
2 + 5

7

# Why am I using Jupyter?

1. It's really easy to mix code + documentation, which I need to do when I'm teaching.
2. It's relatively easy to install, and works on every Python platform.
3. It lets me do things interactively.
4. It creates a document that lasts after our class, so you can remind yourself of what we did.
5. It's also super-duper popular in the data world to use Jupyter

# What is data science?

This is actually a huge, new term in the last decade or two. I divide it into several parts:

- All of data science is about retrieving, analyzing, reporting, visualizing, and forecasting with data. It's the umbrella term.
- *Data analytics* is making sense of data we've already collected. If you have a store, you want to know when people bought, and what they bought, and what coupons they used.
- *Data engineering* is about moving data around, from one place to another, typically to make it easier to analyze and work with.
- *Modeling* and *machine learning* is about making predictions, or forecasts, based on existing data.

All of these are using Python as their main language of choice. 

If you have been using Python for any length of time, this seems **SUPER DUPER WEIRD**. That's because Python runs slowly, and uses lots of memory.

The reason that Python is the #1 language for data science is NumPy, a package that basically implements data structures in C, but exposes them with a thin layer of Python. 

Pandas is basically a wrapper around NumPy. You *could* work with NumPy directly, and many people do! But Pandas provides a huge number of convenient methods for reading, cleaning, filtering, analyzing, exporting, and visualizing data.

Pandas is super popular -- it has about 100m downloads every month.

Many organizations are moving from Matlab, R, and/or Excel to Pandas, because it does the same sort of thing, but lets you control it with Python *and* it's open source.

You can think of Pandas a Excel inside of Python. 

In [2]:
# let's load Pandas!

import numpy as np
import pandas as pd
from pandas import Series

If you aren't yet using `uv` for your Python projects, *you should switch ASAP*! 

I have a free course: https://uvCrashCourse.com

# Series

The core data structure in Pandas is the *series*. This is a 1D data structure. It's kind of like a list in Python, but it has lots of other, different behaviors. We can create a new series with `pd.Series` (or in my case, `Series`), and by passing it a Python list of integers. Those integers will be used to initialize the series:

In [3]:
s = Series([10, 20, 30, 40, 50, 60, 70])  

s  # in Jupyter, if you have an expression on the final line of a cell, you'll see the expression's value

0    10
1    20
2    30
3    40
4    50
5    60
6    70
dtype: int64

# Let's take apart what we just did

1. We created a new Pandas series. This is *not* a list! We used a list to initialize it, but it's not a list.
2. Our series needs to have values that are all of the same type. Here, they're all integers, and specifically, 64-bit integers.
3. The series can have any number of values.
4. The series, behind the scenes, actually contains a NumPy array of integers!
5. When we print our series, you can see not just the values (on the right), but the index on the left. Right now, we have the basic, standard, default index, which is a bunch of integers starting at 0. Pretty soon, you'll see how we can modify that index.
6. We can see the `dtype` of the series at the bottom, which is `int64` -- meaning, 64-bit integers.

In [4]:
# compare with a list!

mylist = [10, 20, 30, 40, 50]

mylist + mylist   # what will happen? will this even work?

[10, 20, 30, 40, 50, 10, 20, 30, 40, 50]

In [6]:
# what if I add my series to itself?

s + s

0     20
1     40
2     60
3     80
4    100
5    120
6    140
dtype: int64

If you run an operation on two series, then the operator is applied to the values at the same index... and we get a new series back, of the same length

When I said `s+s`, I got:

- `s[0] + s[0]` in the 0 index
- `s[1] + s[1]` in the 1 index
- `s[2] + s[2]` in the 2 index

and so forth.

In [7]:
# can we retrieve from a series?

s[0]  # returns the value at index 0

np.int64(10)

In [9]:
s[1]

np.int64(20)

In [10]:
# can I use a slice? Yes!

s[2:7]   # from index 2 until (not including) index 7

2    30
3    40
4    50
5    60
6    70
dtype: int64

In [11]:
# can I change a series?

s[2] = 999
s

0     10
1     20
2    999
3     40
4     50
5     60
6     70
dtype: int64

In [12]:
s[0]

np.int64(10)

In [13]:
np.__version__  # np, dot, 2 _ then version, then 2 _

'2.4.2'

In [14]:
pd.__version__

'3.0.0'

In [15]:
s[6]

np.int64(70)

In [16]:
s[-1]   # will this work?

KeyError: -1

In [17]:
%xmode Minimal

Exception reporting mode: Minimal


In [18]:
s[-1]

KeyError: -1

# Mixing types

In Python, it's totally OK (technically) to have any combination of values in a list or tuple (or even a dict or a set). That's why we don't refer to Python lists as "arrays," because an array *must* have values that are all of the same type.

The same is true in NumPy arrays... and thus, the same thing is true in Pandas. All of the values must be of the same type. If you try to have a string in an integer series, it'll be converted into an integer... if it can be converted! If not, then you'll get an error.

# Exercise: Initial series stuff

1. `import pandas as pd` in your environment/notebook.
2. Create a Pandas series containing 5 integers.
3. Grab the first integer (at index 0).
4. Grab the third integer (at index 2).
5. What happens if you add the series to itself?

In [19]:
import pandas as pd

s = pd.Series([10, 15, 18, 12, 3])

s[0]

np.int64(10)

In [20]:
s[2]

np.int64(18)

In [21]:
s + s

0    20
1    30
2    36
3    24
4     6
dtype: int64

# Useful methods on a series

In [22]:
s

0    10
1    15
2    18
3    12
4     3
dtype: int64

In [23]:
s.sum()   # this returns an integer, the sum total of the elements of 

np.int64(58)

In [24]:
# whether you say pd.Series or Series depends on how you imported Pandas and/or specific names

import pandas as pd        # now I can use pd.Series
from pandas import Series  # now I can use just Series -- but pd isn't defined or installed

In [25]:
s.count()  # how many values are there in the series?

np.int64(5)

In [26]:
s.mean()  # what is the arithmetic mean of the series? It's the same as s.sum() / s.count()

np.float64(11.6)

In [27]:
s.sum() / s.count()

np.float64(11.6)

In [28]:
s.std()  # standard deviation -- how much do values in s "wiggle" from the mean?

np.float64(5.683308895353129)

In [29]:
s

0    10
1    15
2    18
3    12
4     3
dtype: int64

In [30]:
len(s)   # how many values are in s?

5

In [31]:
s.min()  # smallest value

np.int64(3)

In [32]:
s.max()  # largest value

np.int64(18)

In [33]:
s.median()  # what is the central value? This is often a better measure of the "center" than mean

np.float64(12.0)

In [34]:
numbers = Series([10, 20, 30, 40, 50, 100_000])
numbers

0        10
1        20
2        30
3        40
4        50
5    100000
dtype: int64

In [35]:
numbers.mean()

np.float64(16691.666666666668)

In [36]:
numbers.median()

np.float64(35.0)

In [37]:
# if you want all of the summary data about a series, you can invoke "describe"

s.describe()

count     5.000000
mean     11.600000
std       5.683309
min       3.000000
25%      10.000000
50%      12.000000
75%      15.000000
max      18.000000
dtype: float64

In [38]:
# some other really nice methods

s.head()  # give me the first 5 values

0    10
1    15
2    18
3    12
4     3
dtype: int64

In [39]:
s.head(2)  # give me the first 2 values

0    10
1    15
dtype: int64

In [40]:
s.tail(2)  # give me the final 2 values

3    12
4     3
dtype: int64

# My favorite method: `value_counts`

If you invoke this on a series, you get back a new series, telling you how often every value appeared.

In [41]:
s = Series([10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 30, 30, 40, 50, 50, 50, 60, 70])

s.value_counts()  # the numbers in s becomes the index! The values of this series represent the counts

10    6
20    4
50    3
30    2
40    1
60    1
70    1
Name: count, dtype: int64

In [42]:
# since it's sorted from most common to least common
# and since it's a series
# we can get the 3 most common element in s with:

s.value_counts().head(3)  

10    6
20    4
50    3
Name: count, dtype: int64

# Next up

1. Series exercises
2. Setting and retrieving values
3. Broadcasting

# Exercise: Series calculations

1. Create a series of 10 numbers. It's OK for some of them to repeat.
2. Find the largest and smallest numbers in two ways. First, using individual methods. Then, use `describe` to find them.
3. Which number occurs the most often in your series?

In [43]:
s = Series([10, 12, 18, 5, -8, 16, 30, 30, 12, 12])
s

0    10
1    12
2    18
3     5
4    -8
5    16
6    30
7    30
8    12
9    12
dtype: int64

In [44]:
s.min()

np.int64(-8)

In [45]:
s.max()

np.int64(30)

In [46]:
s.describe()

count    10.000000
mean     13.700000
std      11.175867
min      -8.000000
25%      10.500000
50%      12.000000
75%      17.500000
max      30.000000
dtype: float64

In [47]:
# s.describe() returns a series...
# ... and we can retrieve an element from a series with []
# if we put the index in the []

s.describe()['min']

np.float64(-8.0)

In [48]:
s.describe()['max']

np.float64(30.0)

In [50]:
s.value_counts().head(1)

12    3
Name: count, dtype: int64

In [None]:
s.

# Setting and retrieving values

So far, we've seen that we can use `[]` to retrieve from a series, and also to set on it.

From this moment on, I will *not* use `[]` to retrieve or set. Rather, I'll use `.loc[]`. I promise, this is for your own good!

- `.loc` works like a method, but it uses `[]` for some internal syntactic reasons.
- `.loc` lets you retrieve or set via the index.
- `.loc` is one of the most powerful and common ways to retrieve, filter, and set our data in Pandas.
- Soon, we'll contrast it with some other things, and see why indexes are so crucial.

Also, when we get to 2D data frames next week, `[]` will *no longer* work with the index. Rather, they'll refer to columns. That can be very confusing and frustrating.

In [51]:
s

0    10
1    12
2    18
3     5
4    -8
5    16
6    30
7    30
8    12
9    12
dtype: int64

In [52]:
s.loc[3]   # this is way better than s[3]

np.int64(5)

In [53]:
s.loc[3] = 999  # this is better than s[3] = 999
s

0     10
1     12
2     18
3    999
4     -8
5     16
6     30
7     30
8     12
9     12
dtype: int64

# Broadcasting

This is one of the most important and powerful tools we have in Pandas!

In [54]:
s

0     10
1     12
2     18
3    999
4     -8
5     16
6     30
7     30
8     12
9     12
dtype: int64

In [55]:
s = pd.Series([10, 20, 30, 40, 50, 55, 65, 75, 85, 95])
s

0    10
1    20
2    30
3    40
4    50
5    55
6    65
7    75
8    85
9    95
dtype: int64

In [56]:
s + s   # we've seen this -- we'll get each element added to itself, and a new series back with the same length

0     20
1     40
2     60
3     80
4    100
5    110
6    130
7    150
8    170
9    190
dtype: int64

In [57]:
# what if we add 3 to s?

s + 3  # this is known as "broadcasting" -- the +3 is run on each element of s, giving us a new series back

0    13
1    23
2    33
3    43
4    53
5    58
6    68
7    78
8    88
9    98
dtype: int64

In [58]:
# we can do this with *any* operator!

In [59]:
s - 3

0     7
1    17
2    27
3    37
4    47
5    52
6    62
7    72
8    82
9    92
dtype: int64

In [60]:
s * 3

0     30
1     60
2     90
3    120
4    150
5    165
6    195
7    225
8    255
9    285
dtype: int64

In [61]:
s / 3

0     3.333333
1     6.666667
2    10.000000
3    13.333333
4    16.666667
5    18.333333
6    21.666667
7    25.000000
8    28.333333
9    31.666667
dtype: float64

Why is broadcasting useful?

1. It takes advantage of "vectorization" in Pandas, where this is a very fast way to operate on numerous numbers.
2. We will soon see that we can use it to retrieve specific values, and filter through our series as well.

In [63]:
(s + 3).loc[9]

np.int64(98)

s 

In [64]:
s

0    10
1    20
2    30
3    40
4    50
5    55
6    65
7    75
8    85
9    95
dtype: int64

In [65]:
s + 3

0    13
1    23
2    33
3    43
4    53
5    58
6    68
7    78
8    88
9    98
dtype: int64

In [66]:
(s + 3).loc[9]

np.int64(98)

# Exercise: Playing with series

1. Define a series, `highs`, with the 10-day high temp forecast for your city.
2. Define a second series, `lows`, with the 10-day low temp forecast for your city.
3. Find the mean difference between highs and lows. (That is, what, on average, will be the difference between high and low temps over the next 10 days in your city?)
4. Repeat this, but if you originally used Fahrenheit, change it to Celsius. Or vice versa.
    - From F to C, celsius = (fahrenheit - 32) * 5 / 9
    - From C to F, fahrenheit = (celsius * 9/5) + 32

In [67]:
# if you want random integers, you can use np.random.randint(LOW, HIGH, COUNT)

np.random.randint(0, 100, 10)

array([46,  9, 31, 44, 23, 50, 19,  1, 31, 63])

In [68]:
s = Series(np.random.randint(0, 100, 10))
s

0    42
1    67
2    84
3    36
4    68
5    68
6    59
7     0
8    84
9    62
dtype: int64

In [69]:
high_temps = Series([23, 28, 20, 26, 30, 20, 20, 19, 23, 22])
low_temps = Series([16, 13, 10, 16, 14, 12, 11, 10, 12, 11])




In [70]:
high_temps

0    23
1    28
2    20
3    26
4    30
5    20
6    20
7    19
8    23
9    22
dtype: int64

In [71]:
high_temps - low_temps

0     7
1    15
2    10
3    10
4    16
5     8
6     9
7     9
8    11
9    11
dtype: int64

In [72]:
(high_temps - low_temps).mean()

np.float64(10.6)

In [73]:
(high_temps - low_temps).min()

np.int64(7)

In [74]:
(high_temps - low_temps).max()

np.int64(16)

In [75]:
diff_temps = high_temps - low_temps

In [76]:
diff_temps.describe()

count    10.000000
mean     10.600000
std       2.875181
min       7.000000
25%       9.000000
50%      10.000000
75%      11.000000
max      16.000000
dtype: float64

In [77]:
high_temps_f = (high_temps * 9/5) + 32
low_temps_f = (low_temps * 9/5) + 32


In [78]:
high_temps_f

0    73.4
1    82.4
2    68.0
3    78.8
4    86.0
5    68.0
6    68.0
7    66.2
8    73.4
9    71.6
dtype: float64

In [79]:
low_temps_f

0    60.8
1    55.4
2    50.0
3    60.8
4    57.2
5    53.6
6    51.8
7    50.0
8    53.6
9    51.8
dtype: float64

In [81]:
(high_temps_f - low_temps_f).describe()

count    10.000000
mean     19.080000
std       5.175326
min      12.600000
25%      16.200000
50%      18.000000
75%      19.800000
max      28.800000
dtype: float64

In [None]:
((highs.mean()) - (lows.mean()))*9/5 + 32

In [82]:
(high_temps - low_temps).mean()

np.float64(10.6)

In [83]:
high_temps.mean() - low_temps.mean()

np.float64(10.600000000000001)

In [84]:
# Fancy indexing

s

0    42
1    67
2    84
3    36
4    68
5    68
6    59
7     0
8    84
9    62
dtype: int64

In [85]:
# if I want the item at index 3, I can say

s.loc[3]

np.int64(36)

In [86]:
# if I want the item at index 5, I can say

s.loc[5]

np.int64(68)

In [87]:
# what if I want both of them?
# we can use fancy indexing -- giving [] not one numeric index, but rather a list of indexes!

s.loc[ [3, 5] ]

3    36
5    68
dtype: int64

In [88]:
s.loc[ [3, 5, 2, 4] ]

3    36
5    68
2    84
4    68
dtype: int64

In [90]:
output = s.loc[ [3, 5, 3, 5, 3, 5] ]
output

3    36
5    68
3    36
5    68
3    36
5    68
dtype: int64

In [91]:
output

3    36
5    68
3    36
5    68
3    36
5    68
dtype: int64

In [93]:
# what if I want the first item? I can't use .loc now, because I'll get all of the matching element!

output.loc[3]   # if you ask for all elements with index 3, that's what .loc will return!

3    36
3    36
3    36
dtype: int64

In [94]:
# the other way to do this is .iloc, where it only refers to the position
# this is the way to retrieve/set values in a series that doesn't depend on the index!

output.iloc[3]  # this will return the element at positional index 3

np.int64(68)

In [95]:
# .iloc can use negative indexes!
# if you want the final value in a series, just use .iloc[-1]

output.iloc[-1]

np.int64(68)

In [98]:
# can we use slices with .loc and/or .iloc? YES!
# but remember that fancy indexing has double [] -- the outer for .loc/.iloc, and the inner for a list of indexes
# slices only have one pair of [], the slice goes in them

s.loc[3:7]   # normally, a slice is "up to and not including". .loc breaks that, and says, "up to AND INCLUDING"

3    36
4    68
5    68
6    59
7     0
dtype: int64

In [99]:
s.iloc[3:7]  # now it will be based on position, and thus it's "up to and NOT including"

3    36
4    68
5    68
6    59
dtype: int64

# Exercise: Retrieving with `.loc` and `.iloc`

1. From the `high_temps` series, retrieve the items at indexes 2, 5, and 8, and get their mean. Do you want to use `.loc` or `.iloc` here? Does it matter?
2. Grab the elements from indexes 2-8 in `high_temps`, and assign to a new series. If you want the 2nd and 3rd values from here, how would you use `.loc`? How would you use `.iloc`?

In [100]:
high_temps

0    23
1    28
2    20
3    26
4    30
5    20
6    20
7    19
8    23
9    22
dtype: int64

In [101]:
high_temps.loc[ [2,5,8] ]

2    20
5    20
8    23
dtype: int64

In [102]:
high_temps.iloc[ [2,5,8] ]

2    20
5    20
8    23
dtype: int64

In [103]:
some_high_temps = high_temps.iloc[ [2,5,8] ]
some_high_temps

2    20
5    20
8    23
dtype: int64

In [107]:
# 2nd and 3rd values from some_high_temps using .loc

some_high_temps.loc[[5, 8]]   # here, we retrieve by index!

5    20
8    23
dtype: int64

In [106]:
some_high_temps.iloc[[1, 2]]     # here, we retrieve by position!

5    20
8    23
dtype: int64

# Next up

1. EK's questions
2. Mask/boolean arrays, for filtering data
3. Custom indexes

In [108]:
s

0    42
1    67
2    84
3    36
4    68
5    68
6    59
7     0
8    84
9    62
dtype: int64

In [109]:
some_high_temps

2    20
5    20
8    23
dtype: int64

In [110]:
some_high_temps.iloc[1]   # this returns the number at position 1

np.int64(20)

In [112]:
some_high_temps.where(20)


ValueError: Array conditional must be same shape as self

In [121]:
some_high_temps.iloc[ [1] ]   # one-element list of indexes

5    20
dtype: int64

# Remember broadcasting?

The idea is that we can apply any operator to a series with a scalar value (i.e., an integer or float), and the operator + number will be applied to every element in the series. We'll get back a new series as a result.

In [115]:
s = Series([10, 20, 30, 40])
s

0    10
1    20
2    30
3    40
dtype: int64

In [116]:
s * 3

0     30
1     60
2     90
3    120
dtype: int64

In [117]:
# we've seen fancy indexing, where I can specify which elements I want

s.loc[ [2, 1, 3] ]

2    30
1    20
3    40
dtype: int64

In [118]:
# what if I pass a list of boolean values, instead of integers?
# that is: instead of saying s.loc [ SOME NUMBERS ] ], I say s.loc [ [ SOME TRUE/FALSE values ]]?

In [123]:
# if we pass a list of True/False values to s.loc,
# then we get the values aligned with True
# and the values aligned with False are ignored

s.loc[ [True, False, True, False]  ]

0    10
2    30
dtype: int64

# No one writes these boolean lists by hand!

Instead, we can take advantage of broadcasting! 

We broadcast not `+` or `-`, but `==` or `<` or another comparison. That gives us a broacast of a boolean operator! We can then use it to retrieve selected values.

# Mask indexes are central to Pandas

They are also super weird and hard to wrap your head around:

- The syntax is bizarre
- What we're doing is a bit weird, too

This is why you never use `for` loops or `if` statements when working with Pandas.

In [124]:
s

0    10
1    20
2    30
3    40
dtype: int64

In [125]:
s < 30   # here, I'm broadcasting! I'll get back a series of booleans (True/False), for the operator and each value

0     True
1     True
2    False
3    False
dtype: bool

In [127]:
#          10   20    30     40         # values in s

s.loc[  [True, True, False, False]  ]   # what happens here?

0    10
1    20
dtype: int64

In [128]:
# we can do all of this in one fell swoop:

s.loc[ s < 30 ]

0    10
1    20
dtype: int64

# How to read this:

1. Start inside of the `[]`. We see a broadcast taking place, `s < 30`
2. That broadcast returns a series of booleans (`True` and `False` values), of the same length as `s`
3. Then, thanks to `.loc`, we apply the boolean series inside of `[]`.
4. `.loc` returns only those elements of `s` for which there is a corresponding `True` in the boolean series.

This is known as a "boolean index" or a "mask index."

# Exercise: Boolean indexes

1. On how many days will the forecast high temp be higher than the mean high temp?
2. What is the mean of the odd temperatures? Remember: You can say `%2` in Python to get the remainder from dividing by 2, and if you compare that with 1, you can check if a number is odd.