# Agenda

1. Series
2. Creating them
3. Retrieving with `[]`, `.loc`, `.iloc`
4. Indexes and how that affects retrieval
5. Broadcasting and retrieving

In [1]:
import pandas as pd

In [2]:
from pandas import Series

In [3]:
s = Series([10, 20, 30, 40, 50])

In [4]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [5]:
s = Series([10, 20, 30.5, 40, 50])

In [6]:
s

0    10.0
1    20.0
2    30.5
3    40.0
4    50.0
dtype: float64

# Lists vs series

A list can contain any objects we want. They are traditionally all of the same type, but they don't have to be.

In a series, all values *must* be of the same dtype.

In [7]:
# How can I retrieve from a series?

s[0]

10.0

In [8]:
s[1]

20.0

In [9]:
type(s[1])

numpy.float64

In [10]:
s[-1]  # can I get the final value 

KeyError: -1

# Don't use `[]` by themselves on a series!

Yes, using `[]` will work when you're working with a series. But it's a bad habit to get into, especially since when we start to work with data frames, we'll be using `[]` to refer to the columns, rather than to the rows.

What do you use instead?

You can use `.loc` and `.iloc`.

Right now, these are (almost) identical in behavior. However, we will soon see that they don't have to be.

The basic idea is that you use `.loc` with `[]` after it, and the index you want inside of the `[]`.

In [11]:
s.loc[0]

10.0

In [12]:
s.loc[1]

20.0

In [13]:
# what, then, is .iloc if .loc uses the index?
# .iloc uses the position, starting with 0

s.iloc[0]

10.0

In [14]:
s.iloc[1]

20.0

# Methods we can run on our series

- `min`
- `max`
- `mean`
- `std`
- `count` (how many non-NaN values are in there)
- `median`

# Exercise: Weather report

1. Define a series containing the max temperature of where you live over the coming 10 days.
2. What will be the mean temperature?
3. What will be the median? Are they significantly different, and does that matter?


In [15]:
s = Series([33, 38, 27, 24, 23, 24, 25, 27, 32, 35])
s

0    33
1    38
2    27
3    24
4    23
5    24
6    25
7    27
8    32
9    35
dtype: int64

In [16]:
s.mean()

28.8

In [17]:
s.sum() / s.count()

28.8

In [18]:
s.median()

27.0

# Setting the index

An index in a Pandas series can be basically any data type at all, and can contain whatever values you want.

You can (almost) think of it as a dictionary, but with even fewer restrictions on what it can contain.

In [19]:
s = Series([10, 20, 30, 40, 50])
s.index  # what is the index on my series?

RangeIndex(start=0, stop=5, step=1)

In [20]:
# I can replace the index by assigning to it
# so long as I assign the right number of values, that's fine

s.index = [2,4,6,8,10]

In [21]:
s

2     10
4     20
6     30
8     40
10    50
dtype: int64

In [22]:
# now we can see the difference between .loc and .iloc

s.loc[6]

30

In [23]:
s.iloc[6]

IndexError: single positional indexer is out-of-bounds

In [24]:
x = 'abcd'
x.upper()

'ABCD'

In [25]:
str.upper(x)

'ABCD'

In [26]:
type(s)

pandas.core.series.Series

In [28]:
type(pd.core.series.Series.index)

pandas._libs.properties.AxisProperty

In [29]:
s.index = list('abcde')
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [30]:
s.loc['b']

20

In [31]:
s.loc['c']

30

In [32]:
s.iloc[1]

20

In [33]:
s.iloc[2]

30

In [34]:
s.iloc[3]

40

In [37]:
# I can always use : to show a range
s.iloc[1:4]  # up to and *NOT* including

b    20
c    30
d    40
dtype: int64

In [38]:
s.loc['b':'d']   # up and *INCLUDING*

b    20
c    30
d    40
dtype: int64

When I say

    s.iloc[a:b]

We get up to and not including b.  But if I say

    s.loc[a:b]

we get *including* b!



In [39]:
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [41]:
s = Series([10, 20, 30, 40, 50])

s.loc[2]

30

In [42]:
s.loc[4]

50

In [43]:
# what if I want both of them?
s.loc[ [2,4] ]     # this is known as "fancy indexing" -- I can retrieve more than one value at a time

2    30
4    50
dtype: int64

In [44]:
s.loc[ [2,4] ].mean()

40.0

In [45]:
# what if our series has a custom index?

s.index = list('abcde')

s.loc[['b', 'd']]

b    20
d    40
dtype: int64

In [46]:
# I can set the index when I create the series by passing the keyword argument index= and a list of values

s = Series([10, 20, 30, 40, 50],
           index=list('abcde'))
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [47]:
# how about this?

s = Series([10, 20, 30, 40, 50],
           index=list('abcab'))
s

a    10
b    20
c    30
a    40
b    50
dtype: int64

In [48]:
# you can have an index that repeats!

s.loc['a']

a    10
a    40
dtype: int64

In [49]:
s.loc[['a', 'b']]

a    10
a    40
b    20
b    50
dtype: int64

In [50]:
s.loc['a':'c']  # slice from a-c

KeyError: "Cannot get left slice bound for non-unique label: 'a'"

In [51]:
s.index

Index(['a', 'b', 'c', 'a', 'b'], dtype='object')

In [52]:
list(s.index)

['a', 'b', 'c', 'a', 'b']

In [54]:
set(s.index)

{'a', 'b', 'c'}

In [55]:
s.index.drop_duplicates()

Index(['a', 'b', 'c'], dtype='object')

In [56]:
s

a    10
b    20
c    30
a    40
b    50
dtype: int64

In [58]:
s.index=list('abcbf')

In [59]:
s.loc['c':'f']

c    30
b    40
f    50
dtype: int64

In [61]:
%timeit s['c':'f']

219 µs ± 8.53 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [62]:
%timeit s.loc['c':'f']

219 µs ± 17 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [63]:
%timeit s.iloc[2:4]

46.6 µs ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


# Exercise: Better temperatures

1. Recreate your high-temperature series, but make the index contain days of the week.
2. Get the mean temperature for all Fridays.
3. Get the mean temperature for all Fridays and Mondays.


In [64]:
s

a    10
b    20
c    30
b    40
f    50
dtype: int64

In [66]:
type(s.index)

pandas.core.indexes.base.Index

In [67]:
s = Series([33, 38, 27, 24, 23, 24, 25, 27, 32, 35],
          index='Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun'.split())
s

Fri    33
Sat    38
Sun    27
Mon    24
Tue    23
Wed    24
Thu    25
Fri    27
Sat    32
Sun    35
dtype: int64

In [68]:
s.loc['Fri']

Fri    33
Fri    27
dtype: int64

In [69]:
s.loc[['Fri', 'Mon']]

Fri    33
Fri    27
Mon    24
dtype: int64

In [70]:
s.loc[['Mon', 'Fri']]

Mon    24
Fri    33
Fri    27
dtype: int64

In [71]:
s.loc[['Mon', 'Fri', 'Mon']]

Mon    24
Fri    33
Fri    27
Mon    24
dtype: int64

# Per-index operations

If I have two series, and I want to perform an operation on them, I can -- and that operation will be done on a per-index basis.

In [72]:
s1 = Series([10, 20, 30, 40, 50], index=list('abcde'))
s2 = Series([100, 200, 300, 400, 500], index=list('abcde'))


In [73]:
s1 + s2

a    110
b    220
c    330
d    440
e    550
dtype: int64

In [74]:
# let's reverse the index in s2
s1 = Series([10, 20, 30, 40, 50], index=list('abcde'))
s2 = Series([100, 200, 300, 400, 500], index=list('edcba'))

s1 + s2

a    510
b    420
c    330
d    240
e    150
dtype: int64

In [75]:
# what if an index is repeated?
s1 = Series([10, 20, 30, 40, 50], index=list('abcde'))
s2 = Series([100, 200, 300, 400, 500], index=list('abcdd'))

s1 + s2

a    110.0
b    220.0
c    330.0
d    440.0
d    540.0
e      NaN
dtype: float64

In [76]:
# what if they aren't the same length?
s1 = Series([10, 20, 30, 40, 50], index=list('abcde'))
s2 = Series([100, 200, 300, 400], index=list('abcd'))

s1 + s2

a    110.0
b    220.0
c    330.0
d    440.0
e      NaN
dtype: float64

In [77]:
# What if I add a number to a series? What happens then?

s = Series([10, 20, 30, 40, 50], index=list('abcde'))
s + 3

a    13
b    23
c    33
d    43
e    53
dtype: int64

# Broadcasting

"Broadcasting" is one of the most important ideas in all of Pandas. The idea is: If you have a series and a scalar (single) value, and you perform an operation between the series and the scalar value, the scalar will be "broadcast" to each element of the series, and will be performed there.

In [78]:
s * 3

a     30
b     60
c     90
d    120
e    150
dtype: int64

In [79]:
s + 3

a    13
b    23
c    33
d    43
e    53
dtype: int64

In [80]:
s / 3

a     3.333333
b     6.666667
c    10.000000
d    13.333333
e    16.666667
dtype: float64

In [81]:
s ** 3

a      1000
b      8000
c     27000
d     64000
e    125000
dtype: int64

# Never use `for` loops!

One of the best ways to slow down your Pandas code is to use a `for` loop to go through each element and do something. Never, never use `for` loops unless you have no choice -- and that's going to be rare.



# Exercise: More temperatures

1. Define two separate series, each with 10 temperatures. The first will be what we used before, with the high temps for the coming 10 days. The second will be for the low temps over the coming 10 days. They should have the same index (with day names).
2. Find the mean difference between highs and lows over the coming 10 days.
3. If you used Celsius for your temps, convert your highs to Fahrenheit. Or convert to Celsius, if you used Fahrenheit.

- Fahrenheit = (Celsius * 1.8) + 32
- Celsius = (Fahrenheit - 32) / 1.8

In [82]:
highs = Series([33, 38, 27, 24, 23, 24, 25, 27, 32, 35],
          index='Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun'.split())
lows = Series([19, 19, 17, 16, 16, 15, 14, 16, 19, 21],
          index='Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun'.split())


In [84]:
diffs = highs - lows
diffs.mean()

11.6

In [85]:
(highs - lows).mean()

11.6

In [87]:
(highs * 1.8) + 32

Fri     91.4
Sat    100.4
Sun     80.6
Mon     75.2
Tue     73.4
Wed     75.2
Thu     77.0
Fri     80.6
Sat     89.6
Sun     95.0
dtype: float64

In [88]:
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [89]:
s.loc['b'] = 999
s

a     10
b    999
c     30
d     40
e     50
dtype: int64

# Next up: Mask indexes

- Comparisons
- Broadcasts and comparisons
- Using that to filter our series with a "boolean index" or a "mask index"