# Agenda

1. Series
2. Creating a series
3. Retrieving from a series with `.loc` and `.iloc`
5. Fancy indexing
6. Setting an index
7. Broadcasting and retrieving

In [1]:
import pandas as pd
from pandas import Series

In [2]:
s = Series([10, 20, 30, 40, 50])

In [3]:
type(s)

pandas.core.series.Series

In [4]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

# Series vs. list

A Python list can contain any number of values, of any types, including a mix of any types. It's traditional not to do that, but there's no technical barrier from doing so.

By contrast, a series *must* contain values that are all of the same type. There is no way around this.

The type of data that a series contains is determined by its `dtype`. By default, if it sees that we're creating a series with integers, Pandas will use a `dtype` of "int64", meaning 64-bit (8-byte) integers. 

In [5]:
# what if I create a series with some other values?

s = Series([10, 30, 30.5, 40, 50])
s

0    10.0
1    30.0
2    30.5
3    40.0
4    50.0
dtype: float64

In [6]:
# you can always ask a series what its dtype is:

s.dtype

dtype('float64')

In [7]:
# how can I retrieve from a series?
# I can get values from a list with [], giving it either an integer or a slice. Can I do the same here?
# Yes (but)

s[0]

10.0

In [8]:
s[4]

50.0

In [9]:
s[-1]  # can I get the final value with a negative index?

KeyError: -1

# Don't use `[]` on their own with a series!

On a series, you can use plain ol' `[]` to retrieve values. However, it's a better idea to use `.loc` and `.iloc`. 

Why? What's wrong with `[]`? 

The answer is that when we start to work with data frames, `[]` on their own will be for the *columns*, not for the rows. Which makes using `[]` with a series very confusing.  I strongly suggest that you only use `.loc` and `.iloc` to retrieve (and set) in your series, to avoid confusion.

In [10]:
s.loc[0]

10.0

In [11]:
s.loc[1]

30.0

In [12]:
s.loc[-1]  # still cannot do this

KeyError: -1

In [13]:
# can I do a slice?
s

0    10.0
1    30.0
2    30.5
3    40.0
4    50.0
dtype: float64

In [14]:
s.loc[1:3]   # in Python, a slice lets us grab values from the start up to and not including the end

1    30.0
2    30.5
3    40.0
dtype: float64

# `.loc` and slices

It turns out that `.loc` returns values in a slice, up to *and including* the end value.

In [15]:
# we can also use .iloc. Right now, they seem to be almost identical.
# loc goes based on the index that we actually have
# .iloc goes based on the position, starting with index 0, just like lists, strings, tuples, etc.

s.iloc[0]

10.0

In [16]:
s.iloc[1]

30.0

In [18]:
s.iloc[1:3]  # this is up to and *not* including!

1    30.0
2    30.5
dtype: float64

In [19]:
s

0    10.0
1    30.0
2    30.5
3    40.0
4    50.0
dtype: float64

# What can we do with a series?

We can retrieve values from the series. We can also run methods:

- `min`
- `max`
- `mean`
- `median`
- `std` (standard deviation)
- `count` (how many values)
- `sum`

(There are others, too.)

In [20]:
s.mean()

32.1

# Exercise: High temperatures

1. Create a series containing the high (max) temperatures of where you live over the next 10 days.
2. What will be the mean temperature during that time?
3. What will be the median temperature? Are they the same? If not, why not?
4. Compare the mean temperature in the first 5 days of your series with the temps in the final 5 days of your series.

In [21]:
s = Series([29, 26, 23, 25, 26, 28, 34, 34, 34, 34])
s

0    29
1    26
2    23
3    25
4    26
5    28
6    34
7    34
8    34
9    34
dtype: int64

In [22]:
s.mean()

29.3

In [23]:
s.median()

28.5

In [24]:
s.describe()

count    10.000000
mean     29.300000
std       4.347413
min      23.000000
25%      26.000000
50%      28.500000
75%      34.000000
max      34.000000
dtype: float64

In [29]:
s.loc[:4].mean()

25.8

In [30]:
s.loc[5:].mean()

32.8

# Setting the index to custom values

The index on a series is, by default, a range of integers starting with 0. But we can any values we want! One way to set the index is to take an existing series and assign to its `.index` attribute. If we assign a list or series that is the same length as the current one, the index will be replaced.

In [32]:
s = Series([10, 20, 30, 40, 50])

s.index = [2,4,6,8,10]
s

2     10
4     20
6     30
8     40
10    50
dtype: int64

In [33]:
s.loc[10]  # this will retrieve the value at index 10

50

In [35]:
s.iloc[3]

40

In [34]:
s.iloc[10]   # this looks for position 10 in the series, finds that it doesn't exist, and gives an error

IndexError: single positional indexer is out-of-bounds

In [36]:
s.index = list('abcde')
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [37]:
s.loc['d']

40

In [38]:
s.iloc[2]

30

In [39]:
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [41]:
s.loc['b':'d']

b    20
c    30
d    40
dtype: int64

# Can we create a series, and then assign the index right away?

In [42]:
s = Series([10, 20, 30, 40, 50],
           index=list('abcde'))
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [43]:
# consider this:

s = Series([10, 20, 30, 40, 50],
           index=list('abcab'))   # now we have an index with repeating characters!
s

a    10
b    20
c    30
a    40
b    50
dtype: int64

In [44]:
# if I ask for .loc['c']
s.loc['c']

30

In [45]:
s.loc['a']

a    10
a    40
dtype: int64

# Some classic index values

- Days of the week
- Dates and times
- Usernames
- Regions
- Companies



In [46]:
# what does the index look like?
s.index

Index(['a', 'b', 'c', 'a', 'b'], dtype='object')

In [47]:
s = Series([10, 20, 30, 40, 50])
s.index

RangeIndex(start=0, stop=5, step=1)

In [48]:
# again, our repeated index

s = Series([10, 20, 30, 40, 50],
           index=list('abcab'))

s.loc['b':'b']

KeyError: "Cannot get left slice bound for non-unique label: 'b'"

# Exercise: Better temperatures

1. Recreate (or copy) your high-temp series, but make the index contain days of the week ('Sun', 'Mon' etc.')
2. Get the mean temperature for Fridays
3. Get the mean temperature for all Fridays and Mondays.

In [51]:
'Sun Mon Tue'.split()

['Sun', 'Mon', 'Tue']

In [52]:
list('abcde')  # this returns a new list based on the string 'abcde' -- returning a list whose elements are single characters

['a', 'b', 'c', 'd', 'e']

In [53]:
['Sun', 'Mon',  'Tue']

['Sun', 'Mon', 'Tue']

In [54]:
s = Series([29, 26, 23, 25, 26, 28, 34, 34, 34, 34],
          index='Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun'.split())
s


Fri    29
Sat    26
Sun    23
Mon    25
Tue    26
Wed    28
Thu    34
Fri    34
Sat    34
Sun    34
dtype: int64

In [56]:
s.loc['Fri'].mean()

31.5

# Fancy indexing

If I want to retrieve something with `loc`, I have (so far) two options:

- Retrieve based on the index, passing one value
- Retrieve several values, based on a slice

We can put a variety of things inside of the square brackets that we pass to `.loc`. We can, among things, pass a list of indexes, and we'll get all of their values back.

In [57]:
# I'm passing a list to `[]`!

s.loc[   ['Fri', 'Mon']   ]

Fri    29
Fri    34
Mon    25
dtype: int64

In [58]:
s.loc[   ['Fri', 'Mon']   ].mean()

29.333333333333332

In [59]:
s.loc[   ['Mon', 'Fri']   ]

Mon    25
Fri    29
Fri    34
dtype: int64

In [60]:
s.loc[ ['Mon', 'Fri', 'Mon', 'Fri'] ]

Mon    25
Fri    29
Fri    34
Mon    25
Fri    29
Fri    34
dtype: int64

# Per-index operations

We've seen that we can assign an index to our series, and that we can use the index to retrieve one or more items. The index also plays a major role if we try to perform operations on the values in the series.

In [61]:
# Python lists

list1 = [10, 20, 30]
list2 = [40, 50, 60]

list1 + list2

[10, 20, 30, 40, 50, 60]

In [62]:
# Pandas series

s1 = Series([10, 20, 30], index=list('abc'))
s2 = Series([100, 200, 300], index=list('abc'))

s1 + s2

a    110
b    220
c    330
dtype: int64

In [63]:
# What if the indexes don't match up in their order:

s1 = Series([10, 20, 30], index=list('abc'))
s2 = Series([100, 200, 300], index=list('cba'))

s1 + s2

a    310
b    220
c    130
dtype: int64

In [64]:
# What if they don't have completely overlapping indexes?

s1 = Series([10, 20, 30], index=list('abc'))
s2 = Series([100, 200, 300], index=list('bcd'))

s1 + s2

a      NaN
b    120.0
c    230.0
d      NaN
dtype: float64

In [65]:
# NaN == "not a number"

In [66]:
# What if an index repeats?

s1 = Series([10, 20, 30], index=list('abc'))
s2 = Series([100, 200, 300], index=list('abb'))

s1 + s2

a    110.0
b    220.0
b    320.0
c      NaN
dtype: float64

In [67]:
s2 + s1

a    110.0
b    220.0
b    320.0
c      NaN
dtype: float64

In [68]:
s1 * s2

a    1000.0
b    4000.0
b    6000.0
c       NaN
dtype: float64

In [69]:
s1 / s2

a    0.100000
b    0.100000
b    0.066667
c         NaN
dtype: float64

In [70]:
s1 ** s2

a    1.000000e+100
b    1.606938e+260
b              inf
c              NaN
dtype: float64

In [71]:
s1

a    10
b    20
c    30
dtype: int64

In [72]:
s2

a    100
b    200
b    300
dtype: int64

# Broadcasting

The idea is: If you have a series and a scalar (single) value, and you perform an operation on them together, then the scalar will be "broadcast" to all elements of the series. You'll get back a series as a result.



In [73]:
s = Series([10, 20, 30, 40, 50], 
           index=list('abcde'))

s * 3

a     30
b     60
c     90
d    120
e    150
dtype: int64

In [74]:
s ** 2

a     100
b     400
c     900
d    1600
e    2500
dtype: int64

In [75]:
s - 5

a     5
b    15
c    25
d    35
e    45
dtype: int64

# This is why we don't use `for` loops

If you want to perform an operation on every element of a series, *DO NOT* use a `for` loop. 

Pandas does what's know as "vectorization" of operations, so that you don't have to loop. 

# Exercise: More temperatures

1. Define two series, each with 10 temperatures. The first will be what we used before, with high temps for the coming 10 days. The second will be for the low temps over the coming 10 days. They should have the same index (with day names).
2. Find the mean difference between highs and lows over the coming 10 days.
3. If you used Celsius for your temps, convert them to Fahrenheit and re-run your calculation. If you used Fahrenheit, then convert to Celsius and re-run.


    Fahrenheit = (Celsius * 1.8) + 32
    Celsius = (Fahrenheit - 32) / 1.8
