Series can be queried by index position or index label

If no index is given to the series when querying, the position and label are same values.

To query by numeric location, starting at zero, use `iloc`.

To query by index label, use `loc` attribute.

In [45]:
import pandas as pd

student_classes ={'Alice':'Physics',
                 'Jack':'Chemistry',
                 'Molly':'English',
                 'Sam': 'History'}

s = pd.Series(student_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [46]:
# to see the 4th entry, use iloc attribute with param 3
s.iloc[3]

'History'

In [47]:
# to see what class Molly has, use loc attribute with param Molly
s.loc['Molly']

'English'

### iloc and loc are attributes, not methods.
Use `[]` aka indexing operator.

In Python, this calls get or set for an item depending on the context of its use.

This may be confusing if you're used to languages where encapsulation of attributes, variables and properties is common such as Java.


In [48]:
# if you pass in an integer param, the operator will behave as if you want to query via the iloc attribute
s[3]

'History'

In [49]:
# if you pass in an object, it will query as if you wanted to use the label based loc attribute
s['Molly']

'English'

If your index is a list of integers, this is complicated as Pandas cannot determine if you want to query by index position or index label.

The safer option is to be explicit and use iloc or loc attributes directly.

In [50]:
class_code ={99: 'Physics',
            100: 'Chemistry',
            101: 'English',
            102: 'History'}

s = pd.Series(class_code)

In [51]:
# if we call s[0] we get a key error becase theres no item in the classes list with an index of 0, instead we have to call 
# iloc explicity if we want the first item
#s[0]

In [52]:
# it didnt call s.iloc[0] but it generates an error

A common task is to take the values inside a Series, summarize or transform them.

A programmatic approach is to iterate over all the items in the Series, and perform the operation we are interested in.

For e.g., calculating average grade.

This works, but is slow. 

Pandas and numpy supports a method of computation called vectorization.

Vectorization works with most of the fnuctions in the numpy library, including the sum function.


In [53]:
grades = pd.Series([90, 80, 70, 60])

total = 0
for grade in grades:
    total += grade
print(total/len(grades))

75.0


In [54]:
import numpy as np

# use np.sum and pass in the iterable item, our Series
total = np.sum(grades)
print(total/len(grades))

75.0


In [55]:
# example with big series of random numbers

# generate 10000 random numbers to create a new series.
numbers = pd.Series(np.random.randint(0,1000,10000))

# look at top 5 items
numbers.head()

0    741
1    661
2    416
3    266
4    387
dtype: int32

In [56]:
# verify the length of series
len(numbers)

10000

In [57]:
# use cellular magic function
# the function we are using is called timeit, it runs our code a few times to det on avg how long it takes
# give the number of loops to run. By default, it is 1000 loops.
# we use timeit to run 100 loops.
# to use cellular magic function, it has to be the first line in the cell.

In [58]:
%%timeit -n 100
total = 0
for number in numbers:
    total+=number
    
total/len(numbers)

1.45 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Not bad. Now lets try time the vectorization method.


In [59]:
%%timeit -n 100
total = np.sum(grades)
total/len(grades)

52.9 µs ± 4.96 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


As you can see, using vectorization method is much much faster!

You should be aware of parallel computing features and start thinking in functional programming terms.

Vectorization is the ability for a computer to execute multiple instructions at once, and with high perf chips, esp graphics cards, you can get dramatic speedups.

Modern graphics cards can run thousands of instructions in parallel.


A related feature in pandas and numpy is called broadcasting.

You can apply an operation to every value in the series, changing the series.

For e.g., if we wanted to increase every random variable by 2, we could do so quickly using the += operator directly on the Series object.


In [60]:
numbers.head()

0    741
1    661
2    416
3    266
4    387
dtype: int32

In [61]:
# increase every value in Series by 2
numbers += 2
numbers.head()

0    743
1    663
2    418
3    268
4    389
dtype: int32

The procedural way of doing this would be to iterate thru all of the items in the series and increase the values directly.

Pandas support iterating through a series much like a dict, allowing you to unpack values easily.

Use `iteritems()` function which returns a label and a value.


In [62]:
type(numbers)

pandas.core.series.Series

In [69]:
for label, value in numbers.iteritems():
    # now for the item which is returned, lets call set_value()
    #numbers.set_value(label, value +2)
    # not that set_values() is deprecated, use at() instead
    numbers.at[label] = value+2
    #print(label, '+', value)

numbers.head()

0    755
1    675
2    430
3    280
4    401
dtype: int32

In [70]:
# look at speed comparisons
# 5 loops using iterative approach

In [72]:
%%timeit -n 10
# we'll create a blank new series of items to deal with
s = pd.Series(np.random.randint(0,1000, 1000))

for label, value in s.iteritems():
    s.loc[label] = value+2

43.8 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [73]:
# now we try with broadcasting method

In [74]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000, 1000))
# broadcast with +=
s += 2

368 µs ± 56.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [75]:
# Broadcasting method is much faster than iterative approach!
# Much faster, much concise and easier to read!


The `.iloc` attribute lets you not only modify data in place, but also add new data as well.

If the value you pass in as the index doesnt exist, then a new entry is added.

Indeices can have mixed types.

Its important to be aware of the typing, Pandas will auto change the underlying NumPy as appropriate

In [76]:
# e.g. using a Series with a few numbers
s = pd.Series([1,2,3])

# Add some new value, maybe a uni course
s.loc['History'] = 102

s

0            1
1            2
2            3
History    102
dtype: int64

Since History is not in the original list of indices, it creates a new element in the series with index 'History' and value of 102.

### E.g. Non-unique index values

In [77]:
student_classes ={'Alice':'Physics',
                 'Jack':'Chemistry',
                 'Molly':'English',
                 'Sam': 'History'}

student_classes

{'Alice': 'Physics', 'Jack': 'Chemistry', 'Molly': 'English', 'Sam': 'History'}

In [79]:
# create a sereis for some new student Kelly which lists all the courses she has taken. index = Kelly and data = name of courses
kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index=['Kelly', 'Kelly', 'Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [81]:
# TODO: DOES NOT WORK.
# Append all of the data in this new series to the first using append()
all_student_classes = student_classes.append(kelly_classes)

all_student_classes

AttributeError: 'dict' object has no attribute 'append'