In this lecture, will talk about one of the primary data types of the Pandas Library, the Series. You'll learn about the structure of the Series, how to query and emerge Series objects together, and the importance of thinking about parallelization when engaging in data science programming.


In [1]:
# A pandas Series can be queried either by the index position or the index label. If you don't give an index to the series when
# querying, the position in the label are effectively the same values. To query by numeric location, starting at 0, use the
# "iloc" attribute. To query by the index label, you can use the "loc" attribute. 

# So let's start with an example. We'll use students enrolled in classes coming from a dictionary:
import pandas as pd
students_classes = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'}
s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [2]:
# So, for this series, if you wanted to see the 4th entry, we could use the iloc attribute with the parameter 3:
s.iloc[3]

'History'

In [3]:
# If you wanted to see what class Molly has, we would use the loc attribute with a parameter of Molly:
s.loc['Molly']

'English'

In [4]:
# So keep in mind that iloc and loc are not methods, they are attributes, so you don't use parentheses to query them, but square
# brackets instead, and this is called the indexing operator. In Python this calls get or set for an item depending on the
# context of its use.
# This might seem a bit confusing if you're used to languages where encapsulation of attributes, variables, and properties is
# common, such as in Java.

# Pandas tries to make our code a bit more readable and provides a sort of smart syntax using the indexing operator directly on
# the series itself. For instance, if you pass in an integer parameter, the operator will behave as if you want to query via the
# iloc attribute:
s[3]

'History'

In [5]:
# If you pass in an object, it will query as if you wanted to use the label based on the loc attribute:
s['Molly']

'English'

In [6]:
# So what happens if your index is actually a list of integers? And this is a bit complicated and Pandas can't determine
# automatically whether you're intending to query by index position or index label. So you need to be careful when you're using
# the indexing operator on the Series itself. The safer option is to be more explicit and to use the iloc and loc attributes
# directly.

# Here's an example using classes in their classcode information, where classes are indexed by classcodes in the form of
# integers. 
# So let's create some new dictionary classcode will say 99 maps to Physics, 100 to Chemistry, 101 to English, and 102 to
# History, and will create some new series.
class_code = {99: 'Physics',
             100: 'Chemistry',
             101: 'English',
             102: 'History'}
s = pd.Series(class_code)

In [8]:
# If we try and call, s[0] we get a key error because there's no item in the class list with an index of 0. Instead, we have to
# call iloc explicitly if we want the first item. So s sub zero, and that gives us this nasty looking key error.
s[0]


KeyError: 0

In [9]:
# So, that didn't call s.iloc[0] underneath as one might expect, and instead it generated this error.

In [10]:
# Now we know how to get data out of this series, let's talk about working with the data. A common task is to want to consider
# all of the values inside of a series and do some sort of operation. This could be trying to find a certain number, or
# summarizing the data or transforming the data in some kind of way.

# A typical programmatic approach to this would be to iterate over all of the items in the series, and invoke the operation one
# is interested in. For instance, we could create a Series of integers representing student grades, and just try and get the
# average grade:
grades = pd.Series([90,80,70,60])

total = 0
for grade in grades:
    total+=grade
print(total/len(grades))

75.0


In [11]:
# So just a very simple averaging function, this works, but it's slow. Modern computers can do many tasks simultaneously,
# especially, but not only tasks involving mathematics.

# Pandas and the underlying NumPy support a number of methods for computation called vectorization. Vectorization in particular
# works with most of the functions in the NumPy library, including the sum function.

# So here's how we would really write the code using the NumPy sum method. First we need to import the NumPy module
import numpy as np

# Then we'll just call np.sum and pass in an iterable item. In this case, our pandas series. 
total = np.sum(grades)
print(total/len(grades))

75.0


In [12]:
# Now both of the methods create the same value, but is one actually faster? The Jupyter Notebook has a magic function which can
# help.

# First let's create a big series of random numbers, and this is actually used a lot when demonstrating techniques with pandas,
# so you should get used to seeing this:
numbers = pd.Series(np.random.randint(0,1000,10000))

# Now let's look at the top five items in this series to make sure they actually seem random, and we could do this with the
# head() function:
numbers.head()

0    509
1     39
2     99
3    786
4    456
dtype: int32

In [13]:
# We can actually verify the length of the series is correct using the len function
len(numbers)

10000

In [14]:
# Okay, so now we're confident that we have a big series. The ipython interpreter has something called magic functions that
# begin with a percentage sign. If we type this sign and hit the tab key, you can see a list of the available magic functions.
# You could write your own magic functions too, but that's a little bit outside of the scope of this course.

In [15]:
# So here we're actually going to use cellular magic function. These start with two percentage signs, an wrap the code in the
# current Jupyter cell. The function we're going to use is called "timeit". This function will run our code a few times to
# determine on average how long it takes. 

# So let's run timeit with our original iterative code. You can give timeit the number of loops that you would like to run. By
# default it's 1,000 loops. I'll ask time it here to use 100 runs because we're recording this. Note that in order to use the
# cellular magic function, it has to be the first line of each cell

In [16]:
%%timeit -n 100
total = 0
for number in numbers:
    total += number
total/len(numbers)

3.36 ms ± 72.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [17]:
# All right, not bad, time it run the code and it doesn't seem to take very long at all. Now let's try with vectorization:

In [18]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

194 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [19]:
# Wow, this is pretty shocking difference in the speed and demonstrates why one should be aware of parallel computing features
# and start thinking in functional programming terms. Put more simply, vectorization is the ability for a computer to execute
# multiple instructions at once. With high performance chips, especially graphics cards, you can get dynamic speedups. Modern
# graphics cards can run thousands of instructions in parallel.

In [20]:
# A related feature in pandas and NumPy is called "broadcasting". With broadcasting, we can apply an operation to every value
# in the series, changing the series. For instance, if we wanted to increase every random variable by 2, we could do so quickly
# using the += operator directly on the series object.

# Let's look at the head of our series:
numbers.head()

0    509
1     39
2     99
3    786
4    456
dtype: int32

In [21]:
# And now let's just increase everything in the series by 2
# so s plus equals 2. So here we're applying the plus equals operator directly to the series object, not a single value. And now let's look at the head.
numbers+=2
numbers.head()

0    511
1     41
2    101
3    788
4    458
dtype: int32

In [22]:
# The procedural way of doing this would be to iterate through all of the items in the series and increase the values directly. Pandas does support iterating through the series much like a dictionary, allowing you to unpack values easily.

# We can use the iteritems() function in particular which returns a label and value:
# So for label and value in s.iteritems, now for the item which is returned, let's call the set value. So s.set_value, we indicate the label, and we say the value we just want to increment that by 2. 

for label, value in numbers.iteritems():
    # Now for the item which is returned, let's call the set value()
    numbers.at[label] = value+2
# And then let's check the result of the This for loop computation by looking at the head.
numbers.head()

0    513
1     43
2    103
3    790
4    460
dtype: int32

In [24]:
# So the result is the same, though you may notice a warning depending on the version of Pandas being used. But if you find yourself iterating pretty much anytime in Pandas, you should question whether you're doing things in the best possible way.

# Let's take a look at some speed comparisons. First, let's try five loops using the iterative approach:


In [31]:
%%timeit -n 10
#We'll create a blank new series of items to deal with. Always good when timing to create a new series
s = pd.Series(np.random.randint(0,1000,1000))
# And we'll just rewrite our loop from above:
for label, value in s.iteritems():
    s.loc[label]= value+2

109 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
# Now let's try using that broadcasting method:

In [32]:
%%timeit -n 10
# We need to recreate the series:
s = pd.Series(np.random.randint(0,1000,1000))
# And we just broadcast  with +=
s+=2

374 µs ± 134 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [33]:
# Amazing. Not only is it significantly faster, but it's more concise, an even easier to read too. The typical mathematical operations that you would expect are vectorized, and the NumPy documentation outlines what it would take to create vectorized functions of your own.

# One last note on using the indexing operators to access series data. The .loc attribute lets you not only modify data in place, but also add new data as well. If the value you passed in as the index doesn't exist, then a new entry is created. And keep in mind indices can have mixed types. While it's important to be aware of the typing going on underneath, Pandas will automatically change the underlying NumPy types as appropriate.

# Here's an example using a series of a few numbers;
s = pd.Series([1,2,3])
# We could add some new value, maybe a university course:
s.loc['History'] = 102
s

0            1
1            2
2            3
History    102
dtype: int64

In [40]:
# We see that mixed types for data values or index labels are no problem for Pandas. Since "History" is not in the original list of indices, s.loc['History'] essentially creates a new element in the series, with the index name of 'History', and the value of 102.

# Up until now, I've shown only examples of a series where the index values were unique. I want to end this lecture by showing an example where index values are not unique. And this makes the Pandas Series a little different conceptually than, for instance, a relational database.

# Let's create a series with students and courses which they have taken:
students_classes = pd.Series({'Alice': 'Physics',
                            'Jack': 'Chemistry',
                            'Molly': 'English',
                            'Sam': 'History'})
student_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [35]:
# Now let's create a series just for some new student, Kelly, which lists all of the courses that she's taken. We'll set the index to Kelly, and the data to be the names of the courses:
kelly_classes = pd.Series(['Philosophy','Arts','Math'], index=['Kelly','Kelly','Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [41]:
# Finally, we can append all of the data in this Series to the .append() function.
all_students_classes = students_classes.append(kelly_classes)

# This creates a series which has our original people in it as well as all of Kelly's courses:
all_students_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [42]:
# There are a couple of important considerations when using .append. First, Pandas will take the series and try to infer the best data types to use. In this example, everything is a string, so there's no problems here. Second, the append method doesn't actually change the underlying Series objects. It instead returns a new series which is made up of the two appended together. And this is actually a common pattern in Pandas. By default, returning a new object instead of modifying one in place. And one that you should come to expect. By printing the original series, we can see that that series hasn't changed:
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [43]:
# Finally, we can see that when we query the appended series for Kelly, we don't get a single value, but a series itself:
all_students_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In this lecture, we focused on one of the primary data types of the Pandas Libra. The series you learn how to query the series with lock and I lock that the series is an index data structure. How to merge two series objects together with append an the importance of vectorization. 

There are many more methods associated with the series object that we haven't talked about, but with these basics down will move on to talking about pandas, 2 dimensional data structure, the data frame. The data frame is very similar to the series object, but includes multiple columns of data. Is the structure you'll spend the majority of your time working on when cleaning and aggregating data?