# Introduction to Data Science – Lecture 5: Tuples, Sets, Dictionaries, Classes/Objects, Pandas Series
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/*

In this lecture we introduce more data structures: tuples, sets, dictionaries, and series. While sets and dictionaries are built-in Python data structures, series (and dataframes, which we'll cover in the next lecture) are part of the [pandas library](http://pandas.pydata.org/) tailored to data science applications. We will also briefly introduce objects and object oriented programming. 

## Announcements / Reminders: Submitting Homework 

#### Homework 2 Due Friday

Please:
 * Submit within the given Jupyter Notebook (not as a `.py` or other file). The file you submit should be an altered `.ipynb`. If there are multiple files needed (e.g., images, data), they must be included in the zip.
 * Make sure that you have filled in the cells with your work.  
 * Use the assignment from this year, **2025**, from [https://github.com/datascience-course/2025-datascience-homework](https://github.com/datascience-course/2025-datascience-homework) 
 
We do not accept submissions that are not `.ipynb` or from an alternative year.

## 1. Tuples

[Tuples](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences) are a list-like data structure that are, in contrast to lists, **immutable**. 

The purpose of tuples is to store objects of different types. Remember that lists should only contain **homogeneous data** and numpy lists even enforce that; Tuples are designed for the **heterogeneous case**. 

Also, Tuples have practical implications for performance and `HashTables`, which we will discuss later. 

Here is how we can initialize a tuple: 

In [1]:
person = "Alex", 1981, "Computer Science"
person

('Alex', 1981, 'Computer Science')

Initialization with parenthesis is preferred, since it's more explicit:

In [2]:
person = ("Alex", 1981, "Austria")
person

('Alex', 1981, 'Austria')

We can access them just like arrays: 

In [3]:
person[0]

'Alex'

We cannot, however change values. This throws a **TypeError**.

In [4]:
# throws TypeError
# person[1] = 1985

# ---------------------------------------------------------------------------
# TypeError                                 Traceback (most recent call last)
# Cell In[4], line 2
#       1 # throws TypeError
# ----> 2 person[1] = 1985

# TypeError: 'tuple' object does not support item assignment

Arbitrary objects can be part of a tuple:

In [5]:
train_schedule = ("Train 1", [7,11])
print(train_schedule[1])
# this works because we're modifying the mutable array within the immuatable tuple.
train_schedule[1][0] = 15
print(train_schedule[1])

[7, 11]
[15, 11]


Of course, that includes tuples:

In [6]:
train_schedule = ("Train 1", (7,11))
# this doesn't work
# train_schedule[1][0] = 15

# ---------------------------------------------------------------------------
# TypeError                                 Traceback (most recent call last)
# Cell In[8], line 3
#       1 train_schedule = ("Train 1", (7,11))
#       2 # this doesn't work
# ----> 3 train_schedule[1][0] = 15
#       4 train_schedule

# TypeError: 'tuple' object does not support item assignment


# train_schedule

This allows us to create functions with **multiple return values**.

Consider the following code:

In [7]:
def multiply(a, b, c):
    return (a*b), (a*c), (b*c), (a*b*c)

Here, it looks like we return multiple values – something that's not possible in most programming languages! But it's very convenient. In practice, we "only" return a tuple. 

Let's try it out:

In [8]:
multiply(3, 7, 11)

(21, 33, 77, 231)

The round brackets in the returned values indicate what's going on: what is returned is, in fact, a tuple!

We can use this return value to assign multiple variables at the same time:

In [9]:
ab, ac, bc, abc = multiply(3, 7, 11)
my_tuple = multiply(3, 7, 11)
print(my_tuple)
print(type(my_tuple))
print(ab)
ab = 19 
print(ab)

(21, 33, 77, 231)
<class 'tuple'>
21
19


To do this, no function is necessary. We can just do the following:

In [10]:
what, i_s, going, on = "this", "is", "really", "nice" # use () to be more explicit
print(what, i_s, going, on)
print(type(what))

this is really nice
<class 'str'>


## 2. Sets

A [set](https://docs.python.org/3/tutorial/datastructures.html#sets) is a mutable collection, similar to a list, however, it is
 * **not ordered**, and
 * **cannot contain the same element twice**

Here is an example:

In [11]:
# Initialize a set with {}
beatles = {"John", "Paul", "Ringo", "George"}
beatles

{'George', 'John', 'Paul', 'Ringo'}

Notice that on my machine, the **order** of the output is **different** from the input: 
`{'George', 'John', 'Paul', 'Ringo'}`

We can also initalize a set with an array or a tuple:

In [12]:
usernames = set(["Jimmy", "Robert", "John", "John"])
# You don't get an error when you try to put a duplicate in a set
# it just doesn't get added a second time.
usernames

{'Jimmy', 'John', 'Robert'}

We've initialized the set `usernames` with an array of names. We have chose a set, because we don't want to have duplicate user names. 

However, **in the second example, the array included a duplicate – John was specified twice**. We can see, however, that **John is contained in the set only once**.

Sets are great for various tasks. For example, they can be used to remove duplicate entries from lists. Most importantly, they let you very efficiently check whether an element already exists. 

A set works based on a mathematical function that produces a "hash code". This hash code is then used as an index to an array. For example, "Jimmy" could hash to the value 13, and accordingly, Jimmy would be put at the 13th index of an array. When we want to test whether "Jimmy" is already in a set, we simply compute the hash, which will again produce 13, and then look up whether something is stored at index 13. 

We can check whether a set contains a value using the `in` keyword:

In [13]:
"Jimmy" in usernames

True

In [14]:
"Ringo" in usernames

False

Note that this also works in lists, but if your set or list is large, this is considerably slower. 

In [15]:
username_list = ["Jimmy", "Robert", "John", "John"]
"John" in username_list

# This takes longer than checking in a set

True

We can add values using the add function on a set:

In [16]:
usernames.add("JohnB")
usernames

{'Jimmy', 'John', 'JohnB', 'Robert'}

And remove elements with the remove function: 

In [17]:
usernames.remove("John")
usernames

{'Jimmy', 'JohnB', 'Robert'}

If the set doesn't contain a key we want to remove, it will throw a `KeyError`.

In [18]:
# usernames.remove("Joseph")

# ---------------------------------------------------------------------------
# KeyError                                  Traceback (most recent call last)
# Cell In[18], line 1
# ----> 1 usernames.remove("Joseph")

# KeyError: 'Joseph'

To prevent that, it is advisable to first check whether a set actually contains a value, if you're not 100% sure: 

In [19]:
if ("JohnB" in usernames):
    usernames.remove("JohnB")

usernames    

# This will never throw an error

{'Jimmy', 'Robert'}

We can iterate over the values of a set. Remember that no guarantee about the order of the items in the set is made. 

In [20]:
for name in usernames:
    print (name)

Robert
Jimmy


Make sure to check out the [documentation](https://docs.python.org/3.5/library/stdtypes.html#set) to see what else a set can do. 

### Try it!
You're now ready to try the first problem in today's activities.

## 3. Dictionaries

[Dictionaries](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) are related to sets, but are more powerful: in addition to the **key** used to identify an element in a set, dictionaries also store a **value** associated with a key. Dictionaries store **key-value pairs**. Other terms commonly used for dictionaries are *associative arrays*, *(hash) maps*, and *hash tables*. 

Here is a simple example:

In [21]:
musicians = {"John":"Zeppelin", "Jimmy":"Zeppelin", "Paul":"Beatles", "Ringo":"Beatles"}
musicians
# key:value
# key should be unique, values do not need to be

{'John': 'Zeppelin',
 'Jimmy': 'Zeppelin',
 'Paul': 'Beatles',
 'Ringo': 'Beatles'}

As we can see, a dictionary can be created with curly brackets and a list of key-value pairs, separated by a `:`. Here, the names are the keys, the bands are the values. 

There are other ways of creating a dictionary. Here, we pass a list of tuples to the dictionary, but we could also pass a list of lists.

In [22]:
more_musicians = dict([("Thom", "Radiohead"), ("Dave", "Foo Fighters")])
more_musicians

{'Thom': 'Radiohead', 'Dave': 'Foo Fighters'}

Of course, a dictionary can be of any data type. Here is an example with int as keys, floats as values:

In [23]:
numbers = {3:1.45, 4:1.32, 19:9.97, 6:9.99}
numbers

{3: 1.45, 4: 1.32, 19: 9.97, 6: 9.99}

Note that it's not a good idea to use floats as keys, as they are stored only as approximations. You should stick to integers and strings mostly. 

Also not great to use lists as keys, as they can be edited. Better to use a tuple.

Values, in contrast, are frequently complex data types, such as lists, floats, strings, or generic objects.

In [24]:
multi_band_musicians = {"Dave":["Nirvana", "Foo Fighters"], "Eric":["Yardbirds", "Cream","Solo"]}
multi_band_musicians

{'Dave': ['Nirvana', 'Foo Fighters'], 'Eric': ['Yardbirds', 'Cream', 'Solo']}

Dictionary elements are accessed just as elements in a list, with square brackets, but instead of the index, we pass in the key: 

In [25]:
numbers[3]
# this is not index 3, this is key 3

1.45

In [26]:
multi_band_musicians["Eric"]

['Yardbirds', 'Cream', 'Solo']

We can add elements to a dict:

In [27]:
musicians["Thom"] = "Radiohead"
musicians

{'John': 'Zeppelin',
 'Jimmy': 'Zeppelin',
 'Paul': 'Beatles',
 'Ringo': 'Beatles',
 'Thom': 'Radiohead'}

And remove them using the `del` keyword:

In [28]:
del musicians["Thom"]
musicians

{'John': 'Zeppelin',
 'Jimmy': 'Zeppelin',
 'Paul': 'Beatles',
 'Ringo': 'Beatles'}

Again, we have to worry about key errors. If we want to remove Thom again, we'd get a `KeyError`.

In [29]:
# del musicians["Thom"]

# ---------------------------------------------------------------------------
# KeyError                                  Traceback (most recent call last)
# Cell In[29], line 1
# ----> 1 del musicians["Thom"]

# KeyError: 'Thom'

KeyError: 'Thom'

We can access a list of keys and values separately: 

In [30]:
musicians.keys()

dict_keys(['John', 'Jimmy', 'Paul', 'Ringo'])

Notice that the result is not a list or a set, but a [view object](https://docs.python.org/3/library/stdtypes.html#dict-views). A view object is updated when the dictionary is changed, and we can use it to iterate over a dictionary. 

In [31]:
for musician in musicians.keys():
    print(musician)

John
Jimmy
Paul
Ringo


This also works with `values()` and `items()`:

In [32]:
musicians.values()

dict_values(['Zeppelin', 'Zeppelin', 'Beatles', 'Beatles'])

In [33]:
musicians.items()

dict_items([('John', 'Zeppelin'), ('Jimmy', 'Zeppelin'), ('Paul', 'Beatles'), ('Ringo', 'Beatles')])

The latter is especially handy for iterating over the key-value pairs in a dictionary:

In [34]:
# notice that we iterate over the tuples and have the elements of the tuple assigned to k and v, respectively.
for k, v in musicians.items():
    print (k + ", Band: " + v)

John, Band: Zeppelin
Jimmy, Band: Zeppelin
Paul, Band: Beatles
Ringo, Band: Beatles


Another way to write the previous expression would be like this: 

In [35]:
for k in musicians.keys():
    print(k + ", Band: " +  musicians[k])

John, Band: Zeppelin
Jimmy, Band: Zeppelin
Paul, Band: Beatles
Ringo, Band: Beatles


Make sure to check out [the dictionary documentation](https://docs.python.org/3/library/stdtypes.html#typesmapping) for more info. 

### Try it!
You're now ready to try the second problem in today's activities. 


## 4. Classes and Objects
*Note that this is not a detailed introduction into Object Oriented Programming (OOP) and we glance over a lot of subtleties and use terminology loosely.*

We won't be actively doing much object-oriented programming in this class, but we will frequently use objects as they are returned by a library, and hence it's a good idea to understand the basics of OOP. 

**Objects** are a data-structure that you can customize completely. They also provide interfaces to manipulate that data. 

[**Object oriented programming**](https://en.wikipedia.org/wiki/Object-oriented_programming) is one of the most commonly used programming paradigms. It's based on bundling data together with functionality, i.e., it's a combination of a data structure and functions – called **methods** – that operate on the data of an object. 

**Classes** are templates (data types) for **objects**. An object of a class is also called an **instance** of that class. 

Let's define a class:


In [36]:
class Person: 
    # a class variable, each instance has its own copy
    name = "blank"
    
    def __init__(self):
        pass
    
    # a method setting the value of a member
    def set_name(self, name):
        # write the parameter self to the member variable name
        # both, "self" and "name" are arbitrary terms
        self.name = name
        # self.name = name.strip()
    
    # a method that does something, without a variable
    def print_name(self):
        print("Name:", self.name)

In [37]:
class HobbyPerson: 
    # a class variable, shared by all instances
    name = "blank"
    hobby = "blank"
    
newPerson = HobbyPerson()
print(newPerson.name)
print(newPerson.hobby)

newPerson.name = "Robert"
newPerson.hobby = "swimming"

print(newPerson.name)
print(newPerson.hobby)

blank
blank
Robert
swimming


Notice the use of the `class` keyword to define the class. 

Methods are defined just like functions, but they have the `self` variable. The name of that variable is actually not relevant, but it's customary to call it `self`. This is a reference to the specific instance. You don't specify that variable when you call the method, it's provided for you automatically based on the object you're calling. 

Here, we instantiate that class and set a parameter via a method; then use the `print_name()` method: 

In [38]:
ringo = Person()
# method without parameter
ringo.print_name()
# call a method with a parameter
ringo.set_name("Ringo")
ringo.print_name()
# accessing a class member
ringo.name

Name: blank
Name: Ringo


'Ringo'

The key thing here is the way to access functions (and members) of objects with the `.` notation: 

`object.method()`

Here, we're saying execute that method on that specific object. We have already used this, and will be using this all the time! 

Here, we create a different person: 

In [39]:
paul = Person()
paul.set_name("Paul")
paul.print_name()
ringo.print_name()

Name: Paul
Name: Ringo


If we ask for the data type of our ringo variable, we'll see that it's an instance of our class:

In [40]:
type(ringo)

__main__.Person

We can also use a shorthand to initialize objects with the required variables. We use the `__init__` method (the name matters here) to do that. This `__init__` method is also called the "constructor".

In [43]:
class Musician: 
    # instantiation operation
    def __init__(self, name, instrument):
        # an instance variable, specific to that instance
        self.name = name
        self.instrument = instrument
        self.inname = name + instrument
    
    def print_musician(self):
        print(self.name, "plays", self.instrument)

With this definition, we create an object and at the same time specify its parameters. 

In [44]:
ringo = Musician("Ringo", "Drums")
ringo.print_musician()

Ringo plays Drums


We can also access member variables directly: 

In [45]:
ringo.instrument

'Drums'

If we have a constructor with a signature, we also do have to use it. This will fail: 

In [49]:
# will throw a Type Error because we didn't use the proper signature
# paul = Musician()

# ---------------------------------------------------------------------------
# TypeError                                 Traceback (most recent call last)
# Cell In[46], line 2
#       1 # will throw a Type Error because we didn't use the proper signature
# ----> 2 paul = Musician()

# TypeError: Musician.__init__() missing 2 required positional arguments: 'name' and 'instrument'

A workaround is to use default values: 

In [47]:
class Musician: 
    def __init__(self, name="MyName", instrument="default", backup_singer=False):
        # an instance varaible, specific to that instance
        self.instrument = instrument
        self.name = name
    
    def print_musician(self):
        print(self.name, "plays", self.instrument)

In [48]:
paul = Musician()
paul.print_musician()
paul.name

MyName plays default


'MyName'

There is more to OO than what we covered here. For example, inheritence is a common paradigm that's also supported in Python. But what we've learned is enough to use objects that are provided by the libraries we'll be using. You can learn more from the [official documentation](https://docs.python.org/3/tutorial/classes.html). 

### Try it!
You're now ready to try the third problem in today's activities.

## 5. Working with Modules

While we briefly touched on modules (remember the `import math` statement), we haven't really talked about what a module is. Modules are used, ugh, to modularize code. You can write a module by simply creating a `.py` file. We won't be writing many modules ourselves, but we will use them extensively.

To import a module simply write

```python
import module_name
```

You can then use functions defined in the module with the `.` notation, just like we did for objects. Here's an example:

In [50]:
import math
math.sqrt(9)

3.0

We can also use the `from` notation to import specific functions from a package and add them directly to the namespace:

In [51]:
from math import log10
# notice that this is NOT accessed via math.log10()
log10(3)

0.47712125471966244

You can also bulk-import all functions of a module into your local namespace, however, this is **strongly discouraged**, as it can lead to name-clashes and makes your code unreadable eventually.

In [None]:
# from math import * 
# log2(3)

Finally, we can redefine the name of a module. This is useful to define a shorthand for long library names.

In [52]:
import math as m 
m.sqrt(13)

3.605551275463989

## 6. The Pandas Library: Series

Pandas is a popular library for manipulating vectors, tables, and time series. We will frequently use Pandas data structures instead of the built-in python data structures, as they provide much richer functionality. Also, Pandas is **fast**, which makes working with large datasets easier.  Check out the official pandas website at [http://pandas.pydata.org/](http://pandas.pydata.org/).

This tutorial is partially based on the [excellent book by Matt Harrison](https://www.amazon.com/Learning-Pandas-Library-Analysis-Visualization-ebook/dp/B01GIE03GW/).

When you work with Pandas, it's handy to have a [cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) lying around. 

Pandas provides three data structures: 

 * the **series**, which represents a single column of data similar to a python list
 * the **data frame**, which represents multiple series of data
 * the **panel**, which represents multiple data frames
 
We'll mostly work with series and data frames and largely ignore panels. Today we will stick to series. 

Pandas should already be part of your anaconda installation. If not, simply run:

```
$ conda install pandas
```

To make pandas available, we'll import the module into this notebook. It is customary to import pandas as `pd`:

In [53]:
import pandas as pd

Series are the most fundamental data structure in pandas. Let's create two simple series based on an arrays:

In [54]:
bands = pd.Series(["Stones", "Beatles", "Zeppelin", "Pink Floyd"])
bands

0        Stones
1       Beatles
2      Zeppelin
3    Pink Floyd
dtype: object

In [55]:
founded = pd.Series([1962, 1960, 1968, 1965])
founded

0    1962
1    1960
2    1968
3    1965
dtype: int64

When we output these objects we can see an index, also called an axis, which by default is an integer sequence starting at 0, and the associated values. 

| Index | Value | 
| - | - |
| 0  |        Stones
|1   |    Beatles
|2  |    Zeppelin
|3 |    Pink Floyd

Pandas also tells us the data type of the values, `object` for the first series – in this case, this is a string, `int64` (a 64-bit integer) for the second.

Notice that `int64` is not a Python datatype, but a C integer of 64 bit length – which, unlike Python integers – can overflow!

We can also use other data types as indices, in which case the series behaves a lot like a dictionary:

In [56]:
# the data is the first parameter, the index is given by the index keyword
bands_founded = pd.Series([1962, 1960, 1968, 1965, 2012],
                          index=["Stones", "Beatles", "Zeppelin", "Pink Floyd", "Pink Floyd"], 
                          name="Bands founded")
bands_founded

Stones        1962
Beatles       1960
Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2012
Name: Bands founded, dtype: int64

| Index | Value | 
| - | - |
| Stones     |    1962
| Beatles    |    1960
| Zeppelin     |  1968
| Pink Floyd |    1965
| Pink Floyd |    2012

Here we see something interesting: We've used the same index (Pink Floyd) twice, once for the original founding of the band, and once for the re-union starting in 2012. Also, the order of the entries is preserved. 

A series is both, a list and a dictionary! 

We can access the values of an array by printing the member `values`.

In [57]:
bands.values

array(['Stones', 'Beatles', 'Zeppelin', 'Pink Floyd'], dtype=object)

And we can look at how the index is composed:

In [58]:
bands.index

RangeIndex(start=0, stop=4, step=1)

What we see here is that this isn't an explicit list, but rather a set of rules, similar to the ranges we've already worked with. 

Let's compare this to the index where we used explicit labels:

In [59]:
bands_founded.index

Index(['Stones', 'Beatles', 'Zeppelin', 'Pink Floyd', 'Pink Floyd'], dtype='object')

We can access individual entries as we'd access an array or a dictionary:

In [60]:
bands[0]

'Stones'

In [61]:
bands_founded["Beatles"]

1960

There is also a method for looking up a value:

In [62]:
bands_founded.get("Stones")

1962

Note that these access methods are as fast as a dictionary lookup, and much faster than a lookup in a list.

That works also with arrays of labels, in which case the return type is a series, not a single value.

In [63]:
bands_founded.get(["Stones", "Beatles"])

Stones     1962
Beatles    1960
Name: Bands founded, dtype: int64

Notice that when we access data with multiple indices, we don't get a simple datatype, as in the above cases, but instead get another series back:

In [64]:
bands_founded["Pink Floyd"]

Pink Floyd    1965
Pink Floyd    2012
Name: Bands founded, dtype: int64

Series also have indexers for label-based access: [`loc`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)

In [65]:
# And one more way for looking up a value:
bands_founded.loc["Stones"]
# this is equivalent to 
# bands_founded["Stones"]

1962

Related to the `loc` indexer is the [`iloc`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) indexer. However, instead of operating on our label index, `iloc` operates purely on position: 

In [66]:
bands_founded.iloc[1]

1960

There is also an `ix` indexer, which is deprecated and should not be used.

These ways of accessing slices of a dataset (`loc`, `iloc`), will make more sense when we use dataframes instead of series – in dataframes, `loc` and `iloc` operate on the rows, whereas square brackets operate on the columns.

### Iterating

Iteration works as you would expect: 

In [67]:
for band in bands:
    print(band)

Stones
Beatles
Zeppelin
Pink Floyd


In [68]:
for band, founded in bands_founded.items():
    print(band + ", " + str(founded))

Stones, 1962
Beatles, 1960
Zeppelin, 1968
Pink Floyd, 1965
Pink Floyd, 2012


### Updating
Updating works largely as expected, however, you have to be careful when updating series with duplicate indices:

In [69]:
bands[2] = "The Doors"
bands

0        Stones
1       Beatles
2     The Doors
3    Pink Floyd
dtype: object

We can add a new item by direclty assigning it to a new index.

In [70]:
bands[4] = "Zeppelin"
bands

0        Stones
1       Beatles
2     The Doors
3    Pink Floyd
4      Zeppelin
dtype: object

Note that the indices don't have to be sequential.

In [71]:
bands[17] = "The Who"
bands

0         Stones
1        Beatles
2      The Doors
3     Pink Floyd
4       Zeppelin
17       The Who
dtype: object

When we update based on an index that occurs more than once, all instances are updated:

In [72]:
bands_founded["Pink Floyd"] = 2015
bands_founded

Stones        1962
Beatles       1960
Zeppelin      1968
Pink Floyd    2015
Pink Floyd    2015
Name: Bands founded, dtype: int64

A way to update a specific entry when an index is used multiple time is to use the `iloc` indexer. We can use the `iloc` array to set values based purely on position. However, all of this is rather ugly.

In [73]:
bands_founded.iloc[3] = 1965
bands_founded

Stones        1962
Beatles       1960
Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

### Deleting 

Deleting is rarely done with pandas data structures, instead filters and masks are used. It's possible based on indices:

In [74]:
del bands_founded["Stones"]
bands_founded

Beatles       1960
Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

### Indexing and slicing

Indexing and slicing works largely like in normal python, but instead of just directly using the bracket notations, it is recommended to use `iloc` for indexing by position and `loc` for indexing by labelled indices. 

In [75]:
# slicing by position
bands_founded.iloc[1:3]

Zeppelin      1968
Pink Floyd    1965
Name: Bands founded, dtype: int64

When slicing by labelled index, the last value specified is *included*, which differs from regular Python slicing behavior.

In [76]:
# slicing by index
bands_founded.loc["Zeppelin" : "Pink Floyd"]

Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

In [77]:
# Note that index 17 is included
bands.loc[1:17]

1        Beatles
2      The Doors
3     Pink Floyd
4       Zeppelin
17       The Who
dtype: object

Again, for series (not for data frames), `loc` and just using bracket notation is identical: 

In [78]:
bands[2:17]

2      The Doors
3     Pink Floyd
4       Zeppelin
17       The Who
dtype: object

Both, `iloc` and `loc` can be used with arrays, which isn't possible in vanilla Python:

In [79]:
bands_founded.iloc[[0,3]]

Beatles       1960
Pink Floyd    2015
Name: Bands founded, dtype: int64

In [80]:
bands_founded.loc[["Beatles", "Pink Floyd"]]

Beatles       1960
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

And, all these variants can also be used with boolean arrays, which we will soon find out to be very helpful:

In [81]:
bands_founded

Beatles       1960
Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

In [82]:
bands_founded.loc[[True, False, False, True]]

Beatles       1960
Pink Floyd    2015
Name: Bands founded, dtype: int64

### Masking and Filtering

With pandas we can create boolean arrays that we can use to mask and filter a dataset. In the following expression, we'll create a new series that has "True" for every band formed after 1964:

In [83]:
mask = bands_founded > 1964
mask

Beatles       False
Zeppelin       True
Pink Floyd     True
Pink Floyd     True
Name: Bands founded, dtype: bool

This is called **broadcasting**. We can use broadcasting with various operations:

In [84]:
# Not particularly useful for this dataset..
founding_months = bands_founded * 12
founding_months

Beatles       23520
Zeppelin      23616
Pink Floyd    23580
Pink Floyd    24180
Name: Bands founded, dtype: int64

We can use a boolean mask to filter a series, as we've seen before:

In [85]:
# applying the mask to the original array
# note that almost all of those operations return a new copy and don't modify in place
bands_founded[mask]

Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

The short form here would be:

In [86]:
bands_founded[bands_founded > 1967]

Zeppelin      1968
Pink Floyd    2015
Name: Bands founded, dtype: int64

In the above example, this expression: 
`bands_founded > 1967` creates the series with boolean values, and that is then used to filter the `bands_founded` series. 

### Exploring a Series

There are various way we can explore a series. We can count the number of non-null values: 

In [87]:
numbers = pd.Series([1962, 1960, 1968, 1965, 2012, None, 2016])
numbers.count()

6

In [88]:
numbers

0    1962.0
1    1960.0
2    1968.0
3    1965.0
4    2012.0
5       NaN
6    2016.0
dtype: float64

We can get the sum, mean, median of a series:

In [89]:
numbers.sum()

11883.0

In [90]:
numbers.mean()

1980.5

In [91]:
numbers.median()

1966.5

We can also get an overview of the statistical properties of a series: 

In [92]:
numbers.describe()

count       6.000000
mean     1980.500000
std        26.120873
min      1960.000000
25%      1962.750000
50%      1966.500000
75%      2001.000000
max      2016.000000
dtype: float64

Note that None/NaN values are ignored here. We can drop all NaN values if we desire:

In [93]:
numbers = numbers.dropna()
numbers

0    1962.0
1    1960.0
2    1968.0
3    1965.0
4    2012.0
6    2016.0
dtype: float64

This works also for non-numerical data. Of course, we get different measures:

In [94]:
bands.describe()

count          6
unique         6
top       Stones
freq           1
dtype: object

Other useful methods are asking for a specific quantile, the minimum, the maximum, etc. 

In [95]:
numbers.quantile(0.25)

1962.75

In [96]:
numbers.max()

2016.0

In [97]:
numbers.min()

1960.0

### Sorting 

We can sort a series:

In [98]:
numbers.sort_values()

1    1960.0
0    1962.0
3    1965.0
2    1968.0
4    2012.0
6    2016.0
dtype: float64

And make the sorting descending: 

In [99]:
sorted_numbers = numbers.sort_values(ascending=False)
sorted_numbers

6    2016.0
4    2012.0
2    1968.0
3    1965.0
0    1962.0
1    1960.0
dtype: float64

Note that the indices remain the same! We can **reset the indices**:

In [None]:
# If we don't specify drop to be true, the previous indices are preserved in a separte column
sorted_numbers = sorted_numbers.reset_index(drop=True)
sorted_numbers

We can also sort by the index:

In [None]:
# mix up the indices first
new_sorted_numbers = numbers.sort_values()
print(new_sorted_numbers)
new_sorted_numbers.sort_index()

### Applying a Function

Often, we will want to apply a function to all values of a Series. We can do that with the [`map()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) function:

In [None]:
import datetime

# Convert an integer year into a date, assuming Jan 1 as day and month.
def to_date(year):
    return datetime.date(int(year), 1, 1)
    
new_sorted_numbers.map(to_date)

This is an incredibly powerful concept that you can use to modify series in sophisticated ways, similar to list comprehension. 

Another way to use the map function is to pass in a dictionary that is then applied to matching objects: 

In [None]:
new_sorted_numbers.map({1965:1945, 2012:1999, 1968:"What"})

### Conclusion

Series (and data frames, which we will tackle in the next lab) are incredibly powerful. We've only covered a small part of the features here. Make sure to also check out resources such as the [10 minutes to pandas guide](http://pandas.pydata.org/pandas-docs/stable/10min.html).