# Introduction to Data Science - Lecture 5 - Dictionaries, Series
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/*

In this lecture we will, after a quick recap, introduce more data structures: sets, dictionaries, and series. While sets and dictionaries are built-in Python data structures, series and dataframes are part of the [pandas library](http://pandas.pydata.org/) tailored to data science applications.

## If-Statements Recap

* By using if-elif-else statements we can realize conditional flow in a program. 
* If an elif take a parameter that is tested for truth. The parameter can be a boolean (`True` or `False`), or any other data type. 
    * For numerical data types, 0 and None evaluates to false, everything else to true.
    * For lists, strings, dictionaries, an empty container evaluates to false.

See the [documentation](https://docs.python.org/3/library/stdtypes.html#truth-value-testing) to understand what is considered true and false.

In [16]:
def factors(x):
    # notice the use of the negation and the use of 0 as false
    if(not x % 2):
        print("2 is a factor of " + str(x))  
    elif(not x % 3):     # only evaluated when if was false
        print("3 is a factor of " + str(x))
    else: # only evaluated when both if and elif were false
        print("Neither 2 nor 3 are factors of " + str(x))

factors(4)
factors(9)
factors(13)

2 is a factor of 4
3 is a factor of 9
Neither 2 nor 3 are factors of 13



## Lists Recap

**A list is a collection of items.**   
**Lists are created with square brackets `[]` and can be accessed via an index:**

In [2]:
beatles = ["Paul", "John", "George", "Ringo"]
# printing the whole array
print(beatles)
# printing the first element of that array, at index 0
print(beatles[0])
# third element, at index 3
print(beatles[3])
# access the one-but-last element
print(beatles[-2])

['Paul', 'John', 'George', 'Ringo']
Paul
Ringo
George


We can also create **slices of an array with the slice operator `:`**

```python
a[start:end] # items start through end-1
a[start:]    # items start through the rest of the array
a[:end]      # items from the beginning through end-1
a[:]         # a copy of the whole array
```
There is also the step value, which can be used with any of the above:

```python
a[start:end:step] # start through not past end, by step
```

See [this post](for a good explanation on slicing).

In [3]:
# Get the slice from 0 (included) to 2 (excluded)
beatles[:2] # this can also be written as [0:2]

['Paul', 'John']

The slice operation returns a new array, the original array is untouched: 

In [4]:
beatles

['Paul', 'John', 'George', 'Ringo']

**We can change the elements that are contained in a list**: 

In [5]:
beatles[1] = "JohnYoko"
beatles

['Paul', 'JohnYoko', 'George', 'Ringo']

Lists can also be **extended in-place with the `append()` function**:

In [6]:
beatles.append("George Martin")
beatles

['Paul', 'JohnYoko', 'George', 'Ringo', 'George Martin']

Lists can be **concatenated**: 

In [7]:
zeppelin = ["Jimmy", "Robert", "John", "John"]
beatles += zeppelin
beatles

['Paul',
 'JohnYoko',
 'George',
 'Ringo',
 'George Martin',
 'Jimmy',
 'Robert',
 'John',
 'John']

We can **check the length** of a list:

In [8]:
len(zeppelin)

4

Lists can also be **nested**: 

In [9]:
# let's reset the beatles first
beatles = ["Paul", "John", "George", "Ringo"]
bands = [beatles, zeppelin]
bands

[['Paul', 'John', 'George', 'Ringo'], ['Jimmy', 'Robert', 'John', 'John']]

### While Loop Recap

While loops use the `while` keyword, a condition, and the loop body:

In [10]:
a, b = 1, 1
while (b < 1000):
    print(b, end=", ") 
    # end is a parameter of print that defines how the string to be printed ends. 
    # By default, a newline \n is appended, which we overwrite here
    temp = b
    b += a
    a = temp
    # a better way of writing this is using simultaneous assignment: 
    # a, b = b, a + b

1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 

This continues, until the terminating condition is reached. 

We can also **use the `break` statement to terminate a loop**: 

In [11]:
a, b = 1, 1
while (True):
    print(b, end=", ") 
    a, b = b, a + b
    if (b > 1000):
        break

1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 

### For Loop Recap


For loops are mainly used to iterate over items of a sequence. 

In [12]:
for member in zeppelin: 
    print(member)

Jimmy
Robert
John
John


When you want to iterate over a sequence of numbers, use the [`range()`](https://docs.python.org/3/library/stdtypes.html#range) function. Range generates a sequence of numbers:

In [13]:
for i in range(10): 
    print (i)

0
1
2
3
4
5
6
7
8
9


## 1. Tuples

[Tuples](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences) are a list-like data structure that are, in contrast to lists **immutable**. 

The purpose of tuples is to store objects of different types. Remember that Lists should only contain homogeneous data; Tuples are designed for the heterogeneous case. 

Also, Tuples have practical implications for performance and HashTables, which we will discull later. 

In [14]:
person = "Alex", 1981, "Computer Science"
person

('Alex', 1981, 'Computer Science')

Initialization with brackets is prefered, since it's more explicit:

In [15]:
person = ("Alex", 1981, "Austria")
person

('Alex', 1981, 'Austria')

We can access them just like arrays: 

In [16]:
person[1]

1981

We cannot, however change values. This throws a **TypeError**.

In [17]:
# throws TypeError
person[1] = 1985

TypeError: 'tuple' object does not support item assignment

Arbitrary objects can be part of a tuple:

In [18]:
train_schedule = ("Train 1", [9,11])
# this works because we're modifying the mutable array within the immuatable tuple.
train_schedule[1][0] = 15
train_schedule

('Train 1', [15, 11])

Of course, that includes tuples:

In [19]:
train_schedule = ("Train 1", (9,11))
# this doesn't work
# train_schedule[1][0] = 15
train_schedule

('Train 1', (9, 11))

In [14]:
a = 1
b = 2
(a, b) = (b, a)
a, b

(2, 1)

In [15]:
[a, b] = [b, a]
a, b

(1, 2)

### Aside: Functions with Multiple Return Values

Consider the following code:

In [21]:
def multiply(a, b, c):
    return (a*b), (a*c), (b*c), (a*b*c)

Here, we return multiple return values - something that's not possible in most programming languages! But it's very convenient. 

Let's try it out:

In [22]:
multiply(3, 7, 11)

(21, 33, 77, 231)

The round brackets indicate what's going on: what is returned, is in fact, a tuple!

We can use this return value to assign multiple variables at the same time:

In [23]:
ab, ac, bc, abc = multiply(3, 7, 11)
print(ab, ac, bc, abc)

21 33 77 231


To do this, no function is necessary. We can just do the following:

In [24]:
what, i_s, going, on = "this", "is", "really", "nice" # use () to be more explicit
print(what, i_s, going, on)

this is really nice


## 2. Sets

A [set](https://docs.python.org/3/tutorial/datastructures.html#sets) is a mutable collection, similar to a list, however, it is
 * **not ordered**, and
 * **cannot contain the same element twice**

Here is an example:

In [2]:
# Initialize a set with {}
beatles = {"John", "Paul", "Ringo", "George", "Paul"}
beatles

{'George', 'John', 'Paul', 'Ringo'}

In [26]:
# Initialize the set with an array
usernames = set(["Jimmy", "Robert", "John", "John"])
usernames

{'Jimmy', 'John', 'Robert'}

In [3]:
set(range(10))

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

We've initialized the set `usernames` with an array of names. We have chose a set, because we don't want to have duplicate user names. However, **in the second example, the array included a duplicate - John was specified twice**. We can see, however, that **John was added to the set only once**.

Also note the order of the elements in the first set: Initialized with:
```python
{"John", "Paul", "Ringo", "George"}
```
But the output isn't ordered: 
```python
{"John", "Paul", "Ringo", "George"}
```

Sets are great for various tasks. For example, they can be used to remove duplicate entries from lists. Most importantly, they let you very efficiently check whether an element already exists. 

A set works based on a mathematical function that produces a "hash code". This hash code is then used as an index to an array. For example, "Jimmy" could hash to the value 13, and accordingly, Jimmy would be put at the 13th index of an array. When we want to test whether "Jimmy" is already in a set, we simply compute the hash, which will again produce 13, and then look up whether something is stored at index 13. 

We can check whether a set contains a value using the `in` keyword:

In [27]:
"Jimmy" in usernames

True

In [5]:
print("Ringo" in beatles)
print("Steven" in beatles)

True
False


In [8]:
"Ringo" in list(beatles)

True

We can add values using the add function on a set:

In [29]:
usernames.add("JohnB")
usernames

{'Jimmy', 'John', 'JohnB', 'Robert'}

And remove elements with the remove function: 

In [30]:
usernames.remove("John")
usernames

{'Jimmy', 'JohnB', 'Robert'}

If the set doesn't contain a key we want to remove, it will throw a `KeyError`.

In [31]:
usernames.remove("Joseph")

KeyError: 'Joseph'

To prevent that, it is advisable to first check whether a set actually contains a value, if you're not 100% sure: 

In [32]:
if ("Joseph" in usernames):
    usernames.remove("Joseph")

We can iterate over the values of a set. Note, however, that no guarantee about the order of the set is made. 

In [33]:
for name in usernames:
    print (name)

Jimmy
Robert
JohnB


Make sure to check out the [documentation](https://docs.python.org/3.5/library/stdtypes.html#set) to see what else a set can do. 

## Exercise 2: Sets

Write a function that finds the overlap of two sets and prints them. Initialize two sets, e.g., with values {13, 25, 37, 45, 13} and {14, 25, 38, 8, 45} and call this function with them.

In [12]:
def setOverlap(s1, s2):
    ret = set()
    for e in s1:
        if e in s2:
            ret.add(e)
    return ret

def setOverlapComprehension(s1, s2):
    return  set([ x for x in s1 if x in s2])
print(setOverlap({13, 25, 37, 45, 13}, {14, 25, 38, 8, 45}))
print(setOverlapComprehension({13, 25, 37, 45, 13}, {14, 25, 38, 8, 45}))

{25, 45}
{25, 45}


## 3. Dictionaries

[Dictionaries](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) are related to sets, but are more powerful: in addition to the key used to identify an element in a set, dictionaries also store a value associated with a key. Other terms commonly used for dictionaries are *associative arrays*, *(hash) maps*, and *hash tables*. 

Here is a simple example:

In [13]:
musicians = {"John":"Zeppelin", "Jimmy":"Zeppelin", "Paul":"Beatles", "Ringo":"Beatles"}
musicians

{'John': 'Zeppelin',
 'Jimmy': 'Zeppelin',
 'Paul': 'Beatles',
 'Ringo': 'Beatles'}

As we can see, a dictionary can be created with curly brackets and a list of key-value pairs, separated by a `:`. Here, the names are the keys, the bands are the values. 

There are other ways of creating a dictionary. Here, we pass a list of tuples to the dictionary, but we could also pass a list of lists.

In [36]:
more_musicians = dict([("Thom", "Radiohead"), ("Dave", "Foo Fighters")])
more_musicians

{'Thom': 'Radiohead', 'Dave': 'Foo Fighters'}

In [16]:
a = frozenset({1,2,3})
print(type(a))
b = dict()
b[a] = 5

<class 'frozenset'>


Of course, a dictionary can be of any data type. Here is an example with int as keys, floats as values:

In [37]:
numbers = {3:1.45, 4:1.32, 19:9.97, 6:9.99}
numbers

{3: 1.45, 4: 1.32, 19: 9.97, 6: 9.99}

Note that it's generally not a good idea to use floats as keys, as they are stored only as approximations.

Dict elements are accessed just as elements in a list, with square brackets, but instead of the index, we pass in the key: 

In [38]:
numbers[3]

1.45

In [39]:
musicians["John"]

'Zeppelin'

We can add elements to a dict:

In [17]:
musicians["Thom"] = "Radiohead"
musicians

{'John': 'Zeppelin',
 'Jimmy': 'Zeppelin',
 'Paul': 'Beatles',
 'Ringo': 'Beatles',
 'Thom': 'Radiohead'}

In [18]:
musicians["Thom"] = "Readiohead"
musicians

{'John': 'Zeppelin',
 'Jimmy': 'Zeppelin',
 'Paul': 'Beatles',
 'Ringo': 'Beatles',
 'Thom': 'Readiohead'}

And remove them using the `del` keyword:

In [41]:
del musicians["Thom"]
musicians

{'John': 'Zeppelin',
 'Jimmy': 'Zeppelin',
 'Paul': 'Beatles',
 'Ringo': 'Beatles'}

Again, we have to worry about key errors. If we want to remove Thom again, we'd get a `KeyError`.

In [42]:
del musicians["Thom"]

KeyError: 'Thom'

We can access a list of keys and values separately: 

In [43]:
musicians.keys()

dict_keys(['John', 'Jimmy', 'Paul', 'Ringo'])

In [19]:
[key for key in musicians.keys() if  len(key) < 5]

['John', 'Paul', 'Thom']

Notice that the result is not a list or a set, but a [view object](https://docs.python.org/3/library/stdtypes.html#dict-views). A view object always is updated when the dictionary is changed, and we can use it to iterate over a dictionary. 

In [45]:
for musician in musicians.keys():
    print(musician)

John
Jimmy
Paul
Ringo


This also works with `values()` and `items()`:

In [46]:
musicians.values()

dict_values(['Zeppelin', 'Zeppelin', 'Beatles', 'Beatles'])

In [47]:
musicians.items()

dict_items([('John', 'Zeppelin'), ('Jimmy', 'Zeppelin'), ('Paul', 'Beatles'), ('Ringo', 'Beatles')])

The latter is especially handy for iterating over the key-value pairs in a dictionary:

In [48]:
# notice that we iterate over the tuples and have the elements of the tuple assigned to k and v, respectively.
for k, v in musicians.items():
    print (k + ", " + v)

John, Zeppelin
Jimmy, Zeppelin
Paul, Beatles
Ringo, Beatles


Another way to write the previous expression would be like this: 

In [49]:
for k in musicians.keys():
    print(k + ", " +  musicians[k])

John, Zeppelin
Jimmy, Zeppelin
Paul, Beatles
Ringo, Beatles


Make sure to check out [the dictionary documentation](https://docs.python.org/3/library/stdtypes.html#typesmapping) for more info. 

In [50]:
for k in musicians:
    print(k)

John
Jimmy
Paul
Ringo


### Exercise 3: Dictionaries

 * Create a dictionary with two-letter codes of two of US states and the full names, e.g., UT: Utah, NY: New York
 * After initially creating the dictionary, add two more states to the dictionary.
 * Create a second dictionary that maps the state codes to an array of cities in that state, e.g., UT: [Salt Lake City, Ogden, Provo, St. George]. 
 * Write a function that takes a state code and prints the full name of the state and lists the cities in that state.

## 4. Working with Modules

While we briefly touched on modules (remember the `import math` statement), we haven't really talked about what a module is. Modules are used, ugh, to modularize code. You can write a module by simply creating a `.py` file. We won't be writing many modules ourselves, but we will use them extensively.

To import a module simply write

```python
import module_name
```

You can then use functions defined in the module with the `.` notation. Here's an example:

In [51]:
import math
math.sqrt(9)

3.0

We can also use the `from` notation to import specific functions from a package and add them directly to the namespace:

In [52]:
from math import log10
# notice that this is NOT accessed via math.log10()
log10(3)

0.47712125471966244

You can also bulk-import all functions of a module into your local namespace, however, this is **strongly discouraged**, as it can lead to name-clashes and makes your code unreadable eventually.

In [53]:
from math import * 
log2(3)

1.584962500721156

Finally, we can redefine the name of a module. This is useful to define a shorthand for long library names.

In [54]:
import math as m 
m.sqrt(13)

3.605551275463989

## 5. The Pandas Library: Series

Pandas is a popular library for manipulating vectors, tables, and time series. We will frequently use Pandas data structures instead of the built-in python data structures, as they provide much richer functionality. Also, Pandas is **fast**, which makes working with large datasets easier.  Check out the official pandas website at [http://pandas.pydata.org/](http://pandas.pydata.org/).

This tutorial is partially based on the [excellent book by Matt Harrison](https://www.amazon.com/Learning-Pandas-Library-Analysis-Visualization-ebook/dp/B01GIE03GW/).

Pandas provides three data structures: 

 * the **series**, which represents a single column of data similar to a python list
 * the **data frame**, which represents multiple series of data
 * the **panel**, which represents multiple data frames
 
We'll mostly work with series and data frames and largely ignore panels. Today we will stick to series. 

Simply run:

```
$  /usr/local/bin/pip3 install pandas
```

To make pandas available, we'll import the module into this notebook. It is customary to import pandas as `pd`:

In [20]:
import pandas as pd

Series are the most fundamental data structure in pandas. Let's create two simple series based on an arrays:

In [21]:

bands = pd.Series(["Stones", "Beatles", "Zeppelin", "Pink Floyd"])
bands

0        Stones
1       Beatles
2      Zeppelin
3    Pink Floyd
dtype: object

In [23]:
type(bands[0])

str

In [8]:
founded = pd.Series([1962, 1960, 1968, 1965])
founded

0    1962
1    1960
2    1968
3    1965
dtype: int64

When we output these objects we can see an index, also called an axis, which by default is an integer sequence starting at 0, and the associated values. 

| Index | Value | 
| - | - |
| 0  |        Stones
|1   |    Beatles
|2  |    Zeppelin
|3 |    Pink Floyd

Pandas also tells us the data type of the values, `object` for the first series - in this case, this is a string, `int64` (a 64-bit integer) for the second.

Notice that `int64` is not a Python datatype, but a C integer of 64 bit length - which, unlike Python integers - can overflow!

We can also use other data types as indices, in which case the series behaves a lot like a dictionary:

In [36]:
# the data is the first parameter, the index is given by the index keyword
bands_founded = pd.Series([1962, 1960, 1968, 1965, 2012],
                          index=["Stones", "Beatles", "Zeppelin", "Pink Floyd", "Pink Floyd"], 
                          name="Bands founded",
                          dtype="int64")
bands_founded

Stones        1962
Beatles       1960
Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2012
Name: Bands founded, dtype: int64

In [37]:
bands_founded["Stones"]

1962

| Index | Value | 
| - | - |
| Stones     |    1962
| Beatles    |    1960
| Zeppelin     |  1968
| Pink Floyd |    1965
| Pink Floyd |    2012

Here we see something interesting: We've used the same index (Pink Floyd) twice, once for the original founding of the band, and once for the re-union starting in 2012. Also, the order of the entries is preserved. 

A series is, so to speak, both, a list and a dictionary! 

We can access the values of an array by printing the member `values`.

In [38]:
bands.values

array(['Stones', 'Beatles', 'Zeppelin', 'Pink Floyd'], dtype=object)

And we can look at how the index is composed:

In [39]:
bands.index

RangeIndex(start=0, stop=4, step=1)

What we see here is that this isn't an explicit list, but rather a set of rules!

Let's compare this to the index where we used explicit labels:

In [40]:
bands_founded.index
for x in bands_founded.index:
    print(x)

Stones
Beatles
Zeppelin
Pink Floyd
Pink Floyd


In [41]:
bands_founded.values

array([1962, 1960, 1968, 1965, 2012])

We can access individual entries as we'd access an array or a dictionary:

In [42]:
bands[0]

'Stones'

In [43]:
bands_founded["Beatles"]

1960

There is also a method for looking up a value:

In [44]:
bands_founded.get("Stones")

1962

Note that these access methods are as fast as a dictionary lookup (compared to a lookup in a list).

That works also with arrays of labels, in which case the return type is a series, not a single value.

In [45]:
bands_founded.get(["Stones", "Beatles"])

Stones     1962
Beatles    1960
Name: Bands founded, dtype: int64

Notice that when we access data with multiple indices, we don't get a simple data type, as in the above cases, but instead get another series back:

In [46]:
display(bands_founded["Pink Floyd"])
type(bands_founded["Pink Floyd"])

Pink Floyd    1965
Pink Floyd    2012
Name: Bands founded, dtype: int64

pandas.core.series.Series

Series also have indexers for label-based access: [`loc`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)

In [47]:
# And one more way for looking up a value:
bands_founded.loc["Stones"]
# this is equivalent to 
# bands_founded["Stones"]

1962

Related to the `loc` indexer is the [`iloc`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) indexer. However, instead of an index, `iloc` operates purely on position, not on index labels: 

In [48]:
bands_founded[0]

1962

In [49]:
bands_founded.iloc[0]

1962

In [50]:
for i in range(len(bands_founded)):
    print(i, bands_founded.iloc[i])

0 1962
1 1960
2 1968
3 1965
4 2012


Notice that there is also an `ix` indexer, which, however, is deprecated and should not be used.

These ways of accessing slices of a dataset (`loc`, `iloc`), will make more sense when we use dataframes instead of series - in dataframes, `loc` and `iloc` operate on the rows, whereas square brackets operate on the columns.

### Iterating

Iteration works as you would expect: 

In [51]:
for band in bands:
    print(band)

Stones
Beatles
Zeppelin
Pink Floyd


In [52]:
for band, founded in bands_founded.items():
    print(band + ", " + str(founded))

Stones, 1962
Beatles, 1960
Zeppelin, 1968
Pink Floyd, 1965
Pink Floyd, 2012


### Updating
Updating works largely as expected, however, you have to be careful when updating series with duplicate indices:

In [53]:
bands[2] = "The Doors"
bands

0        Stones
1       Beatles
2     The Doors
3    Pink Floyd
dtype: object

We can add a new item by direclty assigning it to a new index.

In [54]:
bands[4] = "Zeppelin"
bands

0        Stones
1       Beatles
2     The Doors
3    Pink Floyd
4      Zeppelin
dtype: object

Note that the indices don't have to be sequential.

In [55]:
bands[17] = "The Who"
bands

0         Stones
1        Beatles
2      The Doors
3     Pink Floyd
4       Zeppelin
17       The Who
dtype: object

We can also use a function to set the value:

In [56]:
bands.at[9] = "Hendrix"
bands

0         Stones
1        Beatles
2      The Doors
3     Pink Floyd
4       Zeppelin
17       The Who
9        Hendrix
dtype: object

When we update based on an index that occurs more than once, all instances are updated:

In [57]:
bands_founded["Pink Floyd"] = 2015
bands_founded

Stones        1962
Beatles       1960
Zeppelin      1968
Pink Floyd    2015
Pink Floyd    2015
Name: Bands founded, dtype: int64

A way to update a specific entry when an index is used multiple time is to use the `iloc` indexer. We can use the `iloc` array to set values based purely on position. However, all of this is rather ugly.

In [58]:
bands_founded.iloc[3] = 1965
bands_founded

Stones        1962
Beatles       1960
Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

### Deleting 

Deleting is rarely done with pandas data structures, instead filters and masks are used. It's possible based on indices:

In [59]:
del bands_founded["Stones"]
bands_founded

Beatles       1960
Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

### Indexing and slicing

Indexing and slicing works largely like in normal python, but instead of just directly using the bracket notations, it is recommended to use `iloc` for indexing by position and `loc` for indexing by index. 

In [60]:
# slicing by position
bands_founded.iloc[1:3]

Zeppelin      1968
Pink Floyd    1965
Name: Bands founded, dtype: int64

When slicing by index, the last value specified is *included*, which differs from regular Python slicing behavior:

In [61]:
# slicing by index
bands_founded.loc["Zeppelin" : "Pink Floyd"]

Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

In [62]:
# Note that index 17 is included
bands.loc[1:17]

1        Beatles
2      The Doors
3     Pink Floyd
4       Zeppelin
17       The Who
dtype: object

Both, `iloc` and `loc` can be used with arrays, which isn't possible in vanilla Python:

In [63]:
bands_founded.iloc[[0,3]]

Beatles       1960
Pink Floyd    2015
Name: Bands founded, dtype: int64

In [64]:
bands_founded.loc[["Beatles", "Pink Floyd"]]

Beatles       1960
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

And, all these variants can also be used with boolean arrays, which we will soon find out to be very helpful:

In [65]:
bands_founded

Beatles       1960
Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

In [66]:
bands_founded.loc[[True, False, False, True]]

Beatles       1960
Pink Floyd    2015
Name: Bands founded, dtype: int64

### Masking and Filtering

With pandas we can create boolean arrays that we can use to mask and filter a dataset. In the following expression, we'll create a new array that has "True" for every band formed after 1964:

In [67]:
mask = bands_founded > 1964
mask

Beatles       False
Zeppelin       True
Pink Floyd     True
Pink Floyd     True
Name: Bands founded, dtype: bool

This uses a technique called **broadcasting**. We can use broadcasting with various operations:

In [68]:
# Not particularly useful for this dataset..
founding_months = bands_founded * 12
founding_months

Beatles       23520
Zeppelin      23616
Pink Floyd    23580
Pink Floyd    24180
Name: Bands founded, dtype: int64

We can use a boolean mask to filter a series, as we've seen before:

In [69]:
# applying the mask to the original array
# note that almost all of those operations return a new copy and don't modify in place
bands_founded[mask]

Zeppelin      1968
Pink Floyd    1965
Pink Floyd    2015
Name: Bands founded, dtype: int64

The short form here would be:

# THIS IS SUPER USEFUL AND IMPORTANT!!!

In [70]:
bands_founded[bands_founded > 1967]

Zeppelin      1968
Pink Floyd    2015
Name: Bands founded, dtype: int64

**This is a super useful thing.  We'll use it to select rows from tables a lot!**

## Exploring a Series

There are various way we can explore a series. We can count the number of non-null values: 

In [91]:
numbers = pd.Series([1962, 1960, 1968, 1965, 2012, None, 2016])
print(numbers.count())
print(len(numbers))

6
7


In [72]:
display(numbers)
3

0    1962.0
1    1960.0
2    1968.0
3    1965.0
4    2012.0
5       NaN
6    2016.0
dtype: float64

3

We can get the sum, mean, median of a series:

In [73]:
numbers.sum()

11883.0

In [74]:
numbers.mean()

1980.5

In [75]:
numbers.median()

1966.5

We can also get an overview of the statistical properties of a series: 

In [76]:
numbers.describe()

count       6.000000
mean     1980.500000
std        26.120873
min      1960.000000
25%      1962.750000
50%      1966.500000
75%      2001.000000
max      2016.000000
dtype: float64

Note that None/NaN values are ignored here. We can drop all NaN values if we desire:

In [77]:
numbers = numbers.dropna()
numbers

0    1962.0
1    1960.0
2    1968.0
3    1965.0
4    2012.0
6    2016.0
dtype: float64

In [78]:
def f(x):
    'this is the f function'
    return 3
print(f.__doc__)

this is the f function


This works also for non-numerical data. Of course, we get different measures:

In [79]:
bands.describe()

count           7
unique          7
top       Hendrix
freq            1
dtype: object

Other useful methods are asking for a specific quantile, the minimum, the maximum, etc. 

In [80]:
numbers.quantile(0.25)

1962.75

In [81]:
numbers.max()

2016.0

In [82]:
numbers.min()

1960.0

## Sorting 

We can sort a series:

In [83]:
numbers.sort_values()

1    1960.0
0    1962.0
3    1965.0
2    1968.0
4    2012.0
6    2016.0
dtype: float64

In [84]:
sorted_nums = numbers.sort_values()
print(sorted_nums[0])
print(sorted_nums.iloc[0])


1962.0
1960.0


In [85]:
sorted_numbers = numbers.sort_values(ascending=False)
sorted_numbers

6    2016.0
4    2012.0
2    1968.0
3    1965.0
0    1962.0
1    1960.0
dtype: float64

Note that the indices remain constant. We can **reset the indices**:

In [86]:
# If we don't specify drop to be true, the previous indices are preserved in a separte column
sorted_numbers = sorted_numbers.reset_index(drop=True)
sorted_numbers

0    2016.0
1    2012.0
2    1968.0
3    1965.0
4    1962.0
5    1960.0
dtype: float64

We can also sort by the index:

In [87]:
# mix up the indices first
new_sorted_numbers = numbers.sort_values()
print(new_sorted_numbers)
new_sorted_numbers.sort_index()

1    1960.0
0    1962.0
3    1965.0
2    1968.0
4    2012.0
6    2016.0
dtype: float64


0    1962.0
1    1960.0
2    1968.0
3    1965.0
4    2012.0
6    2016.0
dtype: float64

## Applying a Function

Often, we will want to apply a function to all values of a Series. We can do that with the map function:

In [88]:
import datetime

# Convert an integer year into a date, assuming Jan 1 as day and month.
def to_date(year):
    return datetime.date(int(year), 1, 1)
    
new_sorted_numbers.map(to_date)

1    1960-01-01
0    1962-01-01
3    1965-01-01
2    1968-01-01
4    2012-01-01
6    2016-01-01
dtype: object

This is an incredibly powerful concept that you can use to modify series in sophisticated ways.

Another way to use the map function is to pass in a dictionary that is then applied to matching objects: 

In [89]:
new_sorted_numbers.map({1965:1945, 2012:1999, 1968:"What"})

1     NaN
0     NaN
3    1945
2    What
4    1999
6     NaN
dtype: object

## Conclusion

Series (and data frames, which we will tackle in the next lab) are incredibly powerful. We've only covered a small part of the features here. Make sure to also check out resources such as the [10 minutes to pandas guide](http://pandas.pydata.org/pandas-docs/stable/10min.html).

### Exercise 4: Pandas Series

Create a new pandas series with the lists given below that contain NFL team names and the number of Super Bowl titles they won. Use the names as indices, the wins as the data.

 * Once the list is created, sort the series alphabetically by index. 
 * Print an overview of the statistical properties of the series. What's the mean number of wins?
 * Filter out all teams that have won less than four Super Bowl titles
 * A football team has 45 players. Update the series so that instead of the number of titles, it reflects the number of Super Bowl rings given to the players. 
 * Assume that each ring costs \$ 30,000. Update the series so that it contains a string of the dollar amount including the \$ sign. For the Steelers, for example, this would correspond to: 
 ```
 Pittsburgh Steelers             $ 8100000
 ```


In [90]:
teams = ["Pittsburgh Steelers",
"New England Patriots",
"Dallas Cowboys",
"San Francisco 49ers",
"Green Bay Packers",
"New York Giants",
"Denver Broncos",
"Oakland/Los Angeles Raiders",
"Washington Redskins",
"Miami Dolphins",
"Baltimore/Indianapolis Colts",
"Baltimore Ravens"]
wins = [6,6,5,5,4,4,3,3,3,2,2,2]