-----
#Pandas Series Data Structure
-----

In this lecture we're going to explore the **pandas Series structure**. By the end of this lecture you should be 
familiar with how to store and manipulate single dimensional indexed data in the Series object.

**The series is one of the core data structures in pandas**. You can think of it a cross between a list and a dictionary.

![Series.png](https://drive.google.com/uc?id=1RlS1LIvx_9oaClrnx3hhcGDt7aQBQSq5)

The items are all stored in an order and there's labels with which you can retrieve them. An easy way to visualize this is two columns of data. The first is the special index, a lot like keys in a dictionary. While the
second is your actual data. It's important to note that the data column has a label of its own and can be retrieved using the `.name` attribute. This is different than with dictionaries and is useful when it comes to 
merging multiple columns of data. We'll talk about that in the next couple of lectures.

In [53]:
# Let's import pandas to get started
import pandas as pd

##Creating Series using lists

One of the easiest ways to create a series is to use an array-like object, like a list. 

When you do this, Pandas automatically assigns an index starting with zero and sets the name of the series to `None`.

In [54]:
# Let's make a list of the three of students, Alice, Jack, and Molly, all as strings
students = ["Alice", "Jack", "Molly"]

# Now we just call the Series function in pandas and pass in the students
pd.Series(students)

0    Alice
1     Jack
2    Molly
dtype: object

The result is a Series object. We see here that pandas has automatically identified the type of data in this Series as `object` and set the dytpe parameter as such. We also see that the values are indexed with integers, starting at zero.

***Why use pandas vs traditional lists***? 
Pandas stores series values in a typed array using the Numpy library. This offers significant speedup when processing data versus traditional python lists.

This is why in the above example, Pandas assigned the Series' `dtype` as `object`. The `dtype` `object` comes from NumPy, it describes the type of element in an `ndarray`. Every element in an `ndarray` must have the same size in bytes. For `int64` and `float64`, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the `ndarray` directly, Pandas uses an `object` `ndarray`, which saves pointers to `objects`; because of this the `dtype` of this kind `ndarray` is `object`.

To double check this, let's create a Series from a list of integers this time.

In [55]:
# Lets create a Series from a list of integers
numbers = [1, 2, 3]

pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

As expected, Pandas set the type to `int64`. 

## How Pandas handles missing data

It is important to know how `numpy` (and thus `pandas`) handles missing data.

In Python, we have the `None` type to indicate the lack of data. Pandas does some type conversion in the backend, and uses the type `object` for the underlying array.



In [56]:
# Let's recreate our list of students, but leave the last one as a None
students = ['Alice', 'Jack', None]
# And lets convert this to a series
pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

As expected, it returned it as a `None` type. However, if we create a list of numbers, integers or floats, and put in the `None` type, pandas will automatically convert it to a special floating point value designated as `NaN`, which stands for "Not a Number".

In [57]:
# So lets create a list with a None value in it
numbers = [1, 2, None]
# And turn that into a series
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

So, we notice a couple of things: 

1.   `NaN` is a different value to `None`.
2.   Pandas sets the dytpe of a series to floating point when the other values are numerical.

So, why not just leave it as an integer? Underneath, pandas represents `NaN` as a floating point number, and because integers can be typecast to floats, pandas went and converted our integers to floats. 

**It is important to stress that `None` and `NaN` while might be used by data scientists in the same way, i.e. to denote missing data, underneath they are NOT represented by pandas in the same way.**


In [58]:
# Lets bring in numpy which allows us to generate an NaN value
import numpy as np
# NaN is NOT equivilent to None. Try the equality test, the result is False.
np.nan == None

False

Note: you can't do an equality test of NAN to itself. When you do, the answer is always False.

In [59]:
 np.nan == np.nan

False

The reason behind that is that not every `NaN` can be considered to be the same value. In layman terms, `NaN` cannot be equal to itself because `NaN` is the result of a failure, but that failure can happen in multiple ways. The result of one failure cannot be equal to the result of any other failure and unknown values cannot be equal to each other.

This is why NumPy developed a special function for this special case, `.isnan()`.

In [60]:
np.isnan(np.nan)

True

So keep in mind when you see `NaN`, while it's meaning is similar to `None`, it's a numeric value and treated differently for efficiency reasons.

##More on creating Series

Let's talk more about how pandas' Series can be created. While a list might be a common way to create some play data, you often find that you have to label data that you want to manipulate. 

As such, a series can be created directly from dictionary data. If you do this, the index is automatically assigned to the keys of the dictionary that you provided and not just incrementing integers.


In [61]:
# Here's an example using a dictionary with some data of students and their classes.
students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}
s = pd.Series(students_scores)
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

Since it was string data, pandas set the data type of the series to `object`.

Note also that the index (the first column) is also a list of strings set to dtype `object`.

In [65]:
# Once the series has been created, we can get the "index object" using the `index` attribute.
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

As you play more with pandas you'll notice that a lot of things are implemented as numpy arrays, and have the dtype value set. 

This is true of indicies, and here pandas infered that we were using objects for the index.

Also, note that the dtype of object is not just for strings, but for arbitrary objects. For example, let's create a more complex type of data, say, a list of tuples.

In [66]:
students = [("Alice","Brown"), ("Jack", "White"), ("Molly", "Green")]
pd.Series(students)

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

As you can see, each of the tuples is stored in the series, and the type is `object`.

You can also separate your index creation from the data by passing in the index as a list explicitly to the series.

In [67]:
s = pd.Series(['Physics', 'Chemistry', 'English'], index=['Alice', 'Jack', 'Molly'])
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

So what happens if your list of values in the index object are not aligned with the keys in your dictionary for creating the series? Well, pandas overrides the automatic creation to favor only the indices values that you provided. So it will ignore from your dictionary all keys which are not in your index, and pandas will add `None` or `NaN` type values for any index value you provide, which is not in your dictionary key list.


In [68]:
# Here's and example. Let's pass in a dictionary of three items, in this case students and their courses
students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}

# When I create the series object though I'll only ask for an index with three students, and exclude Jack
# You can imagine that this came from a large dictionary and for some reason you just need this subset of three.
s = pd.Series(students_scores, index=['Alice', 'Molly', 'Sam'])
s

Alice    Physics
Molly    English
Sam          NaN
dtype: object

The result is that the Series object doesn't have Jack in it, even though he was in our original dataset, but it explicitly does have Sam in it as a missing value.

So far, we've explored the pandas Series data structure. You've seen how to create a series from lists and dictionaries, how indicies on data work, and the way that pandas typecasts data including missing values.

Next, let's look at how to "query" Series objects together.

##Querying Pandas Series

A pandas Series can be "queried" either by:

1.   the index position
2.   the index label

Note: If you don't give an index to the series when querying, the position and the label are effectively the same values. 

There are two Series "attributes" that can be used in a query:

1.   `iloc`: To query by numeric location (i.e. index position), starting at zero
2.   `loc` : To query by the index label

Note: Keep in mind that `iloc` and `loc` are not "methods", they are "attributes". So you don't use parentheses to query them, but square brackets instead, which is called *the indexing operator*. 

In [69]:
# Lets start with an example. We'll use students enrolled in classes coming from a dictionary
import pandas as pd
students_classes = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'}
s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

So, for this series, if you wanted to see the fourth entry we would we would use the `iloc` attribute with the parameter 3.

In [70]:
s.iloc[3]

'History'

If you wanted to see what class Molly has, we would use the `loc` attribute with a parameter of "Molly".

In [71]:
s.loc['Molly']

'English'

Pandas tries to make our code a bit more readable and provides a sort of smart syntax using the indexing operator, `[]`, directly on the series itself. For instance, if you pass in an integer parameter, the operator will behave as if you want it to query via the `iloc` attribute

In [72]:
s[3]

'History'

If you pass in an object, it will query as if you wanted to use the label based `loc` attribute.

In [73]:
s['Molly']

'English'

So what do you think will happen if your index is a list of integers?

In [74]:
# Here's an example using class and their classcode information, where classes are indexed by 
# classcodes, in the form of integers
class_code = {99: 'Physics',
              100: 'Chemistry',
              101: 'English',
              102: 'History'}
s = pd.Series(class_code)
s

99       Physics
100    Chemistry
101      English
102      History
dtype: object

What do you think will happen if we try and call `s[0]`?

In [75]:
s[0]

KeyError: ignored

So, that didn't call `s.iloc[0]` underneath as one might expect. Instead, it generated a key error. This is because there's no item in the classes list with an index of zero.

Here, Pandas can't determine automatically whether you're intending to query by index position or index label. So you need to be careful when using the indexing operator on the Series itself. The safer option is to be more explicit and use the `iloc` or `loc` attributes directly.

In [76]:
# Let's give iloc a go
s.iloc[0]

'Physics'

##Performing operations with Pandas Series

Now we know how to get data out of the series, let's talk about working with the data. A common task is to want to consider all of the values inside of a series and do some sort of operation. This could be trying to find a certain number, or aggregating the data or transforming the data in some way.

###Vectorization

A typical programmatic approach to this would be to iterate over all the items in the series, and invoke the operation one is interested in. For instance, we could create a Series of integers representing student grades, and just try and get an average grade

In [78]:
grades = pd.Series([90, 80, 70, 60])

total = 0
for grade in grades:
    total+=grade
print(total/len(grades))

75.0


This works, but it's slow. Modern computers can do many tasks simultaneously, especially, but not only, tasks involving mathematics.

Pandas and the underlying numpy libraries support a method of computation called **vectorization**. 

Vectorization works with most of the functions in the numpy library, including the sum function.

In [81]:
# Here's how we would really write the code using the numpy sum method. First we need to import 
# the numpy module

import numpy as np

# Then we just call np.sum and pass in an iterable item. In this case, our panda series.
total = np.sum(grades)
print(total/len(grades))

75.0


Now both of these methods create the same value, but is one actually faster? 

The Jupyter Notebook has a magic function which can help. But before we go into that, let's first create a large series of random numbers using numpy's `random` method.

In [85]:
#np.random.randit() takes 3 parameters: min value, max value, number of integers required
numbers = pd.Series(np.random.randint(0,1000,10000))

# Now lets look at the top five items in that series to make sure they actually seem random. We
# can do this with the head() function
numbers.head()

0    962
1    690
2    492
3     82
4    966
dtype: int64

In [86]:
# We can also verify that length of the series is correct using the len() function
len(numbers)

10000

###Magic Functions

The ipython interpreter has something called "magic functions" that begin with a percentage sign. Try typing `%`, and a list of the available magic functions should appear. 

You could even write your own magic functions too, but that's a little bit outside of the scope of this course. :)

There are two types of magic functions:

1. **Line magics**: These are similar to command line calls. They start with a single `%` character. Rest of the line is its argument passed without parentheses or quotes. Line magics can be used as an expression and their return value can be assigned to a variable. We won't cover these in this lecture, but you might see them throughout the course

2. **Cell magics**: These have a `%%` character prefix. Unlike line magic functions, they can operate on multiple lines below their call, as long as its within the same Jupyter cell. 



###Cell Magic Function: timeit

One very useful cellular magic function is called `timeit`. This function will run our code a few times to determine, on average, how long it takes.

Let's run `timeit` with our original iterative code. You can give `timeit` the number of loops that you would like to run. By default, it is 1,000 loops, but I'll ask it to use 100 runs for the sake of time. 

Note that in order to use a cellular magic function, it has to be the first line in the cell.

In [87]:
%%timeit -n 100
total = 0
for number in numbers:
    total+=number

total/len(numbers)

100 loops, best of 5: 1.46 ms per loop


Not bad. Timeit ran the code and it doesn't seem to take very long at all, just over 1 millisecond per loop. Now let's try with **vectorization**.

In [90]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

100 loops, best of 5: 68.3 µs per loop


The speed difference went from milliseconds (1 thousandth of a second) to microseconds (1 millionth of a second). This is a pretty shocking difference in the speed and demonstrates why one should be aware of parallel computing features and start thinking in functional programming terms.

Put more simply, *vectorization* is the ability for a computer to execute multiple instructions at once, and with high performance chips, especially graphics cards, you can get dramatic speedups. Modern graphics cards can run thousands of instructions in parallel.

###Broadcasting

A related feature in pandas and numpy is called **broadcasting**. With broadcasting, you can apply an operation to every value in the series, changing the series. 

For instance, if we wanted to increase every random variable by 2, we could do so quickly using the += operator directly on the Series object. 


In [91]:
# Let's look at the head of our series
numbers.head()

0    962
1    690
2    492
3     82
4    966
dtype: int64

In [92]:
# And now lets just increase everything in the series by 2
numbers+=2
numbers.head()

0    964
1    692
2    494
3     84
4    968
dtype: int64

The procedural way of doing this would be to iterate through all of the items in the series and increase the values directly. Pandas does support iterating through a series much like a dictionary, allowing you to unpack values easily.

In [93]:
# We can use the iteritems() function which returns a label and value 
for label, value in numbers.iteritems():
    # now for the item which is returned, lets call set_value()
    numbers.loc[label] = value+2
# And we can check the result of this computation
numbers.head()

0    966
1    694
2    496
3     86
4    970
dtype: int64

The result is the same. Let's take a look at some speed comparisons. First, lets try ten loops using the iterative approach...

In [94]:
%%timeit -n 10
# we'll create a blank new series of items to deal with
s = pd.Series(np.random.randint(0,1000,1000))
# And we'll just rewrite our loop from above.
for label, value in s.iteritems():
    s.loc[label]= value+2

10 loops, best of 5: 52.2 ms per loop


Now, let's try that using the broadcasting methods

In [95]:
%%timeit -n 10
# We need to recreate a series
s = pd.Series(np.random.randint(0,1000,1000))
# And we just broadcast with +=
s+=2

10 loops, best of 5: 251 µs per loop


Not only is it significantly faster, but it's more concise and easier to read. Again, this is because the typical mathematical operations you would expect are vectorized. If interested, the numpy documentation outlines what it takes to create vectorized functions of your own. 

###Indexing Operators
One last note on using the indexing operators to access series data. The `.loc` attribute lets you not only modify data in place, but also add new data as well. If the value you pass in as the index doesn't exist, then a new entry is added. Keep in mind, though, that indices can have mixed types. While it's important to be aware of the typing going on underneath, Pandas will automatically change the underlying NumPy types as appropriate.

In [96]:
# Here's an example using a Series of a few numbers. 
s = pd.Series([1, 2, 3])

# We could add some new value, maybe a university course
s.loc['History'] = 102

s

0            1
1            2
2            3
History    102
dtype: int64

We see that mixed types for data values or index labels are no problem for Pandas. Since "History" is not in the original list of indices, `s.loc['History']` essentially creates a new element in the series, with the index named "History", and the value of 102.

Up until now I've shown only examples of a series where the index values were unique. I want to end this lecture by showing an example where index values are not unique, and this makes pandas Series a little different conceptually than, for instance, a relational database.

Lets create a Series with students and the courses which they have taken

In [97]:
students_classes = pd.Series({'Alice': 'Physics',
                              'Jack': 'Chemistry',
                              'Molly': 'English',
                              'Sam': 'History'})
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

Now lets create a Series just for some new student Kelly, which lists all of the courses she has taken. We'll set the index to Kelly, and the data to be the names of courses.

In [98]:
kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index=['Kelly', 'Kelly', 'Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

Finally, we can append all of the data in this new Series to the first using the `.append()` function. This creates a series which has our original people in it as well as all of Kelly's courses

In [99]:
all_students_classes = students_classes.append(kelly_classes)
all_students_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

There are a couple of important considerations when using `.append()`. First, Pandas will take the series and try to infer the best data types to use. In this example, everything is a string, so there's no problems here. Second, the `.append()` method doesn't actually change the underlying Series objects, it instead returns a new series which is made up of the two appended together. This is a common pattern in pandas - by default returning a new object instead of modifying in place - and one you should come to expect. By printing the original series we can see that that series hasn't changed.

In [100]:
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

Finally, we see that when we query the appended series for "Kelly", we don't get a single value, but all her instances. 

In [101]:
all_students_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In this lecture, we focused on one of the primary data types of the Pandas library, the Series. You learned how to query the Series, with `.loc` and `.iloc`, that the Series is an indexed data structure, how to merge two Series objects together with `.append()`, and the importance of vectorization.

There are many more methods associated with the Series object that we haven't talked about. But with these basics down, we'll move on to talking about the Panda's two-dimensional data structure, the `DataFrame`. The `DataFrame` is very similar to the series object, but includes multiple columns of data, and is the structure that you'll spend the majority of your time working with when cleaning and aggregating data.