## Pandas Data Structure

Pandas has three basic data structure:

* Series - 1-dimension; similar to a single column of data in a spread sheet
* Data Frame - 2-dimension; similar to a sheet with rows and columns in a spread sheet
* Panel - 3-dimention; multiple sheets

In practice, Series and Data Frames are used much more frequently than panels. In pandas, Data Frames are essentially multiple series connected together. It is generally not useful to think of pandas data structures in terms on primitive Python data structures (e.g. lists and dictionaries). 

### Series

A series is a one dimensional, list-like data structure in pandas. It has a single _axis_ called the **index**. We can similate the pandas Series data structure with a simple Python dictionary:

In [1]:
import pandas as pd

ser = {
    'index': [0, 1, 2, 3],
    'data': [145, 142, 38, 13],
    'name': 'songs',
}

ser

{'index': [0, 1, 2, 3], 'data': [145, 142, 38, 13], 'name': 'songs'}

With a dictionary like this, we can write a Python function that allows us to pull items out of this kind of data structure based on the `index`:

In [2]:
def get(ser, idx):
    # A function to fetch 'data' values from a series (ser) given an index (idx)
    value_idx = ser['index'] = ser['index'].index(idx)
    return ser['data'][value_idx]

get(ser, 1)

142

In the above function, we supplied the function `get()` with the index `1`, which correspond to the value of `142` within `data`.

Despite what "index" might imply, it does not have to be integer-based, and can be comprised of strings and dates; it can contain duplicate values and the order may be arbitrary.

In [3]:
songs = {
    'index': ['Paul', 'John', 'George', 'Ringo'],
    'data': [145, 142, 38, 13],
    'name': 'counts',
}

songs

{'index': ['Paul', 'John', 'George', 'Ringo'],
 'data': [145, 142, 38, 13],
 'name': 'counts'}

In [4]:
get(songs, 'John')

142

With the Series structure in pandas, the above dictionary-based structure can be accomplished is a greatly simplified and more powerful way.

In [5]:
songs2 = pd.Series([145, 142, 38, 13], name='counts')

songs2

0    145
1    142
2     38
3     13
Name: counts, dtype: int64

Pandas will display the Series in a way that facilitates reading. On the left most column is the *index*, while the *values* associated is in the right most column. In the language of pandas, the generic name for an index is an **axis**, while its assocated values are called the **axis labels**. This is the same in the case for DataFrames, which have two axes, one for the rows and one for the columns.

The `dtype` of a pandas Series object (or a DataFrame, for that matter) is its data type. In the case above, `int64` indicates 64 bit integer. A Series or a DataFrame can hold strings, floats, Booleans, or any arbitrary Python objects. They don't even have to be the same type to be in a Series/DataFrame (although this would affect speed and vetorized operations). 

The `index` of a Series can be accessed by its attribute:

In [6]:
songs2.index

RangeIndex(start=0, stop=4, step=1)

Because the index of `songs2` was automatically generated, pandas used default values, which are monotonically increasing integers. As mentioned above, index can be string-based as well. However, in this case, pandas will not indicate the `dtype` of the index as a string, but rather an object:

In [7]:
songs3 = pd.Series([145, 142, 38, 13],
                  name='counts',
                  index=['Paul', 'John', 'George', 'Ringo'])

songs3

Paul      145
John      142
George     38
Ringo      13
Name: counts, dtype: int64

In [8]:
songs3.index

Index(['Paul', 'John', 'George', 'Ringo'], dtype='object')

The data (that is, the axis labels) does not have to be numeric, or even homogeneous. Mixed generic objects can be inserted into a Series:

In [9]:
class Foo:
    pass

ringo = pd.Series(['Richard', 'Starkey', 13, Foo()],
                 name='ringo')

ringo

0                                        Richard
1                                        Starkey
2                                             13
3    <__main__.Foo object at 0x000002954C4E4788>
Name: ringo, dtype: object

In the case of `ringo`, the `dtype` is `object`. The `object` datatype is used for strings as well as for when the data within the Series is heterogeneous. Because vectorized operations works so well with pandas, if the data we are working with is numeric would would much rather store it as int64 or float64, and not object. Note that in Python and pandas, time data is stored as its special data type, `datetime64`. If you examine the `dtype` of a Series with time data stored and it returned as `object`, it meant the time data was stored as a string, not the Python/pandas datetime data type. While this may not be desireble, converting strings to datetime is not difficult in Python/pandas.

#### The NaN value

The NaN value stands for "Not A Number", is encountered ubiquitously in the NumPy package. In pandas, when a Series is determined to hold numeric values but a number cannot be found to fill the entry, pandas will fill it with NaN. NaN is ignored in pandas for any arithmetic operations.

In [10]:
nan_ser = pd.Series([2, None],
                   index=['Ono', 'Clapton'],
                   )

nan_ser

Ono        2.0
Clapton    NaN
dtype: float64

Note the `dtype` of this series is `float64` and not `int64`. This is because NaN is always treated as a float in Python/pandas. Pandas determined that `nan_ser` is a Series that contained numeric values, but the `Clapton` entry has no values (technically the value is `None`) associated with it, therefore pandas filled it with none, and the entire column now has to be in the `dtype` `float64`. When pandas reads a file, e.g. a .csv file, an empty value for an otherwise numberic column will become NaN. There are convenient methods in pandas such as `.fillna()` and `.dropna()` that utilizes this behavior.

#### The Series and NumPy Arrays

Many functions and methods associated with NumPy arrays works with pandas Series as well:

In [11]:
import numpy as np

numpy_ser = np.array([145, 142, 38, 13])

songs3[1]

142

In [12]:
numpy_ser[1]

142

In [13]:
songs3.mean()

84.5

In [14]:
numpy_ser.mean()

84.5

*Boolean arrays* are masks that can be used to filter out items, and they can be applied to both NumPy arrays and pandas Series.

In [15]:
mask = songs3 > songs3.mean() # Boolean array

mask

Paul       True
John       True
George    False
Ringo     False
Name: counts, dtype: bool

Boolean arrays can be used to filter items of the sequence (either an array or a Series) by performing an idex operation. If the mask has a `True` valuye for a given index, the value is kept. Otherwise, the value is dropped. Here, `mask` is a Boolean array that represents the locations that have a value greater than the median value of the Series.

In [16]:
songs3[mask]

Paul    145
John    142
Name: counts, dtype: int64

In [17]:
numpy_ser[numpy_ser > np.median(numpy_ser)] # The same thing can be with NumPy arrays but with a workaround for the median.

array([145, 142])

#### Creating a Series

The easiest way to creat a pandas Series is to feed it Python lists as arguments:

In [18]:
george_dupe = pd.Series([10, 7, 1, 22],
                       index=['1968', '1969', '1970', '1970'],
                       name='George Songs',
                       )

george_dupe

1968    10
1969     7
1970     1
1970    22
Name: George Songs, dtype: int64

Here we can see that we have use strings for the index entries, and there are duplicated index entries (i.e., they are not unique). For the Series `george_dupe`, we created it while explicitlyl feeding the generator a list containing the index. We do not have to do this, in which case pandas will generate a default index with numeric values. We can also create a Series with a Python dictionary that maps index entries to values. In this case an additional sequence containing the order of the index must be supplied as Python dictionaries are not ordered. A caveat is that Python dictionaries must only have unique keys, we can try a workaround if we are to create the above Series with a dictionary:

In [19]:
g2 = pd.Series({'1969': 7,
               '1970': [1, 22],
               },
              index=['1969', '1970', '1970'])

g2

1969          7
1970    [1, 22]
1970    [1, 22]
dtype: object

Note that the above Series did not behave the way we wanted it to. Therefore, if a Series with non-unique index entries were to be created, we should stick with the list method and not use dictionaries.

#### Reading a Series

Data from a Series can be accessed via index operations just like most Python arrays:

In [20]:
george_dupe['1968']    # Access the value at index entry '1968' of the Series george_dupe

10

Note that normally Python will return a scalar value (exact data type of the value returned obviously would depend), but not when there are non-unique index entries. In that case a Series is returned instead:

In [21]:
george_dupe['1970']

1970     1
1970    22
Name: George Songs, dtype: int64

Data in a Series can be iterated upon:

In [22]:
for item in george_dupe:
    print(item)

10
7
1
22


Here's a quirk with the Series object: when iterating (via `.__iter__()`), it is the **values** of a Series that are iterated over. However, checking for membership within a Series (via `.__contains__()`) is done against the **index** entries of a Series. For example, if we do a Boolean check to see if "22" is in `george_dupe`:

In [23]:
22 in george_dupe

False

Python returned false because it used `.__contains__()` against the **index** of `george_dupe`, which only contains strings of years such as '1970', and no intergers like 22. We must specify to Python that we want to check if 22 is in the **values** of `george_dupe` via the `.values` attribute of the Series:

In [24]:
22 in george_dupe.values

True

In [25]:
22 in set(george_dupe) # The same can be accomplished by querying the set of the Series

True

In [26]:
'1970' in george_dupe # Remember that membership queries to a Series is done over its index

True

Always remember that iterations over a Series by default are done over the **values** of a Series. If it is desired to iterate on both the index label and the value, we can use the `.iteritems()` method:

In [27]:
for item in george_dupe.iteritems():
    print(item)

('1968', 10)
('1969', 7)
('1970', 1)
('1970', 22)


#### Updating a Series

The standard assignment operation can be used in conjunction with the index operation to update the values of a Series:

In [28]:
george_dupe['1969'] = 6

george_dupe['1969']

6

We can add a new index and a corresponding value in the exact same way:

In [29]:
george_dupe['1973'] = 11

george_dupe

1968    10
1969     6
1970     1
1970    22
1973    11
Name: George Songs, dtype: int64

One thing to keep in mind is that if we try to update a Series' values with the above approach, but the Series in question has non-unique index labels, then any changes made to the non-unique index entries will be applied across the board:

In [30]:
george_dupe['1970'] = 2

george_dupe

1968    10
1969     6
1970     2
1970     2
1973    11
Name: George Songs, dtype: int64

To avoid this, we can either use a DataFrame data structure instead (either with a column for artist, or a multi-index), or we can update values in the Series by position via the `.iloc` attribute:

In [31]:
george_dupe.iloc[3] = 22

george_dupe

1968    10
1969     6
1970     2
1970    22
1973    11
Name: George Songs, dtype: int64

There is an `.append()` method for a pandas Series object. However, it does not behave the same way as the primitive Python `.append()` method (e.g. for lists). Instead, it behaves more like the primitive Python `.extend()` method, where it expects another pandas Series to append to:

In [32]:
george_dupe.append(pd.Series({'1974': 9}))

1968    10
1969     6
1970     2
1970    22
1973    11
1974     9
dtype: int64

In [33]:
george_dupe # Note the above line of code DOES NOT override the original Series but rather returns a new extended Series.

1968    10
1969     6
1970     2
1970    22
1973    11
Name: George Songs, dtype: int64

#### Deleting items in Series

While deleting entries is not common in pandas (it is more common to filter entries using masks such as Boolean arrays), it is possible. The `del` statement can be used to delete entries based on the index:

In [34]:
del george_dupe['1973']

george_dupe

1968    10
1969     6
1970     2
1970    22
Name: George Songs, dtype: int64

A more common way to remove unwatned entries in pandas is to filter the Series to get a new Series. For example, let's say we want to filter out all values less than or equal to 2 in `george_dupe`:

In [35]:
george_dupe[george_dupe <= 2]

1970    2
Name: George Songs, dtype: int64

#### Indexing a Series

The index of a Series can be numeric, but they can also be strings:

In [36]:
george = pd.Series([10, 7],
                  index=['1968', '1969'],
                  name='George Songs',
                  )

george

1968    10
1969     7
Name: George Songs, dtype: int64

The `dtype` of the Series is `int64`, which is actually referring to the values of the Series. We can check the `dtype` of the Series' index instead:

In [37]:
george.index

Index(['1968', '1969'], dtype='object')

Indices do not have to be unique:

In [38]:
dupe = pd.Series([10, 2, 7],
                index=['1968', '1968', '1969'],
                name='George Songs',
                )

dupe

1968    10
1968     2
1969     7
Name: George Songs, dtype: int64

In [39]:
dupe.index.is_unique

False

In [40]:
george

1968    10
1969     7
Name: George Songs, dtype: int64

In [41]:
george.index.is_unique

True

A Series object can be indexed and sliced along its axis just like a NumPy array:

In [42]:
george[0]

10

In [43]:
george[-1]

7

Note the above indexing opertation is actually rather confusing, as it isn't clear wheather or not we were indexing via the Series' *index* or is *position*. Therefore, it is recommended to use pandas attributes like `.at[]`, `.iat[]`, `.loc[]`, and `iloc[]`. Recall that in the case of a Series having non-unique index labels, we can workaround that by using `.iloc[]` to select an entry by position. However, this won't work if the index labels are integers. 

In [44]:
george_i = pd.Series([10, 7],
                    index=[1968, 1969],
                    name='George Songs',
                    )

george_i

1968    10
1969     7
Name: George Songs, dtype: int64

In [45]:
george_i[-1]

KeyError: -1

We can explicitly instruct pandas to index by position or by index with the `.loc[]` and `.iloc[]` attributes respectively.

In [46]:
george.iloc[0]

10

In [47]:
george.iloc[-1]

7

In [48]:
george.iloc[4]

IndexError: single positional indexer is out-of-bounds

In [49]:
george.iloc['1968']

TypeError: Cannot index by location index with a non-integer key

In [50]:
george.iloc[0:3]

1968    10
1969     7
Name: George Songs, dtype: int64

In [51]:
george.iloc[[0,1]] # This is an additional functionality in pandas not available in base Python; indexing via a list of index positions.

1968    10
1969     7
Name: George Songs, dtype: int64

In contrast, `.loc[]` index via index labels and not the positions. This is analogous to Python dictionary-based indexing, but with some additional functionality, such as accepting Boolean arrays.

In [52]:
george.loc['1968']

10

In [53]:
george.loc['1970']

KeyError: '1970'

In [54]:
george.loc[0]

TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0] of <class 'int'>

In [55]:
george.loc['1968':]

1968    10
1969     7
Name: George Songs, dtype: int64

The `.loc[]` and `.iloc[]` attributes always return a Series (or DataFrame) object when they can. If it is desired for a NumPy array to be returned instead, the `.at[]` and `iat[]` attributes should be used instead. These are analogous to `.loc[]` and `iloc[]`.

In [56]:
george_dupe = pd.Series([10, 7, 1, 22],
                       index=['1968', '1969', '1970', '1970'],
                       name='George Songs',
                       )

george_dupe

1968    10
1969     7
1970     1
1970    22
Name: George Songs, dtype: int64

In [57]:
george_dupe.at['1970']

array([ 1, 22], dtype=int64)

In [58]:
george_dupe.loc['1970']

1970     1
1970    22
Name: George Songs, dtype: int64

#### Slicing

Slicing is supported with both `.loc[]` and `.iloc[]`. The usual form of `[start]:[end]:[stride]` is used here. For example, with `.iloc[]`, the behavior is as such:

* `.iloc[0:1]` - First item
* `.iloc[:1]` - First item (start default is 0)
* `.iloc[:-2]` - Take from the start up to the second to the last item
* `.iloc[::2]` - Take from the sart to the end, skipping every other item

In [59]:
george.iloc[0:2]

1968    10
1969     7
Name: George Songs, dtype: int64

#### Boolean Arrays

A slice using a Boolean array will return a filtered Series for which the Boolean operation is evaluated.

In [60]:
mask = george > 7

mask

1968     True
1969    False
Name: George Songs, dtype: bool

In [61]:
george[mask]

1968    10
Name: George Songs, dtype: int64

What is happening here is that we first generate a Series that evaluate a Boolean operation. In this case it evalute each value in the Series `george` and return a Series filled with Boolean values `True` or `False` based on whether the value is greater than 7. Next, we apply the resulting `mask` Boolean Series over the Series `george`, and the returned Series only comprised of values that passed through the `mask` filter (i.e., the entries marked `True`). This is called **broadcasting**. The **> 7** operation is broadcasted, or applied, to every entry in the Series, and the result is a new Series with the result of each of those operation. This is very commonly encountered in NumPy, and also in pandas. 

Brocasting is not limited to Boolean masks and arrays. It can be simple arithmetic operation apply over an entire Series

In [62]:
george + 2

1968    12
1969     9
Name: George Songs, dtype: int64

Multiple Boolean operations can be combined:

* And: `ser[a & b]`
* Or: `ser[a | b]`
* Not: `ser[~a\]`

Note that Boolean arrays obey operator precedence. Hence, parentheses should be use to avoid ambiguity.

In [63]:
mask2 = george <= 2

mask2

1968    False
1969    False
Name: George Songs, dtype: bool

In [64]:
george[mask | mask2]

1968    10
Name: George Songs, dtype: int64

In [65]:
george[mask | george <= 2] # Note this will not work as we expected

1968    10
1969     7
Name: George Songs, dtype: int64

In [66]:
george[mask | (george <= 2)] # Parentheses are needed to ensure proper operation order

1968    10
Name: George Songs, dtype: int64

### Series Methods

Generally, methods of a Series object returns a new Series object. This behavior can be changed by using the `inplace` argument that is available in most Series methods. Or, a `.copy()` method can be applied.

In [67]:
songs_66 = pd.Series([3, None, 11, 9],
                    index=['George', 'Ringo', 'John', 'Paul'],
                    name='Counts',
                    )

songs_66

George     3.0
Ringo      NaN
John      11.0
Paul       9.0
Name: Counts, dtype: float64

In [68]:
songs_69 = pd.Series([18, 22, 7, 5],
                    index=['John', 'Paul', 'George', 'Ringo'],
                    name='Counts')

songs_69

John      18
Paul      22
George     7
Ringo      5
Name: Counts, dtype: int64

Recall that interations like `for` loops by default are over values of a Series:

In [69]:
for value in songs_66:
    print(value)

3.0
nan
11.0
9.0


The Series object has an `.iteritems()` method that enable the looping of index, value pairs:

In [70]:
for idx, value in songs_66.iteritems():
    print(idx, value)

George 3.0
Ringo nan
John 11.0
Paul 9.0


To loop over the index labels instead of the values of a Series, the `.keys()` method can be used:

In [71]:
for idx in songs_66.keys(): # Note that the keys (index) values obtained from a Series this way is ordered, unlike a Python dictionary.
    print(idx)

George
Ringo
John
Paul


Pandas Series respond to operators in a slightly different way:

* `+`: Adds scalar (or Series with matching index values), return Series
* `-`: Substracts scalar (or Series with matching index values), return Series
* `/`: Divides scalar (or Series with matching index values), return Series
* `//`: "Floor" divides scalar (or Series with matching index values), return Series
* `*`: Multiplies scalar (or Series with matching index values), return Series
* `%`: Modulus scalar (or Series with matching index values), return Series
* `==` and `!=`: Equality scalar (or Series with matching index values), return Series
* `>` and `<`: Greater/less than scalar (or Series with matching index values), return Series
* `>=` and `<=`: Greater/less than or equal scalar (or Series with matching index values), return Series
* `^`: Binary XOR, return Series
* `|`: Binary OR, return Series
* `&`: Binary AND, return Series

Addition with two Series objects adds only those items whose index occurs in both Series, otherwise it inserts an `NaN` value for index values found only in one of the Series.

In [72]:
songs_66 + songs_69 # The 'Ringo' index entry will be NaN because it is only present in songs_66

George    10.0
John      29.0
Paul      31.0
Ringo      NaN
Name: Counts, dtype: float64

To avoid the ambiguity of `NaN` values inserted we can use the `.fillna()` method to replace `NaN` with zeros (other values can be speficfied to be used).

In [73]:
songs_66.fillna(0) + songs_69.fillna(0)

George    10.0
John      29.0
Paul      31.0
Ringo      5.0
Name: Counts, dtype: float64

Using the `.loc[]` and `.iloc[]`, we can get and set values of a Series very easily.

In [74]:
songs_66.loc['John']

11.0

On the other hand, pandas also support using dotted attribe access for index names that are valid attribute names (and don't conflict with pre-exisitng series attributes):

In [75]:
songs_66.John

11.0

For convenience, pandas support the `.get()` method which allows for an optional parameter to be supplied if look up failed. Below, `songs_66` do not contain an index label `'Fred'`, so it will return `'missing'`.

In [76]:
songs_66.get('Fred', 'missing')

'missing'

We can use similar code to assign/set values within a Series. However, we need to be cautious, as any assignment that uses the `__setitem__()` method would update the Series in place, with no return values:

In [77]:
songs_66.loc['John'] = 82

In [78]:
songs_66

George     3.0
Ringo      NaN
John      82.0
Paul       9.0
Name: Counts, dtype: float64

This behavior is the same if we use the dotted attribute to set the value:

In [79]:
songs_66.John = 81

In [80]:
songs_66

George     3.0
Ringo      NaN
John      81.0
Paul       9.0
Name: Counts, dtype: float64

Accessing and setting values in a Series via the dotted attribute accessor can raise unexpected behavior in the index labels of a Series conflicts with keywords, methods, or attributes in pandas objects. Hence, it is recommanded to avoid setting and accessing values that way, and stick with `.loc[]` and `.iloc[]`. The Python module keyword can be helpful as it supply a kwlist contianing all the keywords currently in used:

In [81]:
import keyword

print(keyword.kwlist)

['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield']


Since we are just talking about Series right now, `.loc[]` and `.iloc[]` basically behave exactly like `.at[]` and `.iat[]`. In a 2-dimensional DataFrame, `.loc[]` and `.iloc[]` would return columns and rows (depending on which axis was specified), while `.at[]` and `.iat[]` would return a single value.

In [82]:
songs_66.at['John'] = 80

songs_66

George     3.0
Ringo      NaN
John      80.0
Paul       9.0
Name: Counts, dtype: float64

Note that setting values with `.loc[]` and `.at[]` can be problematic if there are duplicated (i.e. non-unique) index labels. In that case `iloc[]` and `.iat[]` can be used as a workaround.

In [83]:
george = pd.Series([10, 7, 1, 22],
                  index=['1962', '1969', '1970', '1970'],
                  name='George Songs',
                  )

george

1962    10
1969     7
1970     1
1970    22
Name: George Songs, dtype: int64

In [84]:
george.at['1970'] = 23 # This will upate both entries with index label '1970'

george

1962    10
1969     7
1970    23
1970    23
Name: George Songs, dtype: int64

In [85]:
george.iat[2] = 3 # Using .iat[] or .iloc[] is a workaround if only one of the entry is to be updated.

george

1962    10
1969     7
1970     3
1970    23
Name: George Songs, dtype: int64

One way to quickly retrieve the index positions for values is to use a list comprehension on the `.iteritems()` method in combination with the Python `enumerate()` function:

In [86]:
[pos for pos, x in enumerate(george.iteritems()) if x[0]=='1970']

[2, 3]

Here, we're telling Python to build a list by looping over the enumerated index-value pair in `george`, and only put the index position (`pos`) if the index label (selected by `x[0]`) equals to `'1970'`.

#### Reset Index

Pandas has a built-in method `.reset_index()` that will reset the index of a Series (or a DataFrame) to the default monotonically increasing integers starting from zero. Note that by default, no matter what the original data strucutre is, `.reset_index()` will return a DataFrame and not a Series, and will move the current index values to a column named `index`.

In [87]:
songs_66.reset_index()

Unnamed: 0,index,Counts
0,George,3.0
1,Ringo,
2,John,80.0
3,Paul,9.0


We can pass `True` to the `drop` parameter for `.reset_index()` to return a Series instead:

In [88]:
songs_66.reset_index(drop=True)

0     3.0
1     NaN
2    80.0
3     9.0
Name: Counts, dtype: float64

The method `.reindex()` can be used instead if a specific index order is desired. The index of the result will be conformed to the index passed in. The values of the new index will have a value of the optional parameter `fill_value` which defaults to `NaN`.

In [89]:
songs_66.reindex(['Billy', 'Eric', 'George', 'Yoko'])

Billy     NaN
Eric      NaN
George    3.0
Yoko      NaN
Name: Counts, dtype: float64

The method `.rename()` accomplished a similar thing, but accepts a dictionary as the argument instead. The method would map the index labels to the new labels, or it can also be passed a funciton that accepts a label and returns a new one:

In [90]:
songs_66.rename({'Ringo': 'Richard'})

George      3.0
Richard     NaN
John       80.0
Paul        9.0
Name: Counts, dtype: float64

In [91]:
songs_66.rename(lambda x: x.lower())

george     3.0
ringo      NaN
john      80.0
paul       9.0
Name: Counts, dtype: float64

Because the index of a Series can also be accessed with the `.index` attribute, the index of a Series can be changed that way as well. However, you may noticed that most pandas methods do not change the object it operates on, but returns a new object (e.g. a Series). This is not the case if we change the index via the `.index` attribute. Hence, it might be more consistent and less confusing if we stick with the pandas methods instead of the dotted attributes when modifying pandas objects.

In [92]:
idx = songs_66.index

idx

Index(['George', 'Ringo', 'John', 'Paul'], dtype='object')

In [93]:
idx2 = range(len(idx))

idx2

range(0, 4)

In [94]:
list(idx2)

[0, 1, 2, 3]

In [95]:
songs_66.index = idx2

songs_66

0     3.0
1     NaN
2    80.0
3     9.0
Name: Counts, dtype: float64

In [96]:
songs_66.index

RangeIndex(start=0, stop=4, step=1)

#### Counts

There are seveal methods in pandas that can give us surface information like counts of a set of data.

In [97]:
songs_66 = pd.Series([3, None, 11, 9],
                    index=['George', 'Ringo', 'John', 'Paul'],
                    name='Counts',
                    )

songs_66

George     3.0
Ringo      NaN
John      11.0
Paul       9.0
Name: Counts, dtype: float64

In [98]:
scores2 = pd.Series([67.3, 100, 96.7, None, 100],
                   index=['Ringo', 'Paul', 'George', 'Peter', 'Billy'],
                   name='test2',
                   )

scores2

Ringo      67.3
Paul      100.0
George     96.7
Peter       NaN
Billy     100.0
Name: test2, dtype: float64

The method `.count()` returns the number of non-null items.

In [99]:
scores2.count() # .count() will not count NaN entries

4

The method `.value_counts()` generates a histogram of sorts. It returns a Series indexed by the values found in the input Series:

In [100]:
scores2.value_counts()

100.0    2
96.7     1
67.3     1
Name: test2, dtype: int64

The `.unique()` and `.nunique()` method return unique values and the count of non-NaN items in the Series respectively:

In [101]:
scores2.unique()

array([ 67.3, 100. ,  96.7,   nan])

In [102]:
scores2.nunique()

3

The `.drop_duplicates()` method drops duplicated values and return a new Series (unless `inplace=True` is passed).

In [103]:
scores2.drop_duplicates()

Ringo      67.3
Paul      100.0
George     96.7
Peter       NaN
Name: test2, dtype: float64

To check if there were duplicated values in the Series, this can be accomplished by the `.duplicated()` method:

In [104]:
scores2.duplicated()

Ringo     False
Paul      False
George    False
Peter     False
Billy      True
Name: test2, dtype: bool

Manipulating a Series by the labels of the index require a little bit more effort (unless you reset the index and do so with the resulting DataFrame). 

In [105]:
scores3 = pd.Series([67.3, 100, 96.7, None, 100, 79],
                   index=['Ringo', 'Paul', 'George', 'Peter', 'Billy', 'Paul'],
                   )

scores3

Ringo      67.3
Paul      100.0
George     96.7
Peter       NaN
Billy     100.0
Paul       79.0
dtype: float64

We can use the `.groupby()` method to group the Series by the index. In this case it will group the `'Paul'` entries together. The method `.groupby()` returns an object that another method must be apply to. We can use `.first()` to have the GroupBy object return the first value in the group, or the `.last()` method to return the last value in the group.

In [106]:
scores3.groupby(scores3.index).first()

Billy     100.0
George     96.7
Paul      100.0
Peter       NaN
Ringo      67.3
dtype: float64

In [107]:
scores3.groupby(scores3.index).last()

Billy     100.0
George     96.7
Paul       79.0
Peter       NaN
Ringo      67.3
dtype: float64

### Statistics in with Series

Pandas include many basic methods that can be called on Series (and DataFrames). For example, `.sum()` returns the sum of a Series:

In [108]:
songs_66.sum() # Note that generally these methods ignore NaN values. This can be switched by passing skipna=False

23.0

The methods `.mean()` and `.median()` perform as expected:

In [109]:
songs_66.mean()

7.666666666666667

In [110]:
songs_66.median()

9.0

The method `.quantile()` accepts percentile as arguments, where 50% is the default.

In [111]:
songs_66.quantile() # 50% quantile

9.0

In [112]:
songs_66.quantile(0.1) # 10% quantile

4.2

In [113]:
songs_66.quantile(0.9) # 90% quantile

10.6

Pandas include a summary method if we just want a quick rundown of the statistics of the Series, `.describe()`.

In [114]:
songs_66.describe()

count     3.000000
mean      7.666667
std       4.163332
min       3.000000
25%       6.000000
50%       9.000000
75%      10.000000
max      11.000000
Name: Counts, dtype: float64

The desired quantile can be passed into `.describe()` to override the default:

In [115]:
songs_66.describe(percentiles=[.05, .1, .2])

count     3.000000
mean      7.666667
std       4.163332
min       3.000000
5%        3.600000
10%       4.200000
20%       5.400000
50%       9.000000
max      11.000000
Name: Counts, dtype: float64

The methods `.max()` and `.min()` retrun the maximum and minimum value of the Series, respectively. The `.idxmax()` and `.idxmin()` methods are analogous but are applied onto the index instead.

In [116]:
songs_66.min()

3.0

In [117]:
songs_66.max()

11.0

In [118]:
songs_66.idxmin()

'George'

In [119]:
songs_66.idxmax()

'John'

Here is a summary of other available statistic methods for Series:

* `.var()`: Returns variance
* `.std()`: Returns standard deviation
* `.mad()`: Returns mean absolute deviation
* `.skew()`: Returns skew
* `.kurt()`: Returns kurtosis
* `.cov()`: Returns covariance, requires another Series object as argument
* `.corr()`: Returns correlation, requires another Series object as argument
* `.authocorr()`: Returns autocorrelation (correlation with itself shifted by one position)
* `.diff()`: Returns first discrete difference
* `.cumsum()`: Returns cumulative sum
* `.cumprod()`: Returns cumulative product
* `.cummin()`: Returns cumulative minimum


#### Converting values of a Series

Pandas include many methods that let us manipulate the values of a Series. For example, we can use `.round()` to round all values of a Series up to the next whole floating point number:

In [120]:
songs_66.round()

George     3.0
Ringo      NaN
John      11.0
Paul       9.0
Name: Counts, dtype: float64

We can "clip" the values of a Series with `.clip()` and passing an upper and lower bound:

In [121]:
songs_66.clip(upper=90, lower=80)

George    80.0
Ringo      NaN
John      80.0
Paul      80.0
Name: Counts, dtype: float64

Note that neither `.round()` nor `.clip` changes the `dtype` of the Series' values. To do so, we need to use `.astype()`.

In [122]:
songs_66.astype(str)

George     3.0
Ringo      nan
John      11.0
Paul       9.0
Name: Counts, dtype: object

Here we have converted all the values of `songs_66` to strings, hence the `dtype` is now `object`. Converting data from one `dtype` to another is not trivial, and sometimes require other methods other than `.astype()`. Also, the presense of `NaN` values usually would raise an exception, it is prudent to first deal with any `NaN` values, then do any converting needed. Methods would often have arguments you can pass for it to ignore `NaN` values as well.

For example, we can use the function `to_numeric()` to convert values to floats. In cases where this is not possible, the method would raise an error. However, if we pass `errors='coerce'` to the method, it will fill in the value with an `NaN` when the value cannot be converted:

In [123]:
pd.to_numeric(songs_66.astype(str), errors='coerce')

George     3.0
Ringo      NaN
John      11.0
Paul       9.0
Name: Counts, dtype: float64

To convert to datetime `dtype`, we can use the pandas function `to_datetime()`, with similar error handling:

In [124]:
pd.to_datetime(pd.Series(['Sep 7, 2001',
                         '9/8/2001',
                         '9-9-2001',
                         '10th of September 2001',
                         'Once de Septiembre 2001', # this will raise an error if we don't pass errors='coerce'
                        ]),
              errors='coerce')

0   2001-09-07
1   2001-09-08
2   2001-09-09
3   2001-09-10
4          NaT
dtype: datetime64[ns]

### Dealing with None

We can use the `.fillna()` method to fill in `NaN` values with whatever we want:

In [125]:
songs_66.fillna(-1)

George     3.0
Ringo     -1.0
John      11.0
Paul       9.0
Name: Counts, dtype: float64

Or, we can drop `NaN` values with `.dropna()`:

In [126]:
songs_66.dropna()

George     3.0
John      11.0
Paul       9.0
Name: Counts, dtype: float64

The `.notbull()` method generates a Boolean array of values that are not `NaN`. The resulting mask can be use just like any other Boolean masks.

In [127]:
val_mask = songs_66.notnull()

val_mask

George     True
Ringo     False
John       True
Paul       True
Name: Counts, dtype: bool

In [128]:
songs_66[val_mask]

George     3.0
John      11.0
Paul       9.0
Name: Counts, dtype: float64

Analogously, `.isnull()` generates a Boolean mask of values that are `NaN`:

In [130]:
nan_mask = songs_66.isnull()

nan_mask

George    False
Ringo      True
John      False
Paul      False
Name: Counts, dtype: bool

In [131]:
songs_66[nan_mask]

Ringo   NaN
Name: Counts, dtype: float64

Note that using the `~` operator, we can flip a Boolean mask:

In [132]:
~nan_mask

George     True
Ringo     False
John       True
Paul       True
Name: Counts, dtype: bool

The `.first_valid_index()` and `.last_value_index()` methods return the first and last valid index labels respectively:

In [135]:
songs_66.first_valid_index()

'George'

In [136]:
songs_66.last_valid_index()

'Paul'

### Matrix Operations

Because much of pandas is based on NumPy, matrix operations work well on Series and DataFrames. For example, `.dot()` calculates dot products. Note that this fails if any of the entries are `NaN`:

In [140]:
songs_66.dot(songs_66)

nan

In [142]:
songs_66.dropna().dot(songs_66.dropna())

211.0

The transpose of a matrix can be obtained either from the `.T` attribute or the `.transpose()` method:

In [143]:
songs_66.T

George     3.0
Ringo      NaN
John      11.0
Paul       9.0
Name: Counts, dtype: float64

In [144]:
songs_66.transpose()

George     3.0
Ringo      NaN
John      11.0
Paul       9.0
Name: Counts, dtype: float64

### Append, combining, and joining two Series

Two Series can be concatenate together by using the `.append()` method. Note that this is unlike the `.append()` method of a Python list. The pandas `.append()` method takes another Series object as an argument:

In [146]:
songs_66.append(songs_69)

George     3.0
Ringo      NaN
John      11.0
Paul       9.0
John      18.0
Paul      22.0
George     7.0
Ringo      5.0
Name: Counts, dtype: float64

By default, the pandas `.append()` method will retain all duplicate indices, unless we pass `verify_integrity=True` into the method:

In [147]:
songs_66.append(songs_69, verify_integrity=True) # This will raise an error as there are duplicated indices

ValueError: Indexes have overlapping values: Index(['John', 'Paul', 'George', 'Ringo'], dtype='object')

To perfrom element-wise opeations on Series objects, we can use the `.combine()` method, which take another Series object as an argument, as well as a fuction to be performed on both:

In [149]:
def avg(v1, v2):
    return (v1 + v2)/2.0

songs_66.combine(songs_69, avg)

George     5.0
John      14.5
Paul      15.5
Ringo      NaN
Name: Counts, dtype: float64

The `.update()` method provides a way to quickly replace all values of a Series with the values of another. However, note that unlike most other pandas methods, `.update()` behave as if `inplace=True` was passed; it does not return a new Series but actually replaces values in the Series the method was called upon.

In [151]:
songs_66.update(songs_69)

songs_66

George     7.0
Ringo      5.0
John      18.0
Paul      22.0
Name: Counts, dtype: float64

The `.repeat()` method repates the Series' entries for a number of times:

In [152]:
songs_69.repeat(2)

John      18
John      18
Paul      22
Paul      22
George     7
George     7
Ringo      5
Ringo      5
Name: Counts, dtype: int64

### Sorting a Series

Pandas includes a `.sort_values()` method for sorting a Series or a DataFrame. You can pass the parameter `kind=` to the method for different types of sorting. The default is `kind='quicksort':

In [156]:
songs_66.sort_values()

Ringo      5.0
George     7.0
John      18.0
Paul      22.0
Name: Counts, dtype: float64

Note that if there are rows with the same value, the kind of sorting may rearrange those rows. To avoid that, use `kind='mergesort'`:

In [158]:
s = pd.Series([2, 2, 2,], index=(['a2', 'a1', 'a3']))

s.sort_values(kind='mergesort')

a2    2
a1    2
a3    2
dtype: int64

Other types of sort may not preserve that order:

In [159]:
s.sort_values(kind='heapsort')

a1    2
a3    2
a2    2
dtype: int64

We can pass `ascending=False` to the method for a descending sort:

In [161]:
songs_66.sort_values(ascending=False)

Paul      22.0
John      18.0
George     7.0
Ringo      5.0
Name: Counts, dtype: float64

The `.sort_index()` method works similarly, but sort by index values instead:

In [162]:
songs_66.sort_index()

George     7.0
John      18.0
Paul      22.0
Ringo      5.0
Name: Counts, dtype: float64

In [163]:
songs_66.sort_index(ascending=False)

Ringo      5.0
Paul      22.0
John      18.0
George     7.0
Name: Counts, dtype: float64

The method `.rank()` sorts the Series, then assign each entry a rank as values:

In [164]:
songs_66.rank()

George    2.0
Ringo     1.0
John      3.0
Paul      4.0
Name: Counts, dtype: float64

### Applying a function across a Series

Similar to broadcasting, we can use the `.map()` method to apply a function to every entry of a Series:

In [165]:
def format(x):
    if x == 1:
        template = '{} song'
    else:
        template = '{} songs'
    return template.format(x)

songs_66.map(format)

George     7.0 songs
Ringo      5.0 songs
John      18.0 songs
Paul      22.0 songs
Name: Counts, dtype: object

`.map()` also accepts dictionaries as arguments, where it will update the value of the Series by matching the keys of the dictionary:

In [167]:
song_dict = {
    5: None,
    18.: 21,
    22.: 23,
}

songs_66.map(song_dict) # Because there isn't an dictionary item with a key that matches with the 'George' entry value, an NaN value will be used instead

George     NaN
Ringo      NaN
John      21.0
Paul      23.0
Name: Counts, dtype: float64

Lastly, `.map()` accepts Series objects as arguments. The behavior in this case is the same as passing a dictionary:

In [168]:
mapping = pd.Series({22.: 33})

songs_66.map(mapping)

George     NaN
Ringo      NaN
John       NaN
Paul      33.0
Name: Counts, dtype: float64

The `.apply()` method behave similarly to `.map()`, but only accepts functions and not Series or dictionaries.

### Serialization

A pandas Series object can be serialized into a file, such as a .csv file. We can use the `.to_csv()` method for this:

In [169]:
from io import StringIO

fout = StringIO()
songs_66.to_csv(fout)
print(fout.getvalue())

,Counts
George,7.0
Ringo,5.0
John,18.0
Paul,22.0



More appropriately, we should use a *context manager* to create a .csv file:

In [174]:
with open('./songs_66.csv', 'w') as fout:
    songs_66.to_csv(fout)

Note how the above file readout we're missing the index label? We can remedy that by passing `index_label=` argument to `.to_csv()`:

In [176]:
fout = StringIO()
songs_66.to_csv(fout, index_label='Name')
print(fout.getvalue())

Name,Counts
George,7.0
Ringo,5.0
John,18.0
Paul,22.0



Conversely, we can use the `read_csv()` function to create a Series from a .csv file:

In [184]:
fout.seek(0)
series = pd.read_csv(fout, index_col='Name', squeeze=True)

series

Name
George     7.0
Ringo      5.0
John      18.0
Paul      22.0
Name: Counts, dtype: float64

Note that `read_csv()` will read the file into a DataFrame by default. To get a Series out of the file, we must first ensure the function knows that the column 'Name' should be used as the index by passing `index_col=`, then we also need to pass `squeeze=True`, otherwise the function will return a 1 column DataFrame.

### String Operations in Series

A real adventage of working with pandas Series and DataFrame objects is that actions like string operations can be done on entries in a vectorized manner without any iterations. In particular, many string operations in pandas are build-in, and does not require the use of `.map()` or `.apply()`. To access string operations, first invoke them on the `.str` attribute of the Series (or DataFrame) object:

In [185]:
names = pd.Series(['George', 'John', 'Paul'])

names

0    George
1      John
2      Paul
dtype: object

In [186]:
names.str.lower() # This will turn all entries in 'names' lowercase

0    george
1      john
2      paul
dtype: object

In [187]:
names.str.findall('o') # This will find all 'o's in each row

0    [o]
1    [o]
2     []
dtype: object

Below are all pandas methods that are build-in on the `.str` attribute:

* `.cat()`: Concatenate list of strings onto items
* `.center()`: Centers strings to width
* `.contains()`: Boolean for whether pattern matches
* `.count()`: Count pattern occurs in string
* `.decode()`: Decode a codec encoding
* `.encode()`: Encode a codec encoding
* `.endswith()`: Boolean if strings end with item
* `.findall()`: Find pattern in string
* `.get()`: Attribte access on items
* `.join()`: Join items with separator
* `.len()`: Return length of items
* `.lower()`: Lowercase the items
* `.lstrip()`: Remove whitespace of left of items
* `.match()`: Find groups in items from the pattern
* `.pad()`: Pad the items
* `.repeat()`: Repeat the string a certain number of times
* `.replace()`: Replace a pattern with a new value
* `.rstrip()`: Remove whitespace on the right of items
* `.slice()`: Pull out slices from strings
* `.split()`: Split items by pattern
* `.startswith()`: Boolean if strings starts with item
* `.strip()`: Remove whitespace from the items
* `.title()`: Titlecase the items
* `.upper()`: Uppercase the items