## Pandas Data Structure

Pandas has three basic data structure:

* Series - 1-dimension; similar to a single column of data in a spread sheet
* Data Frame - 2-dimension; similar to a sheet with rows and columns in a spread sheet
* Panel - 3-dimention; multiple sheets

In practice, Series and Data Frames are used much more frequently than panels. In pandas, Data Frames are essentially multiple series connected together. It is generally not useful to think of pandas data structures in terms on primitive Python data structures (e.g. lists and dictionaries). 

### Series

A series is a one dimensional, list-like data structure in pandas. It has a single _axis_ called the **index**. We can similate the pandas Series data structure with a simple Python dictionary:

In [1]:
import pandas as pd

ser = {
    'index': [0, 1, 2, 3],
    'data': [145, 142, 38, 13],
    'name': 'songs',
}

ser

{'index': [0, 1, 2, 3], 'data': [145, 142, 38, 13], 'name': 'songs'}

With a dictionary like this, we can write a Python function that allows us to pull items out of this kind of data structure based on the `index`:

In [2]:
def get(ser, idx):
    # A function to fetch 'data' values from a series (ser) given an index (idx)
    value_idx = ser['index'] = ser['index'].index(idx)
    return ser['data'][value_idx]

get(ser, 1)

142

In the above function, we supplied the function `get()` with the index `1`, which correspond to the value of `142` within `data`.

Despite what "index" might imply, it does not have to be integer-based, and can be comprised of strings and dates; it can contain duplicate values and the order may be arbitrary.

In [3]:
songs = {
    'index': ['Paul', 'John', 'George', 'Ringo'],
    'data': [145, 142, 38, 13],
    'name': 'counts',
}

songs

{'index': ['Paul', 'John', 'George', 'Ringo'],
 'data': [145, 142, 38, 13],
 'name': 'counts'}

In [4]:
get(songs, 'John')

142

With the Series structure in pandas, the above dictionary-based structure can be accomplished is a greatly simplified and more powerful way.

In [5]:
songs2 = pd.Series([145, 142, 38, 13], name='counts')

songs2

0    145
1    142
2     38
3     13
Name: counts, dtype: int64

Pandas will display the Series in a way that facilitates reading. On the left most column is the *index*, while the *values* associated is in the right most column. In the language of pandas, the generic name for an index is an **axis**, while its assocated values are called the **axis labels**. This is the same in the case for DataFrames, which have two axes, one for the rows and one for the columns.

The `dtype` of a pandas Series object (or a DataFrame, for that matter) is its data type. In the case above, `int64` indicates 64 bit integer. A Series or a DataFrame can hold strings, floats, Booleans, or any arbitrary Python objects. They don't even have to be the same type to be in a Series/DataFrame (although this would affect speed and vetorized operations). 

The `index` of a Series can be accessed by its attribute:

In [6]:
songs2.index

RangeIndex(start=0, stop=4, step=1)

Because the index of `songs2` was automatically generated, pandas used default values, which are monotonically increasing integers. As mentioned above, index can be string-based as well. However, in this case, pandas will not indicate the `dtype` of the index as a string, but rather an object:

In [7]:
songs3 = pd.Series([145, 142, 38, 13],
                  name='counts',
                  index=['Paul', 'John', 'George', 'Ringo'])

songs3

Paul      145
John      142
George     38
Ringo      13
Name: counts, dtype: int64

In [8]:
songs3.index

Index(['Paul', 'John', 'George', 'Ringo'], dtype='object')

The data (that is, the axis labels) does not have to be numeric, or even homogeneous. Mixed generic objects can be inserted into a Series:

In [9]:
class Foo:
    pass

ringo = pd.Series(['Richard', 'Starkey', 13, Foo()],
                 name='ringo')

ringo

0                                        Richard
1                                        Starkey
2                                             13
3    <__main__.Foo object at 0x000001EB32171B88>
Name: ringo, dtype: object

In the case of `ringo`, the `dtype` is `object`. The `object` datatype is used for strings as well as for when the data within the Series is heterogeneous. Because vectorized operations works so well with pandas, if the data we are working with is numeric would would much rather store it as int64 or float64, and not object. Note that in Python and pandas, time data is stored as its special data type, `datetime64`. If you examine the `dtype` of a Series with time data stored and it returned as `object`, it meant the time data was stored as a string, not the Python/pandas datetime data type. While this may not be desireble, converting strings to datetime is not difficult in Python/pandas.

#### The NaN value

The NaN value stands for "Not A Number", is encountered ubiquitously in the NumPy package. In pandas, when a Series is determined to hold numeric values but a number cannot be found to fill the entry, pandas will fill it with NaN. NaN is ignored in pandas for any arithmetic operations.

In [10]:
nan_ser = pd.Series([2, None],
                   index=['Ono', 'Clapton'],
                   )

nan_ser

Ono        2.0
Clapton    NaN
dtype: float64

Note the `dtype` of this series is `float64` and not `int64`. This is because NaN is always treated as a float in Python/pandas. Pandas determined that `nan_ser` is a Series that contained numeric values, but the `Clapton` entry has no values (technically the value is `None`) associated with it, therefore pandas filled it with none, and the entire column now has to be in the `dtype` `float64`. When pandas reads a file, e.g. a .csv file, an empty value for an otherwise numberic column will become NaN. There are convenient methods in pandas such as `.fillna()` and `.dropna()` that utilizes this behavior.

#### The Series and NumPy Arrays

Many functions and methods associated with NumPy arrays works with pandas Series as well:

In [11]:
import numpy as np

numpy_ser = np.array([145, 142, 38, 13])

songs3[1]

142

In [12]:
numpy_ser[1]

142

In [13]:
songs3.mean()

84.5

In [14]:
numpy_ser.mean()

84.5

*Boolean arrays* are masks that can be used to filter out items, and they can be applied to both NumPy arrays and pandas Series.

In [15]:
mask = songs3 > songs3.mean() # Boolean array

mask

Paul       True
John       True
George    False
Ringo     False
Name: counts, dtype: bool

Boolean arrays can be used to filter items of the sequence (either an array or a Series) by performing an idex operation. If the mask has a `True` valuye for a given index, the value is kept. Otherwise, the value is dropped. Here, `mask` is a Boolean array that represents the locations that have a value greater than the median value of the Series.

In [16]:
songs3[mask]

Paul    145
John    142
Name: counts, dtype: int64

In [18]:
numpy_ser[numpy_ser > np.median(numpy_ser)] # The same thing can be with NumPy arrays but with a workaround for the median.

array([145, 142])

#### Creating a Series

The easiest way to creat a pandas Series is to feed it Python lists as arguments:

In [19]:
george_dupe = pd.Series([10, 7, 1, 22],
                       index=['1968', '1969', '1970', '1970'],
                       name='George Songs',
                       )

george_dupe

1968    10
1969     7
1970     1
1970    22
Name: George Songs, dtype: int64

Here we can see that we have use strings for the index entries, and there are duplicated index entries (i.e., they are not unique). For the Series `george_dupe`, we created it while explicitlyl feeding the generator a list containing the index. We do not have to do this, in which case pandas will generate a default index with numeric values. We can also create a Series with a Python dictionary that maps index entries to values. In this case an additional sequence containing the order of the index must be supplied as Python dictionaries are not ordered. A caveat is that Python dictionaries must only have unique keys, we can try a workaround if we are to create the above Series with a dictionary:

In [20]:
g2 = pd.Series({'1969': 7,
               '1970': [1, 22],
               },
              index=['1969', '1970', '1970'])

g2

1969          7
1970    [1, 22]
1970    [1, 22]
dtype: object

Note that the above Series did not behave the way we wanted it to. Therefore, if a Series with non-unique index entries were to be created, we should stick with the list method and not use dictionaries.

#### Reading a Series

Data from a Series can be accessed via index operations just like most Python arrays:

In [21]:
george_dupe['1968']    # Access the value at index entry '1968' of the Series george_dupe

10

Note that normally Python will return a scalar value (exact data type of the value returned obviously would depend), but not when there are non-unique index entries. In that case a Series is returned instead:

In [22]:
george_dupe['1970']

1970     1
1970    22
Name: George Songs, dtype: int64

Data in a Series can be iterated upon:

In [23]:
for item in george_dupe:
    print(item)

10
7
1
22


Here's a quirk with the Series object: when iterating (via `.__iter__()`), it is the **values** of a Series that are iterated over. However, checking for membership within a Series (via `.__contains__()`) is done against the **index** entries of a Series. For example, if we do a Boolean check to see if "22" is in `george_dupe`:

In [24]:
22 in george_dupe

False

Python returned false because it used `.__contains__()` against the **index** of `george_dupe`, which only contains strings of years such as '1970', and no intergers like 22. We must specify to Python that we want to check if 22 is in the **values** of `george_dupe` via the `.values` attribute of the Series:

In [26]:
22 in george_dupe.values

True

In [29]:
22 in set(george_dupe) # The same can be accomplished by querying the set of the Series

True

In [30]:
'1970' in george_dupe # Remember that membership queries to a Series is done over its index

True

Always remember that iterations over a Series by default are done over the **values** of a Series. If it is desired to iterate on both the index label and the value, we can use the `.iteritems()` method:

In [31]:
for item in george_dupe.iteritems():
    print(item)

('1968', 10)
('1969', 7)
('1970', 1)
('1970', 22)


#### Updating a Series

The standard assignment operation can be used in conjunction with the index operation to update the values of a Series:

In [34]:
george_dupe['1969'] = 6

george_dupe['1969']

6

We can add a new index and a corresponding value in the exact same way:

In [35]:
george_dupe['1973'] = 11

george_dupe

1968    10
1969     6
1970     1
1970    22
1973    11
Name: George Songs, dtype: int64

One thing to keep in mind is that if we try to update a Series' values with the above approach, but the Series in question has non-unique index labels, then any changes made to the non-unique index entries will be applied across the board:

In [36]:
george_dupe['1970'] = 2

george_dupe

1968    10
1969     6
1970     2
1970     2
1973    11
Name: George Songs, dtype: int64

To avoid this, we can either use a DataFrame data structure instead (either with a column for artist, or a multi-index), or we can update values in the Series by position via the `.iloc` attribute:

In [38]:
george_dupe.iloc[3] = 22

george_dupe

1968    10
1969     6
1970     2
1970    22
1973    11
Name: George Songs, dtype: int64

There is an `.append()` method for a pandas Series object. However, it does not behave the same way as the primitive Python `.append()` method (e.g. for lists). Instead, it behaves more like the primitive Python `.extend()` method, where it expects another pandas Series to append to:

In [39]:
george_dupe.append(pd.Series({'1974': 9}))

1968    10
1969     6
1970     2
1970    22
1973    11
1974     9
dtype: int64

In [41]:
george_dupe # Note the above line of code DOES NOT override the original Series but rather returns a new extended Series.

1968    10
1969     6
1970     2
1970    22
1973    11
Name: George Songs, dtype: int64

#### Deleting items in Series

While deleting entries is not common in pandas (it is more common to filter entries using masks such as Boolean arrays), it is possible. The `del` statement can be used to delete entries based on the index:

In [44]:
del george_dupe['1973']

george_dupe

1968    10
1969     6
1970     2
1970    22
Name: George Songs, dtype: int64

A more common way to remove unwatned entries in pandas is to filter the Series to get a new Series. For example, let's say we want to filter out all values less than or equal to 2 in `george_dupe`:

In [45]:
george_dupe[george_dupe <= 2]

1970    2
Name: George Songs, dtype: int64

#### Indexing a Series

The index of a Series can be numeric, but they can also be strings:

In [46]:
george = pd.Series([10, 7],
                  index=['1968', '1969'],
                  name='George Songs',
                  )

george

1968    10
1969     7
Name: George Songs, dtype: int64

The `dtype` of the Series is `int64`, which is actually referring to the values of the Series. We can check the `dtype` of the Series' index instead:

In [47]:
george.index

Index(['1968', '1969'], dtype='object')

Indices do not have to be unique:

In [49]:
dupe = pd.Series([10, 2, 7],
                index=['1968', '1968', '1969'],
                name='George Songs',
                )

dupe

1968    10
1968     2
1969     7
Name: George Songs, dtype: int64

In [51]:
dupe.index.is_unique

False

In [52]:
george

1968    10
1969     7
Name: George Songs, dtype: int64

In [53]:
george.index.is_unique

True