# Introduction



Pandas is a package built on top of NumPy, making use of its arrays to provide the very efficient implementation of a `DataFrame`, which is essentially an efficient multidimensional array suited to work with real data that might contain labels, is likely to be heterogeneous and may have missing values.

Similarly to NumPy, Pandas not only provides a nice interface to work with data, but also implements very efficient data operations on top of those interfaces. Pandas is commonly imported with the alias **pd**:

In [1]:
import numpy as np
import pandas as pd
pd.__version__

'1.0.3'

As always, IPython provides quick access to the package documentation.

In [2]:
pd?

Pandas objects can be thought of as NumPy structured arrays in which rows and columns are identified through the use of labels instead of integer indices. The three fundamental Pandas data structures are: `Series`, `DataFrame` and `Index`.

## The Series Object



According to Pandas documentation, a `Series` object is:
> One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).



In [3]:
data = pd.Series(np.linspace(0.25, 1.0, num=4))
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As shown in the output, `Series` wraps a sequence of indices and a sequence of values, both of which can be accessed with their respective attributes `index` and `values`:

In [4]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [5]:
data.index

RangeIndex(start=0, stop=4, step=1)

The attribute `values` is, as per the description, a NumPy array, while `index` is an array-like object of type `pd.Index`.

The main difference between a `Series` object and a one-dimensional NumPy array is that the latter uses an _implicitly_ defined integer index, while the Pandas object uses an _explicitly_ defined index (not necessarily integer type) associated with each value.

The explicit index can consist of values of any type, as the example bellow demonstrates:

In [6]:
data = pd.Series(np.linspace(0.25, 1.0, num=4),
                index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

A `Series` object can be also thought as a specialization of Python dictionaries (but much more efficient):

In [7]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [8]:
# Element access by key
population['California']

38332521

In [9]:
# Array-style operations such as slicing also works!
population['California':'Illinois':2]

California    38332521
New York      19651127
Illinois      12882135
dtype: int64

The general way of constructing a `Series` object is of the form

```
pd.Series(data, index=index)
```

where `data`, as seen in the examples above, can be one of many entities. The following are more examples exploring the possibilities:

In [10]:
# Python list
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [11]:
# Scalar which will be repeated to match the length of index
pd.Series(5, index=[1, 2, 3])

1    5
2    5
3    5
dtype: int64

In [12]:
# Index can be explicitly set to change the resulting object
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

## The DataFrame Object

Just as the `Series` object, `DataFrame` can be thought of as either a generalization of NumPy arrays, or as a specialization of Python dictionaries. From the official documentation:

> Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for `Series` objects. **The primary pandas data structure**.

And, according to VanderPlas:

> [..] you can think of a `DataFrame` as a sequence of aligned `Series` objects.

The example bellow illustrates how a DataFrame can be a container of some Series objects:

In [13]:
# New Series object. (indices are shared with the population object)
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [14]:
# A new DataFrame is constructed from the 2 previously created Series objects
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


The DataFrame has both and `index` attribute that gives access to its labels, and a `columns` attribute that holds the columns labels:

In [15]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [16]:
states.columns

Index(['population', 'area'], dtype='object')

Just like in Python dictionaries or `Series` objects, `DataFrame` data can be accessed by key values (or, in this case, column labels):

In [17]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

The following snippets showcase some different ways of creating `DataFrame` objects:

In [18]:
# DataFrame from a single Series object
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [19]:
# DataFrame from a list of dicts
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [20]:
# DataFrame from dictionaries with "missing data". Missing values are replaced by the special value NaN (not a value)
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [21]:
# From a dictionary as Series objects
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [22]:
# From a two-dimensional NumPy array
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.9157,0.742388
b,0.224626,0.397751
c,0.991019,0.966943


In [23]:
# From a NumPy structured array
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## The Index Object


Pandas objects are designed to facilitate operations across datasets such as joins. With this in mind, the `Index` object follows many conventions used by Python's built-in `set` to allow for combinations as the following examples suggest:

In [24]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [25]:
# Intersection
indA & indB

Int64Index([3, 5, 7], dtype='int64')

In [26]:
# Union
indA | indB

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [27]:
# Symmetric difference
indA ^ indB

Int64Index([1, 2, 9, 11], dtype='int64')

All the operations above can be accessed through object methods, for example `indA.intersection(indB)`