Pandas is the python library implementing `DataFrame` objects. These correspond to equivalent objects in `R`. They are two dimensional arrays having labeled rows and columns. The library `pandas` is a layer sitting over `numpy`.

In [1]:
import numpy as np
import pandas as pd

## Series

These are the simpleste `pandas` objects : labeled $1$-dimensional arrays. 


### Defining a `Series` object

There are different ways to define a `Series`.

* Giving an array of objects, labels (indexes) are simply positions

In [2]:
grades_a = pd.Series([1.5, 10, 12, 14, 18, 17, 18, 15, 0, 9])
grades_a

0     1.5
1    10.0
2    12.0
3    14.0
4    18.0
5    17.0
6    18.0
7    15.0
8     0.0
9     9.0
dtype: float64

Prints of series makes explicit type of data and indexes.

* Explicitely specifying indexes ; series wouldn't be anymore interesting than arrays if they wouldn't provide anything better.

In [3]:
grades_b = pd.Series(np.random.randint(0, 20, 10), index=[2*p for p in range(10)])
grades_b

0      2
2     19
4      4
6      0
8      1
10    12
12    15
14    11
16     4
18    15
dtype: int64

Behavior of a series does correspond to the one of dictionnaries. This analogy is accurate up to the point where dictionaries might have `keys` (or `values`) that are not of the same type. As is the case for `keys`, indexes of a series are immutable objects. 

In [4]:
population = pd.Series([2138551, 794811, 472317, 433055, 338620, 
                        277269, 274845, 248252, 231844, 228328],
                       index=['Paris', 'Marseille', 'Lyon', 'Toulouse', 'Nice', 
                              'Nantes', 'Strasbourg', 'Montpellier', 'Bordeaux', 'Lille'])
population

Paris          2138551
Marseille       794811
Lyon            472317
Toulouse        433055
Nice            338620
Nantes          277269
Strasbourg      274845
Montpellier     248252
Bordeaux        231844
Lille           228328
dtype: int64

A similar result can  be obtained using dictionaries.

In [5]:
population = pd.Series({'Paris':2138551, 'Marseille':794811, 'Lyon':472317, 
                        'Toulouse':433055, 'Nice':338620, 'Nantes':277269,
                        'Strasbourg':274845, 'Montpellier':248252, 'Bordeaux':231844, 
                        'Lille':228328})
population

Bordeaux        231844
Lille           228328
Lyon            472317
Marseille       794811
Montpellier     248252
Nantes          277269
Nice            338620
Paris          2138551
Strasbourg      274845
Toulouse        433055
dtype: int64

Notice that the difference here is that the order of indexes is the alphabetical one. There are no specified order in a dictionary structure and the one taken for the corresponding series one is the natural one. This contrasts with the previous approach where order of indexes was preserved.

In [6]:
population.sort_values(ascending=False)

Paris          2138551
Marseille       794811
Lyon            472317
Toulouse        433055
Nice            338620
Nantes          277269
Strasbourg      274845
Montpellier     248252
Bordeaux        231844
Lille           228328
dtype: int64

We'll still keep on the standard order on indexes ; the reason shall appear when slicing a series. 

A more parametrisable way of specifying indexes is explicitely through `index=`.

In [7]:
pd.Series({'Paris':2138551, 'Marseille':794811, 'Lyon':472317, 
            'Toulouse':433055, 'Nice':338620, 'Nantes':277269,
            'Strasbourg':274845, 'Montpellier':248252, 'Bordeaux':231844, 
            'Lille':228328}, index=['Paris', 'Bordeaux'])

Paris       2138551
Bordeaux     231844
dtype: int64

### Selection and Slicing

As an extension to what one can do in case of arrays, slicing takes into account indexes.

In [10]:
grades_a[1]

10.0

In [15]:
grades_a[-1:10:2]

9    9.0
dtype: float64

In [16]:
population['Paris']

2138551

In [17]:
population['Nantes':'Paris']

Nantes     277269
Nice       338620
Paris     2138551
dtype: int64

It is not because labels are there that position indexes are not available.

In [18]:
population[1:4]

Lille        228328
Lyon         472317
Marseille    794811
dtype: int64

In [19]:
population[population < 500000]

Bordeaux       231844
Lille          228328
Lyon           472317
Montpellier    248252
Nantes         277269
Nice           338620
Strasbourg     274845
Toulouse       433055
dtype: int64

This behavior can be problematic when series is indexed by integer-valued indexes not corresponding to integers from $0$ to the lenght of the series $-1$. This only appears when using slices ; slices can be implicit (correspond to slicing through position indexes) or explicit (slicing using row names).

In [20]:
grades_b

0      2
2     19
4      4
6      0
8      1
10    12
12    15
14    11
16     4
18    15
dtype: int64

In [21]:
grades_b[1:2] # corresponds to looking for row at position 1

2    19
dtype: int64

In [22]:
grades_b[2]  # corresponds to looking at element at index 2

19

To avoid potential bugs, `pandas` allows explicit specification of index level we're interested in.

* `loc` : for explicit indexes
* `iloc`: for implecet indexes
* `ix`  : for mix **(is depreciated)**.

In [24]:
population.loc['Nantes':'Paris']

Nantes     277269
Nice       338620
Paris     2138551
dtype: int64

**Notice here that *stop* is contained in the selected range!**. This is only true for explicit indexing.

In [25]:
population.iloc[4:8]

Montpellier     248252
Nantes          277269
Nice            338620
Paris          2138551
dtype: int64

### Assigning

As previously said, these behave as dictionaries do. 

In [27]:
grades_a.iloc[1] = 20
grades_a

0     1.5
1    20.0
2    12.0
3    14.0
4    18.0
5    17.0
6    18.0
7    15.0
8     0.0
9     9.0
dtype: float64

In [31]:
grades_a.loc[1:3] = -1
grades_a

0     1.5
1    -1.0
2    -1.0
3    -1.0
4    18.0
5    17.0
6    18.0
7    15.0
8     0.0
9     9.0
dtype: float64

One can also extend the size of there `Series` object simply introducing value for new key.

In [32]:
grades_b

0      2
2     -1
4     -1
6      0
8      1
10    12
12    15
14    11
16     4
18    15
dtype: int64

In [34]:
grades_b.loc[20] = 4
grades_b

0      2
2     -1
4     -1
6      0
8      1
10    12
12    15
14    11
16     4
18    15
20     4
dtype: int64

## `Index` Objects

The indexes in `Series` and later on `DataFrame` objects have more to it than just indexing. Many of the data manipulation procedures involve taking unions, intersections or complements of indexes over a number of `Series` or `DataFrames`. That's the reason why they are objects on their own that can be thought of as an immutable array.

In [35]:
type(grades_b.index)

pandas.core.indexes.numeric.Int64Index

In [36]:
type(grades_a.index)

pandas.core.indexes.range.RangeIndex

In [37]:
type(population.index)

pandas.core.indexes.base.Index

In [38]:
area = pd.Series({'Paris':105.40, 'Marseille':240.62, 'Lyon':47.95, 
                  'Toulouse':118.30, 'Nice':73.91, 'Nantes':65.77,
                  'Strasbourg':78.09, 'Montpellier':57.11, 'Bordeaux':49.70, 
                  'Lille':34.99})

In [39]:
area.iloc[2:5].index & population.iloc[1:3].index

Index(['Lyon'], dtype='object')

In [40]:
area.loc['Bordeaux':'Nantes'].index | population.loc['Lille':'Nice'].index

Index(['Bordeaux', 'Lille', 'Lyon', 'Marseille', 'Montpellier', 'Nantes',
       'Nice'],
      dtype='object')

## DataFrames

These can be understood as an array of `Series` where rows share common indexes.

### Defining a DataFrame

The most common approach is as a dictionary of `Series`.

In [41]:
france = pd.DataFrame({'population':population, 'area':area})
france

Unnamed: 0,area,population
Bordeaux,49.7,231844
Lille,34.99,228328
Lyon,47.95,472317
Marseille,240.62,794811
Montpellier,57.11,248252
Nantes,65.77,277269
Nice,73.91,338620
Paris,105.4,2138551
Strasbourg,78.09,274845
Toulouse,118.3,433055


As you can see `jupyter notebooks` have a specific choice to write out `pandas DataFrame` objects.

There are two indexes now for a `DataFrame`. One for rows and another for columns. There attributes are slightly different in both cases.

In [42]:
france.index

Index(['Bordeaux', 'Lille', 'Lyon', 'Marseille', 'Montpellier', 'Nantes',
       'Nice', 'Paris', 'Strasbourg', 'Toulouse'],
      dtype='object')

In [43]:
france.columns

Index(['area', 'population'], dtype='object')

`DataFrame` objects can also have one column. For instance

In [44]:
pd.DataFrame(population)

Unnamed: 0,0
Bordeaux,231844
Lille,228328
Lyon,472317
Marseille,794811
Montpellier,248252
Nantes,277269
Nice,338620
Paris,2138551
Strasbourg,274845
Toulouse,433055


In such a case, the column takes the standard indexing by integers and starting at $0$. To make things more adapted one can specify the column index.

In [45]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
Bordeaux,231844
Lille,228328
Lyon,472317
Marseille,794811
Montpellier,248252
Nantes,277269
Nice,338620
Paris,2138551
Strasbourg,274845
Toulouse,433055


Another useful approach to define `DataFrames` is to go from `numpy` arrays to `pandas` dataframes.

In [47]:
pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B'], index=['Hello', 'Bonjour', 'Guten Tag', 'Holà'])

Unnamed: 0,A,B
Hello,1.370014,0.369823
Bonjour,-0.571574,-0.100883
Guten Tag,-0.408951,0.095164
Holà,0.601696,1.12654


Accessing a dataframe's underlying array is done following dictionary syntax 

In [48]:
france.values

array([[  4.97000000e+01,   2.31844000e+05],
       [  3.49900000e+01,   2.28328000e+05],
       [  4.79500000e+01,   4.72317000e+05],
       [  2.40620000e+02,   7.94811000e+05],
       [  5.71100000e+01,   2.48252000e+05],
       [  6.57700000e+01,   2.77269000e+05],
       [  7.39100000e+01,   3.38620000e+05],
       [  1.05400000e+02,   2.13855100e+06],
       [  7.80900000e+01,   2.74845000e+05],
       [  1.18300000e+02,   4.33055000e+05]])

### Selecting and Slicing

One-entry selection and slicing in a dataframe is for columns.

In [49]:
france['area']

Bordeaux        49.70
Lille           34.99
Lyon            47.95
Marseille      240.62
Montpellier     57.11
Nantes          65.77
Nice            73.91
Paris          105.40
Strasbourg      78.09
Toulouse       118.30
Name: area, dtype: float64

This equivalently can be accessed using `france.area`. This use is not advised since it might be overwritten by a method having same name. To access multiple columns at a time one gives a list of column names as argument.

In [50]:
france[['population', 'area']]

Unnamed: 0,population,area
Bordeaux,231844,49.7
Lille,228328,34.99
Lyon,472317,47.95
Marseille,794811,240.62
Montpellier,248252,57.11
Nantes,277269,65.77
Nice,338620,73.91
Paris,2138551,105.4
Strasbourg,274845,78.09
Toulouse,433055,118.3


Other types of access to elements or slices shall be done using `iloc` and `loc` methods. They behave in a similar fashion to what we've previously seen in the case of `Series`.  

In [51]:
france.loc['Paris':'Toulouse', 'area':'population']

Unnamed: 0,area,population
Paris,105.4,2138551
Strasbourg,78.09,274845
Toulouse,118.3,433055


### Assignment

To create a new entry in a dataframe :

In [52]:
france['density'] = france['population']/france['area'] 

In [53]:
france

Unnamed: 0,area,population,density
Bordeaux,49.7,231844,4664.869215
Lille,34.99,228328,6525.521578
Lyon,47.95,472317,9850.198123
Marseille,240.62,794811,3303.179287
Montpellier,57.11,248252,4346.909473
Nantes,65.77,277269,4215.736658
Nice,73.91,338620,4581.518063
Paris,105.4,2138551,20289.857685
Strasbourg,78.09,274845,3519.592778
Toulouse,118.3,433055,3660.650888


## Universal Functions on DataFrames

Since `pandas` dataframes is based on `numpy` arrays all mathematical operations on arrays extend to the case of dataframes. 

In [54]:
test = pd.DataFrame(np.random.randint(0, 10, (4, 3)), columns=['Z', 'Y', 'X'])

In [55]:
test

Unnamed: 0,Z,Y,X
0,0,5,0
1,8,2,9
2,2,7,9
3,4,1,6


In [56]:
np.exp(test)

Unnamed: 0,Z,Y,X
0,1.0,148.413159,1.0
1,2980.957987,7.389056,8103.083928
2,7.389056,1096.633158,8103.083928
3,54.59815,2.718282,403.428793


In [57]:
test*test

Unnamed: 0,Z,Y,X
0,0,25,0
1,64,4,81
2,4,49,81
3,16,1,36


In [58]:
np.cos(france)

Unnamed: 0,area,population,density
Bordeaux,0.84433,0.73485,-0.922659
Lille,-0.907929,-0.982468,-0.907444
Lyon,-0.677741,-0.859921,-0.262529
Marseille,-0.28419,0.810956,-0.203947
Montpellier,0.846547,-0.978668,0.493437
Nantes,-0.979376,-0.11338,0.960866
Nice,0.082479,0.957023,0.474876
Paris,0.156006,0.972669,0.118191
Strasbourg,-0.900527,0.930547,0.532703
Toulouse,0.470869,0.556253,-0.767768


## Handling Missing Data

Any library for data analysis has to be able to handle missing values properly. In the case of `pandas` handling missing data is not as straightforward as the case in `R` for instance could be. There are no `NA` objects for each datatype. That's why `pandas` has to infer the proper missing value to use or the proper datatype to use depending on the existence or not of a missing value. It has two standard choices to use `nan` object from `numpy` or the `python` `None` object.

In [59]:
pd.Series([1, np.nan, 1, None])

0    1.0
1    NaN
2    1.0
3    NaN
dtype: float64

In [60]:
x = pd.Series(np.random.randint(0, 10, 3), dtype=int)
x

0    7
1    1
2    1
dtype: int64

In [61]:
x[0] = None
x

0    NaN
1    1.0
2    1.0
dtype: float64

## Exercise

Look in the documentation for `isnull`, `notnull`, `dropna`, `fillna`.

## Exercise

Look into the pandas documentation to learn about data aggregation and grouping by.