# Pandas

#### this course is inspired by
   - Arnaud Legout, Inria, (courses and MOOC python)
   - Thierry Parmentelat (the numeric part of the MOOC python)
 

#### pandas cheat sheet
   - https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

### pandas
   - `under development` since 2008
   - try to `close the gap` between `python`, `statistical computing` and `multidimensional  datasets`
   - not very `intuitive` but very `powerful` and there is `no better` solution
   
   
   - PyHPC11 (Python High-Performance and Scientific Computing conference 2011):
      - https://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf

# objectives

   - data `arrays` with `labeled axes`
   - `vectorized` operations
   - handling of `missing data`
   - `merge`, `pivot`, `groupy` other `relation` operations
   - automatic or explicit `data alignment`
   - integrated `time series` functionality

### pandas versus numpy

   - numpy contains `efficient` array `creation` and `manipulations`
   - pandas offers an `index-based structuration` to numpy.ndarray

  - see http://pandas.pydata.org/pandas-docs/stable/
  
  

   - pandas uses numpy as a `black-box`
   - `no assumption` is made on memory allocation
   
   
   - i.e. pandas works mainly with `copies` instead of `in-place` modification
   - see https://stackoverflow.com/questions/23296282/what-rules-does-pandas-use-to-generate-a-view-vs-a-copy
   
   

   - `in-place` modification in pandas are usually an assignment after the copy
   - see https://stackoverflow.com/questions/22532302/pandas-peculiar-performance-drop-for-inplace-rename-after-dropna/22533110#22533110                  127

   - importing  `pandas library`

In [None]:
import pandas as pd

   - pandas version

In [None]:
pd.__version__

   - Pandas version and version of its dependencies

In [None]:
#pd.show_versions(as_json=False)  # very long do not print

there are two pandas containers
   - I) $\texttt{pandas.Series}$ is for `one-dimensional` arrays 
   - II) $\texttt{pandas.DataFrame}$ is for `two-dimensional` arrays

## I) $\texttt{pandas.Series}$

#### Series contain
   - an `array-like` data
   - an `array-like` index of same length as the data
   - by default, the index starts at $0$

### 1) creating $\texttt{pandas.Series}$ from arrays

example
   - we have $11$ `European countries` with their `names`
   - their corresponding `total areas` in $km^2$
      - Russia, Ukraine, France, Spain, Sweden, Norway, Germany, Finland, Poland, Italy, UnitedKingdom
      - 3972400, 603628, 551695, 505992, 450295, 385178, 357578, 338145, 312685, 301338, 242495

we can create a $\texttt{pandas.Series}$ with the areas

In [None]:
import pandas as pd

In [None]:
areas = [3972400, 603628, 551695, 505992, 450295, 385178,
                   357578, 338145, 312685, 301338, 242495]
sa = pd.Series(areas)
sa

   - by `default` a $\texttt{pandas.Serie}$ is `indexed` by `numbers`
   - the `type` of the elements is here $\texttt{int64}$

In [None]:
sa.index

   - the `index` is also called the `keys`

In [None]:
sa.keys() is sa.index

   - they are the `same` `python` object

### 2) providing an index to a $\texttt{pandas.Series}$

   - in our example, the serie is `indexed` by `numbers`
   - we can index it by the `names` of the countries

In [None]:
countries = ['Russia', 'Ukraine', 'France', 'Spain',
             'Sweden', 'Norway',  'Germany', 'Finland',
             'Poland', 'Italy', 'UnitedKingdom']
sc = pd.Series(areas, index = countries)

In [None]:
sc.index

   - `index` needs not be `unique`

In [None]:
index = ['a', 'b', 'c', 'a', 'c']
s = pd.Series([10, 39, 27, 8, 46], index = index)

   - the $\texttt{index}$ $\texttt{'a'}$ has two values

In [None]:
s['a']

   - we can `test` if an `index` is in the serie

In [None]:
'a' in s

   - you can `reorganize` the index of a serie

In [None]:
s = pd.Series([10, 39, 27, 8, 46],
              index = ['d', 'b', 'c', 'a', 'e'])

In [None]:
s.reindex(['a', 'b', 'c', 'd', 'e', 'f'], fill_value=0)

  - you can give `default-values` for `new` elements

### 3) creating $\texttt{pandas.Series}$ from dictionaries $\{(key_i, item_i)\}$

   - `keys` of the $\texttt{dict}$ are `indexes`
   - `values` of the $\texttt{dict}$ are `elements`

In [None]:
d = {'Russia': 3972400, 'Ukraine': 603628, 'France': 551695, 'Spain': 505992,
     'Sweden': 450295, 'Norway': 385178, 'Germany': 357578, 'Finland': 338145,
     'Poland': 312685, 'Italy': 301338, 'UnitedKingdom': 242495}

scd = pd.Series(d)

In [None]:
scd.index

### 4) types of elements of a $\texttt{pandas.Series}$

In [None]:
index = ['a', 'b', 'c', 'a', 'c']
s = pd.Series([10, 39, 27, 8, 46], index = index)

In [None]:
s.index

in this example
   - the `type` of the `index` is $\texttt{object}$
   - and not a `fixed-length` array of characters like in $\texttt{numpy.ndarray}$ 

In [None]:
s.index.dtype # 'O' for object

   - a $\texttt{pandas.Serie}$ can hold data of any type but the type is `unique`(*)
   - but the types of the element
   - when the type is $\texttt{object}$ elements are `references` to Python `objects`

(*) `unlike` python where containers can hold objects of `any types` and are `heterogeneous`

In [None]:
l = [1, 'toto', 12.89, {'a':1}, (10, 230)]
[type(e) for e in l]

### XXX) naming the element and the index of a serie

In [None]:
s = pd.Series({'Russia': 3972400, 'Ukraine': 603628, 'France': 551695})
s

In [None]:
s

In [None]:
s.name = 'areas'
s.index.name = 'countries'
s.head()

### 5) accessing the underlying $\texttt{numpy.ndarray}$ from  $\texttt{pandas.Series}$

#### it is recommended to use:

   - $\texttt{np.Series.array}$ is a wrapper arround the `underlying data`
   - $\texttt{np.Series.to_numpy}$ returns the `underlying` $\texttt{numpy.ndarray}$

In [None]:
s = pd.Series(['a', 'b', 'c'])
s

In [None]:
s.array

In [None]:
s.to_numpy()

#### it is recommended to avoid $\texttt{Series.values}$
   - it returns a $\texttt{numpy}$ array `representing the underlying data`
   - **but** it has `inconsistent behaviour` (it is not deprecated)
   - (https://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.24.0.html#accessing-the-values-in-a-series-or-index)

      

### 6) accessing elements in a $\texttt{pandas.Series}$

In [None]:
countries = ['Russia', 'Ukraine', 'France', 'Spain', 'Sweden', 'Norway',
                       'Germany', 'Finland', 'Poland', 'Italy', 'UnitedKingdom']
areas = [3972400, 603628, 551695, 505992, 450295, 385178,
                   357578, 338145, 312685, 301338, 242495]
sc = pd.Series(areas, index = countries)

#### a) `accessing` elements by their `index` (their `key`)

the `strong` way

In [None]:
sc['Spain']

In [None]:
'Spain' in sc 

in case of `absence`: the `strong` way produces an `error`

In [None]:
try:
    sc['Denmark']
except KeyError as e:
    print(e, 'is not an index')

In [None]:
'Denmark' in sc

#### b) `accessing` elements by their `index` using $\texttt{pandas.Series.loc[]}$
   - it is a `property` not a `function`

access the element of index `Russia` in the serie

In [None]:
sc.loc['Russia'] # the same element as sc['Russia']

in case of `absence`: it produces an `error`

In [None]:
try:
    sc.loc['Denmark']
except KeyError as e:
    print(e, 'is not an index')

#### c) `accessing` elements by the `position` of their index using $\texttt{pandas.Series.iloc[]}$

   - the position of the `key` in the list of `index`

`'Russia'` is the `first` index (i.e. $0$) 

In [None]:
sc.iloc[0] # the same element as sc.loc['Russia']

In [None]:
sc.iloc[-1] # the last one like in python

In [None]:
sc.iloc[0:3] # several indexes

in case of `absence`: it produces an `error`

In [None]:
try:
    sc.iloc[1000]
except IndexError as e:
    print(e)

#### d) aggregating the elements in an `iterable` with $\texttt{pandas.Series.items}$

   - you obtain a python $\texttt{zip}$ i.e. a comprehension of $(index, values)$

In [None]:
zsa = sa.items() # the iterable contain the numbered elements
zsa

In [None]:
for z in zsa:
    print(z)

### 7) deleting an element with  $\texttt{pandas.Serie.drop}$

   - it allocates and returns the new $\texttt{pandas.Series}$
   - or you can do it `inplace` (it will *modify* the actual serie)

In [None]:
sc.drop('Russia')

In [None]:
sc = sc.drop('Russia')

we do it `inplace`

In [None]:
sc = sc.drop('Spain', inplace=True)

### 8) adding an element in a $\texttt{pandas.Series}$

In [None]:
s = pd.Series([10, 47, 47, 67], index=['a', 'b', 'c', 'a'])

#### a) you can check whether values are contained in a series

the `direct` way

In [None]:
s == 10

   - the `proper` way

In [None]:
s.isin([10]) # a single element

In [None]:
s.isin([100, 47]) # several selements

#### b) you can check whether values are indexes in a series

   - the python way

In [None]:
'a' in s

In [None]:
s.index.isin(['a', 'd'])

#### c) you can change the value of an element

In [None]:
s['a']

In [None]:
s['a'] = 17 # you modify all the 'a'

In [None]:
s['a']

   - you modify the `original` array (`not` a copy of the array)

In [None]:
s.dtype

In [None]:
type(s['c'])

In [None]:
s['c'] = "toto"

   - the `type` of the elements `changed`
   - it was integer it became `string` (objects)

In [None]:
type(s['c'])

In [None]:
s.dtype

#### d) you can add elements

   - the same way you change an existing one
   - you `give` a new `pair` $(index,\ value)$

In [None]:
s['v'] = 134

In [None]:
s

   - `adding` elements can `change` the `data-type` of the array

In [None]:
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [None]:
s.dtype

In [None]:
s.loc['str'] = '4' # we add a string

In [None]:
s.dtype

In [None]:
s + s   #   as the sum is defined on both
        # for character strings it is the concatenation

#### e) `implicit` type conversion
   - type conversion can be done `automatically`

   - `changing` elements can `change` the `data-type` of the array

In [None]:
s = pd.Series([10, 47, 47, 67], index=['a', 'b', 'c', 'a'])

In [None]:
s.dtype

In [None]:
s + s # the add of 64-bits integers

   - we add an element of type `character string`

In [None]:
s['w'] = '101'

   - you silently change the data-type of the array

   - when `printed` the array `looks` the same !
   - but `from now on` elements are `references` to objects
   - elements indexed by $\texttt{['a', 'b', 'c']}$ are `references to 64-bits integer` objects
   - the element indexed by $\texttt{'w'}$ is a `reference to a character string` object

In [None]:
s

In [None]:
s.dtype

   - but:

In [None]:
[type(e) for e in s.array]   # the last one is a `str` not an `int`

In [None]:
s + s

   - $+$ is the `addition of integers`
   - $+$ is the `concatenation of strings`

performance (be careful):
   - operations on $\texttt{numpy.ndarray}$ with `elements of type $\texttt{object}$`
   - are `slower` that operation on $\texttt{numpy.ndarray}$ with `elements of type `numeric` (int32, int64, float64, etc.)

In [None]:
import numpy as np

In [None]:
s = pd.Series([10, 47, 47, 67], index=['a', 'b', 'c', 'a'])
s.dtype # dtype('int64')
np.power(s, 2)

   - the `operation` is done directely on `64-bits` integers

In [None]:
s['w'] = '101'  # we add an element of type str
s.dtype         # dtype('O')
                # the type of the array changed to `object`
s.drop(['w'], inplace= True) # we remove the element of type str
s.dtype        # dtype('O') the type remains `object`

In [None]:
np.power(s, 2)

   - the `operation` is done on `64-bits`
   - but now the integers are `referenced` by the array
   - (one `indirection` has been added)

#### f) `explicit type` conversion with $\texttt{pandas.Series.astype}$

   - type conversion can be done `explicitely`

In [None]:
s = pd.Series([10.20, 47, 47, 67], index=['a', 'b', 'c', 'a'])

In [None]:
s

In [None]:
import numpy as np

   - we change the type

In [None]:
s.astype(np.int32)

   - it returns a new $\texttt{pandas.Series}$
   - with converted values

### 9) appying `vectorized` operations on $\texttt{pandas.Series}$

#### a) advanced array indexing and assigment features

In [None]:
s = pd.Series([56, 45, 23, 8, 19, 34], index=['a', 'b', 'c', 'd', 'e', 'f'])
s

In [None]:
s[s<30]

In [None]:
s.loc[s<30] # the same as previously

   - we can `modify` all the `selected` elements

In [None]:
s.loc[s<30] = 30 # threshold

In [None]:
s[s<30]

   - note that the scalar (here $30$) as been `broadcasted` in the `required size`

In [None]:
s[s>50] = 50

In [None]:
s

## II) $\texttt{pandas.DataFrame}$


   - `two-dimensional` arrays
   - where `rows` and `columns` are indexed
   - `missing` values are replaced by $\texttt{numpy.NaN}$
   - `scalar` values are broadcasted
   
   
   
   - can be build in `several ways` or `read` from files (the most usual way)

### 1) creating $\texttt{pandas.DataFrame}$ from $\texttt{pandas.Series}$
   - `index` must be identical

   - we create three `series of data`
      - `distance`, `lowest_temp` and `highest_temp` related to the solar system
   - series are `indexed by` the `names` of the planets
   - some values are `missing`
      - the `lowest` and the `highest` temperature of `neptune`, `saturn` and `uranus`
   - all planets are from the `solar system`

In [None]:
# distance are relative to earth
distance = pd.Series([0.387, 0.723, 30, 1., 5.203, 1.523, 9.6, 19.19],
                     index=['Mercury', 'Venus', 'Neptune', 'Earth', 'Jupiter', 'Mars', 'Saturn', 'Uranus'])

lowest_temp = pd.Series([-200.0, 446.0,  -90.0, -125.0, -140.0],
                        index=['Mercury', 'Venus', 'Earth', 'Jupiter', 'Mars'])

highest_temp = pd.Series([430.0, 490.0, 60.0, 17.0, 20.0],
                         index=['Mercury', 'Venus', 'Earth', 'Jupiter', 'Mars'])


we `group` the series using a `python dict` 
   - the `names` of the series are the `keys` of `dict`
   - the `elements` of the series are the `values` 

In [None]:
planets = pd.DataFrame({'distance': distance,
                        'lowest temperature': lowest_temp, 
                        'highest temperature': highest_temp, 
                        'origin':'solar system'})

   - for the `serie` $\texttt{'origin'}$
   - we give the `single` value $\texttt{'solar system'}$

   - we `see` the first elements

In [None]:
planets.head()

   - the `single` value is `broadcasted` to the `entire column` (here  *Solar System*)
   - `missing` values are `replaced by` $\texttt{numpy.NaN}$ (min/max temperature of neptune, saturn and uranus

you can `retrieve` columns by `name`
   - you obtain a `reference` on the `serie` `not` a `copy`

In [None]:
planets[['distance', 'lowest temperature', 'highest temperature']]

In [None]:
planets['distance'] # relative distance where earth is 1

   - you can use the `columns names` as a `key` (when possible)
   - it `won't` work for $\texttt{lowest temperature}$

In [None]:
planets.distance

planets.distance

   - we can give a name to the data frame

In [None]:
planets.name = 'planets'

 - we can give a name to the index

In [None]:
planets.index.name = 'planets names'

In [None]:
planets.head()

   - you can create a `data frame` from a `dict of dicts`

In [None]:
planets_2 = pd.DataFrame(
    {'distance': {'Mercury' : 0.387, 'Venus' : 0.723, 'Neptune' : 30, 'Earth' : 1},
     'lowest temperature': {'Mercury' : -200, 'Venus': 446, 'Earth' : -90},})
                        

In [None]:
planets_2

### 2) creating $\texttt{pandas.DataFrame}$ by specifying parameters $\texttt{data}$, $\texttt{columns}$ and $\texttt{index}$

In [None]:
planets_1 = pd.DataFrame([[1.000, -90.0, 60.0],
                          [5.203, -125.0, 17.0],
                          [1.523, -140.0, 20.0],
                          [0.387, -200.0, 430.0],
                          [30.0],
                          [9.600],
                          [ 19.190],
                          [ 0.723, 446.0, 490.0]],
                         
                         index= ['Earth', 'Jupiter', 'Mars', 'Mercury', 'Neptune', 'Saturn', 'Uranus', 'Venus'],
                         
                         columns=['distance', 'lowest temperature', 'highest temperature'])

In [None]:
planets_1.head(3)

in a `data frame`
   - the `rows` and the `columns` are `indexed`
   - the type is $\texttt{pandas.Index}$ (for short)

In [None]:
type(planets_1.columns), type(planets_1.columns)

In [None]:
pd.Index

   - you can create an object `Index`
   - and pass it to the data frame `constructor`

In [None]:
index_rows = pd.Index(['Earth', 'Jupiter', 'Mars','Mercury',
                       'Neptune', 'Saturn', 'Uranus', 'Venus'])

In [None]:
index_cols = pd.Index(['distance', 'lowest temperature',
                     'highest temperature'])

In [None]:
planets_3 = pd.DataFrame([[1.000, -90.0, 60.0],
                          [5.203, -125.0, 17.0],
                          [1.523, -140.0, 20.0],
                          [0.387, -200.0, 430.0],
                          [30.0],
                          [9.600],
                          [ 19.190],
                          [ 0.723, 446.0, 490.0]],                         
                         index = index_rows,
                         columns = index_cols)

### 3) information on $\texttt{pandas.DataFrame}$

   - you can access the `index`

In [None]:
planets.index

   - you can access the `columns names`

In [None]:
planets.columns

   - you access `columns` by keys like for a dictionary

In [None]:
planets['distance']

   - you can `transpose` a $\texttt{pandas.DataFrame}$

In [None]:
planets.T # nows columns are rows

   - you can access the `underlying` two-dimensional $\texttt{numpy.ndarray}$

In [None]:
planets.to_numpy()

   - you can get general `statistics` on the `numerical` columns

In [None]:
planets.describe()

   - you an access `information` on `columns`
   - *numbers of non-null elements, types, memory usage*

In [None]:
planets.info()

### 4) accessing elements in a $\texttt{pandas.DataFrame}$ using $\texttt{pandas.DataFrame.loc}$

the classical way
   - `standard` (python and numpy) `indexing operators` `[]` and attribute operator `.`
   - are `available` and `intuitive`

   
However
   - using `standard operators` has  `optimization` limits
   - for `production code` use the `optimized pandas data access methods` 
   
   
   
http://pandas.pydata.org/pandas-docs/stable/indexing.html

#### accessing elements using `labels` and $\texttt{pandas.DataFrame.loc}$
   - $\texttt{df.loc[row_label]}$
   - $\texttt{df.loc[row_label, column_label]}$
   

$\texttt{row_label}$ and $\texttt{column_label}$ can be:
   - `labels`
   - `list of labels`
   - `slices` with labels
   - `masks` (`Boolean array`) 
   

when $\texttt{row_label}$ and $\texttt{column_label}$ are `labels`
   - it returns a value

In [None]:
planets.loc['Earth', 'distance']

In [None]:
planets.loc['Earth']

when only one $\texttt{row_label}$ or $\texttt{row_label}$ is a `label`
   - it returns a $\texttt{pandas.Series}$

In [None]:
planets.loc['Earth']

In [None]:
type(planets.loc['Earth'])

In [None]:
planets.loc[['Earth'], 'distance']

In [None]:
type(planets.loc[['Earth'], 'distance'])

In [None]:
planets.loc['Earth', ['distance']]

In [None]:
type(planets.loc['Earth', ['distance']])

when $\texttt{row_label}$ and $\texttt{column_label}$ are `lists of labels`
   - it returns a $\texttt{pandas.DataFrame}$

In [None]:
planets.loc[['Earth']]

In [None]:
type(planets.loc[['Earth']])

In [None]:
planets.loc[['Earth', 'Mars']]

   - all columns $\texttt{':'}$
   - rows fron 'Earth' included to 'Mars'`included`

In [None]:
planets.loc['Earth':'Mars', :]

   - all rows $\texttt{':'}$
   - columns from $\texttt{distance}$ to $\texttt{highest temperature}$ `included`

In [None]:
planets.loc[:, 'distance':'highest temperature']

   - $\texttt{planets}$ `farther than` earth from the sum

In [None]:
planets.loc[planets.loc[:, 'distance'] > 1]

### 5) accessing elements in a $\texttt{pandas.DataFrame}$ using `position` and  $\texttt{pandas.DataFrame.iloc}$

#### accessing elements using $\texttt{pandas.DataFrame.iloc}$
   - $\texttt{df.loc[row_id]}$
   - $\texttt{df.loc[row_id, column_id]}$
   

$\texttt{row_id}$ and $\texttt{column_id}$ can be:
   - `integer`
   - `list of integers`
   - `slices`
   - `masks` (`Boolean array`)  

In [None]:
planets_1 = pd.DataFrame([[1.000, -90.0, 60.0],
                          [5.203, -125.0, 17.0],
                          [1.523, -140.0, 20.0],
                          [0.387, -200.0, 430.0]],
                         
                         index= ['Earth', 'Jupiter', 'Mars', 'Mercury'],
                         
                         columns=['distance', 'lowest temperature', 'highest temperature'])

   - the first row

In [None]:
planets.iloc[0] # pandas.Series

   - the first and the third rows

In [None]:
planets.iloc[[0, 2]] # pandas.DataFrame

   - $\texttt{[1, 3]}$ the second and the fourth columns
   - $\texttt{[0, 2]}$ of the first and the third rows

In [None]:
planets.iloc[[0, 2], [1, 3]] # pandas.DataFrame

   - first row, first column (as a float)

In [None]:
planets.iloc[0, 1] # pandas.DataFrame

   - first row, first column (as a $\texttt{pandas.DataFrame}$)

In [None]:
planets.iloc[[0], [1]] # pandas.DataFrame

   - the `rows` from `position` $0$ to `position` $2$ `excluded` (*python slicing rules*)
   - the `columns` from `position` $1$ to position $3$ `excluded` (*python slicing rules*)

In [None]:
planets.iloc[0:2, 1:3] # pandas.DataFrame

   - all rows $\texttt{':'}$
   - columns from $1$ to $3$ excluded

In [None]:
planets.iloc[:, 1:3]

   - all columns $\texttt{':'}$
   - rows from $0$ to $3$ excluded

In [None]:
planets.iloc[0:3, :]

   - planets farther from the sun than the erath

### 6) changing the $\texttt{pandas.DataFrame}$ $\texttt{index}$

   - $\texttt{pandas.DataFramce.set_index(new_column)}$
   - $\texttt{pandas.DataFramce.reset_index()}$
   - direct assignement

In [None]:
planets = pd.DataFrame([[1.000, -90.0, 60.0],
                          [5.203, -125.0, 17.0],],                         
                         index= ['Earth', 'Jupiter'],                         
                         columns=['distance', 'lowest temperature', 'highest temperature'])

   - with $\texttt{pandas.DataFramce.set_index}$ you `index` by `another` column

In [None]:
planets.set_index('distance')

   - with $\texttt{pandas.DataFramce.reset_index}$  the `index` became a `normal` $\texttt{pandas.DataFrame}$ column 

In [None]:
planets.reset_index()

   - with direct assigment you create a new index

In [None]:
planets

In [None]:
planets.index = ['la terre', 'jupiter']

### 7) sorting $\texttt{pandas.DataFrame}$ according `columns`

In [None]:
df = pd.DataFrame({ 'col1':  [19, 3, 26, 46, 4, 19],
                    'col2': ['h', 'w', 'y', 'd', 'm', 'w'],
                    'col3':  [8.45, 19.23, 89.56, 17.5, 54.76, 89.56]})

In [None]:
df.sort_values(by='col1', ascending=False)

   - `first` $\texttt{col1}$ is `sorted`
   - then, for `identical values`, $\texttt{col2}$ is sorted 

In [None]:
df.sort_values(by=['col1', 'col2'], ascending=False)

   - you can sort only a few elements ($\texttt{pandas.DataFrame.nlargest()}$, $\texttt{pandas.DataFrame.nsmallest()}$)
   - (*it might be faster on large datasets*)

In [None]:
df.nlargest(2, 'col3')

In [None]:
df.nsmallest(3, 'col1')

### 8) applying vectorized functions to $\texttt{pandas.DataFrame}$

   - $\texttt{pandas.DataFrame}$ columns are stored in $\texttt{numpy.ndarray}$
   - `ufuncs` functions can be `applied` to $\texttt{pandas.Series}$
   - `rows` and `columns` labels are preserved

In [None]:
df = pd.DataFrame(np.linspace(0, 2*np.pi, 100), columns=['angle'])

In [None]:
df.head()

In [None]:
df['sinus'] = np.sin(df)

In [None]:
df.head(3)

In [None]:
df['cosinus'] = np.cos(df['angle'])

In [None]:
df.head(3)

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
df[['sinus', 'cosinus']].plot()

In [None]:
( np.power(df['sinus'], 2) + np.power(df['cosinus'], 2) )[0:3]

## III) `alignement` of `labels` (rows, columns)

   - $\texttt{pandas}$ automatically `align labels` to `perform` the operation
   - operations will be performed on values with the `same row` and `same column` label

   - to have `label alignement` you must use `pandas` `Ufuncs` not the `numpy`
   - $\texttt{numpy}$ Ufuncs will `operate` on the underlying `ndarray` independently of the `labels`

### 1) `alignment` on $\texttt{pandas.Series}$ (on `rows` labels)

In [None]:
s1 = pd.Series([1, 2, 3, 4],     index=['a', 'b', 'c', 'a'])
s2 = pd.Series([10, 20, 30, 40], index=['a', 'e', 'f', 'c'])

In [None]:
s1 + s2  # s1['a'] + s2['a'] = 1 + 10
         # s1['a'] + s2['a'] = 4 + 10
         # s1['b'] + np.NaN  
         # s1['c'] + s2['c'] = 3 + 40
         # np.NaN + s2['e']
         # np.NaN + s2['f']

   - `missing` values are replaced by $\texttt{numpy.NaN}$
   - note that a $\texttt{numpy.NaN}$ `"contaminates"` an expression:
      - $\texttt{nuppy.NaN + 20 = numpy.Nan}$

In [None]:
s1.add(s2) # the same as s1 + s2

   - you can `fill` missing values
   - (here missing values are replaced by $0$)

In [None]:
s1.add(s2, fill_value=0)

   - **but** $\texttt{numpy}$ `does not align` labels

In [None]:
np.add(s1, s2) # it adds the two numpy.ndarrays

### 2) `alignment` on $\texttt{pandas.DataFrame}$ ( on `rows` and `columns` labels)

example
   - number of `kilometers` done in `bicycles`, `cars` and `bus`
   - by `Garance`, `Nathalie` et `Baptiste`

In [None]:
names = ['Garance', 'Nathalie', 'Baptiste']

bicycle = pd.Series([280, 340, 150], index=['Garance', 'Nathalie', 'Baptiste'])
car = pd.Series([1500, 450, 670], index=['Garance', 'Nathalie', 'Baptiste'])
bus = pd.Series([30, 11, 36], index=['Garance', 'Nathalie', 'Baptiste'])

trips_in_january = pd.DataFrame({'bicycle':bicycle, 'car': car, 'bus': bus})


In [None]:
trips_in_january

In [None]:
bicycle = pd.Series([130, 80], index=['Garance', 'Baptiste']) # missing Nathalie's values
car = pd.Series([270, 890], index=['Nathalie', 'Baptiste'])  # missing Garance's values
bus = pd.Series([27, 130], index=['Garance', 'Nathalie'])    # missing Baptiste' values

trips_in_february = pd.DataFrame({'bicycle':bicycle, 'car': car, 'bus': bus})

In [None]:
trips_in_february # missing values are np.NaN

In [None]:
trips_with_NaN = trips_in_january + trips_in_february # alignment is done on rows and columns

In [None]:
trips_with_NaN

In [None]:
trips = trips_in_january.add(trips_in_february, fill_value=0)  # alignment is done on rows and columns

In [None]:
trips

#### c) `alignment` on $\texttt{pandas.Series}$  and $\texttt{pandas.DataFrame}$

In [None]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [10, 20, 30], 'c': [100, 200, 300]}, index=['x', 'y', 'z'])

In [None]:
df

   - we add a `row`

In [None]:
s_row = pd.Series([0.10, 0.20, 0.30], index=['a', 'b', 'c'])

   - the $\texttt{pandas.Series}$ is considered as a `row`
   - the row is `broadcasted` on the three indexes
   - the `alignement` is done on the `row` and the `columns` `labels`

In [None]:
df + s_row

In [None]:
s_col = pd.Series([1000, 2000, 3000], index=['x', 'y', 'z'])

In [None]:
df + s_col # it is wrong !
           # for pandas, the serie is a `row`
           # 'x', 'y' and 'z' are considered as new `columns`
           # (axis is 1)

you must indicate the `axis`
   - $\texttt{axis=0}$ `means` that the `Series labels` are `indexes`
   - the `broadcast` is done `column-wise`

In [None]:
df.add(s_col, axis=0)
# s_col is 'x' [1000]
#          'y' [2000]
#          'z' [3000]

# s_col broadcasted is 'x' [1000][1000][1000]
#                      'y' [2000][2000][2000]
#                      'z' [3000][3000][3000]

## IV) handling `missing data` in $\texttt{numpy}$ and  $\texttt{pandas}$

   - in `real data` you can have `missing values`
   - `missing values` are represented in $\texttt{pandas}$ arrays by $\texttt{numpy.NaN}$

### 1) the type of `missing values`

   - the type of $\texttt{numpy.NaN}$ is `float`
   
       
   - i.e. $\texttt{numpy.NaN}$, can only be used for `float` or `object` types
   
   
   - in other cases a conversion is done
      - `integers` are converted to `float64`
      - `Booleans` are converted to `object`

  
   - when a $\texttt{numpy.NaN}$ is `present` in a numeric $\texttt{numpy.Series}$
   - the `dtype` of the $\texttt{numpy.Series}$ is `numpy float64`
   

In [None]:
df = pd.Series([1, 2, 3, np.NaN])
df.dtype

   - if you try to `force` an integer dtype, an `exception` is `raised`

In [None]:
try:
    df = pd.Series([1, 2, 3, np.NaN], dtype=np.int64)
except ValueError as e:
    print(e)

   - the `version 0.24` of the $\texttt{pandas}$ `library`
   - can hold `integer dtypes` with `missing values`
   
   
   
   - it is not done through the `regular integer type`
   - but it uses `extension types`
   
   
   - the `extended integer-type` that can hold NaN values is $\texttt{'Int64'}$ (not $\texttt{'int64'}$

   - in $\texttt{pandas.Series}$, $\texttt{None}$ is replaced $\texttt{numpy.NaN}$
   
   
   - except of $\texttt{pandas.Series}$ of type `object`


In [None]:
df = pd.Series([1, 2, 3, None], dtype='object')
df

In [None]:
#pd.isna?

### 2) $\texttt{pandas}$ functions to `dealing` with `missing values`

$\texttt{pandas.isna()}$, $\texttt{DataFrame.isna}$  and $\texttt{Index.isna}$
   - returns the `Boolean mask` of `missing` values 

In [None]:
df = pd.Series([1, 2, np.NaN, None], dtype='object')
pd.isna(df)  # same as df.isna()

In [None]:
df[df.isna()] # select the missing values in the Series

In [None]:
df = pd.DataFrame([[1, 2, 3, np.NaN], [4, 5, None]])
   # 4 columns of two values each
   # the two firsts are int64
   # the third and the furth are float64 (presence of NaN)
df.head()

   - on `index`

In [None]:
df = pd.DataFrame([[1, 2], [4, 5]], index=['a', np.NaN])

In [None]:
df.index.isna()

#### $\texttt{pandas.notna()}$, $\texttt{DataFrame.notna}$  and $\texttt{Index.notna}$
   - returns the `Boolean mask` of `non-missing` values 

#### $\texttt{pandas.dropna}$  `remove missing values`

on $\texttt{pandas.Series}$ it remove the value

on $\texttt{pandas.DataFrame}$ it remove the `whole row` or `column`
   - $\texttt{axis = 0}$ or $\texttt{axis = 'index'}$  for `rows`
   - $\texttt{axis = 1}$ or $\texttt{axis = 'columns'}$  for `columns`

In [None]:
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, np.NaN, 7], [np.NaN, 8, 9, 10]])
df

In [None]:
df.dropna() # by default axis=0

In [None]:
df.dropna(axis='index')

In [None]:
df.dropna(axis=1)

In [None]:
df.dropna(axis='columns')

the parameter $\texttt{how}$



   - when $\texttt{how='any'}$ `row` or `column` is removed when it contains at least one NA or all NA
   
   
   - when $\texttt{how='any'}$ `row` or `column` is removed when all values are missing


In [None]:
df = pd.DataFrame([[1, 2, 3, np.NaN], []])
df

In [None]:
df.dropna(how='all')

In [None]:
df.dropna(how='any') # there is nothing left !

the parameter $\texttt{thresh}$
   - you keep `rows` (or `columns`)
   - where `thresh` values or `more` are `not missing`

In [None]:
df = pd.DataFrame([[1, 2, 3, np.NaN], [4, 5, np.NaN, np.NaN], [6, 7, np.NaN, np.NaN]])
df

In [None]:
df.dropna(thresh=3, axis=0)

In [None]:
df.dropna(thresh=1, axis=1)

#### $\texttt{pandas.fillna()}$  `missing values` are replaced
   - you can specify the `strategy` ($\texttt{method}$) of replacement

methods
   - `propagation` of the `last valid` observation to `next valid`
   - `forward` ($\texttt{ffill}$)
   -  `backward`($\texttt{bfill}$)

In [None]:
df = pd.Series([1, np.NaN, np.NaN, 5, np.NaN,  6, np.NaN, 9])
df

In [None]:
df.fillna(method='ffill') # propagation forward 

In [None]:
df.fillna(method='bfill')  # propagation backward

   - the same for $\texttt{pandas.DataFrame}$

In [None]:
df = pd.DataFrame([[1, np.NaN, np.NaN], [np.NaN, 6, np.NaN], [2, np.NaN, 9]])
df.head()

In [None]:
df.fillna(axis=0, method='ffill')

In [None]:
df.fillna(axis=1, method='bfill')

   - computing `equality` in presence of `NaN` values
   - `equals` is not the same as `==`

In [None]:
df1 = pd.DataFrame([[2, 3, 4], [5, np.NaN, 7]])
df2 = pd.DataFrame([[2, 3, 4], [5, np.NaN, 7]])

In [None]:
df1.equals(df2) # NaN == NaN

In [None]:
df1 == df2 # NaN != NaN

## V) computing simple `statistics` in $\texttt{pandas}$

   - on $\texttt{pandas.Series}$ it returns a `single number`
   - on $\texttt{pandas.DataFrame}$ it returns a `number per axis`

### 1) $\texttt{pandas.DataFrame.describe}$

   - it is a statistical overview of a DataFrame

#### a) on `DataFrame` with only `numerical data`

   - you have `count`, `mean`, `standard error`, `min/max`, `quartiles`

In [None]:
df = pd.DataFrame([[1.70, 67], [1.67, 59], [1.84, 78], [1.86, 90], [1.56, 45], [1.57, 63]], columns=['height', 'weight'])

In [None]:
df.describe() 

#### b) on  `DataFrame` with only `categorical` data

   - you obtain `count`, `number of values`, `frequency`, `top` (most common value), ...

In [None]:
df = pd.DataFrame([['M', 'Lower'], ['F', 'Middle'], ['F', 'Lower'],
                   ['F', 'Middle'], ['M', 'Lower'], ['M', 'Lower']],
                 columns=['Sex', 'Income'])

In [None]:
df.describe()

#### c) on `DataFrame` mixing numerical and categorical data

In [None]:
df = pd.DataFrame([[1.70, 67, 'M', 'Lower'], [1.67, 59, 'F', 'Middle'], [1.84, 78, 'F', 'Lower'],
                   [1.86, 90, 'F', 'Middle'], [1.56, 45, 'M', 'Lower'], [1.57, 63, 'M', 'Lower']],
                  columns=['height', 'weight', 'sex', 'income'])

In [None]:
df.describe() # by default it is applied to numerical data only

In [None]:
df[['sex', 'income']].describe()

### 2) index of the `minimum` or the `maximum`

#### a) on $\texttt{pandas.Series}$

In [None]:
s = pd.Series([10, 4, 4, 89, 4, 120, 67, 67])

In [None]:
s.idxmin(), s.idxmax()

In [None]:
s.value_counts() # number of elements of the same value

In [None]:
s.mode() # the most frequent value

#### b) on $\texttt{pandas.DataFrame}$

In [None]:
df = pd.DataFrame([[2, 3, 4], [5, 0, 7]])

In [None]:
df

In [None]:
df.idxmin(axis=0)

In [None]:
df.idxmax(axis=1)

## VI) `multi indexing` in  $\texttt{pandas}$

   - in a $\texttt{pandas.DataFrame}$ axis labels are `rows` and `columns` labels
   - they are represented by $\texttt{Index}$ `objects`
   - with `row` and `column` `Index` you have `two-dimensional structured` arrays 

In [None]:
df = pd.DataFrame({'row': [0, 1, 2],
                   'one_X': [1, 2, 3],
                   'one_Y': [4, 5, 6],
                   'two_X': [10, 20, 30],
                   'two-Y': [40, 50, 60]})

df = df.set_index('row')   # we set the index to the 'row' column

In [None]:
df

   - this example appears to be more `structured`
   - we can see `two labelled pairs` of values $(X_{one}, Y_{one})$ and $(X_{two}, Y_{two})$
   - the `first` pair is labelled by `one` and the `second` by `two`

   - there is a `hierarchy` in the labels

the `two pairs` of values $(X_{one}, Y_{one})$ and $(X_{two}, Y_{two})$ can be seen as:
   - `two labels`: $one$ and $two$
   - with `two values` labelled `X` and `Y` each
   - with `three values` each indexed by the label `row`

something like this:

|$\  $ |one   |$\ $ |$\ $ |two | $\ $|
|-     |-     |-    |    -|  - |-    |
|$\ $  |**X**     |**Y**    |$\ $ |**X**   |**Y**    |
|**row**   | $\ $ |$\ $ |$\ $ |$\ $|$\ $ | 
|**0**  |1     |4    |$\ $ |10  |40   |
|**1**  |2     |5    |$\ $ |20  |50   |
|**2**  |3     |6    |$\ $ |30  |60   |


it is `multi-indexing`

   - you want to express `multi-dimensionality` in a data structure of `lower dimension`
      - a $\texttt{pandas.Series}$ with `more than one` dimension
      - a $\texttt{pandas.DataFrame}$ with `more than two` dimensions
      
      

   - `columns` `Index` will be `replaced by` a `columns multiIndex`

   - you express the `multi-indexing` by `tuples` of `related labels`

In [None]:
tuples_from_pairs = [('one', 'X'), ('one', 'Y'), ('two', 'X'), ('two', 'Y')]

   - you create a multi-index `object` from the `tuples`

In [None]:
pd.MultiIndex.from_tuples(tuples_from_pairs)

a multi-index is `composed` of:
   - the `levels` (groups of labels in *`descending`* `levels` like $[[one, two], [X, Y]]$)
   - their `coding`

   - you `replace` the columns index by a columns multi-index

In [None]:
df.columns = pd.MultiIndex.from_tuples(tuples_from_pairs)

In [None]:
df

   - you have now an indexing with `hierarchical columns`

   - to  `access` multi-index  use the $\texttt{pandas.DataFrame.loc}$ and $\texttt{pandas.DataFrame.loc}$
   - the first index is the row and the second is the column

In [None]:
df.loc[0] # first rows, all columns

In [None]:
df.loc[0, 'one'] # first rows, columns 'one'

In [None]:
df.loc[0, ['one', 'two']] # first row
                          # list of columns 'one' and 'two'

   - the index of the columns is `hierarchical`
   - i.e. it can be described using `tuples` of `labels`
   - the same `tuples` you use to construct the `multi-index`
   - $(one, X), (one, Y), (two, X), (two, Y)$

   - you can use `tuples` of `labels` with `.loc`

In [None]:
df.loc[0, ('two', 'X')] # first row
                        # columns label ('one', 'X')

In [None]:
df.loc[[0,2], ('two', 'X')] # columns label ('one', 'X')
                            # of first and third rows

   - you can use `.iloc`

In [None]:
df.iloc[0] # first row

   - multi-index on `rows` and `columns`

In [None]:
# index for years and visits
index = pd.MultiIndex.from_product([[2013, 2014],
                                    [1, 2, 3]],
                                   names=['year',
                                          'visit'])

In [None]:
# columns for clients and medical data
columns = pd.MultiIndex.from_product([['Alice', 'Bob'],
                                      ['before test', 'after test']],
                                     names=['Patient',
                                            'HearthRate'])

In [None]:
data = np.random.randint(60, 100, 24).reshape(6, 4) # earth rates beteen 60 and 100 beats


medical_data = pd.DataFrame(data, index=index, columns=columns)
medical_data

In [None]:
medical_data.columns

In [None]:
medical_data.loc[:, 'Alice'] # all medical data on Alice

In [None]:
medical_data.loc[(2013, 2), ('Alice', 'before test')] # Alice's HearthRate 'before test'
                                                      # in the second visit in 2013,

In [None]:
medical_data.loc[(2013, 2), ('Alice', 'before test')] = 82

   - you `must` use `.loc` or `.iloc` to modify an element
   - never use direct access
   - http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a

In [None]:
medical_data< 80 # you can test

## V) Importing data in pandas

### 1) formats of files

   - `pandas` can `import` files of `a lot of formats`
      - CSV, JSON, HTML, Excel, ...
   - see http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

### 2) reading and writing `csv` files (comma separated values)

   - to write a `csv` use the `method` $\texttt{pandas.DataFrame.to_csv}$

In [None]:
distance = pd.Series([0.387, 0.723, 30, 1., 5.203, 1.523, 9.6, 19.19],
                     index=['Mercury', 'Venus', 'Neptune', 'Earth', 'Jupiter', 'Mars', 'Saturn', 'Uranus'])

lowest_temp = pd.Series([-200.0, 446.0,  -90.0, -125.0, -140.0],
                        index=['Mercury', 'Venus', 'Earth', 'Jupiter', 'Mars'])

highest_temp = pd.Series([430.0, 490.0, 60.0, 17.0, 20.0],
                         index=['Mercury', 'Venus', 'Earth', 'Jupiter', 'Mars'])

planets = pd.DataFrame({'distance': distance,
                        'lowest temperature': lowest_temp, 
                        'highest temperature': highest_temp, 
                        'origin':'solar system'})

In [None]:
planets

In [None]:
planets.index

In [None]:
planets.to_csv('planets.csv', index_label='names', float_format='%.3f')

   - a file `planets.csv` has been `created` in your current folder
   - we gave a `name` to the `rows` index

   - the csv `format` is very `simple`: a $2 \times 2$ matrix, where:
   - by default, the `first` line is the `columns` header (`labels` if any, else `indexes`)
   - the `other` lines are `rows` written `one below the other` with values `separed by` ','

`planets.csv`
   - *names,distance,lowest temperature,highest temperature,origin  
Earth,1.0,-90.0,60.0,solar system  
Jupiter,5.203,-125.0,17.0,solar system  
Mars,1.523,-140.0,20.0,solar system  
Mercury,0.387,-200.0,430.0,solar system  
Neptune,30.0,,,solar system  
Saturn,9.6,,,solar system  
Uranus,19.19,,,solar system  
Venus,0.723,446.0,490.0,solar system*

   - to `read` a `csv` use the `method` $\texttt{pandas.DataFrame.read_csv}$

In [None]:
df = pd.read_csv('planets.csv')

In [None]:
df = df.set_index('names')       # the rows 'names' is the index

#### digression:
   - you can see a `general floating point problem` $5.523$ became $5.5230000000000001$ when printed by $\texttt{to_csv}$
   - https://github.com/pandas-dev/pandas/issues/17154

In [None]:
planets.loc['Mars', 'distance'], df.loc['Mars', 'distance']

In [None]:
df.loc['Mars', 'distance']  == planets.loc['Mars', 'distance']

In [None]:
np.isclose(df.loc['Mars', 'distance'], planets.loc['Mars', 'distance'])

*trying to get exact equality out of floating points is generally a losing battle*

*let's go back to the course*

#### the method $\texttt{pandas.DataFrame.read_csv}$
   - has many optional `parameters` that you can `set`
   - see the help

In [None]:
#pd.read_csv?

In [None]:
#pd.DataFrame.to_csv?

## VI)  $\texttt{pandas.DataFrame.groupby}$

   - can be applied o $\texttt{pandas.Series}$ and on $\texttt{pandas.DataFrame}$
   - to `group together` amounts of data from a DataFrame or a Series
   - for example, to compute operation on these groups

   - http://pandas.pydata.org/pandas-docs/stable/groupby.html

In [None]:
df = pd.DataFrame([[1.70, 67, 'M', 'Lower'], [1.67, 59, 'F', 'Middle'], [1.84, 78, 'F', 'Lower'],
                   [1.86, 90, 'F', 'Middle'], [1.56, 45, 'M', 'Middle'], [1.57, 63, 'M', 'Lower']],
                  columns=['height', 'weight', 'sex', 'income'])
df

   - we can `group by` `income`, or `sex`

In [None]:
gdf1 = df.groupby('sex')
gdf1.size() # we have two groups:
               # the three rows with `sex` == 'F' are grouped together
               # the three rows with `sex` == 'M' are grouped together

In [None]:
gdf1.groups # the description of the group

In [None]:
gdf2 = df.groupby(['sex', 'income'])
gdf2.size() # we have four groups:
            # the first group cotains one row with 'sex' == 'F' and 'income' == 'Lower'
            # the second group contains two rows with 'sex' == 'F' and 'income' == 'Middle'
            # ...

In [None]:
gdf2.groups

   - you can apply `operations` on the groups

In [None]:
gdf2.sum()

## VII) combining datasets

### 1) concatenation of `data frames` with the function $\texttt{pandas.concat}$

   - you can `concatenate` $\texttt{pandas.Series}$
   - you can `concatenate` $\texttt{pandas.DataFrame}$ along an `axis` (rows or columns)
   
   
   - it generates a `new` $\texttt{pandas.DataFrame}$
   
   
   
   - there are many optional `parameters` you can `set`

#### a) concatenation along the `columns axis`

   - the first `data frame`

In [None]:
df1 = pd.DataFrame([[1.70, 67], [1.67, 59], [1.84, 78],
                    [1.86, 90,], [1.56, 45,], [1.57, 63]],
                  columns=['height', 'weight'],
                  index=['Gabriel', 'Emma', 'Jules', 'Louise', 'Hugo', 'Nathan'])
df1.head(2)

   - the second `data frame`

In [None]:
df2 = pd.DataFrame([['M', 'Lower'], ['F', 'Middle'], ['M', 'Lower'],
                   ['F', 'Middle'], ['M', 'Middle'], ['M', 'Lower']],
                  columns=['sex', 'income'],
                  index=['Gabriel', 'Emma', 'Jules', 'Louise', 'Hugo', 'Nathan'])
df2.head(2)

   - their `concatenation`

In [None]:
df3 = pd.concat([df1, df2], axis=1)
df3.tail(2)

#### b) concatenation along the `rows` `axis` 

   - the first `data frame` is `df3`

   - the second `data frame`

In [None]:
df4 = pd.DataFrame([[1.54, 45, 'F', 'Lower'], [1.76, 84, 'F', 'Middle'], [1.67, 72, 'F', 'Middle']],
                  columns=['height', 'weight', 'sex', 'income'],
                  index=['Alice', 'Paul', 'Léna'])
df4.head(2)

   - their `concatenation`

In [None]:
df5 = pd.concat([df3, df4], axis=0)
df5.tail(4)

#### c) concatenation in presence of duplicate indexes

   - by defaut you will get `several` indexes or columns with the `same name`

   - the first dataset

In [None]:
df3.index

   - the second dataset

In [None]:
df6 = pd.DataFrame([[1.54, 45, 'F', 'Lower'], [1.76, 84, 'F', 'Middle'], [1.67, 72, 'F', 'Middle']],
                  columns=['height', 'weight', 'sex', 'income'],
                  index=['Emma', 'Paul', 'Louise'])
df6.index

In [None]:
set(df3.index).intersection(df6.index) # or df3.index.intersection(df6.index)

   -  we concatenate in presence of two `duplicated` indexes

In [None]:
df7 = pd.concat([df3, df6], sort=False)
df7.loc['Emma'] 

    - you get two 'Emma' entries in your index

   - you can `force` `duplicate` indexes check with the $\texttt{verify_integrity}$ parameter

In [None]:
try:
    df7 = pd.concat([df3, df6], verify_integrity=True)
except ValueError as e:
    print(e)        

   - you can `concatenate` when `axis` are not `aligned`
   - `missing values` are replaced by $\texttt{numpy.NaN}$

   - the first `data frame` does not contain the `sex` column

In [None]:
df9 = pd.DataFrame([[1.70, 67, 'Lower'], [1.67, 59, 'Middle']],
                  columns=['height', 'weight', 'income'],
                  index=['Paul', 'Louise'])
df9.head(2)

   - the second `data frame` does not have the `income` column

In [None]:
df10 = pd.DataFrame([[1.54, 45, 'F'], [1.76, 84, 'F']],
                  columns=['height', 'weight', 'sex'],
                  index=['Alice', 'Léna'])
df10.head(2)

In [None]:
df11 = pd.concat([df9, df10], axis=0, sort=False) # (we passe `sort=False` to silence a warning)
df11

   - the `resulting` dataframe contains NaN (not available, not a number, ...)

   - the $\texttt{pandas.append}$ fuction is a `shortcut` to `concat` with a `simplified interface`

### 3) combining datasets with $\texttt{pandas.merge}$

   - the `rows` represent `objects` (like `objects` in a `data base`)
   - you `merge` two `data frames` by `joining` objects

   - two `rows` are `merged` if they have a `matching key`
   - a `key` is defined by `one or several` columns `names`
   - by default `merge` considers `all` columns with the `same name` in the data frames

#### a) `one-by-one` merge

   - there is `no duplicate entry` in the `key columns`
   - two `rows` are merged when the `key column` matches

   - the first `data frame`

In [None]:
df1 = pd.DataFrame({'names': ['Gabriel', 'Emma', 'Jules'],
                    'sex': ['M', 'F', 'M']})
df1

   - the second `data frame`

In [None]:
df2 = pd.DataFrame({'names': ['Gabriel', 'Emma', 'Paul'],
                    'incomes': ['Lower', 'Middle', 'Lower']})
df2

   - the two `data frames` describe two same `objects`: 'Gabriel' and 'Emma'
   - 'Jules' cannot be joined to another `object`
   - you `merge` objects of the two `data frames` `one-by-one`

In [None]:
df3 = pd.merge(df1, df2)
df3

   - when key are not unique 

   - `key` can be `multi-columns` 

In [None]:
df1 = pd.DataFrame({'names': ['Gabriel', 'Jules', 'Emma'],
                    'incomes': ['Lower', 'Upper', 'Lower'],
                    'height': [1.87, 1.67, 1.64]})

df2 = pd.DataFrame({'names': ['Gabriel', 'Jules', 'Emma'], 
                    'sex': ['M', 'M', 'F'],
                    'incomes': ['Lower', 'Middle', 'Lower'],})

In [None]:
pd.merge(df1, df2) # merge on 'names' and 'incomes'
                   # 'Jules' is not the same object, incomes are diffent

#### b) `many-to-one` merge

   - one of the two `key columns` contains `duplicate values`
   - a `one-to-one` strategy for each duplicated row is `applied`

In [None]:
df4 = pd.DataFrame({'names': ['Gabriel', 'Jules', 'Emma'], 
                    'sex': ['M', 'M', 'F'],
                    'height': [1.87, 1.67, 1.84]})

In [None]:
df5 = pd.DataFrame({'names': ['Gabriel', 'Emma', 'Jules'],
                    'incomes': ['Lower', 'Middle', 'Lower'],})

In [None]:
pd.merge(df4, df5) # every 'Jules' of the first data frame is merged with the 'Jules' of the second data frame 

   - another example

In [None]:
df3 = pd.DataFrame({'names': ['Gabriel', 'Emma', 'Jules'],
                    'incomes': ['L', 'M', 'L']})
df3

In [None]:
df4 = pd.DataFrame({'incomes': ['L', 'M', 'U'],
                    'explanation': ['Lower', 'Middle', 'Upper']})
df4

In [None]:
pd.merge(df3, df4)

#### c) `many-to-many` merge

   - both `key columns` contain duplicates `entries`
   - a `cartesian product` is used

In [None]:
df5 = pd.DataFrame({'names': ['Gabriel', 'Jules', 'Jules'],   # two "Jules"
                    'height': [1.87, 1.67, 1.84]})
df5

In [None]:
df6 = pd.DataFrame({'names': ['Gabriel', 'Jules', 'Jules'],   # two 'Jules'
                    'incomes': ['Lower', 'Middle', 'Lower']})

In [None]:
pd.merge(df5, df6) # four 'Jules"

#### d) controling `keys`

   - you can `specify` the `key columns` with the parameter $\texttt{on='names'}$
   
   
   - you can `link` columns with different `names` (parameters $\texttt{left_on='names', right_on='identity')}$

In [None]:
df1 = pd.DataFrame({'names': ['Gabriel', 'Jules', 'Emma'],
                    'incomes': ['Lower', 'Upper', 'Lower'],
                    'height': [1.87, 1.67, 1.64]})

df2 = pd.DataFrame({'names': ['Gabriel', 'Jules', 'Emma'], 
                    'sex': ['M', 'M', 'F'],
                    'incomes': ['Lower', 'Middle', 'Lower'],})

In [None]:
pd.merge(df1, df2, on=['names']) # incomes is no more involved in the merging

In [None]:
df1 = pd.DataFrame({'names': ['Gabriel', 'Jules', 'Emma'],
                    'incomes': ['Lower', 'Middle', 'Lower'],
                    'height': [1.87, 1.67, 1.64]})

df2 = pd.DataFrame({'identity': ['Gabriel', 'Jules', 'Emma'], 
                    'sex': ['M', 'M', 'F'],
                    'incomes': ['Lower', 'Middle', 'Lower'],})

   - columns `excluded` from the `key` are `renamed`
   

In [None]:
pd.merge(df1, df2, left_on='names', right_on='identity')   # 'incomes' is excluded => incomes will be renamed

   - merge does not preserve the `index`

   - `join` is a shortcut to `merge`

##  VIII) pivoting tables

   - let us see an `example`

you have a `dataset` about the sinking of `titanic`
   - for each `passenger` you have values such as:
      - the `survival status`
      - the `sex`
      - the `class` (first, second, third)
      - the `age`

   - we `import` the dataset
   - we keep the `interesting columns` (parameter $\texttt{usecols}$)

In [None]:
df = pd.read_csv('titanic.csv', usecols=['Survived', 'Pclass', 'Sex', 'Age'])

In [None]:
df.head(3)

   - the `number` of passengers in `each` class

In [None]:
df['Pclass'].value_counts()

suppose you want to know
   - the `survival rate` depending on the `sex` and the `class` 

you know
   - the `value` to be `aggregated` here the `survival status`
   - the `aggregation` function here the $\texttt{numpy.mean}$
   - the `key` to be the `index`   here the column `sex`
   - the `key` to be the `column`  here the column `Pclass`

   - this is done by the $\texttt{pandas.DataFrame.pivot_table}$ `method`

In [None]:
df.pivot_table('Survived', index='Sex', columns='Pclass', aggfunc=np.mean)

   - it returns a `new` data frame 

In [None]:
df1 = df.pivot_table('Survived', index='Sex', columns='Pclass', aggfunc=np.mean)

   - `Pclass` is the name of the `columns index`
   - `Sex` is the name of the `rows index`

   - another example

   - you want to `compute` the `survival rate` by `age group`
   - but we do not have `age group`
   - so we must `pack` the ages in `bins` representing `age groupe`

   - we create `bins` of ages and `names` for those `bins`

In [None]:
age_groups=[0, 11, 17, 25, 35, 45, 55, 65, 100] 
age_group_names = ['<11', '11-17', '17-25', '25-35', '35-45', '45-55', '55-65', '>65'] 

   - we create a new `column` where the `Age` is replaced by the `age group` 
   - using the `method` $\texttt{pandas.cut}$ with the `bins` and the `names`

   - we can `add` the column in our data frame

In [None]:
df['Age group'] = pd.cut(df['Age'], bins=age_groups, labels=age_group_names)

In [None]:
#df.sort_values(by='Age', ascending=False)

   - we compute a new `data frame` with the  `survival rate` by `age group`

In [None]:
df.pivot_table('Survived', index=['Sex', 'Pclass'], columns='Age group', aggfunc=np.mean)

   - a `higher` rate of `women` was `saved` in `all` categories except `children under 11`
   - where $55 \%$ of the boys were saved against $54 \%$ of the girls

   - we `do not need` to `add` a column to the `data frame`
   
   - here we  `pass` the number of `bins` and their `names`

In [None]:
col = pd.cut(df['Age'], 3, labels=['child', 'adult', 'old'])

In [None]:
df.pivot_table('Survived', index=['Sex', 'Pclass'], columns=col, aggfunc=np.mean)

In [None]:
df = pd.read_csv('titanic.csv')

  - example of changing the `data`
  - for example, we want to `replace` the `number` of the classes by the `names`
  
  
  - create the `Boolean mask`
  - always `loc` or `iloc` `never` use a classical `array` assignement

In [None]:
mask = (df['Pclass'] == 1)  # we have a mask of indexes

In [None]:
mask.head()

   - we `localize` the `true` values
   - and replace their `Pclass` column `value` by the string `'first'`

In [None]:
df.loc[ mask, 'Pclass'] = 'first'

In [None]:
df.loc[ df['Pclass'] == 2, 'Pclass'] = 'second'
df.loc[ df['Pclass'] == 3, 'Pclass'] = 'third'

In [None]:
df.head()

##  VIII) dealing with `temporality` in $\texttt{pandas}$

`3 ways` to talk about `temporality`
   - `date` or `time`: it is an `instant` e.g. `just now` 
   - `time duration`: e.g. `3 hours` (deltas)
   - `time period`: it is an `interval of time` e.g. a `date` plus a `duration`

### 1) date and time intervals in `numpy`

#### a) date in `numpy` $\texttt{numpy.datetime64}$

   - dates in $\texttt{pandas}$ are based on $\texttt{numpy.datetime64}$
   - the format is `'year-month-day hour:minute:second'`
   - the numbers are `zero-padded` ($09$ and not $9$)

In [None]:
np.datetime64('2019-09-04 14:00:09')

#### b) `time duration` in `numpy` $\texttt{numpy.timedelta64}$
   

In [None]:
np.datetime64('2019-09-04 14:00:09') - np.datetime64('2019-09-11 09:27:09')

### 2)  temporality in `pandas`

#### a) dates in `pandas` $\texttt{pandas.Timestamp}$

In [None]:
pd.Timestamp(0) # the Unix time 

*"the Unix time is the number of seconds that have elapsed since 00:00:00 Thursday, 1 January 1970"*

In [None]:
pd.Timestamp('2019-10-04 14:00:00') 

   - if you have a `specific format` use $\texttt{pandas.to_datetime}$ with the $\texttt{format}$ parameter
   - `Y` is year (2019), `y` is year (19), `m` is month, `d` is day (number), `M` is minute, ...

In [None]:
pd.to_datetime('2019|10|04 14;00;07', format='%Y|%m|%d %H;%M;%S') 

#### b) time duration in `pandas`  $\texttt{pandas.Timedelta}$

   - it is a `time interval`
   - with no mention of a precise `date`

   - `duration` between `two` dates

In [None]:
pd.Timestamp('2019-09-04 14:00:00') - pd.Timestamp('2019-09-04 8:36:57')

In [None]:
pd.Timestamp('2019-10-04 14:00:00') - pd.Timestamp('2019-09-04 8:36:57')

In [None]:
#pd.Timedelta?

#### c) time period in `pandas`  $\texttt{pandas.Timedelta}$

   - a `period` is a `date` and a `duration`

In [None]:
d = pd.Timestamp('2019-10-04 14:00:00')

In [None]:
d.to_period(d - pd.Timestamp('2019-09-04 8:36:57'))

In [None]:
d.to_period('D')

### 3) columns of dates for $\texttt{pandas.DataFrame}$

#### a) in an already created `dataframe`

   - you have a `dataset` with a column of `dates`

In [None]:
df = pd.DataFrame({'time': ['2019/12/25 23:59', '2019/12/31 23:59'],
                   'holidays': ['Christmas', 'New Year']})

   - the `time` is a `simple` a python `string` 

In [None]:
type(df.loc[0, 'time'])   

   - you can `transform` a `string` in `objects` of type `date`
   - with the $\texttt{pandas.to_datetime}$ method

In [None]:
df['time'] = pd.to_datetime(df['time'])

   - note that $\texttt{pandas.to_datetime}$, applied to an `array of dates`, returns an `index of dates`

In [None]:
df.dtypes

   - remember the $\texttt{pandas}$ datetimes rely on $\texttt{numpy.datetime64}$

   - you can `index` the `data frame` by a column of dates 

In [None]:
df.set_index('time')

#### b) creating `date` type while reading the `csv` file

   - you can `convert` the date during a `csv` read
   - and index your DataFrame by the date

   - we write the data frame in a file without the index

In [None]:
df.to_csv('foo.csv', index=None)

   - we `parse` the date while we `read` the csv-file

In [None]:
df = pd.read_csv('foo.csv', parse_dates=['time'])

In [None]:
df.dtypes

In [None]:
df.head()

   - we `index` the data frame by `date` while we read the csv-file

In [None]:
df = pd.read_csv('foo.csv', parse_dates=['time'], index_col='time')

In [None]:
df.head()

In [None]:
df.index

   - you have a new type: $\texttt{pandas.DatetimeIndex}$

   - for `unsusual dates format` indicate the parser function `to use` 
   - in the file `test_date.csv` we replace the '/' by '|' in the date strings

In [None]:
def my_date_parser (d):
    return pd.to_datetime(d, format='%Y|%m|%d %H:%M')

df = pd.read_csv('test_date.csv', parse_dates=['time'], index_col='time', date_parser=my_date_parser)

In [None]:
df.head()

### xxx) when dates are `wrong` you can `ignore` or `coerce`

   - you get an error

In [None]:
try:
    pd.to_datetime('30/02/2019')
except ValueError as e:
    print(e)

   - you ignore the error

In [None]:
pd.to_datetime('30/02/2019', errors='ignore') # your create a 30th of February

   - you coerce the error

In [None]:
pd.to_datetime('30/02/2019', errors='coerce') # this is Not a Time

   - it is the $\texttt{pandas}$ `object`: $\texttt{pandas.NaT}$
   - classical `NaN` methods work on `NaT values`

# ANNEXES

## 1) Dealing with unicode in $\texttt{pandas}$


https://docs.python.org/3/library/codecs.html#standard-encodings

In [None]:
df = pd.DataFrame({'names': ['à è é ï ù']})

In [None]:
df.to_csv('bar.csv', header=True, index=False, encoding='utf-8')