<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>Val√©rie Roy</span>
<span><img src="media/ensmp-25-alpha.png" /></span>
</div>

# Pandas

#### this course is inspired by
   - Arnaud Legout, Inria, (courses and MOOC python)
   - Thierry Parmentelat (the numeric part of the MOOC python)
 

#### pandas cheat sheet
   - https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

### pandas
   - **under development** since 2008
   - try to **close the gap** between **python**, **statistical computing** and **multidimensional  datasets**
   - not very **intuitive** but very **powerful** and there is **no better** solution
   
   
   - PyHPC11 (Python High-Performance and Scientific Computing conference 2011):
      - https://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf

# objectives

   - data **arrays** with **labeled axes**
   - **vectorized** operations
   - handling of **missing data**
   - **merge**, **pivot**, **groupy** other **relation** operations
   - automatic or explicit **data alignment**
   - integrated **time series** functionality

### pandas versus numpy

   - numpy contains **efficient** array **creation** and **manipulations**
   - pandas offers an **index-based structuration** to numpy.ndarray

  - see http://pandas.pydata.org/pandas-docs/stable/
  
  

   - pandas uses numpy as a **black-box**
   - **no assumption** is made on memory allocation
   
   
   - i.e. pandas works mainly with **copies** instead of **in-place** modification
   - see https://stackoverflow.com/questions/23296282/what-rules-does-pandas-use-to-generate-a-view-vs-a-copy
   
   

   - **in-place** modification in pandas are usually an assignment after the copy
   - see https://stackoverflow.com/questions/22532302/pandas-peculiar-performance-drop-for-inplace-rename-after-dropna/22533110#22533110                  127

   - importing  **pandas library**

In [None]:
import pandas as pd

   - pandas version

In [None]:
pd.__version__

   - Pandas version and version of its dependencies

In [None]:
#pd.show_versions(as_json=False)  # very long do not print

there are two pandas containers
   - I) $\texttt{pandas.Series}$ is for **one-dimensional** arrays 
   - II) $\texttt{pandas.DataFrame}$ is for **two-dimensional** arrays

## I) $\texttt{pandas.Series}$

#### Series contain
   - an **array-like** data
   - an **array-like** index of same length as the data
   - by default, the index starts at $0$

### 1) creating $\texttt{pandas.Series}$ from arrays

example
   - we have $11$ **European countries** with their **names**
   - their corresponding **total areas** in $km^2$
      - Russia, Ukraine, France, Spain, Sweden, Norway, Germany, Finland, Poland, Italy, UnitedKingdom
      - 3972400, 603628, 551695, 505992, 450295, 385178, 357578, 338145, 312685, 301338, 242495

we can create a $\texttt{pandas.Series}$ with the areas

In [None]:
import pandas as pd

In [None]:
areas = [3972400, 603628, 551695, 505992, 450295, 385178,
                   357578, 338145, 312685, 301338, 242495]
sa = pd.Series(areas)
sa

   - by **default** a $\texttt{pandas.Serie}$ is **indexed** by **numbers**
   - the **type** of the elements is here $\texttt{int64}$

In [None]:
sa.index

   - the **index** is also called the **keys**

In [None]:
sa.keys() is sa.index

   - they are the **same** **python** object

### 2) providing an index to a $\texttt{pandas.Series}$

   - in our example, the serie is **indexed** by **numbers**
   - we can index it by the **names** of the countries

In [None]:
countries = ['Russia', 'Ukraine', 'France', 'Spain',
             'Sweden', 'Norway',  'Germany', 'Finland',
             'Poland', 'Italy', 'UnitedKingdom']
sc = pd.Series(areas, index = countries)

In [None]:
sc.index

   - **index** needs not be **unique**

In [None]:
index = ['a', 'b', 'c', 'a', 'c']
s = pd.Series([10, 39, 27, 8, 46], index = index)

   - the $\texttt{index}$ $\texttt{'a'}$ has two values

In [None]:
s['a']

   - we can **test** if an **index** is in the serie

In [None]:
'a' in s

   - you can **reorganize** the index of a serie

In [None]:
s = pd.Series([10, 39, 27, 8, 46],
              index = ['d', 'b', 'c', 'a', 'e'])

In [None]:
s.reindex(['a', 'b', 'c', 'd', 'e', 'f'], fill_value=0)

  - you can give **default-values** for **new** elements

### 3) creating $\texttt{pandas.Series}$ from dictionaries $\{(key_i, item_i)\}$

   - **keys** of the $\texttt{dict}$ are **indexes**
   - **values** of the $\texttt{dict}$ are **elements**

In [None]:
d = {'Russia': 3972400, 'Ukraine': 603628, 'France': 551695, 'Spain': 505992,
     'Sweden': 450295, 'Norway': 385178, 'Germany': 357578, 'Finland': 338145,
     'Poland': 312685, 'Italy': 301338, 'UnitedKingdom': 242495}

scd = pd.Series(d)

In [None]:
scd.index

### 4) types of elements of a $\texttt{pandas.Series}$

In [None]:
index = ['a', 'b', 'c', 'a', 'c']
s = pd.Series([10, 39, 27, 8, 46], index = index)

In [None]:
s.index

in this example
   - the **type** of the **index** is $\texttt{object}$
   - and not a **fixed-length** array of characters like in $\texttt{numpy.ndarray}$ 

In [None]:
s.index.dtype # 'O' for object

   - a $\texttt{pandas.Serie}$ can hold data of any type but the type is **unique**(*)
   - but the types of the element
   - when the type is $\texttt{object}$ elements are **references** to Python **objects**

(*) **unlike** python where containers can hold objects of **any types** and are **heterogeneous**

In [None]:
l = [1, 'toto', 12.89, {'a':1}, (10, 230)]
[type(e) for e in l]

### XXX) naming the element and the index of a serie

In [None]:
s = pd.Series({'Russia': 3972400, 'Ukraine': 603628, 'France': 551695})
s

In [None]:
s

In [None]:
s.name = 'areas'
s.index.name = 'countries'
s.head()

### 5) accessing the underlying $\texttt{numpy.ndarray}$ from  $\texttt{pandas.Series}$

#### it is recommended to use:

   - $\texttt{np.Series.array}$ is a wrapper arround the **underlying data**
   - $\texttt{np.Series.to_numpy}$ returns the **underlying** $\texttt{numpy.ndarray}$

In [None]:
s = pd.Series(['a', 'b', 'c'])
s

In [None]:
s.array

In [None]:
s.to_numpy()

#### it is recommended to avoid $\texttt{Series.values}$
   - it returns a $\texttt{numpy}$ array **representing the underlying data**
   - **but** it has **inconsistent behaviour** (it is not deprecated)
   - (https://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.24.0.html#accessing-the-values-in-a-series-or-index)

      

### 6) accessing elements in a $\texttt{pandas.Series}$

In [None]:
countries = ['Russia', 'Ukraine', 'France', 'Spain', 'Sweden', 'Norway',
                       'Germany', 'Finland', 'Poland', 'Italy', 'UnitedKingdom']
areas = [3972400, 603628, 551695, 505992, 450295, 385178,
                   357578, 338145, 312685, 301338, 242495]
sc = pd.Series(areas, index = countries)

#### a) **accessing** elements by their **index** (their **key**)

the **strong** way

In [None]:
sc['Spain']

In [None]:
'Spain' in sc 

in case of **absence**: the **strong** way produces an **error**

In [None]:
try:
    sc['Denmark']
except KeyError as e:
    print(e, 'is not an index')

In [None]:
'Denmark' in sc

#### b) **accessing** elements by their **index** using $\texttt{pandas.Series.loc[]}$
   - it is a **property** not a **function**

access the element of index **Russia** in the serie

In [None]:
sc.loc['Russia'] # the same element as sc['Russia']

in case of **absence**: it produces an **error**

In [None]:
try:
    sc.loc['Denmark']
except KeyError as e:
    print(e, 'is not an index')

#### c) **accessing** elements by the **position** of their index using $\texttt{pandas.Series.iloc[]}$

   - the position of the **key** in the list of **index**

**'Russia'** is the **first** index (i.e. $0$) 

In [None]:
sc.iloc[0] # the same element as sc.loc['Russia']

In [None]:
sc.iloc[-1] # the last one like in python

In [None]:
sc.iloc[0:3] # several indexes

in case of **absence**: it produces an **error**

In [None]:
try:
    sc.iloc[1000]
except IndexError as e:
    print(e)

#### d) aggregating the elements in an **iterable** with $\texttt{pandas.Series.items}$

   - you obtain a python $\texttt{zip}$ i.e. a comprehension of $(index, values)$

In [None]:
zsa = sa.items() # the iterable contain the numbered elements
zsa

In [None]:
for z in zsa:
    print(z)

### 7) deleting an element with  $\texttt{pandas.Serie.drop}$

   - it allocates and returns the new $\texttt{pandas.Series}$
   - or you can do it **inplace** (it will *modify* the actual serie)

In [None]:
sc.drop('Russia')

In [None]:
sc = sc.drop('Russia')

we do it **inplace**

In [None]:
sc = sc.drop('Spain', inplace=True)

### 8) adding an element in a $\texttt{pandas.Series}$

In [None]:
s = pd.Series([10, 47, 47, 67], index=['a', 'b', 'c', 'a'])

#### a) you can check whether values are contained in a series

the **direct** way

In [None]:
s == 10

   - the **proper** way

In [None]:
s.isin([10]) # a single element

In [None]:
s.isin([100, 47]) # several selements

#### b) you can check whether values are indexes in a series

   - the python way

In [None]:
'a' in s

In [None]:
s.index.isin(['a', 'd'])

#### c) you can change the value of an element

In [None]:
s['a']

In [None]:
s['a'] = 17 # you modify all the 'a'

In [None]:
s['a']

   - you modify the **original** array (**not** a copy of the array)

In [None]:
s.dtype

In [None]:
type(s['c'])

In [None]:
s['c'] = "toto"

   - the **type** of the elements **changed**
   - it was integer it became **string** (objects)

In [None]:
type(s['c'])

In [None]:
s.dtype

#### d) you can add elements

   - the same way you change an existing one
   - you **give** a new **pair** $(index,\ value)$

In [None]:
s['v'] = 134

In [None]:
s

   - **adding** elements can **change** the **data-type** of the array

In [None]:
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [None]:
s.dtype

In [None]:
s.loc['str'] = '4' # we add a string

In [None]:
s.dtype

In [None]:
s + s   #   as the sum is defined on both
        # for character strings it is the concatenation

#### e) **implicit** type conversion
   - type conversion can be done **automatically**

   - **changing** elements can **change** the **data-type** of the array

In [None]:
s = pd.Series([10, 47, 47, 67], index=['a', 'b', 'c', 'a'])

In [None]:
s.dtype

In [None]:
s + s # the add of 64-bits integers

   - we add an element of type **character string**

In [None]:
s['w'] = '101'

   - you silently change the data-type of the array

   - when **printed** the array **looks** the same !
   - but **from now on** elements are **references** to objects
   - elements indexed by $\texttt{['a', 'b', 'c']}$ are **references to 64-bits integer** objects
   - the element indexed by $\texttt{'w'}$ is a **reference to a character string** object

In [None]:
s

In [None]:
s.dtype

   - but:

In [None]:
[type(e) for e in s.array]   # the last one is a **str** not an **int**

In [None]:
s + s

   - $+$ is the **addition of integers**
   - $+$ is the **concatenation of strings**

performance (be careful):
   - operations on $\texttt{numpy.ndarray}$ with **elements of type $\texttt{object}$**
   - are **slower** that operation on $\texttt{numpy.ndarray}$ with **elements of type **numeric** (int32, int64, float64, etc.)

In [None]:
import numpy as np

In [None]:
s = pd.Series([10, 47, 47, 67], index=['a', 'b', 'c', 'a'])
s.dtype # dtype('int64')
np.power(s, 2)

   - the **operation** is done directely on **64-bits** integers

In [None]:
s['w'] = '101'  # we add an element of type str
s.dtype         # dtype('O')
                # the type of the array changed to **object**
s.drop(['w'], inplace= True) # we remove the element of type str
s.dtype        # dtype('O') the type remains **object**

In [None]:
np.power(s, 2)

   - the **operation** is done on **64-bits**
   - but now the integers are **referenced** by the array
   - (one **indirection** has been added)

#### f) **explicit type** conversion with $\texttt{pandas.Series.astype}$

   - type conversion can be done **explicitely**

In [None]:
s = pd.Series([10.20, 47, 47, 67], index=['a', 'b', 'c', 'a'])

In [None]:
s

In [None]:
import numpy as np

   - we change the type

In [None]:
s.astype(np.int32)

   - it returns a new $\texttt{pandas.Series}$
   - with converted values

### 9) appying **vectorized** operations on $\texttt{pandas.Series}$

#### a) advanced array indexing and assigment features

In [None]:
s = pd.Series([56, 45, 23, 8, 19, 34], index=['a', 'b', 'c', 'd', 'e', 'f'])
s

In [None]:
s[s<30]

In [None]:
s.loc[s<30] # the same as previously

   - we can **modify** all the **selected** elements

In [None]:
s.loc[s<30] = 30 # threshold

In [None]:
s[s<30]

   - note that the scalar (here $30$) as been **broadcasted** in the **required size**

In [None]:
s[s>50] = 50

In [None]:
s