<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>Valérie Roy</span>
<span><img src="media/ensmp-25-alpha.png" /></span>
</div>

# Pandas

**this course is inspired by**
   - Arnaud Legout, Inria, (courses and MOOC python)
   - Thierry Parmentelat (the numeric part of the MOOC python)
 

**see also: pandas cheat sheet**
   - https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

## history
   - **under development** since 2008
   - try to **close the gap** between **python**, **statistical computing** and **multidimensional  datasets**
   - not very **intuitive** but very **powerful** and there is **no better** solution
   
   
   - PyHPC11 (Python High-Performance and Scientific Computing conference 2011):
      - https://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf

## objectives

   - data **arrays** with **labeled axes**
   - **vectorized** operations
   - handling of **missing data**
   - **merge**, **pivot**, **groupy** other **relation** operations
   - automatic or explicit **data alignment**
   - integrated **time series** functionality

## pandas versus numpy

   - numpy contains **efficient** array **creation** and **manipulations**
   - pandas offers an **index-based structuration** to numpy.ndarray

  - see http://pandas.pydata.org/pandas-docs/stable/
  
  

   - pandas uses numpy as a **black-box**
   - **no assumption** is made on memory allocation
   
   
   - i.e. pandas works mainly with **copies** instead of **in-place** modification
   - see https://stackoverflow.com/questions/23296282/what-rules-does-pandas-use-to-generate-a-view-vs-a-copy
   
   

   - **in-place** modification in pandas are usually an assignment after the copy
   - see https://stackoverflow.com/questions/22532302/pandas-peculiar-performance-drop-for-inplace-rename-after-dropna/22533110#22533110                  127

## install and import

like always, install if needed from the terminal with

```bash
pip install pandas
```

In [None]:
import pandas as pd

In [None]:
# show version

pd.__version__

## first-class citizens

there are two main pandas containers
   - ***pandas.Series*** is for **one-dimensional** arrays (think, a column)
   - ***pandas.DataFrame*** is for **two-dimensional** arrays (think, a spreadsheet)

# *pandas.Series*

a Series instance contains
   - an **array-like** data
   - an **array-like** index of same length as the data
   - by default, the index starts at $0$

## creating *pandas.Series* from arrays

example
   - we have $11$ **European countries** with their **names**
   - their corresponding **total areas** in $km^2$
      - Russia, Ukraine, France, Spain, Sweden, Norway, Germany, Finland, Poland, Italy, UnitedKingdom
      - 3972400, 603628, 551695, 505992, 450295, 385178, 357578, 338145, 312685, 301338, 242495

we can create a *pandas.Series* with the areas

In [None]:
import pandas as pd

In [None]:
areas = [3972400, 603628, 551695, 505992, 450295, 385178,
                   357578, 338145, 312685, 301338, 242495]
sa = pd.Series(areas)
sa

   - by **default** a *pandas.Series* is **indexed** by **numbers**
   - the **type** of the elements is here *int64*

In [None]:
sa.index

   - the **index** is also called the **keys**

In [None]:
sa.keys() is sa.index

   - they are the **same** **python** object

## providing an index to a *pandas.Series*

   - in our example, the serie is **indexed** by **numbers**
   - we can index it by the **names** of the countries

In [None]:
countries = ['Russia', 'Ukraine', 'France', 'Spain',
             'Sweden', 'Norway',  'Germany', 'Finland',
             'Poland', 'Italy', 'UnitedKingdom']
sc = pd.Series(areas, index = countries)

In [None]:
sc.index

   - **index** needs not be **unique**

In [None]:
index = ['a', 'b', 'c', 'a', 'c']
s = pd.Series([10, 39, 27, 8, 46], index = index)

In [None]:
# index 'a' has two values
s['a']

In [None]:
# we can test if an index is in the series


'a' in s

In [None]:
# we can **reorganize** the index of a serie
s = pd.Series([10, 39, 27, 8, 46],
              index = ['d', 'b', 'c', 'a', 'e'])

In [None]:
s.reindex(['a', 'b', 'c', 'd', 'e', 'f'],
          fill_value=0)

## creating *pandas.Series* from a dictionary

   - **keys** of the *dict* are **indexes**
   - **values** of the *dict* are **elements**

In [None]:
d = {'Russia': 3972400, 'Ukraine': 603628, 'France': 551695, 'Spain': 505992,
     'Sweden': 450295, 'Norway': 385178, 'Germany': 357578, 'Finland': 338145,
     'Poland': 312685, 'Italy': 301338, 'UnitedKingdom': 242495}

scd = pd.Series(d)

In [None]:
scd.index

## types of elements of a *pandas.Series*

In [None]:
index = ['a', 'b', 'c', 'a', 'c']
s = pd.Series([10, 39, 27, 8, 46], index = index)

In [None]:
s.index

in this example
   - the **type** of the **index** is *object*
   - and not a **fixed-length** array of characters like in *numpy.ndarray* 

In [None]:
s.index.dtype # 'O' stands for object

   - a *pandas.Series* can hold data of any type but the type is **unique**(*)
   - but the types of the element
   - when the type is *object* elements are **references** to Python **objects**

(*) **unlike** python where containers can hold objects of **any types** and are **heterogeneous**

In [None]:
l = [1, 'toto', 12.89, {'a':1}, (10, 230)]
[type(e) for e in l]

## naming the element and the index of a series

In [None]:
s = pd.Series({'Russia': 3972400, 'Ukraine': 603628, 'France': 551695})
s

In [None]:
s.name = 'areas'
s.index.name = 'countries'
s.head()

## accessing elements in a *pandas.Series*

In [None]:
countries = ['Russia', 'Ukraine', 'France', 'Spain', 'Sweden', 'Norway',
                       'Germany', 'Finland', 'Poland', 'Italy', 'UnitedKingdom']
areas = [3972400, 603628, 551695, 505992, 450295, 385178,
                   357578, 338145, 312685, 301338, 242495]
sc = pd.Series(areas, index = countries)

### **accessing** elements by their **index** (their **key**)

the **strong** way

In [None]:
sc['Spain']

In [None]:
'Spain' in sc 

in case of **absence**: the **strong** way produces an **error**

In [None]:
try:
    sc['Denmark']
except KeyError as e:
    print(e, 'is not an index')

In [None]:
'Denmark' in sc

### **accessing** elements by their **index** using *pandas.Series.loc[]*

- it is a **property** not a **function**

access the element of index **Russia** in the serie

In [None]:
sc.loc['Russia'] # the same element as sc['Russia']

in case of **absence**: it produces an **error**

In [None]:
try:
    sc.loc['Denmark']
except KeyError as e:
    print(e, 'is not an index')

### **accessing** elements by the **position** of their index using *pandas.Series.iloc[]*

   - the position of the **key** in the list of **index**

In [None]:
# 'Russia' is the **first** index - so 0 
sc.iloc[0] # the same element as sc.loc['Russia']

In [None]:
sc.iloc[-1] # the last one like in python

In [None]:
sc.iloc[0:3] # several indexes

in case of **absence**: it produces an **error**

In [None]:
try:
    sc.iloc[1000]
except IndexError as e:
    print(e)

## changing the value of an element

In [None]:
s = pd.Series([10, 47, 47, 67], index=['a', 'b', 'c', 'a'])

In [None]:
s['a']

In [None]:
s['a'] = 17 # you modify all the 'a'

In [None]:
s['a']

   - you modify the **original** array (**not** a copy of the array)

s.dtype

In [None]:
type(s['c'])

**be sure to use proper type**

In [None]:
s['c'] = "toto"

   - the **type** of the elements **changed**
   - it was integer it became **string** (objects)

In [None]:
type(s['c'])

In [None]:
s.dtype

## adding elements in a Series

   - the same way you change an existing one
   - you **give** a new **pair** $(index,\ value)$

In [None]:
s['v'] = 134

In [None]:
s

## deleting an element with  *pandas.Serie.drop*

   - it allocates and returns the new *pandas.Series*


In [None]:
sc.drop('Russia')

In [None]:
sc = sc.drop('Russia')

   - or you can do it **inplace** (it will *modify* the Series object)

In [None]:
sc = sc.drop('Spain', inplace=True)

## **vectorized** operations on *pandas.Series*

Series can be subject to advanced array indexing, as well broadcast-based assignment  
just like numpy arrays

In [None]:
s = pd.Series([56, 45, 23, 8, 19, 34],
              index=['a', 'b', 'c', 'd', 'e', 'f'])
s

In [None]:
# we could have done also
# s.loc[s<30] 



s[s<30]

In [None]:
# we can modify the selected elements
s.loc[s<30] = 30 # threshold

In [None]:
s[s<=30]

***
remaining slides in this notebook are optional

## contains ?

In [None]:
s = pd.Series([10, 47, 47, 67], index=['a', 'b', 'c', 'a'])

### check whether values are contained in a series

the **direct** way

In [None]:
s == 10

   - the **proper** way

In [None]:
s.isin([10]) # a single element

In [None]:
s.isin([100, 47]) # several selements

### check whether values are indexes in a series

In [None]:
# the python way
'a' in s

In [None]:
# the pandas way
s.index.isin(['a', 'd'])

## iterating over a *Series* with *pandas.Series.items*

   - you obtain a python *zip* i.e. a comprehension of $(index, values)$

In [None]:
zsa = sa.items() # the iterable contain the numbered elements
zsa

In [None]:
for index, value in zsa:
    print(f"index={index}, value={value}")

## **implicit** type conversion

- type conversion can be done **automatically**

   - changing or adding elements can **change** the **data-type** of the array

In [None]:
s = pd.Series([10, 47, 47, 67], index=['a', 'b', 'c', 'a'])

In [None]:
s.dtype

In [None]:
s + s # the add of 64-bits integers

   - we add an element of type **character string**

In [None]:
s['w'] = '101'

   - you **silently** change the data-type of the array 

   - when **printed** the array **looks** the same !
   - but **from now on** elements are **references** to objects
   - elements indexed by *['a', 'b', 'c']* are **references to 64-bits integer** objects
   - the element indexed by *'w'* is a **reference to a character string** object

In [None]:
s

In [None]:
s.dtype

   - but:

In [None]:
[type(e) for e in s.array]   # the last one is a **str** not an **int**

In [None]:
s + s

   - $+$ is the **addition of integers**
   - $+$ is the **concatenation of strings**

performance (be careful):
   - operations on *numpy.ndarray* with **elements of type *object***
   - are **slower** that operation on *numpy.ndarray* with **elements of type **numeric** (int32, int64, float64, etc.)

In [None]:
import numpy as np

In [None]:
s = pd.Series([10, 47, 47, 67], index=['a', 'b', 'c', 'a'])
s.dtype # dtype('int64')
np.power(s, 2)

   - the **operation** is done directely on **64-bits** integers

In [None]:
s['w'] = '101'  # we add an element of type str
s.dtype         # dtype('O')
                # the type of the array changed to **object**
s.drop(['w'], inplace= True) # we remove the element of type str
s.dtype        # dtype('O') the type remains **object**

In [None]:
np.power(s, 2)

   - the **operation** is done on **64-bits**
   - but now the integers are **referenced** by the array
   - (one **indirection** has been added)

## **explicit type** conversion with *pandas.Series.astype*

   - type conversion can be done **explicitly**

In [None]:
s = pd.Series([10.20, 47, 47, 67], index=['a', 'b', 'c', 'a'])

In [None]:
s

In [None]:
import numpy as np

   - we change the type

In [None]:
s.astype(np.int32)

   - it returns a new *pandas.Series*
   - with converted values

## accessing the underlying *numpy.ndarray* from a *pandas.Series*

**it is recommended to use**:

   - *np.Series.array* is a wrapper arround the **underlying data**
   - *np.Series.to_numpy* returns the **underlying** *numpy.ndarray*

In [None]:
s = pd.Series(['a', 'b', 'c'])
s

In [None]:
s.array

In [None]:
s.to_numpy()

#### it is recommended to avoid *Series.values*
   - it returns a *numpy* array **representing the underlying data**
   - **but** it has **inconsistent behaviour** (it is not deprecated)
   - (https://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.24.0.html#accessing-the-values-in-a-series-or-index)

      