# CHAPTER 5
---
# Getting Started with pandas

In [1]:
import pandas as pd
import numpy as np

## Introduction to pandas Data Structures <font color='green'>[Essential]</font>
To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for every
problem, they provide a solid, easy-to-use basis for most applications.

### Series <font color='green'>[Essential]</font>
A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type) and an associated array of data labels, called its index. The simplest
Series is formed from only an array of data:

Une série est un tableau lié à un index (l'index est lui même un tableau)

In [2]:
# la plus simple des séries est formée par un simple tableau de données
obj = pd.Series([4, 7, -5, 3])

In [3]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

on peut choisir l'index en le passant en argument

In [6]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [7]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

utiliser l'index pour sélectionner une ou plusieur valeurs

In [8]:
obj2['a']

-5

In [9]:
obj2['d'] = 6
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

le lien index <-> valeur est conservé lorsqu'on fitre ou fait des opérations mathématiques sur les valeurs

In [10]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

Les séries peuvent être construites à partir de dictionnaires

In [11]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

### DataFrame <font color='green'>[Essential]</font>

- Un DataFrame représente un tableau. 
- Il est comparable à un tableau dont les lignes et les colonnes sont indexés
- C'est un peu une 'Series de Series'

In [12]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)

In [13]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


- On peut accéder aux séries d'un DataFrame comme aux valeurs d'un dictionnaire 
- On accède à des «vues sur les séries» et non à des copies
- les séries d'un df partagent le même index

In [14]:
frame['state'] # accès à une ligne

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

In [15]:
frame.loc[2] # accès à un colone

state    Ohio
year     2002
pop       3.6
Name: 2, dtype: object

In [16]:
frame['debt'] = 10
frame['emp'] = [1, 2, 3, 4, 5]
frame

Unnamed: 0,state,year,pop,debt,emp
0,Ohio,2000,1.5,10,1
1,Ohio,2001,1.7,10,2
2,Ohio,2002,3.6,10,3
3,Nevada,2001,2.4,10,4
4,Nevada,2002,2.9,10,5


In [17]:
# les valeurs d'un DataFrame sont rangés dans un tableau à 2 dimentions
frame.values

array([['Ohio', 2000, 1.5, 10, 1],
       ['Ohio', 2001, 1.7, 10, 2],
       ['Ohio', 2002, 3.6, 10, 3],
       ['Nevada', 2001, 2.4, 10, 4],
       ['Nevada', 2002, 2.9, 10, 5]], dtype=object)

### Index Objects <font color="#D22328">[Advanced]</font>

In [18]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [19]:
index[1:]

Index(['b', 'c'], dtype='object')

In [20]:
if False:
    # immutable
    index[1]= 0

## Essential Functionality

### Reindexing <font color='#D22328'>[Advanced]</font>

In [21]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [22]:
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [23]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [24]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [25]:
frame = pd.DataFrame(
    np.arange(9).reshape((3, 3)), 
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [26]:
frame.reindex(columns=[ 'Texas', 'California', 'France']) # la valeur est renvoyée, frame n'a pas changé

Unnamed: 0,Texas,California,France
a,1,2,
c,4,5,
d,7,8,


### Dropping entries from an axis <font color='green'>[Essential]</font>

In [27]:
frame.drop('California', axis=1)

Unnamed: 0,Ohio,Texas
a,0,1
c,3,4
d,6,7


In [28]:
frame.drop('a')

Unnamed: 0,Ohio,Texas,California
c,3,4,5
d,6,7,8


### Indexing, selection, and filtering <font color='green'>[Essential]</font>

In [29]:
obj = pd.Series([2, 3, 5, 9], index=['a', 'b', 'c', 'd'])
obj

a    2
b    3
c    5
d    9
dtype: int64

In [30]:
# indexing
obj.iloc[[1, 3]]

b    3
d    9
dtype: int64

In [31]:
# filtering
obj.loc[obj < 6]

a    2
b    3
c    5
dtype: int64

In [32]:
# slicing
obj.loc['b':'c']

b    3
c    5
dtype: int64

In [33]:
obj.loc['a':'b'] = 4

In [34]:
obj

a    4
b    4
c    5
d    9
dtype: int64

In [35]:
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)

In [36]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [37]:
data.iloc[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [38]:
data.loc[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### Arithmetic and data alignment <font color='#D22328'>[Advanced]</font>

In [39]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

In [40]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [41]:
s1 + s2 # seules les valeurs définies dans les deux objets sont conservées

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [42]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

In [43]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [44]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [45]:
df1 + df2 # seules les valeurs définies dans les deux objets sont conservées

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [46]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [47]:
# opérations entre les DataFrames et les séries
df1 - df1.iloc[0]

Unnamed: 0,a,b,c,d
0,0.0,0.0,0.0,0.0
1,4.0,4.0,4.0,4.0
2,8.0,8.0,8.0,8.0


### Function application and mapping <font color='green'>[Essential]</font>

In [48]:
frame = pd.DataFrame(
    np.random.randn(3, 3), 
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas']
)

In [49]:
frame

Unnamed: 0,b,d,e
Utah,-0.821269,0.940856,0.541246
Ohio,0.166392,1.300106,1.088083
Texas,1.728017,-0.01593,2.330886


Les fonctions de la librairie «np» peuvent être appliquées à des Series ou des DataFrame. Directement

In [50]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.821269,0.940856,0.541246
Ohio,0.166392,1.300106,1.088083
Texas,1.728017,0.01593,2.330886


In [51]:
def series_min(series):
    return series.min() # on retourne une valeur unique

# en appelant la fonction pour chaque Series du DataFrame, on crée une nouvelle series
frame.apply(series_min)

b   -0.821269
d   -0.015930
e    0.541246
dtype: float64

In [52]:
def fois_deux(series):
    return series * 2 # on retourne une série

# en appelant la fonction pour chaque Series du DataFrame, on crée une nouvelle series
frame.apply(fois_deux)

Unnamed: 0,b,d,e
Utah,-1.642538,1.881712,1.082493
Ohio,0.332783,2.600211,2.176167
Texas,3.456034,-0.031861,4.661772


In [53]:
def plus_un(valeur):
    return 'valeur: %.2f' % valeur

frame.applymap(plus_un) # on applique la fonction pour chaque cellule !

Unnamed: 0,b,d,e
Utah,valeur: -0.82,valeur: 0.94,valeur: 0.54
Ohio,valeur: 0.17,valeur: 1.30,valeur: 1.09
Texas,valeur: 1.73,valeur: -0.02,valeur: 2.33


### Sorting and ranking <font color='green'>[Essential]</font>

In [54]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [55]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [56]:
frame = pd.DataFrame(
    np.arange(8).reshape((2, 4)), 
    index=['three', 'one'],
    columns=['d', 'a', 'b', 'c']
)

In [57]:
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [58]:
# on peut ordonner selon l'axe des colonnes ou des lignes
sorted_frame = frame.sort_index(axis=1)
sorted_frame

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [59]:
transposed = sorted_frame.T
transposed

Unnamed: 0,three,one
a,1,5
b,2,6
c,3,7
d,0,4


In [60]:
transposed.sort_values(by='one')

Unnamed: 0,three,one
d,0,4
a,1,5
b,2,6
c,3,7


In [61]:
obj = pd.Series([7, -5, 7, 4, 2, 0,5, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    5
7    4
dtype: int64

In [62]:
obj.rank() # rank nous donne le rang de chaque valeur, les ex aequo sont représentés avec des fractions

0    7.5
1    1.0
2    7.5
3    4.5
4    3.0
5    2.0
6    6.0
7    4.5
dtype: float64

## Axis indexes with duplicate values <font color='#D22328'>[Advanced]</font>

In [63]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [64]:
obj['c'] # c'est une valeur

4

In [65]:
obj['a'] # c'est une série !

a    0
a    1
dtype: int64

## Summarizing and Computing Descriptive Statistics <font color='green'>[Essential]</font>  <font color='#D22328'>[Advanced]</font>
pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values from
the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data.

In [66]:
df = pd.DataFrame(
    [[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],
    columns=['one', 'two']
) # nan : not a number est utilisé pour matérialiser des données manquantes

In [67]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [68]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [69]:
df.cumsum() # cumul 

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [70]:
df.idxmax() # index de la valeur max

one    b
two    d
dtype: object

In [71]:
df.sum(axis=1) # on peut sommer selon l'axe de son choix

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [72]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [73]:
df['one'].describe()

count    3.000000
mean     3.083333
std      3.493685
min      0.750000
25%      1.075000
50%      1.400000
75%      4.250000
max      7.100000
Name: one, dtype: float64

In [74]:
strings = df.astype(str) # on change le type des valeurs de df
strings['one'].describe() # le fonctionnement de describe est différent avec des données non numériques

count       4
unique      4
top       nan
freq        1
Name: one, dtype: object

### Correlation and Covariance <font color='#D22328'>[Advanced]</font>

In [75]:
filled = df.fillna(0) # on remplace les nan par des 0
filled

Unnamed: 0,one,two
a,1.4,0.0
b,7.1,-4.5
c,0.0,0.0
d,0.75,-1.3


In [76]:
filled.corr()

Unnamed: 0,one,two
one,1.0,-0.94454
two,-0.94454,1.0


### Unique Values, Value Counts, and Membership <font color='#D22328'>[Advanced]</font>

In [77]:
filled['one'].unique()

array([ 1.4 ,  7.1 ,  0.  ,  0.75])

In [78]:
filled['one'].nunique()

4

## Handling Missing Data <font color='green'>[Essential]</font>

In [79]:
df.isnull()

Unnamed: 0,one,two
a,False,True
b,False,False
c,True,True
d,False,False


In [80]:
df.dropna(subset=['two'])

Unnamed: 0,one,two
b,7.1,-4.5
d,0.75,-1.3


## Hierarchical Indexing <font color='#D22328'>[Advanced]</font>

In [81]:
data = pd.Series(
    np.random.randn(10), 
    index=[
        ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
        [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]
    ]
)

In [82]:
data

a  1    1.461448
   2   -0.387216
   3   -0.932193
b  1   -0.482614
   2    0.109048
   3    0.730106
c  1    1.876444
   2    1.667494
d  2    0.986478
   3   -0.129860
dtype: float64

In [83]:
data.index # 

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           codes=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

In [84]:
data['a'] # c'est une série

1    1.461448
2   -0.387216
3   -0.932193
dtype: float64

In [85]:
# on peut filtrer la série par chacun des niveaux de son index
# la syntaxe se déduit de celle des index unidimensionels 
data.loc['a', [1,2]] 

a  1    1.461448
   2   -0.387216
dtype: float64

In [86]:
unstacked = data.unstack()

In [87]:
unstacked

Unnamed: 0,1,2,3
a,1.461448,-0.387216,-0.932193
b,-0.482614,0.109048,0.730106
c,1.876444,1.667494,
d,,0.986478,-0.12986


In [88]:
unstacked[4] = 7

In [89]:
unstacked

Unnamed: 0,1,2,3,4
a,1.461448,-0.387216,-0.932193,7
b,-0.482614,0.109048,0.730106,7
c,1.876444,1.667494,,7
d,,0.986478,-0.12986,7


In [90]:
stacked = unstacked.stack()
stacked

a  1    1.461448
   2   -0.387216
   3   -0.932193
   4    7.000000
b  1   -0.482614
   2    0.109048
   3    0.730106
   4    7.000000
c  1    1.876444
   2    1.667494
   4    7.000000
d  2    0.986478
   3   -0.129860
   4    7.000000
dtype: float64

In [91]:
# on peut construire des dataframe avec des indexs hiérarchiques sur les deux axes
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'],
    ['Green', 'Red', 'Green']])

In [92]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [93]:
# on peut nommer les niveaux (levels) des axes afin de les manipuler plus facilement
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [94]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


### Reordering and Sorting Levels <font color='#D22328'>[Advanced]</font>

In [95]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [96]:
frame.swaplevel('key1', 'key2').sort_index()

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


### Using a DataFrame’s Columns <font color='#D22328'>[Advanced]</font>

In [97]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]}
                 )

In [98]:
frame.set_index(['c', 'd'])

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1
