## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib

## Notebook customisation

In [2]:
# Change default size of plots
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Numpy

## NumPy Basics

Numpy from _Numerical Python_ is the fundamental package for scientific computing with Python. A lot of package are using _Numpy_'s object like pandas. 

Features according to [Numpy website](https://www.numpy.org/):
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities

By convention most people use this command to import _Numpy_:

```python
import numpy as np
```

## Numpy array

In [3]:
data = [[8, 4.5, 10, 50],
        [25., 65, 20.3, 89]]

data_np = np.array(data)
data_np

array([[ 8. ,  4.5, 10. , 50. ],
       [25. , 65. , 20.3, 89. ]])

In [4]:
data_np.ndim

2

In [5]:
data_np.shape

(2, 4)

In [6]:
data_np.dtype

dtype('float64')

## Numpy array (2)

Numpy array can be created from an existing list with `np.array()`. Other function are:
- `np.arange()` same as range() but 

In [9]:
a=np.arange(4, 10, 0.2)

In [10]:
a.shape

(30,)

- `np.ones()`, `np.zeros()` and `np.empty()` to create np array with respectively, 1, 0 or just allocate memory.   

## Numpy array (3)

The newaxis expression is used to increase the dimension

In [13]:
a = np.arange(0, 51, 10)
b = np.arange(0, 51, 10)[:,np.newaxis]
b

array([[ 0],
       [10],
       [20],
       [30],
       [40],
       [50]])

In [14]:
np.arange(6)

array([0, 1, 2, 3, 4, 5])

In [8]:
np.arange(0, 51, 10)[:,np.newaxis] + np.arange(6)

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

<img src="pictures/numpy_indexing.png" alt="Drawing" style="float:center; width: 70%;"/>

<sup>source: http://scipy-lectures.org/intro/numpy/array_object.html</sup>

## Numpy array (4)

You can pass as argument the type of your array using the keyword `dtype`:
- int8 ... int64, integers 8-bit to 64-bit
- float16 ... float128, floating point from half precision to extended precision
- complex64 ... complex256, complex numbers
- bool, boolean type
- object, string_, unicode_

Note that in opposition to python list, numpy cannot mix different type of variable in the same array.

In [69]:
np.ones((2, 5), dtype=np.float32) # Standart single precision = float32

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]], dtype=float32)

## Linear Algebra

In [15]:
a = np.array([[1, 2, 3], [3, 4, 6.7], [5, 9.0, 5]])
b = a.transpose()

In [16]:
np.linalg.inv(b)

array([[-2.27683616,  1.04519774,  0.39548023],
       [ 0.96045198, -0.56497175,  0.05649718],
       [ 0.07909605,  0.1299435 , -0.11299435]])

In [35]:
b = np.random.rand(3, 3) * 20
b

array([[ 4.64713306, 16.41449131, 19.24232856],
       [ 7.49005688, 13.01884916,  8.13688624],
       [10.31539264, 17.63964764,  9.81692078]])

In [38]:
a @ c # Numpt>1.1 and python>3.5 otherwise use np.dot(a, b)
#UNCLEAR

array([[ 50.57342473,  95.37113254,  64.96686337],
       [113.01475737, 219.50450973, 156.04789985],
       [142.22314041, 287.44033717, 218.52822287]])

## Numpy is fast !!

Let's create a random list of 1 Million points and do the additon:

In [16]:
import random
N = 1000000
a = [random.random() for i in range(N)]
b = [random.random() for i in range(N)]

%timeit [a[i] + b[i] for i in range(N)]

76 ms ± 501 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


And now with numpy :

In [19]:
a_np = np.array(a)
b_np = np.array(b)

%timeit a_np + b_np

657 µs ± 5.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Numpy is fast !! (2)

With numpy the operation is ~100 time faster. Let's have a look on buildin function like `sum()` and `np.sum()`:

In [20]:
%timeit sum_a = sum(a)

3.05 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [21]:
%timeit sum_a = np.sum(a_np)

371 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


~10 time faster.

# Pandas

## Pandas

Like _Numpy_, _Pandas_ contains data structure and data manipulation tools, however _Panda_ can deal with heterogenous data while adopting _Numpy_'s array-based computing. 

By convention most people use this command to import pandas:

```python
import pandas as pd
```

_Pandas_ contains two important data structures $Series$ and $DataFrame$.


## Pandas _Series_

_Series_ are a one-dimention array objects. They contains sequence of values that can be associated to data labels or _index_.

In [4]:
test_serie = pd.Series(range(5), ['a', 'b', 'c', 'd', 'e'])
test_serie

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [5]:
test_serie.e 

4

In [8]:
test_serie['e']

4

In [7]:
test_serie[4]

4

## Operation on series

In [82]:
(test_serie * 10) + 1.5

a     1.5
b    11.5
c    21.5
d    31.5
e    41.5
dtype: float64

In [83]:
test2_serie = pd.Series(range(5), ['b', 'c', 'd', 'e', 'f'])

In [84]:
test_serie + test2_serie

a    NaN
b    1.0
c    3.0
d    5.0
e    7.0
f    NaN
dtype: float64

## Select Data

In [89]:
test_serie*10 > 20

a    False
b    False
c    False
d     True
e     True
dtype: bool

In [90]:
test_serie[test_serie*10 > 20]

d    3
e    4
dtype: int64

## Pandas _DataFrame_

_DataFrame_ represent table of data with heterogenous column type (int, float, bool,  string). 
_DataFrame_ can also be considered as an ordered dictionnay of _Series_ sharing the same index.

In [123]:
student_grad = pd.DataFrame({'gender': ['M', 'F', 'M', 'M', 'F', 'F'],
                             'math': [random.random()*20 for _ in range(6)],
                             'biology': [random.random()*20 for _ in range(6)],
                             'french': [random.random()*20 for _ in range(6)],
                             'history': [random.random()*20 for _ in range(6)],
                             'physics': [random.random()*20 for _ in range(6)]},
                            index=['Anders', 'Olga', 'Peter', 'Icham', 'Nathalie', 'Aiko'])

In [124]:
student_grad

Unnamed: 0,gender,math,biology,french,history,physics
Anders,M,12.451162,19.535397,3.794617,1.853126,10.813565
Olga,F,12.333553,18.508612,11.287596,8.019036,6.158078
Peter,M,1.920123,3.154818,9.765983,5.487378,19.576605
Icham,M,13.363628,4.968228,15.345407,7.137415,1.727489
Nathalie,F,6.934926,3.526261,19.092621,18.342093,7.40607
Aiko,F,16.807241,14.337433,8.894549,16.052968,1.700058


## Data analysis

In [131]:
student_grad.loc['Olga']

gender           F
math       12.3336
biology    18.5086
french     11.2876
history    8.01904
physics    6.15808
mean       11.2614
Name: Olga, dtype: object

In [132]:
student_grad['mean'] = student_grad.mean(axis=1)
student_grad

Unnamed: 0,gender,math,biology,french,history,physics,mean
Anders,M,12.451162,19.535397,3.794617,1.853126,10.813565,9.689573
Olga,F,12.333553,18.508612,11.287596,8.019036,6.158078,11.261375
Peter,M,1.920123,3.154818,9.765983,5.487378,19.576605,7.980981
Icham,M,13.363628,4.968228,15.345407,7.137415,1.727489,8.508434
Nathalie,F,6.934926,3.526261,19.092621,18.342093,7.40607,11.060394
Aiko,F,16.807241,14.337433,8.894549,16.052968,1.700058,11.55845


## Merge Data

In [163]:
student_grad_2 = pd.DataFrame({'english': [random.random()*20 for _ in range(5)]},
                              index=['Olga', 'Peter', 'Icham', 'Nathalie', 'Aiko'])

In [164]:
all_grad = pd.concat([student_grad,student_grad_2], axis=1, sort=False)
all_grad

Unnamed: 0,gender,math,biology,french,history,physics,mean,english
Anders,M,12.451162,19.535397,3.794617,1.853126,10.813565,9.689573,
Olga,F,12.333553,18.508612,11.287596,8.019036,6.158078,11.261375,16.415284
Peter,M,1.920123,3.154818,9.765983,5.487378,19.576605,7.980981,19.158013
Icham,M,13.363628,4.968228,15.345407,7.137415,1.727489,8.508434,16.285414
Nathalie,F,6.934926,3.526261,19.092621,18.342093,7.40607,11.060394,13.372065
Aiko,F,16.807241,14.337433,8.894549,16.052968,1.700058,11.55845,19.413025


## Filter Data

In [165]:
all_grad.isna()

Unnamed: 0,gender,math,biology,french,history,physics,mean,english
Anders,False,False,False,False,False,False,False,True
Olga,False,False,False,False,False,False,False,False
Peter,False,False,False,False,False,False,False,False
Icham,False,False,False,False,False,False,False,False
Nathalie,False,False,False,False,False,False,False,False
Aiko,False,False,False,False,False,False,False,False


In [210]:
all_grad > 10

Unnamed: 0,gender,math,biology,french,history,physics,mean,english
Anders,True,True,True,False,False,True,False,False
Olga,True,True,True,True,False,False,True,True
Peter,True,False,False,False,False,True,False,True
Icham,True,True,False,True,False,False,False,True
Nathalie,True,False,False,True,True,False,True,True
Aiko,True,True,True,False,True,False,True,True


## Data frame instance and function

In [167]:
all_grad.columns

Index(['gender', 'math', 'biology', 'french', 'history', 'physics', 'mean',
       'english'],
      dtype='object')

In [169]:
all_grad.index

Index(['Anders', 'Olga', 'Peter', 'Icham', 'Nathalie', 'Aiko'], dtype='object')

In [168]:
# UNCLEAR
# Ah, drop is to not account for the mean column? Explain here
all_grad.drop(columns='mean').mean(axis=1)

Anders       9.689573
Olga        12.120360
Peter        9.843820
Icham        9.804597
Nathalie    11.445673
Aiko        12.867545
dtype: float64

## Data frame selection

Don't forget the brackets when combining conditions:

In [176]:
all_grad[ (all_grad['math'] > 10) & (all_grad['french'] > 10)]

Unnamed: 0,gender,math,biology,french,history,physics,mean,english
Olga,F,12.333553,18.508612,11.287596,8.019036,6.158078,11.261375,16.415284
Icham,M,13.363628,4.968228,15.345407,7.137415,1.727489,8.508434,16.285414


In [181]:
all_grad.loc[['Olga'], ['math', 'biology']]

Unnamed: 0,math,biology
Olga,12.333553,18.508612


In [183]:
all_grad.iloc[1, [1, 2]]

math       12.3336
biology    18.5086
Name: Olga, dtype: object

## Function Application and mapping

Let's say that you want to compute the gap between the worst and best grad, you can define your one function and apply it to every column, or line:

In [189]:
gap_f = lambda x: x.max() - x.min()
all_grad.drop(columns=['gender']).apply(gap_f)

math       14.887118
biology    16.380578
french     15.298004
history    16.488967
physics    17.876547
mean        3.577468
english     6.040960
dtype: float64

In [190]:
all_grad.drop(columns=['gender']).apply(gap_f, axis='columns')

Anders      17.682271
Olga        12.350534
Peter       17.656482
Icham       14.557925
Nathalie    15.566360
Aiko        17.712967
dtype: float64

## Sorting and counting

You want to get the ranking of student base on their mean grade:

In [193]:
all_grad.sort_values(by='mean', ascending=False)

Unnamed: 0,gender,math,biology,french,history,physics,mean,english
Aiko,F,16.807241,14.337433,8.894549,16.052968,1.700058,11.55845,19.413025
Olga,F,12.333553,18.508612,11.287596,8.019036,6.158078,11.261375,16.415284
Nathalie,F,6.934926,3.526261,19.092621,18.342093,7.40607,11.060394,13.372065
Anders,M,12.451162,19.535397,3.794617,1.853126,10.813565,9.689573,
Icham,M,13.363628,4.968228,15.345407,7.137415,1.727489,8.508434,16.285414
Peter,M,1.920123,3.154818,9.765983,5.487378,19.576605,7.980981,19.158013


Number of course validated:

In [204]:
(all_grad.drop(columns=['gender', 'mean']) > 10).sum(axis=1)

Anders      3
Olga        4
Peter       2
Icham       3
Nathalie    3
Aiko        4
dtype: int64

## GroupBy 

The `groupby()` mechanics has been integrated similarly as in `R`. It will split a `DataFrame` into groups based on one or more _keys_ to then apply any statistics.

In [209]:
all_grad.groupby(all_grad['gender']).mean()

Unnamed: 0_level_0,math,biology,french,history,physics,mean,english
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
F,12.02524,12.124102,13.091588,14.138032,5.088068,11.293406,16.400125
M,9.244971,9.219481,9.635336,4.825973,10.705886,8.726329,17.721713


## Visualising dataset

In [18]:
import seaborn as sns
iris = sns.load_dataset('iris')
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [226]:
# print 2 first lines
iris.head(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa


In [225]:
# print 2 last lines
iris.tail(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


## [Qgrid](https://github.com/quantopian/qgrid)

_"Qgrid is a Jupyter notebook widget which uses SlickGrid to render pandas DataFrames within a Jupyter notebook. This allows you to explore your DataFrames with intuitive scrolling, sorting, and filtering controls, as well as edit your DataFrames by double clicking cells."_

In [19]:
import qgrid
qgrid_widget = qgrid.show_grid(iris, show_toolbar=True, grid_options={'maxVisibleRows':5})
qgrid_widget

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

## Idée d'exo:

In [20]:
import io
import requests

#https://www.data.gouv.fr/fr/datasets/elections-europeennes-2019-resultats/#_

url = "https://www.data.gouv.fr/fr/datasets/r/da21a51c-ac5d-41a0-9b43-5dc2f814f21a"
s = requests.get(url).content
ds = pd.read_csv(io.StringIO(s.decode('utf-8')), sep=';')
print(ds.describe())


         Sequence   Année   Tour  Département  Commune  Code canton  \
count  263.000000   263.0  263.0        263.0    263.0   263.000000   
mean   132.000000  2019.0    1.0         31.0    555.0    19.319392   
std     76.065761     0.0    0.0          0.0      0.0     3.131453   
min      1.000000  2019.0    1.0         31.0    555.0    15.000000   
25%     66.500000  2019.0    1.0         31.0    555.0    17.000000   
50%    132.000000  2019.0    1.0         31.0    555.0    19.000000   
75%    197.500000  2019.0    1.0         31.0    555.0    22.000000   
max    263.000000  2019.0    1.0         31.0    555.0    25.000000   

       Code circonscription  Nombre d'inscrits  Nombre d'abstentions  \
count            263.000000         263.000000            263.000000   
mean               3.623574         923.281369            439.615970   
std                2.551555         188.675568            115.579229   
min                1.000000          68.000000             48.000000   


In [43]:
ds.head()

Unnamed: 0,Sequence,Type,Année,Tour,Département,Commune,Numéro du bureau,Code canton,Code circonscription,Indicatif,...,Nombre de voix Liste 30,Code dépôt Liste 31,Nombre de voix Liste 31,Code dépôt Liste 32,Nombre de voix Liste 32,Code dépôt Liste 33,Nombre de voix Liste 33,Code dépôt Liste 34,Nombre de voix Liste 34,Procurations
0,12,ER,2019,1,31,555,12,17,1,I,...,132,9,4,2,1,26,0,34,0,35.0
1,20,ER,2019,1,31,555,20,17,2,I,...,133,9,6,2,0,26,0,34,0,14.0
2,27,ER,2019,1,31,555,27,17,2,I,...,183,9,6,2,0,26,0,34,0,23.0
3,28,ER,2019,1,31,555,28,17,2,I,...,104,9,10,2,1,26,0,34,0,15.0
4,29,ER,2019,1,31,555,29,17,2,I,...,123,9,10,2,2,26,0,34,0,15.0


In [None]:
# http://www.pygal.org/en/stable/documentation/types/maps/pygal_maps_fr.html