# Interactive data analysis, vizualisation

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper _The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis_.**[1]**

- 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor)
- Four features measured (the length and the width of the sepals and petals, in centimeters)

Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. 

**[1]** R. A. Fisher (1936). "The use of multiple measurements in taxonomic problems". Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x.

Source [Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set).

**We use this dataset to test different visualisation library :**  

- matplotlib
- pandas 
- seaborn
- ggplot with %%R
- plotly
- bokeh

## Imports

In [8]:
import numpy as np
import pandas as pd
import matplotlib

## Notebook customisation

In [9]:
# Change default size of plots
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## NumPy Basics

Numpy from _Numerical Python_ is the fundamental package for scientific computing with Python. A lot of package are using _Numpy_'s object like pandas. 

allow according to [Numpy website](https://www.numpy.org/):
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities



## Numpy array

In [26]:
data = [[8, 4.5, 10, 50],
        [25., 65, 20.3, 89]]

data_np = np.array(data)
data_np

array([[ 8. ,  4.5, 10. , 50. ],
       [25. , 65. , 20.3, 89. ]])

In [29]:
data_np.ndim

2

In [27]:
data_np.shape

(2, 4)

In [28]:
data_np.dtype

dtype('float64')

Note that in opposition to python list, numpy cannot mix different type of variable in the same array.

## Numpy array (2)

Numpy array can be created from an existing list with `np.array()`. Other function are:
- `np.arange()` same as range() but 

In [43]:
np.arange(4, 10, 0.2)

array([4. , 4.2, 4.4, 4.6, 4.8, 5. , 5.2, 5.4, 5.6, 5.8, 6. , 6.2, 6.4,
       6.6, 6.8, 7. , 7.2, 7.4, 7.6, 7.8, 8. , 8.2, 8.4, 8.6, 8.8, 9. ,
       9.2, 9.4, 9.6, 9.8])

- `np.ones()`, `np.zeros()` and `np.empty()` to create np array with respectively, 1.0, 0.0 or just allocate memory.   

## Numpy array (3)

You can pass as argument the type of your array using the keyword `dtype`:
- int8-64, integers 8-bit to 64-bit
- float16-128, floating point from half precision to extended precision
- complex64-256, complex numbers
- bool, boolean type
- object, string_, unicode_

In [47]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [48]:
np.ones(10, dtype=np.int32)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

## Numpy is fast !!

Let's create to random list of 1 M. points and do the additon:

In [16]:
import random
N = 1000000
a = [random.random() for i in range(N)]
b = [random.random() for i in range(N)]

%timeit [a[i] + b[i] for i in range(N)]

76 ms ± 501 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


And now with numpy :

In [19]:
a_np = np.array(a)
b_np = np.array(b)

%timeit a_np + b_np

657 µs ± 5.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Numpy is fast !! (2)

with numpy the operation is ~100 time faster. Let's have a look on buildin function like `sum()` and `np.sum()`:

In [20]:
%timeit sum_a = sum(a)

3.05 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [21]:
%timeit sum_a = np.sum(a_np)

371 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


~4 time faster.

In [3]:
data

array([[-0.68139835,  0.11932005,  1.10275277],
       [-0.04288342, -0.00721042, -0.02208139]])