# Module 8 - NumPy and SciPy: introduction to statistics in python <a id='0'></a>
--------------------------------------------------------------------------------------------------

## Table of Content <a id='toc'></a>

<br>

[**NumPy - The fundamental package for scientific computing in Python**](#0)

&nbsp;&nbsp;&nbsp;&nbsp;[**Introduction**](#1)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Importing NumPy](#1.1)

&nbsp;&nbsp;&nbsp;&nbsp;[**The heart of NumPy: the array**](#2)

&nbsp;&nbsp;&nbsp;&nbsp;[**Creating NumPy arrays**](#3)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Creating arrays from lists](#3.1)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Creating homogeneous-filled arrays](#3.2)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Create an identity matrix](#3.3)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Array with random numbers following a uniform distribution between 0.0 and 1.0](#3.4)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Evenly spaced arrays: the np.arange function](#3.5)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Reading and writing arrays from/to files](#3.6)  

&nbsp;&nbsp;&nbsp;&nbsp;[**Array reshaping**](#4)

&nbsp;&nbsp;&nbsp;&nbsp;[**Accessing data in NumPy arrays**](#5)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Indexing](#5.1)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Slicing](#5.2)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Comparison operations](#5.3)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Iterating over arrays](#5.4)  

&nbsp;&nbsp;&nbsp;&nbsp;[**Modifying array values**](#6)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Assigning new values to an existing index](#6.1)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Adding rows and columns to an existing array](#6.2)  

&nbsp;&nbsp;&nbsp;&nbsp;[**To copy or not to copy?**](#7)

&nbsp;&nbsp;&nbsp;&nbsp;[**NumPy functions**](#8)

&nbsp;&nbsp;&nbsp;&nbsp;[**Random numbers in NumPy**](#9)

&nbsp;&nbsp;&nbsp;&nbsp;[**Linear algebra built-in capabilities**](#10)

<br>

[**SciPy.stats and statistics in python**](#11)

&nbsp;&nbsp;&nbsp;&nbsp;[**Manipulation of random distributions**](#12)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Drawing random numbers: rvs](#12.1)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Looking up the quantiles and probability density functions](#12.2)  

&nbsp;&nbsp;&nbsp;&nbsp;[**Statistical tests**](#13)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Statistical modelling and regression](#13.1)

<br>

[**Exercises**](#14)

<br>
<br>

# NumPy - The fundamental package for scientific computing in Python <a id='0'></a>
-------------


## Introduction <a id='1'></a>
-------------------

**[NumPy](www.numpy.org)** is the fundamental package for scientific computing with Python.  
Its **highlights** are:
* A powerful N-dimensional array object.
* Efficient, broadcasting functions.
* Tools for integrating C/C++ and Fortran code.
* Useful linear algebra, Fourier transform, and random number generation capabilities.

**Why use NumPy ?**
* **NumPy arrays are faster** than standard python lists.
* NumPy functions allow to **perform many operations with much less code**.
* NumPy functions are **faster than native Python implementation**.
* Great **collection of mathematical functions** available (numpy, scipy, sympy).

### Importing NumPy <a id='1.1'></a>

`numpy` is generally imported under the alias **`np`**.


In [None]:
import numpy as np

<br>
<br>

[Back to ToC](#toc)

## The heart of NumPy: the array <a id='2'></a>
------------------------------------

* NumPy's main object is the homogeneous **multidimensional array** (all elements in the array must
  of the same type).
* Numpy arrays are created with **`np.array()`**.
* While a NumPy array object might be reminiscent of a standard python `list`, NumPy arrays are much
  more efficient for operations with large numerical data and in general outperform standard Python lists.
  
<br>

**Example:** create a new numpy array, here of dimension 2 (rows) x 3 (columns).

In [None]:
my_array = np.array([[1.,2.,3.],[4.,5.,6.]])

print("my_array:\n", my_array)
print("my_array dimensions:", my_array.shape)          # Number or rows and columns in the array.
print("my_array number of elements:", my_array.size)   # Total number of elements in the array.
print("my_array type of elements", my_array.dtype)     # Type of data stored in the array.

<br>

One great functionality of NumPy arrays is that it is painfully easy to **perform an operation over the entirety of the elements in the array**:

In [None]:
my_result = my_array * 3   # Multiply by 3 all elements in the array.
print(my_result)

> To perform the same operation with **native python** would take a lot more effort and code:
>
> ```python
> my_list = [[1.,2.,3.],[4.,5.,6.]]
> my_result_2 = [[0,0,0],[0,0,0]]
> for i in range(len(my_list)):
>     for j in range(len(my_list[i])):
>         my_result_2[i][j] = my_list[i][j] * 3
>
> print(my_result_2)
```

<br>

Numpy is also able to make the most out of the constraint of homogeneity in the array data to provide **amazing speed-ups**:

In [None]:
from time import time

native_data = tuple(range(10**7))
numpy_data = np.array(native_data)

t0 = time()
numpy_data *= 3
numpy_time = time() - t0

t0 = time()
native_data = [ x*3 for x in native_data ]
native_time = time() - t0

print("native timing :", native_time)
print("numpy timing :", numpy_time)
print("numpy acceleration factor:" , native_time // numpy_time )

<br>
<br>

[Back to ToC](#toc)

## Creating NumPy arrays <a id='3'></a>
---------------------------------

### Creating arrays from lists (and more generally from iterables) <a id='3.1'></a>

The simplest way to create a new NumPy array is to use the **`np.array()`** method that accepts an iterable as argument, e.g. a `list`, `tuple` or `generator`.

* **Create one-dimensional** arrays.

In [None]:
my_array = np.array([1, 2, 3])          # Create an array from a list...
my_array_2 = np.array((7, 8, "nine"))   # ...from a tuple
my_array_3 = np.array(range(10))        # ...from a generator (here a range object).

print("object type:", type(my_array))
print("array value:", my_array)
print("array value:", my_array_2)
print("array value:", my_array_3)

> Note how `np.array((7, 8, "nine"))` has converted the values of `7` and `8` to strings.
> This is because all values in a NumPy array must be of the same type.

<br>

* **Create N-dimensional** arrays: NumPy also allows creating 2-(or more)dimensional arrays.

In [None]:
# Example of a 2-dimensional array:
my_array = np.array([[1,2,3],[4,5,6]])

print("object type:", type(my_array))
print("array value:\n", my_array)

In [None]:
# Example of a 3-dimensional array:
my_array = np.array([[range(10), range(10)], [range(10), range(10)], [range(10), range(10)]])
my_array

<br>

### Creating homogeneous-filled arrays <a id='3.2'></a>

The following functions generate arrays filled with the same value. The **`<shape>`** argument is a tuple/list (iterable) that indicates the dimensions of the array (e.g. rows and columns).
* Array **filled with zeroes**: `np.zeros(<shape>)`.
* Array **filled with ones**: `np.ones(<shape>)`.
* Array **filled with a specific value**: `np.full(<shape>, value)`.

In [None]:
# 2-dimensional array filled with 0s.
my_array = np.zeros((3, 5))
print(my_array)

In [None]:
# 3-dimensional array filled with 1s.
my_array = np.ones((2, 4, 2))
print(my_array)

In [None]:
# 2-dimensional array filled with a specific value.
my_array = np.full((2, 3), 42)
print(my_array)

<br>

### Create an identity matrix <a id='3.3'></a>

* **`np.eye()`**: `np.eye(<number of rows>, <number of columns>, <position of diagonal>)`.
* The 3rd argument, **`<position of diagonal>`** specifies where the diagonal with `1` values should be.

In [None]:
my_array = np.eye(4,4,0)
print("Diagonal at index position 0:\n", my_array)

my_array = np.eye(4,4,1)
print("\nDiagonal at index position 1:\n", my_array)

<br>

### Array with random numbers following a uniform distribution between 0.0 and 1.0  <a id='3.4'></a>

In [None]:
my_array = np.random.rand(2,5)
print(my_array)

<br>

### Evenly spaced arrays: the `np.arange` function <a id='3.5'></a>

**`np.arange()`** allows to generate one-dimensional arrays filled with **evenly spaced numbers**.
* Accepts custom `start`, `stop`, and `step` (increment) values: `np.arange(start, stop, step)`.
* By default, arrays are filled with integers starting at `0`.
* The stop (end) value is *not* included (as usual in python).
* Unlike the built-in `range()` function, `np.arange()` supports **float** `start`/`stop` points and float increments (`step`).

In [None]:
my_array_1 = np.arange(6)             # Integers from 0 to 5 - the stop value is not included.
my_array_2 = np.arange(6, 21, 3)      # Every 3rd number from 6 to 21, 21 not included.
my_array_3 = np.arange(1.1, 7, 1.5)   # Supports float values! Here between 1.1 and 7, incremented by 1.5.

print(my_array_1)
print(my_array_2)
print(my_array_3)

<br>

### Reading and writing arrays from/to files <a id='3.6'></a>

* **`numpy.loadtxt()`**: read data from a (tabulated) text file.

> `numpy.loadtxt(<file name>, dtype=<type of data in file>, comments='#', delimiter=None, converters=None,
   skiprows=0, usecols=None, unpack=False, ndmin=0)`

In [None]:
# Print the content of the file.
# Note that we use "!" to call a shell command (here "cat"). This is a feature of Jupyter Notebooks,
# but it will not work on Windows machines.
!cat data/test.tab

In [None]:
# Note: we skip the 1st row of the file, because it contains column names.
my_array = np.loadtxt("data/test.tab", dtype=float, delimiter=" ", skiprows=1)
print(my_array)

> **Notes:**
> * There is a more sophisticated function for handling files with complex formatting : **`numpy.genfromtxt()`**.
> * For complex cases, the use of other libraries or custom parsers is required.

<br>

* **`numpy.savetxt()`**: write data to a text file.
> `numpy.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ')`

In [None]:
my_array = np.array([[1,2,3], [4,5,6]])
np.savetxt(
    "test_out.tab", 
    my_array, 
    fmt="%d", 
    delimiter="\t",
    header="# my sample\n#test",
    comments=""
)

In [None]:
# Print the content of the file we just created (note: this command will not work on Windows).
!cat test_out.tab

<br>
<br>

[Back to ToC](#toc)

### Array reshaping <a id='4'></a>

* **`numpy.reshape()`**: changes the shape of an array without changing its data.

In [None]:
# 1-dimensional array.
my_array = np.arange(1.1, 6, 0.7)
print(my_array)

In [None]:
# Reshape to a 2D array.
print(np.reshape(my_array, (2, 4)))

In [None]:
# Reshape to a 3D array.
print(np.reshape(my_array, (2, 2, 2), order="C"))  # order="C" is the default value.


<br>
<br>

[Back to ToC](#toc)

## Accessing data in NumPy arrays <a id='5'></a>
------------------------------------------------

### Indexing <a id='5.1'></a>

* **One-dimensional** arrays are accessed in the same way as standard lists in python.
* **Multi-dimensional** arrays are accessed by indices along every axis. These indices are separated by comas and given in square brackets.

In [None]:
# Create a new 3x3 Numpy array.
my_array = np.arange(1., 10.).reshape((3, 3))
print(my_array)

In [None]:
# Accessing a single element in standard way (list-like syntax):
print(my_array[1][0])

# Accessing a row in the standard way:
print("row:", my_array[1])

* **Accessing an element using the NumPy syntax**: in NumPy indices are given in square brackets and separated by coma.

In [None]:
# Accessing a single element in numpy syntax:
print(my_array[1, 0])

<br>

### Slicing <a id='5.2'></a>

* Accessing a subset of an array is done using the **`[rows to select, cols to select]`** syntax.
* Passing **`:`** selects all rows or columns.
* Remember that indexing is zero-based (i.e. the first row/column has index 0, not 1), and that
  the end index of a slice is not included.

<br>

**Example 1:** accessing a single row or column.

In [None]:
print(my_array, "\n")

# Accessing a single row.
print("Second row:", my_array[1,:])

# Accessing a single column.
print("Third column:", my_array[:, 2])

<br>

**Example 2:** accessing a 2-dimensional subset of an array.

In [None]:
# All rows from the second row to the end, and all columns
# from the first one to the third one (not included).
print(my_array[1:,:2])

In [None]:
# First and 3rd row, and all columns.
print(my_array[[0,2],:])

<br>

* Similarly to slicing lists/tuples, the **`start:stop:step`** notation can also be used when accessing array elements.

In [None]:
my_array = np.arange(1., 21.).reshape((4,5))
print(my_array)

# Accessing a subset with a step argument.
print("subset:\n", my_array[:,0::2])

<br>

### Comparison operations <a id='5.3'></a>

When applying **comparison operators**, such as `>`, `<` or `==`, an array of boolean values of the same shape than the input array is returned.

In [None]:
my_array = np.arange(1., 10.).reshape((3,3))

print("my_array:\n", my_array, "\n")
print("my_array > 5:\n", my_array > 5)

* The **`.all()`** and **`.any()`** methods of arrays can be used to evaluate whether all values in
  a boolean array are `True` or `False` (in non-boolean arrays, these methods can also be used - they test
  the *truthyness* of values in the array).

In [None]:
results = my_array > 5

print("All values in the array are greater than 5:", results.all()) 
print("There is at least one value in the array greater than 5:", results.any()) 

* Boolean arrays can also be used to **filter an array** (extract values) that match a given criteria.

In [None]:
# A boolean array is used to select values >= 5.
print("Values which are greater or equal to 5:", my_array[my_array >= 5])

<br>

### Iterating over arrays <a id='5.4'></a>

* When iterating over a multi-dimensional NumPy array with a `for` loop, the iteration is done over rows.
* To iterate over **individual elements**, the **`np.nditer()`** can be used.

In [None]:
# The standard for loop iterates over rows.
for row in my_array:
    print(row)

In [None]:
# Iterating over all elements in standard way.
for x in my_array:
    for y in x:
        print(y, " ",end="")

In [None]:
# Iterating over all elements in numpy way.
for x in np.nditer(my_array):
    print(x, " ", end="")

<br>
<br>

[Back to ToC](#toc)

## Modifying array values <a id='6'></a>
--------------------------------

### Assigning new values to an existing index <a id='6.1'></a>

NumPy arrays are mutable, and their values can be modified.
* To modify a value, simply assign a new value to an existing index (or range) position.

**Example 1:** modify a single value.

In [None]:
# Create a new array.
my_array = np.arange(1,5).reshape((2,2))
print(my_array, "\n")

# Modify the value in row 0, column 1:
my_array[0,1] = 5
print("After changing one element:\n", my_array)

<br>

* **Example 2:** modify an entire row or column.

In [None]:
# Create a new array.
my_array = np.arange(1,5).reshape((2,2))
print(my_array, "\n")

# Modify values in the 1st row.
my_array[0,:] = np.array([5,6])
print("After changing one row:\n", my_array, "\n")

# Modify values in the 2nd column.
my_array[:,1] = np.array([7,8])
print("After changing one column:\n", my_array)

<br>

### Adding rows and columns to an existing array <a id='6.2'></a>

* **`np.append()`**: append a row (`axis=0`) or column (`axis=1`) to an array.
* **`np.insert()`**: add a row (`axis=0`) or column (`axis=1`) at a specific position in the array.
* **`np.concatenate()`**: bind multiple arrays by rows or columns.

<br>

**Example 1:** add a row or column with **`np.append()`**.

In [None]:
# Create a new array
my_array = np.arange(0,6).reshape((2,3))
print(my_array, "\n")

# Append column by setting "axis=1".
print(
    "After appending a column:\n",
    np.append(my_array, np.array([6, 7]).reshape(2,1), axis=1),
    "\n"
)

# Append a row by setting "axis=0".
print(
    "After appending a row:\n",
    np.append(my_array, np.array([6, 7, 8]).reshape(1,3), axis=0),
)

<br>

**Example 2:** insert a row in an array with **`np.insert()`**.

In [None]:
print(my_array, "\n")

# Insert a row in 2nd position (index = 1).
print("After row insertion\n", np.insert(my_array, 1, [[6, 7, 8]], axis=0))

<br>

**Example 3:** combine arrays by row or column with **`np.concatenate()`**.

In [None]:
array_1 = np.arange(0,6).reshape((2,3))
array_2 = np.arange(6,12).reshape((2,3))

print("Row concatenation:\n", np.concatenate((array_1, array_2), axis=0), "\n")
print("Column concatenation:\n", np.concatenate((array_1, array_2), axis=1))

<br>
<br>

[Back to ToC](#toc)

## To copy or not to copy? <a id='7'></a>
--------------------------------

When operating and manipulating arrays, their data is sometimes copied into a new array and sometimes not. This is often a source of confusion for beginners.
* If you plan to change array values but want to keep the old array untouched, then make a copy of it
  using the **`.copy()`** method!

**Example:**

In [None]:
my_array = np.arange(1., 21.).reshape((4,5))
print(my_array, "\n")

# When assigning "my_array" to a new variable, the data from "my_array" is NOT copied.
# Instead, "tmp" simply points to the same memory location as "my_array".
tmp = my_array
print("Are 'tmp' and 'my_array' pointing to the same object in memory?", tmp is my_array, "\n")

# As a result, a change made to "tmp" will also be present in "my_array"...
# because they are pointing to the same object!
tmp[1, 1] = 999
print("Values of 'my_array' after we changed 'tmp':\n", my_array) 


**Example 2:** even if it appears that the `my_array` and `tmp` arrays are not the same object, they do in fact access the same memory location.

> This behavior may appear strange, but it is because NumPy arrays access their data by reference.
> And thus multiple array may access the same memory space, or subset of the same memory space.

In [None]:
my_array = np.arange(1., 21.).reshape((4,5))
print(my_array, "\n")

tmp = my_array[1:3,1:3]
print("Are 'tmp' and 'my_array' pointing to the same object in memory?", tmp is my_array, "\n")

# "tmp" is a different object than "my_array", but the change in one is still present in the other!
tmp[1,1] = 999
print("tmp:\n", tmp, "\n")
print("Content of 'my_array' after changing 'tmp':\n", my_array)

<br>

**If you plan to change array values but want to keep the old array untouched, then make a copy of it!**
* To copy an array, use its **`.copy()`** method.

In [None]:
my_array = np.arange(1., 21.).reshape((4,5))
print(my_array, "\n")

# Make a copy of the original array.
tmp = my_array.copy()
print("Are 'tmp' and 'my_array' pointing to the same object in memory?", tmp is my_array, "\n")

# This time, modifying the copy does not modify the original array.
tmp[1,1] = 999
print("'tmp' array after the change:\n", tmp, "\n")
print("'my_array' after having changed 'tmp':\n", my_array)

<br>
<br>

[Back to ToC](#toc)

## NumPy functions <a id='8'></a>
-------------------------

NumPy provide a great number of functions. The power of NumPy functions is their **speed** and the possibility to **apply them to arrays element-wise, column-wise or row-wise**.

### Element-wise operations and functions
* When used on an array, all basic math operators (**`+`**, **`-`**, **`/`**, **`*`**, **`//`**, __`**`__) are automatically applied element-wise.

In [None]:
# Create a test array.
my_array = np.arange(1, 7).reshape((2,3))    # We use .reshape to create a 2 x 3 matrix.
print("Original values:\n", my_array)

# Apply math operations to each element of the array.
print("\nDivision:\n", my_array / 2)
print("\nAddition:\n", my_array + 10)
print("\nPower of 2:\n", my_array ** 2)

**When arrays are of the same size**, the requested operation is carried-out element wise.

In [None]:
print("Divide arrays:\n", my_array / my_array)
print("\nAddition arrays:\n", my_array + my_array)

<br>

**When arrays differ in size**, the shorter one is recycled.  
In this example, the array `[1, 2, 3]` is added to both the 1st and the 2nd row.

In [None]:
my_array + np.array([1, 2, 3])

<br>

### Element-wise, row-wise, column-wise functions

NumPy has many functions that can be applied:
* Element wise - by default.
* Row-wise - by passing `axis=0` as argument.
* Column-wise - by passing `axis=1` as argument.

Some examples include: **`np.sum()`**, **`np.std()`**, **`np.mean()`** and **`np.log2()`**.

In [None]:
# Sum.
print("Element-wise sum:", np.sum(my_array))
print("Sum of columns:", np.sum(my_array, axis=0))
print("Sum of rows:", np.sum(my_array, axis=1))

In [None]:
# Standard deviation.
print("Std of the whole array:", np.std(my_array))
print("Std of columns:", np.std(my_array, axis=0))
print("Std of rows:", np.std(my_array, axis=1))

In [None]:
# Mean function.
print("Mean of the whole array:", np.mean(my_array))
print("Mean of columns:", np.mean(my_array, axis=0))
print("Mean of rows:", np.mean(my_array, axis=1))

In [None]:
# Log base 2.
print("Compute log2 or each element:\n", np.log2(my_array))

<br>
<br>

[Back to ToC](#toc)

## Random numbers in NumPy <a id='9'></a>
---------------------------------------

The **`numpy.random`** module provides a large collection of distributions (uniform, normal, beta, binomial, gamma, poisson ...) to draw from.

* **`np.random.rand()`**: random numbers from a uniform distribution [0,1].
* **`np.random.randn()`**: random numbers from a normal distribution.
* **`np.random.permutation()`**: random permutation of the array's values (reshuffling).
* **`np.random.choice(array, size, replace=True/False)`**: random sampling from an array, with (`replace=True`) or without (`replace=False`) replacement.

<br>

**Examples:**

In [None]:
# Random numbers from a uniform distribution [0,1].
print("single number:", np.random.rand())
print("array of random numbers:\n", np.random.rand(2,3))

In [None]:
# Random numbers from normal distribution.
my_array = np.random.randn(100000)
print("array of random numbers.\n", "mean :", np.mean(my_array) , "\tstandard deviation :" , np.std(my_array) )

In [None]:
# Permutation and sampling.
my_array = np.arange(7)
print("array permutation:\n", np.random.permutation(my_array))
print("sample from the array:\n", np.random.choice(my_array, size=3, replace=True))

<br>
<br>

[Back to ToC](#toc)

## Linear algebra built-in capabilities <a id='10'></a>
----------------------------------------------

NumPy arrays can be used as matrices without any special conversion.  
For advanced linear algebra operations there is special package: **`numpy.linalg`**

In [None]:
# Create a new array.
my_array = np.arange(1., 7.).reshape((2,3))
print(my_array)

In [None]:
# Transpose of a matrix.
print("transposed matrix:\n", my_array.T)

#### There are different types of matrix multiplication!!!

In [None]:
# Matrix multiplication element-wise.
print(my_array * my_array)

In [None]:
# Matrix product.
print(my_array.dot(my_array.T)) 

# Alternatively, one can also use the "@" operator:
print(my_array @ my_array.T)

<br>
<br>
<br>


[Back to ToC](#toc)

# SciPy.stats and statistics in python <a id='11'></a>
------------------------------------------------------

**SciPy** references a comprehensive [project for scientific python programming](https://scipy.org) as well as a [library](https://docs.scipy.org/doc/scipy/reference) implementing various tools and algorithm for scientific software.

Here we will give a few pointers on the **`scipy.stats`** library, which provides ways to interact with various random distribution functions, as well as implement numerous statistical tests.

<br>

## Manipulation of random distributions <a id='12'></a>
--------------------------------------------------

The **`scipy.stats`** module implements utilities for a large number of continuous and discrete distributions:

In [None]:
from scipy import stats

In [None]:
dist_continue = [d for d in dir(stats) if isinstance(getattr(stats, d), stats.rv_continuous)]
dist_discrete = [d for d in dir(stats) if isinstance(getattr(stats, d), stats.rv_discrete)]

print('number of continuous distributions: ' , len(dist_continue))
print('number of discrete distributions:   ' , len(dist_discrete))

<br>

Let's experiment with the **normal distribution**, or **`norm`** in `scipy.stats`

A look at `help(stats.norm)` tells us that:

```
 |  The location (``loc``) keyword specifies the mean.
 |  The scale (``scale``) keyword specifies the standard deviation.
```

Let's generate a normal distribution with:
* mean = 10 -> `loc=10`.
* stdev = 2 -> `scale=2`.

In [None]:
N = stats.norm(loc=10, scale=2)

print(N.stats())  # The mean and variance of a distribution can be retrieved using the .stats method.
print(type(N))

<br>

That distribution object (`N`) we just created can now be used to interact with the distribution in many ways, as will be illustrated below.

<br>

### Drawing random numbers: `.rvs()` <a id='12.1'></a>

* **`.rvs()`**: this method of a scipy distribution is used to draw random numbers from the distribution.
* The **`size`** argument defines the dimensions of the returned arrays of random numbers.
  The type returned by the method is a NumPy array.

In [None]:
N.rvs(size = [5,5])

> **Note:** as with any drawing of random numbers on a computer,
> [one merely emulates randomness](https://en.wikipedia.org/wiki/Pseudorandom_number_generator).
> This also means that one can make some random operation reproducible by setting the random seed to
  a specific value.



In [None]:
import numpy as np

np.random.seed(2020) # Set the seed of the random number generator to a specific value.
draw1 = N.rvs(size=5)
print("Random numbers:", draw1)

# By setting the random seed back to 2020, we can reproduce randomness.
np.random.seed(2020)
draw2 = N.rvs(size=5)
print("Random numbers:", draw2)

print("\nAre the random draws equal?",draw1 == draw2)

<br>

### Looking up quantiles and probability density functions <a id='12.2'></a>

* **`.pdf()`**: Probability Density Function of distribution.
* **`.cdf()`**: Cumulative Distribution Function of distribution.
* **`.ppf()`**: Percent Point Function (inverse of CDF): gives the quantiles of the distribution.

<br>

**Example:** plotting a distribution.

In [None]:
import matplotlib.pyplot as plt 

# pdf: Probability Density Function.
x = np.arange(0, 20, 0.1)
plt.plot(x, N.pdf(x))
plt.show()

# cdf: Cumulative Distribution Function.
print("What is the probability of drawing a number <=15.0?", N.cdf(15.0)) 

# ppf: Percent Point Function (inverse of CDF): gives the quantiles of the distribution.
P = [0.025, 0.5, 0.975]
for threshold, quantile in zip(P, N.ppf(P)):
    print("Quantile:", threshold, "->", quantile)

<br>

For discrete distribution these rules change a bit: the `.pdf()` function is replaced by **`.pmf()`**.
* **Example:** binomial distribution with 10 draws and a 0.5 probability of success.

In [None]:
x = np.arange(0,10)
plt.scatter(x, stats.binom.pmf(x, n=10, p=0.5))
plt.show()

<br>
<br>

[Back to ToC](#toc)

## Statistical tests <a id='13'></a>
-----------------------

**`scipy.stats`** implements a number of statistical tests as functions.
* Most tests return two values: the computed **test statistic** and the **p-value**.
* We will only demonstrate a couple tests here, but you can get a more in-depth explanation and
  demonstration of `scipy.stats` tests
  [here](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet).

<br>

### Student's t-test
Imagine that we have two samples of measurement drawn from 2 sub-population:

In [None]:
sample1 = stats.norm.rvs(size=93, loc=173, scale=20)
sample2 = stats.norm.rvs(size=132, loc=181, scale=20)

To test whether the samples have the same mean value, we perform a **t-test to test the equality of the means**:

In [None]:
statistic, pvalue = stats.ttest_ind(sample1, sample2)
significance_threshold = 0.05

if pvalue < significance_threshold:
    print("We reject the null hypothesis (H0)")
else:
    print("We do not reject the null hypothesis (H0)")
print("p-value=",pvalue)


**`stats.ttest_ind`** has an **`equal_var`** parameter that one can set to `False` in order to perform Welsch's t-test, which is warranted when one cannot assume that the two sub-population's variances are equal.

> In general, these functions have a very good documentation, detailing the tests and giving usage examples.  
> We wholeheartedly recommend any would-be users to have a read at the `help()`.

<br>

### Chi-Squared Test

Imagine that we count different cell types in two biopsies and report them in a list:

In [None]:
biopsy1 = np.array([135, 423, 24, 72])
biopsy2 = [184, 552, 77, 101]

We now want to test whether the two biopsies differ significantly in their composition using **Chi-squared** test.

In [None]:
table = [biopsy1 , biopsy2]

stat, pvalue, degrees_freedom, expected_values = stats.chi2_contingency(table)
print('Chi-square test of independence of variables')
print('stat=%.3f, degree of  freedom=%i , p-value=%.3f' % (stat, degrees_freedom, pvalue))

<br>

### Statistical modelling and regression <a id='13.1'></a>

`scipy` implements methods to fit a model to some data. 

**`scipy.stats`** proposes a simple linear regression function between two variable,
while **`scipy.optimize`** implements functions to fit (non-linear) models to data.

In [None]:
x = stats.uniform.rvs(size=30, loc=0, scale=100)         # generating x
y = 1.6*x  + stats.norm.rvs(size=30, loc=0, scale=50)  # Y = 1.6 * x + some noise
    
# Perform the linear regression:   
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print("slope:",slope,"\tintercept:" , intercept)
print("R-squared:" , r_value**2)
print("p-value for the slope:", p_value)

# Plot the data along with the fitted line:
plt.plot(x, y, 'o', label='original data')
plt.plot(x, intercept + slope*x, 'r', label='fitted line')
plt.legend()
plt.show()


In [None]:
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit


# Example with 2 explanatory variables
# ************************************

# We define the model as a function.
# here it is a form of exponential decay model 
# where variable1 is the time and the rate of decay changes with variable2
# the model function takes as argument the 2 explanatory variable (grouped in a single tuple), and the 4 parameters
def func(X , a, b, c , d):
    x0 , x1 = X
    return a * np.exp( -(b + d*x1 ) * x0 ) + c 

#realParameter values
realParams = [ 5, 0.2, 0.5,  0.1 ]

# we simulate some data, with some noise
n=50 # number of points
variable1 = stats.uniform.rvs(size =n , loc = 0 , scale = 10 ) # explanatory variable number 1 : some uniform variable
variable2 = stats.bernoulli.rvs(size=n , p=0.5) # explanatory variable number 2 : can be 0 or 1

y = func( (variable1 , variable2) , realParams[0], realParams[1], realParams[2],  realParams[3])
y_noise = stats.norm.rvs(size=n , scale = 0.4) 
ydata = y+y_noise


popt, pcov = curve_fit(func, (variable1 , variable2), ydata)
perr = np.sqrt(np.diag(pcov))
print('parameter estimates          :', popt)
print('parameter standard deviation :', perr)
print('\nrelative estimation error    :',np.abs(popt - realParams)/realParams )


plt.scatter(variable1, ydata, c=variable2 , label='data')

x = np.linspace(min(variable1) , max(variable1) , 100)
plt.plot(x, func( ( x , np.zeros(100) )  , *popt), '--' ,  label='fit for variable2==0')
plt.plot(x, func( ( x , np.ones(100) )  , *popt), '--' , label='fit for variable2==1')

plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()


> You can find most of what was discussed here and more in the [official tutorial](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html)

<br>
<br>
<br>

[Back to ToC](#toc)

# Exercises <a id='14'></a>
-----------------

## Exercise 8.1

1. Load the data in the file `data/sample_data.tsv` as a numpy array. Display it's dimensions (number of rows and columns).
2. Log-transform the data (use base 2 log).
3. Find the row-wise means for replicates of Sample1 and Sample2.
4. Find the row-wise standard deviations the same way as means.
5. Use a function *scipy.stats.ttest_ind* to calculate p-value for every row.
6. Select p-values which are smaller than $10^{-2}$.
7. Print how many P-values below $10^{-2}$ are found.

In [None]:
import numpy as np
import scipy.stats as stats

In [None]:
# Display the top of the input data file using the shell command "head".
# Note: this command will not work on Windows.
!head -3 data/sample_data.tsv

### Solution:

* **Part 1**: load the data.

In [None]:
# %load -r 1-24 solutions/solution_81.py

* **Part 2**: log-transform the data.

In [None]:
# %load -r 25-31 solutions/solution_81.py

* **Part 3**: find the row-wise means.

In [None]:
# %load -r 32-52 solutions/solution_81.py

* **Part 4**: find the row-wise standard deviations.

In [None]:
# %load -r 53-60 solutions/solution_81.py

* **Part 5:** calculate p-value for every row.

In [None]:
# %load -r 61-65 solutions/solution_81.py

* **Part 6 and 7:** select p-values.

In [None]:
# %load -r 66- solutions/solution_81.py