# Table of Content <a id='toc'></a>


&nbsp;&nbsp;&nbsp;&nbsp;[Module 8 - NumPy and SciPy: introduction to statistics in python](#0)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Why NumPy?](#1)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[How to start?](#2)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[The heart of NumPy: array](#3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Creating NumPy arrays](#4)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Accessing data in an NumPy arrays](#5)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Iterating through the numpy arrays](#6)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Modifying array values](#7)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[To copy or not to copy?](#8)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Functions](#9)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Random numbers in NumPy](#10)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Linear algebra built-in capabilities](#11)

&nbsp;&nbsp;&nbsp;&nbsp;[SciPy.stats and statistics in python](#12)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[manipulation of random distributions](#13)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Statistical tests](#14)

&nbsp;&nbsp;&nbsp;&nbsp;[Exercise](#15)


[back to the toc](#toc)

<br>

# Module 8 - NumPy and SciPy: introduction to statistics in python <a id='0'></a>
--------------------------------------------------------------------------------------------------

[NumPy](www.numpy.org) is the fundamental package for scientific computing with Python.

The **highlights** of NumPy are:
* A powerful N-dimensional array object
* Efficient, broadcasting functions
* Tools for integrating C/C++ and Fortran code
* Useful linear algebra, Fourier transform, and random number capabilities



[back to the toc](#toc)

<br>

## Why NumPy? <a id='1'></a>

* NumPy arrays is faster than standard python lists
* NumPy functions allows to write many operations with much less code
* NumPy functions are faster than naive Python implementation
* Great collection of mathematical functions available (numpy, scipy, sympy)


[back to the toc](#toc)

<br>

## How to start? <a id='2'></a>


In [None]:
# loading numpy module
import numpy as np

<br>


[back to the toc](#toc)

<br>

## The heart of NumPy: array <a id='3'></a>

NumPy's main object is the **homogeneous** multidimensional array.

Arrays are very efficient for operations with large numerical data and in general outperform standard Python lists.

In [None]:
my_array = np.array([[1.,2.,3.],[4.,5.,6.]])
print("my_array:\n", my_array)
print("my_array dimentions:", my_array.shape)
print("my_array number of elements:", my_array.size)
print("my_array type of elements", my_array.dtype)

One great functionnality of numpy arrays is that it is painfully easy to perform an operation over the entirety of the elements in the array :

In [None]:
my_result =  my_array * 3 # multiply by 3 all elements in the array

#compare with a native python equivalent :
my_list=[[1.,2.,3.],[4.,5.,6.]]
my_result2 = [[0,0,0],[0,0,0]]
for i  in range(len(my_list)):
    for j in range(len(my_list[i])):
        my_result2[i][j] = my_list[i][j] * 3
        

Numpy is also able to make the most out of the constraint of homogeneity in the array data to provide amazing speed-ups :

In [None]:
from time import time
native_data = [x for x in range(10**7)]
numpy_data = np.array(native_data)

t0 = time()
numpy_data *= 3
t1 = time()
numpyTime = t1-t0

t0 = time()
native_data = [ x*3 for x in native_data ]
t1 = time()
nativeTime = t1-t0

print("native timing :",nativeTime)
print("numpy timing :",numpyTime)
print("numpy acceleration factor :" , nativeTime / numpyTime )

So, let's familiarize ourselves with the numpy array.

<br>


[back to the toc](#toc)

<br>

## Creating NumPy arrays <a id='4'></a>


### Creating arrays from lists

In [None]:
# one dimentional array
my_array = np.array([1,2,3])
print("my_array:", type(my_array), my_array)

In [None]:
# two dimentional array
my_array = np.array([[1,2,3],[4,5,6]])
print("my_array:", type(my_array))
print(my_array)

### Creating arrays with functions

In [None]:
# array filled with zeroes
my_array = np.zeros((3,5)) # ( number of rows , number of columns )
print(my_array)

In [None]:
# array filled with ones
my_array = np.ones((4,2))
print(my_array)

In [None]:
# array filled with desired number
my_array = np.full((2,3), 42)
print(my_array)

In [None]:
## identity matrix
my_array = np.eye(4,4,0) 
# the first 2 arguments give the matrix dimensions and the 3rd argument specify where the main diagnoal will be
print(my_array)

In [None]:
my_array = np.random.rand(2,2) # random numbers from a uniform distribution between 0.0 and 1.0
print(my_array)

### Arange function

Generate one-dimentional array of evenly spaced numbers.

In [None]:
my_array = np.arange(6)
print(my_array)

In [None]:
# support float start/end points, as well as float steps
my_array = np.arange(1.1, 6 , 1.1) #start , stop , step
print(my_array)

### Array reshaping

**numpy.reshape** changes the shape of an array without changing its data

In [None]:
# 1D array
my_array = np.arange(1.1, 6, 0.7)
print(my_array)

In [None]:
# 2D array
print(np.reshape(my_array, (2,4)))

In [None]:
# 3D array
print(np.reshape(my_array, (2,2,2), order="C"))


### Reading arrays from files

**numpy.loadtxt**(fname, dtype=&lt;type float>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0)

In [None]:
!cat data/test.tab

In [None]:
my_array = np.loadtxt("data/test.tab",  dtype= np.str,delimiter=" ")
print(my_array)

 * There is a more sophisticated function for handling files with complex formating : **numpy.genfromtxt**.
 * For complex cases, the use of other libraries or custom parsers is required.

### Writing an array to a file

**numpy.savetxt**(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ')

In [None]:
my_array = np.array([[1,2,3], [4,5,6]])
np.savetxt("test_out.tab", my_array, fmt="%d", delimiter="\t",
           header="# my sample\n#test", comments="")

In [None]:
!cat test_out.tab

<br>


[back to the toc](#toc)

<br>

## Accessing data in an NumPy arrays <a id='5'></a>

### Indexing

* One-dimensional arrays are accessed the same way as standard lists in python
* Multi-dimensional arrays are accessed by indices along every axis. These indices are separated by comas and given in square brackets

In [None]:
my_array = np.arange(1., 10.).reshape((3,3))
print(my_array)

In [None]:
# accessing a single element in standard way
print(my_array[1][0])
# accessing a row  in the standard way
print("row:", my_array[1])

In NumPy indices are given in square brackets and separated by coma.

In [None]:
# accessing a single element in numpy way
print(my_array[1,0])

### Slicing

Accessing a subset of an array.

In [None]:
print(my_array)

In [None]:
# accessing a row
print(my_array[1,:])

In [None]:
# accessing a column
print(my_array[:, 2])

In [None]:
# accessing a subset
print(my_array[1:,:2])

In [None]:
# accessing a subset
print(my_array[[0,2],:])

The `start:stop:step` notation can also be used when accessing array elements

In [None]:
my_array = np.arange(1., 21.).reshape((4,5))
print(my_array)
# accessing a subset with step argument
print("subset:\n", my_array[:,0::2])

### Comparison operations

In [None]:
# comparison operators return array of boolean values
my_array = np.arange(1., 10.).reshape((3,3))
print("my_array:\n", my_array)
print("my_array > 5:\n", my_array > 5)

In [None]:
# evaluation of boolean arrays
results = my_array > 5
print("All values in my_array are greater than 5:", results.all()) 
print("There is at least one value in my_array greater than 5:",
      results.any()) 

In [None]:
# extracting values with boolen arrays
print("Values which are greater than 5:", my_array[my_array > 5])


[back to the toc](#toc)

<br>

## Iterating through the numpy arrays <a id='6'></a>


In [None]:
# standard for loop iterates over rows
for x in my_array:
    print(x)

In [None]:
# iterating over all elements in standard way
for x in my_array:
    for y in x:
        print(y, " ",end="")
    print()

In [None]:
# iterating over all elements in numpy way
for x in np.nditer(my_array):
    print(x, " ", end="")

<br>


[back to the toc](#toc)

<br>

## Modifying array values <a id='7'></a>


### Assignment

In [None]:
# changing one element
my_array = np.arange(1,5).reshape((2,2))
print(my_array)
my_array[0,1] = 5
print("After changing one element:\n", my_array)

In [None]:
# changing rows and columns
my_array = np.arange(1,5).reshape((2,2))
print(my_array)
my_array[0,:] = np.array([5,6])
print("After changing one row:\n", my_array)
my_array[:,1] = np.array([7,8])
print("After changing one column:\n", my_array)

### Adding rows and columns to an existing array

In [None]:
# appending
my_array = np.arange(0,6).reshape((2,3))
print(my_array)
# append column
my_array = np.append(my_array, np.array([6, 7]).reshape(2,1), axis=1) # 0 : row , 1: column
print("after column adding\n", my_array)

In [None]:
# insertion
my_array = np.arange(0,6).reshape((2,3))
print(my_array)
# insert row
my_array = np.insert(my_array, 1, [[6, 7, 8]], axis=0)
print("after row insertion\n", my_array)


In [None]:
# concatenation
my_array = np.arange(0,6).reshape((2,3))
my_array2 = np.arange(6,12).reshape((2,3))
print("row concatenation:\n", np.concatenate((my_array, my_array2),
                                             axis=0))
print("column concatenation:\n", np.concatenate((my_array, my_array2),
                                                axis=1))


[back to the toc](#toc)

<br>

## To copy or not to copy? <a id='8'></a>
When operating and manipulating arrays, their data is sometimes copied into a new array and sometimes not. This is often a source of confusion for beginners.

In [None]:
my_array = np.arange(1., 21.).reshape((4,5))
print(my_array)
tmp = my_array
print("tmp is my_array", tmp is my_array)
tmp[1,1] = 999
print("my array after we changed tmp:\n", my_array) 
# the change is tmp is present in my_array, because they are the same object

In [None]:
my_array = np.arange(1., 21.).reshape((4,5))
print(my_array)
tmp = my_array[1:3,1:3]
print("tmp is my_array", tmp is my_array)
print("tmp:\n", tmp)
tmp[1,1] = 999
print("my_array after we changed tmp:\n", my_array)
# tmp is not my_array, but the change in one is reported in the other

> This behavior may appear strange, but it is because numpy arrays access their data by reference. And thus multiple array may access the same memory space, or subset of the same memory space 
> See <insert Bulak slides on references>

### If you plan to change array values but want to keep the old array untouched, then make copy of it!

In [None]:
my_array = np.arange(1., 21.).reshape((4,5))
print(my_array)
tmp = my_array.copy()
print("tmp is my_array", tmp is my_array)
tmp[1,1] = 999
print("my array after we changed tmp:\n", my_array)
print("tmp array after change:\n", tmp)


[back to the toc](#toc)

<br>

## Functions <a id='9'></a>

Numpy provide and great number of functions. The power of numpy functions are speed and possibility to apply it arrays element-wise, column-wise or row-wise.

### Element-wise functions
This includes all basic math operators like <span style="font-size: 1.5em;">+, -, /, \*, //, \*\*</span>

In [None]:
# element-wise operation with scalars
my_array = np.arange(1., 7.).reshape((2,3))
print(my_array)
print("division")
print(my_array/2)
print("sum")
print(my_array + 10)
# element-wise functions
print("power of 2")
print(my_array**2)
print("log2")
print(np.log2(my_array))

### Sum, division ... of arrays

In [None]:
# element-wise operation with arrays
my_array = np.arange(1., 7.).reshape((2,3))
print(my_array)
print("division")
print(my_array / my_array)
print("sum")
print(my_array + my_array)

### Element-wise, row-wise, column-wise functions

In [None]:
# sum
print("element-wise sum:", np.sum(my_array))
print("sum of columns:", np.sum(my_array, axis=0))
print("sum of rows:", np.sum(my_array, axis=1))

In [None]:
# standard deviation
print("std of the whole array:", np.std(my_array))
print("std of columns:", np.std(my_array, axis=0))
print("std of rows:", np.std(my_array, axis=1))

In [None]:
# mean function
print("mean of the whole array:", np.mean(my_array))
print("mean of columns:", np.mean(my_array, axis=0))
print("mean of rows:", np.mean(my_array, axis=1))


[back to the toc](#toc)

<br>

## Random numbers in NumPy <a id='10'></a>

The numpy.random mudule provides a large collection of distributions (uniform, normal, beta, binomial, gamma, poisson ...) to draw from.

In [None]:
# random numbers from uniform distribution [0,1)
print("single number:", np.random.rand())
print("array of random numbers:\n", np.random.rand(2,3))

In [None]:
# random numbers from normal distribution
my_array = np.random.randn(100000)
print("array of random numbers.\n", "mean :", np.mean(my_array) , "\tstandard deviation :" , np.std(my_array) )

In [None]:
# permutation and sampling
my_array = np.arange(7)
print("array permutation:\n", np.random.permutation(my_array))
print("sample from the array:\n", np.random.choice(my_array,
                                                   size=3,
                                                   replace=True))


[back to the toc](#toc)

<br>

## Linear algebra built-in capabilities <a id='11'></a>
NumPy arrays could be used as matrices without any special conversion.
For advanced linear algebra operations there is spacial package for it (**numpy.linalg**)

In [None]:
my_array = np.arange(1., 7.).reshape((2,3))
print(my_array)

In [None]:
# Transpose of a matrix
print("transposed matrix:\n", my_array.T)

#### There are different types of matrix multiplication!!!

In [None]:
# matrix multiplication element-wise
print(my_array * my_array)

In [None]:
# matrix product
print(my_array.dot(my_array.T)) 
# one can also use the @ operator :
print(my_array @ my_array.T) 

<br>


[back to the toc](#toc)

<br>

# SciPy.stats and statistics in python <a id='12'></a>


SciPy references a comprehensive [project for scientific python programming](https://scipy.org) regrouping as well  as a [library](https://docs.scipy.org/doc/scipy/reference/) (which is part of the project) implementing various tools and algorithm for scientific software.

Here we will give a few pointers on the `scipy.stats` library, which provides ways to interact with various random distribution functions, as well as implement numerous statistical tests.



[back to the toc](#toc)

<br>

## manipulation of random distributions <a id='13'></a>

scipy.stats implements utilisties for a large number of continuous and discrete distributions :

In [None]:
from scipy import stats

dist_continu = [d for d in dir(stats) if isinstance(getattr(stats, d), stats.rv_continuous)]
dist_discrete = [d for d in dir(stats) if isinstance(getattr(stats, d), stats.rv_discrete)]
print('number of continuous distributions: %d' % len(dist_continu))
print('number of discrete distributions:   %d' % len(dist_discrete))


let's experiment with the normal distribution, or `norm` in `scipy.stats`


A look at `help(stats.norm)` tells us that 
```
 |  The location (``loc``) keyword specifies the mean.
 |  The scale (``scale``) keyword specifies the standard deviation.
```


In [None]:
## we can generate a specific normal distribution :
N = stats.norm(loc = 10 , scale = 2)

# the mean and variance of a distribution can be retrieved using the .stats method :
print(N.stats())

That object can then be used to interact with the distribution in many ways.


### Drawing some random numbers : rvs

In [None]:
# draw some random number in this distribution : rvs
# the size argument is 1 or several integers and defines the dimensions of the returned arrays of random numbers
N.rvs(size = [5,5]) 

as with any drawing of random variable on a computer, [one merely emulates randomness](https://en.wikipedia.org/wiki/Pseudorandom_number_generator). This also means that one can make some random operation reproducible by setting up the random seed.



In [None]:
import numpy as np
np.random.seed(2020) # we set the random seed
draw1 = N.rvs(size=5)
np.random.seed(2020) # we set the random seed back to 2020
draw2 = N.rvs(size=5)
print("Are the ramdom draws equal?",draw1 == draw2)

### Looking up the quantiles and probability density functions


In [None]:
#  pdf: Probability Density Function
# I know this is not the plotting lesson, but here is a small recipe to plot the distribution
import matplotlib.pyplot as plt 
X = np.arange(0,20,0.1)
plt.plot( X , N.pdf(X) )
plt.show()

#    cdf: Cumulative Distribution Function
print('what is the probability of drawing a number <=15.0 ?' ,  N.cdf(15.0)) 


#    ppf: Percent Point Function (Inverse of CDF) , gives the quantiles of the distribution
P = [0.025,0.5,0.975]
Q = N.ppf(P)
print( 'quantiles:', P , '->' , Q )


In [None]:
# For discrete distribution these rules change a bit , the pdf function is replaced by pmf:
X = np.arange(0,10)
plt.scatter( X , stats.binom.pmf( X ,  n = 10 , p = 0.5 ) ) # binomial distribution with 10 draws and a 0.5 probability of success
plt.show()


[back to the toc](#toc)

<br>

## Statistical tests <a id='14'></a>

`scipy.stats` implements a number of statistical tests as functions.

Most return two values : the computed test statistic and the p-value.

We will only demonstrate a couple tests here.
You can get a more in-depth explaination and demonstration of scipy.stats tests [there](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/)



In [None]:
#Imagine we have two samples of measurement drawn from 2 sub-population :
sample1 = stats.norm.rvs(size = 93 , loc = 173 , scale = 20)
sample2 = stats.norm.rvs(size = 132 , loc = 181 , scale = 20)

# we perform a t-test to test the equality of the means
statistic , pValue = stats.ttest_ind(sample1 , sample2) 
significanceThreshold = 0.05
if pValue < significanceThreshold:
    print( "We reject the hypothesis of equality of means(H0). p-value :" , pValue )
else:
    print( "We do not reject the hypothesis of equality of means(H0). p-value :" , pValue )


`stats.ttest_ind` has a `equal_var` parameter that one can set to `False` in order to perform Welsch's t-test, which is warranted when one cannot assume the two sub-population's variances to be equal.

> In general, these functions have a very good documentation, detailing the tests and giving usage examples. We heartily recommend any would-be users to have a read at the `help()`.

In [None]:
# Example of the Chi-Squared Test
# imagine you count different cell types in two biopsies and report them in a list :
biopsy1 = np.array([135 , 423 , 24 , 72])
biopsy2 = [184 , 552 , 77 , 101]

table = [biopsy1 , biopsy2]

stat, pValue, degreeOfFreedom, expectedValues = stats.chi2_contingency(table)
print('Chi-square test of independence of variables')
print('stat=%.3f, degree of  freedom=%i , p-value=%.3f' % (stat, degreeOfFreedom,  pValue))
# here the two biopsies seem to differ significatively in their composition


### Statiscial modelling and regression

`scipy` implements methods to fit a model to some data. 

`scipy.stats` proposes a simple linear regression function between two variable,
while `scipy.optimize` implements functions to fit (non-linear) models to data.

In [None]:

x = stats.uniform.rvs(size=30 , loc=0 , scale=100) # generating X
y = 1.6*x  + stats.norm.rvs(size=30 , loc=0 , scale=50) # Y = 1.6 * X + some noise
    
# Perform the linear regression:
    
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print("slope: %f    intercept: %f" % (slope, intercept))
print("R-squared: %f" % r_value**2)
print("p-value for the slope: %f" % p_value)

#   Plot the data along with the fitted line:
    
plt.plot(x, y, 'o', label='original data')
plt.plot(x, intercept + slope*x, 'r', label='fitted line')
plt.legend()
plt.show()




In [None]:
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit


## example with 2 explanatory variables


# we define the model as a function.
# here it is a form of exponential decay model 
# where variable1 is the time and the rate of decay changes with variable2
# the model function takes as argument the 2 explanatory variable (grouped in a single tuple), and the 4 parameters
def func(X , a, b, c , d):
    x0 , x1 = X
    return a * np.exp( -(b + d*x1 ) * x0 ) + c 

#realParameter values
realParams = [ 5, 0.2, 0.5,  0.1 ]

# we simulate some data, with some noise
n=50 # number of points
variable1 = stats.uniform.rvs(size =n , loc = 0 , scale = 10 ) # explanatory variable number 1 : some uniform variable
variable2 = stats.bernoulli.rvs(size=n , p=0.5) # explanatory variable number 2 : can be 0 or 1

y = func( (variable1 , variable2) , realParams[0], realParams[1], realParams[2],  realParams[3])
y_noise = stats.norm.rvs(size=n , scale = 0.4) 
ydata = y+y_noise


popt, pcov = curve_fit(func, (variable1 , variable2), ydata)
perr = np.sqrt(np.diag(pcov))
print('parameter estimates          :',popt)
print('parameter standard deviation :',perr)
print('\nrelative estimation error    :',np.abs(popt - realParams)/realParams )


plt.scatter(variable1, ydata, c=variable2 , label='data')

x = np.linspace(min(variable1) , max(variable1) , 100)
plt.plot(x, func( ( x , np.zeros(100) )  , *popt), '--' ,  label='fit for variable2==0')
plt.plot(x, func( ( x , np.ones(100) )  , *popt), '--' , label='fit for variable2==1')

plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()



> You can find most of what was discussed here and more in the [official tutorial](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html)


[back to the toc](#toc)

<br>

# Exercise <a id='15'></a>


1. Load the data in thefile "sample_data.tsv" as a numpy array
2. Log-transform the data
3. Find the row-wise means for replicates of Sample1 and Sample2
4. Find the row-wise standard deviations the same way as means
5. Use a function *scipy.stats.ttest_ind* to calculate p-value for every row
6. Select p-values which are smaller than $10^{-2}$
7. Print how many P-values below $10^{-2}$ are found

In [None]:
# use of ttest_ind function
import scipy.stats as sps

# two arrays of random numbers
a = np.random.randn(3,5) * 3 + 15
b = np.random.randn(3,8) * 2 + 5

print(sps.ttest_ind(a, b, axis=1, equal_var=False).pvalue)

In [None]:
!head -3 data/sample_data.tsv

In [None]:
# %load -r 1-19 solutions/solution_81.py

In [None]:
# %load -r 20-26 solutions/solution_81.py

In [None]:
# %load -r 27-49 solutions/solution_81.py

In [None]:
# %load -r 50-54 solutions/solution_81.py

In [None]:
# %load -r 55-58 solutions/solution_81.py

In [None]:
# %load -r 59-62 solutions/solution_81.py

In [None]:
# %load -r 63- solutions/solution_81.py