[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shaneahmed/StatswithPython/blob/main/02-DescriptiveStatistics.ipynb) 

[![Open In Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/shaneahmed/StatswithPython/blob/main/02-DescriptiveStatistics.ipynb)

# Descriptive Statistics
In the [previous notebook](https://github.com/shaneahmed/StatswithPython/blob/main/01-Introduction%20to%20Python.ipynb) we discussed basic python syntax, data structures and objects. In this notebook, we will learn to import python modules and use them to perform descriptive analysis.

## Modules in Python
The definitions you made in the previous excercise for functions and variables are lost. You may want to use a handy function such as to calculate _the mean_ that you’ve written in several programs without copying its definition into each program. Python has a way to put definitions in a file and use them in a script or in an interactive instance of the interpreter. Such a file is called a module; definitions from a module can be imported into other modules or into the main module.

A module is a file containing Python definitions and statements. The file name is the module name with the suffix .py
appended. Within a module, the module’s name (as a string) is available as the value of the global variable __name__.
In the previous notebook, you wrote a function to calculate fibonacci numbers you can save the code in a file named fibo.py in the current directory and import it in this notebook. Once the file is saved you can import the module in this notebook.

In [1]:
import fibo

    You can run the fib function in fibo using `fibo.fib()` call

In [2]:
fibo.fib(500)

0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 


    You can also import modules with a different name

In [3]:
import fibo as fib

In [4]:
fib.fib(500)

0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 


    You can import functions within a module directly

In [5]:
from fibo import fib

In [6]:
fib(500)

0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 


### Statistics Module
Python has some built-in modules such as `math`, `statistics` to perform basic statistical calculations. You can import statistics module using python `import`

In [7]:
import math
math.sqrt(4) # sqrt function calculate square root of a number

2.0

In [8]:
import statistics

In [9]:
statistics.mean([1, 2, 3, 4, 4])

2.8

### Installing Module in Python
For more sophisticated functions you can install modules such as "numpy", "scipy", "pandas", using `pip` command or `conda` (if you are in anaconda environment). Let's install "numpy", "scipy", "pandas". "!" operator runs terminal commands in notebook.

In [10]:
!pip install numpy scipy pandas



    After installation we can import these modules directly. The examples below show mean calculation using numpy and pandas libraries. scipy will depreciating support for mean calculation and is proposing move to numpy mean calculation.

In [11]:
import numpy as np # numpy (numerical python) is usually imported as np in python codes
import pandas as pd # pandas (Python Data Analysis Library) is usually imported in python as pd
import scipy.stats as stats # SciPy is a collection of mathematical algorithms and convenience functions built on the NumPy 
                            # extension of Python. We are importing stats package from scipy here.

In [12]:
np.mean([1, 2, 3, 4, 4])

2.8

In [13]:
import pandas as pd # pandas (Python Data Analysis Library) is usually imported in python as pd

In [14]:
d = pd.DataFrame([1, 2, 3, 4, 4]) # pandas deals with data in data frame. we will learn about data frame in the following notebooks

In [15]:
d.to_numpy()

array([[1],
       [2],
       [3],
       [4],
       [4]], dtype=int64)

In [16]:
d.mean()[0]

2.8

## Frequency Distribution
Let's consider the data in lecture slides.

In [17]:
data = [15, 8, 20, 16, 12, 18, 14, 22, 17, 5,
19, 15, 18, 29, 6, 13, 16, 19, 10, 24,
15, 3, 26, 30, 13, 17, 7, 16, 23, 25,
1, 15, 18, 14, 5, 27, 16, 20, 14, 6,
24, 14, 20, 25, 21, 15, 17, 8, 23, 21,
17, 14, 10, 13, 18, 16, 21, 9, 11, 22,
15, 12, 9, 16, 20, 11, 13, 22, 17, 13,
9, 22, 16, 12, 19, 17, 14, 10, 19, 18,
11, 16, 12, 18, 13, 17, 15, 14, 15, 28]

In [18]:
print(data)

[15, 8, 20, 16, 12, 18, 14, 22, 17, 5, 19, 15, 18, 29, 6, 13, 16, 19, 10, 24, 15, 3, 26, 30, 13, 17, 7, 16, 23, 25, 1, 15, 18, 14, 5, 27, 16, 20, 14, 6, 24, 14, 20, 25, 21, 15, 17, 8, 23, 21, 17, 14, 10, 13, 18, 16, 21, 9, 11, 22, 15, 12, 9, 16, 20, 11, 13, 22, 17, 13, 9, 22, 16, 12, 19, 17, 14, 10, 19, 18, 11, 16, 12, 18, 13, 17, 15, 14, 15, 28]


In [19]:
np.bincount(data)[::-1] # Is the frequency distribution same as in the slides

array([1, 1, 1, 1, 1, 2, 2, 2, 4, 3, 4, 4, 6, 7, 8, 8, 7, 6, 4, 3, 3, 3,
       2, 1, 2, 2, 0, 1, 0, 1, 0], dtype=int64)

In [20]:
np.unique(data) # Identifies unique values in the data with f>0

array([ 1,  3,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30])

## The Mean
As calculated above mean can be calculated using `np.mean()` function

In [21]:
np.mean([5, 8, 10, 11, 12])

9.2

    you can calculate mean of multiple columns in python using the same function

In [22]:
data = pd.DataFrame([[15, 8, 20, 16, 12, 18, 14, 22, 17, 5],
[19, 15, 18, 29, 6, 13, 16, 19, 10, 24]])

data.to_numpy() # print the values

array([[15,  8, 20, 16, 12, 18, 14, 22, 17,  5],
       [19, 15, 18, 29,  6, 13, 16, 19, 10, 24]], dtype=int64)

In [23]:
np.mean(data).to_numpy()

array([17. , 11.5, 19. , 22.5,  9. , 15.5, 15. , 20.5, 13.5, 14.5])

    As data is in pandas data frame you can directly call the mean function

In [24]:
data.mean()

0    17.0
1    11.5
2    19.0
3    22.5
4     9.0
5    15.5
6    15.0
7    20.5
8    13.5
9    14.5
dtype: float64

### Weighted Mean
Weighted mean can be calculated using `np.average()` function

In [25]:
School_A = [60, 40]
School_B = [80, 70, 60]
School_C = [75, 60, 60]

In [26]:
mean_data = [np.mean(School_A), np.mean(School_B), np.mean(School_C)]

In [27]:
np.average(mean_data, axis=0, weights=[2,3,3])

63.125

## The Median
You can calculate median using `np.median()` or `pd.DataFrame.median()`

In [28]:
data_odd_n = [40, 1, 4, 42, 6, 8, 43, 45, 47]
data_even_n = [3, 9, 15, 16, 19, 22]

In [29]:
np.median(data_odd_n)

40.0

In [30]:
np.median(data_even_n)

15.5

In [31]:
data_odd_n = pd.DataFrame([40, 1, 4, 42, 6, 8, 43, 45, 47])
data_even_n = pd.DataFrame([3, 9, 15, 16, 19, 22])

In [32]:
data_odd_n.median().to_numpy()

array([40.])

In [33]:
data_even_n.median().to_numpy()

array([15.5])

## The Mode
Mode can be calculated using `scipy.stats.mode` or `pd.DataFrame.mode()`

In [34]:
data = [100, 101, 105, 105, 107, 108]

In [35]:
stats.mode(data)

ModeResult(mode=array([105]), count=array([2]))

In [36]:
data = pd.DataFrame(data)

In [37]:
data.mode().to_numpy()

array([[105]], dtype=int64)

In [38]:
data = [100, 101, 105, 106, 107, 108] # If all the members have equal frequency

In [39]:
data = pd.DataFrame(data)

In [40]:
data.mode().to_numpy()

array([[100],
       [101],
       [105],
       [106],
       [107],
       [108]], dtype=int64)

## Range
Range = X<sub>H</sub> - X<sub>L</sub>

X<sub>H</sub> = Highest score in the distribution

X<sub>L</sub> = Lowest score in the distribution

Note: This is different from python [range function](https://docs.python.org/3/library/functions.html#func-range)

In [41]:
data = [17, 44, 50, 23, 42]

In [42]:
np.max(data)-np.min(data) # np.max() calculates XH whereas np.min calculates XL

33

We can also use interquartile range function iqr in ``scipy.stats`` by setting the ``rng`` between 0 and 100%

In [43]:
stats.iqr(data, rng=[0, 100]) 

33.0

## The Interquartile Range
The above function can be used to calculate inter quartile range using default values for `rng`

In [44]:
stats.iqr(data) 

21.0

In [45]:
data = pd.DataFrame([17, 44, 50, 23, 42])

In [46]:
Q1, Q3 = data.quantile([0.25, 0.75], axis=0).to_numpy()
Q3-Q1

array([21.])

## Semi-interquartile range
SIQR = IQR/2

In [47]:
SIQR = (Q3-Q1)/2
print(SIQR)

[10.5]
