<h1 style='text-align: center;'>How to Use Python Descripstats Package</h1>
<h3 style='text-align: center;'>Shouke Wei, Ph.D. Professor<sup>1,2</sup></h3>
<h5 style='text-align: center;'><sup>1</sup> Deepsim Intelligence Technology Inc, BC V2T0G9, Abbotsford, Canada</h5>
<h5 style='text-align: center;'><sup>2</sup> Deepsim Academy, BC V2T0G9, Abbotsford, Canada</h5>
<h5 style='text-align: center;'>Email: shouke.wei@gmail.com</h5>

## 1. Brief introduction of the package

For numeric data, the `describe( )` function of Python Pandas provides a very convenient method to generate a general summary table of descriptive Statistics. However, the result's index only include `count`, `mean`, `std`, `min`, `max` as well as `lower`, `50` and `upper percentiles`. By default, the lower percentile is `25`, the upper percentile is `75`, and the `50` percentile is the same as the median.

In most cases, such as writing a scientific and data analysis report, and journal paper, we need more statistic indices than these default ones, such as mean absolute deviation (`mad`), `variance`, standard error of the mean (`sem`), `sum`, `skewness`, `kurtosis`, etc. Pandas also provides methods to calculate them, but we have to write a code snippet to add them to the summary table of the `describe( )` function.

This is a small package, which help you add more descriptive statistic measures to the default `describe()` of Pandas, which include:  
   - mad: mean absolute deviation
   - variance: variance
   - sem: standard error of the mean
   - sum: sum
   - skewness: skewness
   - kurtosis: kurtosis

### Method:
`Describe(data)`  
   - **Parameters**:   
      - data: data in NumPy array or Pandas DataFrame  
   - **Return**:  
       - stats: the descriptive statistics

## 2. Example

### (1) Import the packages
You can import the package with:
```python
from descripstats import Describe
```
then use the `Describe()` directly. Or 
```python
import descripstats as ds
```
then use `ds.Discribe()`

We use the second method in this example as follows:

In [2]:
import pandas as pd
import descripstats as ds

### (2) Read data to Pandas DataFrame

In [3]:
url = 'https://raw.githubusercontent.com/Sid-149/Life-Expectancy-Predictor-Comparative-Analysis/main/Notebooks/Life%20Expectancy%20Data.csv'
df = pd.read_csv(url,index_col=False)

# display the first rows
df.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


### (3) Display the default descriptive statistic measures of Pandas
First, let's use the `describe()` function of Pandas so that you can clearly see what measures added in this package later. 

In [4]:
df.describe()

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.51872,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.914858,25.070016,11467.272489,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.169342,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.935626,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.912906,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.947595,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.534144,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806335,7420359.0,7.2,7.2,0.779,14.3
max,2015.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


### (4) Descriptive statistic measures added by this package
Now, let's use the function of the package.

In [5]:
ds.Describe(df)

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.519,69.224932,164.796448,30.303948,4.602861,738.2513,80.940461,2419.592,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158,12753380.0,4.839704,4.870317,0.627551,11.992793
mad,4.005042,7.791541,95.360086,40.69157,3.492952,1016.691,18.039704,3923.784,18.084219,57.108057,16.582604,1.953667,16.746393,2.451941,8850.176,17416290.0,3.425156,3.456303,0.164336,2.593734
variance,21.28753,90.704052,15448.520903,13906.659712,16.422048,3951805.0,628.505682,131498300.0,401.763279,25742.774003,548.873337,6.241601,562.491918,25.783896,203637700.0,3722476000000000.0,19.538123,20.330018,0.04448,11.282342
sem,0.085121,0.176006,2.296984,2.175632,0.077361,36.67515,0.513346,211.5603,0.371952,2.960069,0.43363,0.047974,0.438976,0.09368,285.9759,1276080.0,0.082024,0.08367,0.004007,0.063763
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.915,25.070016,11467.27,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.17,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.9356,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.91291,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.948,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.5341,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806,7420359.0,7.2,7.2,0.779,14.3


### (5) Remove some of them
You can remove one or more of them you do not want through the following way.

In [6]:
stats = ds.Describe(df)

#### (i) remove one
For example, you do not want to include `mad` (mean absolute deviation) in the summary table

In [7]:
stats.drop('mad')

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.519,69.224932,164.796448,30.303948,4.602861,738.2513,80.940461,2419.592,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158,12753380.0,4.839704,4.870317,0.627551,11.992793
variance,21.28753,90.704052,15448.520903,13906.659712,16.422048,3951805.0,628.505682,131498300.0,401.763279,25742.774003,548.873337,6.241601,562.491918,25.783896,203637700.0,3722476000000000.0,19.538123,20.330018,0.04448,11.282342
sem,0.085121,0.176006,2.296984,2.175632,0.077361,36.67515,0.513346,211.5603,0.371952,2.960069,0.43363,0.047974,0.438976,0.09368,285.9759,1276080.0,0.082024,0.08367,0.004007,0.063763
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.915,25.070016,11467.27,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.17,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.9356,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.91291,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.948,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.5341,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806,7420359.0,7.2,7.2,0.779,14.3
max,2015.0,89.0,723.0,1800.0,17.87,19479.91,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7,1293859000.0,27.7,28.6,0.948,20.7


#### (ii) remove more
For example, remove `mad`, `variance` and `sem`. The `inplace=False` is the default, which does not change the summary table. 

In [8]:
stats

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.519,69.224932,164.796448,30.303948,4.602861,738.2513,80.940461,2419.592,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158,12753380.0,4.839704,4.870317,0.627551,11.992793
mad,4.005042,7.791541,95.360086,40.69157,3.492952,1016.691,18.039704,3923.784,18.084219,57.108057,16.582604,1.953667,16.746393,2.451941,8850.176,17416290.0,3.425156,3.456303,0.164336,2.593734
variance,21.28753,90.704052,15448.520903,13906.659712,16.422048,3951805.0,628.505682,131498300.0,401.763279,25742.774003,548.873337,6.241601,562.491918,25.783896,203637700.0,3722476000000000.0,19.538123,20.330018,0.04448,11.282342
sem,0.085121,0.176006,2.296984,2.175632,0.077361,36.67515,0.513346,211.5603,0.371952,2.960069,0.43363,0.047974,0.438976,0.09368,285.9759,1276080.0,0.082024,0.08367,0.004007,0.063763
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.915,25.070016,11467.27,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.17,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.9356,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.91291,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.948,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.5341,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806,7420359.0,7.2,7.2,0.779,14.3


So the `mad` is still there. If you want to change the table, then use `inplace=True`.

In [9]:
stats.drop(['mad','variance','sem'],inplace=True)
stats

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.519,69.224932,164.796448,30.303948,4.602861,738.2513,80.940461,2419.592,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.915,25.070016,11467.27,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.17,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.9356,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.91291,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.948,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.5341,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806,7420359.0,7.2,7.2,0.779,14.3
max,2015.0,89.0,723.0,1800.0,17.87,19479.91,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7,1293859000.0,27.7,28.6,0.948,20.7
sum,5898090.0,202690.6,482524.0,89033.0,12630.25,2168982.0,193043.0,7108762.0,111284.9,123501.0,240964.0,16104.37,240304.0,5118.3,18633060.0,29154220000.0,14054.5,14143.4,1738.944,33280.0
skewness,-0.006409027,-0.638605,1.174369,9.786963,0.589563,4.652051,-1.930845,9.441332,-0.219312,9.495065,-2.098053,0.618686,-2.072753,5.396112,3.206655,15.91624,1.711471,1.777424,-1.143763,-0.602437


### (5) Transpose the table
For publication purpose in the journal, we usually need to transpose the above table. Besides, we also just roud to one decimal place.

In [55]:
stats.round(1).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,sum,skewness,kurtosis
Year,2938.0,2007.5,4.6,2000.0,2004.0,2008.0,2012.0,2015.0,5898090.0,-0.0,-1.2
Life expectancy,2928.0,69.2,9.5,36.3,63.1,72.1,75.7,89.0,202690.6,-0.6,-0.2
Adult Mortality,2928.0,164.8,124.3,1.0,74.0,144.0,228.0,723.0,482524.0,1.2,1.7
infant deaths,2938.0,30.3,117.9,0.0,0.0,3.0,22.0,1800.0,89033.0,9.8,116.0
Alcohol,2744.0,4.6,4.1,0.0,0.9,3.8,7.7,17.9,12630.2,0.6,-0.8
percentage expenditure,2938.0,738.3,1987.9,0.0,4.7,64.9,441.5,19479.9,2168982.0,4.7,26.6
Hepatitis B,2385.0,80.9,25.1,1.0,77.0,92.0,97.0,99.0,193043.0,-1.9,2.8
Measles,2938.0,2419.6,11467.3,0.0,0.0,17.0,360.2,212183.0,7108762.0,9.4,114.9
BMI,2904.0,38.3,20.0,1.0,19.3,43.5,56.2,87.3,111284.9,-0.2,-1.3
under-five deaths,2938.0,42.0,160.4,0.0,0.0,4.0,28.0,2500.0,123501.0,9.5,109.8
