# **``BioStatistics``**
#### **``Foundation for Data Analysis in Health & Medical Sciences``**

## ***````Chapter-1 :: Introduction for Statistics````***

##### **``Statistics``** is the field of study concerning:
- Collection of data
- Organizing the data
- Summarizing the data
- Analysing the data
- And, drawing the conclusions for the population or entire body of data by studying the part of it

##### There can be various **``Sources of data``**:
- **Routinely kept records**
    - ***For example***: In hospital most of the records like patient entires or number of patients treated in a day are the prime examples of electronic records.


- **Surveys**
    - These helps us in collecting the data other than the routinely kept records for finding the answer of a question. 
    - ***For example***: An administrator of a clinic wishes to obtain information regarding the mode of transportation used by patients to visit the clinic. If admission forms do not contain a question on mode of transportation, we may conduct a survey among patients to obtain this information.


- **Experiments**
    - ***For example***: A doctor is trying to find out whether a newly created drug helps in lowering the level of bad choloestrol.


- **External Sources**
    - ***For example***: Already published reports or someone has already worked on the question or problem that you are trying to solve/answer.

#### **``BioStatistics``**
- When you are dealing with data that is derived from the biological science or medicine science.
    - ***For example***: Cornoary Heart Disease data, Dopamine effect data, ASD data and others.


#### **``Variable``**
- An attribute in the data set that contains the values or characteristics of different people/subjects involved in the population or boday of data.
    - ***For example***: Blood Sugar, Heart Rate, Cholestrol level and others.


#### **``Random Variable``**
- A variable whose value is dependent on various factors and cannot be exactly predicted in advance.
    - ***For example***: What will be the height and weight of the newly born child after 20 years. Here, height and weight are the attributes which are dependent on the DNA composition or genetics or other external factors.


#### **``Measurement``**
- It is the unit or scale in which the data of a variable is expressed or collected. 
    - ***For example***: Nominal Scale, Ordinal Scale, Interval scale and Units like kilograms, centimeters etc.
    
    
#### **``Statistical Inference``**
- It is the process or procedure in which we use the decriptive statistics(analyse the sample) to estimate the value of the population parameter.

#### **``How to collect the sample``**
- **Simple Random Sampling**
    - When every record has a equal probability of being picked up from the population.
        - There are two ways associated with Simple Random Sampling:
            - **Sample with replacement**
                - It means a sample record picked from the population, also exits in the popualtion and can be picked-up again from the population to be a part of the sample.
            - **Sample without replacement**
                - It means a sample record once picked from the population, kept aside so that it cannot be picked again or cannot occur more than once in the sample.
- **Systematic Sampling**
    - It is the way in which sample observations are picked in certain manner.
        - ***For example***: x is the user provided or any random number. k is the gap or interval from the next number.
        - x, x+k, x+2k, x+3k, x+4k, x+5k, x+6k, ...... , x+nk
- **Stratified Sampling**
    - It is the way in which population data has been divided into various categories or classes and observations are randomly picked up from each class.
    - These different classes are referred as Stratums. And, from each Stratum you can pick the sample based upon its strength or fixed number of instances from every strata or using systematic sampling in stratums.

#### **``Scientific method or Design of Experiments``**

#### **Making an observation**: 
- An observation is made of a phenomenon or a group of phenomena.
    - For example, it is readily observable that regular exercise reduces body weight in many people. It is also readily observable that changing diet may have a similar effect. In this case there are two observable phenomena, regular exercise and diet change, that have the same endpoint.
    

#### **Formulating a Hypothesis**:
- A statistical hypothesis is like "The average (mean) loss of body weight of people who exercise is greater than the average (mean) loss of body weight of people who do not exercise." In this statement a quantitative measure, the “average” or “mean” value, is hypothesized to be greater in the sample of patients who exercise.


#### **Designing an Experiment**:
- The third step of the scientific method involves designing an experiment that will yield the data necessary to validly test an appropriate statistical hypothesis. 
- This step of the scientific method, like that of data analysis, requires the expertise of a statistician. 
- Improperly designed experiments are the leading cause of invalid results and unjustified conclusions. 
- Further, most studies that are challenged by experts are challenged on the basis of the appropriateness or inappropriateness of the study’s research design.   

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
patient_id_age = pd.read_csv("ch01_all/EXA_C01_S04_01.csv")

In [3]:
patient_id_age.describe()

Unnamed: 0,SUBJ,AGE
count,189.0,189.0
mean,95.0,55.031746
std,54.703748,9.914477
min,1.0,30.0
25%,48.0,48.0
50%,95.0,54.0
75%,142.0,61.0
max,189.0,82.0


In [4]:
patient_id_age.head(15)

Unnamed: 0,SUBJ,AGE
0,1,48
1,2,35
2,3,46
3,4,44
4,5,43
5,6,42
6,7,39
7,8,44
8,9,49
9,10,49


In [5]:
patient_id_age.shape

(189, 2)

## **``Simple_Random_Sampling``**

In [6]:
simple_random_numbers = [val for val in np.random.randint(low=1,high=patient_id_age.shape[0]+1,size=14)]

In [7]:
simple_random_numbers

[12, 185, 25, 3, 10, 20, 90, 79, 104, 20, 135, 27, 143, 12]

In [8]:
simple_random_sample = patient_id_age[patient_id_age['SUBJ'].isin(simple_random_numbers)]

In [9]:
simple_random_sample

Unnamed: 0,SUBJ,AGE
2,3,46
9,10,49
11,12,39
19,20,61
24,25,72
26,27,67
78,79,54
89,90,59
103,104,54
134,135,38


## **``Systematic_Random_Sampling``**

In [10]:
first_number = np.random.randint(low=1,high=5,size=1,dtype=np.int)

In [11]:
first_number

array([3])

In [12]:
sample_size = 20

In [13]:
population_size = patient_id_age.shape[0] - first_number[0]

In [14]:
interval_number_for_sample = np.round(population_size/sample_size,0)

In [15]:
interval_number_for_sample

9.0

In [236]:
def gen_systematic_sampling_numbers(start_num,pop_size,sample_size,interval_num):
    """
    ``Description``: This function is craeted for performing the Systematic Random Sampling by generating Systematic Random Numbers.
    
    ``Input Parameters``: It accepts below input parameters:
        1. start_num : First number (x) from which sample numbers to be started
        2. pop_size : Size of the entire data
        3. sample_size : Size of systematic sample size
        4. interval_num : Represents k in generating the series of numbers like:
                        ``x, x+k, x+2k, x+3k, x+4k, ..... x+nk``
        
    ``Returns``: It returns sys_sampling_nums as the systematic sampling numbers.
    """
    sys_sampling_nums = []
    for num in range(start_num[-1], pop_size):
        if sys_sampling_nums == []:
            sys_sampling_nums.append(num)
        elif sys_sampling_nums != [] and sys_sampling_nums[-1] < pop_size and sys_sampling_nums[-1]+interval_num < pop_size:
            sys_sampling_nums.append(sys_sampling_nums[-1]+interval_num)
    return sys_sampling_nums

In [17]:
systematic_sampling_numbers = gen_systematic_sampling_numbers(first_number,population_size,sample_size,interval_number_for_sample)

In [18]:
systematic_sampling_numbers

[3,
 12.0,
 21.0,
 30.0,
 39.0,
 48.0,
 57.0,
 66.0,
 75.0,
 84.0,
 93.0,
 102.0,
 111.0,
 120.0,
 129.0,
 138.0,
 147.0,
 156.0,
 165.0,
 174.0,
 183.0]

In [19]:
systematic_sample = patient_id_age[patient_id_age['SUBJ'].isin(systematic_sampling_numbers)]

In [20]:
systematic_sample

Unnamed: 0,SUBJ,AGE
2,3,46
11,12,39
20,21,53
29,30,46
38,39,48
47,48,49
56,57,56
65,66,50
74,75,61
83,84,57


## **``Stratified_Sampling``**

In [237]:
def stratified_sampling(strata_length,total_stratums,df,df_col,s_size=False,s_intervals=False):
    """
    ``Description``: This function is created for below tasks:
        1. Dividing the entire data into stratums
        2. Generating systematic sample numbers for all the stratums
        
    ``Inputs``: It accepts below input parameters:
        1. strata_length : Minimum number of records in each stratum
        2. total_stratums : Number of stratums in which data needs to be divided
        3. df : DataFrame having all the records
        4. df_col : DataFrame column used for defining the classes or categories for stratums
        5. s_size : Sample size
        6. s_intervals : Represents k of systematic sampling(interval for separating the data values)
        
    ``Returns``: It returns below outputs:
        1. stratums_dict : Data divided in stratums
        2. intervals : Stratum ranges as per dataframe column
        3. stratum_sys_numbers : Systematic sample numbers of each stratum based on dataframe column 
    """
    intervals = []
    for i in range(total_stratums+1):
        if intervals == []:
            intervals.append(df[df_col].min())
        elif intervals != []:
            intervals.append(intervals[-1] + strata_length)
    
    stratums_dict = {}
    for i in range(len(intervals)-1):
        stratums_dict[i] = df[(df[df_col] >= intervals[i]) & (df[df_col] < intervals[i+1])]
    
    stratum_sample_numbers = {}
    for i in all_stratums.keys():
        stratum_sample_numbers[i] = gen_systematic_sampling_numbers([all_stratums[i][df_col].min()],all_stratums[i][df_col].max(),10,2)
    return stratums_dict, intervals, stratum_sample_numbers

In [228]:
all_stratums, stratum_ranges, stratum_sys_numbers = stratified_sampling(18,3,patient_id_age,'AGE')

In [229]:
stratum_ranges

[30, 48, 66, 84]

In [230]:
all_stratums.keys()

dict_keys([0, 1, 2])

In [231]:
all_stratums[0].shape, all_stratums[1].shape, all_stratums[2].shape

((43, 2), (116, 2), (30, 2))

In [232]:
stratum_sys_numbers

{0: [30, 32, 34, 36, 38, 40, 42, 44, 46],
 1: [48, 50, 52, 54, 56, 58, 60, 62, 64],
 2: [66, 68, 70, 72, 74, 76, 78, 80]}

### **``Stratified Systematic Samples``**

In [233]:
stratified_sample_1 = patient_id_age[patient_id_age['AGE'].isin(stratum_sys_numbers[0])]
stratified_sample_2 = patient_id_age[patient_id_age['AGE'].isin(stratum_sys_numbers[1])]
stratified_sample_3 = patient_id_age[patient_id_age['AGE'].isin(stratum_sys_numbers[2])]

In [234]:
stratified_sample_1.shape, stratified_sample_2.shape, stratified_sample_3.shape

((23, 2), (57, 2), (15, 2))

## ***````Chapter-2 :: Descriptive Statistics````***

In [306]:
patient_id_age.head(10)

Unnamed: 0,SUBJ,AGE
0,35,30
1,139,34
2,2,35
3,132,37
4,29,37
5,49,38
6,135,38
7,13,38
8,28,38
9,7,39


In [246]:
patient_id_age.sort_values(by='AGE',axis=0,ascending=True,inplace=True)
patient_id_age.reset_index(drop=True,inplace=True)

In [248]:
patient_id_age.head(10)

Unnamed: 0,SUBJ,AGE
0,35,30
1,139,34
2,2,35
3,132,37
4,29,37
5,49,38
6,135,38
7,13,38
8,28,38
9,7,39


In [249]:
patient_id_age.shape

(189, 2)

#### **``Frequency Distribution``**

In [296]:
freq_dist = pd.DataFrame(patient_id_age.groupby(by='AGE')['SUBJ'].count())
freq_dist.reset_index(inplace=True)
freq_dist.shape

(44, 2)

In [297]:
freq_dist['AGE_CLASS'] = freq_dist['AGE'].apply(lambda val : "[30-48]" if (val >= 30) & (val <= 48) 
                                                else "[49-67]" if (val >= 49) & (val <= 67) 
                                                else "[68-82]")

In [281]:
freq_dist

Unnamed: 0,AGE,SUBJ,AGE_CLASS
0,30,1,[30-48]
1,34,1,[30-48]
2,35,1,[30-48]
3,37,2,[30-48]
4,38,4,[30-48]
5,39,2,[30-48]
6,40,2,[30-48]
7,42,2,[30-48]
8,43,6,[30-48]
9,44,7,[30-48]


In [298]:
freq_dist = pd.DataFrame(freq_dist.groupby(by='AGE_CLASS')['SUBJ'].sum())
freq_dist.reset_index(inplace=True)
freq_dist.columns = ['AGE_CLASS', 'FREQ']
freq_dist

Unnamed: 0,AGE_CLASS,FREQ
0,[30-48],50
1,[49-67],116
2,[68-82],23


#### **``Sturges Equation or Rule``** **for creating the intervals in data for Frequency Distribution**

\begin{align}
\ k & = 1 + 3.322 (log_{10} n)
\end{align}

##### Here, n is number of observations in your sample. And, k is the number of intervals that be formed out of your data.
- ***For example***: Based on our dataset, we have total of 189 records or observations. Therefore, using **``Sturges formula``** number of intervals that we can form in our dataset is:

\begin{align}
\\ k & = 1 + 3.322 (log_{10} 189)
\\ k & = 1 + 3.322 (2.27646)
\\ k & = 8.56
\\ k & \approx 9
\end{align}

Here, k can be used as the number of intervals or the range of the interval. But, not everytime the class width is equal to k. Therefore, to find the class widths use below formula:
\begin{align}
\\ w & = R / k
\\ R & = (Greatest \ value - Smallest \ value)
\\ w & = It \ is \ classwidth
\end{align}

- ***For example***:
\begin{align}
\\ w & = (82 - 30) / 9
\\ w & = 6
\end{align}

### **``Relative Frequency and Cumulative Requency``**

In [293]:
freq_dist

Unnamed: 0,AGE_CLASS,FREQ
0,[30-48],50
1,[49-67],116
2,[68-82],23


In [299]:
freq_dist['R_FREQ'] = (freq_dist['FREQ']/freq_dist['FREQ'].sum())*100

In [302]:
freq_dist

Unnamed: 0,AGE_CLASS,FREQ,R_FREQ
0,[30-48],50,26.455026
1,[49-67],116,61.375661
2,[68-82],23,12.169312


In [303]:
freq_dist['CUM_FREQ'] = freq_dist['R_FREQ'].cumsum()

In [304]:
freq_dist

Unnamed: 0,AGE_CLASS,FREQ,R_FREQ,CUM_FREQ
0,[30-48],50,26.455026,26.455026
1,[49-67],116,61.375661,87.830688
2,[68-82],23,12.169312,100.0
