<table align="left" width=100%>
    <tr>
        <td width="20%">
            <img src="faculty.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                  <b> Faculty Notebook <br> (Session 1) </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## Table of Content

1. **[Import Libraries](#lib)**
2. **[Descriptive Statistics](#des)**
    - 2.1 - **[Measures of Central Tendency](#CT)**
    - 2.2 - **[Measures of Dispersion](#disp)**
    - 2.3 - **[Skewness and Kurtosis](#sk)**
    - 2.4 - **[Covariance and Correlation](#cc)**

<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [1]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt
from matplotlib import gridspec
%matplotlib inline

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import 'factorial' from math library
from math import factorial

# import 'stats' package from scipy library
from scipy import stats
from scipy.stats import randint
from scipy.stats import skewnorm

# import 'random' to generate a random sample
import random

In [2]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

The study of statistics is mainly divided into two parts: `Descriptive` and `Inferential`.

Here we mainly focus on `Inferential Statistics`. Before that, let us recall the descriptive statistics methods learned as a part of exploratory data analysis.

<a id="des"></a>
# 2. Descriptive Statistics

Descriptive statistics summarizes or describes the given data. It includes measures of central tendency, measures of dispersion and distribution of the data.

<a id="CT"></a>
## 2.1 Measures of Central Tendency

A measure of central tendency is a value that distinguishes the central position of the data. It includes mean, median, mode and partition values of the data.

### Mean:
It is defined as the ratio of the sum of all the observations to the total number of observations. It is affected by the presence of outliers.

### Median:
It is the middlemost observation in the data when it is arranged in the increasing or decreasing order based on the values. It divides the dataset into two equal parts.

### Mode: 
It is defined as the value in the data with the highest frequency. There can be more than one mode in the data.

### Partition values:
Partition values are defined as the values that divide the data into equal parts. `Quartiles` divide the data into 4 equal parts, `Deciles` divide the data into 10 equal parts and `Percentiles` divide the data into 100 equal parts.

### Example:

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) of all the branches. Calculate the mean and median to find the average sale.
    
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

In [3]:
# given data
sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

# calculate mean sale
mean_sale = np.mean(sale)
print('Mean:', mean_sale)

# calculate median sale
med_sale = np.median(sale)
print('Median:', med_sale)

Mean: 169.33333333333334
Median: 173.0


<a id="disp"></a>
## 2.2 Measures of Dispersion

A measure of dispersion describes the variability in the data. Some of the measures of dispersion are range, variance, standard deviation, coefficient of variation, and IQR.

### Range:
It is defined as the difference between the largest and smallest observation in the data. It is affected by the presence of extreme observations. 

### Variance: 
It calculates the dispersion of the data from the mean. It is defined as the average of the sum of squares of the difference between the observation and the mean.

### Standard Deviation:
It is the positive square root of variance. The unit of standard deviation is the same as the unit of data points. The variable with near-zero standard deviation is least important for the analysis.

### Coefficient of Variation
It is a measure of the dispersion of data points around the mean. It is always expressed in percentage. We can compare the coefficient of variation of two or more groups to identify the group with more spread.

### Interquartile Range (IQR):
It is defined as the difference between the third and first quartiles. It returns the range of the middle 50% of the data. IQR can be used to identify the outliers in the data.

### Example:

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) of all the branches. Calculate the standard deviation of the sale. Also, find the range in which the middle 50% of the sale would lie.
    
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

In [4]:
# given sale
sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

# calculate standard deviation
std_sale = np.std(sale)
print('Standard Deviation:', std_sale)

# calculate the IQR to obtain the range of middle 50% of the sale

# 1st quartile
# pass the sale values to the parameter, 'a'
# pass the required quantile value to the parameter, 'q'
Q1_sale = np.quantile(a = sale, q = 0.25)

# 3rd quartile
# pass the sale values to the parameter, 'a'
# pass the required quantile value to the parameter, 'q'
Q3_sale = np.quantile(a = sale, q = 0.75)

# calculate IQR
IQR = Q3_sale - Q1_sale

print('Range of the middle 50% of the sale:', IQR)

Standard Deviation: 21.76898915634093
Range of the middle 50% of the sale: 22.5


<a id="sk"></a>
## 2.3 Skewness and Kurtosis

### Skewness:
It measures the degree to which the distribution of the data differs from the normal distribution. The value of skewness can be `positive`, `negative`, or `zero`.

### Kurtosis:
It identifies the peakedness of the data distribution. The positive value of kurtosis represents the `leptokurtic` distribution, the negative value represents the `platykurtic` distribution, and zero value represents the `mesokurtic` distribution.

### Example:

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) of all the branches. Identify the type of Skewness and Kurtosis for sales.
    
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

In [5]:
# calculate the value of skewness to identify the type
sale_kurt = stats.skew(sale)
print('Skewness of Sale:', sale_kurt)

# calculate the value of kurtosis to identify the type
sale_kurt = stats.kurtosis(sale)
print('Kurtosis of Sale:', sale_kurt)

Skewness of Sale: -0.5285526567587567
Kurtosis of Sale: -0.38240010775017863


The above output shows that the value of skewness is negative which implies that the data is `negatively skewed`. Also, the value of kurtosis is negative that implies the distribution of the sales is `platykurtic`.

<a id="cc"></a>
## 2.4 Covariance and Correlation

### Covariance:
It measures the degree to which two variables move together. The value of covariance can be between $-\infty$ to $\infty$. The magnitude of covariance is not easy to interpret.  

### Correlation:
It is the normalized value of covariance. The correlation value near to +1 indicates a `strong positive` correlation between the variables, and value near to -1 indicates a `strong negative` correlation.

### Example:

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) and working hours of all the branches. Find the relationship between the working hours of a store and its sales.
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]
    Working hours = [7, 8.5, 8, 10, 9, 8, 8.5, 7.5, 9.5, 8.5, 8, 9]

In [6]:
# given data
sale = pd.Series([165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175])
working_hrs = pd.Series([7, 8.5, 8, 10, 9, 8, 8.5, 7.5, 9.5, 8.5, 8, 9])

# calculate the correlation coefficient to find the relationship between working hours and sales of a store
corr_coeff = working_hrs.corr(sale)

print('Correlation coefficient:', corr_coeff)

Correlation coefficient: 0.6447248082202144


The value of the correlation coefficient shows that there is a positive correlation between the working hours and sales of a store.