# Descriptive Statistics
Descriptive statistics are brief informational coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population. Descriptive statistics are broken down into measures of central tendancy and measures of viability.

# Inferential Statistics
Inferential statistics use measurements from the sample of subjects in the experiment to compare the treatment groups and makes generalizations about the larger population of subjects. There are many types of inferential statistics and each is appropriate for a specific research design and sample characteristics.

# Descriptive Statistics V. Inferential Statistics
Consider an example of a speedometer that displays information about instantenous speed,

### Descriptive statistics
Descriptive statistics describe the data as it is. In the above case, if the instantenous speed is 65 kph, the description of this would be that the speed is 65 kph.

### Inferential statistics
Inferential statistics involves making deductions out of the given data. In the above case, if a constant speed of 65 kph is maintained for a certain amount of time, then a certain destination would be reached after that time period.

Now if the said speed is not maintained, then there would be consequences. The deduction of such consequences is what inferential statistics is about.

Essentially, "inferring more from the given data is inferential statistics".

# Mean
Mean is the average of all the values in the dataset

# Median
Median is the value that occurs at the center of the dataset.

# What Is Understood From Mean And Median?
Mean and median help in observing the central tendency of the data.

Central tendency is the value which is typically present at the center of the data. It is also representative of the data.

Consider an array representing salaries (in LPA) of a company,
- `salaries = [20, 25, 30, 35, 40, 45, 50]`.
- `mean = 35`.
- `median = 35`.

Now consider,
- `salaries = [20, 25, 30, 35, 40, 45, 50, 300]`.
- `mean = 79`.
- `median = 37.5`.

There is a huge difference between mean and median in the second case, compared to the first one.

In the second case, median is more robust to the outliers. Meaning, it will shift marginally where there is an outlier.

The dataset should be sorted before finding the median.

# Mode
Mode is the most recurring value or the most frequently occurring value in the dataset. 

Consider `array = [90, 80, 90, 70, 95, 90, 60, 90]`,
- `mode = 90`.

Now consider `array = [2, 2, 5, 2, 3, 3, 3, 4]`,
- `mode = 2, 3`.

The second scenario is called as bi-modal scenario. Mode can also be applied on strings. Mean and median may not work on the string data type.

Mean, median or mode can be used to impute the missing values in the dataset.

# Percentage
In mathematics, a percentage is a number or ratio that can be expressed as a fraction of 100. If the percentage of a number is to be calculated, divide the number by the whole (summation) and multiply by 100. Hence, percentage means, part per hundred. The word "per cent" means, "per 100".

# Percentile
In statistics, a percentile is a term that describes how a score compares to other scores from the same set. While there is no universal definition of percentile, it is commonly expressed as the percentage of values in a set of data that fall below a given value.

# Difference Between Percentage And Percentile
The key difference between percentage and percentile is that, percentage is a mathematical value presented out of 100 and percentile is the percent of values below a specific value. The percentage is a means of comparing quantities. A percentile is used to display position or rank.

# Standard Deviation
Standard deviation tells how much spread is there in the data.

Standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance.

The standard deviation is calculated as the square root of variance by determining each data point's deviation relative to the mean.

If the data points are further from the mean, there is a higher deviation within the dataset. Thus, the more spread out the data is, the higher is the standard deviation.

![standard_deviation.png](attachment:standard_deviation.png)

Standard deviation is mathematically expressed as,

$SD (\sigma) = \sqrt{\frac{\sum_{i = 1}^n(x_i - \bar{x})^2}{n - 1}}$

Where,
- $x_i$ = Value of the i-th point in the dataset.
- $\bar{x}$ = Mean value of the dataset.
- $n$ = Number of data points in the dataset.

# Variance
The term variance refers to a statistical measurement of the spread between members in a dataset. More specifically, variance measures how far each member in the dataset is from the mean and thus from every other member in the set.

Variance is mathematically written as,

$SD^2 = \frac{\sum_{i = 1}^n(x_i - \bar{x})^2}{N}$

or,

$\sigma^2 = \frac{\sum_{i = 1}^n(x_i - \bar{x})^2}{n - 1}$

Where,
- $x_i$ = Each value in the dataset.
- $\bar{x}$ = Mean of all values in the dataset.
- $N$ = Number of values in the dataset.

# Inter Quartile Range (IQR)
In descriptive statistics, the interquartile range tells about the spread of the middle half of the distribution.

Quartiles segment any distribution that is ordered from low to high into 4 equal parts. The IQR contains the second and third quartiles or the middle half of the dataset.

While the range tells about the spread of the whole dataset, the interquartile range tells about the range of the middle half of the dataset.

![iqr_1.png](attachment:iqr_1.png)

IQR is calculated as, $IQR = Q_3 - Q_1$.

The upper fence is calculated as, $\text{Upper Fence} = Q_3 + (IQR * 1.5)$.

The lower fence is calculated as, $\text{Lower Fence} = Q_1 - (IQR * 1.5)$.

Any values that are greater than the upper fence and lesser than the lower fence are considered as "*outliers*".

These fences were calculated by a John Tukey.

### What is 1.5 in the above equation?
1.5 is a constant used to discern the outliers. The number 1.5, hereafter scale, clearly controls the sensitivity of the range and hence the decision rule. A bigger scale would make the outliers be considered as inliers, while a smaller one would make some of the data points be perceived as outliers. And, none of these cases is desirable.

### Visualizing the outliers
A box plot is used to visualize the IQR and the outliers.

![iqr_2.png](attachment:iqr_2.png)

# Skew In Distribution

![distribution_skew.png](attachment:distribution_skew.png)

Right skewed (positive skew) distribution occurs when the long tail is on the right side of the distribution. Analysts also refer to them as positively skewed. This condition occurs because probabilities taper off (decrease) more slowly for higher values. Consequently, extremely values will be found far from the peak on the high end more frequently than on the low end.

Left skewed (negative skew) distributions occur when the long tail is on the left side of the distribution. Statisticians also refer to them as negatively skewed. This condition occurs because probabilities taper off (decrease) more slowly for lower values. Therefore, extreme values will be found far from the peak on the low end more frequently than the high end.

The crucial point to keep in mind is that the direction of the long tail defines the skew because it indicated where the majority of the exceptional values will be found.