# Dispersion  - Measure of Spread

Additional source: https://www.youtube.com/watch?v=mk8tOD0t8M0

In statistics, the measure of central tendency gives a single value that represents the entire data; however, a single value can not describe the observation exactly. At this point, the dispersion helps us to study the variability of the items. **Dispersion is a way to explain how a dataset is distributed. When a dataset has a small value, the values in the dataset are tightly clustered; when it is large the items in the set are widely scattered.**

For example, mean can be same for two dataset which are quiet different from each other. **A dispersion is more efficient than central tendency to explain the dataset and its main characteristics.** 

**Range, standard deviation and interquartile range** are the three widely used **measures of dispersion**.

![image.png](attachment:a47b2258-280f-4314-b240-9939a84b6a44.png)

# 1) Range

The range is the simple measure of dispersion, which is defined as the **difference between the maximum and the minimum values.** The main advantage of the range is that it is easy to calculate. There are many disadvantages, on the other hand. It is highly susceptible to extreme values and does not use all the observations in a dataset.

**R = Maximum Value – Minimum Value**

**Range in a simple array**:

[1,5,15,17,19, 150]
R = 150-1 = 149

However, range is sensitive to the extreme values. In this case, without 150, it would be 19-1= 18. So it can't represent the population efficiently.

**Range in grouped data**:

Upper lmit of the highest class interval – lower limit of the lowest class interval

Son sınıf aralığının üst limiti – ilk sınıf aralığının alt limiti

For ex:
1-4

2-5

6-9 için R = 9-1= 8


# 2) Standard Deviation(𝜎)

The most commonly used measure of dispersion is the standard deviation (𝜎). **Standard deviation measures the spread around the mean.** It is also expressed as the **square root of variance**. Therefore we must first describe the variance(𝜎2). **Variance is defined as the average of the squared differences from the mean.** The formula for the variance and standard deviation is given below. 

![image.png](attachment:b90aa2f4-1578-44e2-a1e7-05d6d5db6a08.png)

Like the range, also standard deviation is affected by outliers. One value could contribute greatly to the results of the standard deviation. This also means the standard deviation is a good indicator of the existence of outliers.

**What does SD tell us? If it's lower, that means the values are around the mean; more higher it is, further the values from the mean.**

The standard deviation is also **useful when comparing the spread of two different datasets that have the same mean.** The dataset with the smaller standard deviation has a **narrower spread of measurements around the mean** and therefore usually has **relatively less high or low values**. In the following example, the standard deviation for the first population is 40, however, the standard deviation for the second one is 10. You see the second population has a narrower spread of measurements around the mean.

Populationda N'e sampleda n-1'e böleriz.

![image.png](attachment:b0ded981-dd05-4d63-b890-57171c8ae019.png) ![image.png](attachment:26dd016c-903b-4a07-a52a-b0a5a96777fd.png)

**Tipp: data with smaller standard deviation has a narrower spread of measurements around the mean.**

In [42]:
# Task: find the variance and standard deviation of the list of grades.
grades = [34,38,46,50,64,72,85,91]
A = np.mean(grades)  # mean
print("Mean is", A)
variance_grades = sum([(i-A)**2 for i in grades]) / len(grades)
print("Variance is", variance_grades)

standard_dev = variance_grades **(1/2)
print("STD is", standard_dev)

# or we can use var and std functions
print()
print("Variance is", np.var(grades))
print("STD is", np.std(grades))

Mean is 60.0
Variance is 400.25
STD is 20.006249023742555

Variance is 400.25
STD is 20.006249023742555


# 3) Interquartile Range (IQR) 

Çeyrekler arası genişlik

**Quartiles are the values that divide a group of numbers into quarters.**

**Q1 or the 25th percentile is the first quartile** and defined as **the middle number between the smallest number and the median of the dataset.**

**Q2 is the second quartile which is the median** of the whole dataset.

**Q3 or 75th percentile is the third quartile** which is the **middle value between the median and the highest value of the dataset.**

The interquartile range, often denoted **“IQR”**, is **a way to measure the spread of the middle 50% of a dataset**. It is calculated as the **difference between the first quartile (the 25th percentile) and the third quartile (the 75th percentile) of a dataset.**

For example, a dataset consists of those numbers: 0,4,5,7,8,9,10,12,13,14,15,16,20.

The median (Q2) is the value in the middle of the list. In this case, 10 is the median number.

The first quartile (Q1) is the middle number in between the smallest number (0) and the median (10) which is 7. In other words, the middle number between 0 and 10 is 7.

The third quartile (Q3) is the middle value between the median (10) and the highest value (20) in this case that will be 14. In other words, the middle number between 10 and 20 will be 14.

**Interquartile Range(IQR) is the difference between Q3 and Q1**. In this case:

IQR = Q3 - Q1

IQR = 14 - 7 = 7

As you remember we have already mentioned the extreme values and named them as outliers. In statistics, an outlier is a data point that differs significantly from other observations. **IQR helps us to make a technical description of outliers.** A typical definition of the outlier is, **any data point more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile**.

In this case, we can say:

**Outliers are any data point below (Q1 - 1.5 * IQR) or above (Q3 + 1.5 * IQR).**

The following picture shows the relationship between IQR and outliers.

![image.png](attachment:8ec3a8e9-cd25-4dee-b631-006a83d2d1ab.png)

It’s easy to calculate the interquartile range of a dataset in Python using the **numpy.percentile() function**.

Imagine we have the following number list and let's find whether there are outliers or not?

number_list = [1, 5, 10, 15, 40]

minimum number = 1

maximum number =40

median=10

Q1 = 5

Q3 = 15

IQR = Q3-Q1

IQR= 15-5 = 10

Therefore, (1.5 * IQR) = 15

To determine if there are any outliers, we must consider the numbers that are 1.5*IQR beyond the quartiles.

Q1 – (1.5 * IQR) = 5-15 = -10

Q3 + (1.5 * IQR) = 15+15 = 30

The last number in our list is 40. And it is outside of the interval from (–10) to (30), therefore 40 is an outlier. The rest of the numbers in the list are not outliers.

Additional reading: How to find outliers in data with python https://careerfoundry.com/en/blog/data-analytics/how-to-find-outliers/

For detailed information about the quantiles (quartiles are a kind of quantile): https://en.wikipedia.org/wiki/Quantile

In [12]:
import numpy as np

#define array of data
data = np.array([14, 19, 20, 22, 24, 26, 27, 30, 30, 31, 36, 38, 44, 47])

#calculate interquartile range 
q3, q1 = np.percentile(data, [75 ,25])  # percentile requires 2 positional arguments
iqr = q3 - q1

#display interquartile range 
iqr

12.25

In [50]:
samples = np.random.normal(loc=100, scale=15, size=1000)
q1 = np.percentile(samples, 25) 
q3 = np.percentile(samples, 75)
iqr = stats.iqr(samples)

print("Q1 is", q1, "\nQ3 is", q3, "\nIQR is", iqr) 
print()
print("Q2, median, is", np.percentile(samples, 50))
print("Median is", np.median(samples))

# yani 60 ile (90 - 30) 141 (111+30) arasında olmayanlar outliers olur.

Q1 is 89.52143439932422 
Q3 is 109.56087632588168 
IQR is 20.03944192655746

Q2, median, is 99.5922512696162
Median is 99.5922512696162


# Box Plot

Kutu diyagramı, kutu grafiği

Box plots (also called **box-and-whisker plots or box-whisker plots**) give a good **graphical image of the concentration of the data**. They also show **how far the extreme values** are from most of the data. 

A box plot is constructed from five values: 

- the minimum value, 

- the first quartile, 

- the median, 

- the third quartile, 

- and the maximum value. 

**These five are called "5-number summary"** These 5 numbers are important because we'll do our initial interpretations about a data according to these numbers during a data analysis process.

**We use these values to compare how close other data values are to them.**

To construct a box plot, use a horizontal or vertical number line and a rectangular box. **The smallest and largest data values label the endpoints of the axis.** **The first quartile marks one end of the box and the third quartile marks the other end of the box.** Approximately the middle 50 percent of the data fall inside the box. **The "whiskers" extend from the ends of the box to the smallest and largest data values**. The median or second quartile can be between the first and third quartiles, or it can be one, or the other, or both. The box plot gives a good, quick picture of the data.

**Box Plot (Min & Max)**

![image.png](attachment:0628f87d-1302-4447-b53e-895640b07170.png)

The two whiskers extend from the first quartile to the smallest value and from the third quartile to the largest value. The median is shown in the middle of the box.

**Box Plot (1.5xIQR Rule)**

Consider the following dataset. (Replace 11.5 with 20)

![image.png](attachment:a857483f-00be-44fc-b4ed-b71cf2e4fda5.png)

**The ‘middle’ line drawn inside the box shows the position of the median value.**

**The ends of the ‘box’ give the positions of the first and third quartiles.**

**The ends of the ‘whiskers’ give the minimum and maximum values in the data.**


İksi de dağılımı gösterir ama Histogramla box plot farkı, histogram 2 eksenli (frekanslar ve sınıf aralıkları), box plot tek eksenli(sadece değerler).

# How to make a box plot by using min and max

![image.png](attachment:48facfa7-e0c0-4998-8257-08d287501ca9.png)

![image.png](attachment:94f78675-81b3-4679-a15f-197d15fa4443.png)

![image.png](attachment:0e772635-046c-40ca-8b1a-1ead3dcce86d.png)

![image.png](attachment:f6a780b7-f31c-4465-9d9e-62ee5d67ff5a.png)

![image.png](attachment:9a2f17dd-d644-4880-97ca-16c2883d7376.png)

![image.png](attachment:f90119f8-1a49-438a-9a45-dd59cfc9ca4b.png)

# How to make a box plot with using 1.5*IQR Rule

![image.png](attachment:cdea5409-050c-4211-8168-118ddfef772a.png)

![image.png](attachment:ade15a8c-0179-471e-9867-6c3586c5974e.png)

![image.png](attachment:d4e460ad-59be-48c8-ae20-c697c310d52d.png)

![image.png](attachment:c2795696-a0a0-467f-b34e-24a122abf835.png)

![image.png](attachment:44a6304b-95ba-4171-86aa-e83d17127e3c.png)

In [52]:
# box plot values except median
min = np.min(samples)
max = np.max(samples)
q1 = np.percentile(samples, 25)
q3 = np.percentile(samples, 75)
iqr = stats.iqr(samples)
print("Min value in the array is",min, "\nMax value in the array is", max, "\nQ1 is", q1, "\nQ3 is", q3, "\nIQR is", iqr)

Min value in the array is 55.85147801433594 
Max value in the array is 148.6135659884155 
Q1 is 89.52143439932422 
Q3 is 109.56087632588168 
IQR is 20.03944192655746


In [19]:
import numpy as np
from scipy import stats

salary = [102, 33, 26, 27, 30, 25, 33, 33, 24]

print("Range: ", (np.max(salary)-np.min(salary)))

print("Variance: ", (np.var(salary)))

print("Std: ", (np.std(salary)))

print("Q1:", (np.percentile(salary, 25)))

print("Q2:", (np.percentile(salary, 50)))  #q2 is also called median

print("Q3:", (np.percentile(salary, 75)))

print("IQR:", (stats.iqr(salary)))

Range:  78
Variance:  539.5555555555555
Std:  23.22833518691246
Q1: 26.0
Q2: 30.0
Q3: 33.0
IQR: 7.0


In [18]:
# Pandas describe() ile bir serinin count, mean, std, min, max, 1q, median, 3q'sunu gorebiliriz.
import pandas
lst = pandas.Series([1, 1, 2, 2, 4, 6.8, 7, 8, 8.3, 9, 10, 10, 11.5])
lst.describe()

count    13.000000
mean      6.200000
std       3.737869
min       1.000000
25%       2.000000
50%       7.000000
75%       9.000000
max      11.500000
dtype: float64

In [31]:
lst = [1, 1, 2, 2, 4, 6.8, 7, 8, 8.3, 9, 10, 10, 11.5]

print("Mean of the list is", np.mean(lst))
print("Median of the list is",np.median(lst))
print("Variance of the list is", np.var(lst))
print("SD of the list is", np.std(lst))
print("Range of the list is", np.max(lst) - np.min(lst))
print("Q1 of the list is", np.percentile(lst, 25))
print("Q3 of the list is", np.percentile(lst, 75))
print("Q2 of the list is", np.percentile(lst, 50))  # this is median
print("IQR (Interquartile range) of the list is:", (stats.iqr(salary)))


Mean of the list is 6.199999999999999
Median of the list is 7.0
Variance of the list is 12.896923076923077
SD of the list is 3.5912286305557153
Range of the list is 10.5
Q1 of the list is 2.0
Q3 of the list is 9.0
Q2 of the list is 7.0
IQR (Interquartile range) of the list is: 7.0


In [None]:
# Q1 – (1.5 * IQR) = 2 - 10.5 = -8.5

# Q3 + (1.5 * IQR) = 9+10.5 = 19.5

# Yani -8.5 ile 19.5 aralığı dışında kalan sayı varsa outlierdir. bu listede yok. Ama 20 olsaydı olurdu.

TASK: Randomly generate 1,000 samples from the normal distribution using np.random.normal()(mean = 100, standard deviation = 15)

np.random.normal(loc=0.0, scale=1.0, size=None) #you need to modify this code.

loc will be equal to mean, scale will be equal to std deviation, size will be equal to sample size.

In [35]:
import numpy as np
import scipy
import pandas as pd
samples = np.random.normal(loc=100, scale=15, size=1000)

In [57]:
mean = np.mean(samples)
median = np.median(samples)
from scipy import stats
mode = scipy.stats.mode(samples)
print("Mean is", mean, "\nMedian is", median, "\nMode is", mode)

Mean is 99.35804184957564 
Median is 99.5922512696162 
Mode is ModeResult(mode=array([55.85147801]), count=array([1]))


In [38]:
# Compute the min, max, Q1, Q3, and interquartile range

min = np.min(samples)
max = np.max(samples)
q1 = np.percentile(samples, 25)
q3 = np.percentile(samples, 75)
iqr = stats.iqr(samples)
print(min, max, q1, q3, iqr, sep= "\n")

51.57352991295932
153.488671264277
89.58552577827659
109.57715733975083
19.991631561474236


In [39]:
# compute the variance and STD
variance = np.var(samples)
std_dev = np.std(samples)
print(variance, std_dev, sep= "\n")

229.75567761769867
15.157693677393624


In [40]:
# Compute the skewness and kurtosis
# We can use scipy.stats.skew and scipy.stats.kurtosis

skewness = scipy.stats.skew(samples)
kurtosis = scipy.stats.kurtosis(samples)
print(skewness, kurtosis, sep="\n")

-0.05297051435185616
-0.07774200337096016


# ABSOLUTE DEVIATION

In [2]:
import numpy as np


In [8]:
# Absolute deviation
# Grades of 8 students. find the abs.dev.

grades = [34,38,46,50,64,72,85,91]

# değerlerin mean'den farkının mutlak değerlerinin toplamının değer sayısına bölümü

abs_dev = sum([abs(i-np.mean(grades)) for i in grades])/len(grades)
print(abs_dev)

18.0


In [5]:
grades = [34,38,46,50,64,72,85,91]
A = np.mean(grades)
  
sum = 0 
for i in range(len(grades)):
    av = abs(grades[i] - A)   # Absolute value of the differences of each data point and A
    # Summing all those absolute values
    sum = sum + av               
  
# Sum divided by length of data yields
# the absolute deviation
print(sum / len(grades))
print(np.std(grades))


18.0
20.006249023742555


In [12]:
data = [75, 69, 56, 46, 47, 79, 92, 97, 89, 88,
        36, 96, 105, 32, 116, 101, 79, 93, 91, 112]
  
# Assume any point A about which absolute deviation is to be calculated
A = np.mean(data)
  
sum = 0  # Initialize sum to 0
  
# Absolute deviation calculation
for i in range(len(data)):
    av = abs(data[i] - A)   # Absolute value of the differences 
                                 # of each data point and A
    # Summing all those absolute values
    sum = sum + av               
    
# Sum divided by length of data yields
# the absolute deviation
print(sum / len(data))

20.055000000000003


In [4]:
numbers = [3,4,6,7,10]

mean_numbers = np.mean(numbers)
print(mean_numbers)
abs_val_numbers = sum([abs(i - mean_numbers) for i in numbers])/len(numbers)
print(abs_val_numbers)
print(np.std(numbers))

6.0
2.0
2.449489742783178
