# CENTRAL TENDENCY

**Additional online sources:**

The 8 Basic Statistics Concepts for Data Science
https://medium.com/swlh/the-8-basic-statistics-concepts-for-data-science-7b865fca92b9

The 5 Basic Statistics Concepts Data Scientists Need to Know
https://towardsdatascience.com/the-5-basic-statistics-concepts-data-scientists-need-to-know-2c96740377ae
    

# Central Tendency (Measure of Center)

**The central tendency concept is that one single value can best describe the data.** **Mean, median, and mode** are the three important parameters in statistics. Essentially, all three of them refer to a single aspect called the Central Tendency.

![image.png](attachment:8ebd6990-f7a7-4e2d-9ad0-d8b169a8a8fd.png)

The mean (the average) is most the famous measure of central tendency, but there are also others, such as the median and the mode. The mean, median and mode are all valid measures of central tendency, but under various conditions, one measure of central tendency might become more appropriate than others.


# Mean ( 𝜇 )

The mean is equal to the sum of the values in the dataset divided by the number of values. The number of values in the dataset will be equal to the population or sample size. The table below gives the formula for the population mean and the sample mean.

![image.png](attachment:e24dcc4c-0fef-4e9c-990c-1fe3f5f2b145.png)

One of the major disadvantages of using mean rather than using median or mode is, the mean is particularly sensitive to the effect of extreme values. Extreme values are also called **outliers**. We will make a technical description of outliers, however, these are the values that are unusual by being relatively small or large in numerical value compared to the rest of the dataset. For example: mean of the list of numbers : 0,1,2,3,4,5000= 1002

We would like to have a better measure of central tendency in this situation. Therefore, taking the **median** might be a better measure of central tendency.

In [77]:
import statistics
import numpy as np

In [53]:
salaries = [102, 33, 26, 27, 30, 25, 33, 33, 24]

print(sum(salaries)/len(salaries))
# or
print(np.mean(salaries))

# 37k aslında bu şirketi cok temsil etmiyor. outlier 102 ortalamanın temsil gücünü azalttı. Thus median in this case.

37.0
37.0


In [74]:
data = {'x1':[6, 5, 3, 5, 2, 7, 2, 8],'x2':range(11, 19),'group1':['A', 'B', 'A', 'C', 'C', 'C', 'B', 'A'],
        'group2':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b']}
print(np.mean(data["x1"]))
print(np.mean(data["x2"]))


4.75
14.5


In [2]:
tuple1 = (11, 3, 4, 5, 7, 9, 2)
print(statistics.mean(tuple1))  # ortalamayı verir
print(sum(tuple1))
print(statistics.median(tuple1))  #ortancayı verir
print(statistics.mode(tuple1))  # en çok tekrar edeni verir. hepsi birer kereyse ilkini verir.
print(statistics.multimode(tuple1))  # mesela 2 tane 11 2 tane 7 olsa 11 ve 7 verirdi.
print(statistics.fsum(tuple1))  # daha sensitive sum'dan. aynı şekilde toplamı verir.


5.857142857142857
41
5
11
[11, 3, 4, 5, 7, 9, 2]
41.0


In [46]:
# mean ve median için numpy indirip onun mean ve median functionları da kullanılabilir
# aşağıda np olarak import ettim.
tuple1 = (11, 3, 4, 5, 7, 9, 2)
print(np.mean(tuple1))
print(np.median(tuple1))
# numpy'da mode yok ama scipy import edip onun stats'ı kullnaılabilir. scipy.stats.mode()
print(np.std(tuple1))  # bu da standart deviation


5.857142857142857
5.0
3.0438965360946453


In [50]:
lst = [1,2,5,6,7,3,5,6,87,8,5,3,2,1,4,4,2,2,4,4,5,5,3,2,2,3,5,6,6,7,4]

print(statistics.median(lst))# 31 eleman olduğpu için tam ortayı verdi. 32 olsa ortadaki ikilinin mean'ini verir
print(np.median(lst))  # numpy ile
print()
print(statistics.mode(lst))  # en çok tekrar eden 6 defa ile 2
print(stats.mode(lst))  # stats from scipy
print()
print(statistics.multimode(lst))  # aslında 5 de 6 defa imis.
print()
print(statistics.mean(lst))  # bu listenin ortalaması 6.7
print(np.mean(lst))  # numpy ile
print()
print(statistics.fsum(lst))  # lstenin toplamı
print(sum(lst))
print()
# en çok tekrar eden için collections'dan Counter ve most_common() lazım
from collections import Counter 

print(Counter(lst))
# ama most_commonu bulmak için bunu başka bir variablea atamalıyız
counter_list = Counter(lst)
print(counter_list.most_common(2))  # ilk iki en fazla için 2 yazdım.
print(counter_list.most_common(1))  # ilk 2 olduğu için en fazlada ikiyi getirdi.

print(stats.mode(lst))  #scipy'dan stats'i import edip onun mode functionunu da kullanabilriz

4
4.0

2
ModeResult(mode=array([2]), count=array([6]))

[2, 5]

6.741935483870968
6.741935483870968

209.0
209

Counter({2: 6, 5: 6, 4: 5, 6: 4, 3: 4, 1: 2, 7: 2, 87: 1, 8: 1})
[(2, 6), (5, 6)]
[(2, 6)]
ModeResult(mode=array([2]), count=array([6]))


In [76]:
# numpy ile yapalım
lst = [1,2,5,6,7,3,5,6,8,87,5,3,2,1,4,4,2,2,4,4,5,5,3,2,2,3,5,6,6,7,4]

print(np.mean(lst))
print(np.median(lst))
print(np.std(lst))  # standard deviation from numpy

6.741935483870968
4.0
14.764752582909995


# Median

The median is the **middle score for a dataset that has been sorted from small to large**. Outliers less affect the median.

In [54]:
salaries = [102, 33, 26, 27, 30, 25, 33, 33, 24]
salaries.sort()
print(salaries)

median_salary = salaries[int(len(salaries)/2)]  # int koymayınca error verdi: list indices can't be float
print(median_salary)
# or
print(np.median(salaries))


# mean 37 k ama 102 nedeniyle. median 30 k bu sirkette maaşları daha iyi temsil ediyor. 

[24, 25, 26, 27, 30, 33, 33, 33, 102]
30
30.0


This works well when you have an odd number of scores, but what will happen when you have an even number of sample size? Even if you only had 10 scores? In this case, we simply have to take the middle two scores and average the result.

In [58]:
salaries = [102, 33, 26, 27, 30, 25, 33, 33, 24]
salaries.append(50)
sorted_salaries = salaries.sort()
print(salaries)

[24, 25, 26, 27, 30, 33, 33, 33, 50, 102]


In [68]:
median_salary = (salaries[int((len(salaries)-1)/2)]+ salaries[(int((len(salaries)+1)/2))])/2
print(median_salary)

31.5


In [78]:
print(statistics.median(salaries))
print(np.median(salaries))
# ortadaki ikiliyi toplayıp ortalamasını aldı.

30
30.0


# Mean vs. Median

**While the mean is the balance point of the data, the median is the middle point.**

The mean can be highly influenced by an outlier. The mean is better if a large set of scores does not have an outlier.

The median is not sensitive to outliers. The median is better if a small set of scores has an outlier.

**Generally, if the shape is;**

Perfectly symmetric, the mean equals the median.

Skewed to the right, the mean is larger than the median.

Skewed to the left, the mean is smaller than the median.

Uzun kuyruk mean'i kendine çeker. negatif ve pozitifi olumlu olumsuz dağılım anlamında değil; sayı dogrusu gibi dusun. saga pozitif sola negatif.

![image.png](attachment:715bbcdb-b183-4516-ae2f-02b89e30dc4a.png)


**FOR MEDIAN AND MODE CALCULATION IN GROUPED DATA see**: 

https://tanersayinnn.medium.com/serinin-tüm-değerlerini-dikkate-almayan-ortalamalar-61e1852692b5

# Mode

**The mode is the most frequent score in a dataset.** **It represents the highest bar in a histogram or bar chart.** Therefore, sometimes you can consider the mode as being the most popular option. **The mode is normally used for categorical data where we want to know which category is the most common.** An example of a mode is presented below.

![image.png](attachment:a069479b-b29a-45b8-bd09-44fcd4ed1ef0.png)

The histogram shows the number of doors among used cars where the sample size is around 15000. We can say the most popular option is the 5-door cars. Therefore, the mode for this dataset is 5.

If all observations are repeated an equal number of times, then the mode does not exist.

Kategorik data'da aradaki boşlukları genelde mode kullanarak doldururuz. mode ne ise onu kullanırız.  Sayısal değerlerdeki eksikliklerde ise medyan veya mean’i ölçü alıp doldurabiliyoruz.Ama en basit yöntem bu, baska yontemler de var.

In [85]:
# How to find the most frequented element

salaries = [102, 33, 26, 27, 30, 25, 33, 33, 24]

print(statistics.mode(salaries))
print(stats.mode(salaries))
print()
# or we can import Counter from collections and use its most common function

from collections import Counter
counter_salaries = Counter(salaries)
print(counter_salaries.most_common(1))
# diyelim ki most_commonun sadece sayısını istiyoruz. o zaman index slicing yaparız:
#most_common(1)[0] liste icindeki tuple'ı, most_common(1)[0][0] da bunun ilk indexini verir
print(counter_salaries.most_common(1)[0][0])
# eğer sadece kaç defa tekrar ettiğini istiyorsan
print(counter_salaries.most_common(1)[0][1])
print()

# or we can use max function by changing its key with count.
print(max(salaries, key = salaries.count))


33
ModeResult(mode=array([33]), count=array([3]))

[(33, 3)]
33
3

33


# Calculate Mean, Median and Mode with Python

We can easily calculate mean, median and mode values with Python. **We can use the "Numpy" library for the mean and median, and the "SciPy" library for the mode.** 

In [73]:
import numpy as np
from scipy import stats

salary = [102, 33, 26, 27, 30, 25, 33, 33, 24]

mean_salary = np.mean(salary)
print("mean:", mean_salary)

median_salary = np.median(salary)
print("median:", median_salary)

mode_salary = stats.mode(salary)
print("mode:", mode_salary)
print("The mode value repeats {} time".format(salary.count(max(salary, key = salary.count))))

mean: 37.0
median: 30.0
mode: ModeResult(mode=array([33]), count=array([3]))
The mode value repeats 3 time


In [80]:
lst1 = [100, 55, 95, 150, 101, 99, 53, 57, 70]

print(statistics.median(lst1))
print(np.median(lst1))

95
95.0


In [82]:
# TASK : Write a program that; Finds out the most frequent number in the given list.
# Calculates its frequency.
# Prints out the result such as : the most frequent number is x and it was y times repeated

numbers = [1, 2, 3, 4, 7, 9, 5, 4, 9, 4, 3, 5, 6, 6, 8, 2, 4, 5, 8, 8, 5, 4, 3, 7, 6]
from collections import Counter
def most_frequent(numbers):
    occurrence_count = Counter(numbers)
    return occurrence_count.most_common(1)[0][0]
count_most_freq = numbers.count(most_frequent(numbers))  # Use .count() to Count Number of Occurrences in a Python List
                              
print(f" the most frequent number is {most_frequent(numbers)} and it was {count_most_freq} times repeated")

 the most frequent number is 4 and it was 5 times repeated


In [30]:
# solution of the same question by manipulating the key parameter of max function
numbers = [1,3,7,4,3,6,0,3,6,3,7,6,6]
print(max(numbers))  # max bir iterable objectteki en büyük valueyu return eder(örn max(11, 1, 35) 35 i return eder.)
numbers.count(max(numbers))
# max fonsiyonunun default key. fonksiyonunu değiştireceğiz simdi.
print(max(numbers, key = numbers.count))  # key değişti, artik elementleri sayip max olarak en cok count edeni aliyor.
# listemizde 2 tane var 4 kere olan. bu fonksiyon ilkini alir.
print(numbers.count(3))
print(f"The most frequent number is {max(numbers, key = numbers.count)} and it was \
{numbers.count(max(numbers, key = numbers.count))} times repeated.")

7
3
4
The most frequent number is 3 and it was 4 times repeated.


![image.png](attachment:d2be6b5d-4be4-4a91-b653-6d4271f3c123.png)