# Descriptive Statistics

**Introduction to Python for Data Science** \
Course with Hacktiv8

***

**Sesi 9**

Minggu, 13 Juni 2021 • 09:00 - 12:00 WIB

- Mean
- Median
- Mode
- Variance
- Standard Deviation
- Skewness
- Percentiles
- Ranges

Selasa, 15 Juni 2021 • 19:00 - 22:00 WIB

- Korelasi

***

In [1]:
import math
import statistics
import scipy.stats
import numpy as np
import pandas as pd

Membuat _list_, _numpy array_, dan _pandas series_ berisi beberapa data numerik:

In [2]:
 x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]

y = np.array(x)
y_with_nan = np.array(x_with_nan)

z = pd.Series(x)
z_with_nan = pd.Series(x_with_nan)

print(x)
print(x_with_nan, end='\n\n')

print(y)
print(y_with_nan, end='\n\n')

print(z, end='\n\n')
print(z_with_nan)

[8.0, 1, 2.5, 4, 28.0]
[8.0, 1, 2.5, nan, 4, 28.0]

[ 8.   1.   2.5  4.  28. ]
[ 8.   1.   2.5  nan  4.  28. ]

0     8.0
1     1.0
2     2.5
3     4.0
4    28.0
dtype: float64

0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64


## Mean (Rata-Rata)

### Sample Mean/Average

Rata-rata aritmatika dari semua item dalam kumpulan sata

In [3]:
native_mean = sum(x) / len(x)
stats_mean = statistics.mean(x)
numpy_mean = np.mean(y)
np_array_mean = y.mean()
pandas_mean = z.mean()

print('Native mean:', native_mean)
print('Statistics mean:', stats_mean)
print('NumPy mean:', numpy_mean)
print('NumPy array mean:', np_array_mean)
print('Pandas mean:', pandas_mean)

Native mean: 8.7
Statistics mean: 8.7
NumPy mean: 8.7
NumPy array mean: 8.7
Pandas mean: 8.7


Mean with NaN value:

In [4]:
print('Native mean with NaN:', sum(x_with_nan) / len(x_with_nan))
print('Statistics mean with NaN:', statistics.mean(x_with_nan))
print('NumPy mean with NaN (default):', np.mean(y_with_nan))
print('NumPy mean with NaN (ignored):', np.nanmean(y_with_nan))
print('Pandas mean with NaN (default):', z_with_nan.mean())
print('Pandas mean with NaN (not ignored):', z_with_nan.mean(skipna=False))

Native mean with NaN: nan
Statistics mean with NaN: nan
NumPy mean with NaN (default): nan
NumPy mean with NaN (ignored): 8.7
Pandas mean with NaN (default): 8.7
Pandas mean with NaN (not ignored): nan


### Weighted Mean

Untuk kumpulan data yang berisi item yang muncul dengan frekuensi relatif tertentu

In [5]:
x_new = [8.0, 1, 2.5, 4, 28.0]
weights = [0.1, 0.2, 0.3, 0.25, 0.15]

wmean_range = sum(x_new[i] * weights[i] for i in range(len(x_new))) / sum(weights)

wmean_zip = sum(x_item * w_item for (x_item, w_item) in zip(x_new, weights)) / sum(weights)

wmean_np = np.average(x_new, weights=weights)

x_new_arr = np.array(x_new)
weights_arr = np.array(weights)
wmean_np_arr = (x_new_arr * weights_arr).sum() / weights_arr.sum()

print('Native weighted mean (range method):', wmean_range)
print('Native weighted mean (zip method):', wmean_zip)
print('NumPy weighted mean:', wmean_np)
print('NumPy array weighted mean:', wmean_np)


Native weighted mean (range method): 6.95
Native weighted mean (zip method): 6.95
NumPy weighted mean: 6.95
NumPy array weighted mean: 6.95


### Harmonic Mean

Untuk _handling outliers_ yang besar

In [6]:
print('x:', x)

hmean_native = len(x) / sum(1 / item for item in x)
hmean_stats = statistics.harmonic_mean(x)
hmean_scipy = scipy.stats.hmean(x)
hmean_np_arr = 1 / (1 / np.array(x)).mean()

print('Native harmonic mean', hmean_native)
print('Statistics harmonic mean', hmean_stats)
print('SciPy harmonic mean', hmean_scipy)
print('NumPy array harmonic mean', hmean_np_arr)

x: [8.0, 1, 2.5, 4, 28.0]
Native harmonic mean 2.7613412228796843
Statistics harmonic mean 2.7613412228796843
SciPy harmonic mean 2.7613412228796843
NumPy array harmonic mean 2.7613412228796843


### Geometric Mean

Untuk membandingkan berbagai hal dengan properti yang sangat berbeda

In [7]:
print('x:', x)

gmean_native = 1
for item in x:
    gmean_native *= item
gmean_native **= 1/len(x)
print('Native geometric mean:\t', gmean_native)

gmean_scipy = scipy.stats.gmean(x)
print('SciPy geometric mean:\t', gmean_scipy)

gmean_stats = statistics.geometric_mean(x)
print('Stats geometric mean:\t', gmean_stats)

x: [8.0, 1, 2.5, 4, 28.0]
Native geometric mean:	 4.677885674856041
SciPy geometric mean:	 4.67788567485604
Stats geometric mean:	 4.67788567485604


## Median

Elemen tengah dari kumpulan data yang diurutkan

In [8]:
def native_median(x):
    n = len(x)
    x_sorted = sorted(x)

    print('       x:', x)
    print('sorted x:', x_sorted, end='\n\n')

    if n % 2:
        median_ = x_sorted[round((n-1)/2)]
    else:
        index = round(n/2)
        median_ = (x_sorted[index-1] + x_sorted[index]) / 2
    return median_

print('median (native):', native_median(x))

       x: [8.0, 1, 2.5, 4, 28.0]
sorted x: [1, 2.5, 4, 8.0, 28.0]

median (native): 4


In [9]:
xx = [19, 3, 2, 2.0, 0.3, 5, 1, 6, 5, 1]

print('Native median:', native_median(xx))
print('NumPy median:', np.median(xx))
print('Statistics median:', statistics.median(xx))
print('Statistics median (low):', statistics.median_low(xx))
print('Statistics median (high):', statistics.median_high(xx))

       x: [19, 3, 2, 2.0, 0.3, 5, 1, 6, 5, 1]
sorted x: [0.3, 1, 1, 2, 2.0, 3, 5, 5, 6, 19]

Native median: 2.5
NumPy median: 2.5
Statistics median: 2.5
Statistics median (low): 2.0
Statistics median (high): 3


## Mode

Nilai dalam kumpulan data yang paling sering muncul

In [10]:
u = [2, 3, 2, 8, 12]
v = [12, 15, 12, 15, 21, 15, 12]

mode_u = max((u.count(item), item) for item in set(u))[1]
mode_v = max((v.count(item), item) for item in set(v))[1]

print('Native mode (u):', mode_u)
print('Native mode (v):', mode_v)

Native mode (u): 2
Native mode (v): 15


In [11]:
u_series = pd.Series(u)
v_series = pd.Series(v)

mode_stats_v = statistics.mode(v)
mode_scipy_v = scipy.stats.mode(v)
mode_series_v = v_series.mode()

print('Statistics mode (v):', mode_stats_v, end='\n\n')
print('SciPy mode (v):', mode_scipy_v, end='\n\n')
print('Pandas series mode (v):')
mode_series_v

Statistics mode (v): 12

SciPy mode (v): ModeResult(mode=array([12]), count=array([3]))

Pandas series mode (v):


0    12
1    15
dtype: int64

## Variance

Mengukur penyebaran data. Menunjukkan secara numerik seberapa jauh titik data dari rata-rata.

In [12]:
print('x:', x)

mean_x = sum(x) / len(x)
var_ = sum((item - mean_x) ** 2 for item in x) / (len(x) - 1)

print('Native variance:', var_)

x: [8.0, 1, 2.5, 4, 28.0]
Native variance: 123.19999999999999


In [13]:
var_stats = statistics.variance(x)
var_numpy = np.var(y, ddof=1)
var_series = z.var(ddof=1)
var_scipy = scipy.stats.tvar(x)

print('Statistics variance:', var_stats)
print('NumPy variance:', var_numpy)
print('Pandas series variance:', var_series)
print('SciPy variance:', var_scipy)

Statistics variance: 123.2
NumPy variance: 123.19999999999999
Pandas series variance: 123.19999999999999
SciPy variance: 123.19999999999999


## Standard Deviation

Pengukuran jarak rata-rata antara setiap besaran dan rata-rata.

Standar deviasi yang rendah menunjukkan bahwa titik data cenderung mendekati rata-rata kumpulan data, sedangkan Standar deviasi yang tinggi menunjukkan bahwa titik data tersebar di nilai yang lebih luas.

In [14]:
print('x:', x)

std_ = var_ ** .5
std_stats = statistics.stdev(x)
std_numpy = np.std(y, ddof=1)
std_series = z.std(ddof=1)

print('Native standard deviation:       ', std_)
print('Statistics standard deviation:   ', std_stats)
print('NumPy standard deviation:        ', std_numpy)
print('Pandas series standard deviation:', std_series)

x: [8.0, 1, 2.5, 4, 28.0]
Native standard deviation:        11.099549540409285
Statistics standard deviation:    11.099549540409287
NumPy standard deviation:         11.099549540409285
Pandas series standard deviation: 11.099549540409285


## Skewness

Ukuran asimetri distribusi probabilitas dari _real-valued random variable_ tentang _mean_-nya. Nilai kemiringan bisa positif atau negatif, atau tidak terdefinisi. Jika _skewness_ mendekati 0 (misalnya antara −0.5 dan 0.5), maka dataset dianggap cukup simetris.

In [15]:
print('x:', x)

skew_scipy = scipy.stats.skew(x, bias=False)
skew_series = z.skew()

print('SciPy skewness:', skew_scipy)
print('Pandas series skewness:', skew_series)

x: [8.0, 1, 2.5, 4, 28.0]
SciPy skewness: 1.9470432273905927
Pandas series skewness: 1.9470432273905924


### Kurtosis

Ukuran apakah data bersifat heavy-tailed (banyak outlier) atau light-tailed (kurang outlier) relatif terhadap distribusi normal.

Perbedaan utama antara skewness dan kurtosis adalah skewness mengacu pada tingkat simetri, sedangkan kurtosis mengacu pada tingkat keberadaan pencilan dalam distribusi.

## Percentiles

Cara untuk merepresentasikan posisi suatu nilai dalam kumpulan data. Untuk menghitung persentil, nilai dalam kumpulan data harus selalu dalam urutan _ascending_/menaik.

### Quartiles

Dalam statistik dan probabilitas, kuartil adalah nilai yang membagi data kita menjadi beberapa _quarters_ jika data diurutkan dalam urutan _ascending_/menaik.

Ada tiga nilai kuartil:
* Q1: Sampel persentil ke-25. Kuartil pertama membagi sekitar 25% item terkecil dari kumpulan data lainnya.
* Q2: Sampel persentil ke-50 atau **median**. Kira-kira 25% item terletak di antara kuartil pertama dan kedua dan 25% lainnya antara kuartil kedua dan ketiga.
* Q3: Sampel persentil ke-75. Kuartil ketiga membagi sekitar 25% item terbesar dari sisa kumpulan data.

Untuk membagi data menjadi beberapa interval, dapat menggunakan `statistics.quantiles()`:

Parameter: `statistics.quantiles(data, n=4, method='exclusive')`

* `n` = Jumlah persentil probabilitas sama yang dihasilkan (_default_: `4`)
* `method` = Cara menghitung dengan menyertakan data yang termasuk _outliers_ (_inclusive_) atau tidak menyertakannya (_exclusive_) (_default_: `exclusive`)

In [16]:
x3 = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]

print(statistics.quantiles(x3))
print(statistics.quantiles(x3, method="inclusive"))

[-0.5, 8.0, 23.4]
[0.1, 8.0, 21.0]


`np.percentile()` dapat pula digunakan untuk menentukan persentil sampel apa pun dalam kumpulan data:

In [17]:
print(np.percentile(x3, 50))
print(np.percentile(x3, [0, 25, 50, 75, 100]))

8.0
[-5.   0.1  8.  21.  41. ]


Jika ingin mengabaikan nilai `nan`, gunakan `np.nanpercentile()` sebagai gantinya:

In [18]:
x3_with_nan = np.insert(x3, 2, np.nan)
print(x3_with_nan)
print(np.nanpercentile(x3_with_nan, [25, 50, 75]))

[-5.  -1.1  nan  0.1  2.   8.  12.8 21.  25.8 41. ]
[ 0.1  8.  21. ]


NumPy juga menawarkan fungsionalitas yang sangat mirip di `quantile()` dan `nanquantile()`. Jika ingin menggunakannya, harus memberikan nilai-nilai kuantitatif sebagai angka antara 0 dan 1, bukan persentil:

In [19]:
print(np.quantile(33, 0.05))
print(np.quantile(x3, 0.95))
print(np.quantile(x3, [0.25, 0.5, 0.75]))
print(np.nanquantile(x3_with_nan, [0.25, 0.5, 0.75]))

33.0
34.919999999999995
[ 0.1  8.  21. ]
[ 0.1  8.  21. ]


Objek Pandas Series juga memiliki method `.quantile()`:

In [20]:
z, z_with_nan = pd.Series(x3), pd.Series(x3_with_nan)

print(z.quantile(0.05), end='\n\n')
print(z.quantile(0.95), end='\n\n')
print(z.quantile([0.25, 0.5, 0.75]), end='\n\n')
print(z_with_nan.quantile([0.25, 0.5, 0.75]))

-3.44

34.919999999999995

0.25     0.1
0.50     8.0
0.75    21.0
dtype: float64

0.25     0.1
0.50     8.0
0.75    21.0
dtype: float64


## Ranges (Rentang)

Perbedaan antara elemen maksimum dan minimum dalam kumpulan data

In [21]:
print(x3)
print(np.ptp(x3))
print(max(x3) - min(x3))

[-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
46.0
46.0


### _Interquartile Range_

Perbedaan antara kuartil pertama dan ketiga

In [22]:
print(x3)
print(np.percentile(x3, 75) - np.percentile(x3, 25))
print(statistics.quantiles(x3, method="inclusive")[-1] - statistics.quantiles(x3, method="inclusive")[0])

[-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
20.9
20.9


## _Summary of Descriptive Statistics_

In [23]:
result = scipy.stats.describe(x3)
print(result)
print('')
print('Number of observations:', result.nobs)
print('Min:', result.minmax[0])
print('Max:', result.minmax[1])
print('Mean:', result.mean)
print('Variance:', result.variance)
print('Skewness:', result.skewness)
print('Kurtosis:', result.kurtosis)

DescribeResult(nobs=9, minmax=(-5.0, 41.0), mean=11.622222222222222, variance=228.75194444444446, skewness=0.763007130834308, kurtosis=-0.522454225944291)

Number of observations: 9
Min: -5.0
Max: 41.0
Mean: 11.622222222222222
Variance: 228.75194444444446
Skewness: 0.763007130834308
Kurtosis: -0.522454225944291


In [24]:
pd_result = z.describe()
pd_result

count     9.000000
mean     11.622222
std      15.124548
min      -5.000000
25%       0.100000
50%       8.000000
75%      21.000000
max      41.000000
dtype: float64

---

## Korelasi

Memeriksa hubungan/relasi antara elemen yang sesuai dari dua variabel/fitur dalam kumpulan data.

Pengukuran korelasi antara pasangan data:

-   Korelasi positif

    Hubungan antara 2 variabel dimana kenaikan satu variabel menyebabkan penambahan nilai pada variabel lainnya. Atau sebaliknya, semakin kecil nilai suatu variabel, nilai variabel lainnya juga akan ikut turun. Bisa dikatakan juga, korelasi ini merupakan hubungan yang searah.<sup>[[1]]</sup>


-   Korelasi negatif

    Hubungan antara 2 variabel dimana kenaikan satu variabel menyebakan penurunan nilai dari variabel lainnya. Begitu juga sebaliknya, semakin kecil nilai suatu variabel, semakin besar nilai variabel lainnya. Hubungan antara kedua variabel dalam kasus ini adalah berbalik arah.<sup>[[1]]</sup>

-   Korelasi lemah (_weak_) atau tidak ada korelasi

    Terjadi apabila kedua variabel tidak menunjukkan adanya hubungan linear.

```
Referensi: [1] https://yuvalianda.com/analisis-korelasi/
```

Dua statistik yang mengukur korelasi antar dataset adalah _covariance_ dan _correlation coefficient_.

In [25]:
x = list(range(-10, 11))
y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]

print(x)
print(y)

x_arr, y_arr = np.array(x), np.array(y)
x_ser, y_ser = pd.Series(x_arr), pd.Series(y_arr)

[-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]


### Kovarian (_Covariance_)

Ukuran yang mengukur kekuatan dan arah hubungan antara sepasang variabel.

- If the correlation is positive, then the covariance is positive, as well. A stronger relationship corresponds to a higher value of the covariance.
- If the correlation is negative, then the covariance is negative, as well. A stronger relationship corresponds to a lower (or higher absolute) value of the covariance.
- If the correlation is weak, then the covariance is close to zero.

Cara menghitung kovarian dengan Python murni:

In [26]:
n = len(x)
mean_x, mean_y = sum(x) / n, sum(y) / n
cov_xy = (sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n)) / (n - 1))
print(cov_xy)

19.95


In [27]:
cov_np = np.cov(x_arr, y_arr)
print(cov_np)

[[38.5        19.95      ]
 [19.95       13.91428571]]


In [28]:
var_x, var_y = np.var(x, ddof=1), np.var(y, ddof=1)
print(var_x)
print(var_y)

38.5
13.914285714285711


### _Correlation Coefficient_

Disebut juga _Pearson product-moment correlation coefficient_. Dilambangkan dengan simbol 𝑟. Coefficient  adalah ukuran lain dari korelasi antar data. Kita dapat menganggapnya sebagai _standardized covariance_. Berikut beberapa infonya:

- The value 𝑟 > 0 indicates positive correlation.
- The value 𝑟 < 0 indicates negative correlation.
- The value 𝑟 = 1 is the maximum possible value of 𝑟. It corresponds to a perfect positive linear relationship between variables.
- The value 𝑟 = −1 is the minimum possible value of 𝑟. It corresponds to a perfect negative linear relationship between variables.
- The value 𝑟 ≈ 0, or when 𝑟 is around zero, means that the correlation between variables is weak.




In [29]:
corr_xy = cov_np[0, 1] / (var_x**.5 * var_y**.5)
print(corr_xy)

0.861950005631606


In [30]:
print("x:", x)
print("y:", y)

x: [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y: [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]


In [31]:
z = list(range(20, -1, -1))
print("z:", z, len(z))
cov_xz = np.cov(x, z)
corr_xz = cov_xz[0, 1] / (cov_xz[0, 0]**.5 * cov_xz[1, 1]**.5)

print(corr_xz)

z: [20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0] 21
-1.0
