# Descriptive Statistics

Membantu meringkas, mengatur, dan mudah memahami data. Statistika deskriptif tidak membuat kesimpulan statistik.

## Pengukuran statistika deskriptif (measurement)

### Central tendency (nilai tengah)
    * average/mean
    * median
    * mode (modus) 
    
### Spread tendency (sebaran)
    * standar deviasi/simpangan baku
    * mean deviation
    * variance 
    * percentile '
    * quadratile '
    
## Bentuk kurva

### Berdasarkan nilai tengah '
    * positive skew
    * simetrical skew
    * negative skew

### Berdasarkan sebaran (kurtosis) '
    * leprokurtic
    * mesokurtic
    * platykurtic

## Korelasi
    * no correlation
    * positive correlation
    * negative correlation

**CENTRAL TENDENCY CODE**

In [1]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

In [2]:
x = [8.0, 1, 2, 1.5, 8, 17]
x_with_nan = [8.0, 1, 2, math.nan, 1.5, 8, 17]

print(x)
print(x_with_nan)

[8.0, 1, 2, 1.5, 8, 17]
[8.0, 1, 2, nan, 1.5, 8, 17]


In [4]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)

In [5]:
y

array([ 8. ,  1. ,  2. ,  1.5,  8. , 17. ])

In [6]:
y_with_nan

array([ 8. ,  1. ,  2. ,  nan,  1.5,  8. , 17. ])

In [8]:
z

0     8.0
1     1.0
2     2.0
3     1.5
4     8.0
5    17.0
dtype: float64

In [7]:
z_with_nan

0     8.0
1     1.0
2     2.0
3     NaN
4     1.5
5     8.0
6    17.0
dtype: float64

In [26]:
#Central tendency data y
mean_y = np.mean(y)
median_y = np.median(y)
mode_y = statistics.mode(y)

print(mean_y)
print(median_y)
print(mode_y)

6.25
5.0
8.0


In [27]:
#Central tendency data y_with_nan
mean_y_with_nan = np.nanmean(y_with_nan)
median_y_with_nan = np.nanmedian(y_with_nan)
mode_y_with_nan = statistics.mode(y_with_nan)

print(mean_y_with_nan)
print(median_y_with_nan)
print(mode_y_with_nan)

6.25
5.0
8.0


In [28]:
#Central tendency data z
mean_z = z.mean()
median_z = z.median()
mode_z = statistics.mode(z)

print(mean_z)
print(median_z)
print(mode_z)

6.25
5.0
8.0


In [29]:
#Central tendency data z_with_nan
mean_z_with_nan = z_with_nan.mean()
median_z_with_nan = z_with_nan.median()
mode_z_with_nan = statistics.mode(z_with_nan)

print(mean_z_with_nan)
print(median_z_with_nan)
print(mode_z_with_nan)

6.25
5.0
8.0


**WEIGHTED MEAN CODE**

In [32]:
x2 = [8.0, 1, 2.5, 4, 28.0]
w = [0.7, 0.8, 0.1, 0.2, 0.3, 0.7]

In [37]:
wmean = sum(w[i] * x[i] for i in range(len(x))) / sum(w)
wmean

7.57142857142857

In [39]:
wmeanbp = np.average(y, weights=w)
wmeanbp

7.57142857142857

**HARMONIC MEAN CODE**

In [40]:
#Seorang bapak jalan kaki 5 km pertama dengan kecepatan 2 km/jam dan 5 km kemudian
#berjalan dengan kecepatan 4 km/jam

#n / sigma(1/n)
#2/(1/2 + 1/4)

harmonic_mean = len(x) / sum(1/i for i in x)
harmonic_mean

2.423762376237624

In [41]:
harmonic_mean_stat = statistics.harmonic_mean(x)
harmonic_mean_stat

2.4237623762376237

In [42]:
harmonic_mean_scipy = scipy.stats.hmean(x)
harmonic_mean_scipy

2.423762376237624

**GEOMETRIC MEAN**

In [44]:
#Misal ada laptop 1 harga 20jt lama pakai 7 tahun
#Laptop 2 harga 12jt lama pakai 4 tahun

laptop1_geomean = math.sqrt(1/20 * 7)
laptop2_geomean = math.sqrt(1/12 * 4)
print(laptop1_geomean, laptop2_geomean)

0.5916079783099616 0.5773502691896257


**MEDIAN**

In [45]:
#Cara hitung median urutkan dari besar ke kecil, jika n%2 = 0 ...

median_data = [0.7, 0.8, 0.1, 0.2, 0.3, 0.7]
median_data.sort()
median_data

[0.1, 0.2, 0.3, 0.7, 0.7, 0.8]

In [46]:
median_stat_high = statistics.median_high(median_data)
median_stat_high

0.7

In [47]:
median_stat_low = statistics.median_low(median_data)
median_stat_low

0.3

**MODUS**

In [54]:
modus_data = [1,1,1,3,4,5,6,7,8,9,1]
modus = max((modus_data.count(i),i) for i in set(modus_data))[1]
modus

1

In [55]:
modus_stat = statistics.mode(modus_data)
modus_stat

1

**VARIANCE**

In [56]:
variancestate = statistics.variance(modus_data)
mean_ = sum(modus_data)/len(modus_data)
variance= sum((i - mean_)**2 for i in modus_data)/(len(modus_data)-1)
print(variance, variancestate)

9.163636363636362 9.163636363636364


**STANDAR DEVIATION**

In [57]:
stdv = math.sqrt(variance)
stdvStat = statistics.stdev(modus_data)
print(stdv, stdvStat)

3.027149874657078 3.027149874657078


**SKEWNESS**

In [58]:
skwness = scipy.stats.skew(modus_data)
skwness

0.24935156205469858

**PERCENTILE**

In [60]:
fivepercentile = np.percentile(modus_data, 100)
fivepercentile

9.0

**RANGE (MAX VAL - MIN VAL)**

In [61]:
describeData = scipy.stats.describe(modus_data)
describeData

DescribeResult(nobs=11, minmax=(1, 9), mean=4.181818181818182, variance=9.163636363636364, skewness=0.24935156205469858, kurtosis=-1.3784722222222225)

**KORELASI**

In [65]:
x = np.array(range(0,10))
y = np.random.rand(10)
print(x, y)

[0 1 2 3 4 5 6 7 8 9] [0.07818741 0.79452355 0.55012285 0.82444301 0.48751851 0.21704292
 0.98127607 0.73176495 0.3934123  0.73184246]


In [66]:
covarianxy = np.cov(x,y)
covarianxy

array([[9.16666667, 0.23240838],
       [0.23240838, 0.08229871]])

In [67]:
twod = np.array([
    [1,1,1],
    [1,2,4],
    [4,8,9],
    [5,6,7]
    
])
twod

array([[1, 1, 1],
       [1, 2, 4],
       [4, 8, 9],
       [5, 6, 7]])

In [68]:
np.mean(twod)

4.083333333333333

In [69]:
twod.mean()

4.083333333333333

In [71]:
np.median(twod)

4.0

In [75]:
xmed = np.mean(twod, axis=0)
ymed = np.mean(twod, axis=1)
print(xmed, ymed)

[2.75 4.25 5.25] [1.         2.33333333 7.         6.        ]


In [76]:
scipy.stats.gmean(twod)

array([2.11474253, 3.13016916, 3.9842826 ])

In [77]:
scipy.stats.gmean(twod, axis=1)

array([1.        , 2.        , 6.6038545 , 5.94392195])

In [78]:
scipy.stats.describe(twod)

DescribeResult(nobs=4, minmax=(array([1, 1, 1]), array([5, 8, 9])), mean=array([2.75, 4.25, 5.25]), variance=array([ 4.25      , 10.91666667, 12.25      ]), skewness=array([ 0.11531718,  0.13205603, -0.18515606]), kurtosis=array([-1.84775087, -1.71586737, -1.41302235]))

In [79]:
row_names = [1,2,3,4]
col_names = ["a","b","c"]

df = pd.DataFrame(twod, index = row_names, columns=col_names)
df

Unnamed: 0,a,b,c
1,1,1,1
2,1,2,4
3,4,8,9
4,5,6,7


In [80]:
df.mean()

a    2.75
b    4.25
c    5.25
dtype: float64

In [81]:
df.var()

a     4.250000
b    10.916667
c    12.250000
dtype: float64

In [82]:
df.mean(axis = 1)

1    1.000000
2    2.333333
3    7.000000
4    6.000000
dtype: float64

In [83]:
df.loc[:,"a"]

1    1
2    1
3    4
4    5
Name: a, dtype: int32

In [84]:
df['a']

1    1
2    1
3    4
4    5
Name: a, dtype: int32

In [86]:
df['a'].mean()

2.75

In [87]:
df.mean().mean()

4.083333333333333

In [89]:
df.describe()

Unnamed: 0,a,b,c
count,4.0,4.0,4.0
mean,2.75,4.25,5.25
std,2.061553,3.304038,3.5
min,1.0,1.0,1.0
25%,1.0,1.75,3.25
50%,2.5,4.0,5.5
75%,4.25,6.5,7.5
max,5.0,8.0,9.0
