# <a id='toc1_'></a>[표본 추출 및 데이터 전처리](#toc0_)
---

**Table of contents**<a id='toc0_'></a>    
- [표본 추출 및 데이터 전처리](#toc1_)    
  - [표본 추출](#toc1_1_)    
    - [`random.choice`](#toc1_1_1_)    
  - [데이터 전처리](#toc1_2_)    
    - [중심 경향성 통계량](#toc1_2_1_)    
      - [평균(Mean)](#toc1_2_1_1_)    
      - [중위수(Median)](#toc1_2_1_2_)    
      - [최빈수(Mode)](#toc1_2_1_3_)    
    - [산포도 통계량 (분포 계산)](#toc1_2_2_)    
      - [분산(Variance)](#toc1_2_2_1_)    
      - [표준편차(Standard Deviation)](#toc1_2_2_2_)    
      - [범위(Range)](#toc1_2_2_3_)    
      - [백분위수(Percentile), IQR](#toc1_2_2_4_)    
    - [순위](#toc1_2_3_)    
    - [기타](#toc1_2_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

---

## <a id='toc1_1_'></a>[표본 추출](#toc0_)

### <a id='toc1_1_1_'></a>[`random.choice`](#toc0_)

- 단순 무작위 추출

In [1]:
import numpy as np

numbers = np.arange(1, 11)
s = np.random.choice(numbers, size=5, replace=False)
print(s)

[ 6  9 10  7  1]


In [2]:
s = np.random.choice(numbers, size=5, replace=True)
print(s)

[4 3 7 3 4]


## <a id='toc1_2_'></a>[데이터 전처리](#toc0_)

### <a id='toc1_2_1_'></a>[중심 경향성 통계량](#toc0_)

#### <a id='toc1_2_1_1_'></a>[평균(Mean)](#toc0_)

In [3]:
import pandas as pd
from scipy.stats import trim_mean

In [4]:
x = np.concatenate([np.arange(0, 51), [50]])   # 0~50과 50을 묶어서 x에 저장

print(np.mean(x))
print(trim_mean(x, 0.10))    # 양쪽 끝에서 10%를 제외한 값들의 평균

25.48076923076923
25.5


In [5]:
y = np.array([12, 7, 4, -5, np.nan])

print(np.mean(y))
print(np.nanmean(y))    # 결측값(nan)을 제외한 평균

nan
4.5


#### <a id='toc1_2_1_2_'></a>[중위수(Median)](#toc0_)

In [1]:
import numpy as np

x = np.array([12, 7, 4, -5, np.nan])

print(np.median(x))
print(np.nanmedian(x))

nan
5.5


#### <a id='toc1_2_1_3_'></a>[최빈수(Mode)](#toc0_)

In [3]:
import numpy as np

x = np.array([2, 1, 1, 3, 1])

print(np.bincount(x).argmax())    # 최빈수 구하기

1


### <a id='toc1_2_2_'></a>[산포도 통계량 (분포 계산)](#toc0_)

#### <a id='toc1_2_2_1_'></a>[분산(Variance)](#toc0_)

In [4]:
import numpy as np

x = np.array([3, 4, 5, 2, 4, 3, 4])

print(np.var(x, ddof=1))    # 표본집단의 분산 계산

0.9523809523809526


#### <a id='toc1_2_2_2_'></a>[표준편차(Standard Deviation)](#toc0_)

In [5]:
import numpy as np

x = np.array([3, 4, 5, 2, 4, 3, 4])

print(np.std(x, ddof=1))    # 표본집단의 표준편차 계산

0.9759000729485333


#### <a id='toc1_2_2_3_'></a>[범위(Range)](#toc0_)

In [6]:
import numpy as np

x = np.array([1, 7, 3, 5, 11, 4, 6])

print(np.ptp(x))    # 범위 계산
print(np.min(x))
print(np.max(x))

10
1
11


#### <a id='toc1_2_2_4_'></a>[백분위수(Percentile), IQR](#toc0_)

In [7]:
import numpy as np
from scipy.stats import iqr

x = np.array([3, 4, 5, 2, 4, 3, 4])

q1 = np.percentile(x, 25)
q3 = np.percentile(x, 75)

# IQR 구하기 1
print(q3 - q1)

# IQR 구하기 2
print(iqr(x))


1.0
1.0


### <a id='toc1_2_3_'></a>[순위](#toc0_)

In [9]:
import pandas as pd

x = np.array([1, 1, 5, 5, 9, 7])

print(pd.Series(x).rank(method='average'))
print(pd.Series(x).rank(method='first'))
print(pd.Series(x).rank(method='min'))
print(pd.Series(x).rank(method='max'))
print(pd.Series(x).rank(method='dense'))

0    2.0
1    2.0
2    4.0
3    4.0
4    6.0
5    5.0
dtype: float64
0    1.0
1    1.0
2    2.0
3    2.0
4    4.0
5    3.0
dtype: float64


### <a id='toc1_2_4_'></a>[기타](#toc0_)

In [10]:
print(round(3.1425, 2))    # 소수점 두째 자리까지 출력
print(int(3.1425))   # 정수형으로 변환 후 출력

3.14
3
