# <a id='toc1_'></a>[데이터 탐색](#toc0_)
---

**Table of contents**<a id='toc0_'></a>    
- [데이터 탐색](#toc1_)    
  - [개별 데이터 탐색](#toc1_1_)    
    - [범주형 데이터 - 빈도수 탐색](#toc1_1_1_)    
    - [수치형 데이터](#toc1_1_2_)    
  - [다차원 데이터 탐색](#toc1_2_)    
    - [범주형-범주형 데이터 탐색](#toc1_2_1_)    
    - [수치형-수치형 데이터 탐색](#toc1_2_2_)    
      - [`corr` 함수](#toc1_2_2_1_)    
      - [상관 계수](#toc1_2_2_2_)    
    - [범주형 - 수치형 데이터 탐색](#toc1_2_3_)    
      - [`groupby` 함수](#toc1_2_3_1_)    
  - [전체 데이터 파악](#toc1_3_)    
    - [`info` 함수](#toc1_3_1_)    
    - [`head` 함수](#toc1_3_2_)    
    - [`tail` 함수](#toc1_3_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

---

## <a id='toc1_1_'></a>[개별 데이터 탐색](#toc0_)

### <a id='toc1_1_1_'></a>[범주형 데이터 - 빈도수 탐색](#toc0_)

In [3]:
import pandas as pd
import statsmodels.api as sm

mtcars = sm.datasets.get_rdataset('mtcars').data

print(pd.Series(mtcars['cyl']).value_counts())

cyl
8    14
4    11
6     7
Name: count, dtype: int64


### <a id='toc1_1_2_'></a>[수치형 데이터](#toc0_)

In [4]:
import pandas as pd
import statsmodels.api as sm

mtcars = sm.datasets.get_rdataset('mtcars').data

print(mtcars['wt'].describe())

count    32.000000
mean      3.217250
std       0.978457
min       1.513000
25%       2.581250
50%       3.325000
75%       3.610000
max       5.424000
Name: wt, dtype: float64


## <a id='toc1_2_'></a>[다차원 데이터 탐색](#toc0_)

### <a id='toc1_2_1_'></a>[범주형-범주형 데이터 탐색](#toc0_)

In [5]:
import pandas as pd
import statsmodels.api as sm

mtcars = sm.datasets.get_rdataset('mtcars').data

print(pd.crosstab(mtcars['am'], mtcars['cyl']))

cyl  4  6   8
am           
0    3  4  12
1    8  3   2


### <a id='toc1_2_2_'></a>[수치형-수치형 데이터 탐색](#toc0_)

#### <a id='toc1_2_2_1_'></a>[`corr` 함수](#toc0_)

In [6]:
import statsmodels.api as sm

mtcars = sm.datasets.get_rdataset('mtcars').data
cor_mpg_wt = mtcars['mpg'].corr(mtcars['wt'])   # mpg, wt 간의 상관 계수를 변수에 저장

print(cor_mpg_wt)

-0.8676593765172281


#### <a id='toc1_2_2_2_'></a>[상관 계수](#toc0_)

In [8]:
import pandas as pd

df = pd.read_csv('./datasets/PimaIndianDiabetes2.csv')
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


In [9]:
df = df.iloc[:, [2, 3, 4, 7]]
df = df.dropna()
print(df.describe())

       BloodPressure  SkinThickness     Insulin         Age
count     768.000000     768.000000  768.000000  768.000000
mean       69.105469      20.536458   79.799479   33.240885
std        19.355807      15.952218  115.244002   11.760232
min         0.000000       0.000000    0.000000   21.000000
25%        62.000000       0.000000    0.000000   24.000000
50%        72.000000      23.000000   30.500000   29.000000
75%        80.000000      32.000000  127.250000   41.000000
max       122.000000      99.000000  846.000000   81.000000


In [11]:
print(df.corr(method='pearson'))   # 피어슨 상관 계수

               BloodPressure  SkinThickness   Insulin       Age
BloodPressure       1.000000       0.207371  0.088933  0.239528
SkinThickness       0.207371       1.000000  0.436783 -0.113970
Insulin             0.088933       0.436783  1.000000 -0.042163
Age                 0.239528      -0.113970 -0.042163  1.000000


In [12]:
print(df.corr(method='spearman'))   # 스피어만 순위 상관 계수

               BloodPressure  SkinThickness   Insulin       Age
BloodPressure       1.000000       0.126486 -0.006771  0.350895
SkinThickness       0.126486       1.000000  0.541000 -0.066795
Insulin            -0.006771       0.541000  1.000000 -0.114213
Age                 0.350895      -0.066795 -0.114213  1.000000


In [13]:
print(df.corr(method='kendall'))   # 켄달 순위 상관 계수

               BloodPressure  SkinThickness   Insulin       Age
BloodPressure       1.000000       0.094868 -0.003682  0.246056
SkinThickness       0.094868       1.000000  0.420066 -0.044754
Insulin            -0.003682       0.420066  1.000000 -0.080176
Age                 0.246056      -0.044754 -0.080176  1.000000


### <a id='toc1_2_3_'></a>[범주형 - 수치형 데이터 탐색](#toc0_)

#### <a id='toc1_2_3_1_'></a>[`groupby` 함수](#toc0_)

In [14]:
import statsmodels.api as sm

mtcars = sm.datasets.get_rdataset('mtcars').data

print(mtcars.groupby('cyl')['mpg'].mean())   # cyl을 기준으로 mpg의 평균 출력

cyl
4    26.663636
6    19.742857
8    15.100000
Name: mpg, dtype: float64


## <a id='toc1_3_'></a>[전체 데이터 파악](#toc0_)

### <a id='toc1_3_1_'></a>[`info` 함수](#toc0_)

In [1]:
import statsmodels.api as sm

mtcars = sm.datasets.get_rdataset('mtcars').data

print(mtcars['wt'].info())    # wt 데이터들에 대한 속성 확인

<class 'pandas.core.series.Series'>
Index: 32 entries, Mazda RX4 to Volvo 142E
Series name: wt
Non-Null Count  Dtype  
--------------  -----  
32 non-null     float64
dtypes: float64(1)
memory usage: 512.0+ bytes
None


### <a id='toc1_3_2_'></a>[`head` 함수](#toc0_)

In [2]:
import statsmodels.api as sm

mtcars = sm.datasets.get_rdataset('mtcars').data

print(mtcars['wt'].head())    # mtcars 데이터 세트의 wt 열의 앞에 5개 데이터 출력

rownames
Mazda RX4            2.620
Mazda RX4 Wag        2.875
Datsun 710           2.320
Hornet 4 Drive       3.215
Hornet Sportabout    3.440
Name: wt, dtype: float64


### <a id='toc1_3_3_'></a>[`tail` 함수](#toc0_)

In [3]:
import statsmodels.api as sm

mtcars = sm.datasets.get_rdataset('mtcars').data

print(mtcars['wt'].tail())    # mtcars 데이터 세트의 wt 열의 앞에 5개 데이터 출력

rownames
Lotus Europa      1.513
Ford Pantera L    3.170
Ferrari Dino      2.770
Maserati Bora     3.570
Volvo 142E        2.780
Name: wt, dtype: float64
