## 데이터 측정
1. 데이터의 행이 어떻게 고유하게 식별되는가?
2. 데이터셋에 행과 열이 몇 개 있는가?
3. 주요 범주형변수는 무엇이고, 각 값의 빈도는 어떻게 되는가?
4. 주요 연속변수가 어떻게 분포하는가?
5. 변수들이 서로 어떻게 연관되는가?
6. 어떤 변숫값이 예상 범위를 벗어나며, 누락값이 어떻게 분포하는가?


### 일정 루틴에 따라 데이터를 이해하는 것이 중요
- 데이터 훑어보기 1
- 열 선택 및 정돈 2
- 행 선택 2
- 범주형 변수의 빈도 생성 3
- 연속형 변수의 요약통계 생성 4

In [28]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns',None)

nls97=pd.read_csv('C:/data-cleansing-main/Chapter03/data/nls97.csv')
covidtotals=pd.read_csv('C:/data-cleansing-main/Chapter03/data/covidtotals.csv',parse_dates=['lastdate'])       # lastdate 컬럼을 날짜 형식으로 변환, 불러올 때만 할 수 있나?

In [29]:
nls97.set_index('personid',inplace=True)            # 분석 단위가 개인일 때, 고유 식별자를 인덱스로 지정하자.
covidtotals.set_index('iso_code',inplace=True)

In [23]:
print(nls97.index)
print(nls97.shape)
print(nls97.index.unique())         # 인덱스 값이 고유한지 확인

Int64Index([100061, 100139, 100284, 100292, 100583, 100833, 100931, 101089,
            101122, 101132,
            ...
            998997, 999031, 999053, 999087, 999103, 999291, 999406, 999543,
            999698, 999963],
           dtype='int64', name='personid', length=8984)
(8984, 88)
Int64Index([100061, 100139, 100284, 100292, 100583, 100833, 100931, 101089,
            101122, 101132,
            ...
            998997, 999031, 999053, 999087, 999103, 999291, 999406, 999543,
            999698, 999963],
           dtype='int64', name='personid', length=8984)


In [24]:
nls97.info()        # 자료형과 결측값 개수 확인, 객체 자료형이 많고 누락값도 많다

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8984 entries, 100061 to 999963
Data columns (total 88 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   gender                 8984 non-null   object 
 1   birthmonth             8984 non-null   int64  
 2   birthyear              8984 non-null   int64  
 3   highestgradecompleted  6663 non-null   float64
 4   maritalstatus          6672 non-null   object 
 5   childathome            4791 non-null   float64
 6   childnotathome         4791 non-null   float64
 7   wageincome             5091 non-null   float64
 8   weeklyhrscomputer      6710 non-null   object 
 9   weeklyhrstv            6711 non-null   object 
 10  nightlyhrssleep        6706 non-null   float64
 11  satverbal              1406 non-null   float64
 12  satmath                1407 non-null   float64
 13  gpaoverall             6004 non-null   float64
 14  gpaenglish             5798 non-null   float64
 1

In [25]:
nls97.head(2).T     # 전치해서 보기, 열의 수가 너무 많을 때 

personid,100061,100139
gender,Female,Male
birthmonth,5,9
birthyear,1980,1983
highestgradecompleted,13.0,12.0
maritalstatus,Married,Married
...,...,...
colenroct15,1. Not enrolled,1. Not enrolled
colenrfeb16,1. Not enrolled,1. Not enrolled
colenroct16,1. Not enrolled,1. Not enrolled
colenrfeb17,1. Not enrolled,1. Not enrolled


In [30]:
print(covidtotals.index)
print(covidtotals.index.unique())       # 인덱스 값이 고유한지 확인

Index(['AFG', 'ALB', 'DZA', 'AND', 'AGO', 'AIA', 'ATG', 'ARG', 'ARM', 'ABW',
       ...
       'VIR', 'URY', 'UZB', 'VAT', 'VEN', 'VNM', 'ESH', 'YEM', 'ZMB', 'ZWE'],
      dtype='object', name='iso_code', length=210)
Index(['AFG', 'ALB', 'DZA', 'AND', 'AGO', 'AIA', 'ATG', 'ARG', 'ARM', 'ABW',
       ...
       'VIR', 'URY', 'UZB', 'VAT', 'VEN', 'VNM', 'ESH', 'YEM', 'ZMB', 'ZWE'],
      dtype='object', name='iso_code', length=210)


In [31]:
covidtotals.info()

<class 'pandas.core.frame.DataFrame'>
Index: 210 entries, AFG to ZWE
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   lastdate         210 non-null    datetime64[ns]
 1   location         210 non-null    object        
 2   total_cases      210 non-null    int64         
 3   total_deaths     210 non-null    int64         
 4   total_cases_pm   209 non-null    float64       
 5   total_deaths_pm  209 non-null    float64       
 6   population       210 non-null    float64       
 7   pop_density      198 non-null    float64       
 8   median_age       186 non-null    float64       
 9   gdp_per_capita   182 non-null    float64       
 10  hosp_beds        164 non-null    float64       
dtypes: datetime64[ns](1), float64(7), int64(2), object(1)
memory usage: 27.8+ KB


In [32]:
covidtotals.head(2).T

iso_code,AFG,ALB
lastdate,2020-06-01 00:00:00,2020-06-01 00:00:00
location,Afghanistan,Albania
total_cases,15205,1137
total_deaths,257,33
total_cases_pm,390.589,395.093
total_deaths_pm,6.602,11.467
population,38928341.0,2877800.0
pop_density,54.422,104.871
median_age,18.6,38.0
gdp_per_capita,1803.987,11803.431
