# pandas 데이터 파악과 조작

**분석할 데이터를 수집(확보)하면 데이터의 특징을 파악하고 다루기 쉽게 변형하는 작업을 수행해야 한다**

In [1]:
import numpy as np
import pandas as pd

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity="all"

# #1. 데이터 파악

데이터가 주어졌을때 가장 먼저 하는일은 데이터의 전반적인 구조를 파악하면서 데이터의 특징에 대한 감을 잡는다

- 데이터는 어떤 모양인가?
- 데이터의 크기?
- 어떤 변수들로 구성되었는가?
- 결측치가 있는가?

### 데이터를 파악하기 위해 사용하는 pandas 명령어

#### 데이터 내용 미리보기
- **head()** : 데이터의 앞부분 출력
- **tail()** : 데이터의 뒷부분 출력

#### 데이터 요약 정보 확인하기
- **shape** : 데이터의 행,열 개수 출력
- **info()** : 데이터 기본 정보 확인
- **describe()** : 요약 통계량 출력

In [7]:
titanic = pd.read_csv('data/titanic.csv')

### 데이터 앞/뒤 부분 확인하기 : head(), tail()

- **DataFrame.head(n=5)**
- **DafaFrame.tail(n=5)**

In [11]:
titanic.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [12]:
titanic.tail(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


### 데이터 크기 확인 : shape

- **데이터프레임의 행,열 수를 튜플로 반환**

In [13]:
# 행,열 총 개수
titanic.shape

(891, 12)

In [15]:
# 데이터 총개수
titanic.size

10692

### 데이터 기본 정보 확인 :  info()

- **클래스 유형**
- **index의 dtype과 개수, columns의 수**
- **column별 non-null values의 개수, dtype**
- **memory 사용크기**

In [16]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### 요약 통계량 구하기 : describe()

- **형식**: describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)


- **NaN을 제외한 데이터에 대한 데이터의 중심경향(central tendency),산포(dispersion),모양(shape)을 대표하는 기술통계량 계산 출력**
    - 중심 : 평균(mean), 중위수(median), 백분위수(percentile), 최소값(min), 최대값(max)
    - 산포 : 표준편차(std)
    - 빈도 : 개수(count), 최빈값(top), freq, unique

In [17]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [18]:
# 백분위수 설정: percentiles
titanic.describe(percentiles=[0.2,0.4,0.6,0.8])

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
20%,179.0,0.0,1.0,19.0,0.0,0.0,7.8542
40%,357.0,0.0,2.0,25.0,0.0,0.0,10.5
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
60%,535.0,0.0,3.0,31.8,0.0,0.0,21.6792
80%,713.0,1.0,3.0,41.0,1.0,1.0,39.6875
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


#### describe(include='all') : 문자 변수 요약 통계량 함께 출력

- unique : 중복을 제거한 범주의 개수
- top : 최빈값
- freq : 최빈값 빈도

In [20]:
titanic.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [22]:
titanic.describe(include='object')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


In [23]:
titanic.describe(include='int64')

Unnamed: 0,PassengerId,Survived,Pclass,SibSp,Parch
count,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.523008,0.381594
std,257.353842,0.486592,0.836071,1.102743,0.806057
min,1.0,0.0,1.0,0.0,0.0
25%,223.5,0.0,2.0,0.0,0.0
50%,446.0,0.0,3.0,0.0,0.0
75%,668.5,1.0,3.0,1.0,0.0
max,891.0,1.0,3.0,8.0,6.0


---------------------------------------------

> ## 탐색적 데이터 분석(EDA)

- Exploratory Data Analysis
- 데이터 특성을 파악하고 상세화
- 데이터 유형(속성)에 따라 데이터의 특성을 파악한 후 분석 방향 결정

### 데이터 유형(범주형/수치형)에 따른 EDA 기법

### 기술통계량(Descriptive Statistic)

**: 데이터의 특징을 수치로 요약, 기술하는 통계량**

- **위치통계량(measure of location)**
    - 데이터의 중심경향(central tendency)을 나타내는 척도
    - 대표값이라고도 부름
    - 수치형자료 : 평균(산술/조화/기하), 중위수(위치 중간), 절사평균(최소,최대%제외후 평균), 사분위수(25%,50%,70%,100%) 등
    - 범주형자료 : 최빈수
    
    
- **변이통계량(measure of dispersion)**
    - 데이터의 퍼짐, 흩어진 정도(산포:dispersion)를 나타내는 척도
    - 산포도라고도 부름
    - 수치형자료 : 분산, 표준편차, 사분위간범위, 범위 등
    - 범주형자료 : 범주별 빈도
    
    
- **모양통계량(measure of shape)**
    - 왜도(skewness) : 데이터가 중심 위치로부터 어느 한쪽으로 치우친(비대칭) 정도를 타나태는 척도
    - 첨도(kurtosis) : 분포의 뽀족함 정도

### 시각화 도구를 통한 데이터 파악

**: 데이터의 변수 수, 종류(유형), 분석 목적에 따라 시각화 도구가 달라짐**

#### pandas의 plot 함수

----------------------------------