# Pnadas 10분 완성

- [관련 링크](https://dataitgirls2.github.io/10minutes2pandas/)

### 목차
|장|내용|
|---|---|
|1|Object Creation(객체 생성)|
|2|Viewing Data(데이터 확인하기)|
|3|Selection(선택)|
|4|Missing Data(결측치)|
|5|Operation(연산)|
|6|Merge(병합)|
|7|Grouping(그룹화)|
|8|Reshaping(변형)|
|9|Time Series(시계열)|
|10|Categoricals(범주화)|
|11|Plotting(그래프)|
|12|Getting Data In/Out(데이터 입/출력)|
|13|Gotchas(잡았다!)|

일반적으로 각 패키지는 pd, np, plt라는 이름으로 불러온다.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Matplotlib is building the font cache; this may take a moment.


## 1. Object Creation (객체 생성)

Pandas는 값을 가지고 있는 리스트를 통해 `Series`를 만들고, 정수로 만들어진 인덱스를 기본값으로 불러온다.

In [3]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [4]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

`datatime` 인덱스와 레이블이 있는 열을 가지고 있는 numpy 배열을 전달하여 데이터프레임을 생성한다. 

In [5]:
dates = pd.date_range('20250123', periods=6)

In [6]:
dates

DatetimeIndex(['2025-01-23', '2025-01-24', '2025-01-25', '2025-01-26',
               '2025-01-27', '2025-01-28'],
              dtype='datetime64[ns]', freq='D')

In [7]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [8]:
df

Unnamed: 0,A,B,C,D
2025-01-23,-0.546589,-0.554499,0.813592,0.439799
2025-01-24,1.952686,1.204519,-0.587892,0.322988
2025-01-25,-0.76934,-0.603733,0.76391,0.708147
2025-01-26,-1.349199,0.881114,0.136657,-1.355605
2025-01-27,0.329758,-0.50083,-0.184495,0.579466
2025-01-28,0.09059,-0.885316,-0.849857,-0.873267


Series와 같은 것으로 변환될 수 있는 객체들의 `dict`로 구성된 데이터프레임을 만든다.

In [9]:
df2 = pd.DataFrame({
    'A': 1.,
    'B': pd.Timestamp('20250123'),
    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(["test", "traing", "test", "traint"]),
    'F': 'foo'
})

In [11]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2025-01-23,1.0,3,test,foo
1,1.0,2025-01-23,1.0,3,traing,foo
2,1.0,2025-01-23,1.0,3,test,foo
3,1.0,2025-01-23,1.0,3,traint,foo


DataFrame의 결과물의 column은 다양한 데이터 타입(dtpyes)으로 구성된다.

In [14]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

## 2. Viewing Data (데이터 확인하기)

DataFrame의 가장 윗 줄과 마지막 줄을 확인하고 싶을 때에 사용하는 방법은 다음과 같다.
- 괄호 안에는 숫자가 들어갈 수도 있고 안 들어갈 수도 있다.
- 숫자가 들어간다면, 윗 / 마지막 줄의 특정 줄을 불러올 수 있다.
- 숫자가 들어가지 않는다면, 기본값은 5로 처리된다.

In [15]:
df.tail(3) # 끝에서 마지막 3줄
df.tail() # 끝에서 마지막 5줄

Unnamed: 0,A,B,C,D
2025-01-24,1.952686,1.204519,-0.587892,0.322988
2025-01-25,-0.76934,-0.603733,0.76391,0.708147
2025-01-26,-1.349199,0.881114,0.136657,-1.355605
2025-01-27,0.329758,-0.50083,-0.184495,0.579466
2025-01-28,0.09059,-0.885316,-0.849857,-0.873267


In [16]:
df.head() # 처음에서 5줄

Unnamed: 0,A,B,C,D
2025-01-23,-0.546589,-0.554499,0.813592,0.439799
2025-01-24,1.952686,1.204519,-0.587892,0.322988
2025-01-25,-0.76934,-0.603733,0.76391,0.708147
2025-01-26,-1.349199,0.881114,0.136657,-1.355605
2025-01-27,0.329758,-0.50083,-0.184495,0.579466


In [17]:
df.tail(3)

Unnamed: 0,A,B,C,D
2025-01-26,-1.349199,0.881114,0.136657,-1.355605
2025-01-27,0.329758,-0.50083,-0.184495,0.579466
2025-01-28,0.09059,-0.885316,-0.849857,-0.873267


인덱스 (index), 열 (column) 그리고 numpy 데이터에 대한 세부 정보를 본다.

In [18]:
df.index

DatetimeIndex(['2025-01-23', '2025-01-24', '2025-01-25', '2025-01-26',
               '2025-01-27', '2025-01-28'],
              dtype='datetime64[ns]', freq='D')

In [19]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [20]:
df.values

array([[-0.54658863, -0.55449873,  0.81359169,  0.43979884],
       [ 1.95268561,  1.20451876, -0.58789204,  0.32298773],
       [-0.76934018, -0.60373303,  0.76391003,  0.70814726],
       [-1.34919939,  0.88111376,  0.13665677, -1.35560538],
       [ 0.32975752, -0.50082974, -0.1844955 ,  0.57946646],
       [ 0.09059024, -0.88531586, -0.84985747, -0.87326705]])

describe()는 데이터의 대략적인 통계적 정보 요약을 보여준다.

- `count`: 열의 유효값(결측값이 아닌 값)의 개수
- `mean`: 열의 평균값
- `std`: 열의 표준편차 (데이터의 분산 정도)
- `min`: 열의 최솟값
- `25%`: 열의 1사분위수 (데이터의 하위 25% 지점)
- `50%`: 열의 중앙값 (데이터의 50% 지점, 2사분위수 또는 중위수)
- `75%`: 열의 3사분위수 (데이터의 상위 25%를 제외한 하위 75% 지점)
- `max`: 열의 최댓값

In [21]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.048682,-0.076457,0.015319,-0.029745
std,1.151233,0.883032,0.687702,0.863707
min,-1.349199,-0.885316,-0.849857,-1.355605
25%,-0.713652,-0.591424,-0.487043,-0.574203
50%,-0.227999,-0.527664,-0.023919,0.381393
75%,0.269966,0.535628,0.607097,0.54455
max,1.952686,1.204519,0.813592,0.708147


데이터를 전치한다.

In [22]:
df.T

Unnamed: 0,2025-01-23,2025-01-24,2025-01-25,2025-01-26,2025-01-27,2025-01-28
A,-0.546589,1.952686,-0.76934,-1.349199,0.329758,0.09059
B,-0.554499,1.204519,-0.603733,0.881114,-0.50083,-0.885316
C,0.813592,-0.587892,0.76391,0.136657,-0.184495,-0.849857
D,0.439799,0.322988,0.708147,-1.355605,0.579466,-0.873267


축 별로 정렬한다.

In [25]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2025-01-23,0.439799,0.813592,-0.554499,-0.546589
2025-01-24,0.322988,-0.587892,1.204519,1.952686
2025-01-25,0.708147,0.76391,-0.603733,-0.76934
2025-01-26,-1.355605,0.136657,0.881114,-1.349199
2025-01-27,0.579466,-0.184495,-0.50083,0.329758
2025-01-28,-0.873267,-0.849857,-0.885316,0.09059


값 별로 정렬한다.

In [26]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2025-01-28,0.09059,-0.885316,-0.849857,-0.873267
2025-01-25,-0.76934,-0.603733,0.76391,0.708147
2025-01-23,-0.546589,-0.554499,0.813592,0.439799
2025-01-27,0.329758,-0.50083,-0.184495,0.579466
2025-01-26,-1.349199,0.881114,0.136657,-1.355605
2025-01-24,1.952686,1.204519,-0.587892,0.322988
