![pandas_1-1.png](attachment:f81e5f7d-38b0-427e-bbcc-acc7607b245b.png) 

## Pandas라는 이름은 Panel Data에서 유래되었습니다. Panel이란 무엇을 뜻할까요? 아래 그림은 panel data를 아주 잘 보여주고 있습니다. 

![pandas_1-2.png](attachment:2abe823c-7b82-4b89-b4c6-8611619768fc.png)

#### 위 그림은 엑셀의 전형적인 화면인데요, 이처럼 2차원 데이터이면서도 "숫자로만 되어있지 않은" 데이터를 보통 Panel Data라고 부릅니다.
#### 이러한 데이터는 항상 "행"과 "열"로 데이터를 구분합니다. 이렇게 행, 열로 구조화되어있다고 하여 구조화된 데이터(Structured Data)라고도 부릅니다. 많은 데이터가 이러한 형태를 지니고 있기 때문에 파이썬에서는 이를 위한 전용 패키지를 만들었습니다.
#### 이것이 바로 Pandas입니다.

![pandas_1-3-1.png](attachment:250f8b20-9c30-415b-ad05-c9e86493467b.png)

#### 판다스의 데이터는 List, Numpy array와는 다르게 항상 Key, Value의 구조를 가지고 있습니다. 그런데 Key, Value라는 표현은 앞선 데이터의 종류 중에서 딕셔너리(Dictionary) 데이터에서도 나오는 개념이죠.
#### 판다스도 바로 이 딕셔너리 데이터 구조와 매우 유사합니다.

In [1]:
import pandas as pd # 일반적으로 판다스는 pd로 축약하여 불러옵니다

In [2]:
# 먼저 dictionary data와 key, value 데이터 타입 친숙해져봅시다

items = {'apple':1000, 
         'cup': 2000, 
         'pen': 500, 
         'banana': 4000}
print("생성된 딕셔너리: ", items)

# key로 특정 value를 불러오기
print("apple의 value: ", items["apple"])
print("pen의 value: ", items["pen"])

생성된 딕셔너리:  {'apple': 1000, 'cup': 2000, 'pen': 500, 'banana': 4000}
apple의 value:  1000
pen의 value:  500


In [3]:
# 하나의 key에 여러 value가 저장될 수도 있다

items = {'apple': {'price': 1000, 'pcs': "한 개"}, 
         'cup': {'price': 2000, 'pcs': "한 개"}, 
         'pen': {'price': 500, 'pcs': "네 개"}, 
         'banana': {'price': 4000, 'pcs': "세 개"}}
print("생성된 딕셔너리: ", items)

# key로 특정 value를 불러오기
print("apple의 value: ", items["apple"])
print("pen의 value: ", items["pen"])

생성된 딕셔너리:  {'apple': {'price': 1000, 'pcs': '한 개'}, 'cup': {'price': 2000, 'pcs': '한 개'}, 'pen': {'price': 500, 'pcs': '네 개'}, 'banana': {'price': 4000, 'pcs': '세 개'}}
apple의 value:  {'price': 1000, 'pcs': '한 개'}
pen의 value:  {'price': 500, 'pcs': '네 개'}


In [44]:
# 딕셔너리를 판다스의 DataFrame으로 만들기
df = pd.DataFrame(items)
df 

Unnamed: 0,apple,cup,pen,banana
price,1000,2000,500,4000
pcs,한 개,한 개,네 개,세 개


In [47]:
# 예제 1

# 임의의 딕셔너리를 만들고 판다스 dataframe을 생성해보기
total_items = {'tank': {'price': 10000000, 'pcs': 1}, 
                'car': {'price': 2000000, 'pcs': 8}, 
                'house': {'price': 50000000, 'pcs': 4}}
df = pd.DataFrame(total_items)
df

Unnamed: 0,tank,car,house
price,10000000,2000000,50000000
pcs,1,8,4


##### 딕셔너리의 key가 열로, 각 value가 행으로 전환된 것을 확인할 수 있습니다. 특히 숫자 뿐만 아니라 문자열 데이터도 같이 있는 것을 볼 수 있습니다.

![pandas_1-4.png](attachment:62892787-13af-4f88-8cd7-3beaf475189b.png)

#### 판다스는 위 그림처럼 기본적으로 Series가 모여서 하나의 DataFrame을 이룹니다. 위에서 간단한 딕셔너리 데이터 예시로 생성한 것이 바로 DataFrame인 것이죠. 
    Series란: 전체 데이터에서 1개 열의 데이터
    DataFrame이란: 여러 Series가 합쳐진 구조

In [48]:
# 판다스 Series 데이터 생성

ser = pd.Series(['a','b','c',3]) 
print("판다스 Series:\n", ser)

# 국가번호 예시
Nations_Numbers = {'Korea': 82, 'America': 1, 'Swiss': 41, 'Italy': 39, 'Japan': 81, 'China': 86, 'Rusia': 7}
ser = pd.Series(Nations_Numbers)
print("\n판다스 Series:\n", ser)


판다스 Series:
 0    a
1    b
2    c
3    3
dtype: object

판다스 Series:
 Korea      82
America     1
Swiss      41
Italy      39
Japan      81
China      86
Rusia       7
dtype: int64


In [49]:
# 판다스 Series 데이터는 기본적으로 "이름"과 "index의 이름"을 설정할 수 있다

# 현재는 None이 출력
print(ser.name) 
print(ser.index.name)

ser.name = '국가 번호'
ser.index.name = '국가명'

print("\n판다스 Series:\n", ser) # Series데이터의 이름과 index의 이름이 설정된 것을 볼 수 있다.

None
None

판다스 Series:
 국가명
Korea      82
America     1
Swiss      41
Italy      39
Japan      81
China      86
Rusia       7
Name: 국가 번호, dtype: int64


In [59]:
# 예제 2

# 판다스 Series 데이터 만들고, 데이터 이름 설정하기

ser = pd.Series({'health': 80, 'living': 40, 'food': 100}) 

ser.name = '중요도'
ser.index.name = '삶의 요소'

ser

삶의 요소
health     80
living     40
food      100
Name: 중요도, dtype: int64

In [60]:
# 하나의 key에 여러 value를 가지는 딕셔너리에 Series와 DataFrame을 적용해보자
data = {'Region' : ['Korea', 'America', 'Chaina', 'Canada', 'Italy'],
        'Sales' : [300, 200, 500, 150, 50],
        'Amount' : [90, 80, 100, 30, 10],
        }

ser = pd.Series(data)
ser

Region    [Korea, America, Chaina, Canada, Italy]
Sales                    [300, 200, 500, 150, 50]
Amount                      [90, 80, 100, 30, 10]
dtype: object

In [61]:
df = pd.DataFrame(data)
df

Unnamed: 0,Region,Sales,Amount
0,Korea,300,90
1,America,200,80
2,Chaina,500,100
3,Canada,150,30
4,Italy,50,10


##### 위의 결과를 보면 Series와 DataFrame의 차이를 확인할 수 있다. Series는 항상 1개의 열로 만들어지는 반면, DataFrame은 여러개의 열을 가진다.

In [62]:
# 판다스 DataFrame의 각 행과 열에 이름 붙이기

df.index=['one','two','three','four','five'] # 행
df.columns = ['국가명','가격','수량'] # 열
df

Unnamed: 0,국가명,가격,수량
one,Korea,300,90
two,America,200,80
three,Chaina,500,100
four,Canada,150,30
five,Italy,50,10


In [63]:
# 원하는 열 불러오기

print(df.국가명)
print(df['국가명'])

# 여러 열을 불러오기
print(df[['국가명','수량']])


one        Korea
two      America
three     Chaina
four      Canada
five       Italy
Name: 국가명, dtype: object
one        Korea
two      America
three     Chaina
four      Canada
five       Italy
Name: 국가명, dtype: object
           국가명   수량
one      Korea   90
two    America   80
three   Chaina  100
four    Canada   30
five     Italy   10


In [64]:
# 원하는 행 불러오기: loc(행 이름), iloc(행 인덱스)
print(df.loc['one'])
print(df.iloc[2])

# 여러 행을 불러오기
print(df.loc['one':'three'])
print(df.iloc[1:4])


국가명    Korea
가격       300
수량        90
Name: one, dtype: object
국가명    Chaina
가격        500
수량        100
Name: three, dtype: object
           국가명   가격   수량
one      Korea  300   90
two    America  200   80
three   Chaina  500  100
           국가명   가격   수량
two    America  200   80
three   Chaina  500  100
four    Canada  150   30


In [73]:
# 예제 3

# 하나의 key에 여러 value를 가지는 딕셔너리에 Series와 DataFrame을 적용해보자
data = {
        'Name' : ['Jane', 'Hana', 'Jonghan', 'Mark', 'Irin'],
        'Strawberry' : [100, 200, 50, 80, 20],
        'Apple' : [90, 80, 100, 30, 10],
        'Banana' : [10, 20, 10, 45, 60],
        }

df = pd.DataFrame(data)

# 원하는 행 불러오기: loc(행 이름), iloc(행 인덱스)
print(df.iloc[1:4])

# 여러 열을 불러오기
print(df[['Name', "Apple"]])

      Name  Strawberry  Apple  Banana
1     Hana         200     80      20
2  Jonghan          50    100      10
3     Mark          80     30      45
      Name  Apple
0     Jane     90
1     Hana     80
2  Jonghan    100
3     Mark     30
4     Irin     10


## 판다스로 데이터 불러오기

In [75]:
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data' # Data URL
df = pd.read_csv(data_url, sep='\s+', header=None) #csv 타입 데이터 로드, separate는 빈공간으로 지정, Column은 없음

In [76]:
# 판다스 DataFrame은 기본적으로 head와 tail 기능이 있다. head는 앞 5줄을 출력해주고, tail은 뒷 5줄을 출력해준다.
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [77]:
df.tail() # 총 행의 개수가 500개가 넘는 것을 확인할 수 있습니다.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.12,76.7,2.2875,1,273.0,21.0,396.9,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.9,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0,0.573,6.03,80.8,2.505,1,273.0,21.0,396.9,7.88,11.9


In [78]:
# 각 열의 이름을 나라면 영문코드로 바꿔보기
df.columns = [
    'CRIM','ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO' ,'B', 'LSTAT', 'MEDV'] 
df.head() #처음 다섯줄 출력

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [79]:
# 특정 column만 선택해보기
index_list = ["CRIM", "DIS", "B"] # 원하는 열의 이름을 리스트로 저장
df[index_list] # 해당 리스트로 indexing

Unnamed: 0,CRIM,DIS,B
0,0.00632,4.0900,396.90
1,0.02731,4.9671,396.90
2,0.02729,4.9671,392.83
3,0.03237,6.0622,394.63
4,0.06905,6.0622,396.90
...,...,...,...
501,0.06263,2.4786,391.99
502,0.04527,2.2875,396.90
503,0.06076,2.1675,396.90
504,0.10959,2.3889,393.45


In [81]:
# 예제 4

# df가 주어졌을 때 상위 5개 출력하기
print("상위 5개\n", df.head())

# 각 열의 이름을 원하는 것으로 바꾸기
df.columns = [
    'CRIM','ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO' ,'B', 'LSTAT', 'MEDV'
] 
df

상위 5개
       CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296.0   
1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242.0   
2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242.0   
3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222.0   
4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222.0   

   PTRATIO       B  LSTAT  MEDV  
0     15.3  396.90   4.98  24.0  
1     17.8  396.90   9.14  21.6  
2     17.8  392.83   4.03  34.7  
3     18.7  394.63   2.94  33.4  
4     18.7  396.90   5.33  36.2  


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0


In [27]:
# 원하는 열을 선택하고, 상위 몇 개를 출력할지 선택
print(df['CRIM'].head(3))  
print(df[["CRIM", "DIS", "B"]].head(15))  

0    0.00632
1    0.02731
2    0.02729
Name: CRIM, dtype: float64
       CRIM     DIS       B
0   0.00632  4.0900  396.90
1   0.02731  4.9671  396.90
2   0.02729  4.9671  392.83
3   0.03237  6.0622  394.63
4   0.06905  6.0622  396.90
5   0.02985  6.0622  394.12
6   0.08829  5.5605  395.60
7   0.14455  5.9505  396.90
8   0.21124  6.0821  386.63
9   0.17004  6.5921  386.71
10  0.22489  6.3467  392.52
11  0.11747  6.2267  396.90
12  0.09378  5.4509  390.50
13  0.62976  4.7075  396.90
14  0.63796  4.4619  380.02


In [37]:
# 배열처럼 indexing을 할 수도 있습니다
df[:4]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4


In [42]:
# 현재는 행의 index가 0부터 시작하는 숫자이지만, 이를 특정 열로 대체할 수 있습니다
df.index = df['CRIM'] # CRIM을 index로 설정
del df['CRIM'] # 원해 있더 CRIM 열은 삭제
df

Unnamed: 0_level_0,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
CRIM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0


In [43]:
# 다시 숫자로 index를 바꿀수도 있습니다.
df.index = list(range(len(df)))
df

Unnamed: 0,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
1,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
2,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
503,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
504,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0


In [85]:
# 예제 5
# data의 url에 주어졌을 때, 데이터를 불러오고 df의 index를 바꾸고 원하는 column을 삭제해보기
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data' # Data URL

# df 로드
df = pd.read_csv(data_url, sep='\s+', header=None) #csv 타입 데이터 로드, separate는 빈공간으로 지정, Column은 없음
df.columns = [
    'CRIM','ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO' ,'B', 'LSTAT', 'MEDV'
] 

# df를 
df.index = df['NOX'] # NOX을 index로 설정
del df["NOX"]
df

Unnamed: 0_level_0,CRIM,ZN,INDUS,CHAS,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
NOX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0.538,0.00632,18.0,2.31,0,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
0.469,0.02731,0.0,7.07,0,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
0.469,0.02729,0.0,7.07,0,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
0.458,0.03237,0.0,2.18,0,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
0.458,0.06905,0.0,2.18,0,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
0.573,0.06263,0.0,11.93,0,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
0.573,0.04527,0.0,11.93,0,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
0.573,0.06076,0.0,11.93,0,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
0.573,0.10959,0.0,11.93,0,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0
