# 2. 파이썬으로 데이터 주무르기, pandas
**pandas를 활용해서 데이터프레임을 다뤄봅시다.**

1. Pandas 시작하기
    - prerequisite : Table
    - pandas import하기
   
2. Pandas로 1차원 데이터 다루기 - Series 
    - Series 선언하기
    - Series vs ndarray
    - Series vs dict
    - Series에 이름 붙이기
3. Pandas로 2차원 데이터 다루기 - dataframe
    - dataframe 선언하기
    - from csv to dataframe
    - dataframe 자료 접근하기

[수업에 사용된 covid 데이터](https://www.kaggle.com/imdevskp/corona-virus-report)

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width: 90% !important; }</style>"))

## I. pandas 시작하기

### Prerequisite : Table
- 행과 열을 이용해 데이터를 저장하고 관리하는 자료구조(컨테이너)
- 주로 행은 개체, 열은 속성을 나타냄


In [2]:
import pandas as pd

## II. pandas로 1차원 데이터 다루기 - Series

### Series?
- 1-D labeled array 
- 인덱스를 지정해줄 수 있다

In [3]:
s = pd.Series([1, 4, 9, 16, 25])
s

0     1
1     4
2     9
3    16
4    25
dtype: int64

In [4]:
t = pd.Series({"one" : 1, "two" : 2, "three" : 3})
t

one      1
two      2
three    3
dtype: int64

### Series + Numpy
- Series는 ndarray와 유사하다

In [5]:
print(s[1])
print()
print(t[1])
print()
print(t[1 : 3])

4

2

two      2
three    3
dtype: int64


In [6]:
s[s > s.median()] #자기 자신의 median보다 큰 값들만 가지고 오기

3    16
4    25
dtype: int64

In [7]:
s[[3, 1, 4]] #index를 list의 형태로 묶어서 전달

3    16
1     4
4    25
dtype: int64

In [8]:
import numpy as np

np.exp(s)

0    2.718282e+00
1    5.459815e+01
2    8.103084e+03
3    8.886111e+06
4    7.200490e+10
dtype: float64

In [9]:
s.dtype

dtype('int64')

### Series + dict
- Series는 dict와 유사하다

In [10]:
t

one      1
two      2
three    3
dtype: int64

In [11]:
t['one']

1

In [12]:
#serires에 값 추가
t["five"] = 5
t

one      1
two      2
three    3
five     5
dtype: int64

In [13]:
print(t.get("seven")) #없으므로 None return
print(t.get("seven", 0)) #없을 때 return 할 값을 2번째 param에 입력

None
0


### Series에 이름 붙이기
- `name` 속성을 가지고 있다.
- 처음 Series를 만들 때 이름 붙일 수 있음

In [14]:
s = pd.Series(np.random.randn(5), name = "random_nums")
s

0   -1.271151
1    1.448788
2    0.897725
3   -0.453392
4    1.047403
Name: random_nums, dtype: float64

In [15]:
s.name = "임의의 난수"
s

0   -1.271151
1    1.448788
2    0.897725
3   -0.453392
4    1.047403
Name: 임의의 난수, dtype: float64

## III. Pandas로 2차원 데이터 다루기 - dataframe

### dataframe??
- 2-D labeled **table**
- 인덱스를 지정할 수도 있음

In [16]:
d = {"height" : [1, 2, 3, 4], "weight" : [30, 40, 50, 60]}
df = pd.DataFrame(d)
df

Unnamed: 0,height,weight
0,1,30
1,2,40
2,3,50
3,4,60


## dtype 확인

In [17]:
df.dtypes

height    int64
weight    int64
dtype: object

### From CSV to dataframe
- Comma Separated Value를 Dataframe으로 생성 가능
- `.read_csv()`활용

In [18]:
# 동일 경로에 country_wise_latest.csv가 존재하면

covid = pd.read_csv("./country_wise_latest.csv")
covid

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.50,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.00,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
3,Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.60,Europe
4,Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
182,West Bank and Gaza,10621,78,3752,6791,152,2,0,0.73,35.33,2.08,8916,1705,19.12,Eastern Mediterranean
183,Western Sahara,10,1,8,1,0,0,0,10.00,80.00,12.50,10,0,0.00,Africa
184,Yemen,1691,483,833,375,10,4,36,28.56,49.26,57.98,1619,72,4.45,Eastern Mediterranean
185,Zambia,4552,140,2815,1597,71,1,465,3.08,61.84,4.97,3326,1226,36.86,Africa


### Pandas 활용

### 1. 일부분만 관찰하기

- `head(n)` : 처음 n개의 데이터 참조
- `tail(n)` : 마지막 n개의 데이터 참조

In [19]:
covid.head(5)
#tail도 같은 방식

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
3,Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.6,Europe
4,Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa


### 2. 데이터 접근하기

- `df["column_name"]` or `df.column_name`

In [20]:
covid["Confirmed"]

0      36263
1       4880
2      27973
3        907
4        950
       ...  
182    10621
183       10
184     1691
185     4552
186     2704
Name: Confirmed, Length: 187, dtype: int64

### TIP : dataframe의 각 column은 Series다!

In [21]:
covid["Confirmed"][1 : 5]

1     4880
2    27973
3      907
4      950
Name: Confirmed, dtype: int64

### 3. "조건"을 이용해 데이터 접근하기

In [22]:
# 신규 확진자가 100명이 넘는 나라 찾아보기
# covid의 []안에 있는 covid["New cases"] > 100 이 Boolean을 return하게 되는데, 여기서 True가 나온 row만 넘겨준다
covid[covid["New cases"] > 100].head(5)

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
6,Argentina,167416,3059,72575,91782,4890,120,2057,1.83,43.35,4.21,130774,36642,28.02,Americas
8,Australia,15303,167,9311,5825,368,6,137,1.09,60.84,1.79,12428,2875,23.13,Western Pacific


In [23]:
# WHO 지역(WHO Region)이 동남아시아인 나라 찾기
# unique()를 통해 범주 확인해보기
covid["WHO Region"].unique()

array(['Eastern Mediterranean', 'Europe', 'Africa', 'Americas',
       'Western Pacific', 'South-East Asia'], dtype=object)

In [24]:
covid[covid["WHO Region"] == "South-East Asia"].head(5)

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
13,Bangladesh,226225,2965,125683,97577,2772,37,1801,1.31,55.56,2.36,207453,18772,9.05,South-East Asia
19,Bhutan,99,0,86,13,4,0,1,0.0,86.87,0.0,90,9,10.0,South-East Asia
27,Burma,350,6,292,52,0,0,2,1.71,83.43,2.05,341,9,2.64,South-East Asia
79,India,1480073,33408,951166,495499,44457,637,33598,2.26,64.26,3.51,1155338,324735,28.11,South-East Asia
80,Indonesia,100303,4838,58173,37292,1525,57,1518,4.82,58.0,8.32,88214,12089,13.7,South-East Asia


### 4. 행을 기준으로 데이터 접근하기

### 인덱스를 이용해 가져오기 `.loc[row, col]`

In [25]:
# 이 경우는 index가 covid의 가장 왼쪽에 보이는 숫자이기 때문에 안된다

covid.loc[1]

Country/Region            Albania
Confirmed                    4880
Deaths                        144
Recovered                    2745
Active                       1991
New cases                     117
New deaths                      6
New recovered                  63
Deaths / 100 Cases           2.95
Recovered / 100 Cases       56.25
Deaths / 100 Recovered       5.25
Confirmed last week          4171
1 week change                 709
1 week % increase              17
WHO Region                 Europe
Name: 1, dtype: object

In [26]:
#난 나라이름을 index로 쓰고싶다...

covid.set_index("Country/Region", inplace = True)
covid.loc["India"]

Confirmed                         1480073
Deaths                              33408
Recovered                          951166
Active                             495499
New cases                           44457
New deaths                            637
New recovered                       33598
Deaths / 100 Cases                   2.26
Recovered / 100 Cases               64.26
Deaths / 100 Recovered               3.51
Confirmed last week               1155338
1 week change                      324735
1 week % increase                   28.11
WHO Region                South-East Asia
Name: India, dtype: object

In [27]:
covid.loc["India", "Deaths"]

33408

### 숫자 인덱스를 이용해 가져오기 : `iloc[rowidx, colidx]`

In [28]:
covid.iloc[0] # 아프가니스탄

Confirmed                                 36263
Deaths                                     1269
Recovered                                 25198
Active                                     9796
New cases                                   106
New deaths                                   10
New recovered                                18
Deaths / 100 Cases                          3.5
Recovered / 100 Cases                     69.49
Deaths / 100 Recovered                     5.04
Confirmed last week                       35526
1 week change                               737
1 week % increase                          2.07
WHO Region                Eastern Mediterranean
Name: Afghanistan, dtype: object

In [29]:
covid.iloc[0, 2 : 4] #아프가니스탄, [Recovered, Active]

Recovered    25198
Active        9796
Name: Afghanistan, dtype: object

### 5. groupby

- Split : dataframe을 특정한 "기준"을 바탕으로 분할
- Apply : 통계함수 - ex) sum(), mean(), median()... 을 적용해서 각 데이터를 압축
- Combine : Apply된 결과를 바탕으로 새로운 Series 생성(group_key : applied_val)

`.groupby()`

In [30]:
# WHO Region 별 확진자 수

"""
1. covid에서 확진자 수 col만 추출
2. 이를 covid의 WHO Region을 기준으로 groupby
"""

covid_by_region = covid["Confirmed"].groupby(by = covid["WHO Region"])
covid_by_region

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000020AF5408DC0>

In [31]:
covid_by_region.sum() # 합계

WHO Region
Africa                    723207
Americas                 8839286
Eastern Mediterranean    1490744
Europe                   3299523
South-East Asia          1835297
Western Pacific           292428
Name: Confirmed, dtype: int64

In [32]:
# 국가당 감염자 수

covid_by_region.mean() #sum / 국가 수

WHO Region
Africa                    15066.812500
Americas                 252551.028571
Eastern Mediterranean     67761.090909
Europe                    58920.053571
South-East Asia          183529.700000
Western Pacific           18276.750000
Name: Confirmed, dtype: float64

## Mission:
### 1. covid 데이터에서 100 case 대비 사망률(`Deaths / 100 Cases`)이 가장 높은 국가는?

In [33]:
#스크롤 올리기 번거로우므로 한번 보고
covid.head(5)

Unnamed: 0_level_0,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.6,Europe
Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa


In [34]:
#nlargest를 이용해서 "Deaths / 100 Cases"가 큰 국가들을 순서대로 파악해보자
covid.nlargest(10, "Deaths / 100 Cases").head(5)

Unnamed: 0_level_0,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Yemen,1691,483,833,375,10,4,36,28.56,49.26,57.98,1619,72,4.45,Eastern Mediterranean
United Kingdom,301708,45844,1437,254427,688,7,3,15.19,0.48,3190.26,296944,4764,1.6,Europe
Belgium,66428,9822,17452,39154,402,1,14,14.79,26.27,56.28,64094,2334,3.64,Europe
Italy,246286,35112,198593,12581,168,5,147,14.26,80.64,17.68,244624,1662,0.68,Europe
France,220352,30212,81212,108928,2551,17,267,13.71,36.86,37.2,214023,6329,2.96,Europe


### 확인해보니 "Deaths / 100 Cases" 가 큰 순서대로 잘 나왔다. 따라서 원하는 답은 Yemen이다

### 2. covid 데이터에서 신규 확진자가 없는 나라 중 WHO Region이 'Europe'를 모두 출력하면?  
Hint : 한 줄에 동시에 두가지 조건을 Apply하는 경우 Warning이 발생할 수 있습니다.

In [35]:
#신규 확진자(New Cases)가 없는 row 부터 추출해내고
covid_no_newcases = covid[covid["New cases"] == 0]

#WHO Region이 Europe인 row를 추출해낸다
covid_no_newcases_Europe = covid_no_newcases[covid_no_newcases["WHO Region"] == "Europe"]

#난 index에 나라 이름을 넣어 놓았으니, index를 이용해 그냥 출력해서 본다
covid_no_newcases_Europe.index

Index(['Estonia', 'Holy See', 'Latvia', 'Liechtenstein', 'Monaco',
       'San Marino', 'Spain'],
      dtype='object', name='Country/Region')

### 3. 다음 [데이터](https://www.kaggle.com/neuromusic/avocado-prices)를 이용해 각 Region별로 아보카도가 가장 비싼 평균가격(AveragePrice)을 출력하면?

- Date - The date of the observation
- AveragePrice - the average price of a single avocado
- type - conventional or organic
- year - the year
- Region - the city or region of the observation
- Total Volume - Total number of avocados sold
- 4046 - Total number of avocados with PLU 4046 sold
- 4225 - Total number of avocados with PLU 4225 sold
- 4770 - Total number of avocados with PLU 4770 sold

In [36]:
#일단 csv 가져와서 봐보자

avo = pd.read_csv("./avocado.csv")
avo.head(5)

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [37]:
# 각 row의 결론은 Average Price로 Total Volume만큼의 양이 책정된 것이므로 총 비용을 먼저 계산
avo["Price*Volume"] = (avo["AveragePrice"] * avo["Total Volume"])
# 총 비용과 총량을 지역별로 합계
avo_by_region = avo[["Total Volume", "Price*Volume", "region"]].groupby("region").sum()
# 지역별로 총 비용 / 총량 -> 지역별 평균 가격
avo_by_region["AveragePrice"] = avo_by_region["Price*Volume"] / avo_by_region["Total Volume"]
# 지역별 평균 가격 상위 다섯개 지역만 봐보자
avo_by_region["AveragePrice"].nlargest(5)

region
HartfordSpringfield    1.404888
NewYork                1.392389
Syracuse               1.389533
Philadelphia           1.389442
BuffaloRochester       1.373767
Name: AveragePrice, dtype: float64

### 잘못한게 없다면... HartfordSpringfield가 제일 비싼 지역이다 :)