# 2. 파이썬으로 데이터 주무르기, pandas
**pandas를 활용해서 데이터프레임을 다뤄봅시다.**

1. Pandas 시작하기
    - prerequisite : Table
    - pandas import하기
   
2. Pandas로 1차원 데이터 다루기 - Series 
    - Series 선언하기
    - Series vs ndarray
    - Series vs dict
    - Series에 이름 붙이기
3. Pandas로 2차원 데이터 다루기 - dataframe
    - dataframe 선언하기
    - from csv to dataframe
    - dataframe 자료 접근하기

[수업에 사용된 covid 데이터](https://www.kaggle.com/imdevskp/corona-virus-report)

## I. pandas 시작하기

### Prerequisite: Table  
- 행과 열을 이용해서 데이터를 저장하고 관리하는 자료구조(컨테이너)
- 주로 행과 개체, 열은 속성을 나타낸다

## II. pandas로 1차원 데이터 다루기 - Series

Series: 1-D labeled array, index를 지정해줄 수 있다.

In [4]:
import pandas as pd

In [6]:
# list를 series로 만들 경우, index가 자동으로 오름순으로 지정된다.
s = pd.Series([1, 4, 6, 89, 10])
s

0     1
1     4
2     6
3    89
4    10
dtype: int64

In [7]:
# dictionary를 series로 만들 경우, 해당 키값이 index로 지정된다.
dic = {'one':1, 'two':2, 'three':3}

t = pd.Series(dic)
t

one      1
two      2
three    3
dtype: int64

In [9]:
# index를 통한 원소 불러오기
s[1], s[0], s[3]

(4, 1, 89)

In [11]:
# dictionary를 index로 불러올 경우, 키값이 아닌, value에 index가 매겨지고, 해당 index값이 불러와진다
t[0],t[2],t[1]

(1, 3, 2)

In [12]:
# 물론 키값으로 값을 불러올 수 있다.
t['one']

1

In [15]:
# 범위 지정을 하면 Series형태로 반환된다.
t['one':], t['two':'three']

(one      1
 two      2
 three    3
 dtype: int64,
 two      2
 three    3
 dtype: int64)

In [29]:
# index list를 통해 해당 원소만 Series형태로 불러올 수 있다. 이때는 index를 그대로 가져온다.
s[[4,1,2]]

4    10
1     4
2     6
dtype: int64

In [24]:
s2 = pd.Series([[2,4,1],[6,0,1]])
s2, s2[1][2]

(0    [2, 4, 1]
 1    [6, 0, 1]
 dtype: object,
 1)

In [25]:
# numpy과는 다르다
s2[1,:]

KeyError: 'key of type tuple not found and not a MultiIndex'

In [18]:
s.median(), t.median()

(6.0, 2.0)

In [21]:
# 내부에서 값에 대한 조건을 통해 해당 값을 불러오되, 새로운 index가 설정되어 반환된다.
s[s > 3], t[t > 1]

(1     4
 2     6
 3    89
 4    10
 dtype: int64,
 two      2
 three    3
 dtype: int64)

In [27]:
# 조건식에 해당 Series가 아닌 것이 들어있으면 안 된다.
s[t > 1]

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

In [30]:
import numpy as np

np.exp(s)

0    2.718282e+00
1    5.459815e+01
2    4.034288e+02
3    4.489613e+38
4    2.202647e+04
dtype: float64

In [35]:
# Series의 값 추가
s[10] = 1

In [36]:
s

0      1
1      4
2      6
3     89
4     10
5     10
6      1
10     1
dtype: int64

In [37]:
t['any'] = 'good'

In [38]:
t

one         1
two         2
three       3
any      good
dtype: object

In [39]:
'any' in t

True

In [40]:
# get을 사용하여 키에 대한 값을 받을 수 있는데, 해당 키가 없을 경우 반환값이 없다.
t.get('seven')

In [42]:
# 없을 경우 반환값을 지정할 수 있다.
t.get('seven',7)

7

## III. Pandas로 2차원 데이터 다루기 - dataframe

DataFrame: 2-D labeled table, 인덱스와 칼럼명을 지정할 수 있음

In [43]:
# dictionary 를 dataframe으로 변환
d = {'height':[1,2,3,4], 'weight':[20, 49, 29, 55]}

In [44]:
df = pd.DataFrame(d)
df

Unnamed: 0,height,weight
0,1,20
1,2,49
2,3,29
3,4,55


In [46]:
covid_data = pd.read_csv('~/Jupyter workspace/archive/country_wise_latest.csv')
covid_data

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.50,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.00,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
3,Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.60,Europe
4,Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
182,West Bank and Gaza,10621,78,3752,6791,152,2,0,0.73,35.33,2.08,8916,1705,19.12,Eastern Mediterranean
183,Western Sahara,10,1,8,1,0,0,0,10.00,80.00,12.50,10,0,0.00,Africa
184,Yemen,1691,483,833,375,10,4,36,28.56,49.26,57.98,1619,72,4.45,Eastern Mediterranean
185,Zambia,4552,140,2815,1597,71,1,465,3.08,61.84,4.97,3326,1226,36.86,Africa


In [47]:
covid_data.head()

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
3,Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.6,Europe
4,Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa


In [48]:
covid_data.tail()

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
182,West Bank and Gaza,10621,78,3752,6791,152,2,0,0.73,35.33,2.08,8916,1705,19.12,Eastern Mediterranean
183,Western Sahara,10,1,8,1,0,0,0,10.0,80.0,12.5,10,0,0.0,Africa
184,Yemen,1691,483,833,375,10,4,36,28.56,49.26,57.98,1619,72,4.45,Eastern Mediterranean
185,Zambia,4552,140,2815,1597,71,1,465,3.08,61.84,4.97,3326,1226,36.86,Africa
186,Zimbabwe,2704,36,542,2126,192,2,24,1.33,20.04,6.64,1713,991,57.85,Africa


In [49]:
#데이터 접근
#칼럼 하나를 가져오면 Series로 반환된다.
covid_data['Active']

0      9796
1      1991
2      7973
3        52
4       667
       ... 
182    6791
183       1
184     375
185    1597
186    2126
Name: Active, Length: 187, dtype: int64

In [50]:
covid_data.Active

0      9796
1      1991
2      7973
3        52
4       667
       ... 
182    6791
183       1
184     375
185    1597
186    2126
Name: Active, Length: 187, dtype: int64

In [52]:
covid_data['Confirmed'][1]

4880

In [51]:
covid_data['Confirmed'][2:5]

2    27973
3      907
4      950
Name: Confirmed, dtype: int64

In [53]:
# 한 칼럼에 대소 비교를 하면, 각 원소에 대해서 대소비교를 하고 bool list를 반환한다.
covid_data['New cases'] > 100

0       True
1       True
2       True
3      False
4      False
       ...  
182     True
183    False
184    False
185    False
186     True
Name: New cases, Length: 187, dtype: bool

In [54]:
# 위의 원리를 [ ]안에 넣어서 해당 데이터만 추출 가능하다. 반환되는 데이터는 dataframe
covid_data[covid_data['New cases'] > 100]

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.50,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.00,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
6,Argentina,167416,3059,72575,91782,4890,120,2057,1.83,43.35,4.21,130774,36642,28.02,Americas
8,Australia,15303,167,9311,5825,368,6,137,1.09,60.84,1.79,12428,2875,23.13,Western Pacific
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
177,United Kingdom,301708,45844,1437,254427,688,7,3,15.19,0.48,3190.26,296944,4764,1.60,Europe
179,Uzbekistan,21209,121,11674,9414,678,5,569,0.57,55.04,1.04,17149,4060,23.67,Europe
180,Venezuela,15988,146,9959,5883,525,4,213,0.91,62.29,1.47,12334,3654,29.63,Americas
182,West Bank and Gaza,10621,78,3752,6791,152,2,0,0.73,35.33,2.08,8916,1705,19.12,Eastern Mediterranean


In [56]:
covid_data['WHO Region'].unique()

array(['Eastern Mediterranean', 'Europe', 'Africa', 'Americas',
       'Western Pacific', 'South-East Asia'], dtype=object)

In [57]:
covid_data[covid_data['WHO Region']== 'South-East Asia']

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
13,Bangladesh,226225,2965,125683,97577,2772,37,1801,1.31,55.56,2.36,207453,18772,9.05,South-East Asia
19,Bhutan,99,0,86,13,4,0,1,0.0,86.87,0.0,90,9,10.0,South-East Asia
27,Burma,350,6,292,52,0,0,2,1.71,83.43,2.05,341,9,2.64,South-East Asia
79,India,1480073,33408,951166,495499,44457,637,33598,2.26,64.26,3.51,1155338,324735,28.11,South-East Asia
80,Indonesia,100303,4838,58173,37292,1525,57,1518,4.82,58.0,8.32,88214,12089,13.7,South-East Asia
106,Maldives,3369,15,2547,807,67,0,19,0.45,75.6,0.59,2999,370,12.34,South-East Asia
119,Nepal,18752,48,13754,4950,139,3,626,0.26,73.35,0.35,17844,908,5.09,South-East Asia
158,Sri Lanka,2805,11,2121,673,23,0,15,0.39,75.61,0.52,2730,75,2.75,South-East Asia
167,Thailand,3297,58,3111,128,6,0,2,1.76,94.36,1.86,3250,47,1.45,South-East Asia
168,Timor-Leste,24,0,0,24,0,0,0,0.0,0.0,0.0,24,0,0.0,South-East Asia


In [58]:
books = {"Available":[True, True, False], "Location":[102, 215, 324], "Genre":["Programming","Physics","math"]}

In [59]:
books_df = pd.DataFrame(books, index=['Bug', 'Phys','Det'])
books_df

Unnamed: 0,Available,Location,Genre
Bug,True,102,Programming
Phys,True,215,Physics
Det,False,324,math


데이터 넘파이처럼 접근하기

In [61]:
books_df.loc['Bug', "Available"]

True

In [62]:
# ordered index로 데이터 접근하기
books_df.iloc[0, 1]

102

In [63]:
books_df.iloc[1, 0:2]

Available    True
Location      215
Name: Phys, dtype: object

#### pandas groupby
- split: 특정한 기준을 바탕으로 DataFrame 분할
- apply: 통계함수 (sum, mean, median)을 적용하여 데이터 압축
- combine: Apply된 결과를 바탕으로 새로운 Series생성(group_key: applied_value)

In [64]:
covid_data.head(3)

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa


In [67]:
covid_by_region = covid_data['Confirmed'].groupby(by=covid_data['WHO Region'])
covid_by_region.sum()

WHO Region
Africa                    723207
Americas                 8839286
Eastern Mediterranean    1490744
Europe                   3299523
South-East Asia          1835297
Western Pacific           292428
Name: Confirmed, dtype: int64

In [68]:
covid_by_region.mean()

WHO Region
Africa                    15066.812500
Americas                 252551.028571
Eastern Mediterranean     67761.090909
Europe                    58920.053571
South-East Asia          183529.700000
Western Pacific           18276.750000
Name: Confirmed, dtype: float64

## Mission:
### 1. covid 데이터에서 100 case 대비 사망률(`Deaths / 100 Cases`)이 가장 높은 국가는?

In [75]:
covid_data.keys()

Index(['Country/Region', 'Confirmed', 'Deaths', 'Recovered', 'Active',
       'New cases', 'New deaths', 'New recovered', 'Deaths / 100 Cases',
       'Recovered / 100 Cases', 'Deaths / 100 Recovered',
       'Confirmed last week', '1 week change', '1 week % increase',
       'WHO Region'],
      dtype='object')

In [79]:
M1 = covid_data['Country/Region'].groupby(by=covid_data['Deaths / 100 Cases'])
M1.sum().max()

'Zambia'

### 2. covid 데이터에서 신규 확진자가 없는 나라 중 WHO Region이 'Europe'를 모두 출력하면?  
Hint : 한 줄에 동시에 두가지 조건을 Apply하는 경우 Warning이 발생할 수 있습니다.

In [83]:
covid_europe = covid_data[covid_data['WHO Region']=='Europe']
covid_europe[covid_europe['New cases']==0]

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
56,Estonia,2034,69,1923,42,0,0,1,3.39,94.54,3.59,2021,13,0.64,Europe
75,Holy See,12,0,12,0,0,0,0,0.0,100.0,0.0,12,0,0.0,Europe
95,Latvia,1219,31,1045,143,0,0,0,2.54,85.73,2.97,1192,27,2.27,Europe
100,Liechtenstein,86,1,81,4,0,0,0,1.16,94.19,1.23,86,0,0.0,Europe
113,Monaco,116,4,104,8,0,0,0,3.45,89.66,3.85,109,7,6.42,Europe
143,San Marino,699,42,657,0,0,0,0,6.01,93.99,6.39,699,0,0.0,Europe
157,Spain,272421,28432,150376,93613,0,0,0,10.44,55.2,18.91,264836,7585,2.86,Europe


### 3. 다음 [데이터](https://www.kaggle.com/neuromusic/avocado-prices)를 이용해 각 Region별로 아보카도가 가장 비싼 평균가격(AveragePrice)을 출력하면?

In [84]:
avocado = pd.read_csv('~/Jupyter workspace/avocado.csv')

In [85]:
avocado.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [86]:
avocado.keys()

Index(['Unnamed: 0', 'Date', 'AveragePrice', 'Total Volume', '4046', '4225',
       '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type',
       'year', 'region'],
      dtype='object')

In [88]:
avocado['region'].head()

0    Albany
1    Albany
2    Albany
3    Albany
4    Albany
Name: region, dtype: object

In [91]:
avocado['region'].groupby(by=avocado['AveragePrice']).sum().max()

'WestTexNewMexicoCharlotteOrlandoPortlandSanFranciscoSeattle'