# Pandas & Preprocessing
FIRA Big Data Platform < Data Mining >

### 1. Pandas Tutorial
- 1-1. Data Structure
- 1-2. Selection : Getting, Slicing
- 1.3. Add New Rows & Columns
- 1-4. Inspection
- 1-5. Arithmetic
- 1-6. Map & Apply Function
- 1-7. Sort
    
### 2. Preprocessing
- 2-1. Data from .csv, .sql to DataFrame
- 2-2. Merge/Join two DataFrame : .merge()
- 2-3. Fill or Abandon NaN values
- 2-4. Save & Load as DataFrame : pickle
- 2-5. Data Summary : .groupby()

### 실습 : 군집화 실습 데이터 기본 전처리
    
### 3. Basic Visualization : Scatter Plot of Normalized Total Volume vs. Normalized Value
- 3-1. `matplotlib.pyplot`
- 3-2. `bokeh.js`

### 1. Pandas Tutorial
---
Pandas는 NumPy를 기반으로 하는 데이터 분석에 최적화된 파이썬 라이브러리입니다. Series, DataFrame의 데이터 구조를 제공하고 이를 계산하는 도구를 제공하여 데이터 분석을 보다 편리하게 해줍니다.

In [1]:
# import package - version check 필수
import pandas as pd # 개발자들 사이에 convention

#### 1.1. Data Structure
---
##### pd.Series 
1. Labeling이 가능한 1차원 Array Object 
2. 모두 같은 데이터형을 가집니다.

In [2]:
# index가 ['a', 'b', 'c', 'd']이고, 값이 [3, -5, 7, 4]인 Series
s = pd.Series([3, -5, 7, 4], index = ['a', 'b', 'c', 'd'])
s

a    3
b   -5
c    7
d    4
dtype: int64

##### pd.DataFrame 
1. Labeling이 가능한 2차원 Array Object
2. column 별로 서로 다른 데이터형을 가질 수 있습니다.

In [3]:
data = {'Country': ['Belgium', 'India', 'Brazil'],
        'Capital': ['Brussels', 'New Delhi', 'Brasília'],
        'Population': [11190846, 1303171035, 207847528]}
data

{'Capital': ['Brussels', 'New Delhi', 'Brasília'],
 'Country': ['Belgium', 'India', 'Brazil'],
 'Population': [11190846, 1303171035, 207847528]}

In [4]:
# 위의 데이터를 DataFrame으로 변환
df = pd.DataFrame(data)
df

Unnamed: 0,Capital,Country,Population
0,Brussels,Belgium,11190846
1,New Delhi,India,1303171035
2,Brasília,Brazil,207847528


In [5]:
df = pd.DataFrame(data, index = [1,2,3])
df

Unnamed: 0,Capital,Country,Population
1,Brussels,Belgium,11190846
2,New Delhi,India,1303171035
3,Brasília,Brazil,207847528


#### 1.2. Selection : Getting, Slicing
---

##### Getting

In [6]:
# Series : By index

s

a    3
b   -5
c    7
d    4
dtype: int64

In [7]:
# DataFrame : By Row 

df[0:1] # 제대로 된 방법 아님.

Unnamed: 0,Capital,Country,Population
1,Brussels,Belgium,11190846


In [8]:
# DataFrame : By column

df['Capital']

1     Brussels
2    New Delhi
3     Brasília
Name: Capital, dtype: object

In [9]:
df[['Capital', 'Country']]

Unnamed: 0,Capital,Country
1,Brussels,Belgium
2,New Delhi,India
3,Brasília,Brazil


##### Slicing
* df.iloc : index를 통한 slicing
* df.loc : label을 통한 slicing

In [10]:
# By Position
df.iloc[0,] # 한줄만 부르는 경우는 output은 series임.

Capital       Brussels
Country        Belgium
Population    11190846
Name: 1, dtype: object

In [11]:
df.iloc[1:, 0]

2    New Delhi
3     Brasília
Name: Capital, dtype: object

In [12]:
# By Label
df.loc[1,]

Capital       Brussels
Country        Belgium
Population    11190846
Name: 1, dtype: object

##### Boolean Indexing (Mask)
해당 조건을 만족하는 부분만 slicing - True/False 기반

In [13]:
# s에서 1보다 큰 부분만 출력
s[s>1]

a    3
c    7
d    4
dtype: int64

In [14]:
# s에서 1보다 작은 부분만 출력
s[~(s>1)]

b   -5
dtype: int64

In [15]:
# OR
# 5보다 크거나 -1보다 작은 부분 출력
s[(s>5)| (s<-1)]

b   -5
c    7
dtype: int64

In [16]:
df[df['Population'] > 207840000]

Unnamed: 0,Capital,Country,Population
2,New Delhi,India,1303171035
3,Brasília,Brazil,207847528


In [17]:
# 부정 조건
df.index
df[df.index != 2]

Unnamed: 0,Capital,Country,Population
1,Brussels,Belgium,11190846
3,Brasília,Brazil,207847528


# columns 에서 아래처럼 선택하는거 불가능?

In [18]:
df.columns
df[df.columns != 'Population']

Unnamed: 0,Capital,Country,Population
1,Brussels,Belgium,11190846
2,New Delhi,India,1303171035


# 정규표현식 연습

#import re
df['Capital' == r'B*']

#### 1.3. Adding a New Row & Column
---
##### Row

In [19]:
# Adding new or update row : 'Country 4' - ['Korea', 'Seoul', 50000000]
df.loc[4] = ['Korea', 'Seoul', 50000000]
df
# df.loc[3] = ### mutable

Unnamed: 0,Capital,Country,Population
1,Brussels,Belgium,11190846
2,New Delhi,India,1303171035
3,Brasília,Brazil,207847528
4,Korea,Seoul,50000000


*Q. .iloc으로 새로운 행을 추가할 수 있는가? 그 이유는?*

In [20]:
# Can't use iloc
#df.iloc[4,:] # iloc으로 찾고 수정하는건데 그 index 자체가 없으므로 찾지 못함.

##### Column

In [21]:
# adding new or update columns : 'Continent' - ['Europe', 'Asia', 'America']
df['Continent'] = ['Europe', 'Asia', 'America', 'Asia']
df

Unnamed: 0,Capital,Country,Population,Continent
1,Brussels,Belgium,11190846,Europe
2,New Delhi,India,1303171035,Asia
3,Brasília,Brazil,207847528,America
4,Korea,Seoul,50000000,Asia


#### 1.4. Inspection
---

In [22]:
# DataFrame 크기 정보
df.shape
df.shape[0] # number of rows

4

In [23]:
# 행 정보
df.index # index object
list(df.index) # list로 만들고 싶으면

[1, 2, 3, 4]

In [24]:
# 열 정보
df.columns

Index(['Capital', 'Country', 'Population', 'Continent'], dtype='object')

In [25]:
# 상세 정보 : 행 정보, 열 정보 및 데이터형, 메모리
df.info() # 애는 function이므로 () 붙여야 함

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 1 to 4
Data columns (total 4 columns):
Capital       4 non-null object
Country       4 non-null object
Population    4 non-null int64
Continent     4 non-null object
dtypes: int64(1), object(3)
memory usage: 160.0+ bytes


#### 1.5 Arithmetic
---
##### 모든 타입 공통

In [26]:
# 열/행별 데이터 개수
df.count() # 기본적으로 열별로 셈. default가 axis = 0

Capital       4
Country       4
Population    4
Continent     4
dtype: int64

In [27]:
df.count(axis=1) # 행별로

1    4
2    4
3    4
4    4
dtype: int64

In [28]:
# 열/행별 합
df.sum()
df.sum(axis=1)

1      11190846
2    1303171035
3     207847528
4      50000000
dtype: int64

In [29]:
# 열/행별 누적합
df.cumsum()

Unnamed: 0,Capital,Country,Population,Continent
1,Brussels,Belgium,11190846,Europe
2,BrusselsNew Delhi,BelgiumIndia,1314361881,EuropeAsia
3,BrusselsNew DelhiBrasília,BelgiumIndiaBrazil,1522209409,EuropeAsiaAmerica
4,BrusselsNew DelhiBrasíliaKorea,BelgiumIndiaBrazilSeoul,1572209409,EuropeAsiaAmericaAsia


In [30]:
df

Unnamed: 0,Capital,Country,Population,Continent
1,Brussels,Belgium,11190846,Europe
2,New Delhi,India,1303171035,Asia
3,Brasília,Brazil,207847528,America
4,Korea,Seoul,50000000,Asia


In [31]:
# 열별 최소
df.min()

Capital       Brasília
Country        Belgium
Population    11190846
Continent      America
dtype: object

In [32]:
# 행별 최소. 숫자는 글자에 앞선다.
df.min(axis=1)

1      11190846
2    1303171035
3     207847528
4      50000000
dtype: int64

In [33]:
# 열별 최대
df.max()

Capital        New Delhi
Country            Seoul
Population    1303171035
Continent         Europe
dtype: object

##### 수치형 데이터

In [34]:
# Population 최소 / 최대
df['Population'].min() / df['Population'].max()

0.0085873962046739329

In [35]:
# Population이 최소인 index
df["Population"].idxmin()

1

In [36]:
# 또다른 표현 방법
df["Population"].argmin()

1

In [37]:
# Population이 최대인 index
df["Population"].idxmax()

2

In [38]:
# 열별 기본 수치 정보
df.describe() # 수치정보만 찾아서 계산해줌

Unnamed: 0,Population
count,4.0
mean,393052400.0
std,612677200.0
min,11190850.0
25%,40297710.0
50%,128923800.0
75%,481678400.0
max,1303171000.0


In [39]:
# 열별 평별
df.mean()

Population    3.930524e+08
dtype: float64

#### 1.6 Map & Apply Function
---
##### Series : .map

In [40]:
s.map(print)

3
-5
7
4


a    None
b    None
c    None
d    None
dtype: object

In [41]:
s

a    3
b   -5
c    7
d    4
dtype: int64

In [42]:
# square function 정의 후, series에 적용해보자
square = lambda x: x**2
s.map(square)

a     9
b    25
c    49
d    16
dtype: int64

##### DataFrama : .apply, .applymap

In [43]:
num_data = {
   'a' : [1, 2, 3],
   'b' : [4, 5, 6],
   'c' : [7, 8, 9]
}

In [44]:
num_df = pd.DataFrame(num_data)
num_df

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


In [45]:
# 셀 단위로 적용
num_df.applymap(print)

1
2
3
1
2
3
4
5
6
7
8
9


Unnamed: 0,a,b,c
0,,,
1,,,
2,,,


In [46]:
# num_df.applymap(sum)

In [47]:
# 열/행 단위로 적용
num_df.apply(print)

0    1
1    2
2    3
Name: a, dtype: int64
0    4
1    5
2    6
Name: b, dtype: int64
0    7
1    8
2    9
Name: c, dtype: int64


a    None
b    None
c    None
dtype: object

In [48]:
num_df.apply(sum)

a     6
b    15
c    24
dtype: int64

#### 1-7. Sort
---
특정 컬럼을 기준으로 정렬 가능

In [49]:
# Popluation을 기준으로 내림차순 정렬
df.sort_values('Population', ascending = False)

Unnamed: 0,Capital,Country,Population,Continent
2,New Delhi,India,1303171035,Asia
3,Brasília,Brazil,207847528,America
4,Korea,Seoul,50000000,Asia
1,Brussels,Belgium,11190846,Europe


### 2. Preprocessing
---
- 2-1. Data from .csv, .sql to DataFrame 
- 2-2. Merge/Join two DataFrame : .merge()
- 2-3. Fill or Abandon NaN values
- 2-4. Save & Load as DataFrame : pickle
- 2-5. Data Summary : .groupby()

##### 사용 데이터 : 2016 US Election (Kaggle) - 2016년 미 대선 정당별 대선 후보 경선 결과
- primary_results : 24611 x 8 (공화당, 민주당 대통령 후보 경선 결과)
- county_facts : 3195 x 54 (state, couty별 인구, 주거, 기업, 유통 등 특성 정보)
- county_facts_dictionary : county_facts의 column에 대한 설명 (50개)

#### 2-1. Data from .csv, .sql to DataFrame
---
##### pd.read_sql & sqlite3

In [50]:
# import package
import sqlite3

In [51]:
# connect sqlite3 - database.sqlite
connect = sqlite3.connect("database.sqlite")

In [52]:
# pd.read_sql(query, con)
pd.read_sql('SELECT * FROM primary_results', connect)

DatabaseError: Execution failed on sql 'SELECT * FROM primary_results': no such table: primary_results

In [None]:
pd.read_sql('SELECT * FROM county_facts', connect)

##### pd.read_csv

In [54]:
# county_facts_df
county_facts_df = pd.read_csv('county_facts.csv')
county_facts_df

Unnamed: 0,fips,area_name,state_abbreviation,PST045214,PST040210,PST120214,POP010210,AGE135214,AGE295214,AGE775214,...,SBO415207,SBO015207,MAN450207,WTN220207,RTN130207,RTN131207,AFN120207,BPS030214,LND110210,POP060210
0,0,United States,,318857056,308758105,3.3,308745538,6.2,23.1,14.5,...,8.3,28.8,5319456312,4174286516,3917663456,12990,613795732,1046363,3531905.43,87.4
1,1000,Alabama,,4849377,4780127,1.4,4779736,6.1,22.8,15.3,...,1.2,28.1,112858843,52252752,57344851,12364,6426342,13369,50645.33,94.4
2,1001,Autauga County,AL,55395,54571,1.5,54571,6.0,25.2,13.8,...,0.7,31.7,0,0,598175,12003,88157,131,594.44,91.8
3,1003,Baldwin County,AL,200111,182265,9.8,182265,5.6,22.2,18.7,...,1.3,27.3,1410273,0,2966489,17166,436955,1384,1589.78,114.6
4,1005,Barbour County,AL,26887,27457,-2.1,27457,5.7,21.2,16.5,...,0.0,27.0,0,0,188337,6334,0,8,884.88,31.0
5,1007,Bibb County,AL,22506,22919,-1.8,22915,5.3,21.0,14.8,...,0.0,0.0,0,0,124707,5804,10757,19,622.58,36.8
6,1009,Blount County,AL,57719,57322,0.7,57322,6.1,23.6,17.0,...,0.0,23.2,341544,0,319700,5622,20941,3,644.78,88.9
7,1011,Bullock County,AL,10764,10915,-1.4,10914,6.3,21.4,14.9,...,0.0,38.8,0,0,43810,3995,3670,1,622.81,17.5
8,1013,Butler County,AL,20296,20946,-3.1,20947,6.1,23.6,18.0,...,0.0,0.0,399132,56712,229277,11326,28427,2,776.83,27.0
9,1015,Calhoun County,AL,115916,118586,-2.3,118572,5.7,22.2,16.0,...,0.5,24.7,2679991,0,1542981,13678,186533,114,605.87,195.7


*Q. `county_facts.csv`를 부른 DataFrame을 State와 County로 분리하여 각각 state_df, county_df로 나누어라*
<br>
- Hint : fips는 지역번호를 나타내며, 0은 미국 전체, 천의 자리 이상은 주를 나타낸다

In [56]:
# state_df
state_df = county_facts_df[county_facts_df['fips']%1000 == 0]
state_df

Unnamed: 0,fips,area_name,state_abbreviation,PST045214,PST040210,PST120214,POP010210,AGE135214,AGE295214,AGE775214,...,SBO415207,SBO015207,MAN450207,WTN220207,RTN130207,RTN131207,AFN120207,BPS030214,LND110210,POP060210
0,0,United States,,318857056,308758105,3.3,308745538,6.2,23.1,14.5,...,8.3,28.8,5319456312,4174286516,3917663456,12990,613795732,1046363,3531905.43,87.4
1,1000,Alabama,,4849377,4780127,1.4,4779736,6.1,22.8,15.3,...,1.2,28.1,112858843,52252752,57344851,12364,6426342,13369,50645.33,94.4
69,2000,Alaska,,736732,710249,3.7,710231,7.4,25.3,9.4,...,0.0,25.9,8204030,4563605,9303387,13635,1851293,1518,570640.95,1.2
99,4000,Arizona,,6731484,6392310,5.3,6392017,6.4,24.1,15.9,...,10.7,28.1,57977827,57573459,86758801,13637,13268514,26997,113594.08,56.3
115,5000,Arkansas,,2966369,2915958,1.7,2915918,6.5,23.8,15.7,...,2.3,24.5,60735582,29659789,32974282,11602,3559795,7666,52035.48,56.0
191,6000,California,,38802500,37254503,4.2,37253956,6.5,23.6,12.9,...,16.5,30.3,491372092,598456486,455032270,12561,80852787,83645,155779.22,239.1
250,8000,Colorado,,5355866,5029324,6.5,5029196,6.3,23.3,12.7,...,6.2,29.2,46331953,53598986,65896788,13609,11440395,28686,103641.89,48.5
315,9000,Connecticut,,3596677,3574096,0.6,3574097,5.3,21.6,15.5,...,4.2,28.1,58404898,107917037,52165480,14953,9138437,5329,4842.36,738.1
324,10000,Delaware,,935614,897936,4.2,897934,6.0,21.8,16.4,...,2.1,26.1,25679939,5727401,14202083,16421,1910770,5194,1948.54,460.8
328,11000,District Of Columbia,,658893,601767,9.5,601723,6.5,17.5,11.3,...,6.1,34.5,332844,2117990,3843716,6555,4278171,4189,61.05,9856.5


In [57]:
# county_df
county_df = county_facts_df[county_facts_df['fips']%1000 != 0]
county_df

Unnamed: 0,fips,area_name,state_abbreviation,PST045214,PST040210,PST120214,POP010210,AGE135214,AGE295214,AGE775214,...,SBO415207,SBO015207,MAN450207,WTN220207,RTN130207,RTN131207,AFN120207,BPS030214,LND110210,POP060210
2,1001,Autauga County,AL,55395,54571,1.5,54571,6.0,25.2,13.8,...,0.7,31.7,0,0,598175,12003,88157,131,594.44,91.8
3,1003,Baldwin County,AL,200111,182265,9.8,182265,5.6,22.2,18.7,...,1.3,27.3,1410273,0,2966489,17166,436955,1384,1589.78,114.6
4,1005,Barbour County,AL,26887,27457,-2.1,27457,5.7,21.2,16.5,...,0.0,27.0,0,0,188337,6334,0,8,884.88,31.0
5,1007,Bibb County,AL,22506,22919,-1.8,22915,5.3,21.0,14.8,...,0.0,0.0,0,0,124707,5804,10757,19,622.58,36.8
6,1009,Blount County,AL,57719,57322,0.7,57322,6.1,23.6,17.0,...,0.0,23.2,341544,0,319700,5622,20941,3,644.78,88.9
7,1011,Bullock County,AL,10764,10915,-1.4,10914,6.3,21.4,14.9,...,0.0,38.8,0,0,43810,3995,3670,1,622.81,17.5
8,1013,Butler County,AL,20296,20946,-3.1,20947,6.1,23.6,18.0,...,0.0,0.0,399132,56712,229277,11326,28427,2,776.83,27.0
9,1015,Calhoun County,AL,115916,118586,-2.3,118572,5.7,22.2,16.0,...,0.5,24.7,2679991,0,1542981,13678,186533,114,605.87,195.7
10,1017,Chambers County,AL,34076,34170,-0.3,34215,5.9,21.4,18.3,...,0.0,29.3,667283,0,264650,7620,23237,8,596.53,57.4
11,1019,Cherokee County,AL,26037,25986,0.2,25989,4.8,20.4,20.9,...,0.0,14.5,307439,62293,186321,7613,13948,2,553.70,46.9


In [None]:
# primary_df
primary_df= pd.read_csv('primary_results.csv')
primary_df

#### 2-2. Merge/Join two DataFrames : .merge()
---
* 어떤 df를 기준으로 통합할 것인가?
* 통합할 때 key가 되는 열은 무엇인가?
* 어떻게 통합할 것인가?

In [None]:
primary_county_facts_inner_df = primary_df.merge(county_df, left_on = 'fips',
                                                 right_on = 'fips', how = 'inner')
primary_county_facts_inner_df

In [None]:
primary_county_facts_outer_df = primary_df.merge(county_df, left_on = "fips",
                                                right_on = 'fips', how = 'outer')
primary_county_facts_outer_df

*Q. inner? outer?*

 -> Let's use inner!

#### 2-3. Fill or Abandon NaN
---

In [None]:
# check NaN in data
primary_county_facts_inner_df.isnull().sum()#.sum() # ... 때문에 보기 힘드므로.
primary_county_facts_inner_df.isnull()

*Q. 2-2의 DataFrame을 공화당과 민주당의 결과로 분리하여 각각 rep_df, dem_df로 나누어라* 

In [None]:
# rep_df
rep_df = primary_county_facts_inner_df[primary_county_facts_inner_df['party'] == "Republican"]

# dem_df
dem_df = primary_county_facts_inner_df[primary_county_facts_inner_df['party'] == "Democrat"]

#### 2-4. Save & Load DataFrame : pickle
---

In [None]:
# import package
import pickle

In [None]:
# save as DataFrame
with open('primary_results.pkl', 'wb') as f:# wb: byte처럼 쓰겠다.
    pickle.dump(primary_county_facts_inner_df, f) # dumps는 string으로 쓰는거다! 그러니까 dump 사용

In [None]:
# load as DataFrame
with open('primary_results.pkl', 'rb') as f:
    primary_results_df = pickle.load(f)

#### 2-5. Data Summary  : .groupby()
---

In [None]:
# 공화당 후보들이 각각 받은 votes를 계산
rep_df.groupby('candidate').sum()['votes']

## 실습 : 군집화 실습 데이터 전처리하기
---
군집화 실습을 위해 사용할 데이터를 미리 전처리해보자

##### 분석에 도움되는 프로그래밍 팁
- immutable : 되도록 한번 할당된 변수를 다른 값으로 덮어쓰는 것은 피할 것

##### 사용 데이터 : 비누 구매 고객 데이터 (교재 21.6) - `BathSoap.xlsx`
- sheet3 : DM_Sheet, 멤버 정보 및 비누 구입 정보
- sheet4 : Durables, 멤버들의 비누 이외 타물품 소유 정보

In [None]:
# Brand 정보
brand_code_description = pd.read_csv('BathSoapBrandCode.csv')
brand_code_description

In [None]:
# columns, Durables 정보
all_columns_description = pd.read_csv('BathSoapCodelList.csv')
all_columns_description

##### pd.read_excel
`BathSoap.xlsx` 파일에서 데이터가 있는 sheet를 DataFrame으로 변환
* pd.read_excel document 참고
* sheet 위치, header로 쓸 row를 잘 지정할 것
* row의 시작은 0

In [None]:
# df
df = pd.read_excel('BathSoap.xlsx', sheetname = 'DM_Sheet', header = 2)

# durable_df
durable_df = pd.read_excel('BathSoap.xlsx', sheetname = 'Durables', header = 4) # header = 3로 하니까 join에서 error.

In [None]:
df #  column 'Member id' has float values

In [None]:
df['Member id'] = df['Member id'].fillna(0.0).astype(int) # convert float to int
df

In [None]:
durable_df

##### pd.merge
Durable과 Dm_Sheet의 DataFrame을 통합
* 합치는 키가 되는 column을 잘 살필 것
* 합치는 방법을 잘 결정할 것

In [None]:
# merged_df
merged_df = df.merge(durable_df, left_on = "Member id", right_on = "MEM", how = "inner")
merged_df

##### 결측치 처리  - pd.fillna OR delete rows
* NaN이 있는 행부터 찾을 것
* 해당 행의 성질에 따라 행의 값을 채우거나 지울 것
* 위의 과정을 시행할 때 `nan_filled_df` 등으로 원래의 df를 copy()해서 진행할 것 - immutable

In [None]:
# check NaN by row
nan_s = merged_df.isnull().sum(axis = 1)

# Nan이 아닌 부분만 DataFrame으로 다시 생성
nan_filled_df = merged_df[~(nan_s > 0)]

##### Scaling 
* 수치형 데이터인 열은 대부분 normalizaion 해주는 것이 좋다
* Scaling이 필요하다고 생각하는 수치형 데이터인 열을 찾아서 열별로 Standard Scaling 해준다
    * Statndard : x-열의 평균/열의 표준편차
* 위의 과정을 시행할 때 scaled_df 등으로 원래의 df를 copy()해서 진행할 것 - immutable

In [None]:
nan_filled_df

In [None]:
# Choose columns
scaling_needed_columns = nan_filled_df.drop("Member id", axis = 1)

# 위의 컬럼들을 Scaling
scaled_df = (scaling_needed_columns - scaling_needed_columns.mean()) / scaling_needed_columns.std()

# scaled_df로 저장
scaled_df['Member id'] = nan_filled_df["Member id"]
scaled_df

##### Save DataFrame

### 3. Basic Visualization : Scatter Plot of Normalized Total Volume vs. Normalized Value
---
#### 3-1. `matplotlib.pyplot`

In [None]:
import matplotlib.pyplot as plt

In [None]:
xaxis_label = 'Noramlized Total Volume'
yaxis_label = 'Noramlized Value'
x_range = [-2, 5]
y_range = [-2, 5]

# plt.scatter
# xlabel
# ylabel
# xlims
# ylims
# show

#### 3-2. `bokeh.js`

In [None]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import HoverTool, ColumnDataSource

In [None]:
xaxis_label = 'Noramlized Total Volume'
yaxis_label = 'Noramlized Value'
x_range = [-2, 5]
y_range = [-2, 5]

title = 'Total Volume vs. Value '
TOOLS="hover,crosshair,pan,wheel_zoom,box_zoom,reset,tap,previewsave,box_select"


# output_notebook()

# data source as ColumnDataSource

# figure

# plot kind

# xaxis label
# yaxis label

# hover Setting

# show