Pandas study - Rank
---

## 학습 목표
Pandas의 Rank를 학습하고, 주식 데이터 셋을 이용하여 응용

## pandas docs
- rank : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rank.html
- pandas.core.groupby.GroupBy.rank : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.rank.html#pandas.core.groupby.GroupBy.rank

In [5]:
import pandas as pd
import numpy as np

In [3]:
pd.DataFrame.rank?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m.[0m[0mrank[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmethod[0m[0;34m=[0m[0;34m'average'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnumeric_only[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mna_option[0m[0;34m=[0m[0;34m'keep'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mascending[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpct[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Compute numerical data ranks (1 through n) along axis. Equal values are
assigned a rank that is the average of the ranks of those values.

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
    index to direct ranking
method : {'average', 'min', 'max', 'first', 'dens

### 공식 문서 예시

In [6]:
df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog',
                                   'spider', 'snake'],
                        'Number_legs': [4, 2, 4, 8, np.nan]})

In [7]:
df

Unnamed: 0,Animal,Number_legs
0,cat,4.0
1,penguin,2.0
2,dog,4.0
3,spider,8.0
4,snake,


The following example shows how the method behaves with the above parameters:

- default_rank: this is the default behaviour obtained without using any parameter.
- max_rank: setting method = 'max' the records that have the same values are ranked using the highest rank (e.g.: since ‘cat’ and ‘dog’ are both in the 2nd and 3rd position, rank 3 is assigned.)
- NA_bottom: choosing na_option = 'bottom', if there are records with NaN values they are placed at the bottom of the ranking.
- pct_rank: when setting pct = True, the ranking is expressed as percentile rank.

In [8]:
df['default_rank'] = df['Number_legs'].rank()
df['max_rank'] = df['Number_legs'].rank(method='max')
df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
df['pct_rank'] = df['Number_legs'].rank(pct=True)

In [9]:
df

Unnamed: 0,Animal,Number_legs,default_rank,max_rank,NA_bottom,pct_rank
0,cat,4.0,2.5,3.0,2.5,0.625
1,penguin,2.0,1.0,1.0,1.0,0.25
2,dog,4.0,2.5,3.0,2.5,0.625
3,spider,8.0,4.0,4.0,4.0,1.0
4,snake,,,,5.0,


### 주식 데이터 활용

In [10]:
file_path = 'Stock_Dataset(2017_07_06)/000020.csv'

In [11]:
stock_df = pd.read_csv(file_path)

In [13]:
stock_df.head(10)

Unnamed: 0,Date,Close,Open,High,Low,Volume,Code,Company,Up&Down,Rate
0,2005-04-27,11250,10750,11500,10650,31610,20,동화약품,0.0,0.0
1,2005-04-28,11050,11200,11450,10950,12670,20,동화약품,-200.0,-1.777778
2,2005-04-29,10900,10550,10950,10550,15280,20,동화약품,-150.0,-1.357466
3,2005-05-02,11200,10900,11300,10750,71860,20,동화약품,300.0,2.752294
4,2005-05-03,11700,11450,12150,11400,150848,20,동화약품,500.0,4.464286
5,2005-05-04,13450,13450,13450,13450,425714,20,동화약품,1750.0,14.957265
6,2005-05-06,13850,15000,15400,13600,984975,20,동화약품,400.0,2.973978
7,2005-05-09,14400,14200,14550,13600,444290,20,동화약품,550.0,3.971119
8,2005-05-10,13700,14300,14650,13600,377674,20,동화약품,-700.0,-4.861111
9,2005-05-11,12800,13600,13700,12700,216202,20,동화약품,-900.0,-6.569343


In [14]:
stock_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3025 entries, 0 to 3024
Data columns (total 10 columns):
Date       3025 non-null object
Close      3025 non-null int64
Open       3025 non-null int64
High       3025 non-null int64
Low        3025 non-null int64
Volume     3025 non-null int64
Code       3025 non-null int64
Company    3025 non-null object
Up&Down    3025 non-null float64
Rate       3025 non-null float64
dtypes: float64(2), int64(6), object(2)
memory usage: 236.4+ KB


In [21]:
def make_preprocessed_dataframe(df):
    df['Date'] = df['Date'].astype('datetime64[ns]')
    df['year'] = df.Date.dt.year
    
    return df

In [22]:
stock_df_ = make_preprocessed_dataframe(stock_df)

In [24]:
stock_df_.head()

Unnamed: 0,Date,Close,Open,High,Low,Volume,Code,Company,Up&Down,Rate,year
0,2005-04-27,11250,10750,11500,10650,31610,20,동화약품,0.0,0.0,2005
1,2005-04-28,11050,11200,11450,10950,12670,20,동화약품,-200.0,-1.777778,2005
2,2005-04-29,10900,10550,10950,10550,15280,20,동화약품,-150.0,-1.357466,2005
3,2005-05-02,11200,10900,11300,10750,71860,20,동화약품,300.0,2.752294,2005
4,2005-05-03,11700,11450,12150,11400,150848,20,동화약품,500.0,4.464286,2005


In [37]:
stock_df_[stock_df_['year']==2005]['Volume'].max()

984975

In [30]:
# 년도별 가장 거래량이 많은 값

def get_highest_volume_individaul_year(df):
    df['Volume_rank'] = df.groupby(by='year')['Volume'].rank(method='min', ascending=False)
    h_volumn_individaul_year_df = df[df['Volume_rank']==1].reset_index(drop=True)[['Date', 'Volume']]
    
    return h_volumn_individaul_year_df

In [31]:
get_highest_volume_individaul_year(stock_df_)

Unnamed: 0,Date,Volume
0,2005-05-06,984975
1,2006-04-19,254105
2,2007-07-02,371190
3,2008-11-06,112957
4,2009-08-28,1440335
5,2010-02-02,454953
6,2011-06-16,1919922
7,2012-01-11,745913
8,2013-12-27,1318159
9,2014-12-08,672393


In [32]:
stock_df_.groupby(by='year')['Volume'].rank(method='min', ascending=False)

0       165.0
1       172.0
2       171.0
3       123.0
4        43.0
5         9.0
6         1.0
7         7.0
8        12.0
9        23.0
10       27.0
11       33.0
12       53.0
13       29.0
14       34.0
15       42.0
16       80.0
17       26.0
18      115.0
19      111.0
20       18.0
21       76.0
22       56.0
23      119.0
24      155.0
25      113.0
26       20.0
27       86.0
28       89.0
29       92.0
        ...  
2995     93.0
2996     74.0
2997    111.0
2998     91.0
2999     32.0
3000     69.0
3001     68.0
3002     18.0
3003     52.0
3004      8.0
3005     43.0
3006     89.0
3007     37.0
3008     58.0
3009     23.0
3010     29.0
3011     99.0
3012     10.0
3013     63.0
3014     92.0
3015     49.0
3016    124.0
3017     97.0
3018     96.0
3019    119.0
3020    117.0
3021    126.0
3022    122.0
3023    102.0
3024    115.0
Name: Volume, Length: 3025, dtype: float64

#### pandas.core.groupby.GroupBy.rank

Provide the rank of values within each group.

Parameters
- method : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’
  - average: average rank of group.
  - min: lowest rank in group.
  - max: highest rank in group.
  - first: ranks assigned in order they appear in the array.
  - dense: like ‘min’, but rank always increases by 1 between groups.
- ascending : bool, default True
  - False for ranks by high (1) to low (N).
- na_option{‘keep’, ‘top’, ‘bottom’}, default ‘keep’
  - keep: leave NA values where they are.
  - top: smallest rank if ascending.
  - bottom: smallest rank if descending.
- pct : bool, default False
  - Compute percentage rank of data within each group.
- axis : int, default 0
  - The axis of the object over which to compute the rank.

Returns
- DataFrame with ranking of values within each group

In [40]:
def get_high_closing_price_individaul_year(df):
    df['Close_rank'] = df.groupby(by='year')['Close'].rank(method='min', ascending=False)
    high_closing_price_individaul_year_df = df[df['Close_rank']==1].reset_index(drop=True)[['Date', 'Close']]
    
    return high_closing_price_individaul_year_df

In [41]:
get_high_closing_price_individaul_year(stock_df_)

Unnamed: 0,Date,Close
0,2005-10-13,24100
1,2005-11-24,24100
2,2006-12-22,33350
3,2007-06-29,93000
4,2008-05-06,68100
5,2009-06-22,50700
6,2010-01-04,7520
7,2011-08-01,6100
8,2012-11-05,6760
9,2013-05-08,7300


In [50]:
stock_df_['Rate_rank'] = stock_df_.groupby(by='year')['Rate'].rank(method='min', ascending=False)

In [52]:
stock_df_[stock_df_['Rate_rank']==1].reset_index(drop=True)[['Date', 'Rate']]

Unnamed: 0,Date,Rate
0,2005-05-04,14.957265
1,2006-04-13,7.344633
2,2007-05-30,14.965197
3,2008-10-30,14.987715
4,2009-05-25,8.009153
5,2010-05-31,8.433735
6,2011-04-28,8.245243
7,2012-09-10,4.90566
8,2013-05-03,7.014925
9,2014-10-27,4.83871
