Optimizing pandas
----

Pandas 최적화 스터디
- 자료 출처 : https://github.com/s-heisler/pycon2017-optimizing-pandas
- Profiling 하는 것은 제외

In [1]:
import pandas as pd
import numpy as np
from math import *

### 데이터 읽기

In [2]:
df = pd.read_csv("new_york_hotels.csv", encoding='cp1252')
df.head()

Unnamed: 0,ean_hotel_id,name,address1,city,state_province,postal_code,latitude,longitude,star_rating,high_rate,low_rate
0,269955,Hilton Garden Inn Albany/SUNY Area,1389 Washington Ave,Albany,NY,12206,42.68751,-73.81643,3.0,154.0272,124.0216
1,113431,Courtyard by Marriott Albany Thruway,1455 Washington Avenue,Albany,NY,12206,42.68971,-73.82021,3.0,179.01,134.0
2,108151,Radisson Hotel Albany,205 Wolf Rd,Albany,NY,12205,42.7241,-73.79822,3.0,134.17,84.16
3,254756,Hilton Garden Inn Albany Medical Center,62 New Scotland Ave,Albany,NY,12208,42.65157,-73.77638,3.0,308.2807,228.4597
4,198232,CrestHill Suites SUNY University Albany,1415 Washington Avenue,Albany,NY,12206,42.68873,-73.81854,3.0,169.39,89.39


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1631 entries, 0 to 1630
Data columns (total 11 columns):
ean_hotel_id      1631 non-null int64
name              1631 non-null object
address1          1631 non-null object
city              1631 non-null object
state_province    1631 non-null object
postal_code       1631 non-null object
latitude          1631 non-null float64
longitude         1631 non-null float64
star_rating       1630 non-null float64
high_rate         1631 non-null float64
low_rate          1631 non-null float64
dtypes: float64(5), int64(1), object(5)
memory usage: 140.2+ KB


#### Define the normalization function

In [4]:
def normalize(df, pd_series):
    pd_series = pd_series.astype(float)
    
    avg = np.mean(pd_series)
    sd = np.std(pd_series)
    
    lower_bound = avg - 2 * sd
    upper_bound = avg + 2 * sd
    
    df.loc[pd_series < lower_bound, "cutoff_rate"] = lower_bound
    df.loc[pd_series > upper_bound, "cutoff_rate"] = upper_bound
    
    normalized_price = np.log(df["cutoff_rate"].astype(float))
    
    return normalized_price

#### Timing the normalization function

In [5]:
%timeit df['high_rate_normalized'] = normalize(df, df['high_rate'])

3.98 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Haversine definition

In [7]:
def haversine(lat1, lon1, lat2, lon2):
    miles_constant = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    mi = miles_constant * c
    
    return mi

### iterrows haversine

In [8]:
%%timeit

haversine_series = []
for index, row in df.iterrows():
    haversine_series.append(haversine(40.671, -73.985, row['latitude'], row['longitude']))
    
df['distance'] = haversine_series

187 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Apply Haversine on rows

#### Timing "apply"

In [13]:
%%timeit
df['distance'] = df.apply(
    lambda row : haversine(40.671, -73.985, row['latitude'], row['longitude']), axis=1
)

51.9 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Vectorized implementation of Haversine applied on Pandas series

#### Timing vectorized implementation

In [15]:
%timeit df['distance'] = haversine(40.671, -73.985, df['latitude'], df['longitude'])

1.95 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Vectorized implementation of Haversine applied on NumPy arrays 

#### Timing vectorized implementation

In [16]:
%timeit df['distance'] = haversine(40.671, -73.985, df['latitude'].values, df['longitude'].values)

210 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [17]:
df['latitude'].values

array([42.68751, 42.68971, 42.7241 , ..., 40.92625, 40.95375, 40.97308])

In [18]:
%%timeit
# Convert pandas arrays to NumPy ndarrays
np_lat = df['latitude'].values
np_lon = df['longitude'].values

4.11 µs ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


---------------

## Code Snippet

```python
import numpy as np
import pandas as pd

def function(val_1, val_2, np_array_1. np_array_2):
    val_1_result, val_2_result, np_array_res_1, np_array_res_1 = map(val_1, val_2, np_array_1, np_array_2)
    result = val_1_result - val_2_result * (np_array_res_1 + np_array_res_1)
    
    return result

df = pd.read_csv("file.csv")
df['result'] = function(10, 15, df['arr_1'].values, df['arr_2'].values)

                           
df['result'] = df.apply(
    lambda row : function(10, 15, row['arr_1'], row['arr_2']), axis=1
)
                           
df['result'] = df.apply(
    lambda row : function(10, 15, row['arr_1'], row['arr_2']), axis=2
)
                           
```