# `vaex` @ PyData Budapest 2020

## New York Taxi Dataset (2009-2015): Exploratory Data Analysis

https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page


Running this notebooks requires `vaex==3.0.0`

In [None]:
import vaex
from vaex.ui.colormaps import cm_plusmin
import warnings; warnings.simplefilter('ignore')

import numpy as np

import pylab as plt
import seaborn as sns

import pandas as pd
pd.options.display.max_rows = 70

### Main concepts behind `vaex`:
 - Memory mapping
 - Lazy evaluations
 - Expression system ("virtual" columns)
 - High-performance algorithms

### Memory mapping

In [None]:
!du -h /data/yellow*

Get instant access to your data!

In [None]:
df = vaex.open('/data/yellow_taxi_2009_2015_f32.hdf5')

You can also stream it directly from S3:
```
df = vaex.open('s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5?anon=true')
```

### Lazy evaluations

Just get a quick preview whenever you want to "peak" at your data

In [None]:
df

### Expression system ("virtual" columns)

We call a single "column" an "expression"

In [None]:
df.tip_amount

Defining new columns takes no memory

In [None]:
df['tip_percentage'] = df.tip_amount / df.total_amount
df.tip_amount

Peeking at the data is instant

In [None]:
df

Vaex knows when to be lazy, and when to be eager:
 - If the output of an operation is a new column, vaex will be lazy
 - If the output of an operation is expected to be a new data strugture (single number, list etc..), vaex will be eager

In [None]:
df.tip_percentage.mean()

Filtering creates a shallow copy of the DataFrame. The data itself is not copied!

In [None]:
df_filtered = df[df.total_amount>0]

In [None]:
df_filtered.tip_percentage.mean()

### High performance, efficient algorithms

In [None]:
# Check length of file
rows, columns = df.shape
print(f'Number of rows: {rows:,}')
print(f'Number of columns: {columns}')

In [None]:
df.describe()

## Application: Exploring and cleaning the New York Taxi dataset

### Remove missing data

In [None]:
# Drop NANs
df = df.dropna(column_names=['dropoff_latitude', 'dropoff_longitude', 'pickup_latitude'])

### Abnormal number of passengers

In [None]:
df.passenger_count.value_counts(progress='widget')

In [None]:
# Filter abnormal number of passengers
df = df[(df.passenger_count>0) & (df.passenger_count<7)]

### Clean up distance values

In [None]:
plt.figure(figsize=(8,4))
df.plot1d('trip_distance', limits='minmax', f='log1p', progress='widget')
plt.show()

In [None]:
# How many trips have 0.0 distance?
(df.trip_distance==0).astype('int').sum()

In [None]:
# What is the largest distance?
_ = df.trip_distance.max(progress='widget')
print()
print(f'The maximum trip distance in the data is {_} miles')
print()
print('This is %3.1f times larger than the distance between the Earth and the Moon!' % (_ / 238_900))
print('or')
print('This is %1.1f times the distance to Mars!' % (_ / 33_900_000))

In [None]:
plt.figure(figsize=(8,4))
df.plot1d('trip_distance', limits=[0, 20], f=None, progress='widget')
plt.show()

In [None]:
# Filter negative and too large distances
df = df[(df.trip_distance>0) & (df.trip_distance<10)]

### What _is_ New York City really?

In [None]:
# Interactively plot the pickup locations
df.plot_widget(df.pickup_longitude, 
               df.pickup_latitude, 
               shape=512, 
               f='log1p', 
               colormap='plasma', 
               limits='minmax')

In [None]:
# Define the NYC boundaries
long_min = -74.05
long_max = -73.75
lat_min = 40.58
lat_max = 40.90

# Make a selection based on the boundaries
df = df[(df.pickup_longitude > long_min)  & (df.pickup_longitude < long_max) & \
        (df.pickup_latitude > lat_min)    & (df.pickup_latitude < lat_max) & \
        (df.dropoff_longitude > long_min) & (df.dropoff_longitude < long_max) & \
        (df.dropoff_latitude > lat_min)   & (df.dropoff_latitude < lat_max)]

### Create some date/time features

In [None]:
# Daily activities
df['pickup_hour'] = df.pickup_datetime.dt.hour
df['pickup_day_of_week'] = df.pickup_datetime.dt.dayofweek
df['pickup_is_weekend'] = (df.pickup_day_of_week>=5).astype('int')

# Treat as a categorical feature
df.categorize(column='pickup_hour', inplace=True)

weekday_names_list = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
df.categorize(column='pickup_day_of_week', labels=weekday_names_list, inplace=True)
df

In [None]:
# Number of pick-ups per hour for a given day of the week
df.plot('pickup_hour', 'pickup_day_of_week', colorbar=True, colormap=cm_plusmin, figsize=(15, 5))

plt.xticks(np.arange(24), np.arange(24))
plt.yticks(np.arange(7), weekday_names_list)
plt.show()

In [None]:
# Mean trip distance per hour for a given day of the week
df.plot('pickup_hour', 'pickup_day_of_week', what='mean(trip_distance)', 
        colorbar=True, colormap=cm_plusmin, figsize=(15, 5))

plt.xticks(np.arange(24), np.arange(24))
plt.yticks(np.arange(7), weekday_names_list)
plt.show()

### Groupby examples

In [None]:
df_per_hour = df.groupby(by=df.pickup_hour).agg({'tip_amount': 'mean',
                                                 'tip_amount_weekend': vaex.agg.mean('tip_amount', 
                                                                                     selection='pickup_is_weekend==1')
                                                })

# Display the grouped DataFrame
df_per_hour

In [None]:
plt.figure(figsize=(14, 5))

plt.subplot(121)
sns.barplot(x=df_per_hour.pickup_hour.values, y=df_per_hour.tip_amount.values)
plt.title('Mean tip amount')
plt.xlabel('hour of day')
plt.ylabel('mean tip amount')

plt.subplot(122)
sns.barplot(x=df_per_hour.pickup_hour.values, y=df_per_hour.tip_amount_weekend.values)
plt.title('Mean tip amount (weekend only)')
plt.xlabel('hour of day')
# plt.ylabel('mean trip speed [miles per hour]')


plt.tight_layout()
plt.show()

### Join

In [None]:
df = df.join(df_per_hour, on='pickup_hour', rprefix="right_")
df

### Expensive columns

Let's see the performance of Vaex on a computationally expensive virtual columns.

In [None]:
def arc_distance(theta_1, phi_1, theta_2, phi_2):
    temp = (np.sin((theta_2-theta_1)/2*np.pi/180)**2
           + np.cos(theta_1*np.pi/180)*np.cos(theta_2*np.pi/180) * np.sin((phi_2-phi_1)/2*np.pi/180)**2)
    distance = 2 * np.arctan2(np.sqrt(temp), np.sqrt(1-temp))
    return distance * 3958.8

# distance Budapest - Utrecht [miles]
arc_distance(47.4813602, 18.9902182, 52.0842715, 5.0124523)

By default we are using numpy

In [None]:
# Add the arc-distance in miles as a virtual column
df['arc_distance_miles_numpy'] = arc_distance(df.pickup_longitude, df.pickup_latitude, 
                                              df.dropoff_longitude, df.dropoff_latitude)

In [None]:
sum_numpy = df['arc_distance_miles_numpy'].sum(progress='widget')
print(f'{sum_numpy:.5}')

We can accelerate this by using Just-In-Time compiling (JIT)

In [None]:
df['arc_distance_miles_numba'] = df.arc_distance_miles_numpy.jit_numba()

In [None]:
sum_numba = df.arc_distance_miles_numba.sum(progress='widget')
print(f'{sum_numba:.5}')

Acceleration via a Nvidia GPU is also possible! This example uses _Nvidia 2080 super_.

In [None]:
df['arc_distance_miles_cuda'] = df.arc_distance_miles_numpy.jit_cuda()

In [None]:
sum_cuda = df.arc_distance_miles_cuda.sum(progress='widget')
print(f'{sum_cuda:.5}')

### For a fuller picture please check out [the tutorial on the documentation pages](https://docs.vaex.io/en/latest/tutorial.html).