## Simple ETL / Exploration with node-rapids

This notebook will demonstrate how basic APIs from `node-rapids` ([GitHub](https://github.com/rapidsai/node-rapids), [docs](https://rapidsai.github.io/node-rapids/)) may be used to load and process data from the GPU in Node.

First, we load the cudf module from `node-rapids`:

In [1]:
var cudf = require("@rapidsai/cudf")

We are going to look at the 1.5 Gb [US Accidents (Dec 20) dataset from Kaggle](https://www.kaggle.com/sobhanmoosavi/us-accidents?select=US_Accidents_Dec20.csv). First we need to define load the CSV using `readCSV`:

In [2]:
console.time("readCSV")
df = cudf.DataFrame.readCSV({
    header: 0,
    sourceType: 'files',
    sources: ["modules/cudf/notebooks/data/US_Accidents_Dec20.csv"]
});
console.timeEnd("readCSV")

readCSV: 5.340s


Now that we have loaded the CSV into a GPU DataFrame `df` we can look at some basic information like number of rows and columns:

In [3]:
console.log("Number of rows:", df.numRows)
console.log("Number of cols:", df.numColumns)

Number of rows: 4229394
Number of cols: 49


We can also take a quick look at the top of the dataframe:

In [4]:
console.log(df.head().toString({maxColumns: 0}))

 ID   Source   TMC Severity          Start_Time            End_Time Start_Lat  Start_Lng End_Lat End_Lng Distance(mi)                                        Description Number                    Street Side         City     County State    Zipcode Country   Timezone Airport_Code   Weather_Timestamp Temperature(F) Wind_Chill(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Precipitation(in) Weather_Condition Amenity  Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station  Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
A-1 MapQuest 201.0        3 2016-02-08 05:46:00 2016-02-08 11:00:00 39.865147 -84.058723    null    null         0.01                                                ...   null                    I-70 E    R       Dayton Montgomery    OH      45424      US US/Eastern         KFFO 2016-02-08 05:58:00           36.9          null        91.0        29.68           10.0

We can see this data set has lots of columns we don't really care about. We can pare things down using the `Datafame.drop` method:

In [5]:
var df = df.drop([
    'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight', 'Wind_Speed(mph)', 'Wind_Direction', 'Wind_Chill', 'Humidity(50)', 'Sunrise_Sunset',
    'Pressure', 'Amenity', 'Bump', 'Give_Way', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Traffic_Calming', 'Turning_Loop', 'Timezone', 'Crossing', 'Stop', 'Traffic_Signal', 'Junction', 'Number', 'Side', 'County',
    'Airport_Code', 'TMC', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng', 'Street', 'Country', 'Zipcode', 'Distance(mi)', 'Wind_Chill(F)', 'Pressure(in)', 'Humidity(%)']
)

In [6]:
df.names

[
  'ID',
  'Source',
  'Severity',
  'Description',
  'City',
  'State',
  'Weather_Timestamp',
  'Temperature(F)',
  'Visibility(mi)',
  'Precipitation(in)',
  'Weather_Condition'
]

Things are a bit more manageable now (but still need a wide screen to see all the columns)

In [7]:
console.log(df.head().toString({maxColumns: 0}))

 ID   Source Severity                                        Description         City State   Weather_Timestamp Temperature(F) Visibility(mi) Precipitation(in) Weather_Condition
A-1 MapQuest        3                                                ...       Dayton    OH 2016-02-08 05:58:00           36.9           10.0              0.02        Light Rain
A-2 MapQuest        2 Accident on Brice Rd at Tussing Rd. Expect delays. Reynoldsburg    OH 2016-02-08 05:51:00           37.9           10.0               0.0        Light Rain
A-3 MapQuest        2                                                ... Williamsburg    OH 2016-02-08 06:56:00           36.0           10.0              null          Overcast
A-4 MapQuest        3                                                ...       Dayton    OH 2016-02-08 07:38:00           35.1            9.0              null     Mostly Cloudy
A-5 MapQuest        2                                                ...       Dayton    OH 2016-02-08 07:53:0

In [8]:
temp = df.get('Temperature(F)')
console.log("Min temp:", temp.min())
console.log("Max temp:", temp.max())

Min temp: -89
Max temp: 203


Some of the temperature values are clearly bad data, let's restrict the datafame to a more reasonable range. The `lt` and `gt` unary operators return a boolean mask where values are less or greater than ven values, respectively. These masks can be combined with the `logical_or` operator and then passed to `DataFrame.gather` to restrict to only the valid rows we care about:

In [9]:
temp = df.get('Temperature(F)')

console.time("filter")
valid_temps = temp.lt(120).logicalAnd(temp.gt(-30))
df = df.filter(valid_temps)
console.timeEnd("filter")

filter: 32.037ms


We can see above how long filtering the full 1.5 Gb data set took. Below we can verify that that filtered data only has values in the specified range:

In [10]:
temp = df.get('Temperature(F)')

console.log("New number of rows:", df.numRows)
console.log("New min temp:", temp.min())
console.log("New max temp:", temp.max())

New number of rows: 4139552
New min temp: -29.9
New max temp: 119


Another thing we might want to examine is the grouping of weather conditions. The original dataframe has very fine-grained weather conditions. e.g "Fog" vd "Shallow Fog", as seen below:

In [11]:
weather_groups = df.groupBy({by: "Weather_Condition"})
JSON.stringify(weather_groups.nth(0).get("Weather_Condition").toArrow().toArray())

'["Blowing Dust","Blowing Dust / Windy","Blowing Sand","Blowing Snow","Blowing Snow / Windy","Clear","Cloudy","Cloudy / Windy","Drifting Snow","Drizzle","Drizzle / Windy","Drizzle and Fog","Dust Whirls","Fair","Fair / Windy","Fog","Fog / Windy","Freezing Drizzle","Freezing Rain","Freezing Rain / Windy","Funnel Cloud","Hail","Haze","Haze / Windy","Heavy Blowing Snow","Heavy Drizzle","Heavy Freezing Drizzle","Heavy Freezing Rain","Heavy Ice Pellets","Heavy Rain","Heavy Rain / Windy","Heavy Rain Shower","Heavy Rain Showers","Heavy Sleet","Heavy Smoke","Heavy Snow","Heavy Snow / Windy","Heavy Snow with Thunder","Heavy T-Storm","Heavy T-Storm / Windy","Heavy Thunderstorms and Rain","Heavy Thunderstorms and Snow","Heavy Thunderstorms with Small Hail","Ice Pellets","Light Blowing Snow","Light Drizzle","Light Drizzle / Windy","Light Fog","Light Freezing Drizzle","Light Freezing Fog","Light Freezing Rain","Light Freezing Rain / Windy","Light Hail","Light Haze","Light Ice Pellets","Light Rain","

Let's use Cudf's GPU regex functions to get some quick counts of more generic weather categories. The `Series.containsRe` method will return a boolean mask that is true wherever the series value matches the regex:

In [12]:
weather = df.get("Weather_Condition")

console.time("regex")
clouds_mask = weather.containsRe("Cloud|Overcast");
rain_mask = weather.containsRe("Rain|T-Storm|Thunderstorm|Squalls|Drizzle");
snow_mask = weather.containsRe("Snow")
fog_mask = weather.containsRe("Fog")
ice_mask = weather.containsRe("Ice|Hail|Freezing|Sleet")
particulate_mask = weather.containsRe("Dust|Smoke|Sand")
console.timeEnd("regex")

regex: 143.383ms


The categorization above is not necessarily exlcusive, and categories may overlap, but we can see how many accidents had a category involved by summing each mask:

In [13]:
console.time("sum")
console.log("Severity with clouds     :", clouds_mask.sum())
console.log("Severity with rain       :", rain_mask.sum())
console.log("Severity with snow       :", snow_mask.sum())
console.log("Severity with fog        :", fog_mask.sum())
console.log("Severity with particulate:", particulate_mask.sum())
console.log("Severity with ice        :", ice_mask.sum())
console.timeEnd("sum")

Severity with clouds     : 1889173
Severity with rain       : 325042
Severity with snow       : 67902
Severity with fog        : 51700
Severity with particulate: 8789
Severity with ice        : 4688
sum: 11.89ms


We might be interested to look filter by these subsets to see the average severity when each category is involved:

In [14]:
console.time("means")
console.log("Severity with clouds     :", df.filter(clouds_mask).get("Severity").mean())
console.log("Severity with rain       :", df.filter(rain_mask).get("Severity").mean())
console.log("Severity with snow       :", df.filter(snow_mask).get("Severity").mean())
console.log("Severity with fog        :", df.filter(fog_mask).get("Severity").mean())
console.log("Severity with particulate:", df.filter(particulate_mask).get("Severity").mean())
console.log("Severity with ice        :", df.filter(ice_mask).get("Severity").mean())
console.timeEnd("means")


Severity with clouds     : 2.3208912047758465
Severity with rain       : 2.3520591185139152
Severity with snow       : 2.402550734882625
Severity with fog        : 2.2155319148936172
Severity with particulate: 2.2825122311980883
Severity with ice        : 2.476962457337884
means: 62.381ms


Unsurprisingly, the most severe accidents were recorded in ice and snow conditions.

Hopefully this has been a helpful introduction to Cudf in node-rapids! For more information [see the documentation](https://rapidsai.github.io/node-rapids/).