## Simple ETL / Exploration with node-rapids

This notebook will demonstrate how basic APIs from `node-rapids` ([GitHub](https://github.com/rapidsai/node-rapids), [docs](https://rapidsai.github.io/node-rapids/)) may be used to load and process data from the GPU in Node.

First, we load the cudf module from `node-rapids`:

In [None]:
cudf = require("@rapidsai/cudf")

We are going to look at the 1.5 Gb [US Accidents (Dec 20) dataset from Kaggle](https://www.kaggle.com/sobhanmoosavi/us-accidents?select=US_Accidents_Dec20.csv). First we need to define load the CSV using `readCSV`:

In [None]:
console.time("readCSV")
df = cudf.DataFrame.readCSV({
    header: 0,
    sourceType: 'files',
    sources: ["data/US_Accidents_Dec20.csv"]
});
console.timeEnd("readCSV")

Now that we have loaded the CSV into a GPU DataFrame `df` we can look at some basic information like number of rows and columns:

In [None]:
console.log("Number of rows:", df.numRows)
console.log("Number of cols:", df.numColumns)

We can also take a quick look at the top of the dataframe:

In [None]:
console.log(df.head().toString({maxColumns: 0}))

We can see this data set has lots of columns we don't really care about. We can pare things down using the `Datafame.drop` method:

In [None]:
df = df.drop([
    'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight', 'Wind_Speed(mph)', 'Wind_Direction', 'Wind_Chill', 'Humidity(50)', 'Sunrise_Sunset',
    'Pressure', 'Amenity', 'Bump', 'Give_Way', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Traffic_Calming', 'Turning_Loop', 'Timezone', 'Crossing', 'Stop', 'Traffic_Signal', 'Junction', 'Number', 'Side', 'County',
    'Airport_Code', 'TMC', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng', 'Street', 'Country', 'Zipcode', 'Distance(mi)', 'Wind_Chill(F)', 'Pressure(in)', 'Humidity(%)']
)

In [None]:
df.names

In [None]:
temp = df.get('Temperature(F)')
console.log("Min temp:", temp.min())
console.log("Max temp:", temp.max())

Some of the temperature values are clearly bad data, let's restrict the datafame to a more reasonable range. The `lt` and `gt` unary operators return a boolean mask where values are less or greater than ven values, respectively. These masks can be combined with the `logical_or` operator and then passed to `DataFrame.gather` to restrict to only the valid rows we care about:

In [None]:
temp = df.get('Temperature(F)')

console.time("filter")
valid_temps = temp.lt(120).logicalAnd(temp.gt(-30))
df = df.filter(valid_temps)
console.timeEnd("filter")

We can see above how long filtering the full 1.5 Gb data set took. Below we can verify that that filtered data only has values in the specified range:

In [None]:
temp = df.get('Temperature(F)')

console.log("New number of rows:", df.numRows)
console.log("New min temp:", temp.min())
console.log("New max temp:", temp.max())

Another thing we might want to examine is the grouping of weather conditions. The original dataframe has very fine-grained weather conditions. e.g "Fog" vd "Shallow Fog", as seen below:

In [None]:
weather_groups = df.groupBy({by: "Weather_Condition"})
JSON.stringify(weather_groups.nth(0).get("Weather_Condition").toArrow().toArray())

Let's use Cudf's GPU regex functions to get some quick counts of more generic weather categories. The `Series.containsRe` method will return a boolean mask that is true wherever the series value matches the regex:

In [None]:
weather = df.get("Weather_Condition")

console.time("regex")
clouds_mask = weather.containsRe("Cloud|Overcast");
rain_mask = weather.containsRe("Rain|T-Storm|Thunderstorm|Squalls|Drizzle");
snow_mask = weather.containsRe("Snow")
fog_mask = weather.containsRe("Fog")
ice_mask = weather.containsRe("Ice|Hail|Freezing|Sleet")
particulate_mask = weather.containsRe("Dust|Smoke|Sand")
console.timeEnd("regex")

The categorization above is not necessarily exlcusive, and categories may overlap, but we can see how many accidents had a category involved by summing each mask:

In [None]:
console.time("sum")
console.log("Severity with clouds     :", clouds_mask.sum())
console.log("Severity with rain       :", rain_mask.sum())
console.log("Severity with snow       :", snow_mask.sum())
console.log("Severity with fog        :", fog_mask.sum())
console.log("Severity with particulate:", particulate_mask.sum())
console.log("Severity with ice        :", ice_mask.sum())
console.timeEnd("sum")

We might be interested to look filter by these subsets to see the average severity when each category is involved:

In [None]:
console.time("means")
console.log("Severity with clouds     :", df.filter(clouds_mask).get("Severity").mean())
console.log("Severity with rain       :", df.filter(rain_mask).get("Severity").mean())
console.log("Severity with snow       :", df.filter(snow_mask).get("Severity").mean())
console.log("Severity with fog        :", df.filter(fog_mask).get("Severity").mean())
console.log("Severity with particulate:", df.filter(particulate_mask).get("Severity").mean())
console.log("Severity with ice        :", df.filter(ice_mask).get("Severity").mean())
console.timeEnd("means")


Unsurprisingly, the most severe accidents were recorded in ice and snow conditions.

Hopefully this has been a helpful introduction to Cudf in node-rapids! For more information [see the documentation](https://rapidsai.github.io/node-rapids/).