## Simple ETL / Exploration with node-rapids

This notebook will demonstrate how basic APIs from `node-rapids` ([GitHub](https://github.com/rapidsai/node-rapids), [docs](https://rapidsai.github.io/node-rapids/)) may be used to load and process data from the GPU in Node.

First, we load the cudf module from `node-rapids`:

In [1]:
cudf = require("@rapidsai/cudf");

{
  addon: [Getter],
  Column: [Getter],
  DataFrame: [Getter],
  GroupByMultiple: [Getter],
  GroupBySingle: [Getter],
  AbstractSeries: [Getter],
  Series: [Getter],
  Bool8Series: [Getter],
  Float32Series: [Getter],
  Float64Series: [Getter],
  Int8Series: [Getter],
  Int16Series: [Getter],
  Int32Series: [Getter],
  Uint8Series: [Getter],
  Uint16Series: [Getter],
  Uint32Series: [Getter],
  Int64Series: [Getter],
  Uint64Series: [Getter],
  StringSeries: [Getter],
  ListSeries: [Getter],
  StructSeries: [Getter],
  Table: [Getter],
  NullOrder: [Getter],
  DuplicateKeepOption: [Getter],
  Int8: [Getter],
  Int16: [Getter],
  Int32: [Getter],
  Int64: [Getter],
  Uint8: [Getter],
  Uint16: [Getter],
  Uint32: [Getter],
  Uint64: [Getter],
  Float32: [Getter],
  Float64: [Getter],
  Bool8: [Getter],
  Utf8String: [Getter],
  List: [Getter],
  Struct: [Getter],
  TimestampDay: [Getter],
  TimestampSecond: [Getter],
  TimestampMillisecond: [Getter],
  TimestampMicrosecond: [Getter],


We are going to look at some data from Wikipedia. The data is broken up into ten files. Let's load one of them:

In [2]:
console.time("readCSV")
df = cudf.DataFrame.readCSV({header: 0, sourceType: 'files', sources: ["data/page_titles_en_0.csv"]})
console.timeEnd("readCSV")

readCSV: 1.341s


Now that we have loaded the CSV into a GPU DataFrame `df` we can look at some basic information like number of rows and columns:

In [3]:
console.log("Number of rows:", df.numRows)
console.log("Number of cols:", df.numColumns)
console.log("Columns:", df.names)

Number of rows: 1593959
Number of cols: 5
Columns: [ 'id', 'revid', 'url', 'title', 'text' ]


This data set may have columns we don't really care about. We can pare things down using the `Datafame.drop` method:

In [4]:
df = df.drop(['revid'])

DataFrame {
  _accessor: ColumnAccessor {
    _data: {
      id: Column {},
      url: Column {},
      title: Column {},
      text: Column {}
    },
    _types: undefined,
    _names: [ 'id', 'url', 'title', 'text' ],
    _columns: undefined,
    _labels_to_indices: Map(4) { 'id' => 0, 'url' => 1, 'title' => 2, 'text' => 3 }
  }
}

We can also get a quick preview of the table by using `toString` (similar to Pandas or cudf `.head()` method)

In [5]:
console.log(df.toString())

      id                                          url                                            title text
 6140642  https://en.wikipedia.org/wiki?curid=6140642                        Scorpion (roller coaster)  ...
 6140647  https://en.wikipedia.org/wiki?curid=6140647                                       Metamutant null
 6140648  https://en.wikipedia.org/wiki?curid=6140648                           Standardisation policy null
 6140651  https://en.wikipedia.org/wiki?curid=6140651                        Baron Grey of Chillingham null
 6140652  https://en.wikipedia.org/wiki?curid=6140652                    General Hospital (Blackadder)  ...
 6140657  https://en.wikipedia.org/wiki?curid=6140657                                         Tokarahi  ...
 6140662  https://en.wikipedia.org/wiki?curid=6140662                                  Angel (musical)  ...
 6140664  https://en.wikipedia.org/wiki?curid=6140664               Berlin Township, Erie County, Ohio  ...
 6140673  https://en.wikiped

We can use basic column methods to quickly ask questions like: What is the longest title?

In [6]:
df.get('title').len().max()

228

In [7]:
title = df.get('title')

console.log([...title.filter(title.len().eq(228))])


[
  'Agreement for the Implementation of the Provisions of the United Nations Convention on the Law of the Sea of 10 December 1982 relating to the Conservation and Management of Straddling Fish Stocks and Highly Migratory Fish Stocks'
]


Or similarly, what is the longest article lengths:

In [25]:
text = df.get('text')

console.log("Max text:", text.len().max())

Min text: 1
Max text: 236433


We might be interested to look filter by these subsets to see the average severity when each category is involved: