$ ni <DOSFILE> cleandos
TODO: write real docs
p'url_encode'
p'url_decode'
This will replace the CR/LFs in DOS with just LFs.
Many other languages use square brackets to represent literal arrays; in Perl, these are used for array references:
$ ni 1p'my @arr = [1, 2, 3]; r @arr'
ARRAY(0x7fa7e4184818)
This code has built a length-1 array containing an array reference; if you really wanted to create an array reference, you'd more likely do it explicitly.
$ ni 1p'my $arr_ref = [1, 2, 3]; r $arr_ref'
ARRAY(0x7fa7e4184818)
To dereference the reference, use the appropriate sigil:
$ ni 1p'my $arr_ref = [1, 2, 3]; r @$arr_ref'
1 2 3
Back to the first example, to dereference the array reference we've (probably unintentionally) wrapped in an array, do:
$ ni 1p'my @arr = [1, 2, 3]; r @{$arr[0]}'
1 2 3
You can implement a custom sort by passing a block to sort
.
Let's say you wanted to sort by the length of the string, rather than the order:
$ ni i[romeo juliet rosencrantz guildenstern hero leander] \
p'my @arr = F_; r sort {length $a <=> length $b } F_'
hero romeo juliet leander rosencrantz guildenstern
To reverse the sort, reverse $b
and $a
.
$ ni i[romeo juliet rosencrantz guildenstern hero leander] \
p'my @arr = F_; r sort {length $b <=> length $a } F_'
guildenstern rosencrantz leander juliet romeo hero
More details are available in the perldocs.
ni 1p'my %h1 = ("foo" => "u", "bar" => undef, "baz" => "drank", "horse" => "", "lst" => ["car", "cadr"], "hash" => {"secret" => "key", "lst2" => ["cubs"]}); my %h2 = ("bar" => "x", "foo" => "j", "horse" => "tiger", "lst" => ["caddr"], "hash" => {"worth" => "it", "lst2" => [34]}); %h3 = ("lst" => ["cadddr", "bye"], "hash" => {"word" => "up", "lst2" => ["verj"]}); %h = accumulate(\%h1, \%h2, \%h3); dump_data \%h'
ni
's hacky Data::Dumper
because we're too hardcore for SILLY LIBRARIES
Convert lists to hashes of their frequencies. This is useful!!
my $h = {"foo" => {"bar" => [u,u,u,u,v,baz,baz], "qux" => [ay, ay, bee]}};
my @keys = (foo, [bar, qux]); freqify $h, \@keys;
$h => {"foo" => {"bar" => {"u" => 4, "v" => 1, "baz" => 2},
"qux" => {"ay" => 2, "bee" => 1}}}
ni ::test[i[a,b,c,d hi] i[c1,c2,c3 foo] i[u,v,w yo]] ::test2[i[ba,bb,bc this] i[ux,uy,uc that]] 1p'^{@pair = abc test; @pair2 = abc test2} @ks = ("ux", "bb", "c1"); $j ="bb"; @ls = ihash_def(\@ks, 1, @pair, @pair2); @ls'
- Generate a list of things you want to filter, and put it in a data closure.
::ids[list_of_ids]
- Convert the data closure to a hash using a begin block (
^{%id_hash = ab_ ids}
) - Filter another dataset (
ids_and_data
) using the hash (exists($id_hash{a()})
) $ ni ::ids[list_of_ids] ids_and_data rp'^{%id_hash = ab_ ids} exists($id_hash{a()})'
WARNING: ni
will let you name your files the same name as ni
commands. ni
will look for filenames before it uses its own operators. Therefore, be careful not to name your files anything too obviously terrible. For example:
$ ni n5 \>n1.3E5
n1.3E5
$ ni n1.3E5
1
2
3
4
5
because ni
is reading from the file named n10
.
At this point, you've probably executed a long-running enough ni
job to see the ni
monitor appear at the top of your screen.
- Negative numbers = non-rate-determining step
- Positive numbers = rate-determining step
- Large Positive numbers = slow step
Some things to keep in mind when examining the output of the monitor: if you have access to more compute resources for slow steps, you can use them for that step alone, and this can be done in a very simple way via the S
horizontal scaling operator.
You may also want to consider refactoring your job to make use of Hadoop Streaming with HS
, depending on what's suited to your job. More details are available in the monitor docs and in the optimization docs.
map
can also take an expression rather than a block, but this syntax is tricky
The following example is taken from Perl Monks
You'd probably be even more surprised by the result of this:
my @in = qw(aaa bbb ccc);
my @out = map { s/.$/x/ } @in;
print "@in\n=>\n@out\n";
[download]
which prints:
aax bbx ccx
=>
1 1 1
[download]
Yeah, the original is modified. Basically: never modify $_ in a map, and remember that the "output" of s/// is the success count/code. To get what you want, you need to localize $_ and reuse it as the last expression evaluated in the block:
my @in = qw(aaa bbb ccc);
my @out = map { local
When use strict
is enabled, Perl will complain when you try to create a variable in a Perl snippet that does not start with ::
.
The reasons for this are very specific to Perl; if you are a true Perl nerd, you can look them up, but you do not need to know them if you just accept that variables need to start with ::
when you apply use strict
.
It is probably a good idea to use strict
when the variables you define are sufficiently complex; otherwise you're probably okay not using it.
Perl is a multi-paradigm language, but the dominant paradigm for Perl within ni
is procedural programming. Unlike the object-oriented languages with which you are likely familiar, where methods are objects (often "first-class" objects, whatever that means) procedural languages focus on their functions, called "subroutines."
Perl subroutines store the variables with which they are called in a default variable named @_
. Before continuing, take a moment to think about how one would refer the elements in @_
.
Subroutines are stored and referenced with the following syntax:
*x = sub {"hi"}
$v = &x # $v will be set to "hi"
Here's how you'd write a function that copies a stream of values 4 times in ni
, using Perl.
$ ni n3p'*v = sub {$_[0] x 4}; &v(a)'
1111
2222
3333
To review the syntax, the name of the variable is _
, and within the body of the subroutine, the array associated with that name, @_
is the array of values that are passed to the function. To get any particular scalar value within @_
, you tell Perl you want a scalar ($
) from the variable with name _
, and then indicate to Perl that of the variables named _
, you want to reference the array, by using the postfix [0]
.
Subroutines can also be called without the preceding &
, or created using the following syntax:
$ ni 1p'sub yo {"hi " . $_[0]} yo a'
hi 1
Note that in these examples, a new function will be defined for every line in the input, which is highly inefficient. In the next section, we will introduce begin blocks, which allow functions to be defined once and then used over all of the lines in a Perl mapper block.
In earlier drafts of ni
by example, streaming reduce had a much greater role; however, with time I've found that most operations that are done with streaming reduce are more simply performed using buffered readahead; moreover, buffered readahead. Streaming reduce is still useful and very much worth knowing as an alternative, highly flexible, constant-space set of reduce methods in Perl.
Reduction is dependent on the stream being appropriately sorted, which can make the combined act of sort and reduce expensive to execute on a single machine. Operations like these are (unsurprisingly) good options for using in the combiner or reducer steps of HS
operations.
These operations encapsulate the most common types of reduce operations that you would want to do on a dataset; if your operation is more complicated, it may be more effectively performed using buffered readahead and multi-line reducers.
Let's look at a simple example, summing the integers from 1 to 100,000:
$ ni n1E5p'sr {$_[0] + a} 0'
5000050000
Outside of Perl's trickiness,that the syntax is relatively simple; sr
takes an anonymous function wrapped in curly braces, and one or more initial values. A more general syntax is the following:
@final_state = sr {reducer} @init_state
To return both the sum of all the integers as well as their product, as well as them all concatenated as a string, we could write the following:
$ ni n1E1p'r sr {$_[0] + a, $_[1] * a, $_[2] . a} 0, 1, ""'
55 3628800 12345678910
A few useful details to note here: The array @init_state
is read automatically from the comma-separated list of arguments; it does not need to be wrapped in parentheses like one would use to explicitly declare an array in Perl. The results of this streaming reduce come out on separate lines, which is how p'...'
returns from array-valued functions
sr
is useful, but limited in that it will always reduce the entire stream. Often, it is more useful to reduce over a specific set of criteria.
Let's say we want to sum all of the 1-digit numbers, all of the 2-digit numbers, all of the 3-digit numbers, etc. The first thing to check is that our input is sorted, and we're in luck, because when we use the n
operator to generate numbers for us, they'll come out sorted numerically, which implies they will be sorted by their number of digits.
What we'd like to do is, each time a number streams in, to check if it number of digits is equal to the number of digits for the previous number we saw in the stream. If it does, we'll add it to a running total, otherwise, we emit the running total to the output stream and start a new running total.
In ni
-speak, the reduce-while-equal operator looks like this:
@final_state = se {reducer} \&partition_fn, @init_state
se
differs from sr
only in the addition of the somewhat cryptic \&partition_fn
. This is an anonymous function, which is can be expressed pretty much a block with the perl keyword sub
in front of it. For our example, we'll use sub {length}
, which uses the implicit default variable $_
, as our partition function.
NOTE: The comma after partition_fn
is important.
$ ni n1000p'se {$_[0] + a} sub {length}, 0'
45
4905
494550
1000
That's pretty good, but it's often useful to know what the value of the partition function was for each partition (this is useful, for example, to catch errors where the input was not sorted). This allows us to see how se
interacts with row-based operators.
$ ni n1000p'r length a, se {$_[0] + a} sub {length}, 0'
1 45
2 4905
3 494550
4 1000
One might worry that the row operators will continue to fire for each line of the input while the streaming reduce operator is running. In fact, the streaming reduce will eat all of the lines, and the first line will only be used once.
If your partition function is sufficiently complicated, you may want to write it once and reference it in multiple places.
$ ni n1000p'sub len($) {length $_}; r len a, se {$_[0] + a} \&len, 0'
1 45
2 4905
3 494550
4 1000
The code above uses a Perl function signature sub len($)
tells Perl that the function has one argument. The signature for a function with two arguments would be written as sub foo($$)
. This allows the function to be called as len a
rather than the more explicit len(a)
. The above code will work with or without the begin block syntax.
It's very common that you will want to reduce over one of the columns in the data. For example, we could equivalently use the following spell to perform the same sum-over-number-of digits as we did above.
$ ni n1000p'r a, length a' p'r b, se {$_[0] + a} \&b, 0'
1 45
2 4905
3 494550
4 1000
ni
offers a shorthand for reducing over a particular column and everything to
its left:
$ ni n1000p'r length a, a' p'r a, sea {$_[0] + b} 0'
1 45
2 4905
3 494550
4 1000
Consider the following reduction operation, which computes the sum, mean, min and max.
$ ni n100p'my ($sum, $n, $min, $max) = sr {$_[0] + a, $_[1] + 1,
min($_[2], a), max($_[3], a)}
0, 0, a, a;
r $sum, $sum / $n, $min, $max'
5050 50.5 1 100
For common operations like these, ni
offers a shorthand:
$ ni n100p'r rc \&sr, rsum "a", rmean "a", rmin "a", rmax "a"'
5050 50.5 1 100
If you screw up your partitioning, you're ruined.
JSON => Raw TSV => Cleaned TSV => Filtered TSV => Joined TSV => Statistics
If there's a potential problem with a column, you cannot use it as the first column in a reduce. If you do, Hadoop is very unforgiving; you'll end up having to bail on a particular slice of your data, or restart from the last nearly-equally-partitioned step.
Let's say you have data with 3 columns: user_id
, geohash
, timestamp
.
It's likely that both some of the users and some of the geohashes are spurious. This will cause us difficulty down the line, for example when we want to join data to each record using geohash as the joining key.
Developing and deploying to producion ni
pipelines that involve multiple Hadoop streaming steps involves a risk of failure outside the programmer's control; when this happens, we don't want to have to run the entire job again.
Checkpoints allow you to save the results of difficult jobs while continuing to develop in ni
.
Checkpoints and files share many commonalities. The key difference between a checkpoint and a file are:
- A checkpoint will wait for an
EOF
before writing to the specified checkpoint name, so the data in a checkpoint file will consist of all the output of the operator it encloses. A file, on the other hand, will accept all output that is streamed in, with no guarantee that it is complete. - If you run a
ni
pipeline with checkpoints, and there are files that correspond to the checkpoint names in the directory from whichni
is being run, thenni
will use the data in the checkpoints in place of running the job. This allows you to save the work of long and expensive jobs, and to run these jobs multiple times without worrying that bhe steps earlier in the pipeline that have succeeded will have to be re-run.
An example of checkpoint use is the following:
$ rm -f numbers # prevent ni from reusing any existing file
$ ni nE6gr4 :numbers
1
10
100
1000
This computation will take a while, because g
requires buffering to disk; However, this result has been checkpointed, thus more instuctions can be added to the command without requiring a complete re-run.
$ ni nE6gr4 :numbers O
1000
100
10
1
Compare this to the same code writing to a file:
$ ni nE6gr4 =\>numbers
1000
100
10
1
Because the sort is not checkpointed, adding to the command is not performant, and everything must be recomputed.
$ ni nE6gr4 =\>numbers O
1000
100
10
1
One caveat with checkpoint files is that they are persisted, so these files must be cleared between separate runs of ni
pipelines to avoid collisions. Luckily, if you're using checkpoints to do your data science, errors like these will come out fast and should be obvious.
As of now, ni
auto-generates the names for the Hadoop directories, and these can be hard to remember off the top of your head. If you want to
- A good markdown editor; I like Laverna, and it should work on basically all platforms.
- Infinite patience
- A reasonable test set.
Check out the tutorial for some examples of cool, interactive ni
plotting.
TODO: Say something useful.
In this section, we'll cover two useful operations
We recognize the first and last operators instantly; the middle two operators are new.
The =
operator can be used to divert the stream Running the spell puts us back in less
:
$ ni n10 =[\>ten.txt] z\>ten.gz
ten.gz
The last statement is another \>
, which, as we saw above, writes to a file and emits the file name. That checks out with the output above.
To examine the contents
Let's take a look at this with --explain
:
# note: 'z' will explain as ["sh","pigz"] if you have pigz installed
# (pigz = parallel gzip); ni will autodetect and use faster variants when
# present
$ ni --explain n10 =[\>ten.txt] z\>ten.gz
["n",1,11]
["divert",["file_write","ten.txt"]]
["sh","gzip"]
["file_write","ten.gz"]
Looking at the output of ni --explain
:
...
["divert",["file_write","ten.txt"]]
...
We see that, after "divert"
, the output of ni --explain
of the operator within brackets is shown as a list.
=
is formed of two parallel lines, this may be a useful mnemonic to remember how this operator functions; it takes the input stream and duplicates it. One copy is diverted to the operators within the brackets, while the other copy continues to the other operations in the spell.
One of the most obvious uses of the =
operator is to sink data to a file midway through the stream while allowing the stream to continue to flow.
For simple operations, the square brackets are unnecessary; we could have equivalently written:
ni n10 =\>ten.txt z\>ten.gz
This more aesthetically-pleasing statement is the preferred ni
style. The lack of whitespace between =
and the file write is critical.
TODO: Understand this
Operations on huge matrices are not entirely ni
ic, since they may require space greater than memory, whichwill make them slow. However, operators are provided to improve These operations are suited best to
X<col>
,Y<col>
,N<col>
: Matrix Partitioning- TODO: understand how these actually work.
X
: sparse-to-dense transformation- In the case that there are collisions for locations
X
,X
will sum the values - For example:
ni n010p'r 0, a%3, 1' X
- In the case that there are collisions for locations
@:[disk_backed_data_closure]
In theory, this can save you a lot of space. But I haven't used this in practice.
clip
within
TODO Understand how this works
- Image
- Binary
- Bytestream
- gnuplot
ni
and Perl go well together philosophically. Both have deeply flawed lexicons and both are highly efficient in terms of developer time and processing time. ni
and Perl scripts are both difficult to read to the uninitiated. They demand a much higher baseline level of expertise to understand what's going on than, for example, a Python script.
In addition to Perl, ni
offers direct interfaces to Python, Ruby, and Lisp. While all three of the languages are useful and actively maintained, ni
is written in Perl, and it is by far the most useful of the three. If you haven't learned Perl yet, but you're familiar with another scripting language, like Python or Ruby, I found this course on Udemy useful for learning Perl's odd syntax.
We'll start with the following ni
spell.
$ ni /usr/share/dict/words rx40 r10 p'r substr(a, 0, 3), substr(a, 3, 3), substr(a, 6)'
The output here will depend on the contents of your /usr/share/dict/words/
, but you should have 3 columns; the first 3 letters of each word, the second 3 letters of each word, and any remaining letters. On my machine it looks like this:
aba iss ed
aba sta rdize
abb rev iature
abd uct ion
abe tme nt
abi gai l
abj udg e
abl est
abo lis h
abo rti vely
Notice that ni
has produced tab-delimited columns for us; these will be useful for the powerful column operators we will introduce in this section and the next.
$ ni --explain /usr/share/dict/words rx40 r10 p'r substr(a, 0, 3), substr(a, 3, 3), substr(a, 6)'
["cat","/usr/share/dict/words"]
["row_every",40]
["head","-n",10]
["perl_mapper","r substr(a, 0, 3), substr(a, 3, 3), substr(a, 6)"]
We have reviewed every operator previously except the last.