### The pipe command
`RDD.pipe(cmd)` is a command that sends each element of the RDD as input to the command `cmd`.
The command is any unix command line command, either pre-built, or defined by the programmer.

The pipe() command allows the program to interface between spark and any other program. This is useful, in particular, when there is some legacy software, possibly written in `Matlab` or `Fortran` that we need to use, but that was not designed for parallel computation.

### Defining a simple script
We define a simple script that reads text lines from stdin and outputs altered text to stdout

In [35]:
%%writefile concat.py
#!/usr/bin/env python

# A simple python program that reads lines from stdin, makes a simple alteration, 
# and sends the result back to stdout
import sys
from string import strip
for line in sys.stdin.readlines():
    line=strip(line)
    print 'This Is '+line

Overwriting concat.py


In [30]:
# The script file needs to be executable
!chmod a+x concat.py

### Testing the python script

In [31]:
%%writefile data.txt
line
another line


Overwriting data.txt


In [32]:
!./concat.py < data.txt

This Is line
This Is another line


###  Using the script inside pipe

In [37]:
A=sc.parallelize(range(10))
results=A.pipe('concat.py')

print A.collect()
print '\n'.join(results.collect())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
This Is 0
This Is 1
This Is 2
This Is 3
This Is 4
This Is 5
This Is 6
This Is 7
This Is 8
This Is 9
