# Files 

<a href="docs/ch04/fizzbuzz.py" download>fizzbuzz.py</a>
<a href="docs/ch04/fizzbuzz.R" download>fizzbuzz.R</a>
<a href="docs/ch04/stream.py" download>stream.py</a>
<a href="docs/ch04/stream.R" download>stream.R</a>
<a href="docs/ch04/top-words-4.sh" download>top-words-4.sh</a>
<a href="docs/ch04/top-words-5.sh" download>top-words-5.sh</a>
<a href="docs/ch04/top-words" download>top-words</a>
<a href="docs/ch04/top-words.py" download>top-words.py</a>
<a href="docs/ch04/top-words.R" download>top-words.R</a>

<a href="docs/ch04/stopwords.R" download>stopwords.R</a>

# Converting one-liners into shell scripts

`curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | trim`

***
```
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | trim
*** START OF THE PROJECT GUTENBERG EBOOK 11 ***
[Illustration]




Alice’s Adventures in Wonderland

by Lewis Carroll

… with 3373 more lines

```
***

Obtener las 10 palabras más usadas en el libro

`curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | tr '[:upper:]' '[:lower:]' | grep -oE "[a-z\']{2,}" | sort | uniq -c | sort -nr | head -n 10`

Esta parte del código **grep -oE "[a-z\']{2,}"** separa las palabras en una nueva línea usando grep

***
```
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | tr '[:upper:]' '[:lower:]' | grep -oE "[a-z\']{2,}" | sort | uniq -c | sort -nr | head -n 10
   1653 the
    874 and
    729 to
    595 it
    553 she
    517 of
    462 said
    411 you
    399 alice
    370 in

```
***

Las palabras encontradas son palabras comunes que unen oraciones en el texto, así que debemos deshacernos de ellas usando un diccionario que las contiene.

`curl -sL "https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt" | sort | tee stopwords | trim 20`

***
```
curl -sL "https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt" | sort | tee stopwords | trim 20
10
39
a
able
ableabout
about
above
abroad
abst
accordance
according
accordingly
across
act
actually
ad
added
adj
adopted
ae
… with 1278 more lines
```
***

Con grep **podemos filtrar las palabras comunes antes de comenzar a contar**.

`curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | tr '[:upper:]' '[:lower:]' | grep -oE "[a-z\']{2,}" | sort | grep -Fvwf stopwords | uniq -c | sort -nr | head -n 10`

Explicando una parte del comando **grep -Fvwf stopwords**, obtener patrones del archivo *stopwords*, uno por línea con -f. Interpretar esos patrones como cadenas de texto fijo con -F. Seleccionar aquellas líneas que conforman palabras completas con -w, seleccionar aquellas líneas que no concuerdan con la opción -v

***
```
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | tr '[:upper:]' '[:lower:]' | grep -oE "[a-z\']{2,}" | sort | grep -Fvwf stopwords | uniq -c | sort -nr | head -n 10         
    399 alice
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon
     53 rabbit
     50 head

```
***

Convertir el comando a un script de bash

## Step 1: Create the file

**top-words-1.sh**

***
```
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | 
tr '[:upper:]' '[:lower:]' |
grep -oE "[a-z\']{2,}" |
sort  |
grep -Fvwf stopwords|
uniq -c |
sort -nr |
head -n 10

```
***

Ahora probamos la ejecución del archivo

***
```
$ bash top-words-1.sh 
    399 alice
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon
     53 rabbit
     50 head

```
***

## Step 2: Give permission to execute

Copiar el archivo **top-words-1.sh** a **top-words-2.sh**

`cp -v top-words-{1,2}.sh`
`chmod u+x top-words-2.sh`

***
```
$ l top-words-{1,2}.sh    
-rw-r--r-- 1 dst wheel 174 Jun 25 05:11 top-words-1.sh
-rwxr--r-- 1 dst wheel 174 Jun 25 05:11 top-words-2.sh*
```
***

Verificar la ejecución

***
```
$ ./top-words-2.sh 
    399 alice
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon
     53 rabbit
     50 head

```
***

## Step 3: Define a Shebang

Copiar el archivo y dar permisos de ejecución

`cp -v top-words-{2,3}.sh`
`chmod u+x top-words-3.sh`

**top-words-3.sh**

***
```
#!/usr/bin/env bash
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | 
tr '[:upper:]' '[:lower:]' |
grep -oE "[a-z\']{2,}" |
sort  |
grep -Fvwf stopwords|
uniq -c |
sort -nr |
head -n 10

```
***

## Step 4: Remove the fixed input

Copiar el archivo y dar permisos de ejecución

`cp -v top-words-{3,4}.sh`
`chmod u+x top-words-4.sh`

Eliminar la línea 2 del scrip **top-words-4.sh**, que corresponde con la linea del comando curl, esto para hacer al comando aplicable a una entrada más general

**top-words-4.sh**

***
```
#!/usr/bin/env bash
tr '[:upper:]' '[:lower:]' |
grep -oE "[a-z\']{2,}" |
sort  |
grep -Fvwf stopwords|
uniq -c |
sort -nr |
head -n 10

```
***

Ahora probamos la ejecución enviando un stream de datos al mismo.

`curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | ./top-words-4.sh`

***
```
$ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | ./top-words-4.sh 
    399 alice
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon
     53 rabbit
     50 head
                  
```
***

## Step 5: Add arguments

`cp -v top-words-{4,5}.sh`
`chmod u+x top-words-5.sh`

**top-words-5.sh**

***
```
#!/usr/bin/env bash

NUM_WORDS="${1:-10}"

tr '[:upper:]' '[:lower:]' |
grep -oE "[a-z\']{2,}" |
sort  |
grep -Fvwf stopwords|
uniq -c |
sort -nr |
head -n "$NUM_WORDS"
```
***

**NUM_WORDS="${1:-10}"**, si no se especifica le valor para el parámetro, entonces, se toma como valor el **10**.

Ahora provemos el script.

`curl -sL "https://www.gutenberg.org/files/11/11-0.txt" > alice.txt`
`< alice.txt ./top-words-5_new.sh`

***
```
$ < alice.txt ./top-words-5_new.sh
    399 alice
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon
     53 rabbit
     50 head

```
***

## Step 6: Extend your path

***
```
$ echo $PATH
/usr/local/lib/R/site-library/rush/exec:/usr/bin/dsutils:/home/dst/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
```
***

***
```
$ echo $PATH | tr ':' '\n'
/usr/local/lib/R/site-library/rush/exec
/usr/bin/dsutils
/home/dst/.local/bin
/usr/local/sbin
/usr/local/bin
/usr/sbin
/usr/bin
/sbin
/bin

```
***

Copiar el archivo para darle en nombre definitivo al comando

`cp -v top-words{-5.sh,}`
`chmod u+x top-words.sh`

`export PATH="${PATH}:/data/ch04"`

probar usando el siguiente comando

`curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | top-words 10`

***
```
$ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | top-words 10`
    399 alice
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon
     53 rabbit
     50 head

```
***

## Creating command-line tools with python and R

## Porting the shell script

**top-words.py**

***
```
#!/usr/bin/env python
import re
import sys

from collections import Counter
from urllib.request import urlopen

def top_words(text, n):
    with urlopen("https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt") as f:
        stopwords = f.read().decode("utf-8").split("\n")

    words = re.findall("[a-z']{2,}", text.lower())
    words = (w for w in words if w not in stopwords)

    for word, count in Counter(words).most_common(n):
        print(f"{count:>7} {word}")


if __name__ == "__main__":
    text = sys.stdin.read()

    try:
        n = int(sys.argv[1])
    except:
        n = 10

    top_words(text, n)

```
***

** Se recomienda usar el paquete NLTK** para procesar texto.

Ahora en lenguaje R

**top-words.R**

***
```
#!/usr/bin/env Rscript
n <- as.integer(commandArgs(trailingOnly = TRUE))
if (length(n) == 0) n <- 10

f_stopwords <- url("https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt")
stopwords <- readLines(f_stopwords, warn = FALSE)
close(f_stopwords)

f_text <- file("stdin")
lines <- tolower(readLines(f_text))

words <- unlist(regmatches(lines, gregexpr("[a-z']{2,}", lines)))
words <- words[is.na(match(words, stopwords))]

counts <- sort(table(words), decreasing = TRUE)
cat(sprintf("%7d %s\n", counts[1:n], names(counts[1:n])), sep = "")
close(f_text)

```
***

Ahora, probemos que obtenemos la misma salida de los tres archivos

***
```
$ time < alice.txt ./top-words-5_new.sh 5                             
    399 alice
     76 queen
     71 time
     63 king
     60 turtle
./top-words-5_new.sh 5 < alice.txt  0.05s user 0.01s system 130% cpu 0.050 total

```
***

***
```
$ time < alice.txt ./top-words.py 5
    399 alice
     76 queen
     71 time
     63 king
     60 turtle
./top-words.py 5 < alice.txt  0.32s user 0.02s system 57% cpu 0.584 total

```
***

***
```
$ time < alice.txt ./top-words.R 5 
    399 alice
     76 queen
     71 time
     63 king
     60 turtle
./top-words.R 5 < alice.txt  0.27s user 0.11s system 51% cpu 0.735 total

```
***

## Processing streaming data from standard input

El problema Fizz Buzz, el cual se define como sigue: imprime cada numero de 1 a 100, pero si el numero es divisible entre 3, imprime en lugar del número "fizz", pero si el número es divisible entre 5, imprime "buzz"; y si el número es divisible entre 15, imprime "fizzbuzz".

**fizzbuzz.py**

***
```
#!/usr/bin/env python
import sys

CYCLE_OF_15 = ["fizzbuzz", None, None, "fizz", None,
               "buzz", "fizz", None, None, "fizz",
               "buzz", None, "fizz", None, None]

def fizz_buzz(n: int) -> str:
    return CYCLE_OF_15[n % 15] or str(n)

if __name__ == "__main__":
    try:
        while (n:= sys.stdin.readline()):
            print(fizz_buzz(int(n)))
    except:
        pass

```
***

**fizzbuzz.R**

***
```
#!/usr/bin/env Rscript
cycle_of_15 <- c("fizzbuzz", NA, NA, "fizz", NA,
                 "buzz", "fizz", NA, NA, "fizz",
                 "buzz", NA, "fizz", NA, NA)

fizz_buzz <- function(n) {
  word <- cycle_of_15[as.integer(n) %% 15 + 1]
  ifelse(is.na(word), n, word)
}

f <- file("stdin")
open(f)
while(length(n <- readLines(f, n = 1)) > 0) {
  write(fizz_buzz(n), stdout())
}
close(f)

```
***

Ahora probemos la ejecución

***
```
$ seq 30 | ./fizzbuzz.py | column -x
1		2		fizz		4		buzz
fizz	7		8		fizz		buzz
11		fizz		13		14		fizzbuzz
16		17		fizz		19		buzz
fizz	22		23		fizz		buzz
26		fizz		28		29		fizzbuzz
```
***

***
```
$ seq 30 | ./fizzbuzz.R | column -x
1		2		fizz		4		buzz
fizz	7		8		fizz		buzz
11		fizz		13		14		fizzbuzz
16		17		fizz		19		buzz
fizz	22		23		fizz		buzz
26		fizz		28		29		fizzbuzz

```
***


