# File Processing

This folder contains protein files from two different biosynthetic gene clusters. In this exercise we will again use wildcards to access some files rather than others, but this time we will learn more about pipes and how to safe our results in a new file.

## Flow of Information

Before we start, lets first think about what happens when we look at the content of a file with `cat` in the shell. The command will open the file (or files) and print every line it finds into the terminal (the so called standart output or STDOUT).

If we open up a pipe with `|` we will now use the output of the previous command and use it as input for the next command. This is also the reason why you need to specify a target file for the first command in a pipeline, but not for the ones afterwards. For example:

```
# open file and return every line | count every line returned by the previous command
cat file.txt | wc -l
```

You can imagine this as a flow of information going from the first command with the datasource all the way to the last command in the pipeline where the modified result will then get printed.



## Redirecting information to a file

But what if you want to save the result of your pipeline in a new file? In this case, you can simply re-direct the information flow from the standart output (i.e. printing to the terminal) to a file by using the `>` character.

```
# safe file in your current directory
cat file.txt | wc -l > linecount.txt

# safe file anywhere
cat file.txt | wc -l > /path/to/results/linecount.txt
```

Watch out though! If there is a file with the exact same name at this location it will be overwritten! For this reason, you cannot write to the file that you are reading from (your data source). If you try to do this, you will be left with only an empty file and sadness in your heart.

## Tasks

1. How many sequences are there in each file? Document your commands below. This time, use **relative paths** to the files you want to access. (Your current directory is the one where this Jupyter notebook is located)

2. Concatenate the sequences of each file into a single FASTA file.

3. What is the total length of the amino acid sequences for each file?