# Working with big files and filtering

This folder contains the complete sequence of the mitochondrial genome of *Homo sapiens* in [GenBank format](https://www.ncbi.nlm.nih.gov/genbank/). In contrast to a FASTA sequence record, a GenBank file contains a multitude of additional information not only about the sequences themselves but also about where the data comes from.

Go ahead and take a look at the `mt-human.gb` file with the `cat` command in your local terminal. Do you find the additional information mentioned above?

In this exercise, we will learn how to handle bigger files like this and how to extract information from them.

## Start small

As you might have realized, looking at larger files as a whole becomes impractical. Oftentimes, you can get a feel for the structure of a file and what data it contains by looking only at its first couple of lines. You can do this with the `head` command:

```
# print first 10 lines of a file
head file.txt

# print first 50 lines of a file
head -n50 file.txt
```

If you are interested in the last lines of a file, the `tail` command is what you are looking for.

In the GenBank example, we are looking at the top of the file because it contains a special type of information. But you might want to use `head` and `tail` for another reason: speed! Printing a large file line for line in the terminal takes time, so trying out your command with the first couple of lines before applying it to the whole file is oftentimes much faster.

(**Tip**: if you end up printing a large file to the terminal anyway, you can stop the process by hitting Ctrl+c on your keyboard)

## Filtering rows

Remember how we used patterns of filenames to open only a certain type of them in exercise 1 and 2? We can use the same logic to filter for certain rows inside a file. We use the `grep` command for this, which is one of the most powerful commands in BASH. It has a lot of differnt options that are useful in different situations, so let's look at some examples:

In [6]:
%%bash
# run the code cells below to see what the commands do
cat ejemplo.txt

XXXXXXXXXXXXX
aaaaa	xxxxx
xxxxx	bbbbb
ccccc	xxxxx
xxxxx	ddddd
eeeee	xxxxx
aaaaa	bbbbb
....	fffff
axaxa	bxbxb
XXXXXXXXXXXXX

In [4]:
%%bash
# you can either open a file first and redirect the output to grep
cat ejemplo.txt | grep "aaaa"

aaaaa	xxxxx
aaaaa	bbbbb


In [5]:
%%bash
# or you can open a file with grep directly
grep "aaaa" ejemplo.txt

aaaaa	xxxxx
aaaaa	bbbbb


In [7]:
%%bash
# you can invert your search to find lines NOT containing your pattern
grep -v "xxxx" ejemplo.txt

XXXXXXXXXXXXX
aaaaa	bbbbb
....	fffff
axaxa	bxbxb
XXXXXXXXXXXXX


In [10]:
%%bash
# or you can tell grep to also return a number of lines below the match
grep -A1 "ee" ejemplo.txt

eeeee	xxxxx
aaaaa	bbbbb


In [12]:
%%bash
# keep in mind that grep patterns are case specific (unless you use grep -i)
grep "X" ejemplo.txt

XXXXXXXXXXXXX
XXXXXXXXXXXXX


There are many other functionalities of `grep`. You can find more information about all standart BASH commands by looking at their manual with the `man` command.

In [11]:
%%bash
man grep

GREP(1)                          User Commands                         GREP(1)

NAME
       grep, egrep, fgrep, rgrep - print lines that match patterns

SYNOPSIS
       grep [OPTION...] PATTERNS [FILE...]
       grep [OPTION...] -e PATTERNS ... [FILE...]
       grep [OPTION...] -f PATTERN_FILE ... [FILE...]

DESCRIPTION
       grep  searches  for  PATTERNS  in  each  FILE.  PATTERNS is one or more
       patterns separated by newline characters, and  grep  prints  each  line
       that  matches a pattern.  Typically PATTERNS should be quoted when grep
       is used in a shell command.

       A FILE of “-”  stands  for  standard  input.   If  no  FILE  is  given,
       recursive  searches  examine  the  working  directory, and nonrecursive
       searches read standard input.

       In addition, the variant programs egrep, fgrep and rgrep are  the  same
       as  grep -E,  grep -F,  and  grep -r, respectively.  These variants are
       deprecated, but are provided for backward co

## Tasks

1. Extract all Pubmed-IDs from the genbank file in this folder. How many publications contributed to this assembly of the human mitochondrial genome?

In [23]:
%%bash


ejemplo.txt
ex04.ipynb
mt-human.gb
README
