# Extracting information

The file IDs.txt contains a list of sequence identifiers along with the species name from where those sequences originated from. We are not sure whether the sequence identifiers are unique, but oftentimes this kind of information is very important for us. For example, you might want to know if your analysis identified 10 different genes of interest, or if you picked up a signal from the same gene over and over again.

## Working with columns

Although it is hard to see, there a two columns in the identifiers.txt file. Columns are divided by a tabulator or tab-stop. This special character is invisible when printed to STDOUT, but there is also a written version of it that the BASH shell will understand: `\t`

You can make use of these columns with commands like `cut`. For example, you can print only the first column of a file:

In [4]:
%%bash
# try running this cell in the notebook!
cut -f1 identifiers.txt | head

Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster


In another use case you can use cut with a delimiter of your choice, creating completely new columns!

In [3]:
%%bash
cut -d"_" -f1,3 identifiers.txt | head

Drosophila melanogaster	drome_14401
Drosophila melanogaster	drome_14402
Drosophila melanogaster	drome_14403
Drosophila melanogaster	drome_14407
Drosophila melanogaster	drome_14408
Drosophila melanogaster	drome_1440
Drosophila melanogaster	drome_14410
Drosophila melanogaster	drome_14412
Drosophila melanogaster	drome_14413
Drosophila melanogaster	drome_14415


## How many of you are there? Testing uniqueness

BASH has a build in function for testing uniqueness `uniq`. Be careful though, `uniq` will only compare if two subsequent lines are exactly the same. Look at this example:

In [17]:
%%bash
echo "The file looks like this:"
cat example.txt

The file looks like this:
apple
apple
doctor
apple


In [18]:
%%bash
echo "This is not unique at all :("
cat example.txt | uniq

This is not unique at all :(
apple
doctor
apple


You can get around this by sorting the elements of your file first:

In [19]:
%%bash
cat example.txt | sort | uniq

apple
doctor


In [20]:
%%bash
# this also works and involves less typing
cat example.txt | sort -u

apple
doctor


In [21]:
%%bash
# if you want to count how many duplicates are in your file, you need to use uniq though
cat example.txt | sort | uniq -c

      3 apple
      1 doctor


One last thing on sorting numbers: computer count in weird way, if you want to sort numbers in a numeric way use `sort -g`. The g stands for "general numeric"

## Tasks

1. How many different species are in the file?

2. How many identifiers are there for each species?

3. How many unique identifiers are in the file for each species?