# Looking inside files

A common task is to look at the contents of a file. This can be achieved using several diffrent the Unix commands, `less`, `cat`, `head` and `tail`. Let us consider some examples.

## less

The `less` command displays the contents of a specified file one screen at a time. To look at the contents of the file S_typhi.embl use :

In [6]:
less S_typhi.embl

S_typhi.embl: No such file or directory


The contents of the file S_typhi.embl is displayed one screen at a time, to view the next screen press the space bar. The percentage of the file that has been viewed so far will be displayed at the bottom of the screen. As S_typhi.embl is a large file this will take a while, therefore you may want to escape or exit from this command. To do this, press the control and c keys simultaneously, this kills the `less` command, and returns you to the Unix prompt. `less` can also scroll backwards if you hit the `b` key. Another useful feature is the slash key, `/`, to search for an expression in the file.

## head and tail

Sometimes you may just want to view the text at the beginning or the end of a file, without having to display all of the file. The `head` and `tail` commands can be used to do this.

The `head` command displays the first ten lines of a file.

To look at the beginning of the fie S_typhi.embl use:

In [4]:
head S_typhi.embl

head: S_typhi.embl: No such file or directory


The `tail` command displays the last ten lines of a file.

To look at the end of S_typhi.embl use:

In [1]:
tail S_typhi.embl

tail: S_typhi.embl: No such file or directory


The amount of the file that is displayed can be increased by adding extra arguments. To increase the number of lines viewed from 10 to 100 add the –100 argument to the command: 

In [9]:
tail -100 S_typhi.embl

tail: S_typhi.embl: No such file or directory


## cat

The `cat` command joins files together. 

Having looked at the beginning and end of the S_typhi.embl file you should notice that in EMBL files the annotation comes first, then the DNA sequence at the end. If you had two separate files containing the annotation and the DNA sequence, both in EMBL format, it is possible to concatenate or join the two together to make a single file like the S_typhi.embl file you have just looked at. The command `cat` can be used to join two or more files into a single file. The order in which the files are joined is determined by the order in which they appear in the command line. 

For example, we have two separate files, MAL13P1.dna and MAL13P1.tab, that contain the DNA and annotation, respectively, from the P. falciparum genome. To join together these files use:

In [3]:
cat MAL13P1.tab MAL13P1.dna > MAL13P1.embl

cat: MAL13P1.tab: No such file or directory
cat: MAL13P1.dna: No such file or directory


The files MAL13P1.tab and MAL13P1.dna will be joined together and written to a file called MAL13P1.embl.

The `>` symbol in the command line directs the output of the cat program to the designated file MAL13P1.embl

## Saving time

Saving time while typing may not seem important, but the longer that you spend in front of a computer, the happier you will be if you can reduce the time you spend at the keyboard.

* Pressing the up/down arrows will let you scroll through the previous commands. 

* If you highlight some text, middle clicking will paste it on the command line

* One of the best Unix tips you can learn early on is that you can use tab to complete the names of programs and files on most Unix systems. Type enough letters to uniquely identify the name of a file, directory or command and press tab. Unix will do the rest...

In [3]:
fin

bash: fin: command not found


## Redirects and pipes

Multiple Unix commands can be combined together to do very powerful things.

## other commands

some intro for the following commands

## wc - counting

The command `wc` counts lines, words or characters.

To count the number of files that are listed by `ls` use:

In [None]:
ls | wc -l

The `|` symbol above also know as the pipe symbol, connects the two commands into a single operation for simplicity. We say that the output from the first command is piped to and used as input the second command.

You can connect as many commands as you want. For example:

In [None]:
ls | grep ".embl" | wc -l

What does this command do?

## sort - sorting values

The `sort` lets you sort the contents of the input. When you sort the input, lines with identical content end up next to each other in the output, which can then be fed to the `uniq` command (see below) to count the number of unique lines in the input.

To sort the contents of the BED file use:

In [None]:
sort Pfalciparum.bes

To sort the contents of the BED file on position type the following command.

In [None]:
sort -k 2 -n Pfalciparum.bed

The `sort` command can sort by multiple columns e.g. 1st column and then 2nd column by specifying successive -k parameters in the command.

## uniq - finding unique values

The `uniq` cmmand extracts unique lines from the input. It is usualy used in combination with sort to count unique values in the input.

To get the list of chromosomes in the BED file use:

In [None]:
awk ’{ print $1 }’ Pfalciparum.bed | sort | uniq

How many chromosomes are there?

Now modify the previous command to count the number of features per chromosome. How many of these are repeated?
Hint: use the man command to look at the options for the uniq command. Or peruse the wc or grep manuals. There’s more than one way to do it!

## Getting help man

To obtain further information on any of the Unix commands introduced in this tutorial you can use the man command. For example, to get a full description and examples of how to use the sort command use the following command in a terminal window.

In [4]:
man sort

SORT(1)                          User Commands                         SORT(1)



NNAAMMEE
       sort - sort lines of text files

SSYYNNOOPPSSIISS
       ssoorrtt [_O_P_T_I_O_N]... [_F_I_L_E]...

DDEESSCCRRIIPPTTIIOONN
       Write sorted concatenation of all FILE(s) to standard output.

       Mandatory  arguments  to  long  options are mandatory for short options
       too.  Ordering options:

       --bb, ----iiggnnoorree--lleeaaddiinngg--bbllaannkkss
              ignore leading blanks

       --dd, ----ddiiccttiioonnaarryy--oorrddeerr
              consider only blanks and alphanumeric characters

       --ff, ----iiggnnoorree--ccaassee
              fold lower case to upper case characters

       --gg, ----ggeenneerraall--nnuummeerriicc--ssoorrtt
              compare according to general numerical value

       --ii, ----iig

## Exercises

The following exercises ... Open up another terminal window.

1. head/tail command
2. cat command
3. Use the sort command to sort the Bed file X on chromosome and then gene position
4. Use the uniq command to count the number of features per chromosome in the bed file
5. Exercise with pipes