# Command Line Bioinformatics: Exploring File Formats

**Duration**: 2 hours  
**Goals**:  
- Learn to explore file formats without memorization.  
- Use tools like `grep`, `awk`, `sort`, and `less`.  
- Prioritize efficiency and documentation (e.g., `man`, `-h`).  



## Setup  
Ensure you have:   
- access to the Data directory
- Example files: `*.fasta`, `*.fastq`, `*.gff`, `*.bed`.

## Resources 
- Tools/Commands: `Cat`, `less`, `grep`, `sort`, `uniq`, `file`, `awk`, `head/tail`, `gzip/gunzip` `wc`.
- 
- Bed files: https://genome.ucsc.edu/FAQ/FAQformat.html
- Github resource for learning bioinformatics: https://github.com/harvardinformatics/learning-bioinformatics-at-home

Template/Workflow 

Show them how to use all of the command line tools to:
- check documentation/manuals
- navigate dirs
- list directories 
- check file types 
- check file sizes 
Bonus Advanced material:
- install conda.yml file
- Download reads using sratoolkit 
- convert fastq to fasta using seqtk 
- split multiline fasta to single files using seqtk 

## Setup

In [None]:
# Check your location
!pwd

In [34]:
# Navigate to 07_BioFile_Formats directory from CM515-course-2025/
!cd "CM515-course-2025/modules/07_BioFile_Formats"

/usr/bin/bash: line 1: cd: CM515-course-2025/modules/07_BioFile_Formats: No such file or directory
/home/jake/Projects/CM515-course-2025/modules/07_BioFile_Formats


## How to use Manuals/ Help flags

In [38]:
# 1. Option 1 to try: man {tool_name}
!man ls 

In [None]:
# 2. Option 2 to try; {tool_name} --help or {tool_name} -h
!ls --help

## Tools commonly have arguments that can be chained together

In [39]:
# List the files in the data directory 
!ls Data/

Animals.fasta
Bio_data
Biology_protein_data.fasta
Covid_1.fastq
Covid_2.fastq
fasterq.tmp.archlinux.15709
Mystery_Data
Pan_paniscus.panpan1.1.113.gff3
Pan_paniscus.panpan1.1.113.gtf
Pan_paniscus.panpan1.1.cds.all.fa
Pan_paniscus.panpan1.1.dna.chromosome.1.fa.gz
SARS_CoV_2_ref.fasta
SRR22903825
Twist_Exome_Core_Covered_Targets_hg19_liftover.bed
Twist_Exome_Core_Covered_Targets_hg38.bed
total 743M
-rw-r--r-- 1 jake jake 242K Jan 29 13:41 Covid_2.fastq
-rw-r--r-- 1 jake jake 242K Jan 29 13:41 Covid_1.fastq
drwxr-xr-x 1 jake jake   30 Jan 29 13:28 SRR22903825
drwxr-xr-x 1 jake jake  816 Jan 29 12:23 fasterq.tmp.archlinux.15709
-rw-r--r-- 1 jake jake  84K Jan 29 11:30 Biology_protein_data.fasta
-rw-r--r-- 1 jake jake  84K Jan 29 11:24 Bio_data
-rw-r--r-- 1 jake jake  62M Jan 29 10:41 Pan_paniscus.panpan1.1.dna.chromosome.1.fa.gz
-rw-r--r-- 1 jake jake  77M Jan 29 10:39 Pan_paniscus.panpan1.1.cds.all.fa
-rw-r--r-- 1 jake jake 168M Jan 29 10:39 Pan_paniscus.panpan1.1.113.gff3
-rw-r--r-- 1 jak

In [1]:
# List the files in the data directory sorted by time modified
!ls Data/ -lht

total 743M
-rw-r--r-- 1 jake jake 242K Jan 29 13:41 Covid_2.fastq
-rw-r--r-- 1 jake jake 242K Jan 29 13:41 Covid_1.fastq
drwxr-xr-x 1 jake jake   30 Jan 29 13:28 SRR22903825
drwxr-xr-x 1 jake jake  816 Jan 29 12:23 fasterq.tmp.archlinux.15709
-rw-r--r-- 1 jake jake  84K Jan 29 11:30 Biology_protein_data.fasta
-rw-r--r-- 1 jake jake  84K Jan 29 11:24 Bio_data
-rw-r--r-- 1 jake jake  62M Jan 29 10:41 Pan_paniscus.panpan1.1.dna.chromosome.1.fa.gz
-rw-r--r-- 1 jake jake  77M Jan 29 10:39 Pan_paniscus.panpan1.1.cds.all.fa
-rw-r--r-- 1 jake jake 168M Jan 29 10:39 Pan_paniscus.panpan1.1.113.gff3
-rw-r--r-- 1 jake jake 410M Jan 29 10:38 Pan_paniscus.panpan1.1.113.gtf
-rw-r--r-- 1 jake jake 4.4M Jan 29 10:37 Twist_Exome_Core_Covered_Targets_hg38.bed
-rw-r--r-- 1 jake jake 4.4M Jan 29 10:36 Twist_Exome_Core_Covered_Targets_hg19_liftover.bed
-rw-r--r-- 1 jake jake  19M Jan 29 09:39 Animals.fasta
-rw------- 1 jake jake  30K Jan 28 15:59 SARS_CoV_2_ref.fasta
drwxr-xr-x 1 jake jake    0 Jan 28 14:04

In [5]:
# List the files and the . files in the directory
!ls Data/ -lha

total 743M
drwxr-xr-x 1 jake jake  846 Feb  1 21:06 .
drwxr-xr-x 1 jake jake  204 Feb  1 21:06 ..
-rw-r--r-- 1 jake jake  19M Jan 29 09:39 Animals.fasta
-rw-r--r-- 1 jake jake  84K Jan 29 11:24 Bio_data
-rw-r--r-- 1 jake jake  84K Jan 29 11:30 Biology_protein_data.fasta
-rw-r--r-- 1 jake jake 242K Jan 29 13:41 Covid_1.fastq
-rw-r--r-- 1 jake jake 242K Jan 29 13:41 Covid_2.fastq
-rw-r--r-- 1 jake jake   24 Feb  1 21:06 .Davids_secret_diary.txt
drwxr-xr-x 1 jake jake  816 Jan 29 12:23 fasterq.tmp.archlinux.15709
drwxr-xr-x 1 jake jake    0 Jan 28 14:04 Mystery_Data
-rw-r--r-- 1 jake jake 168M Jan 29 10:39 Pan_paniscus.panpan1.1.113.gff3
-rw-r--r-- 1 jake jake 410M Jan 29 10:38 Pan_paniscus.panpan1.1.113.gtf
-rw-r--r-- 1 jake jake  77M Jan 29 10:39 Pan_paniscus.panpan1.1.cds.all.fa
-rw-r--r-- 1 jake jake  62M Jan 29 10:41 Pan_paniscus.panpan1.1.dna.chromosome.1.fa.gz
-rw------- 1 jake jake  30K Jan 28 15:59 SARS_CoV_2_ref.fasta
drwxr-xr-x 1 jake jake   30 Jan 29 13:28 SRR22903825
-rw-r--r

In [None]:
# Can you list the files in the data directory sorted by reverse time modified? 


## Task 1: Write out a one word sentence to describe each of these tools using their manpage / help documentation

In [None]:
# cat - 
# grep - 
# less -
# head -
# tail -
# file -
# awk -
# wc -
# sort - 
# uniq -
# gzip -
# gunzip -

# Tools can be piped together using '|'

In [9]:
## Here is an example

# This command will extract all of the headers from the fasta file
!grep "^>" Data/Animals.fasta

>Gerbil
>Lizard
>Guinea_pig
>Ferret
>Ferret
>Gerbil
>Guinea_pig
>Cat
>Lizard
>Fish
>Snake
>Cat
>Fish
>Snake
>Dog
>Bird
>Dog
>Bird


### Lets pipe commands together to find out more info about the dataset 

In [10]:
# We can use the wc -l command to return the count for the number of lines
!grep "^>" Data/Animals.fasta | wc -l

18


In [13]:
# We can use sort to sort the headers alphabetically
!grep "^>" Data/Animals.fasta | sort

>Bird
>Bird
>Cat
>Cat
>Dog
>Dog
>Ferret
>Ferret
>Fish
>Fish
>Gerbil
>Gerbil
>Guinea_pig
>Guinea_pig
>Lizard
>Lizard
>Snake
>Snake


In [12]:
# We can use uniq -c to return the number of entries for each unique headers
!grep "^>" Data/Animals.fasta | sort | uniq -c

      2 >Bird
      2 >Cat
      2 >Dog
      2 >Ferret
      2 >Fish
      2 >Gerbil
      2 >Guinea_pig
      2 >Lizard
      2 >Snake


# Now that you know how to use tools and look up their documentation, lets explore some bio file formats

#### Below are common fasta files and their many aliases that you may encounter in the wild

- Fasta: fasta, fna, fas, fa 
- Fastq: fastq, fq 
- GFF: GFF, GFF3 
- GTF: GTF
- Bed: Bed

##### Important note, it is common practice to compress biological files to reduce the amount of space. 

Two common zipped formats are gzip and zip 

To unzip a gzipped file you can use the command gunzip {filename}
To zip a file you can use the command gzip {filename}

To unzip a zipped file you can use the command unzip {filename}
To zip a file you can use the command zip {filename}

Gzip is the preffered and most common form of file compression in bioinformatics

#### To check if a file is zipped you can use the file command

## Lets explore an unkown file together.
### We are bioinformaticians and were just sent a file called Bio_data and asked to summarize the data set. 

In [23]:
# We are bioinformaticians and were just sent a file called Bio_data and asked to summarize the data set. 
# step 1. Check what file type it is and if its zipped. 
!file Data/Bio_data

Data/Bio_data: gzip compressed data, was "Bio_data", last modified: Wed Jan 29 18:24:21 2025, from Unix, original size modulo 2^32 86009


In [18]:
# Lets unzip the data
!gunzip Data/Bio_data

SyntaxError: invalid syntax (1975089113.py, line 2)

In [None]:
# Now lets take a peek at the file structure using the head command.
!head -n 10 Data/Bio_data

What do you notice about the file structure? 

Headers or columns? 

If headers is it a fastq or fasta file? 

What do you notice about the sequence characters? DNA or Amino Acid?

Are the sequence lines wrapped at a set length or do they continue forever?

In [24]:
## Here is a handy way to remove the line wrapping characters using awk and save the output as a new file? 
!awk '/^>/ {print (NR==1?"":"\n") $0; next} {printf "%s", $0}' Data/Bio_data > Data/Biology_protein_data.fasta

In [None]:
# Write a command to view the first 30 lines of the new file Data/Biology_protein_data.fasta

In [None]:
# Lets see how many sequences are in the file
!grep -c "^>" Data/Biology_protein_data.fasta

In [None]:
# Lets look at the header column pattern for this file
!grep "^>" Data/Biology_protein_data.fasta

In [None]:
# Lets extract the gene column
!grep "^>" Data/Biology_protein_data.fasta | awk '{print $4}'

In [48]:
# Lets remove the gene: prefix and sort the list
!grep "^>" Data/Biology_protein_data.fasta | awk '{print $4}'| sed 's/gene://' | sort

ENSPPAG00000000006.1
ENSPPAG00000000010.1
ENSPPAG00000000016.1
ENSPPAG00000000019.1
ENSPPAG00000000021.1
ENSPPAG00000000022.1
ENSPPAG00000000023.1
ENSPPAG00000000025.1
ENSPPAG00000000027.1
ENSPPAG00000000028.1
ENSPPAG00000000032.1
ENSPPAG00000000033.1
ENSPPAG00000000035.1
ENSPPAG00000000038.1
ENSPPAG00000000040.1
ENSPPAG00000000040.1
ENSPPAG00000000041.1
ENSPPAG00000000042.1
ENSPPAG00000000042.1
ENSPPAG00000000043.1
ENSPPAG00000000044.1
ENSPPAG00000000044.1
ENSPPAG00000000044.1
ENSPPAG00000000044.1
ENSPPAG00000000045.1
ENSPPAG00000000045.1
ENSPPAG00000000045.1
ENSPPAG00000000046.1
ENSPPAG00000000047.1
ENSPPAG00000000049.1
ENSPPAG00000000050.1
ENSPPAG00000000050.1
ENSPPAG00000000051.1
ENSPPAG00000000052.1
ENSPPAG00000000053.1
ENSPPAG00000000054.1
ENSPPAG00000000054.1
ENSPPAG00000000055.1
ENSPPAG00000000056.1
ENSPPAG00000000056.1
ENSPPAG00000000056.1
ENSPPAG00000000056.1
ENSPPAG00000000057.1
ENSPPAG00000000061.1
ENSPPAG00000000069.1
ENSPPAG00000000071.1
ENSPPAG00000000072.1
ENSPPAG000000

## Fasta file exploration

In [28]:
# gunzip the file /Data/Pan_paniscus.cds.fa.gz
!gunzip Data/Pan_paniscus.cds.fa.gz

gzip: Data/Pan_paniscus.cds.fa.gz: No such file or directory


In [None]:
# 

In [None]:
# Lets take a look at the fasta file

# That wasnt very helpful, lets try and open it up.
# !head -n 10 Data/Animals.fasta

# What do you notice about the file structre? 

# What do you notice about the sequence characters?
# Are the sequence lines wrapped?

# Lets see how many sequences are in the file
#!grep -c "^>" Data/Biology_protein_data.fasta

### Bioawk further exploration 
# Print sequence lengths from FASTA
#!bioawk -c fastx '{print $name, length($seq)}' Data/Animals.fasta


# Get GC content
#!bioawk -c fastx '{print $name, gc($seq)}' Data/Animals.fasta


# Get reverse complement
#!bioawk -c fastx '{print ">"$name; print revcomp($seq)}' Data/Animals.fasta

In [49]:
!head -n 10 Data/Animals.fasta

>Gerbil
GGACTGCAGGGGCTCCCTCCAGCGTCCGTGTCGCCAGCCCCAGGGCAGCAGTCCTTGAAAGGGGGACATCTCCAGCCCCCAAGGGTCCTCTGGAGGCGCAACTGGCCCCCCTGCTCCTTCCCAGCACAAGCATGGCATGGAAGGAAGGATACCCTGGCATGGAGAGTCCCTGAAAGGGGGGGGGGGGGGTGAGCCATTTCACAGTGCCAGTGTGCGCTGGCCAGGCTCTCCCCCACACCCCTGAAAAAAAAGATAGCATGAGGACAGCTTCTGTTTACATTCAGCACACATACACACTGCTGTCCTCCATTCGCTGCCACCAGGAGCTATACAGACCCCCGAGGTGGGAGTCAGCTCCGCATTCATCCATGAGACGCTTCCTAAAGCAGTCTCACAGGCAGGGGGACCCCTGCCCAGCCTGGCTGGTGAGCCTCCCCATCACGCATGCCCTCCCATTCCCCCAGTTGTAAAGCGGATACTTCAAGTGGGAGAGGCATGCTTCAGTGTCTGTGGTGCCTGAACACTTTGTAGTTCAGGTTCAGGGGCTGAGGTCCCTGGGACCTCCGCAGAACCAGAGTCCCTGTTCGTCACAGTCCTCCCCAGGAGATTAGGGTGATTTTCATCCCTAAGGTCCTCAAACTTCCTGGACAGGGCATCTGCGTTCTGATTCTGTTCCCCCTTCTTATGGATAACTTCCATATCCAGTTCCTGTAGTGCCAGTGACCATCGCAGTAATTTGCTGTTCTCCCCTTTGCATTGCATCAGCCATACCAGTGGGTTGTGGTCTGTGTGGACAATGAAGTGTGTCCCAAATAAATAGGGCTTCAATTTCCTCACAGCCCACACCAGGGCAAGACATTCCCGTTCCACTGTTGCATACTTTCTTTCAGCAGGAATTAACTTCCTGCTGATGTATGCTACTGGCTGTTCTGCCCCATTGGCATCCATCTGTGCAAGTACAGCACCGATGCCACTGCTGGAGGCATCGGTTT

## Fastq file exploration

In [None]:
# step 1. Check what file type it is and if its zipped. 
#!file Data/Covid_1.fastq

# Lets try and open it up.
#!head -n 10 Data/Covid_1.fastq

# What do you notice about the file structure? 

# Lets see how many sequences are in the file
#!grep -c "^@" Data/Covid_1.fastq

# Lets get the mean phred score for each sequence 
#!bioawk -c fastx '{print $name, meanqual($qual)}' Data/Covid_1.fastq

# Lets get the mean phred score for every sequence together 
#!bioawk -c fastx '{print $name, meanqual($qual)}' Data/Covid_1.fastq | awk '{print $2}' | awk '{sum+=$1} END {print sum/NR}'

# Lets see how the sequences break down categorically by phred score 
# !bioawk -c fastx '{print $name, meanqual($qual)}' Data/Covid_1.fastq | awk '{print int($2)}' | sort -n | uniq -c | sort -nr



## Advanced Section

### Set up a conda environment and activate the environment to install additional tools

In [None]:

# Install the conda environment using the yml file
!conda install -f environment.yml

In [None]:
# Activate the conda environment
!conda activate Bio