## Command Line Bioinformatics: Exploring File Formats
**Duration**: 2 hours
**Goals**:
- Learn to explore bioinformatics file formats without memorizing their structures.
- Master basic command-line tools like `grep`, `awk`, `sort`, and `less` for data exploration.
- Develop skills in checking documentation (e.g., `man`, `--help`) and piping commands efficiently.

This notebook is designed for new bioinformatics students. You’ll use pre-installed tools to investigate common file formats: FASTA, FASTQ, GFF, and BED. No additional installations are required for the main exercises.



## Setup

Before starting, ensure you have:
- Access to a "Data" directory containing example files: `*.fasta`, `*.fastq`, `*.gff`, `*.bed`.
- A terminal or Jupyter notebook environment with standard UNIX tools installed (`cat`, `grep`, `awk`, etc.).
Let’s verify the example files are present:



In [6]:
# List the contents of the Data directory
ls Data/

 Animals.fasta
 Bio_data
 Biology_protein_data.fasta
 Covid_1.fastq
 Covid_2.fastq
[38;5;4m[1m fasterq.tmp.archlinux.15709[0m
[38;5;4m[1m Mystery_Data[0m
 Pan_paniscus.cds.fa
 Pan_paniscus.panpan1.1.113.gff3
 Pan_paniscus.panpan1.1.113.gtf
 Pan_paniscus.panpan1.1.cds.all.fa
 Pan_paniscus.panpan1.1.dna.chromosome.1.fa.gz
 SARS_CoV_2_ref.fasta
[38;5;4m[1m SRR22903825[0m
 Twist_Exome_Core_Covered_Targets_hg19_liftover.bed
 Twist_Exome_Core_Covered_Targets_hg38.bed


## **Task**: Run the command above. Do you see files ending in `.fasta`, `.fastq`, `.gff`, and `.bed`? If not, let your instructor know



## Resources

Here are the command-line tools we’ll use (all standard and pre-installed):
- `cat`: Concatenates and displays file contents.
- `less`: Views files page by page (use `q` to quit).
- `grep`: Searches for patterns in files.
- `sort`: Sorts lines alphabetically or numerically.
- `uniq`: Removes or counts duplicate lines (use with `sort`).
- `file`: Identifies file types.
- `awk`: Processes text and extracts fields.
- `head`: Shows the first few lines of a file.
- `tail`: Shows the last few lines of a file.
- `wc`: Counts lines, words, or characters.
- `gzip`/`gunzip`: Compresses or decompresses files (e.g., `.gz`).

**External Resources**:
- [BED Format FAQ](https://genome.ucsc.edu/FAQ/FAQformat.html)
- [Learning Bioinformatics at Home](https://github.com/harvardinformatics/learning-bioinformatics-at-home)



## Workflow: Using Command-Line Tools

Let’s build a workflow for exploring files. Try these commands step-by-step:

## 1. Check Documentation

In [None]:
man grep  #Shows the manual for grep

In [None]:
ls --help  #Displays help for ls

## 2. Navigate Directories
Check your location and move around:
 * you want to be here: "/home/user/CM515-course-2025/modules/07_BioFile_Formats" * 

In [None]:
pwd       #Confirm your location

In [None]:
cd Data/  #Enter the Data directory

## 3. List Files
Explore the "Data" directory:

In [None]:
ls Data/          #List filess

In [None]:
ls Data/ -lh      #List with sizes and human-readable format

In [None]:
ls Data/ -lt      #Sort by modification time

## 4. Check File Types
Identify what kind of files you’re working with:

In [None]:
file Data/Animals.fasta  #Example output: ASCII text

## 5. Check File Sizes
Assess file sizes to understand your data:

In [7]:
ls Data/ -lh  #Human-readable sizes

.[38;5;2mr[39m[38;5;3mw[39m[38;5;245m-[39m[38;5;2mr[39m[38;5;245m-[39m[38;5;245m-[39m[38;5;2mr[39m[38;5;245m-[39m[38;5;245m-[39m[38;5;6m[39m [38;5;230mjake[39m [38;5;187mjake[39m  [38;5;229m13[39m [38;5;229mKB[39m [38;5;36mSat Feb  1 21:13:34 2025[39m  Animals.fasta
.[38;5;2mr[39m[38;5;3mw[39m[38;5;245m-[39m[38;5;2mr[39m[38;5;245m-[39m[38;5;245m-[39m[38;5;2mr[39m[38;5;245m-[39m[38;5;245m-[39m[38;5;6m[39m [38;5;230mjake[39m [38;5;187mjake[39m  [38;5;229m28[39m [38;5;229mKB[39m [38;5;36mWed Jan 29 11:24:21 2025[39m  Bio_data
.[38;5;2mr[39m[38;5;3mw[39m[38;5;245m-[39m[38;5;2mr[39m[38;5;245m-[39m[38;5;245m-[39m[38;5;2mr[39m[38;5;245m-[39m[38;5;245m-[39m[38;5;6m[39m [38;5;230mjake[39m [38;5;187mjake[39m  [38;5;229m28[39m [38;5;229mKB[39m [38;5;36mSat Feb  1 22:20:24 2025[39m  Biology_protein_data.fasta
.[38;5;2mr[39m[38;5;3mw[39m[38;5;245m-[39m[38;5;2mr[39m[38;5;245m-[39m[38;5;245m-[39m

In [8]:
du -h Data/Animals.fasta  #Specific file size

16K	Data/Animals.fasta


**Task**: Write a command to list files in "Data/" sorted by reverse modification time. Run it below:

In [None]:
# Your command here




## Task 1: Describe the Tools

Use `man` or `--help` to write a short sentence describing each tool’s purpose. Example:
- `cat`: "Displays the entire contents of a file to the screen."

Fill in the rest:



In [None]:
# cat - "Displays the entire contents of a file to the screen."
# grep -
# less -
# head -
# tail -
# file -
# awk -
# wc -
# sort -
# uniq -
# gzip -
# gunzip -



# **Hint**: Run `man grep` or `grep --help` to get started. After completing this, compare with a classmate



## Piping Commands Together

Piping (`|`) lets you chain commands. Here’s an example with a FASTA file:



In [None]:
# Example: Exploring FASTA Headers

grep "^>" Data/Animals.fasta  # Extract headers (lines starting with ">")

In [None]:
# - Count sequences:

grep "^>" Data/Animals.fasta | wc -l

In [None]:
# - Sort headers alphabetically:

grep "^>" Data/Animals.fasta | sort

In [None]:
# - Count unique headers:

grep "^>" Data/Animals.fasta | sort | uniq -c

# **Task**: Run the commands above. What do the outputs tell you about `Animals.fasta`?



## Exploring Bioinformatics File Formats

Now, let’s dive into specific file formats. We’ll check if files are compressed, peek at their contents, and extract useful information.



### FASTA Files (e.g., `*.fasta`, `*.fa`, `*.fna`)

FASTA files store sequences (DNA, RNA, or protein) with headers starting with `>`.



#### Example:
```
>Gene1_danio_rerio
atcgtagctcagcagacatcgtagctcagcagacatcgtagctcagcagacatcgtagctcagcagacatcgtagctcagcagac
```

In [11]:
#Step 1: Check the File

file Data/Pan_paniscus.cds.fa.gz

Data/Pan_paniscus.cds.fa.gz: cannot open `Data/Pan_paniscus.cds.fa.gz' (No such file or directory)


In [None]:
# Output might say "gzip compressed data," so unzip it:
gunzip Data/Pan_paniscus.cds.fa.gz

gzip: Data/Pan_paniscus.cds.fa.gz: No such file or directory


: 1

In [None]:
head -n 10 Data/Pan_paniscus.cds.fa

In [None]:
# Step 2: Peek at the Contents

head -n 10 Data/Pan_paniscus.cds.fa

# Notice the `>` headers and sequence lines. Are they wrapped (broken into fixed-length lines)?

In [None]:
# Step 3: Count Sequences

grep -c "^>" Data/Pan_paniscus.cds.fa

In [None]:
# Step 5: Calculate Sequence Lengths

# Try calculating sequence lengths on the data file
grep -v "^>" Data/Pan_paniscus.cds.fa | awk '{print length($0)}'

grep -r "flutter_secure_storage" --include="*.dart" . -v "^>" Data/Pan_paniscus.cds.fa | awk '{print length($0)}' -v "^>" Data/Pan_paniscus.cds.fa | awk '{print length($0)}'
grep: ^>: No such file or directory
awk: fatal: cannot open file `-v' for reading: No such file or directory


# **Task**: What’s the longest sequence in this file? Modify the command above to find out.


In [3]:
# Solution:
grep -v "^>" Data/Pan_paniscus.cds.fa | awk '{print length($0)}' | sort -n | tail -n 1


# **Why?** This command filters out the headers (lines starting with `>`) and then calculates the length of each remaining line. The `sort -n` sorts the lengths numerically, and `tail -n 1` gets the longest one.



grep -r "flutter_secure_storage" --include="*.dart" . -v "^>" Data/Pan_paniscus.cds.fa | awk '{print length($0)}' -v "^>" Data/Pan_paniscus.cds.fa | awk '{print length($0)}' -v "^>" Data/Pan_paniscus.cds.fa | awk '{print length($0)}' | sort -n | tail -n 1
grep: ^>: No such file or directory
awk: fatal: cannot open file `-v' for reading: No such file or directoryawk: 
fatal: cannot open file `-v' for reading: No such file or directory


### FASTQ Files (e.g., `*.fastq`, `*.fq`)

FASTQ files store sequences with quality scores (4 lines per record: header, sequence, `+`, quality).

#### Example: 
```
@SRR22903825.40838091 40838091 length=101
CCCAGCGAGAGGGCGTACTCTACCCCAGAGAGGGAAACACCATGCCCACAGTGCTTGGTTTTGCACTCAGGTGTGCGGGCAGCACAGCAGGCCTCACCTTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F
```

#### Step 1: Check and Peek

In [13]:
file Data/Covid_1.fastq

Data/Covid_1.fastq: ASCII text


In [14]:
head -n 8 Data/Covid_1.fastq



@SRR22903825.40838091 40838091 length=101
CCCAGCGAGAGGGCGTACTCTACCCCAGAGAGGGAAACACCATGCCCACAGTGCTTGGTTTTGCACTCAGGTGTGCGGGCAGCACAGCAGGCCTCACCTTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F
@SRR22903825.57624042 57624042 length=101
GTAGCTGTGCTGTGCTAACTGCCTATTCGCTTTGTTTCTTTCAAGAACACTTTTGAAAGGACAAGAAAACGTAAAAGGCAGCAGAACAGAACAGATCGGAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF


#### Step 2: Count Sequences
Headers start with `@`:




In [None]:
grep -c "^@" Data/Covid_1.fastq



#### Step 3: Extract Sequence Lengths
Sequences are the 2nd line of each 4-line block:


In [None]:
awk 'NR%4==2 {print length($0)}' Data/Covid_1.fastq



**Task**: How would you count sequences with length > 100? Hint: Add a condition to the `awk` command.



### GFF Files (e.g., `*.gff`, `*.gff3`)

GFF files store genomic features in tab-separated columns (e.g., chromosome, start, end).

#### Step 1: Peek



In [None]:
head -n 10 Data/example.gff



#### Step 2: Extract Genes
Filter for "gene" features (column 3):

In [None]:
grep -v "^#" Data/example.gff | awk '$3=="gene"'



#### Step 3: Calculate Feature Lengths
Length = end (col 5) - start (col 4) + 1:

In [None]:
grep -v "^#" Data/example.gff | awk '$3=="gene" {print $5-$4+1}'



# **Task**: How many genes are in this file? Combine `grep` and `wc`.



### BED Files (e.g., `*.bed`)

BED files describe genomic regions with at least 3 columns (chromosome, start, end).


#### Step 1: Peek

In [2]:
head -n 10 Data/example.bed



browser position ch1r:923928-127477000					
browser hide all					
track name="FRA1A_chr1_gene" description="FRA1A_gene_info" visibility=2 colorByStrand="255,0,0 0,0,255"					
chr1	3967618	23980294	SRSF10	0	-
chr1	2404819	2412571	PEX10	0	-
chr1	1915108	1917294	CALML6	0	+
chr1	6448467	6460052	ESPN	0	+
chr1	26294205	26306636	UBXN11	0	-
chr1	15410265	15429081	EFHD2	0	+
chr1	11674525	11691477	MAD2L2	0	-


#### Step 2: Calculate Region Lengths
 Length = end (col 3) - start (col 2):



In [3]:
awk '{print $3-$2}' Data/example.bed



0
0
0
20012676
7752
2186
11585
12431
18816
16952
5238
29935
2039
2813
12483
1294
14624
21959
2476
27790
13665
10874
49279
17997
43213
18028
6838
22440
3029
6737
2122
16857
7821
15363
6298
2569
4420
18287
30113
18397
411871
13031
5988
78390
10497
2466
18059
27759
30775
6738
244524
13044
20526
8557
21134
2385
12557
12830
44330
20850
2228
33962
77174
49482
3519
11436
9472
105132
2984
14017
11910
40832
42620
5835
28318
54374
3584
12615
1427
4234
21592
2475
3610
7741
16610
2052
6120
71119
56663
5077
40343
8783
145947
10429
31595
7682
5703
33089
5851
169987
23820
931
1853
6889
61156
7677
3190
15411
51168
10073
46216
80079
1835
14254
21636
12406
4985
84619
24041
202778
2021
24739
7935
17747
3877
21338
46842
9466
7099
5334
4977
2786
12295
142786
1732
5566
25388
25870
6167
2993
5790
3986
1493
61969
4782
19613
20634
1395
509
22942
11906
66769
16171
4408
29885
23356
2607
8485
21878
20976
149437
2345
22226
79552
4395
69556
3752
40404
26601
47107
8382
14744
13403
42787
9576
57173
60523
22553
8767
1

**Task**: Sort the regions by length (largest first). Hint: Use `sort -n`.



## Wrap-Up

 You’ve explored FASTA, FASTQ, GFF, and BED files using standard tools Key takeaways:
 - Use `file`, `head`, and `grep` to understand file structure.
 - Chain commands with `|` to analyze data efficiently.
 - Check documentation when stuck (`man`, `--help`).

**Next Steps**: Try these techniques on your own datasets



## Advanced Section (Optional)

For advanced users, install tools like `bioawk` or `seqtk` using a conda environment:

### Set Up Conda
 ```bash
 conda env create -f environment.yml
 conda activate Bio
 ```

 ### Example with bioawk
 Calculate mean Phred scores for FASTQ:
 ```bash
 bioawk -c fastx '{print $name, meanqual($qual)}' Data/Covid_1.fastq
 ```

 This section is optional and requires installation, so it’s separate from the main exercises.
