# Working directly with files and directories via CLI

I hope everyone is feeling more comfortable with coding and using the command-line interface! Now, let's explore how we can interact directly with plain text files, such as .txt, .tsv, and .csv, using the command line. This skill is especially valuable when working with genomic data, where you often need to count the number of FASTA sequences, inspect file contents, or even make quick edits. The command line provides powerful tools to accomplish all of these tasks efficiently—let’s dive in! 

*Remember, if you need help with any of these commands further, you can always use the built-in help menus that we can access by providing ```-h``` or ```--help``` as the only argument.*

**I want to note that many of these commands will not work on files that are not plain text. Some examples of common file types that are not plain-text files would be “.docx”, “.pdf”, or “.xlsx”. This is because those file formats contain special types of compression and formatting information that are only interpretable by programs specifically designed to work with them, like Microsoft Word/Excel.**

### **I also want to note that some of these commands may not work in the Jupyter Notebook. Please open your own terminal from the "Other" tab in the Launcher menu and follow along there.**

## Viewing file contents
Now that we are experts in navigating around our directories, let's make our way to the **unix_intro** folder. 

Say I want to see the entire contents of the ***example.txt*** file in our working directory? To do this, I would run the **```cat```** command, which will print and display the entire file content. Let's try it below:

In [None]:
cat example.txt

In using the ```cat``` command, you will see the entire contents of the text file. Let's number the file lines using the ```-n``` flag.

In [None]:
cat -n example.txt

Now we have numbered all of the lines, so we know exactly how many lines/rows we have in our file. 

What if we just wanted to see the first few lines of our file? Then, we would use the **```head```** command. Let's try it out:

In [None]:
head example.txt

What about if we wanted to see the last few lines? We would use the **```tail```** command.

In [None]:
tail example.txt

By default, ```head``` and ```tail``` will show the first and last 10 lines of a file, respectively. If you want to see a specific number of lines, we can specify using the ```-n``` flag. 
Let's say I only want to see the first 5 lines of the **example.txt** file. We would run:

In [None]:
head -n 5 example.txt

And what about the last 5 lines?

In [None]:
tail -n 5 example.txt

**Great work everyone! These are some of the most basic commands for looking into your files, and some of the commands we use most often in bioinformatics to inspect our data.**

Let's say we have a very large file we want to inspect. If we tries to use the ```cat``` command, it may take up a lot of space in our CLI, or we may crash our environment trying to load it in. In this case, we may use something like the **```less```** or **```more```** commands to inspect the files.

### ```less``` command
The ```less``` command allows you to view a file page by page. Let's try it out on the **example.txt** file first (if the notebook spits out the entire file, please move to a new terminal session to visualize). 

In [None]:
less example.txt

What you should be seeing are the first ~50ish or so lines of our **example.txt** file. At the bottom of the page, you will see that you have a blinking black square. Hit the down arrow on your keyboard one time. Now hit the up arrow. 

As you can see, we can navigate around our files this way. We can also scroll on our mouse to move among the data in our file. This is a great way to look at data in our files, without having to:
* Open it in GUI (when we may not even have an application on our computer that can read the file)
* Load in all of the data using ```cat```

After using the `less` command, you might notice that your terminal doesn’t immediately return to a new command line prompt. This is because `less` allows you to scroll through the file interactively, and it remains open until you explicitly close it. To exit `less` and return to the command prompt, simply press the **"q"** key (for "quit"). That’s it—just press **q**!

I want to show you what the ```less``` command looks like on a larger file. To do this, lets navigate to the **~/six_commands** directory and try it out on the **genes_and_seqs.tsv** file. 

In [None]:
less genes_and_seqs.tsv

***Here, we can see that the ```less``` command outputed information about the number of genes (gene_ID) we have in our dataset, the length of the sequence (AA_length), and the protein sequence that corresponds to that gene (seq).***

##### ```more``` command
The ```more``` is very similar to the ```less``` command, but it really only allows for forward navigation. Feel free to try it out in the command line, but since it is limited compared to ```less```, I tend not to use it. 

### Word count- ```wc```

##### Basic Function
```wc``` counts lines, words, and characters in text files. It's basic syntax is: ```wc [options] [file(s)]```. Let's try it on the **example.txt** file.

In [None]:
wc example.txt

The ```wc``` command outputs three different numbers:
1. Number of lines
2. Number of words
3. Number of bytes (characters)

So in the **example.txt** file, we have 107 lines, 438 words, and the file is 1778 bytes. 

##### Common Options
* ```-l```: Count only lines
* ```-w```: Count only words
* ```-c```: Count bytes
* ```-m```: Count characters

This command has many applications in bioinformatics research. What if we wanted to analyze the length of a gene? We could run something like, where **gene.txt** is a hypothetical gene sequence:

In [None]:
wc -l gene.txt

This tells us the number of lines we have in this specific gene file, and will give us a rough estimate of the length of the gene. This is useful for quickly assessing the size of genetic data files.

For this next exercise, I want everyone to take a quick peek at the **gene_annotations.tsv** file.

In [None]:
head -5 gene_annotations.tsv

The **gene_annotations.tsv** file contains four columns describing hypothetical gene information from our dataset. The first column lists gene identifiers, sequentially numbering the genes as "gene1," "gene2," "gene3," and so on.
*Instead of opening the file manually (using a GUI or the `less` command), we can count the number of lines with the `wc` command to quickly determine how many genes were identified in our study:*

In [None]:
wc -l gene_annotations.tsv

Subtracting the header, we see that we have charcterized 100 genes! Neat, right?

### Searching for patterns- ```grep```

When searching for specific words or phrases in a Word document or PDF, my go-to method is **CTRL+F**, which quickly locates the information I need without scrolling through unnecessary content. **CTRL+F** is the graphical interface (GUI) tool for pattern searching in documents. In the command-line interface (CLI), we use the **`grep`** command to achieve the same functionality when working with plain text files. This tool is very versatile, so let's take our time learning it. 

##### Basic Function
```grep``` stands for "globally search a regular expression and print". It searches for specific patterns in files and displays lines containing those patterns.

##### Syntax
The basic syntax of grep is:
```grep [options] pattern [file(s)]```

##### Common Options
* ```-i```: Ignore case distinctions

* ```-v```: Invert the match (show lines that don't match)

* ```-c```: Count matching lines instead of printing them

* ```-n```: Display line numbers along with matching lines

* ```-w```: Match whole words only

##### Bioinformatics Applications
In bioinformatics, we use ```grep``` extensively for multiple different purposes. These include:
* Searching Sequence Files
* Extracting FASTA Headers
* Counting Sequences

Let's try it with our files!

First off, let's say I want to see how many "blue" are in my **colors.txt** file. To do this, we would simply run: 

In [None]:
grep blue colors.txt

This should print out the lines that contain what you searched for. What if what we are searching for has multiple hits in a single file?

In [None]:
grep re colors.txt

You'll see that grep gave us two outputs: **red** and **green**, as they both contain **re**. 

Let's try with another file, **example.txt**

In [None]:
grep data example.txt

From this output, we can see that the word **"data"** appears four times in our file. Additionally, instead of simply displaying the number of occurrences, the command prints the entire lines where the word appears. For example, if the file contains the sentence **"This is a pretend data file,"** that full line will be shown in the results. 

What if we want to see the line numbers on which the word **"data"** occurs? We would use the ```-n``` flag:

In [None]:
grep -n data example.txt

Now we have the line #s that the **data** occurs on. 

We if we ***just*** wanted the number of times **data** occurs without seeing the full text in the line? We would use the **```-c```** flag:

In [None]:
grep -c data example.txt

This just prints out the number of **data** occurences in the file.

A great use for ```grep -c``` is counting the number of sequences you may have in a .fasta file. In .fasta file, sequences are characterized by identifying header lines that give you general information about the sequence. A header *always* starts with **'>'**. An example of this looks like:

    >chr1 Jackalope chromosome 1;length=7
    GATTACA

So by using the ```grep``` command with the '>' symbol, we can count how many sequences are in a .fasta file. Let's try it with the **sequence.fasta** file. 

In [None]:
grep -c ">" sequence.fasta

This should tell you that there are 2 sequences in the **sequence.fasta** file. Let's take a look to make sure that it's true!

In [None]:
less sequence.fasta

Great! We see that we do indeed have 2 sequences. 

Okay, let's apply this to some more sequence data. Go ahead and move back into your ~/six_commands working directory.

Now, I want you guys to go ahead and look at the header for the **gene_annotations.tsv** file. Remember, we use the **```head```** command to look at the first lines of a file:

In [None]:
head -n 1 gene_annotations.tsv

Does everyone see that we have 4 columns in this file? We have: *gene_ID*, *genome*, *KO_ID*, and *KO_annotation*
(KO stands for KEGG Orthology; A KO ID (KEGG Orthology ID) is an identifier used in the KEGG (Kyoto Encyclopedia of Genes and Genomes) database to classify genes and proteins based on their functions. We can go over this in more detail if need be).

Let's say that we are hypothetically interested in genes involved with sulfide oxidation, which is thr process of turning H2S (sulfide) into elemental sulfur. Using the KEGG pathway database (we will go over this another time), I see that there is one gene involved in that process: **K17229**.

Let's see if we have this gene in our table!

In [None]:
grep K17229 gene_annotations.tsv

Since this command gave us nothing back, we can assume that this gene is not in our annotated file and was therefore not characterized in our hypothetical study. What if we were instead interested in some of the photosystem II genes that encode important proteins for photosynthesis? I can see from KEGG that I am interested in these KEGG IDs:

* K02703
* K02706
* K02705
* K02704
* K02707
* K02708

 **With the ```-e``` flag, we can search for multiple variables at a time in a file with a single command.** So let's search for them and see what we got! 

In [None]:
grep -e K02703 -e K02706 -e K02705 -e K02704 -e K02707 -e K02708 gene_annotations.tsv

From this, we can see that we have 2 photosystem II genes in our hypothetical study:

* 39      GEYO    K02704  psbB; photosystem II CP47 chlorophyll apoprotein
* 82      UW179A  K02705  psbC; photosystem II CP43 chlorophyll apoprotein

***Great work everyone! Getting the hang of these commands is tough. Remember that we are not here to memorize every detail about how to use the CLI. We are just getting a hang of the tools and commands we can use in bioinformatics.***

### Creating new files- ```touch```

The ```touch``` command is primarily used for creating empty files and updating file timestamps. Here's an overview of its key features and usage:

##### Basic Function
The touch command can:
1. Create new empty files
2. Update timestamps of existing files

The basic syntax for ```touch``` is: ```touch [options] file(s)```

For example, let's say we want to create a new file in our **dog_pics** directory? We would first make the **dog_pics** our working directory, then run:

In [None]:
touch newfile.txt

As you can see, this creates an empty text file named newfile.txt. If the file already exists, it only updates its timestamp.

If we wanted to create or update multiple files in one command, we would run:

In [None]:
touch file1.txt file2.txt file3.txt

### A terminal text editor & creating new files- ```nano```


***The ```nano``` command is more powerful overall than the ```touch``` command, as you can create, open, and edit plain text files directly in the command line.*** This is one of the most useful commands we have in bioinformatics, as we can edit/update text files and/or shell scripts using ```nano```. Let's take a look.

To open or create a file, the basic syntax is: ```nano filename```

I want you to navigate to your **unix_intro** directory and look at your **example.txt** file:

In [None]:
cd ~/unix_intro/

In [None]:
less example.txt

Let's say I want to get rid of all of the lines that say something like: "I have data in here" or "These could be fasta sequences." I could do it via GUI, but that would be time consuming, especially when we start working in LEAP2 (our HPC). However, we can use ```nano``` to open the file up directly in our CLI and edit the file here. 

**Unfortunately, the ```nano``` command does not work directly within a Jupyter notebook because Jupyter provides their own web-based interface for editing and running code, which is separate from command-line text editors like ```nano```.** We will go over ```nano``` in class. 

### Delete files/directories- ```rm```

A lot of times when you're working with files as large as .fasta files can be, or when you're working on an HPC, it isn't easy to delete files via GUI. It can take too long or be too power-intensive, especially when you need to work on the HPC. To work around this, we often use the remove ```rm``` command. 

#### ```rm``` Command
The ```rm``` command is used to delete files and directories. The basic syntax is: ```rm [option] [file]```, but by default, it will not delete directories.

Let's first start out with removing a couple of the **dog_pics** files for practice. 

I want to delete the .PNG picture of Claude. To do this, I would:

In [1]:
cd ~/dog_pics/claude/

/home/jovyan/dog_pics/claude


In [2]:
rm claude1.PNG

After running this, you will see that this file is no longer in our **claude** directory. If we were in another directory and wanted to do this without changing our working directory, we could run:

```rm ~/dog_pics/claude/claude1.PNG```

and this is true for most commands.

Let's say I wanted to delete multiple files at once and remove all of the photos from the **claude** directory. In this case, we would run:

In [3]:
rm claude2.JPG claude3.JPG

And as you will be able to see via GUI, the **claude** directory is now empty. 

The ```rm``` command has some common options/flags:

1. ```-r``` or ```-R```: Recursively delete directories and their contents
2. ```-f```: Force deletion without prompting for confirmation
3. ```-i```: Interactive mode, prompts before each deletion
4. ```-v```: Verbose mode, shows what's being deleted

The most important of these flags for our purposes is probably the ```-r``` flag, which will allow us to delete entire directories and their contents. Let's try it out with the **jimmy** directory. 

In [5]:
rm -r ~/dog_pics/jimmy

If we tried to delete the directory without the ```-r```, it would spit back an error telling us that because **jimmy** is a directory, we cannot delete it. 

### Great work getting through this module everyone! I know this one was a little bit more intensive, but remember that you can always look these commands and their flags up. What's important is getting the syntax down.

> If you need any more practice, please play around in the terminal. 