# Text Editors

Text editors are an integral part of data analysis. There are many different flavors: some have a graphical interface, some are on the command line; some are good for writing papers, others for code. As you begin to conduct more analyses you will see that some tools suit you better than others. With that said, we will introduce you to some of the text editors that we use often and recommend.

## vi/vim

vim is one of the most basic, ligthweight, and widely available and used text editors around. It is bundled with almost all unix environments. It is convenient for working remotely where you do not have a keyboard and mouse, which graphical editors require. Instead, you use commands and the arrow keys to navigate the file. Sounds intimidating but it's pretty straightforward!

### Basic Usage

To get started with vim, you can open a new file with
```bash
vim
```
If you want to open an existing file, just add the name of the file after vim 
```bash
vim [filename]
```

### Command Mode

When you first open vim, all alphanumeric keys are bound to commands. For example, if you type `dd` then you will delete the line where your cursor is instead of actually entering 'dd'. 

**Navigation**
- `0` moves the cursor to the beginning of the line.
- `$` moves the cursor to the end of the line.
- `shift+G` move to the end of the file.
- `gg` move to the beginning of the file.
- `: + line number` go directly to line number.
- `shift+5` go to corresponding bracket.
- `shift+3` go to previous instance of this word

**Editing**
- `d` starts the delete operation.
- `dw` will delete a word.
- `d0` will delete to the beginning of a line.
- `d$` will delete to the end of a line.
- `dd` will delete an entire line.
- `dgg` will delete to the beginning of the file.
- `dG` will delete to the end of the file.
- `u` will undo the last operation.
- `Ctrl-r` will redo the last undo.

### Insert Mode

This is the basic editing mode. All of the keys will function as normal: i.e. if you now type `dd` you will enter 'dd' into the file.

- `i` enters insert mode
- `escape` exits insert mode
**Copying and Pasting**
- `v` highlight one character at a time.
- `V` highlight one line at a time.
- `Ctrl-v` highlight by columns.
- `p` paste text after the current line.
- `P` paste text on the current line.
- `y` yank text into the copy buffer.

### Menu Mode

**Regex/substitute**
- `/text` search for text in the document, going forward.
- `n` move the cursor to the next instance of the text from the last search. This will wrap to the beginning of the document.
- `N` move the cursor to the previous instance of the text from the last search.
- `?text` search for text in the document, going backwards.
- `:%s/text/replacement text/g` search through the entire document for text and replace it with replacement text.
- `:%s/text/replacement text/gc` search through the entire document and confirm before replacing text.

**Save files**
- `:w` saves the file
- `:w [filename]` saves the file to `filename`
- `:q` exits vime
- `:wq` saves and quits
- `:q!` force quit (ignore prompts)



# Regular Expressions



A regular expression (aka regex) is a sequence of characters that describe or match a text pattern. For example, the string **aabb12** could be described as *aabb12*, two a's, *two'bs, then 1, then 2*, or *four letters followed by two numbers*. 

## Metacharacters

Metacharacters are characters that have an alternate meaning rather than a literal meaning. There are many of these and they can be found on the [Python regular expression documentation](https://docs.python.org/2/library/re.html#regular-expression-syntax).

For example, we'll take the following characters to search a text file:
- 'A'
- '.'
- '$'

In [76]:
import re

text = 'Alan is the coolest person ever!$$$'

# compiling a regular expression allows you to reuse the regular expression.
A = re.compile('A')
dot = re.compile('.')
dollar = re.compile('$')

# Any pattern that is matches will be presented as an item within a list.
A.findall(text)

['A']

In [77]:
dot.findall(text)

['A',
 'l',
 'a',
 'n',
 ' ',
 'i',
 's',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'c',
 'o',
 'o',
 'l',
 'e',
 's',
 't',
 ' ',
 'p',
 'e',
 'r',
 's',
 'o',
 'n',
 ' ',
 'e',
 'v',
 'e',
 'r',
 '!',
 '$',
 '$',
 '$']

In [78]:
dollar.findall(text)

['']

## Position Metacharacters

This type of regular expression is used to match characters based on where they are located as opposed to what the character means.

Let's take the '$' example. We now know that it matches the end of a line so if we couple that with another character:

In [79]:
text = 'Alan is the coolest man ever!$$$'

# In regular expressions, backslashes tell the engine to interpret metacharacters as literals.
dollar = re.compile('\$$')

dollar.findall(text)

['$']

Similarly, '^' matches the beginning of a line:

In [80]:
# IGNORECASE tells the intepreter match the meaning of the character, not character and case
caret = re.compile('^a', re.IGNORECASE)

caret.findall(text)

['A']

There are also boundary metacharacters. For example, '\b' matches a word that ends with 'ing'. Inversely, '\B' matches a non-boundary word, so it would match the 'ing' in 'things' but not 'thing'. This is useful for specifying substrings or whole words.

NOTE: some characters are treated as literals even when they are backslached. See here for more deltails: 
https://stackoverflow.com/questions/2241600/python-regex-r-prefix
https://stackoverflow.com/questions/21104476/what-does-the-r-in-pythons-re-compiler-pattern-flags-mean

In [81]:
text = 'pistol'

# 'is' is surrounded by non-blank characters, not boundaries
boundary = re.compile(r'\bis\b')
print('word boundary',boundary.findall(text))

# However, it is surrounded by non-boundary characters, so it is found by non-word boundary searches
nonboundary = re.compile('\Bis\B')
print('non-word boundary',nonboundary.findall(text))


word boundary []
non-word boundary ['is']


In [82]:
text = 'is'

boundary = re.compile(r'\bis\b')
print('word boundary', boundary.findall(text))

nonboundary = re.compile('\Bis\B')
print('non-word boundary', nonboundary.findall(text))

word boundary ['is']
non-word boundary []


## Single Metacharacters

These metacharacters match specific types of charactes. For example, you can match all alphanumeric characters with '\w' or any whitespace character with '\s'

In [83]:
text = '123 abc !@#'

digits = re.compile('\d')
digits.findall(text)

['1', '2', '3']

In [84]:
# any alphanumeric character
wordchar = re.compile('\w')
wordchar.findall(text)

['1', '2', '3', 'a', 'b', 'c']

In [85]:
# any non-word character
nonwordchar = re.compile('\W')
nonwordchar.findall(text)

[' ', ' ', '!', '@', '#']

In [86]:
# any non-newline character
dot = re.compile('.')
dot.findall(text)

['1', '2', '3', ' ', 'a', 'b', 'c', ' ', '!', '@', '#']

## Quantifiers

All examples before have been searching for individual characters. Quanitfiers allow you to match repeated patterns.

In [87]:
text = 'aa bb cdef 123'

# this looks for every instance of a non-word character
wordchar = re.compile('\w')
wordchar.findall(text)

['a', 'a', 'b', 'b', 'c', 'd', 'e', 'f', '1', '2', '3']

In [88]:
# the '+' tells the interpreter to look for one or more consecutive characters
wordchar = re.compile('\w+')
wordchar.findall(text)

['aa', 'bb', 'cdef', '123']

In [89]:
# '?' is looking for characters that appear once or not at all. It basically makes the preceding character optional.
text = 'colour'

conditional = re.compile('colou?r')
print('colour',conditional.findall(text))

text = 'color'
print('color', conditional.findall(text))

colour ['colour']
color ['color']


In [90]:
# Asterisk is doesn't require the preceding character to be there, but if it is it will match it
# will repeat the pattern as many times as is can.
text = 'b'
star = re.compile('bo*')
print('b',star.findall(text))

text = 'boo'
print('boo',star.findall(text))

text = 'boooo!'
print('boooo!', star.findall(text))

b ['b']
boo ['boo']
boooo! ['boooo']


In [91]:
# curly brackets define a number of times in which the preceding pattern will be repeated
repeat = re.compile('bo{2}')
text = 'boo'
print('boo',repeat.findall(text))

text = 'boooo!'
print('boooo!', repeat.findall(text))

boo ['boo']
boooo! ['boo']


## Character Classes

A character class allows you to match a particular set of user defined characters as oppsed to the predefined metacharacters we went over previously. Think of searching for any vowel.

In [92]:
text = 'I like chocolate'

vowels = re.compile('[AEIOUY]', re.IGNORECASE)
print('vowels', vowels.findall(text))

vowels ['I', 'i', 'e', 'o', 'o', 'a', 'e']


Alternatively, you can use a caret (^) within the square brackets if these are charaters you do not want to match. In this case, it will match all consonants.

In [93]:
vowels = re.compile('[^AEIOUY]', re.IGNORECASE)
print('vowels', vowels.findall(text))

vowels [' ', 'l', 'k', ' ', 'c', 'h', 'c', 'l', 't']


You can also specify ranges of characters:

In [94]:
pattern = re.compile('[a-d]', re.IGNORECASE)
print('pattern', pattern.findall(text))

pattern ['c', 'c', 'a']


## Alterations

This is essentially just an 'or' statement. An example would be if you are looking for whether a sentence says 'we have ten dollars.' or 'I have ten dollars.' Since the only varying piece of that sentence are the pronouns, you can try to match either.

In [95]:
text = 'I have ten dollars'
pattern = re.compile('we|i|they', re.IGNORECASE)
print('I', pattern.findall(text))

text = 'They have ten dollars'
pattern = re.compile('we|i|they', re.IGNORECASE)
print('They', pattern.findall(text))

text = 'We have ten dollars'
pattern = re.compile('we|i|they', re.IGNORECASE)
print('We', pattern.findall(text))

I ['I']
They ['They']
We ['We']


## Backreferences

Back references/captures allow you to reuse regular expressions and/or patterns that match that regular expression. 

Let's give a biological example:
You're looking for a motif that has flanking restriction sites ACTG. The motif can be of any length and any composition of nucleotides but is always flanked by those cuts sites:

In [96]:
text = 'ACTGTTTTTTTTTACTG'

# the '\1' is the refence to the first captured pattern.
# All patterns are ennumerated but can also be named.
print('matches:',re.search(r'(ACTG)([ACTG]+)\1', text).groups())

text = 'ATCGCAGCTACGACTGAAAAAAAAAAAAAAACTG'
print('matches:',re.search(r'(ACTG)([ACTG]+)\1', text).groups())

matches: ('ACTG', 'TTTTTTTTT')
matches: ('ACTG', 'AAAAAAAAAAAAAA')


## Substitution Mode

There are several ways that one can use regular expressions, though the two main modes are searching and substituting.

Searching/matching will only look for whether or not a pattern is matched. Substituting will replace any pattern that is matched with another pattern. Above we have used only searching, so below will only showcase an example of a substitution.

In [97]:
text = 'ACTGTTTTTTTTTACTG'
re.sub(r'ACTG', 'AAAA', text)

'AAAATTTTTTTTTAAAA'

# Command Line Examples


## Constructing a Regular Expression

Arguably the most difficult task with regular expressions is being able to construct one. Often in cases I am trying extract information that match multiple different patterns or the same pattern but in different lines of text. **In order to create a regular expression, you first need to identify the pattern**. This generally requires some brief understanding of the file format and the relevant information as well as finding a pattern that applies to information that you're interested in.

To give you some insight on how to do this, the examples below are scenarios that I often find myself in that are a perfect fit for regular expressions.

## Parsing a GFF file

There are a few tools, such as sed, awk, and grep, that are used for text munging but I happen to use Perl, as it provides a lot of flexibility when doing complex regular expressions and other data munging tasks.

grep can only match patterns (or the inverse of patterns) and cannot be used to replace or transform text. Though limited, it is a nice tool to have for quick searches in files or across filesystems. In this example, all I need is to match a pattern, so grep will suffice.

A GFF file is a standard tab-delimited file format for genomic annotations. Most gff files contain every type of annotated feature for that organism (mRNA, exons, UTRs, motifs, etc.) which are not always relevant to the analysis and may sometimes may actuall interfere with the analysis.

Let's say **I want to extract all annotated mRNAs** from this file. First thing is to understand the [gff file format](https://www.ensembl.org/info/website/upload/gff.html). Reading the documentation will tell you that it is tab delimited and the third column contains the feature type. So looking at the file will sho you:

In [98]:
! head -n 20 caenorhabditis_elegans.PRJNA13758.WBPS9.annotations.gff3

##gff-version 3
##sequence-region I 1 15072434
##sequence-region II 1 15279421
##sequence-region III 1 13783801
##sequence-region IV 1 17493829
##sequence-region V 1 20924180
##sequence-region X 1 17718942
##sequence-region MtDNA 1 13794
I	BLAT_EST_OTHER	expressed_sequence_match	1	50	12.8	-	.	ID=yk585b5.5.6;Target=yk585b5.5 119 168 +
I	BLAT_Trinity_OTHER	expressed_sequence_match	1	52	20.4	+	.	ID=elegans_PE_SS_GG6116|c0_g1_i1.2;Target=elegans_PE_SS_GG6116|c0_g1_i1 174 225 +
I	inverted	inverted_repeat	1	212	66	.	.	Note=loop 426
I	Genbank	assembly_component	1	2679	.	+	.	genbank=FO080985
I	Genomic_canonical	assembly_component	1	2679	.	+	.	Name=cTel33B;Note=Clone:cTel33B,GenBank:FO080985
I	Variation_project_Polymorphism	tandem_duplication	1	11000	.	+	.	variation=WBVar02123961;public_name=WBVar02123961;other_name=cewivar00854884;strain=JU533;polymorphism=1;consequence=Coding_exon
I	interpolated_pmap_position	gene	1	559784	.	.	.	ID=gmap:spe-13;gmap=spe-13;status=uncloned;Note=-2

Looking at the entire file is daunting and doesn't provide a list of features found within this file. So, we can use a few commands to get a full representation of each feature type described in the file.

In [99]:
! cut -f3 caenorhabditis_elegans.PRJNA13758.WBPS9.annotations.gff3 | sort | uniq

antisense_RNA
assembly_component
base_call_error_correction
binding_site
biological_region
CDS
complex_substitution
conserved_region
deletion
DNAseI_hypersensitive_site
duplication
enhancer
exon
experimental_result_region
expressed_sequence_match
five_prime_UTR
gene
##gff-version 3
G_quartet
histone_binding_site
insertion_site
intron
inverted_repeat
lincRNA
low_complexity_region
miRNA
miRNA_primary_transcript
mRNA
mRNA_region
nc_primary_transcript
ncRNA
nucleotide_match
operon
PCR_product
piRNA
point_mutation
polyA_signal_sequence
polyA_site
polypeptide_motif
possible_base_call_error
pre_miRNA
promoter
protein_coding_primary_transcript
protein_match
pseudogenic_rRNA
pseudogenic_transcript
pseudogenic_tRNA
reagent
regulatory_region
repeat_region
RNAi_reagent
rRNA
SAGE_tag
scRNA
##sequence-region I 1 15072434
##sequence-region II 1 15279421
##sequence-region III 1 13783801
##sequence-region IV 1 17493829
##sequence-region MtDNA 1 13794
##sequence-region V 1 20924180
##sequence-region X 1

We see that there are a lot of *RNA*s here, and there happen to be two different *mRNA* feature types. So, now that we know the feature type and the format we can construct our regular expression:

In [100]:
# The -P tells grep to use perl regular expressions which make things a bit easier otherwise you'll have
# to escape some characters!

# The pattern we're looking for is mRNA surrounded by tabs/white space
! grep -P '\tmRNA\t' caenorhabditis_elegans.PRJNA13758.WBPS9.annotations.gff3 | head -n 20

I	WormBase	mRNA	4116	10230	.	-	.	ID=Transcript:Y74C9A.3;Parent=Gene:WBGene00022277;Name=Y74C9A.3;wormpep=WP:CE28146;locus=homt-1
I	WormBase	mRNA	11495	16793	.	+	.	ID=Transcript:Y74C9A.2a.1;Parent=Gene:WBGene00022276;Name=Y74C9A.2a.1;wormpep=WP:CE24660;locus=nlp-40
I	WormBase	mRNA	11495	16837	.	+	.	ID=Transcript:Y74C9A.2a.2;Parent=Gene:WBGene00022276;Name=Y74C9A.2a.2;wormpep=WP:CE24660;locus=nlp-40
I	WormBase	mRNA	11499	16837	.	+	.	ID=Transcript:Y74C9A.2a.3;Parent=Gene:WBGene00022276;Name=Y74C9A.2a.3;wormpep=WP:CE24660;locus=nlp-40
I	WormBase	mRNA	11505	16837	.	+	.	ID=Transcript:Y74C9A.2a.4;Parent=Gene:WBGene00022276;Name=Y74C9A.2a.4;wormpep=WP:CE24660;locus=nlp-40
I	WormBase	mRNA	11618	16837	.	+	.	ID=Transcript:Y74C9A.2a.5;Parent=Gene:WBGene00022276;Name=Y74C9A.2a.5;wormpep=WP:CE24660;locus=nlp-40
I	WormBase	mRNA	11623	16837	.	+	.	ID=Transcript:Y74C9A.2b;Parent=Gene:WBGene00022276;Name=Y74C9A.2b;wormpep=WP:CE49228;locus=nlp-40
I	WormBase	mRNA	17487	26781	.	-	.	ID=Transcript:Y74C

## Renaming FastQ Files

The filenames for sequences are generally too many characters an too much information, which can be annoying when working with those files. To extract only the relevant information and shorten the filename I often construct a regular expression, coupled with some bash commands, to rename the files how I deem fit.


In [101]:
ls /home/at120/badas/2015-03-11_AE52Y-redo

[0m[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a1.341000000083f5.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a2.3410000000847d.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a3.341000000084f4.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a4.3410000000857c.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a5.341000000085f3.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a6.3410000000867b.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a7.341000000086f2.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a8.3410000000877a.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated b1.34100000008403.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated b2.3410000000848a.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated b3.34100000008502.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updat

There is a lot of information here, some of which is intelligible only to Tubo. The only information that I would like to retain are the sample names, consisting of **one lowercase letter followed by one number** and the pairing information, which is **n0 and then a 1 or a 2**.

What I usually do to get started is to work with one filename and construct the pattern from that. Regular expressions are read from left to right, so we should start by grabbing the first bit of information we want, which is the pair information:

In [102]:
%%bash
echo "000000000-AE52Y l01n02 flu mrt-pcr updated f3.34200000008543.fastq.gz" | perl -pe 's/^.+n0([12]).+/$1/g'

2


Next would be to grab the sample name:

In [103]:
%%bash

# It's good to be very specific with regular expressions for multiple reasons.
# 1) It limits the scope of the regular expression
# 2) It makes it easier to read later.
# So I do not NEED to specify 'flu' but it makes for fewer instructions and is more human readable.

echo "000000000-AE52Y l01n02 flu mrt-pcr updated f3.34200000008543.fastq.gz" \
| perl -pe 's/^.+n0([12])\sflu.+\s([a-z]\d)\..+/$1$2/g'

2f3


Lastly, the sample name and file extension and rearrange the caputred patterns:

In [104]:
%%bash
echo "000000000-AE52Y l01n02 flu mrt-pcr updated f3.34200000008543.fastq.gz" \
| perl -pe 's/^.+n0([12])\sflu.+\s([a-z]\d)\..+(fastq.gz)$/$2.r$1.$3/g'

f3.r2.fastq.gz


Now to wrap it up in a for loop and invoke a copy command:

In [105]:
%%bash
cd /home/at120/badas/2015-03-11_AE52Y-redo
for i in 0*gz; do cp "$i" `echo $i | perl -pe 's/^.+n0([12])\sflu.+\s([a-z]\d)\..+(fastq.gz)$/$2.r$1.$3/g'` ; done

ls

000000000-AE52Y l01n01 flu mrt-pcr updated a1.341000000083f5.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a2.3410000000847d.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a3.341000000084f4.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a4.3410000000857c.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a5.341000000085f3.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a6.3410000000867b.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a7.341000000086f2.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a8.3410000000877a.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b1.34100000008403.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b2.3410000000848a.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b3.34100000008502.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b4.34100000008589.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b5.34100000008601.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b6.34100000008688.fastq.gz
000000000-AE52Y l01n

cp: ‘a1.r2.fastq.gz’ and ‘a1.r2.fastq.gz’ are the same file
cp: ‘a2.r2.fastq.gz’ and ‘a2.r2.fastq.gz’ are the same file
cp: ‘a3.r2.fastq.gz’ and ‘a3.r2.fastq.gz’ are the same file
cp: ‘a4.r2.fastq.gz’ and ‘a4.r2.fastq.gz’ are the same file
cp: ‘a5.r2.fastq.gz’ and ‘a5.r2.fastq.gz’ are the same file
cp: ‘a6.r2.fastq.gz’ and ‘a6.r2.fastq.gz’ are the same file
cp: ‘a7.r2.fastq.gz’ and ‘a7.r2.fastq.gz’ are the same file
cp: ‘a8.r2.fastq.gz’ and ‘a8.r2.fastq.gz’ are the same file
cp: ‘b1.r2.fastq.gz’ and ‘b1.r2.fastq.gz’ are the same file
cp: ‘b2.r2.fastq.gz’ and ‘b2.r2.fastq.gz’ are the same file
cp: ‘b3.r2.fastq.gz’ and ‘b3.r2.fastq.gz’ are the same file
cp: ‘b4.r1.fastq.gz’ and ‘b4.r1.fastq.gz’ are the same file
cp: ‘b4.r2.fastq.gz’ and ‘b4.r2.fastq.gz’ are the same file
cp: ‘b5.r1.fastq.gz’ and ‘b5.r1.fastq.gz’ are the same file
cp: ‘b5.r2.fastq.gz’ and ‘b5.r2.fastq.gz’ are the same file
cp: ‘b6.r1.fastq.gz’ and ‘b6.r1.fastq.gz’ are the same file
cp: ‘b6.r2.fastq.gz’ and ‘b6.r2.fastq.gz