# Text Editors

Text editors are an integral part of data analysis. There are many different flavors: some have a graphical interface, some are on the command line; some are good for writing papers, others for code. As you begin to conduct more analyses you will see that some tools suit you better than others. With that said, we will introduce you to some of the text editors that we use often and recommend.

## vi/vim

vi was the first implimentation of this text editor and vim is an upgraded version, so everthing vi does can be done in vim plust a lot more (syntax highlighting, can edit via network protocols, includes a built in diff, etc.). vim is one of the most basic, ligthweight, and widely available and used text editors around. It is bundled with almost all unix environments. It is convenient for working remotely where you do not have a keyboard and mouse, which graphical editors require. Instead, you use commands and the arrow keys to navigate the file. Sounds intimidating but it's pretty straightforward!

### Basic Usage

To get started with vim, you can open a new file with
```bash
vim
```
If you want to open an existing file, just add the name of the file after vim 
```bash
vim [filename]
```

### Command Mode

When you first open vim, all alphanumeric keys are bound to commands. For example, if you type `dd` then you will delete the line where your cursor is instead of actually entering 'dd'. 

**Navigation**
- `h` moves the cursor one character to the left.
- `j` moves the cursor down one line.
- `k` moves the cursor up one line.
- `l` moves the cursor one character to the right.
- `0` moves the cursor to the beginning of the line.
- `$` moves the cursor to the end of the line.
- `w` move forward one word.
- `b` move backward one word.
- `G` move to the end of the file.
- `gg` move to the beginning of the file.
- `. move to the last edit.

**Editing**
- `d` starts the delete operation.
- `dw` will delete a word.
- `d0` will delete to the beginning of a line.
- `d$` will delete to the end of a line.
- `dgg` will delete to the beginning of the file.
- `dG` will delete to the end of the file.
- `u` will undo the last operation.
- `Ctrl-r` will redo the last undo.

**Copying and Pasting**
- `v` highlight one character at a time.
- `V` highlight one line at a time.
- `Ctrl-v` highlight by columns.
- `p` paste text after the current line.
- `P` paste text on the current line.
- `y` yank text into the copy buffer.

### Menu Mode

**Regex/substitute**
- `/text` search for text in the document, going forward.
- `n` move the cursor to the next instance of the text from the last search. This will wrap to the beginning of the document.
- `N` move the cursor to the previous instance of the text from the last search.
- `?text` search for text in the document, going backwards.
- `:%s/text/replacement text/g` search through the entire document for text and replace it with replacement text.
- `:%s/text/replacement text/gc` search through the entire document and confirm before replacing text.

**Save files**
- `:w` saves the file
- `:w [filename]` saves the file to `filename`
- `:q` exits vime
- `:wq` saves and quits
- `:q!` force quit (ignore prompts)

### Insert Mode

This is the basic editing mode. All of the keys will function as normal: i.e. if you now type `dd` you will enter 'dd' into the file.


## Atom

## Notepad++

# Regular Expressions



A regular expression (aka regex) is a sequence of characters that describe or match a text pattern. For example, the string **aabb12** could be described as *aabb12*, two a's, *two'bs, then 1, then 2*, or *four letters followed by two numbers*. 

## Metacharacters

Metacharacters are characters that have an alternate meaning rather than a literal meaning. 



For example, we'll take the following characters to search a text file:
- 'A'
- '.'
- '$'

In [1]:
import re

text = 'Alan is the coolest person ever!$$$'

A = re.compile('A')
dot = re.compile('.')
dollar = re.compile('$')

A.findall(text)

['A']

In [2]:
dot.findall(text)

['A',
 'l',
 'a',
 'n',
 ' ',
 'i',
 's',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'c',
 'o',
 'o',
 'l',
 'e',
 's',
 't',
 ' ',
 'p',
 'e',
 'r',
 's',
 'o',
 'n',
 ' ',
 'e',
 'v',
 'e',
 'r',
 '!',
 '$',
 '$',
 '$']

In [3]:
dollar.findall(text)

['']

## Position Metacharacters

This type of regular expression is used to match characters based on where they are located as opposed to what the character means.

Let's take the '$' example. We now know that it matches the end of a line so if we couple that with another character:

In [4]:
text = 'Alan is the coolest man ever!$$$'

# In regular expressions, backslashes tell the engine to interpret metacharacters as literals.
dollar = re.compile('\$$')

dollar.findall(text)

['$']

Similarly, '^' matches the beginning of a line:

In [5]:
# IGNORECASE tells the intepreter match the meaning of the character, not character and case
caret = re.compile('^a', re.IGNORECASE)

caret.findall(text)

['A']

There are also boundary metacharacters. For example, '\b' matches a word that ends with 'ing'. Inversely, '\B' matches a non-boundary word, so it would match the 'ing' in 'things' but not 'thing'. This is useful for specifying substrings or whole words.

NOTE: some characters are treated as literals even when they are backslached. See here for more deltails: 
https://stackoverflow.com/questions/2241600/python-regex-r-prefix
https://stackoverflow.com/questions/21104476/what-does-the-r-in-pythons-re-compiler-pattern-flags-mean

In [6]:
text = 'pistol'

# 'is' is surrounded by non-blank characters, not boundaries
boundary = re.compile(r'\bis\b')
print('word boundary',boundary.findall(text))

# However, it is surrounded by non-boundary characters, so it is found by non-word boundary searches
nonboundary = re.compile('\Bis\B')
print('non-word boundary',nonboundary.findall(text))


word boundary []
non-word boundary ['is']


In [7]:
text = 'is'

boundary = re.compile(r'\bis\b')
print('word boundary', boundary.findall(text))

nonboundary = re.compile('\Bis\B')
print('non-word boundary', nonboundary.findall(text))

word boundary ['is']
non-word boundary []


## Single Metacharacters

These metacharacters match specific types of charactes. For example, you can match all alphanumeric characters with '\w' or any whitespace character with '\s'

In [8]:
text = '123 abc !@#'

digits = re.compile('\d')
digits.findall(text)

['1', '2', '3']

In [9]:
# any alphanumeric character
wordchar = re.compile('\w')
wordchar.findall(text)

['1', '2', '3', 'a', 'b', 'c']

In [10]:
# any non-word character
nonwordchar = re.compile('\W')
nonwordchar.findall(text)

[' ', ' ', '!', '@', '#']

In [11]:
# any non-newline character
dot = re.compile('.')
dot.findall(text)

['1', '2', '3', ' ', 'a', 'b', 'c', ' ', '!', '@', '#']

## Quantifiers

All examples before have been searching for individual characters. Quanitfiers allow you to match repeated patterns.

In [12]:
text = 'aa bb cdef 123'

# this looks for every instance of a non-word character
wordchar = re.compile('\w')
wordchar.findall(text)

['a', 'a', 'b', 'b', 'c', 'd', 'e', 'f', '1', '2', '3']

In [13]:
# the '+' tells the interpreter to look for one or more consecutive characters
wordchar = re.compile('\w+')
wordchar.findall(text)

['aa', 'bb', 'cdef', '123']

In [14]:
# '?' is looking for characters that appear once or not at all. It basically makes the preceding character optional.
text = 'colour'

conditional = re.compile('colou?r')
print('colour',conditional.findall(text))

text = 'color'
print('color', conditional.findall(text))

colour ['colour']
color ['color']


In [15]:
# Asterisk is doesn't require the preceding character to be there, but if it is it will match it
# will repeat the pattern as many times as is can.
text = 'b'
star = re.compile('bo*')
print('b',star.findall(text))

text = 'boo'
print('boo',star.findall(text))

text = 'boooo!'
print('boooo!', star.findall(text))

b ['b']
boo ['boo']
boooo! ['boooo']


In [16]:
# curly brackets define a number of times in which the preceding pattern will be repeated
repeat = re.compile('bo{2}')
text = 'boo'
print('boo',repeat.findall(text))

text = 'boooo!'
print('boooo!', repeat.findall(text))

boo ['boo']
boooo! ['boo']


## Character Classes

A character class allows you to match a particular set of user defined characters as oppsed to the predefined metacharacters we went over previously. Think of searching for any vowel.

In [17]:
text = 'I like chocolate'

vowels = re.compile('[AEIOUY]', re.IGNORECASE)
print('vowels', vowels.findall(text))

vowels ['I', 'i', 'e', 'o', 'o', 'a', 'e']


Alternatively, you can use a caret (^) within the square brackets if these are charaters you do not want to match. In this case, it will match all consonants.

In [18]:
vowels = re.compile('[^AEIOUY]', re.IGNORECASE)
print('vowels', vowels.findall(text))

vowels [' ', 'l', 'k', ' ', 'c', 'h', 'c', 'l', 't']


You can also specify ranges of characters:

In [19]:
pattern = re.compile('[a-d]', re.IGNORECASE)
print('pattern', pattern.findall(text))

pattern ['c', 'c', 'a']


## Alterations

This is essentially just an 'or' statement. An example would be if you are looking for whether a sentence says 'we have ten dollars.' or 'I have ten dollars.' Since the only varying piece of that sentence are the pronouns, you can try to match either.

In [20]:
text = 'I have ten dollars'
pattern = re.compile('we|i|they', re.IGNORECASE)
print('I', pattern.findall(text))

text = 'They have ten dollars'
pattern = re.compile('we|i|they', re.IGNORECASE)
print('They', pattern.findall(text))

text = 'We have ten dollars'
pattern = re.compile('we|i|they', re.IGNORECASE)
print('We', pattern.findall(text))

I ['I']
They ['They']
We ['We']


## Backreferences

Back references/captures allow you to reuse regular expressions and/or patterns that match that regular expression. 

Let's give a biological example:
You're looking for a motif that has flanking restriction sites ACTG. The motif can be of any length and any composition of nucleotides but is always flanked by those cuts sites:

In [44]:
text = 'ACTGTTTTTTTTTACTG'

# the '\1' is the refence to the first captured pattern.
# All patterns are ennumerated but can also be named.
print('matches:',re.search(r'(ACTG)([ACTG]+)\1', text).groups())

text = 'ATCGCAGCTACGACTGAAAAAAAAAAAAAAACTG'
print('matches:',re.search(r'(ACTG)([ACTG]+)\1', text).groups())

matches: ('ACTG', 'TTTTTTTTT')
matches: ('ACTG', 'AAAAAAAAAAAAAA')


## Substitution Mode

There are several ways that one can use regular expressions, though the two main modes are searching and substituting.

Searching/matching will only look for whether or not a pattern is matched. Substituting will replace any pattern that is matched with another pattern. Above we have used only searching, so below will only showcase an example of a substitution.

In [47]:
text = 'ACTGTTTTTTTTTACTG'
re.sub(r'ACTG', 'AAAA', text)

'AAAATTTTTTTTTAAAA'