Skip to content

Latest commit

 

History

History
241 lines (185 loc) · 11.5 KB

Week8-mergingPython&Bash.md

File metadata and controls

241 lines (185 loc) · 11.5 KB

Merging Python and UNIX

Automating file renaming

First, a quick trick for renaming lots of files! Let's imagine you have a bunch of files named,

  • env1.csv
  • env2.csv
  • env3.csv
  • env4.csv
  • env5.csv
  • ...

But, perhaps you're not satisfied with them being called "env", and you'd prefer them to be called environment1.csv, etc. Now, let's say there are hundreds of these files. Renaming these files individually would be hugely time-consuming, so.. regular expressions to the rescue! All you'll need is a text editor which can search/replace using regular expressions (good options are TextWrangler or Sublime Text). Follow these steps (or similar, depending on your specific data files):

1. From the command line, type ls env* to list all files named as env1.csv, env2.csv ...

2. Copy/paste this list of files into a text editor which can accomodate regular expressions.

3. You'll want each file name to have its own line, with no empty lines in between. You can clean up whitespace with the following two regular expression search/replace pairs (in order):

Search Replace What it does
\s* \n Replace all whitespace with a newline character
\n+ \n Replace all "double newlines" with a single newline (no empty space between lines)

4. Finally, perform a search/replace with the following regular expressions:
Search: ^env(\d+\.csv)
Replace: mv env\1 environment\1

5. Copy and paste the new search/replaced text file contents back to the command line, and viola! All files will have been renamed to environment1.csv, etc.



The python sys module

The sys module interacts with the python interpreter. This module primarily comes with certain variables (rather than functions) which are particularly useful, two of which are described below.

Passing command-line arguments sys.argv

Often, you'll want to pass input arguments to a script. All input arguments are stored in the variable sys.argv (note that you must import the sys module!). For example, you might have a script called calc_dna.py which performs certain calculations on a sequence data file, but each time you run the script, you might want to process a different file. One option to pass the file name as an input argument: python calc_dna.py inputfile.fasta, where inputfile.fasta is the single command-line argument. If you've loaded the sys module, all command line arguments will be stored in sys.argv:

Assume the following code is included in the script calc_dna.py.

import sys
print sys.argv

From the command line you run python calc_dna.py inputfile.fasta and get back:
['calc_dna.py', 'inputfile.fasta']

Notice that the first entry in sys.argv is the name of the script. After this come all command line arguments! In addition, all sys.argv entries will be strings. So remember that if you want to use an input argument as a number, you must convert it to a float or integer.

Generally, you should save the input arguments to a new variable inside the script:

import sys
infile = sys.argv[1]

Error checking is often very useful here; you might have a script which requires two command line arguments. For instance, let's say you have a script (which does something..) which takes a file name and a number as its two arguments. To ensure that this always happens, help yourself out with assertion statements:

import sys
# sys.argv must be of length 3 (script name, inputfile, number)
# print a usage statement if arguments not provided or provided incorrectly
assert( len(sys.argv) == 3 ), "Usage: python my_script.py <inputfile> <number>"
infile = sys.argv[1]
number = int( sys.argv[2] ) # remember to convert from string to int, as needed!

Editting the path with sys.path

You can view everything in python's path by printing the contents of the list sys.path. This variable will you tell which directories on your computer that the current python interpretter is able to access. To edit this path in place, for instance by adding a directory to the path, simply use .append():

import sys
sys.path.append("/path/to/directory/that/python/should/know/about/")



The python os and shutil modules

The os and shutil modules are useful for interacting with your computer's operating system (typically UNIX). With these modules, you can run commands from your python script which are analogous to UNIX commands like cd and pwd.
Some examples:

Module Command Description Unix equivalent Example
os os.listdir List all items in a given directory ls os.listdir("/directory/of/interest/")
os os.remove Remove a file rm os.remove("i_hate_this_file.txt")
os os.rmdir Remove a directory rm -r os.rmdir("/i/hate/this/directory/")
os os.mkdir Create a new directory mkdir os.mkdir("/path/to/brand/new/directory/")
os os.mkdirs Create many new directories mkdir os.mkdir("/path/to/a/brand/new/directory/", "/path/to/another/brand/new/directory/")
os os.chdir Change directory where python is running cd os.chdir("/another/directory/where/i/want/to/be/")
shutil shutil.copy Copy a file cp shutil.copy("old_file.txt", "new_file.txt")
shutil shutil.move Move a file mv shutil.move("old_file.txt", "new_file.txt")

Running external commands with os

You will often want to use Python scripting to automate analyses which use external programs or softwares. You can actually call these programs directly from your python script using the function os.system(). This function takes a single argument: the command you want to run (as a string). Anything that you could type into the command line can be given to os.system!

# Create a multiple sequence alignment in MAFFT from python
import os
# Define input, output files
infile = "unaligned.fasta"
outfile = "aligned.fasta"
command = "mafft " + infile + " > " + outfile
os.system(command)

We can also incorporate sys to provide the input/output file names as arguments!

# Create a multiple sequence alignment in MAFFT from python
import os
import sys

# Check and save input arguments
assert( len(sys.argv) == 3 ), "Usage: python align.py <inputfile> <outputfile>"
infile = sys.argv[1]
outfile = sys.argv[2]

# Run the alignment
command = "mafft " + infile + " > " + outfile
os.system(command)

Finally, you can check that the command has run properly by saving the output of os.system() (basically, save it to a variable). In UNIX, a returned value of 1 means an error occurred, but a returned value of 0 means everything went fine. Therefore, we want to make sure that os.system() returns a value of 0, by editting the last few lines:

command = "mafft " + infile + " > " + outfile
aligned_properly = os.system(command)
assert(aligned_properly == 0), "MAFFT didn't work!"

Part 2: useful UNIX/Bash one-liners: sed, sort, uniq and awk

Bash is great for getting a quick look at your data and for simple regex replacements



sed is useful for quick and recursive replacements using Regex

  • General pseudocode: sed [-E] command/regex/replacement/optionalflag filename > newfile
  • My favorite pseudocode: sed -E s/OLD/NEW/
  • Mac users must include -E to access regular expressions
  • sed does not understand \t and \n, see below
# replace first instance of XX with YY for each line
sed s/XX/YY/ filename > newfile
# replace all instances of XX with YY 
# - `g` flag means 'global' and searches for all instances of the pattern
sed s/XX/YY/g filename > newfile
# replace all instances of XX with YY and of AA with ZZ
# - `-e` flag lets you execute multiple sed commands at once
sed -e s/XX/YY/g -e s/AA/ZZ/g filename > newfile
# keep only letter and space characters ([a-zA-Z' ']*) that come before a different type of character in each line
# - must escape all () using `\`, ie: \([regex]\)
# - must put whitespace and replacement ('\1\2') in quotes, or else it is interpreted as a separate command
sed -E s/\([a-zA-Z' ']*\)\(.*\)/'\1'/ example 
  • To insert tabs (\t) you'll have to hit ctrl + v, then Tab while in the terminal environment
  • For newline characters (\n), you have to code it directly into the line with \ + enter, for example:
sed -E s/\([0-9]\)/'\1\
'/ examplefile.txt

sort sorts the lines in a file by numbers then lowercase letters then uppercase letters

Command Meaning Example
-b ignore leading blanks sort -b filename > filename.sorted
-r reverse sort -r filename > filename.sortedr
-k POS1 sort by field/character indicated by POS1 sort by field 2: sort -k 2 filename
sort by second character in field 2: sort -k 2.2 filename
-k POS1,POS2 sort based on the characters from POS1 to POS2 sort by characters in fields 2 and 3: sort -k 2,3 filename
sort starting with second character in field 2 up to and including field 3: sort -k 2.2,3 filename

uniq counts the number of unique ADJACENT lines in a file; use sort beforehand to ready the file for uniq

Command Meaning Example
-c prefixes lines with the number of times they occur uniq -c filename
-d prints only repeated lines uniq -d filename
-u prints only unique lines uniq -u filename > filename.unique
-f N skips N number of lines uniq -f 30 filename

A combination of grep, sort, and uniq is great for looking at raw sequence reads

# unzip file (`bunzip2`) and print contents to stdout rather than to a new file (`-c`)
# select any lines that start with A, C, T, or G (`grep ^[ACTG]`)
# keep lines that do not contain F, H, J, or B (`grep -v [FHJB]`)
# print the first 10000 lines, sort them alphabetically, show unique lines with counts (`head -10000 | sort | uniq -c`)
bunzip2 -c AzetekiG_SRR957179.fastq.bz2 | grep ^[ACTG] | grep -v [FHJB] | head -10000 | sort | uniq -c

awk is useful for quick subsetting of tab- or csv-delimited datasets

  • General pseudocode: cat filename | awk '{ command }'
  • My favorite pseudocode: cat filename | awk -Fdelimiter '{ print($linenumber,$otherlinenumber) }'
  • awk, unlike sed, does understand \n and \t
# example using HW csv file
# print columns 2 and 3
cat WEEK_06_python5_HW.csv | awk -F, '{ print ($2,$3) }'
# print columns 2 and 3, add a `tab` between the items
cat WEEK_06_python5_HW.csv | awk -F, '{ print ($2"\t"$3) }'
# print columns 1 and 2, adding in a `tab` between, a new field "newline", and a new line at the end
cat WEEK_06_python5_HW.csv | awk -F, '{print ($1"\t"$2 "\tnewline\n")}'
# more complicated, a for loop that prints each item 
cat WEEK_06_python5_HW.csv | awk -F, '{for (i=1;i<=4;i++) {print ($i)}}'
# for each line, from items 1 until 4: `for (i=1;i<=4;i++)`
# print each item: `{print $i}`
# much more complicated, a for loop that prints a 10-nucleotide sequence that overlaps by 1 nucleotide along the entire sequence
# you should recall this first part that grabs the sequence data from an illumina output file
bunzip2 -c AzetekiG_SRR957179.fastq.bz2 | grep ^[ACTG] | grep -v [FHJB] | head -10000 | sort | uniq | \
awk '{for (i=1;i<=length($1)-10;i++) {print substr($1,i,10)}}'
# for 1 to the length of the line until 10 before the end: `for (i=1;i<=length($1)-10;i++)`
# print from the entire line (`$1`) a substring from i to 10: `{print substr($1,i,10)}`