# vikasrawal/orgpaper

Switch branches/tags
Nothing to show
26f51e2 Feb 13, 2016
1260 lines (992 sloc) 53.8 KB

# Introduction

This guide introduces an open-source toolkit for writing research papers and monographs. The main features of this toolkit centered around Emacs and Org-mode are:

• embedded R code in the document that allows for statistical results to be revised and reproduced,
• bibliographic citations from a personal bibliographic database,
• formatting using well defined styles with minimal markup,
• support for production of final output as pdf, odt, docx, html and many other formats.

## Will this guide be useful for you?

This guide will be useful for you if you are writing a research paper, a dissertation or an academic book. It would be useful if your writing involves one or more of the following:

• Citing existing literature in your area
• Presenting results of statistical analyses (in tabular form and/or graphically)
• Using mathematical equations

Following this guide would need some investment of time but benefits far outweigh the investment you make.

## What is our goal?

What are the most interesting features of the writing platform that you will set up using this guide?

• With easy style specifications that you provide, the document will be almost-entirely automatically formatted by the software.
• Complicated LaTeX-style markup is a pain and Openoffice/MS-Word documents require too much manual formatting. Basic Org-mode mark-up is extremely simple, and can be mastered in very little time.
• Org-mode can produce well formatted output in LaTeX, pdf, odt, docx, html and many other formats.
• Instead of including statistical results (tables, graphs, etc), we would embed appropriate R programs in the document, so that when the formatted output is produced, all programs are run to generate the results. Advantages of doing this are:
• Any changes in the data being used can be accommodated just by publishing the document again.
• Any modifications in statistical analysis are easily made by modifying the programs that are embedded in the file itself.
• Anyone who has the org file, can reproduce your results. You can also extract all R programs from the org file and distribute those for reproduction of your results.
• The document will be integrated with a citation manager, so that bibliographic information will be pulled automatically from a central database to create a fully formatted bibliography.
• You will maintain a bibliographic database in BibTeX format, that you can build over time, adding bibliographic information for works that you cite.
• Many websites (including Google Scholar) provide bibliographic information directly in BibTeX format, and we will have integrated tools that will allow us to pull this information directly into our local database.

## Acknowledgements

In my adaptation of Org, I have benefitted immensely from the great community of Orgers. The Org-mode manual, Worg, and archives of the Org-mode mailing list have been the most important resources. In addition, I have greatly benefited from solutions provided by various people to my specific queries on the Org-mode mailing list. What I present in this document is essentially a synthesis of solutions provided by various people. The community has been extremely generous in providing these.

I would particularly like to thank

• Carsten Dominik, the author of Org-mode.
• Bastien Guerry, who has been a great maintainer of Org-mode, after Carsten passed on the baton to him.
• Nicolas Goaziou, who wrote the brilliant new exporter framework. The amount of code Nicolas has contributed to Org over the last two years or so is incredible. Nicolas very kindly responded to several of my queries.
• Eric Schulte, the main author of Babel, which gave Org mode the ability to execute code. I used to use Org-mode as a task manager and for taking notes. I discovered org-babel in the summer of 2010, when I was doing fieldwork in villages in eastern India. This discovery completely changed my work flow, and Org-mode became central to all my academic work.
• In addition to the above, Suvayu Ali, for responses to several of my queries on the mailing list.

# Installing necessary software

This set up will work with any operating system. I have tested it on GNU/Linux and Mac OS-X, but it should work on Windows as well. For this setup, you need to install Emacs (Version 24 along with a few additional Emacs packages), Texlive, R (along with whatever additional R packages you want to use) and Pandoc.

## Emacs

• GNU/Linux

Emacs can be installed using package managers of all GNU/Linux distributions. Latest versions of most common distributions provide version 24. I strongly recommend using the latest version of Emacs.

• Mac OS-X

The built-in Emacs on OS-X is an older version, and it would be a good idea to install the latest version instead.

The best option is to install it via homebrew. I like the version available from railwaycat/emacsmacport tap (https://github.com/railwaycat/emacs-mac-port).

After installing homebrew, or if you already have it installed, just do the following from the terminal

$brew tap railwaycat/emacsmacport $ brew install emacs-mac

• Microsoft Windows

## R

In this guide, I assume that you are familiar with R (http://www.r-project.org). I will not cover R programming in this guide.

For GNU/Linux, R can be installed from native package managers (look for r-base in debian and debian-based distributions). For Mac OS-X and Windows, download and see installation instructions at http://www.r-project.org

## Pandoc

Pandoc (http://johnmacfarlane.net/pandoc/) is an extremely powerful converter, which can translate one markup to another. It supports conversion between many file formats, and supports “syntax for footnotes, tables, flexible ordered lists, definition lists, fenced code blocks, superscript, subscript, strikeout, title blocks, automatic tables of contents, embedded LaTeX math, citations, and markdown inside HTML block elements.” That is pretty much everything I use.

We shall use pandoc to convert our file from latex to odt/docx/html formats.

## Customising emacs

I recommend using Eric Schulte’s Emacs Starter Kit to take care of most of the customisation.[fn:1]

However, using the starter-kit requires you to first update org. Starting version 24, Emacs includes a package-manager. You can install/update add-on packages using the package manager. To use the package manager, press M-x in emacs, and then type package-list-packages and press return. This would bring up a list of packages. Find ess, and with the cursor on it, mark it by pressing i. Similarly, find bibretrieve and mark it. Then press x to install them.

To install the kit, go to http://eschulte.github.io/emacs24-starter-kit/#installation and follow the instructions.

Org-mode should be pre-installed with Emacs. However, since Org-mode is under heavy development, and it is really a good idea to keep up with the latest version, it is better to clone it from the git repository of Org-mode, and update it regularly. You can keep org-mode under ~/.emacs.d/src/org and compile it.

I also recommend using, in addition, research-toolkit.org, available from https://raw.githubusercontent.com/vikasrawal/orgpaper/master/research-toolkit.org. To use it, create a directory with your username under ~/.emacs.d/, and save the file in this directory.[fn:2]

For any other personal customisation that you may need to do, you can create more .org or .el files in this directory.

# Emacs basics

GNU Emacs is an extensible platform. Although its primary function is as an editor, it can be extended to do almost anything that you would want your computer to do. Now, that really is not an overstatement. It is a worthwhile aim to slowly shift an increasing number of tasks you do on your computer to emacs-based solutions. For each major task you do on your computer, ask if it can be done using emacs. For almost everything, the answer is yes, and in most cases, emacs does it better than other software you are used to. Many emacs users have learnt emacs by shifting, one-by-one, to emacs for all major tasks that they do on the computer.

I am not going to give a detailed guide to use of emacs. A few tasks for which I use Emacs include

• File management (copying files, moving files, creating directories)
• Calender, scheduler, planner
• Calculator
• Statistical work (by hooking Emacs to R)
• And, of course, as an editor (including for writing research papers)

In this guide, I will just provide a minimal set of basic commands in emacs to get you started. This is a minimal but sufficient set to be able to work. I expect that you would learn more commands as you start using emacs.

## Notations

In emacs, a buffer is equivalent to a tab in a web browser. It is normal to have several buffers open at the same time. Each file opens in emacs as a buffer. Buffers could also have processes like R running in them. Emacs displays any messages for you in a separate buffer.

Most commands in emacs are given using the Control (ctrl) or the Meta (often mapped to alt) keys.[fn:3] Control key is usually referred to as C- and the Meta key as M-. So a command C-c means pressing Control and c together. Command M-x means pressing Meta and x together. Everything is case-sensitive. So M-X would mean, pressing Meta, Shift and x together. C-c M-x l would mean pressing C-c, release, then M-x, release, and then l.

## Basic commands

Table essential-emacs-commands gives the commands that are the most important. This is a minimal set, commands that you should aim to learn as soon as possible. There are many more, which you will learn as you start using emacs.

All commands have a verbose version that can be used by pressing M-x and writing the command. For example, M-x find-file to open a file. All major commands are also mapped to a shortcut. For example, instead of typing M-x find-file to open a file, you can say C-x C-f. I remember shortcuts for commands that I use most frequently. For others, I use the verbose versions. Over time, one learns more shortcuts and starts using them instead of the verbose versions.

 M-x followed by Description Verbose command Shortcut *Opening files, saving and closing* Open a file find-file C-x C-f Save the buffer/file save-buffer C-x C-s Save as: prompts for a new filename and saves the buffer into it write-named-file C-x C-w Save all buffers and quit emacs save-buffers-kill-emacs C-x C-c *Copy, Cut and Delete Commands* Delete the rest of the current line kill-line C-k To select text, press this at the beginning of the region and then take the cursor to the end set-mark-command C-spacebar Cut the selected region kill-region C-w Copy the selected region copy-region-as-kill M-w Paste or insert at current cursor location yank C-y *Search Commands* prompts for text string and then searches from the current cursor position forwards in the buffer isearch-forward C-s Find-and-replace: replaces one string with another, one by one, asking for each occurrence of search string query-replace M-% Find-and-replace: replaces all occurrences of one string with another replace-string *Other commands* Divide a long sentence into multiple lines, each smaller than the maximum width specified fill-paragraph M-q *Window and Buffer Commands* Switch to another buffer switch-to-buffer C-x b List all buffers list-buffers C-x C-b Split current window into two windows; each window can show same or different buffers double-window C-x 2 Remove the split zero-window C-x 0 When you have two or more windows, move the cursor to the next window other-window C-x o *Canceling and undoing* Abort the command in progress keyboard-quit C-g Undo undo C-_

# Org-mode basics

## Preamble

An Org file has a few special lines at the top that set up the environment. Following lines are an example of the minimal set of lines that we shall use.

#+TITLE: Reproducible Research Papers using Org-mode and R: A Guide
#+AUTHOR: Vikas Rawal
#+DATE: May 4, 2014
#+OPTIONS: toc:2 H:3 num:2


As you can see, each line starts with a keyword, and the values for this keyword are specified after the colon.

Table special-lines gives details of a few major special lines that we shall use.

KeywordPurpose
#+TITLETo declare title of the paper
#+AUTHORTo declare author/s of the paper
#+DATESets the date. If blank, no date is used. If this keyword is omitted, current date is used.
#+OPTIONSThere are many options you can give. These are what I find the most important Multiple options can be separated by a space and specified on the same line.
H:2 (Treat top two levels of headlines as section levels, and anything below that as item list. Modify the number as appropriate)
num:2 (Number top two levels of headlines. Modify the number as appropriate.)

In addition to these, we can use LaTeX specific options for formatting the pdf output, odt specific options for formatting the odt/docx output, and R specific options for setting up the R environment. These would also be specified using special lines at the top of the file. I shall provide details of some of these in the Sections where these topics are discussed.

After the special lines at the top comes the main body of the Org file.

The content in any Org file is organised in a hierarchy of headlines. Think of these headlines as sections of your paper.

A headline in Org starts with one or more stars (*) followed by a space. A single star denotes the main sections, double star denote the subsections, three stars denote sub-subsections, and so on. We shall use this to create the section structure of our document. You can create as many levels of sections as you need.

See the following example. Note that headlines are not numbered. We leave section numbering for org-mode to handle automatically.

#+TITLE: Reproducible Research Papers using Org-mode and R: A Guide
#+AUTHOR: Vikas Rawal
#+DATE: May 4, 2014
* Introduction
* Literature review
** Is this an important issue
This is a sub-section under top-level section "Literature review" Now
indulgence dissimilar for his thoroughly has terminated. Agreement
offending commanded my an. Change wholly say why eldest period. Are
projection put celebrated particular unreserved joy unsatiable its. In
then dare good am rose bred or. On am in nearer square wanted.
** What are the major disputes in the literature
Instrument cultivated alteration any favourable expression law far
nor. Both new like tore but year. An from mean on with when sing pain.
Oh to as principles devonshire companions unsatiable an delightful.
The ourselves suffering the sincerity. Inhabit her manners adapted age
certain. Debating offended at branched striking be subjects.
Announcing of invitation principles in. Cold in late or deal.
Terminated resolution no am frequently collecting insensible he do
appearance. Projection invitation affronting admiration if no on or.
It as instrument boisterous frequently apartments an in. Mr excellence
inquietude conviction is in unreserved particular. You fully seems
stand nay own point walls. Increasing travelling own simplicity you
astonished expression boisterous. Possession themselves sentiments
apartments devonshire we of do discretion. Enjoyment discourse ye
continued pronounce we necessary abilities.
* Methodology
This is the next top-level section. There are no sub-sections under this.
* Results
This is the third top-level section. Theere are sub-sections under this.
** Result 1
This is a sub-section under section Results.
** Result 2
This is another sub-section under section Results
* Conclusions
This is the next and final top-level section. There are  no sub-sections under it.


Org handles these headlines beautifully. With your cursor on the headline, pressing tab folds-in the contents of a headline. If you press tab on a folded headline, it opens to display the contents. If there are multiple levels of headlines, these open in stages as you repeat pressing the tab key.

When you are on a headline, pressing M-return creates a new headline at the same level (that is, with the same number of stars). Once you are on the new headline, a tab moves it to a lower level (that is, a star is added), and shift-tab moves it to a higher level (that is, a star is removed).

When I start writing a paper, I start with a tentative headline/section structure, and then start filling in the content under each headline, and modify the section structure, if needed, as the paper develops.

## Itemised lists

Following syntax produces unordered (bulleted) lists:

+ bullet
+ bullet
- bullet2 1
- bullet2 2
+ bullet
+ bullet


This is how this list shows up in the final document

• bullet
• bullet
• bullet2 1
• bullet2 2
• bullet
• bullet

Following syntax produces ordered/numbered lists:

1. Item 1
2. Item 2
1) Item 2.1
2) Item 2.2
1) Item 2.2.1
3) Item 3


This is how the ordered list shows up in the final document.

1. Item 1
2. Item 2
1. Item 2.1
2. Item 2.2
1. Item 2.2.1
3. Item 3

Note that:

• In unordered lists, + and - signs are interchangeable.
• Similarly, in ordered lists 1. and 1) are interchangeable.
• Levels of bullets and numbering are determined by indentation.
• Ordered and unordered lists can be mixed using numbers and bullets for different levels.
• If the cursor is on a line that is part of an itemised list, M-return inserts a new line with a bullet/number below the present line with the same level of indentation.

## Inserting footnotes

• To insert a footnote at any point, use C-c C-x f
• To reorder and renumber footnotes after inserting a footnote in a text that already has some footnotes after the point where a new footnote is being inserted, use C-u C-c C-x f S

## Tables

### Sample code

We shall directly create only those tables in Org that present content not being produced through statistical analysis. For tables that are created through statistical analysis, we shall embed R programs rather than the tables themselves. This is discussed in Section #orgmodeandr of this guide.

The following sample code produces a fully formated table, with a numbered title above the table and a name for cross-referencing the table from the text anywhere in the document.

#+NAME: table-yield
#+CAPTION: Average yields and average income, by State, India
| State          | Average yield | Average income |
|----------------+---------------+----------------|
| Haryana        |           300 |          25000 |
| Punjab         |           260 |          35000 |


See Table table-yield-2, for an illustration of how this table shows up in the final document.

StateAverage yieldAverage income
Haryana30025000
Punjab26035000

### Table editor

Org-mode has an in-built table editor, which is very simple to use.

• Tables in Org have columns separated using |.
• Once you create the first row by separating columns using |, pressing tabs takes you from the first column to the next. Org automatically aligns the columns.
• At the end of the row, pressing tab again, creates a new blank row. You can also create a new blank row by pressing return anywhere in the last row.
• For creating a horizontal line anywhere, type |- at the starting of the line, and press tab.
• Contents of each cell are aligned automatically by Org.
• To delete a row, use C-k (M-x kill-line).

Org provides various commands for manipulating design of tables. Table org-table-commands provides the most important ones. Note that Table org-table-commands is created using Org mode. It also gives you an idea of how the table would look eventually.

CommandDescription
M-<left>Move the column left
M-<right>Move the column right
M-S-<left>Delete the current column
M-S-<right>Insert a new column to the left of the cursor position
M-<up>Move row up
M-<down>Move row down
M-S-<up>Delete the current row or horizontal line
M-S-<down>Insert a new row above the current row

For more commands for manipulating tables, see this section of the Org manual. In particular, you may want to look at spreadsheet-like functions of the table editor.

One limitation of Org is lack of support for merging of cells in a Table.

## Images

You can insert images in documents as follows

[[a.jpg]]


You should do this for images that you already have, and you just want to insert them in the document. For graphs produced by R, we shall embed the code instead, so that the graph is generated and inserted automatically.

## Captions and cross-references

We would like to give a title to our tables and images. And we would like to be able to refer to them from the text. These are achieved by adding two lines above every table and image.

• A line starting with #+CAPTION: placed just above a table or a figure adds a title to it. All Tables and Figures titles are automatically numbered.
• For referring to these Tables and Figures in the text, we shall name each table and figure in a line starting with #+NAME: as below.

To illustrate, for inserting an image, with a caption and a name, this is what we shall do.

#+NAME: literacy-rate
#+CAPTION: Percentage of literate men and women, by country (per cent)
[[a.jpg]]


Similarly, a table will be inserted as follows.

#+NAME: literacy-rate-table
#+CAPTION: Percentage of literate men and women, by country (per cent)
| Country    | Men | Women |
|------------+-----+-------|
| India      |  75 |    43 |
| Bangladesh |  83 |    63 |
| Rwanda     |  77 |    60 |


To refer to the Table above in the text, write Table [[literacy-rate-table]]. As an illustration, see the following sentence.

Tables [[literacy-rate-table]] and [[health-table]], and Figure
[[literacy-figure]], show the level of underdevelopment.


By default, all objects with captions are numbered, and names are used to anchor cross-references. When the formatted output is produced, all the references would be automatically converted to appropriate numbers. If new objects are inserted in the paper, numbering will be adjusted automatically when you create the formatted output.

## Formatting tables for LaTeX/PDF export

### Column types

The default LaTeX tabular environment allows only a few column types. In particular, there is limited support in tabular environment for wrapping text in different types of columns. However, there are many other LaTeX environments for making tables, each with different advantages. I find tabulary the most useful for my needs.

Table tabulary-column-types shows different types of columns available in tabulary package.

TypeDescription
lLeft aligned, no wrapping
LLeft aligned with wrapping
rRight aligned, no wrapping
RRight aligned with wrapping
cCentre aligned, no wrapping
CCentre aligned with wrapping
JJustified and wrapped

A line of the following type needs to be inserted above an Org table to make it use tabulary environment instead of tabular.

#+attr_latex: :environment tabulary :width \textwidth :align L|llR


:width is used to specify the maximum total width of the table that the table can take [it may be specified as \textwidth, implying full text width, or in centimeters (like, 10cm) or in inces (like, 5in)]. Note that, in tabulary, the width is the maximum width of the whole table. If your columns do not need the entire width that you specify, the table turns out narrower than the width.

:align specifies how to render each columns by using one letter (l,L,r,R,c,C or J) for each column. The number of letters should exactly match the number of columns in your table. A | anywhere implies a vertical line.

### Notes below tables

LaTeX package threeparttable is used for including notes below the table. For using threeparttable you need to call the package. In addition, it is a good idea to include the following special line for better formating of notes below the table

#+LATEX_HEADER: \renewcommand{\TPTminimum}{\linewidth}


The following code produces a table with notes below.

#+NAME: table-yield
#+CAPTION: Average yields and average income, by State, India
#+begin_table
#+begin_threeparttable
#+attr_latex: :environment tabulary :width \textwidth :align Lrr
| State          | Average yield | Average income |
|----------------+---------------+----------------|
| Haryana        |           300 |          25000 |
| Punjab         |           260 |          35000 |
#+begin_tablenotes
\item[] \footnotesize Notes:
\item[1] \footnotesize This table is very nice but this note is
very long, so long that it goes wider than the table
\item[2] \footnotesize This is a second note. But this is not
very wide.
\item[] \footnotesize Source: http://www.indianstatistics.org}
#+end_tablenotes
#+end_threeparttable
#+end_table


The notes use a little bit of direct LaTeX coding.

• \item[] ensures that each note is in a separate paragraph.
• \footenotesize, which is optional, renders the notes in a slightly smaller font.

# Org-mode and R

## Configuration

Following code in research-toolkit.org enables Org to run different types of code. If you have installed research-toolkit.org as specified in #customemacs, these are already enabled.

I have included here the languages that I commonly use. See Org manual, if you would like to add any more.

(org-babel-do-load-languages
'((R . t)
(org . t)
(ditaa . t)
(latex . t)
(dot . t)
(emacs-lisp . t)
(gnuplot . t)
(screen . nil)
(shell . t)
(sql . nil)
(sqlite . t)))

## Special lines for R

Org allows you to run multiple R sessions simultaneously, if you are working on two documents side by side, and would like to keep statistical work for the two separately.

This is done by naming the R session which a particular Org file is linked to. All R code in this file would be run in the specified R session. You could have, at the same time, another R session, with a different name, being called by another Org buffer.

We can give a name to the R session (let us say, my-r-session) that our Org buffer should be linked to by adding the following line at the top (in the preamble, that is).

#+property: session my-r-session


## Embedding R code in an Org document

Org uses ESS (emacs-speaks-statistics) to provide a fully functional, syntax-aware, development environment to write R code. R code is embedded into Org as a source block. The basic syntax is

#+NAME: name_of_code_block

#+END_SRC


This is how source blocks are created.

• First write the lines starting with #+NAME, #+BEGIN_SRC and #+END_SRC.
• Then with your cursor in between the BEGIN_SRC and the END_SRC lines, give the command C-c ’ (that is, press Ctrl-C, release, and press ‘).
• This would open a new buffer using ESS mode. If you type your code in this buffer, you will see that ESS is syntax-aware and nicely highlights R code.
• ESS also allows you to run (evaluate) the code that you write, to test what your code is doing. Use C-j for evaluating a single line of code, C-b for evaluating the whole ess buffer, or C-r for a marked region within the ess buffer.
• Once you have finished writing a code block and tested it, press C-c ’ again to come back to your Org buffer.
• In your Org buffer, with your cursor in a source-block, press C-c C-c to evaluate the whole code block and have the results included in your document.
• You can always edit your source code by opening a temporary ESS buffer using C-c’

## Code blocks that read data and load functions for later use in the document without any immediate output

I normally have one or two code blocks that read the data I am going to use, call the libraries that I use, and define a few functions of my own that I plan to use. I want this code block to be evaluated, so that these data, libraries and functions become available in my R environment. But no output from such code blocks is expected to be included into the document.[fn:4]

Code block readdata-code is an example of such a code block. Note :results value silent switch used in the #+begin_src line.

#+NAME: readdata-code
#+BEGIN_SRC R :results value silent

#+END_SRC


## Code blocks that produce results in the form of a table

Most of code blocks in my papers fall in this category. The code block may use data and functions made available by previous code blocks, read some new data and may load some new functions. The code block does some statistical processing. The last command of the code block produces an object (for example, a data.frame) that is included in the document as a Table.

For example, the code block r-code-table below uses mydata1 read in the previous code block, reads a new dataset, and processes them to create a table that shows average BMI by country.

#+NAME: bmi-table-code
#+BEGIN_SRC R :results value :colnames yes :hline yes
aggregate(height~Country,data=mydata1,mean)->a1
aggregate(weight~Country,data=mydata2,mean)->a2
merge(a1,a2,by="Country")->a1

# Footnotes

[fn:1] Eric’s Emacs Starter Kit is a beautiful illustration of power of Org-mode. It uses Org-mode source blocks to systematically document all Emacs customisation.

[fn:2] If you do not know, echo \\$USER in the terminal tells you your username.

[fn:3] Depending on the keyboard and the default configuration of the flavour of emacs you have installed, Meta may instead be mapped to a different key (for example, Windows key, or Option or Command key in Apple computers.

[fn:4] For libraries and functions that you need to call, it is even better to include them in a .Rprofile file in your working directory. These libraries and functions would then be called when R is started, and not each time you evaluate code blocks in your document.

[fn:5] Of various image formats, I find that png files are most versatile. png files support transparency, and are rendered well both on the web and in print. You can also specify jpeg or pdf files. pdf files for images work very well if you are only going to produce a pdf document.

[fn:6] Author of the odt exporter has chosen to develop the exporter outside Org-mode. He has developed a JabRef exporter to integrate citations into odt exports, but that is not a part of Org-mode and needs to be installed separately. In any case, since our toolkit primarily uses LaTeX, using Pandoc to create odt or docx files from LaTeX export works better.