# Markdown generator for publications

Takes a TSV file with publication information and converts it into markdown for display on the website. Inspired by the packages used in [academicpages.github.io](academicpages.github.io) but adapted for the use case in astronomy, in particular adding links to the arXiv. I've also allowed an optional distinction between journal papers and conference proceedings (in case you want to list the latter). 

To use this code, replace ```publications.tsv``` and ```proceedings.tsv``` with files containing your own data. 

## Data format

I have not yet implemented a version to read directly from bibtex, so at present this requires creation of the TSV file, which can be rather cumbersome if done by hand.

The TSV ```publications.tsv``` should have the following columns: ```acc```, ```year```, ```title```, ```journal```, ```vol```, ```doi```, ```eprint```, ```author```, ```comment```, ```url_slug```

- ```acc```: boolean, indicates whether paper has been accepted or not
- ```year```: the year to appear in the citation (I like to have this the year of journal publication, if different from the year of the preprint appearing on arXiv)
- ```title```: self-explanatory
- ```journal```: optional (ignored if ```acc=False```)
- ```vol```: optional, bibliographic information on journal volume and page number (can also put ```in press``` for papers that have received a DOI but not been assigned other bibliographic information)
- ```doi```: optional, DOI for accepted papers (note, only the DOI, not the full URL)
- ```eprint```: arXiv identifier, in the form ```arXiv:xxxx.yyyyy``` or ```astro-ph/zzzzzz``` (note: don't provide only the numerical value, as otherwise ```pandas``` will interpret it as a float and that screws things up!)
- ```author```: self-explanatory (format is up to you, but I like to list 3 authors then **et al**. for >10 authors, or all authors otherwise)
- ```comment```: optional, a field for you to add things like "front cover of Nature!" or whatever
- ```url_slug```: short descriptive text, eg "first_author_et_al" used to name the markdown file for this entry. The markdown file will have the name ```[num]-[url_slug]-[year].md``` where ```num``` is generated from the value of ```eprint``` (see below)

I want my papers to be listed on the website in reverse chronological order, with the chronology determined by the date of appearing on the arXiv (not the date of publication in the journal, which can be very different). Therefore I sort the list of papers extracted from the TSV according to the ```eprint``` value, and assign each of $N$ items an integer ```num``` in the range $[1, N]$, represented using 3 digits with leading zeros if necessary (this assumes you don't have more than 999 papers – if you do you probably don't want to put them all on your website!). 

If you are old enough to have papers from the days of the old arXiv listing format ```astro-ph/zzzzzz``` or similar, this will not work for you! In this case you should adjust the code below as you see fit.

## Import the data

We use ```pandas``` to read in the TSV data that you will have created using a spreadsheet or other editing software.

In [1]:
import pandas as pd

publications = pd.read_csv("publications.tsv", sep="\t", header=0).sort_values('eprint', ascending=False)
publications

Unnamed: 0,acc,year,title,journal,vol,doi,eprint,author,comment,url_slug
0,True,2020,Environmental dependence of X-ray and optical ...,MNRAS,in press,10.1093/mnras/staa3341,arXiv:2010.12671,"M. Manolopoulou, B. Hoyle, R. G. Mann, M. Sahl...",,Manolopoulou_et_al
3,False,2020,DES Y1 results: Splitting growth and geometry ...,,,,arXiv:2010.05924,"J. Muir, E. Baxter, V. Miranda, et al.",,Muir_et_al
4,True,2020,The completed SDSS-IV extended baryon oscillat...,MNRAS,"499, 4140",10.1093/mnras/staa3074,arXiv:2008.06060,"Seshadri Nadathur, Alex Woodfinden, Will J. Pe...",,Nadathur_et_al_eBOSS
7,False,2020,The Completed SDSS-IV Extended Baryon Oscillat...,,,,arXiv:2007.09013,"Marie Aubert, Marie-Claude Cousinou, St{\'e}ph...",,Aubert_et_al
8,False,2020,The Completed SDSS-IV extended Baryon Oscillat...,,,,arXiv:2007.09008,"Arnaud de Mattia, Vanina Ruhlmann-Kleider, Ana...",,deMattia_et_al
1,True,2020,The completed SDSS-IV extended Baryon Oscillat...,MNRAS,in press,10.1093/mnras/staa3336,arXiv:2007.09007,"Anand Raichoor, Arnaud de Mattia, Ashley J. Ro...",,Raichoor_et_al
6,True,2020,The Completed SDSS-IV extended Baryon Oscillat...,MNRAS,"498, 2492",10.1093/mnras/staa2455,arXiv:2007.08994,"H{\'e}ctor Gil-Mar{\'\i}n, Juli{\'a}n E. Bauti...",,Gil-Marin_et_al
5,True,2020,The Completed SDSS-IV extended Baryon Oscillat...,MNRAS,in press,10.1093/mnras/staa2800,arXiv:2007.08993,"Julian E. Bautista, Romain Paviot, Mariana Var...",,Bautista_et_al
9,False,2020,The Completed SDSS-IV extended Baryon Oscillat...,,,,arXiv:2007.08991,eBOSS Collaboration (S. Alam et al.),,eBOSS_cosmology
14,False,2020,Reconstructing the radial velocity profile of ...,,,,arXiv:2002.01689,"Yi-Chao Li, Yin-Zhe Ma, Seshadri Nadathur",,Li_et_al


## Escape special characters

YAML is very picky about how it takes a valid string, so we are replacing single and double quotes (and ampersands) with their HTML encoded equivilents. This makes them look not so readable in raw format, but they are parsed and rendered nicely.

I have also added functionality to replace standard Tex code for some subset of accents, umlauts and diacritics in author names, and to convert "et al." to *et al.* (in italics). Note that in ```publications.tsv``` I have replaced the standard Tex ```{\"o}``` with ```{\:o}``` to represent &ouml; (and similar for &auml;, &uuml; etc.). This is because ```\"``` is an escape character that screws up my TSV editor, you may not have the same issue.  

In [2]:
html_char_escape_table = {
    "&": "&amp;",
    '"': "&quot;",
    "'": "&apos;"
    }

html_str_replace_table = {
    "{\\'A}": "&Aacute;",
    "{\\'a}": "&aacute;",
    "{\\`a}": "&agrave;",
    '{\\"a}': "&auml;",
    '{\\:a}': "&auml;",  # note the unusual (non-TeX) symbol for umlauts – \" is an escape character in TSV readers
    "{\\'e}": "&eacute;",
    "{\\`e}": "&egrave;",
    "{\\~n}": "&ntilde;",
    "{\\'\i}": "&iacute;",
    "{\\'o}": "&oacute;",
    '{\\"o"}': "&ouml;",
    '{\\:o}': "&ouml;", # note the unusual (non-LaTeX) symbol for umlauts
    "{\\~o}": "&otilde;",
    "{\\o}": "&oslash;",
    "{\\'r'}": "&racute;",
    '{\\"u}': "&uuml;",
    '{\\:u}': "&uuml;", # note the unusual (non-LaTeX) symbol for umlauts
    "{\\'u}": "&uacute;",
    "{\\v{z}}": "&zcaron;",
    "et al.": "<em>et al.</em>"
}

def html_char_escape(text):
    """Produce entities within text."""
    return "".join(html_char_escape_table.get(c,c) for c in text)

def html_str_replace(text):
    """Replace special """
    for c in html_str_replace_table:
        text = text.replace(c, html_str_replace_table[c])
    return text

## Creating the markdown files

This is the main bit of the code that creates markdown files for each entry in the TSV file. These are then handled using ```Liquid``` markdown in generating the publication list on the website.

In [3]:
import os

count = len(publications.index)
for row, item in publications.iterrows():    
    md_filename = "%03d-" % count + item.url_slug + "-" + str(item.year) + ".md"
    arxiv_id = item.eprint.replace('arXiv:', '')
    arxiv_link = "https://arxiv.org/abs/" + arxiv_id
        
    ## YAML variables
    md = "---\nnumber: \"" + str(count) + '"\n'        
    md += "title: \"" + html_char_escape(item.title) + '"\n'
    md += "arxiv_link: \"" + arxiv_link + '"\n'
    md += "arxiv_id: \"" + arxiv_id + '"\n'
    md += "author: \"" + html_str_replace(item.author) + '"'
    md += "\nreviewed: " + str(item.acc)
    if item.acc:
        md += "\njournal: \"" + item.journal + ", " + item.vol + " (" + str(item.year) + ')"'
        md += "\ndoi: \"" + str(item.doi) + '"'
    if len(str(item.comment)) > 5:
        md += "\ncomment: \"" + html_char_escape(item.comment) + '"'
    md += "\n---"
        
    md_filename = os.path.basename(md_filename)
       
    with open("../_publications/" + md_filename, 'w') as f:
        f.write(md)
    count -= 1

## Do the same with proceedings, if desired

If you want to list conference proceedings separately from journal papers, as I do, repeat the steps above with ```proceedings.tsv```. The difference here is that 
- we do not assume that all proceedings are posted on the arXiv, so there is a new column ```arxiv```, containing a boolean which indicates whether or not to use the ```eprint``` information
- sorting is done on ```year``` only
- ```journal``` and ```vol``` columns are replaced by ```proceedings``` which contains the equivalent bibliographic information 

In [8]:
proceedings = pd.read_csv("proceedings.tsv", sep="\t", header=0).sort_values('year', ascending=False)

count = len(proceedings.index)
for row, item in proceedings.iterrows():    
    md_filename = "%03d-" % count + item.url_slug + "-" + str(item.year) + ".md"
        
    ## YAML variables
    md = "---\nnumber: \"" + str(count) + '"\n'        
    md += "title: \"" + html_char_escape(item.title) + '"\n'
    md += "arxiv: " + str(item.arXiv) + '\n'
    if item.arXiv:
        arxiv_id = item.eprint.replace('arXiv:', '')
        arxiv_link = "https://arxiv.org/abs/" + arxiv_id
        md += "arxiv_link: \"" + arxiv_link + '"\n'
        md += "arxiv_id: \"" + arxiv_id + '"\n'
    md += "author: \"" + html_str_replace(item.author) + '"'
    md += "\npublished: " + str(item.acc)
    if item.acc:
        md += "\njournal: \"" + item.proceedings + " (" + str(item.year) + ')"'
        if len(str(item.doi)) > 5:  # allowing for proceedings that have publication information but no doi
            md += "\ndoi: \"" + str(item.doi) + '"'
    if len(str(item.comment)) > 5:
        md += "\ncomment: \"" + html_char_escape(item.comment) + '"'
    md += "\n---"
        
    md_filename = os.path.basename(md_filename)
       
    with open("../_proceedings/" + md_filename, 'w') as f:
        f.write(md)
    count -= 1