# Publications markdown generator for academicpages

Takes a TSV of publications with metadata and converts them for use with [academicpages.github.io](academicpages.github.io). This is an interactive Jupyter notebook ([see more info here](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html)). The core python code is also in `publications.py`. Run either from the `markdown_generator` folder after replacing `publications.tsv` with one containing your data.

TODO: Make this work with BibTex and other databases of citations, rather than Stuart's non-standard TSV format and citation style.


## Data format

The TSV needs to have the following columns: pub_date, title, venue, excerpt, citation, site_url, and paper_url, with a header at the top. 

- `excerpt` and `paper_url` can be blank, but the others must have values. 
- `pub_date` must be formatted as YYYY-MM-DD.
- `url_slug` will be the descriptive part of the .md file and the permalink URL for the page about the paper. The .md file will be `YYYY-MM-DD-[url_slug].md` and the permalink will be `https://[yourdomain]/publications/YYYY-MM-DD-[url_slug]`

This is how the raw file looks (it doesn't look pretty, use a spreadsheet or other program to edit and create).

In [5]:
!Powershell.exe -Command type publications.tsv -Head 5

pub_date	title	venue	excerpt	citation	url_slug	paper_url
2022-11-11	Topographic wetness index as a proxy for soil moisture in a hillslope catena: flow algorithms and map generalization	Open access! "Winzeler, H. E., P. R. Owens, Q. D. Read, Z. Libohova, A. Ashworth, and T. Sauer. 2022. Topographic wetness index as a proxy for soil moisture in a hillslope catena: flow algorithms and map generalization. Land 11:2018. DOI: 10.3390/land11112018."	winzeler-et-al-2022	https://doi.org/10.3390/land11112018
2022-10-18	Integrating natural gradients, experiments, and statistical modelling in a distributed network experiment: an example from the WaRM Network	Ecology and Evolution	Open access!	"Prager, C. M., A. T. Classen, M. K. Sundqvist, M. N. Barrios-Garcia, E. K. Cameron, L. Chen, C. Chisholm, T. W. Crowther, J. R. Deslippe, K. Grigulis, J.-S. He, J. A. Henning, M. Hovenden, T. T. HA,ye, X. Jing, S. Lavorel, J. R. McLaren, D. B. Metcalfe, G. S. Newman, M. L. Nielsen, C. Rixen, Q. D. Read, K. E

## Import pandas

We are using the very handy pandas library for dataframes.

In [6]:
import pandas as pd

## Import TSV

Pandas makes this easy with the read_csv function. We are using a TSV, so we specify the separator as a tab, or `\t`.

I found it important to put this data in a tab-separated values format, because there are a lot of commas in this kind of data and comma-separated values can get messed up. However, you can modify the import statement, as pandas also has read_excel(), read_json(), and others.

In [19]:
publications = pd.read_csv("publications.tsv", sep="\t", header=0)
publications.head()


Unnamed: 0,pub_date,title,venue,excerpt,citation,url_slug,paper_url
0,2022-11-11,Topographic wetness index as a proxy for soil ...,Land,Open access!,"Winzeler, H. E., P. R. Owens, Q. D. Read, Z. L...",winzeler-et-al-2022,https://doi.org/10.3390/land11112018
1,2022-10-18,"Integrating natural gradients, experiments, an...",Ecology and Evolution,Open access!,"Prager, C. M., A. T. Classen, M. K. Sundqvist,...",prager-et-al-2022,https://doi.org/10.1002/ece3.9396
2,2022-09-10,Accuracy of genomic prediction of yield and su...,Agriculture,Open access!,"Islam, Md. S., P. McCord, Q. D. Read, L. Qin, ...",islam-et-al-2022,https://doi.org/10.3390/agriculture12091436
3,2022-09-09,Potential of silicon to improve biological con...,Agriculture,Open access!,"Zimba, K. J., Q. D. Read, M. Haseeb, R. L. Mea...",zimba-et-al-2022,https://doi.org/10.3390/agriculture12091432
4,2022-08-27,Dasymetric population mapping based on US Cens...,Scientific Data,Open access!,"Swanwick, R. H., Q. D. Read, S. M. Guinn, M. A...",swanwick-et-al-2022,https://doi.org/10.1038/s41597-022-01603-z


## Escape special characters

YAML is very picky about how it takes a valid string, so we are replacing single and double quotes (and ampersands) with their HTML encoded equivilents. This makes them look not so readable in raw format, but they are parsed and rendered nicely.

In [15]:
html_escape_table = {
    "&": "&amp;",
    '"': "&quot;",
    "'": "&apos;"
    }

def html_escape(text):
    """Produce entities within text."""
    return "".join(html_escape_table.get(c,c) for c in text)

In [20]:
# Some test code to see what the filename would be
# Looks like excel messed it up.
str(publications['pub_date'])

'0     2022-11-11\n1     2022-10-18\n2     2022-09-10\n3     2022-09-09\n4     2022-08-27\n5     2022-05-13\n6     2022-05-01\n7     2022-04-15\n8     2022-04-04\n9     2022-03-12\n10    2022-01-03\n11    2021-06-21\n12    2021-04-20\n13    2021-04-01\n14    2021-02-02\n15    2021-01-20\n16    2020-06-23\n17    2020-06-01\n18    2020-01-24\n19    2020-01-14\n20    2019-07-16\n21    2019-02-27\n22    2019-03-12\n23    2018-10-20\n24    2018-10-01\n25    2018-01-24\n26    2018-03-06\n27    2018-05-01\n28    2017-09-01\n29    2017-07-01\n30    2017-12-29\n31    2016-07-01\n32    2016-07-01\n33    2016-08-01\n34    2015-09-01\n35    2014-02-01\n36    2013-12-07\n37    2013-11-12\n38    2012-02-29\nName: pub_date, dtype: object'

## Creating the markdown files

This is where the heavy lifting is done. This loops through all the rows in the TSV dataframe, then starts to concatentate a big string (```md```) that contains the markdown for each type. It does the YAML metadata first, then does the description for the individual page.

In [21]:
import os
for row, item in publications.iterrows():
    
    md_filename = str(item.pub_date) + "-" + item.url_slug + ".md"
    html_filename = str(item.pub_date) + "-" + item.url_slug
    year = str(item.pub_date)[:4]
    
    ## YAML variables
    
    md = "---\ntitle: \""   + item.title + '"\n'
    
    md += """collection: publications"""
    
    md += """\npermalink: /publication/""" + html_filename
    
    if len(str(item.excerpt)) > 5:
        md += "\nexcerpt: '" + html_escape(item.excerpt) + "'"
    
    md += "\ndate: " + str(item.pub_date) 
    
    md += "\nvenue: '" + html_escape(item.venue) + "'"
    
    if len(str(item.paper_url)) > 5:
        md += "\npaperurl: '" + item.paper_url + "'"
    
    md += "\ncitation: '" + html_escape(item.citation) + "'"
    
    md += "\n---"
    
    ## Markdown description for individual page
        
    if len(str(item.excerpt)) > 5:
        md += "\n" + html_escape(item.excerpt) + "\n"
    
    if len(str(item.paper_url)) > 5:
        md += "\n[Download paper here](" + item.paper_url + ")\n" 
        
    ##md += "\nRecommended citation: " + item.citation
    
    md_filename = os.path.basename(md_filename)
       
    with open("../_publications/" + md_filename, 'w') as f:
        f.write(md)

These files are in the publications directory, one directory below where we're working from.

In [22]:
%%cmd
dir ..\_publications\

Microsoft Windows [Version 10.0.19042.2130]
(c) Microsoft Corporation. All rights reserved.

C:\Users\qdread\Documents\GitHub\qdread.github.io\markdown_generator>dir ..\_publications\
 Volume in drive C has no label.
 Volume Serial Number is 08D0-A2BB

 Directory of C:\Users\qdread\Documents\GitHub\qdread.github.io\_publications

11/11/2022  10:35 AM    <DIR>          .
11/11/2022  10:35 AM    <DIR>          ..
11/11/2022  10:35 AM               724 2012-02-29-clark-et-al-2012.md
11/11/2022  10:35 AM               708 2013-11-12-vannuland-et-al-2013.md
11/11/2022  10:35 AM               776 2013-12-07-gorman-et-al-2013.md
11/11/2022  10:35 AM               871 2014-02-01-read-et-al-2014-functional-ecology.md
11/11/2022  10:35 AM               741 2015-09-01-schussler-et-al-2015.md
11/11/2022  10:35 AM               806 2016-07-01-read-et-al-2016-oikos.md
11/11/2022  10:35 AM               682 2016-07-01-vannuland-et-al-2016.md
11/11/2022  10:35 AM               773 2016-08-01-yoon-and-