# Demonstration by making author indexes from paper information

This Jupyter Notebook is used to give a demonstration of the creation of author indexes of 3 conference proceedings from author information using regular expression and LaTeX.

## Requirement

The author index should contain a list of author names with the paper numbers of their papers they wrote. Since there are three different categories of papers, there should be three author indexes. Note that all author indexes should be in the same format.

## Procedure

There is a few procedure that we will go through to create the author indexes of three different categories of papers from raw text data of the conference prcoceedings like

In [17]:
with open('conference_session_paper.txt') as f:
    for i in range(10):
        print(f.readline()[:-1])

﻿R-01: Evaluating Alternative Refrigerants & Technologies
Time: Monday, 14/Jul/2014: 1:00pm - 3:00pm  •  Location: 214 A&B
ID: 2175
R-404A Alternative Refrigerant with Low Compressor Discharge Temperature
Barbara Minor1, Vladimir Sulc2, Jeff Berge3, Michal Kolda4, Michal Hegar5
1DuPont Fluoroproducts; 2Thermo King, Corp, Ingersoll Rand; 3Thermo King, Corp, Ingersoll Rand; 4Ingersoll Rand Equipment Manufacturing; 5Ingersoll Rand Equipment Manufacturing; barbara.h.minor@dupont.com

ID: 2250
AHRI Low Global Warming Potential Alternative Refrigerants Evaluation Program (Low-GWP AREP) – Summary of Phase I Testing Results
Xudong Wang, Karim Amrane


and the procedures are

* Extract the author names and paper numbers
* Order the data
* Put the information and the links in the LaTeX template
* Compile the author indexes

## Extracting and organizing the data

Let's start the demo by opening the files to import the data. We have two data files to import

* __conference\_session\_paper.txt__ which contains the conference paper data
* __double_name.txt__ which contains which last names contain more than one word such as 'van der Bor'

In [18]:
# open read file
main_name = 'conference_session_paper.txt'
fread = open(main_name, 'r')  # read only

# find the names which have double family names and add an identifier to them
fname = open('double_name.txt')
fname.readline()
double_name = []
name = fname.readline().split('\n')[0]  # and arrange the names in an array
while name != '':
    double_name.append(name)
    name = fname.readline().split('\n')[0] 

And let's prepare the files to store the data of papers of three different cateogires

In [19]:
# open write files
filename = []
file_declar = []
filename.append('author_index_raw_comp.txt')
filename.append('author_index_raw_refri.txt')
filename.append('author_index_raw_building.txt')
for name in filename:
    file_declar.append(open(name, 'w'))

And let's extract the data from the _conference_session_paper.txt_ and write the author names and paper numbers into separate files

In [20]:
import re

# identify text and write appropriate lines
text_line = fread.readline()  # read one line
while text_line != '':   # not end of file
    # check if it has read an id number
    if text_line[0:3] == 'The':
        # withdrawn papers, skip to the next line
        text_line = fread.readline()
    elif text_line[0:3] == 'ID:':
        id_num = text_line[4:8]
        # set filewriter
        fwrite = file_declar[int(id_num[0])-1]
        fread.readline()  # ignore the paper title line
        author_list = re.split(
            '[0-9],[0-9], |[0-9], |, |,|[0-9]\n|[0-9]|[0-9],|\n|',
            fread.readline()
        )  # form an author list
        # check for multiple words in family name first
        # print author_list
        for person in author_list:  # for each author
            for dbname in double_name:
                if person.find(dbname):
                    person = person.replace(
                        dbname, dbname.replace(' ', '_')
                    )
            name = person.split(' ')
            if name[0] != '':  # only write if a string is observed
                word = len(name)
                fwrite.write(name[word-1]+', ')  # write last name first
                for char in range(word-1):
                    fwrite.write(name[char]+' ')
                    # write the corresponding id number with an indicator
                fwrite.write(':'+id_num+'\n')
        # reset file writer
        file_declar[int(id_num[0])-1] = fwrite
    text_line = fread.readline()  # read another line

# close files
fread.close()
for declar in file_declar:
    declar.close()

  return _compile(pattern, flags).split(string, maxsplit)


Notice the use of __regular expression__ _'[0-9],[0-9], |[0-9], |, |,|[0-9]\n|[0-9]|[0-9],|\n|'_ to get the names only from the line with author data, such as

* Barbara Minor1, Vladimir Sulc2, Jeff Berge3, Michal Kolda4, Michal Hegar5
* Stephen Anthony Kujak, Panayu Robert Srichai, Kenneth J. Schultz
* Noriaki Ishii1, Takuma Tsuji2, Keiko Anami3, Charles W. Knisely4, Tatsuya Oku2, Koichi Nokiyama1, Kiyoshi Sawai5, Hirofumi Yoshida6, Hiroaki Nakai6

Without the regular expression, you may need to write many more lines to extract the names from the line of information

Let's look at one of the files

In [21]:
with open('author_index_raw_comp.txt') as f:
    for i in range(10):
        print(f.readline()[:-1])

Bertagnolio, Stephane :1627
Winandy, Eric :1627
Vazquez, Sonia :1627
Gao, Haiyang :1425
Fukuta, Mitsuhiro :1255
Ogi, Daisuke :1255
Motozawa, Masaaki :1255
Yanagisawa, Tadashi :1255
Iwanami, Shigeki :1255
Hotta, Tadashi :1255


There are two problems:

* The data are not ordered
* Only one paper is assigned to each author

and we need to fix that

## Order the author names

Ordering can be done by opening the files, order the rows according to the characters and put the data back into the files

In [22]:
# after writing the initial file, read for sorting and write again
for name in filename:
    fread = open(name, 'r')
    raw = sorted(str.split(fread.read(), '\n'), key=str.lower)  # where the ordering is done
    fread.close()
    fwrite = open(name, 'w')
    for item in raw:
        if item != '':
            fwrite.write(item+'\n')
    fwrite.close()

Assigning authors with multiple paper numbers is much more difficult but the tools required are similar.

In [23]:
# look at the files again and for identical author names, align them into one
# single line
for name in filename:
    fread = open(name, 'r')
    line1 = fread.readline()[:-1]  # skip the newline character
    nextline = 'a'  # fill in some dummy to start while loop
    content = ''
    while line1 != '':  # not eof
        nextline = fread.readline()[:-1]
        if nextline != '':  # not eof
            line2 = re.split(':', nextline)
            name1 = line1.split(':')[0]
            name2 = line2[0]
            id2 = line2[1].split('\n')
            while name1 == name2:
                line1 = line1+', '+id2[0]
                nextline = fread.readline()[:-1]
                line2 = re.split(':|\n', nextline)
                name2 = line2[0]
                # may encounter error, to ensure the newline character removal
                try:
                    id2 = line2[1].split('\n')
                except Exception:
                    break
            content = content+line1+'\n'
        else:
            content = content+line1+'\n'
        line1 = nextline
    fread.close()
    fwrite = open(name, 'w')
    fwrite.write(content)
    fwrite.close()

## Copying the data into the LaTeX template

The latex template is composed of two parts, the _preamble_ which we prepare beforehand, and the _document content_ which we create by Python with the data.

Before we move forward, let's look at the preamble part.

In [24]:
with open('preamble.tex') as f:
    print(f.read())


% document class definition
\documentclass[letter, 10pt, oneside]{article}
\setlength{\parskip}{0in}
\setlength{\parindent}{0in}
\textwidth = 470pt
\linespread{1.1}
\textheight = 630pt
\oddsidemargin = 15pt
\marginparsep = 4pt
\marginparwidth = 0pt
\hoffset = -0.5in
\footskip = 10pt
\pagestyle{empty}
\setlength{\columnsep}{1in}
\columnwidth = 3in

% include packages
\usepackage[pdftex, hidelinks]{hyperref} % for hyperlink
\usepackage[utf8]{inputenc}
\usepackage{supertabular} % for long table
\usepackage{array} % for left-aligned cells
\newcolumntype{P}[1]{>{\raggedright\arraybackslash}p{#1}}

\begin{document}

\twocolumn[  % for two columns
\begin{@twocolumnfalse}
{\centering{\Large{\bf{Author Index}}} \\}
\vspace{0.5in}
\end{@twocolumnfalse}
]
\centering{
\begin{supertabular}{p{2.25in} P{1in}}


Let's create the document content and append it to the preamble.

In [25]:
# write latex files for a table with space fill in between
latex_filename = []
# use UTF-8 for European characters
preamble = str.encode(open('preamble.tex', 'r').read())
ending = b'\\end{supertabular}\n}\n\n\\end{document}'  # for the ending of the document content
for name in filename:
    # open file
    latex_name = name.split('.txt')[0]+'.tex'
    latex_filename.append(latex_name)
    latex_write = open(latex_name, 'wb')

    # write preamble and table headers stored in another tex file
    latex_write.write(preamble+b'\n')

    fread = open(name, 'r')
    line = re.split(':|\n', fread.readline())
    while line != ['']:
        # add link to each number
        num_string = [num.strip() for num in line[1].split(',')]
        if num_string[0][0] == '1':
            store = 'COMP_Links/'
        elif num_string[0][0] == '2':
            store = 'Refrig_LINKS/'
        else:
            store = 'Bldg_LINKS/'
        pdf_string = []
        for ind_num in num_string:
            pdf_string.append(
                '\href{run:./'+store+ind_num+'.pdf}{'+ind_num+'}'
            )
        pdf_new = pdf_string[0]
        if len(num_string) > 1:
            for entry in [
                ', '+pdf_string[i] for i in range(1,len(num_string))
            ]:
                pdf_new = pdf_new+entry
        latex_write.write(
            str.encode(line[0].replace('_', ' ')+' '+' & '+pdf_new+'\\\ '+'\n')
        )
        line = re.split(':|\n', fread.readline())
    latex_write.write(ending)
    fread.close()
    latex_write.close()

Now let's look at the '.tex' files created.


In [26]:
with open('author_index_raw_comp.tex') as f:
    for i in range(45):
        print(f.readline()[:-1])


% document class definition
\documentclass[letter, 10pt, oneside]{article}
\setlength{\parskip}{0in}
\setlength{\parindent}{0in}
\textwidth = 470pt
\linespread{1.1}
\textheight = 630pt
\oddsidemargin = 15pt
\marginparsep = 4pt
\marginparwidth = 0pt
\hoffset = -0.5in
\footskip = 10pt
\pagestyle{empty}
\setlength{\columnsep}{1in}
\columnwidth = 3in

% include packages
\usepackage[pdftex, hidelinks]{hyperref} % for hyperlink
\usepackage[utf8]{inputenc}
\usepackage{supertabular} % for long table
\usepackage{array} % for left-aligned cells
\newcolumntype{P}[1]{>{\raggedright\arraybackslash}p{#1}}

\begin{document}

\twocolumn[  % for two columns
\begin{@twocolumnfalse}
{\centering{\Large{\bf{Author Index}}} \\}
\vspace{0.5in}
\end{@twocolumnfalse}
]
\centering{
\begin{supertabular}{p{2.25in} P{1in}}
Aleksandr, Drozdov   & \href{run:./COMP_Links/1197.pdf}{1197}\\ 
Almbauer, Raimund   & \href{run:./COMP_Links/1260.pdf}{1260}\\ 
Anami, Keiko   & \href{run:./COMP_Links/1562.pdf}{1562}, \href{r

## Compile the file

To compile the created the 3 LaTeX files for the 3 author indexes, we need a LaTeX compiler. In this demo, we'll use [MikiTeX](https://miktex.org/).

Follow the instruction online to install the software, and we can compile the files by simply running the command
```
latexmk -pdf author*.tex
```
at the terminal/ command terminal or running the following in Python

In [27]:
import os
os.system('latexmk -pdf author*.tex')

0

And the pdf files of author indexes will be here for you!