# im2latex

&copy; Copyright 2017-2018 Sumeet S Singh

    This file is part of im2latex solution by Sumeet S Singh.

    This program is free software: you can redistribute it and/or modify
    it under the terms of the Affero GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    Affero GNU General Public License for more details.

    You should have received a copy of the Affero GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

In [1]:
import pandas as pd
import os
import numpy as np

In [2]:
pd.options.display.max_rows = 600
pd.options.display.max_columns = 20
pd.options.display.max_colwidth = 1000
pd.options.display.width = 160

## Please Read Before You Begin
**Only follow these steps if you want to generate additional data on top of what was used to generate the resuls presented in the accompanying paper.**

On the other hand, if you are content with the dataset that was used for this project and/or want to compare your model with mine, then skip this step and start wtih step1 instead i.e. move on to notebook preprocessing_step_1...

This notebook will guide you how to download, process and save latex-formulas for the following steps.

### Perform Manual Steps

#### Download Github repos
Download contents of [untrix/im2latex](https://github.com/untrix/im2latex) to ~/im2latex and [untrix/im2latex-dataset](https://github.com/untrix/im2latex-dataset) to ~/im2latex-dataset.

#### Download LaTeX documents from the web
Follow instructions at [untrix/im2latex-dataset](https://github.com/untrix/im2latex-dataset) and download a bunch of tar.gz files of latex documents. The instructions provide a bunch of URLs to the KDD Cup 2003 dataset from where you can download the data.
In the example below the files are downloaded into ~/im2latex/data/hep_downloads/hep-ph/tars - but feel free to download them wherever you want and replace the directory names in the instructions appropriately.

#### Extract Math formulas from the LaTeX documents
Follow instructions at [untrix/im2latex-dataset](https://github.com/untrix/im2latex-dataset) on running latex2formulas.py

> `cd ~/im2latex/data/hep_downloads/hep-ph`  
> `python ~/im2latex-dataset/src/latex2formulas.py tars/`

This will produce a file called formulas.txt. Copy it out.

> `cp formulas.txt ~/im2latex/data/dataset_tmp`

#### Normalize formulas
Here, we'll use the preprocessing code from [harvardnlp/im2markup](https://github.com/harvardnlp/im2markup). For convenience, the preprocessing code has been copied into the folder thirdparty/harvardnlp-im2markup in this repo. and some minor changes made thereafter.

Note that this script internally uses node-js, so you'll need to install node-js before you run it.

> `cd ~/im2latex/thirdparty/harvardnlp_im2markup`  
> `python scripts/preprocessing/preprocess_formulas.py --mode normalize --input-file ~/im2latex/data/dataset_tmp/formulas.txt --output-file ~/im2latex/data/dataset_tmp/formulas.norm.txt`

This produces a text file ~/im2latex/data/dataset_tmp/formulas.norm.txt with normalized formulas. This is to spit-out 100s of errors as it encounters sytactically incorrect formulas but that's okay because that way you weed out the bad formulas. I do not expect any other errors or issues in producing the normalized output file. If you do run into issues, look at the repo [harvardnlp/im2markup](https://github.com/harvardnlp/im2markup) for more details.

Next, we'll load the normalized formulas and process them further. Run the following cells in a live jupyter notebook or iPython environment. **Be sure to change the file and directory names below as appropriate.**

In [3]:
input_dir = '../data/dataset_tmp'
formula_file = os.path.join(input_dir, 'formulas.norm.txt')
output_dir = '../data/dataset_tmp/step0'
output_file = os.path.join(output_dir, 'formulas.norm.filtered.txt')
dump = True

In [4]:
def readlines_to_df(path, colname):
#   return pd.read_csv(output_file, sep='\t', header=None, names=['formula'], index_col=False, dtype=str, skipinitialspace=True, skip_blank_lines=True)
    rows = []
    n = 0
    with open(path, 'r') as f:
        print 'opened file %s'%path
        for line in f:
            n += 1
            line = line.strip()  # remove \n
            if len(line) > 0:
                rows.append(line.encode('utf-8'))
    print 'processed %d lines resulting in %d rows'%(n, len(rows))
    return pd.DataFrame({colname:rows}, dtype=np.str_)

def readlines_to_sr(path):
    rows = []
    n = 0
    with open(path, 'r') as f:
        print 'opened file %s'%path
        for line in f:
            n += 1
            line = line.strip()  # remove \n
            if len(line) > 0:
                rows.append(line.encode('utf-8'))
    print 'processed %d lines resulting in %d rows'%(n, len(rows))
    return pd.Series(rows, dtype=np.str_)

def sr_to_lines(sr, path):
#   df.to_csv(path, header=False, index=False, columns=['formula'], encoding='utf-8', quoting=csv.QUOTE_NONE, escapechar=None, sep='\t')
    assert sr.dtype == np.str_ or sr.dtype == np.object_
    with open(path, 'w') as f:
        for s in sr:
            assert '\n' not in s
            f.write(s.strip())
            f.write('\n')

In [5]:
# df_all = readlines_to_df(formula_file, 'formula')
# df_all.shape
sr_all = readlines_to_sr(formula_file)
sr_all.shape

opened file ../data/dataset_tmp/formulas.norm.txt
processed 338921 lines resulting in 338142 rows


(338142,)

In [6]:
# sr_formula = df_all.formula.str.strip()
# df_stripped = df_all.drop('formula', axis='columns').assign(formula=sr_formula)
# print df_stripped.shape
# df_len_filtered = df_stripped[(sr_formula.str.split().str.len() > 3)]
# print df_len_filtered.shape
sr_stripped = sr_all.str.strip()
print sr_stripped.shape
sr_len_filtered = sr_stripped[(sr_stripped.str.split().str.len() > 3)]
print sr_len_filtered.shape

(338142,)
(338078,)


#### Remove text-only 'formulas'
It turns out that some of the downloaded formulas were only english sentences - i.e. they didn't have any mathematics in them. Hence we'll discard those by retaining only formulas that have at least one LaTeX command. We'll detect this by the presence of a backslash.

In [7]:
def filter_words(sr_, words):
    sr = sr_
    for word in words:
        sr = sr[~(sr.str.contains(word))]
    return sr

sr2 = filter_words(sr_len_filtered, [r'%', r'\label', r'\cite', r'\ref', r'\pageref', r'\footnote'])
# df2 =df2[~df2['formula'].str.contains('%')]  # Remove all commented out lines (will remove any line with a comment)
# df2 = df2[~df2['formula'].str.contains(r'\label')]
# df2 = df2[~df2['formula'].str.contains(r'\cite')]
# df2 =df2[df2['formula'].str.contains(r'\\')]
# df2 = df2[df2['formula'].str.contains(r'=')]
# df2 = df2[~df2['formula'].str.contains(r'\ref')]
# df2 = df2[~df2['formula'].str.contains(r'\pageref')]
# df2 = df2[~df2['formula'].str.contains(r'\footnote')]
sr2.shape  # 337565

(337565,)

#### Sample a fraction of the formulas
Since we ended up with more formulas than we need, we'll sample a fraction for our use.

In [8]:
sr_final = sr2.sample(frac=0.2)
sr_final.shape  # 67513

(67513,)

#### Save to disk

In [9]:
if dump:
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    sr_to_lines(sr_final, output_file)

#### Check The Saved File

In [11]:
sr_read = readlines_to_sr(output_file)
assert sr_read.shape[0] == sr_final.shape[0]

opened file ../data/dataset_tmp/step0/formulas.norm.filtered.txt
processed 67513 lines resulting in 67513 rows


## END