Note on preparing TOC in vim

1. Introduction ....................................................5

To process a line like this in vim, the strategy is to record a macro that:
1. `$` jump to end of the line
2. `F␣` find the previous "space" character. Takes you to space after "Introduction". 
 - Note: Type the actual space character.
3. `w` move forward a word to touch the first period after space
4. `diw` to delete the entire "word" made of the periods
5. Apply the macro to the entire document (each line) with `:%norm @q` where `q` is whatever register the macro was stored under.

-OR- 
 
6. Apply to a range of lines with `XX,YY norm @q` where XX and YY are beginning and end of line range to apply to, and q is whichever register the macro is stored under.

**Procedure**
1. DUMP METADATA THAT IS ALREADY IN PDF
pdftk input.pdf dump_data output meta.txt

2. RUN JULIA SCRIPT ON A TABLE OF CONTENTS FILE YOU HAVE CREATED MANUALLY. 
    IT WILL OUTPUT A TABLE OF CONTENT FILE FORMATTED FOR PDFTK
julia script preparePDFTKBookmark(input_toc_path, output_toc_path)

3. MERGE EXISTING METADATA WITH GENERATED PDFTK FORMATTED TABLE OF CONTENTS
cat meta.txt output_toc.txt > updated_meta

4. UPDATE PDF WITH UPDATED METADATA FILE
pdftk input.pdf update_info updated_meta.txt final.pdf

5. SEE ZIPPED FILE--> "preparePDFTKBookmark() Julia Sample files.zip" for input files used to develop this script. 
    * meta.txt was generated from pdftk
    
    * toc.txt and testtoc.txt were hand generated by copying contents from OCR'd pmbok.pdf pages and pasting into text file
    
    * pmbok.pdf was the pdf that we wish to put TOC into
    


In [None]:
#=
#PART I Title of Part One
PART II Title of Part Two
APPENDIX
    Appendix A -- Title of Appendix A
        A.1  Section 1 of Appendix A
            A.1.1 Section 1.1 of Appendix A 
INDEX
REFERENCES
BIBLIOGRAPHY

If you are combining multiple PDFs with bookmarks file already made, we can add the PDF file names as top level by:
    1. extracting the bookmarks file from each pdf with pdftk
    2. editing it to increment the level of each file and prepend an entry for the document name as highest level
    3. offset the page numbers of each bookmarks file to correspond to where it will be in the final document
    4. combining the files into one bookmarks file
    5. combine the pdfs with the new bookmark file
=#

#function preparePDFTKBookmark(input::AbstractString, output::AbstractString)
#function preparePDFTKBookmark(input::Array{String,1})
#preparePdftkBookmark(input::String, output::String)
#    * input: input file path
#    * output: output file path
    
#=
    Table of contents file to have the following format:
    
    1 Heading 999
    2.2 Heading2 999
    2.2.3 Heading3 999
    
    where "999" is some page number. (e.g 1, 12, 1919)
    where "Heading" is the name of the heading
    where X.X.X are the section numbering
    
    At the present, any other headings would be manually inserted.
=#
    ##############################
    #Helper functions
    ##############################
    function getLevelNumeric(line::String)   
        # Determine nesting level
        # Note: This count starts at 2 instead of 1 to allow for a top level for 
        # "PART I", "APPENDIX", etc. in pdftk bookmark format
        if (ismatch(r"^[0][!\s]", line))
            # 0 is for title
            level = 1    
        elseif (ismatch(r"^\d*([\.]{0})[^\d\.]", line))
            # then we have d (level 1)
            level = 2
        elseif (ismatch(r"^\d*[\.]\d+[^\d\.]", line))
            # then we have d.d (level 2)
            level = 3
        elseif (ismatch(r"^\d*[\.]\d*[\.]\d*[^\d\.]", line))
            # then we have d.d.d (level 3)
            level = 4
        elseif (ismatch(r"^\d*[\.]\d*[\.]\d*[\.]\d*[^\d\.]", line))
            # d.d.d.d (Level 5)
            level = 5
        
        else 
            error("Levels greater than $maxlevel not supported at the moment.")
    #              finally
    #                   close(input)
    #              end
        end
    end

    function getLevelAlpha(line::String)   
        # Determine Nesting level if entry starts with Alpha Character
        # Two types of Alpha entries: simple and nested
        # Simple: Index, Bibliography, References, PART XII, Part 12
        # Nested: Part, Appendix
        #   - Appendix A, Appendix B.1, Appendix C.1.
        
        if ismatch(r"^[A]*[\.]\d", line)
            # there is at least an A.d in the string, then go down this path of checking
            if (ismatch(r"^[A]*[\.]\d+[^\d\.]", line))
                   # Matches "A.d " => "A.9 Some Text 8534" but not "A.d" with no trailing space
                    level = 3
            elseif ismatch(r"^[A]*[\.]\d*[\.]\d*[^\d\.]", line)
                    # A.d.d
                    level = 4
            elseif (ismatch(r"^[A]*[\.]\d*[\.]\d*[\.]\d*[^\d\.]", line))
                    # A.d.d.d
                level = 5
            else
                error("Either nesting too deep or improper formatting.")
            end
        elseif (ismatch(r"\bappendix|Appendix|APPENDIX\b", line)) ||
            (ismatch(r"\bindex|Index|INDEX\b", line)) ||
            (ismatch(r"\breferences|References|REFERENCES\b", line)) ||
            (ismatch(r"\bBibliography|bibliography|BIBLIOGRAPHY\b",line))  ||
            (ismatch(r"\b^part|^Part|^PART\b", line))  
            level = 2
        end
    end

        function makeEntry!(line::String, level::Int, entry::Array{String,1}, maxlevel::Int, isAlpha::Bool)  
    #BookmarkBegin
    #Bookmark Title: MySection
    #BookmarkLevel: X
    #BookmarkPageNumber: XX
    
    #entry[1] = BookmarkBegin
    #entry[2] = Bookmark Title: MySection
    #entry[3] = BookmarkLevel: X
    #entry[4] = BookmarkPageNumber: XX
    
    #
    #Example Parsings of Entries
    # 1.1 Section of the Book 910
    # will turn into
    # BookmarkBegin
    # Bookmark Title: Section of the Book
    # BookmarkLevel: 3
    # BookmarkPageNumber: 910
    
    # Appendix B.5 1001
    # will turn into
    # BookmarkBegin
    # Bookmark Title: Appendix B.5
    # BookmarkLevel: 2
    # BookmarkPageNumber: 1001
    
    # INDEX 1406
    # will turn into
    # BookmarkBegin
    # Bookmark Title: INDEX
    # BookmarkLevel: 1
    # BookmarkPageNumber: 1406

        if !isAlpha
            # starts with a number.  5, 1.1, 3.2.3, 4.6.3.9, etc.
            if level == 1
                # Making the title by stripping the 0 off the front of the line
                entry[2] = string("BookmarkTitle:", split(line, r"\D+")[1][2:end], " ", split(line[length(split(line, r"\D+")[1]) + 2 : end],r"\s*\d+$")[1])
            #entry[4] = string("BookmarkPageNumber: ",split(line,r"\D+")[end-1])
            split(line,r"\s\D+")[end] # gets page number


            elseif level == 2
                entry[2] = string("BookmarkTitle:", split(line, r"\D+")[1], " ", split(line[length(split(line, r"\D+")[1]) + 2 : end],r"\s*\d+$")[1])
                #entry[4] =  string("BookmarkPageNumber: ",split(line,r"\D+")[2])
            elseif level > 2 && level <= maxlevel
                # This pattern below can generate the appropriate field for any level greater than 1. 
                # It is only limited by the upper limit placed on the level in the elseif line
                entry[2] = string("BookmarkTitle: ", split(line, r"\s")[1], " ", split(line[length(split(line, r"\s")[1]) + 2 : end],r"\s*\d+$")[1])
                #entry[4] = string("BookmarkPageNumber: ", split(line, r"\D+")[level])
            else
                error("makeEntry(): Please choose an appropriate level. You chose $level. Integers between 1 and $maxlevel are accepted.")
            end
        elseif isAlpha
            # starts with a letter. Appendix C, Part 3, Part V, INDEX, references, etc.
            if level == 1
                if (ismatch(r"\b^part\s\d+|^Part\s\d+|^PART\s\d+\b", line))
                    # Special case for Part 1, Part 9, etc.
                    #entry[4] = string("BookmarkPageNumber: ", split(line, r"\D+")[level+2])
                elseif (ismatch(r"\bappendix|Appendix|APPENDIX\b", line)) ||
                (ismatch(r"\bindex|Index|INDEX\b", line)) ||
                (ismatch(r"\breferences|References|REFERENCES\b", line)) ||
                (ismatch(r"\bBibliography|bibliography|BIBLIOGRAPHY\b",line))  ||
                ( (ismatch(r"\b^part|^Part|^PART\b", line)) && !(ismatch(r"\b^part\s\d+|^Part\s\d+|^PART\s\d+\b", line)))

                    # Appendix, Index, References, Bibliography, Appendix A, PART I, PART IX but not PART 2, PART 3, etc.
                   # entry[4] = string("BookmarkPageNumber: ", split(line, r"\D+")[level+1])
                end
           else
               # Case for A.1, A.1.1, A.1.2, etc.
                if level == 2
                    # For Appendix Entries such as "A.1 Section Title 90"
                   # entry[4] = string("BookmarkPageNumber: ", split(line, r"\D+")[level+1])
                elseif level > 2 && level <= maxlevel
                    # For Appendix entries such as "A.1.2 Section Title 90", "A.1.2.3 Another Section 99"
                   # entry[4] = string("BookmarkPageNumber: ", split(line, r"\D+")[level+1])
                end
            end
            #This entry[2] generates the title for any BookmarkTitle starting with a letter.
            entry[2] = string("BookmarkTitle:", " ", split(line[length(split(line, r"\D+")[1]) + 1 : end],r"\s*\d+$")[1]) 
        end

        # These work for all scenarios both Numeric and Alpha headings.
        entry[1] = "BookmarkBegin" # Common for all entries, but placed after in case the if statement fails.
        entry[3] = string("BookmarkLevel: ", level)
        entry[4] = string("BookmarkPageNumber: ",split(line,r"\D+")[end-1])


        return entry
end


    function addEntry!(entry::Array{String,1}, output::AbstractString)
    #function addEntry(entry::Array{String,1})
        # Add an entry to bookmark file
        println(entry[1])
        println(entry[2])
        println(entry[3])
        println(entry[4])
        open(output, "a+") do f
            write(f, "$(entry[1]) \n")
            write(f, "$(entry[2]) \n")
            write(f, "$(entry[3]) \n")
            write(f, "$(entry[4]) \n")
        end
    end
    ###########################
    #END Helper Functions
    ###########################

    ###########################
    # Main Code
    ###########################   
function preparePDFTKBookmark(input::AbstractString, output::AbstractString)
#    * input: input file path
#    * output: output file path

    entry = Array{String,1}(4)  #Initialize
    entry[:] = ""
    println(entry) # Good!
    maxlevel = 5 # Maximum level we will support. e.g. 10.2.3.5 is four levels. 1.4.20 is three levels of nesting.
    level=-1 #Initialize

    open(input) do file
        for fileline in enumerate(eachline(file))
            line = fileline[2][1:end]
            println(line)
            m = typeof(line)
           # println(m)
            if ismatch(r"^[0-9]", line) 
                # starts with number
                level = getLevelNumeric(line)
                println("Numeric Line")
                println("Level is $level")
                isAlpha = false
                entry = makeEntry!(line, level, entry, maxlevel, isAlpha)
             elseif ismatch(r"^[a-zA-Z]", line)
                # starts with letter
                level = getLevelAlpha(line)
                println("Alpha line")
                println("Level is $level")
                isAlpha = true
                entry = makeEntry!(line, level, entry, maxlevel,isAlpha)
            elseif !ismatch(r"(\S)", line) #not non-whitespace characters. i.e. blank line
                continue
            else
               error("Each line must start with an letter or number.")
               # Do file cleanup if fails. TODO
             end
            #addEntry(output)
            addEntry!(entry, output)
        
        println(entry[1])
        println(entry[2])
        println(entry[3])
        println(entry[4])
        println("")
        end
    end
end

inputFile="/path/to/file/filename.toc"
outputPath="/output/path/"
preparePDFTKBookmark(inputFile, string(outputPath, "bookmark.bkmk"))



In [1]:
# A simple test of some common ways that the input might be nonideal

#=
myfile = ["130000. introduction to book        791161",
"2. an introduction to agile  802",
"210.1000010101 definable work vs. high-uncertainty work 80233",
"2.2 the agile manifesto and mindset  803",
"241.24443.414 xxxx and the kanban method    9",
"2.4 uncertainty, risk, and life cycle selection 808",
"3. life cycle selection 812",
"3.1 characteristics of project life cycles 813"]
=#


myfile = ["130000. introduction to book        791161",
"A.3 Appendix A3  802",
"210.1000010101 definable work vs. high-uncertainty work 80233",
"A.2.2 the agile manifesto and mindset  803",
"241.24443.414 xxxx and the kanban method    9",
"INDEX 808",
"3. life cycle selection 812",
"3.1 characteristics of project life cycles 813"]


8-element Array{String,1}:
 "130000. introduction to book        791161"                   
 "A.3 Appendix A3  802"                                         
 "210.1000010101 definable work vs. high-uncertainty work 80233"
 "A.2.2 the agile manifesto and mindset  803"                   
 "241.24443.414 xxxx and the kanban method    9"                
 "INDEX 808"                                                    
 "3. life cycle selection 812"                                  
 "3.1 characteristics of project life cycles 813"               

In [5]:
outputPath = "/users/me"
string(outputPath, "/bookmarks.bkmk")

"/users/me/bookmarks.bkmk"