# Convert PDF files to TXT files

The code in this Jupyter Notebook converts PDF files into plain text files (*.txt) by extracting text from each page of the documents, then writing that text to newly created TXT files. 

## Setup

First, we need to install the libraries/packages required to run the code in this notebook.

<div class="alert alert-block alert-info">You only need to run the cell below the first time you use this notebook.</div>

In [None]:
# Uninstall the fitz
%pip uninstall fitz
%pip install -r requirements.txt

Next, we import the libraries and modules we'll be using.

In [None]:
# Import required modules 
import os
import fitz

<div class='alert alert-block alert-info'>Running the cell above may have given you a warning messesage, "WARNING: Skipping fitz as it is not installed." If so, no worries! That just means that there wasn't already a package called <code>fitz</code> installed. If you didn't get a warning message, <code>fitz</code> was uninstalled in order to avoid interference with importing the PyMuPDF library in the <code>import fitz</code> line. For more information, see the following GitHub issue for the PyMuPDF project: <a href='https://github.com/pymupdf/PyMuPDF/issues/523' target='_blank'>Unable to use fitz with python 3.8</a></div>

The last setup steps are to specify the folder/directory containing the PDF files you want to convert and the directory where you want to save the newly-created TXT files.

<div class="alert alert-block alert-info">This notebook assumes that the PDF files you want to convert are in a folder/directory called 'pdf' that is in the same directory as the notebook. If the PDF files are in a different directory, you will need to update the value for the <code>pdf_dir</code> variable accordingly in the cell below. You may use either a relative path or an absolute path (for more information, see <a href="https://www.computerhope.com/issues/ch001708.htm" target="_blank">What is the difference between a relative and absolute path?</a>).</div

In [None]:
# Set the directory with PDF documents for text extraction
pdf_dir = 'pdf'

# Validate directory path
if not os.path.exists(pdf_dir):
    raise Exception ("Directory '%s' not found!" % (pdf_dir))

<div class='alert alert-block alert-info'>If you'd like to store your TXT files in a different directory, make sure to to update the value for the <code>txt_dir</code> variable accordingly in the cell below.</div>

In [None]:
# Create a subdirectory called 'txt' for plain text files
txt_dir = 'txt'

try:
    os.stat(txt_dir)
except:
    os.mkdir(txt_dir)

## Create TXT files with extracted text from PDF documents

Finally, we can convert our PDF files into TXT files!

**Resources**
- [Document — PyMuPDF 1.19.4 documentation](https://pymupdf.readthedocs.io/en/latest/document.html)
- [Page — PyMuPDF 1.19.4 documentation](https://pymupdf.readthedocs.io/en/latest/page.html)

In [None]:
# Create a list for files in which text is not recognized
exceptions = []

# Iterate through all files in the directory with PDFs
for filename in os.listdir(pdf_dir):
        
    # Check if the file type is PDF before trying to extract text
    if filename.endswith('.pdf'):

        # Create a PDF Document object
        try:
            doc = fitz.Document('%s/%s' % (pdf_dir, filename))
        except:
            continue

        # Extract text from the document page-by-page
        text = []

        for page in doc:
            page_text = page.get_text('text')
            if page_text:
                text.append(page_text)

        # If no text is extracted, add the filename to the exceptions list and move on to the next file
        if not text:
            exceptions.append(filename)
            continue
        
        # Create a TXT file for each document using the PDF filename
        # Write each page of document text to the TXT file
        with open('%s/%s' % (txt_dir, filename.replace('pdf', 'txt')), 'w', encoding='utf-8') as file:
            for page in text:
                file.write(page)

# Create a TXT file that lists all files in the exceptions list and save to the directory with TXT files
if exceptions:
    with open('%s/exceptions.txt' % (txt_dir), 'w', encoding='utf-8') as file:
        for filename in exceptions:
            file.write('%s\n' % (filename))

In [None]:
# Check the exceptions list for any files from which text was not extracted successfully
exceptions

<div class='alert alert-block alert-info'><p>Sometimes, text may not be extracted successfully from a PDF document because it needs to be processed with Optical Character Recognition (OCR) software. OCR processing makes the text in the document  editable and searchable (for more information, see the  <a href="https://pitt.libguides.com/ocr" target="_blank">Optical Character Recognition (OCR) @ Pitt Guide)</a>.</p>
    <p>If you need to OCR your document, you can do so using <a href="https://acrobat.adobe.com/us/en/acrobat/acrobat-pro.html" target="_blank">Adobe Acrobat Pro DC</a>, which is installed on most computers in the <a href="https://www.technology.pitt.edu/services/student-computing-labs" target="_blank">Student Computing Labs</a> across campus as well as in the specialized section of the Pitt IT <a href="https://www.technology.pitt.edu/services/virtual-lab" target="_blank">Virtual Computing Lab</a>. Acrobat, among other Adobe products, is also available to Pitt students and faculty to <a href="https://www.technology.pitt.edu/software/adobe-software" target="_blank">install for free</a>.</p></div>