This notebook presents how the data in `datasets/texts-pdf/` was transformed into a usable dataset.

Texts taken from: https://lingua.com/english/reading/

In [40]:
import PyPDF2

def process_text(text):
    # Find the start of the unwanted section
    if text.find("……………") != -1:
        return ""
    start = text.find("Did you understand the text?")

    text = text[:start]

    #remove cefr level tag (B1, A2, etc.)
    text = text.replace("(A1)", "")
    text = text.replace("(A2)", "")
    text = text.replace("(B1)", "")
    text = text.replace("(B2)", "")
    text = text.replace("(C1)", "")
    text = text.replace("(C2)", "")
    text = text.replace("© 2019 British Council   www.britishcouncil.org/learnenglish", "")
    
    # Return the text up to (but not including) the start of the unwanted section


    start = text.find("Reading text:")
    if start == -1:
        start = text.find("Reading text :")
        if start == -1:
            start = text.find("Reading t ext:")
            if start == -1:
                start = text.find("Reading tex t:")
    end = text.find("Tasks")
    new_text = text[start:end]
    if new_text == "":
        new_text = text
    return new_text
def extract_text_from_pdf(file_path):
    # Open the PDF file
    with open(file_path, "rb") as file:
        # Initialize a PDF file reader object
        pdf = PyPDF2.PdfFileReader(file)

        # Initialize a string to store the extracted text
        extracted_text = ""

        # Loop through each page in the PDF and extract the text
        for page in range(pdf.getNumPages()):
            extracted_text += pdf.getPage(page).extractText()

        return process_text(extracted_text)



# Specify the path to your PDF file
file_path = "datasets/texts-pdf/B1/_pdf_storage_english-text-empire-state-building.pdf"
#file_path = "datasets/texts-pdf/A1/_pdf_storage_english-text-house.pdf"

# Extract and process the text from the PDF
text = extract_text_from_pdf(file_path)

# Print the processed text
print(text)

The Empire State Building 
When exploring New York City, there are several different options for activities during a day trip. Some
visitors come to see a show, visit art museums, or simply to shop in many of the city's high-end
retailers. However, many tourists simply come to New York City for the sightseeing. One of the most
visited landmarks in New York City is the Empire State Building.
The Empire State Building, constructed in 1931, is a 102-story skyscraper, the ninth highest building in
the world, and the fourth tallest structure in the United States. It is located in Midtown, Manhattan.
This skyscraper is an iconic symbol of the city, having been featured in over 90 popular movies (as of
2018) throughout film history. Tourists come from all over the world to visit this building and view the
city from its famous observation decks.
Matthew, an enthusiast of historic buildings, was excited for this trip to New York City because he has
always appreciated architectural design. Matth

In [41]:
#traverse files in folder
import os
import pandas as pd

path = "datasets/texts-pdf/"

def transform_pdfs_to_dataset(path):
    dataset = pd.DataFrame(columns=["text","label"])
    for subdir in os.listdir(path):
        print(subdir)
        if subdir.endswith(".DS_Store"):
            continue
        for filename in os.listdir(os.path.join(path,subdir)):
            if filename.endswith(".pdf"):
                file_path = os.path.join(path,subdir,filename)
                text = extract_text_from_pdf(file_path)
                if text == "":
                    continue
                label = subdir
                dataset = pd.concat([dataset,pd.DataFrame({"text":[text],"label":[label]})],ignore_index=True)
            else:
                continue
    return dataset

dataset = transform_pdfs_to_dataset(path)


.DS_Store
B2
A2
C1
B1
A1
C2


In [43]:
# get only texts on B2 level
print(dataset[dataset["label"]=="B2"]['text'][6])

Reading text: Asteroids  
A 
In 2010, the planetary defence team at NASA had identified and logged 90 per cent of the 
asteroids near Earth measuring 1km wide. These ‘ near-Earth objects ’, or NEOs, are the size of 
mountains and include anything within 50 million kilometres of Earth’ s orbit. With an estimated 
50 left to log, NASA says none of the 887 it knows about are a significant danger to the planet.  
B 
Now NASA is working towards logging some of the smaller asteroids, those measuring 140 
metres wide or more. Of the 25,000 estimated asteroids of this size, so far  about 8,000 have 
been logged, leaving 17,000 unaccounted for. Considering that a 19 -metre asteroid that 
exploded above the city of Chelyabinsk in Russia in 2013 injured 1,200 people, these middle-sized asteroids would be a serious danger if they enter Earth’ s orbit.  
C 
Whether NASA can find the remaining middle- sized NEOs depends on getting the money to 
build NEOCam, a 0.5- metre space telescope which would 

In [44]:
dataset.to_csv("from_pdfs.csv",index=False)