This notebook presents how the data in `datasets/texts-pdf/` was transformed into a usable dataset.

Texts taken from: https://lingua.com/english/reading/ and https://learnenglish.britishcouncil.org/skills/reading

In [48]:
import PyPDF2

def process_text(text):
    # Find the start of the unwanted section
    if text.find("……………") != -1:
        return ""
    start = text.find("Did you understand the text?")

    text = text[:start]

    #remove cefr level tag (B1, A2, etc.)
    text = text.replace("(A1)", "")
    text = text.replace("(A2)", "")
    text = text.replace("(B1)", "")
    text = text.replace("(B2)", "")
    text = text.replace("(C1)", "")
    text = text.replace("(C2)", "")
    text = text.replace("© 2019 British Council   www.britishcouncil.org/learnenglish", "")
    
    # Return the text up to (but not including) the start of the unwanted section


    start = text.find("Reading text:") + len("Reading text:") if text.find("Reading text:") != -1 else -1
    if start == -1:
        start = text.find("Reading text :") + len("Reading text :") if text.find("Reading text :") != -1 else -1
        if start == -1:
            start = text.find("Reading t ext:") + len("Reading t ext:") if text.find("Reading t ext:") != -1 else -1
            if start == -1:
                start = text.find("Reading tex t:") + len("Reading tex t:") if text.find("Reading tex t:") != -1 else -1

    end = text.find("Tasks")
    new_text = text[start:]
    new_text = new_text[:end]
    if new_text == "":
        new_text = text
    return new_text
def extract_text_from_pdf(file_path):
    # Open the PDF file
    with open(file_path, "rb") as file:
        # Initialize a PDF file reader object
        pdf = PyPDF2.PdfFileReader(file)

        # Initialize a string to store the extracted text
        extracted_text = ""

        # Loop through each page in the PDF and extract the text
        for page in range(pdf.getNumPages()):
            extracted_text += pdf.getPage(page).extractText()

        return process_text(extracted_text)



# Specify the path to your PDF file
file_path = "datasets/texts-pdf/B1/_pdf_storage_english-text-empire-state-building.pdf"
#file_path = "datasets/texts-pdf/A1/_pdf_storage_english-text-house.pdf"

# Extract and process the text from the PDF
text = extract_text_from_pdf(file_path)

# Print the processed text
print(text)

The Empire State Building 
When exploring New York City, there are several different options for activities during a day trip. Some
visitors come to see a show, visit art museums, or simply to shop in many of the city's high-end
retailers. However, many tourists simply come to New York City for the sightseeing. One of the most
visited landmarks in New York City is the Empire State Building.
The Empire State Building, constructed in 1931, is a 102-story skyscraper, the ninth highest building in
the world, and the fourth tallest structure in the United States. It is located in Midtown, Manhattan.
This skyscraper is an iconic symbol of the city, having been featured in over 90 popular movies (as of
2018) throughout film history. Tourists come from all over the world to visit this building and view the
city from its famous observation decks.
Matthew, an enthusiast of historic buildings, was excited for this trip to New York City because he has
always appreciated architectural design. Matth

In [49]:
pdf = PyPDF2.PdfFileReader('datasets/texts-pdf/C1/LearnEnglish-Reading-C1-Life-on-Mars.pdf')
extracted_text = ""
for page in range(pdf.getNumPages()):
            extracted_text += pdf.getPage(page).extractText()

start = extracted_text.find('Reading text:') +len('Reading text:')

extracted_text[start:]

' Life on Mars  \nA new study published in the journal  Science shows definitive evidence of organic matter on \nthe surface of Mars.  The data was collected by NASA’ s nuclear -powered rover Curiosity. It \nconfirms earlier findings  that the Red Planet once contained carbon- based compounds. These \ncompounds – also called organic mo lecules – are essential ingredients for life as scientists \nunderstand it.  \nThe organ ic molecules were found in Mars’s Gale Crater, a large area that may have been a \nwatery lake over three billion years ago. The rover encountered traces of the molecule in \nrocks extracted from the area. The rocks also contain sulfur, which scientists speculate \nhelped preserve the organics even when the rocks were exposed to the harsh radiation on \nthe surface of the planet.  \nScientists are quick to state that the presence of these organic molecules is not sufficient evidence for ancient life on Mars, as the molecules could have been formed by non -living \npr

In [50]:
#traverse files in folder
import os
import pandas as pd

path = "datasets/texts-pdf/"

def transform_pdfs_to_dataset(path):
    dataset = pd.DataFrame(columns=["text","label"])
    for subdir in os.listdir(path):
        print(subdir)
        if subdir.endswith(".DS_Store"):
            continue
        for filename in os.listdir(os.path.join(path,subdir)):
            if subdir=="C1":
                print(filename)
            if filename.endswith(".pdf"):
                file_path = os.path.join(path,subdir,filename)
                text = extract_text_from_pdf(file_path)
                if subdir=="C1":
                    print(text)
                if text == "":
                    continue
                label = subdir
                dataset = pd.concat([dataset,pd.DataFrame({"text":[text],"label":[label]})],ignore_index=True)
            else:
                continue
    return dataset

dataset = transform_pdfs_to_dataset(path)


.DS_Store
B2
A2
C1
LearnEnglish-Reading-C1-Horror-film-cliches.pdf

LearnEnglish-Reading-C1-Life-on-Mars.pdf

LearnEnglish-Reading-C1-A-threat-to-bananas.pdf

LearnEnglish-Reading-C1-Sustainable-supermarkets.pdf

LearnEnglish-Reading-C1-Political-manifestos.pdf

LearnEnglish-Reading-C1-A-biography-of-Kilian-Jornet.pdf

LearnEnglish-Reading-C1-How-humans-evolved-language.pdf

LearnEnglish-Reading-C1-Four-book-summaries.pdf

_pdf_storage_english-text-environment.pdf
The Environment  
In our modern world, there are many factors that place the wellbeing of the planet in jeopardy. While
some people have the opinion that environmental problems are just a natural occurrence, others
believe that human beings have a huge impact on the environment. Regardless of your viewpoint, take
into consideration the following factors that place our environment as well as the planet Earth in
danger.
Global warming or climate change is a major contributing factor to environmental damage. Because of
global wa

In [53]:
# get only texts on B2 level
print(dataset[dataset["label"]=="C1"])

                                                 text label
21  The Environment  \nIn our modern world, there ...    C1
22  Spanish Flu Pandemic of 1918\nThe deadliest vi...    C1
23   Cultural behaviour in business  \nMuch of tod...    C1
24   Giving and receiving positive feedback  \nYou...    C1


In [54]:
dataset.to_csv("datasets/from_pdfs_automatic.csv",index=False)

In [49]:
dataset = pd.read_csv("datasets/cefr_texts_labeled.csv")
print(dataset[dataset["label"]=="C2"]['text'].head(20))

1292    A strong earthquake shook central Japan on Sat...
1293    Plato’s Republic centers on a simple question:...
1294    Taliban militants, who implemented Islamic law...
1295    ﻿They may not know who Steve Jobs was or even ...
1296    China's top foreign policy official met with N...
1297    After receiving nearly 100 reports of stuck ga...
1298    ﻿The Duke and Duchess of Cambridge won the fir...
1299    New York City's Board of Health voted Thursday...
1300    Current feminist political philosophy is indeb...
1301    Hume proposes that feeling, not thought, infor...
1302    In 1932 Marcuse published one of the first rev...
1303    An influential strand of feminist ethics devel...
1304    The last bit of news I heard before boarding a...
1305    Truth-value realism is the view that every wel...
1306    The deadly attack in Kabul on Shi'ite worshipp...
1307    The investigation considered farmers’ readines...
1308    ﻿The threatened extinction of the tiger in Ind...
1309    New la