# Wikipedia Scraper and Script Generator
# 🖥→🚿→📃→🗣🎙

## Goal
Generate a voice acting script file given a target Wikipedia page.

## Why is this important?
If we're going to hire voice actors to generate hours upon hours of high-quality recordings for us, we'll need to give them something to read.


## How do you create a script?
1. Choose script sources (*Wikipedia pages, articles, essays, etc.*)
2. Extract text from sources
3. Clean and normalize text
4. Export the script

# Step One: Choosing A Script Source

### What does an ideal script source look like?
We still have lots to learn, but here are our current assumptions:

1. **A script should be in the same domain as the expected use case of the voice**

If you're training a voice that will be used to narrate business audiobooks, it's best to train on scripts that talk about business, not biology. The Tacotron 2 mean opinion score dropped off significantly when it attempted to speak out-of-domain sentences.

2. **A script should use similar style and punctuation as the expected use case of the voice**

If you're training a voice for a phone support system that speaks in 1-3 word options, you might not want to train on long-winded novels where sentences are much longer and interpreted quite differently.

3. **A script should contain coherent, complete pieces of content**

A voice actor may read the second sentence of a paragraph in a different style depending on what was in the sentence before it. Accordingly, it's important that scripts contain the entire context in which a line is to be spoken.

4. **Ideally, a script does not contain copyrighted content**

The law seems super ambiguous on if training on copyrighted content is permissible. To be safe, it's ideal to find scripts that are public and licensed for reuse (like Wikipedia articles).

# Step Two: Extract Text from Sources
Let's assume we've found a few Wikipedia articles in our target domain.

We'll use the [MediaWiki python library], which is a wrapper around the MediaWiki API that makes requests and text extraction a bit easier.

There is a more popular but less-maintained wrapper, [Wikipedia], that is a worthwhile fallback if we have issues with this library. 


[MediaWiki python library]: https://github.com/barrust/mediawiki
[Wikipedia]: https://github.com/goldsmith/Wikipedia

In [1]:
from IPython.display import display, Markdown
from mediawiki import MediaWiki
import pandas as pd
import string, re

wikipedia = MediaWiki()

Let's imagine we want to scrape the Wikipedia entry on [Yuval Noah Harari].

The "content" of a wiki page is the body of the article. It starts from the very first sentence (just below the page title) and ends on the last sentence of the last section of the page.

Bonus points: The `mediawiki` library automatically removes tables, images, and links from the retrieved content.

[Yuval Noah Harari]: https://en.wikipedia.org/wiki/Yuval_Noah_Harari

In [2]:
TARGET_WIKI_PAGE = 'Yuval Noah Harari'

Here's what the raw text content of the Wiki looks like:

In [3]:
wiki_page = wikipedia.page(title=TARGET_WIKI_PAGE)

print(wiki_page.content)

Yuval Noah Harari (Hebrew: יובל נח הררי‎; born 24 February 1976) is an Israeli historian and a tenured professor in the Department of History at the Hebrew University of Jerusalem. He is the author of the international bestsellers Sapiens: A Brief History of Humankind (2014) and Homo Deus: A Brief History of Tomorrow (2016). His writings examine concepts of free will, consciousness and definitions of intelligence.
Harari's early publications are concerned with what he describes as the "cognitive revolution" occurring roughly 50,000 years ago, when Homo sapiens supplanted the rival Neanderthals, mastered cognitive linguistics, developed structured societies, and ascended as apex predators, aided by the agricultural revolution and more recently accelerated by scientific methodology and rationale which have allowed humans to approach near mastery over their environment.
His recent books are more cautionary, and work through the consequences of a futuristic biotechnological world where sen

When we look at the page above, it's fairly obvious we don't want to include any of the text from the "Publications" section and below.

# Step Three: Clean and normalize text

Each script source is likely to have unique styling, syntax, and structure. We'll need to create some rules to make sure we're **only extracting the text we intend the voice actor to read.**

### 1. Clean the body text

In our case, Wikipedia articles can contain all sorts of anomolies from foreign characters (like Hebrew letters) to lengthy citations (which can include things like ISBN numbers).

The `wikipedia` library automatically removes links, citations, images, and tables from the page content it returns.

We'll write a `clean_text()` that removes a few more elements:
- **Brackets and the information contained within.** *Ex.* [16].
- **Parentheses that include years and acronyms.** *Ex.* (1945-1994).
- (Optional) **All parenthetical information**. *Ex.* (Any information contained in parentheses as this is where many anomolies appear to occur)

In [4]:
def clean_text(input_text, remove_parenthetical_content=True):
    """ Remove [brackets] and (years, acronyms) from input_text. 
    Optionally, it can remove all content found within parentheses.

    Returns:
        Cleaned Text
    """
    cleaned_text = re.sub('\[.*?\]', '', input_text) # [brackets]
    cleaned_text = re.sub(' \(\S+\)', '', cleaned_text) # (years, acronyms)
    if remove_parenthetical_content:
        cleaned_text = re.sub(' \([^)]*\)', '', cleaned_text)

    return cleaned_text

### 2. (New) Truncate page at a specific section and create DataFrame with usable sections

When we generate scripts for voice actors to read, it's very likely that only a portion of our Wikipedia page will be selected. Accordingly, we'll want to make it easy to select individual sections from our Wikipedia page.

We'll create a pandas `DataFrame` object that contains the cleaned text from each usable section in our page. The object will look something like:

Occasionally, we may receive a section title with no corresponding body text because Wikipedia has a "This section needs expansion" placeholder. 


Finally, We may want to truncate the page at a certain section. For example, a wiki entry about an author may have a `"Publications"` section at the bottom of the page that includes a long list of books we don't intend our voice actor to read.

`dataframe_from_wiki_page()` creates a `DataFrame` object containing the desired content from a `MediaWiki.page()` object. It iterates through each section in the page, cleans the text, and stores it as a new row in a pandas `DataFrame` object.

It optionally truncates the page at a given section. If you set the `truncate_at_section` argument to `"Publications"`, for example, it will return the cleaned Wiki content up until the `"Publications"` section.

The `DataFrame` will look something like this:


| Title | Content |
|------------|------------|
|Summary|Yuval Noah Harari is an Israeli author...|
|Biography|Harari was born in Kiryat Ata, Israel...|
|Academic career|Harari first specialized in Medieval...| 
|...|...|
|Views and opinions|Harari is interested in how Homo sapiens...|


By default, `dataframe_from_wiki_page()` will truncate at the `"See also"` or `"References"` sections, both of which are lists of articles, books, topics, etc. found at the bottom of the page. We don't intend our voice actors to read these.

In [5]:
def dataframe_from_wiki_page(wiki_page, truncate_at_section=None):
    """ Generates a pandas DataFrame containing the titles and accompanying content of
    each individual section contained within a Wikipedia page. Cleans the text of each
    block of text and optionally truncates the page at a specific section.
    
    Arguments:
        wiki_page: A MediaWiki page object.
    """
    # set up sections to truncate
    break_at_sections = ['See also', 'References'] # defaults
    if truncate_at_section:
        break_at_sections.append(truncate_at_section)

    # Every page starts with a "Summary" section
    title_list = ['%s: Summary' % wiki_page.title]
    content_list = [wiki_page.summary]

    # extract the content for each section
    for section in wiki_page.sections:
        if section in break_at_sections:
            break
        if not wiki_page.section(section) == '': 
            title_list.append('%s: %s' % (wiki_page.title, section))
            content_cleaned = clean_text(wiki_page.section(section), remove_parenthetical_content=True)
            content_list.append(content_cleaned)

    # write section titles and content to a dictionary
    script_as_dict = {
        'Title': title_list,
        'Content': content_list
    }
    
    # create and return a DataFrame using the dictionary
    return pd.DataFrame(script_as_dict)

Let's see this in action!

In [6]:
page_as_df = dataframe_from_wiki_page(wiki_page, truncate_at_section='Publications')

# clean the text
page_as_df['Content'] = page_as_df['Content'].apply(clean_text, remove_parenthetical_content=True)

pd.set_option('display.max_colwidth', -1)
display(page_as_df)

Unnamed: 0,Title,Content
0,Yuval Noah Harari: Summary,"Yuval Noah Harari is an Israeli historian and a tenured professor in the Department of History at the Hebrew University of Jerusalem. He is the author of the international bestsellers Sapiens: A Brief History of Humankind and Homo Deus: A Brief History of Tomorrow. His writings examine concepts of free will, consciousness and definitions of intelligence.\nHarari's early publications are concerned with what he describes as the ""cognitive revolution"" occurring roughly 50,000 years ago, when Homo sapiens supplanted the rival Neanderthals, mastered cognitive linguistics, developed structured societies, and ascended as apex predators, aided by the agricultural revolution and more recently accelerated by scientific methodology and rationale which have allowed humans to approach near mastery over their environment.\nHis recent books are more cautionary, and work through the consequences of a futuristic biotechnological world where sentient biological organisms are surpassed by their own creations; he has said ""Homo sapiens as we know them will disappear in a century or so""."
1,Yuval Noah Harari: Biography,"Harari was born in Kiryat Ata, Israel, in 1976 and grew up in a secular Jewish family with Lebanese and Eastern European roots in Haifa, Israel. Harari is openly gay, and in 2002 met his husband Itzik Yahav, whom he calls ""my internet of all things"". Yahav is also Harari's personal manager. They married in a civil ceremony in Toronto in Canada. The couple lives in a moshav Mesilat Zion near Jerusalem.Harari says Vipassana meditation, which he began whilst in Oxford in 2000, has ""transformed my life"". He practises for two hours every day, every year undertakes a meditation retreat of 30 days or longer, in silence and with no books or social media, and is an assistant meditation teacher. He dedicated Homo Deus to ""my teacher, S. N. Goenka, who lovingly taught me important things,"" and said ""I could not have written this book without the focus, peace and insight gained from practising Vipassana for fifteen years."" He also regards meditation as a way to research.Harari is a vegan, and says this resulted from his research, including his view that the foundation of the dairy industry is the breaking of the bond between mother and calf cows. As of September 2017, he does not have a smartphone."
2,Yuval Noah Harari: Academic career,"Harari first specialized in medieval history and military history in his studies from 1993 to 1998 at the Hebrew University of Jerusalem. He completed his DPhil degree at Jesus College, Oxford, in 2002 under the supervision of Steven J. Gunn. From 2003 to 2005 he pursued postdoctoral studies in history as a Yad Hanadiv Fellow.He has published numerous books and articles, including Special Operations in the Age of Chivalry, 1100–1550;The Ultimate Experience: Battlefield Revelations and the Making of Modern War Culture, 1450–2000; The Concept of 'Decisive Battles' in World History; and Armchairs, Coffee and Authority: Eye-witnesses and Flesh-witnesses Speak about War,1100–2000. He now specializes in world history and macro-historical processes.\nHis book Sapiens: A Brief History of Humankind was published in Hebrew in 2011 and then in English in 2014; it has since been translated into some 30 additional languages. The book surveys the entire length of human history, from the evolution of Homo sapiens in the Stone Age up to the political and technological revolutions of the 21st century. The Hebrew edition became a bestseller in Israel, and generated much interest both in the academic community and among the general public, turning Harari into a celebrity.\nYouTube video clips of Harari's Hebrew lectures on the history of the world have been viewed by tens of thousands of Israelis.Harari also gives a free online course in English titled A Brief History of Humankind."
3,Yuval Noah Harari: Awards and recognition,"Harari twice won the Polonsky Prize for ""Creativity and Originality"", in 2009 and 2012. In 2011 he won the Society for Military History's Moncado Award for outstanding articles in military history. In 2012 he was elected to the Young Israeli Academy of Sciences."
4,Yuval Noah Harari: Published works,"His book Homo Deus: A Brief History of Tomorrow was published in 2016, examining broad possibilities of the future of Homo sapiens. The book's premise outlines that in the future, humanity is likely to make a significant attempt to gain happiness, immortality and God-like powers. The book goes on to openly speculate various ways this ambition might be realised for Homo sapiens in the future based on the past and present. Among several possibilities for the future, Harari develops a term for a philosophy or mindset that worships big data.Harari's next book will be called 21 Lessons for the 21st Century and will focus more on present-day concerns. It is due to be published on 30th August 2018."
5,Yuval Noah Harari: Views and opinions,"Harari is interested in how Homo sapiens reached their current condition, and in their future. His research focuses on macro-historical questions such as: What is the relation between history and biology? What is the essential difference between Homo sapiens and other animals? Is there justice in history? Does history have a direction? Did people become happier as history unfolded?\nHarari regards dissatisfaction as the ""deep root"" of human reality, and as related to evolution.In a 2017 article Harari has argued that through continuing technological progress and advances in the field of artificial intelligence, ""by 2050 a new class of people might emerge – the useless class. People who are not just unemployed, but unemployable."" He put forward the case that dealing with this new social class economically, socially and politically will be a central challenge for humanity in the coming decades.Harari has commented on the plight of animals, particularly domesticated animals since the agricultural revolution, and is a vegan. In a 2015 Guardian article under the title ""Industrial farming is one of the worst crimes in history"" he called ""he fate of industrially farmed animals one of the most pressing ethical questions of our time."""


# Step Four: Export the script

We finally have our script ready to use as source material for our final voice actor scripts.

We'll just need to export it as a usable file. `.csv` a lightweight, universal file format that we'll use. There's no reason you can't export the script to a different format, but I've chosen `.csv` because it's compact, universal, and plays nicely with tools like `pandas` that we'll use to analyze phoneme coverage.

In [18]:
OUTPUT_DIRECTORY = 'wikipedia/csv-source/'

In [19]:
output_filename = '%s%s.csv' % (OUTPUT_DIRECTORY, TARGET_WIKI_PAGE)

page_as_df.to_csv(path_or_buf=output_filename, sep='\t', index=True, index_label='Index')

print('File successfully saved as %s' % output_filename)

File successfully saved as wikipedia/csv-source/Yuval Noah Harari.csv


# Putting It All Together

Below is a straightforward example of each step carried out in order. We'll retrieve, clean, and save the Wikipedia entry on Benjamin Franklin.

In [20]:
# 1. choose target
page_to_scrape = 'Strategic management'

# 2. retrieve text from target
wiki_page = wikipedia.page(title=page_to_scrape)
wiki_content = wiki_page.content

# 3. create DataFrame & clean text
page_as_df = dataframe_from_wiki_page(wiki_page)
page_as_df['Content'] = page_as_df['Content'].apply(clean_text) # clean text

# 4. save as .csv
output_filename = '%s%s.csv' % (OUTPUT_DIRECTORY, page_to_scrape)
page_as_df.to_csv(path_or_buf=output_filename, sep='\t', index=True, index_label='Index')
print('File successfully saved as %s' % output_filename)


# for exporting txt...
# 3. Clean and normalize text
# cleaned_wiki_content = normalize_wiki_content(wiki_content)

# 4. Save as txt
# save_wiki_as_txt(page_to_scrape, cleaned_wiki_content)

File successfully saved as wikipedia/csv-source/Strategic management.csv


### Footnote for normalize_wiki_content() function
If you do intend for your voice actor to read any of the elements above, you'll need to modify the `clean_text()` function.



## [IGNORE] Archive of old functions and notes...

### 2. (Old) Normalize the output and (optionally) truncate the page at a specific section

Occasionally, the Wikipedia API will return content with inconsistent formatting. For example, we may receive a section title with no corresponding body text because Wikipedia has a "This section needs expansion" placeholder. 

Furthermore, we don't want voice actors to read section titles since these don't appear in the majority of the real-world scripts we've seen so far. *Caveat:* We'll likely want to include section titles for any domains where they are more common to read, given that they may affect how certain content is read.

But, in our case, we'll remove all section headers and append together each block of body text in a format that is easy for voice actors to read (two line breaks after each paragraph).

Finally, We may want to truncate the page at a certain section. For example, a wiki entry about an author may have a `"Publications"` section at the bottom of the page that includes a long list of books we don't intend our voice actor to read.

`normalize_wiki_content()` takes the raw text from a Wiki page, cleans each line, omits section headers, and optionally truncates the page at a given section. If you set the `truncate_at_section` argument to `"Publications"`, for example, it will return the cleaned Wiki content up until the `"Publications"` section.

By default, `normalize_wiki_content()` will truncate at the `"See also"` or `"References"` sections, both of which are lists of articles, books, topics, etc. found at the bottom of the page. We don't intend our voice actors to read these.



In [10]:
def normalize_wiki_content(wiki_content, remove_parenthetical_content=False, truncate_at_section=None):
    """ Remove section headers, cleans the body text, and truncates a Wikipedia page at a given section.

    Returns:
        Cleaned Wiki Content (String)
    """
    cleaned_wiki_content = ''
    break_at_sections = ['See also', 'References'] # default sections to break on
    
    if truncate_at_section:
        break_at_sections.append(truncate_at_section)
        
    section_pattern = re.compile('\={1,5}.*?\={1,5}') # regex expression for "== Sections =="
    
    # start cleaning each line of text!
    for paragraph in wiki_content.splitlines():
        # ...so long as it's not a section header
        if not section_pattern.match(paragraph):
            cleaned_paragraph = clean_text(paragraph, remove_parenthetical_content=remove_parenthetical_content)
            if cleaned_paragraph:
                cleaned_wiki_content += '%s\n\n' % cleaned_paragraph # write paragraph content
        else:
            # if we land on a section header, check if we should break
            section_title = paragraph.strip('=')
            section_title = section_title.strip()
            if section_title in break_at_sections:
                break
            else:
                cleaned_wiki_content += 'SECTION: %s\n\n' % section_title # write section title

    return cleaned_wiki_content

In [11]:
cleaned_wiki_content = normalize_wiki_content(wiki_page.content, remove_parenthetical_content=False, truncate_at_section="Publications")

# print(cleaned_wiki_content)

#### That's much better

Depending on your domain, there may be additional anomolies you'll need to account for. Perhaps you're scraping entries about physics or math, which may contain non alpha-numeric characters that you'd like removed from your final script.

#### (Optional) Next Steps

Modify `clean_text()` or `normalize_wiki_content()` or both to better scrub and normalize the Wiki text for your specific domain. Alternatively, you can build an entirely-new, domain-specific cleaning function.

# Step Four: Export the script

We finally have our script ready to share with a voice actor or pass along to our phoneme analyzer.

We'll just need to export it as a usable file. `.txt` a lightweight, universal file format that we'll use. There's no reason you can't export the script to a different format, but I've chosen `.txt` because it's compact, universal, and plays nicely with tools like `pandas` that we'll use to analyze phoneme coverage.

The `save_wiki_as_txt()` function below writes the target Wiki page title (ex. "Yuval Noah Harari") and the cleaned wiki content to a `.txt` file. It also retrieves and saves the original URL of the Wiki entry for later reference.

In [12]:
# directory to save final script
OUTPUT_DIRECTORY = 'wikipedia/'

In [13]:
def save_wiki_as_txt(target_wiki_page, wiki_content, write_to_directory='wikipedia/'):
    """ Writes text content from a Wikipedia page as a .txt file.

    Returns:
        None
    """
    
    # create txt file for export
    output_file_format = '%s%s.txt'
    output_filename = output_file_format % (write_to_directory, target_wiki_page)
    output_file = open(output_filename, 'w+')
    
    # request Wiki page to retrieve source URL
    wiki_page = wikipedia.page(title=target_wiki_page)

    # write a header row with source info
    output_file.write('%s\n' % target_wiki_page)
    output_file.write('*Source: Wikipedia - %s\n\n' % wiki_page.url)

    # write content of wiki page
    try:
        output_file.write(wiki_content)
        output_file.close()
        print('%s entry successfully saved!' % target_wiki_page)
    except Exception as e:
        print('Something went wrong!')
        print(e)

In [15]:
# save_wiki_as_txt(target_wiki_page=TARGET_WIKI_PAGE, wiki_content=cleaned_wiki_content, write_to_directory=OUTPUT_DIRECTORY)