### Scraping Text from Wikipedia
The *[wikipedia](https://pypi.org/project/wikipedia/)* python library can be used to search and get data from... wikipedia.

In addition to going step-by-step through scraping text from Wikipedia, we will cover these core Python topics:
<br>
Regular Expressions
<br>
String Operations
<br>
List Operations
<br>
Exceptions
<br>
For loops
<br>
Try/Except loops
<br>
Output data to file (.txt)
<br>
Functions


In [None]:
pip install wikipedia

Let's import and use the wikipedia package to simply look at the main text from the DXARTS wikipedia page.
Pay attention to the `wikipedia.page( )` and `.content( )` modules.
The full list of wikipedia modules can be found [here](https://wikipedia.readthedocs.io/en/latest/code.html#module-wikipedia), and they can return other types of results, such as image urls, internal wikipedia links, external links, summary etc. 


In [None]:
# Import package
import wikipedia
# Specify the title of the Wikipedia page
wiki = wikipedia.page('dxarts')
# Extract the plain text content of the page
text = wiki.content
# Return the text
text

Notice all of the formatting characters for headings == and new lines \n are included. `print( )` will hide these characters and display the formatted text:

In [None]:
print(text)

However, we probably don't want this formatting in our text dataset (when using it for training a language model). If we only want to keep the body of each paragraph and nothing else, we will have to do some *scrubbing* with regular expressions:

Drop headers surrounded by ‘==’: 

`re.sub(r'==.*==+', '', text)`
    
`.` = any character

`*` = multiple times

`+` = multiple occurrences

Replace ‘\n’ (a new line) with ‘’ (an empty string):

`.replace('\n', ' ')`

The output text is a string (you can check this using type(text)) which allows us to use string methods or operations, such as .replace( )

In [None]:
# Import regular expression package
import re
# Clean text as described above
text = re.sub(r'==.*==+', '', text)
text = text.replace('\n', ' ')
text

### Searching wikipedia
The `wikipedia.search( )` method will give the top 10 relevant pages, returned as a list

In [None]:
wiki_search = wikipedia.search("art")
print(wiki_search)

10 is the default, but can be modified with the argument, `results=100` or any other number.

In [None]:
wiki_search = wikipedia.search("art", results=100)
print(wiki_search)

Most of these results seem relevant to "Art", but you may want to remove one or more list elements. For example, the singer "Art Garfunkel," or the movie "O Brother, Where Art Thou?" So we can remove individual elements while retaining the rest of the list.
Note that `remove()` takes exactly one argument, so multiple removes may be needed.

This will return an error if any of these named elements are NOT in the list.

In [None]:
wiki_search.remove('Art Garfunkel')
wiki_search.remove('O Brother, Where Art Thou?')
# the len() operation will return the length of the list (number of indices) 
length = len(wiki_search)
print(length, wiki_search)

### Warning:
Some of these results may redirect you to a different wiki, which will raise an exception in python. 

Essentially this occurs when there are multiple wiki pages under the larger "Art" umbrella. This *Disambiguation Error* is raised when the page may refer to one of many other pages.

In [None]:
wiki = wikipedia.page(title='Art')
text = wiki.content
text

*Sometimes* this can be avoided by adding the following argument to the `wikipedia.page( )` method:

`auto_suggest=False`

This asks Wikipedia *not* to suggest other relevant wiki pages

In [None]:
wiki = wikipedia.page(title='Art', auto_suggest=False)
text = wiki.content
text

However, this does't always work. To really avoid exceptions/errors, we need to create a try/except loop to ingnore any failed redirects. So first, let's create a list of 5 results:

In [None]:
wiki_search = wikipedia.search("art", results=5)
print(wiki_search)

Next, we use `enumerate()`, `try` and `except` to create a loop that ignores any result that causes an exception. Pay attention to indentation, which indicates which loop/process we are operating in.
<br>

`collect=''` creates an empty list to save all of our text in.

In the next line: `for i, val in enumerate(wiki_search):`... `i` is the index of the current loop, and `val` is the content from our list `(wiki_search)` at that particular index. So on the first loop, `i = 0` and `val = "Art"`, on the final loop, `i = 4` and `val = "Elements of art"`.
<br>

Normally this *for* loop could do everything we want, but we also need a process to look for and ignore exceptions, so we use a try/except loop within the for loop. The `try` loop contains everything we want to accomplish: grab the wiki page, extract the text content, add it to our long string of text, and print the results. The `except` loop simply looks for a specific exception from wikipedia, and it it occurs, we pass over it and try the next loop.

In [None]:
# make an empty string to collect all of the text
collect = ''
# enumerate() automatically loops through each element in the list "wiki-search"
for i, val in enumerate(wiki_search):
    try:
        wiki = wikipedia.page(title=val, auto_suggest=False)
# Extract the plain text content of the page
        text = wiki.content
        collect = collect + text
# At the end of each loop, print the index, the result name, and the text    
        print(i, val, text)
    except wikipedia.exceptions.DisambiguationError as e:
        pass
        
# the except looks for and passes over any errors

If we look closely at the resulting text, we'll see that it it was able to scrape indices 0, 3, 4, while passing over 1 & 2.

Finally, we could either copy and paste this output, or save it to a .txt file directly from Python

In [None]:
text_file = open("sample.txt", "w")
n = text_file.write(collect)
text_file.close()

## Wrapping everything in to a function

If you want to run a large number of searches, or easily switch between different search terms, it can be useful to wrap all of these processes into a function. We have already seen most everything that will go into this function, but now it is defined as its own, complete process.
<br>
To create a function, use `def`, create a unique name like `wiki_scrape`, then define the kind of arguments you may wany to change on subsequent runs of the function. Here we want to be able to change the wikipedia `search` term, the max number of results `num_results`, and a unique `filename` for the resulting text file. These arguments will be replaced with real names/numbers when we actually run the function.

In [None]:
def wiki_scrape(search, num_results, filename):
    import wikipedia, re
#search for n pages    
    w_search = wikipedia.search(search, results=num_results)
    print(w_search)
#scrape all pages and collect all text in a single string
    collect = ''
    for i, val in enumerate(w_search):
#this try/except loop ignores errors from wikipedia        
        try:
            wiki = wikipedia.page(title=val, auto_suggest=False)
            text = wiki.content
            collect = collect + text
        except wikipedia.DisambiguationError as e:
#report which wikis caused errors            
            print("skipped redirect: " + val)
            pass
#regex to scrub formatting        
    scrub = collect
    scrub = re.sub(r'==.*==+', '', scrub)
    scrub = scrub.replace('\n\n+', '\n')
#return char and word count
    print(str(len(scrub)) + " characters (w/spaces)")
    print(str(len(scrub.split())) + " appx words") 
    print(collect)
#write all text to file
    text_file = open(filename, "w")
    n = text_file.write(scrub)
    text_file.close()

Once the function is defined, all we need to do is call the funcion with the three arguments `(search, num_results, filename)`

In [None]:
wiki_scrape("art",25,"art_25.txt")

## Function with manual filtering
Here, the steps for searching and filtering are broken out of the function, so you can manually remove unwanted entries.

In [None]:
art_search = wikipedia.search("art", results=50)
print(art_search)

In [None]:
art_search.remove('O Brother, Where Art Thou?')
art_search.remove('Art Garfunkel')
art_search.remove('Nicholas Art')
art_search.remove('Art the Clown')
art_search.remove('Art Malik')
length = len(art_search)
print(length, art_search)

In [None]:
def filtered_wiki_scrape(listname, filename):
    import wikipedia, re
#search for n pages - removed    
#   w_search=wikipedia.search(search, results=num_results)
    print(listname)
#scrape all pages and collect all text in a single string
    collect = ''
    for i, val in enumerate(listname):
#this try/except loop ignores errors from wikipedia        
        try:
            wiki = wikipedia.page(title=val, auto_suggest=False)
            text = wiki.content
            collect = collect + text
        except wikipedia.DisambiguationError as e:
#report which wikis caused errors            
            print("skipped redirect: " + val)
            pass
# regex to scrub formatting        
    scrub = collect
    scrub = re.sub(r'==.*==+', '', scrub)
    #scrub = scrub.replace('\n', ' ')
# return char and word count
    print(str(len(scrub)) + " characters (w/spaces)")
    print(str(len(scrub.split())) + " appx words") 
    print(collect)
# write all text to file
    text_file = open(filename, "w")
    n = text_file.write(scrub)
    text_file.close()

In [None]:
filtered_wiki_scrape(art_search, 'filtered.txt')