# Data Retrieval Project
Getting data from the web, for use in machine learning applications throughout the rest of the course.

**Requirements**
At least 500 records of each of the following data types:
* Numeric
* Categorical
* Text
* Images

Also, you need at least one _label_ to predict.

# Chosen data: 1,000 random Wikipedia articles
For my dataset, I will use Wikipedia's API to retrieve 1,000 random articles, along with the following data about each article page:


---


Page ID, Title, URL, Page Views in last 60 days, Description (local), Description (Wikidata), Alias, Label, Page Size in Bytes, Count of Available Languages, Categories, Category URLs, Page Image Name, Page Image URL, Images, Image URLs, Location (Latitude), Location (Longitude), Location (Distance to BYU in meters), Location (Name), Location (Type), Location (Country), Location (Region), Location (Globe)


---



In addition, I will use BeautifulSoup to retrieve the text (i.e., complete content) from each Wikipedia article returned by the API call. Those two datasets can be combined for a complete analysis across the 1,000 articles.

# Steps for data retrieval

In [None]:
# import needed packages (libraries)
import requests                 # for making web requests (API calls)
from bs4 import BeautifulSoup   # for parsing HTML
import pandas as pd             # for storing and analyzing data in a tabular format
import re                       # for cleaning text data in Wikipedia articles
import sqlite3                  # for storing a copy of the data


## Build API parameters for Wikipedia, `action=query`

I created [this query](https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&curtimestamp=1&prop=categories%7Ccoordinates%7Cinfo%7Cpageimages%7Cpageviews%7Cdescription%7Clanglinkscount%7Cpageterms%7Cimages&indexpageids=1&generator=random&redirects=1&formatversion=2&cllimit=10&coprop=country%7Cregion%7Cglobe%7Cname%7Ctype&codistancefrompage=Brigham%20Young%20University&inprop=url&piprop=name%7Coriginal&pilimit=10&pilicense=any&pvipmetric=pageviews&descprefersource=local&imlimit=100&grnnamespace=0&grnfilterredir=nonredirects&grnlimit=5) using Wikipedia's [API sandbox](https://en.wikipedia.org/wiki/Special:ApiSandbox#)

```python
# URL string:
URL = 'https://en.wikipedia.org/w/api.php?action=query&format=json&curtimestamp=1&prop=categories%7Ccoordinates%7Cinfo%7Cpageimages%7Cpageviews%7Cdescription%7Clanglinkscount%7Cpageterms%7Cimages&indexpageids=1&generator=random&redirects=1&formatversion=2&cllimit=10&coprop=country%7Cregion%7Cglobe%7Cname%7Ctype&codistancefrompage=Brigham%20Young%20University&inprop=url&piprop=name%7Coriginal&pilimit=10&pilicense=any&pvipmetric=pageviews&descprefersource=local&imlimit=100&grnnamespace=0&grnfilterredir=nonredirects&grnlimit=5'

# OR...
# JSON dictionary of settings
# (see the code in cells below, this is a long list of query parameters)
```

In [None]:
# See example at: https://www.mediawiki.org/wiki/API:Categories#Python

base_url = 'https://en.wikipedia.org/w/api.php'

# See: https://meta.wikimedia.org/wiki/User-Agent_policy
your_name = 'YOUR_NAME_HERE'
contact_info = 'EMAIL_ADDRESS'
required_headers = {
    'User-Agent': f'{your_name} ({contact_info}) using Python requests library'
}
parameters = {
    "action": "query",  # the type of API call (in this case, a data request)
    "format": "json",   # return format
    "curtimestamp": 1,  # return the current time stamp (UTC)
    "prop": "categories|coordinates|info|pageimages|pageviews|description|langlinkscount|pageterms|images", # additoinal page properties to include
    "indexpageids": 1,              # include a list of the returned page IDs; useful for making other API calls
    "continue": "grncontinue||",    # parameter that fetches the next batch of results
    "generator": "random",          # return a randomly-chosen article and a sequence of articles after it
    "redirects": 1,                 # automatically resolve redirects
    "formatversion": "2",           # return data in modern (JSON) format. Other version is "1" (XML-compatible)
    "clshow": "!hidden",            # include only non-hidden categories (other options are to omit this item or "hidden")
    "cllimit": "10",                # the limit of categories returned for a single page (ranges from 1-5000)
    "coprop": "country|region|globe|name|type",         # properties of article coordinates to return
    "codistancefrompage": "Brigham Young University",   # calculate the distance (in meters) from the returned page to BYU's coordinates
    "inprop": "url",                # additional page information: URL of the page
    "piprop": "name|original",      # additional page image information: the name and URL of the original image (not thumbnail)
    "pilimit": "10",                # limit of page images to return (although results still return only 1 image). Ranges from 1-50.
    "pilicense": "any",             # include images of any license type. Other option is "free"
    "pvipmetric": "pageviews",      # page view metric is set to pageviews (the only option)
    "descprefersource": "local",    # try to find a local (formula-derived) description of the page. If not, return the "global" description from Wikidata
    "imlimit": "100",               # limit of images to return (actually, returns a list of the links to images on the page). Ranges from 1-100.
    "grnnamespace": "0",            # 0 means "articles". Other namespaces include media, files (images), talks/discussions, or others.
    "grnfilterredir": "nonredirects",   # other options: "all", "redirects". "All" will include a list of the pages and associated redirects to those pages.
    "grnlimit": "5"                     # the number of random pages to return. Can range from 1-500.
    # "grncontinue": "KEY_FROM_LAST_QUERY"
}

# api_session = requests.session()

# response = api_session.get(url = base_url, params = parameters, headers = required_headers)
# json_data = response.json()

# for key, value in json_data.items():
#     print(key, value)

## Construct lists to hold data from query
Each list item represents one row of data

In [None]:
# These are the columns obtained from the API call
# Additional data scraping will retrieve other helpful data, 
# like page text (content) and last updated date
data_columns = [
                'Page ID', 
                'Title', 
                'URL', 
                'Page Views, last 60 days', 
                'Description, local', 
                'Description, Wikidata', 
                'Alias', 
                'Label', 
                'Size in Bytes', 
                'Available languages count', 
                'Categories', 
                'Category URLs',  
                'Page Image Name', 
                'Page Image URL', 
                'Images', 
                'Image URLs', 
                'Location, Latitude', 
                'Location, Longitude', 
                'Location, Distance to BYU', 
                'Location, Name', 
                'Location, Type', 
                'Location, Country', 
                'Location, Region', 
                'Location, Globe' 
                ]

## Create a function to extract data from API calls

In [None]:
def Query_Wikipedia_API(data_list, num_pages = 5, is_continued = False, continue_key = ''):
    '''
    Purpose: query Wikipedia's API and return page data to a list that can be passed to a DataFrame
    Columns in return list: 'Page ID', 'Title', 'URL', 'Page Views, last 60 days', 'Description, local', 
                'Description, Wikidata', 'Alias', 'Label', 'Size in Bytes', 'Available languages count', 
                'Categories', 'Category URLs', 'Page Image Name', 'Page Image URL', 'Images', 
                'Image URLs', 'Location, Latitude', 'Location, Longitude', 'Location, Distance to BYU', 
                'Location, Name', 'Location, Type', 'Location, Country', 'Location, Region', 'Location, Globe' 

    Parameters
    ----------
    data_list: required. List object to store the values returned by the API call. Will be modified by this function.
    num_pages: Optional, default = 5 (range: 1-500). The number of random pages to request from Wikipedia.
        Note that Wikipedia will not reliably return Category or Page View information if num_pages is > 5.
    is_continued: Optional, default = False. If this is the second or later query, set to True to get a new resultset
    continue_key: Optional, default is ''. If is_continued is True, set continue_key to the value 
        returned by the previous request
    '''
    import requests
    import urllib.parse
    from bs4 import BeautifulSoup

    json_params = {
    "action": "query",  # the type of API call (in this case, a data request)
    "format": "json",   # return format
    "curtimestamp": 1,  # return the current time stamp (UTC)
    "prop": "categories|coordinates|info|pageimages|pageviews|description|langlinkscount|pageterms|images", # additoinal page properties to include
    "indexpageids": 1,              # include a list of the returned page IDs; useful for making other API calls
    "continue": "grncontinue||",    # parameter that fetches the next batch of results
    "generator": "random",          # return a randomly-chosen article and a sequence of articles after it
    "redirects": 1,                 # automatically resolve redirects
    "formatversion": "2",           # return data in modern (JSON) format. Other version is "1" (XML-compatible)
    "clshow": "!hidden",            # include only non-hidden categories (other options are to omit this item or "hidden")
    "cllimit": "10",                # the limit of categories returned for a single page (ranges from 1-5000)
    "coprop": "country|region|globe|name|type",         # properties of article coordinates to return
    "codistancefrompage": "Brigham Young University",   # calculate the distance (in meters) from the returned page to BYU's coordinates
    "inprop": "url",                # additional page information: URL of the page
    "piprop": "name|original",      # additional page image information: the name and URL of the original image (not thumbnail)
    "pilimit": "10",                # limit of page images to return (although results still return only 1 image). Ranges from 1-50.
    "pilicense": "any",             # include images of any license type. Other option is "free"
    "pvipmetric": "pageviews",      # page view metric is set to pageviews (the only option)
    "descprefersource": "local",    # try to find a local (formula-derived) description of the page. If not, return the "global" description from Wikidata
    "imlimit": "100",               # limit of images to return (actually, returns a list of the links to images on the page). Ranges from 1-100.
    "grnnamespace": "0",            # 0 means "articles". Other namespaces include media, files (images), talks/discussions, or others.
    "grnfilterredir": "nonredirects",   # other options: "all", "redirects". "All" will include a list of the pages and associated redirects to those pages.
    "grnlimit": "5"                 # the number of random pages to return. Can range from 1-500. Pageviews won't be shown if more than 5 pages are requested.
    # "grncontinue": "KEY_FROM_LAST_QUERY"
    }

    # set the return count from the json_params list
    json_params['grnlimit'] = num_pages

    # add the grncontinue key and its value if this request is a subsequent one after the first
    if is_continued:
        json_params['grncontinue'] = continue_key
    else:
        if 'grncontinue' in json_params.keys():
            del json_params['grncontinue']
    
    # query Wikipedia's API and return a random list of 5 articles
    base_url = 'https://en.wikipedia.org/w/api.php'

    # See: https://meta.wikimedia.org/wiki/User-Agent_policy
    required_headers = {
        'User-Agent': 'Student: Ryan Parker (rparker8@byu.edu) using Python requests library'
        }
    api_session = requests.session()
    response = api_session.get(url = base_url, params = json_params, headers = required_headers)
    json_data = response.json()

    # record the continue key
    response_continue_key = json_data['continue']['grncontinue']

    # set up the counting variables
    numeric_count = 0
    categorical_count = 0
    text_count = 0
    image_count = 0
    
    # Change None values to 0 so the Page_views calculate properly
    # This function uses recursion to trace the dictionary tree and replace None with 0
    # See: https://stackoverflow.com/a/35986190/17005348
    def replace_none_in_dict(any_dict):
        # what to replace None with
        replace_value = 0
        
        # recursive loop
        for k, v in any_dict.items():
            if v is None:
                any_dict[k] = replace_value
            elif type(v) == dict:
                replace_none_in_dict(v)
            elif type(v) == list:
                for item in v:
                    if item is None:
                        item = replace_value
                    elif type(item) == dict:
                        replace_none_in_dict(item)
    
    replace_none_in_dict(json_data)

    # parse json_data to get information for each variable
    for item in json_data['query']['pages']:
        Page_ID = item.get('pageid', '')
        
        Title = item.get('title', '')
        
        URL = item.get('fullurl', '')
        
        try:
            Page_views = sum([value for key, value in item['pageviews'].items()])
            numeric_count += 1
        except:
            Page_views = 0

        Desc_loc = item.get('description', '')
        if Desc_loc != '':
            text_count += 1

        try:
            Desc_global = item['terms']['description'][0]
        except:
            Desc_global = ''
        
        try:
            # Alias = ', '.join([a for a in item['terms']['alias']])
            Alias = [a for a in item['terms']['alias']]
        except:
            Alias = ''
        
        try:
            # Label = ', '.join([l for l in item['terms']['label']])
            Label = [l for l in item['terms']['label']]
        except:
            Label = ''

        Byte_length = item.get('length', '')

        Num_languages = item.get('langlinkscount', '')

        try:
            Categories = [cat['title'][9:] for cat in item['categories']]   # "Category:" is 9 characters, so we'll trim that off of the results
            categorical_count += 1
        except:
            Categories = ''

        if len(Categories) > 0:
            try:
                Category_URLs = ['https://en.wikipedia.org/wiki/Category:' + cat.replace(' ', '_') for cat in Categories]
            except:
                Category_URLs = ''
        else:
            Category_URLs = ''
        

        Page_Img_Name = item.get('pageimage', '')
        
        try:
            Page_Img_URL = item['original']['source']
        except:
            Page_Img_URL = ''

        try:
            Images = ['https://en.wikipedia.org/wiki/' + file['title'].replace(' ', '_') for file in item['images']]
            image_count += len(Images)
        except:
            Images = ''
        
        if len(Images) > 0: 
            try:
                img_url_list = []
                for x, list_item in enumerate(Images):
                    image_title = item['images'][x]['title']
                    soup = BeautifulSoup(requests.get(list_item).text, 'html')
                    direct_url = 'https:' + soup.find("div", {'class':'fullImageLink'}).find('a')['href']
                    img_url_list.append(direct_url)
                Image_URLs = img_url_list
            except:
                Image_URLs = ''
        else:
            Image_URLs = ''
        
        try: 
            Loc_Lat = item['coordinates'][0]['lat']
        except: 
            Loc_Lat = ''

        try:
            Loc_Lon = item['coordinates'][0]['lon']
        except:
            Loc_Lon = ''
        
        try: 
            Loc_Dist = item['coordinates'][0]['dist']
        except:
            Loc_Dist = ''
        
        try:
            Loc_Name = item['coordinates'][0]['name'] 
        except:
            Loc_Name = ''

        try:
            Loc_Type = item['coordinates'][0]['type']
        except:
            Loc_Type = ''
        
        try: 
            Loc_Country = item['coordinates'][0]['country']
        except:
            Loc_Country = ''
        
        try:
            Loc_Region = item['coordinates'][0]['region']
        except: 
            Loc_Region = ''
        
        try:
            Loc_Globe = item['coordinates'][0]['globe']
        except:
            Loc_Globe = ''
        
        # combine all values into one row
        one_row = [
                   Page_ID, 
                   Title,
                   URL, 
                   Page_views, 
                   Desc_loc, 
                   Desc_global, 
                   Alias, 
                   Label, 
                   Byte_length,
                   Num_languages, 
                   Categories, 
                   Category_URLs, 
                   Page_Img_Name, 
                   Page_Img_URL, 
                   Images, 
                   Image_URLs, 
                   Loc_Lat, 
                   Loc_Lon, 
                   Loc_Dist, 
                   Loc_Name, 
                   Loc_Type, 
                   Loc_Country, 
                   Loc_Region, 
                   Loc_Globe    
        ]

        # add row to data_list
        data_list.append(one_row)
    
    return_dict = {
        # 'data_list': data_list,               # Not necessary because the data_list is updated within the function
        'continue_key': response_continue_key, 
        'page_ids': json_data['query']['pageids'], 
        'numeric_count': numeric_count, 
        'categorical_count': categorical_count, 
        'text_count': text_count, 
        'images_count': image_count
    }

    return return_dict

## Loop through API calls until data requirements are met

In [None]:
data_rows = []

pageid_list = []        # list of all Wikipedia page IDs returned from the queries
cont_key = ''           # unique key to get next batch of random articles
num_count = 0
catg_count = 0
txt_count = 0
img_count = 0
data_requirements_met = False
is_first_request = True

while not data_requirements_met:

    # Code below commented out for testing purposes. 
    # I believe that the 'grncontinue' parameter causes the query to be replicated, 
    # which is not what we want; rather, we want new data each time
    # if is_first_request:
    #     result_dict = Query_Wikipedia_API(data_list=data_rows, num_pages=5)
    #     is_first_request = False
    # else:
    #     result_dict = Query_Wikipedia_API(data_list=data_rows, num_pages=5, is_continued=True, continue_key=cont_key)
    
    result_dict = Query_Wikipedia_API(data_list=data_rows, num_pages=5)
    cont_key = result_dict['continue_key']
    num_count += result_dict['numeric_count']
    catg_count += result_dict['categorical_count']
    txt_count += result_dict['text_count']
    img_count += result_dict['images_count']
    pageid_list.extend(result_dict['page_ids'])
    
    print('Total numeric records:\t', num_count)
    print('Total categ. records:\t', catg_count)
    print('Total text records:\t', txt_count)
    print('Total image count:\t', img_count)
    print('Number of pages retrieved:\t', len(pageid_list))
    print('\n\n')

    # Check whether there are at least 500 values of each data type
    if (num_count >= 500) and (catg_count >= 500) and (txt_count >= 500) and (img_count >= 500):
        data_requirements_met = True

# Create a DataFrame from the results
df = pd.DataFrame(data=data_rows, columns=data_columns)

# Display DataFrame
df.head(5)

In [None]:
# Show basic info about DataFrame
df.describe(include='all')

## Save the DataFrame to a SQLite database and to a .csv file

In [None]:
# NOTE: I commented-out the SQL part of the code below because SQLlite cannot
# create a database where some values are Python lists.
# 
# For simplicity, I used lists to store values for a page (article) when there 
# were multiple values for a single page; for example, with images or categories.
# In proper database implementation, I would create a separate table to hold
# those multiple values, and would link it to the main table using a mapping table.
# Since I will use pandas to manipulate this data, I will keep it condensed for now
# and retain the Python lists.

# # First, to a SQLite database (.db file)
# # establish connection
# conn = sqlite3.connect('Wikipedia_data.db')

# # run SQL -- this will create the table, too
# df.to_sql(name='Wikipedia_data', con=conn, if_exists='replace', index=False)

# # Note: there is no need for conn.commit(), the changes are automatically saved
# # Close the connection
# conn.close()
# # ------------------------------------
# # End of SQL section
# # ------------------------------------


# Next, to a .csv file
df.to_csv('Wikipedia_data.csv', index=False)

# Additional info from web scraping
Along with the information obtained from API calls, this additional information will enhance the analysis.

This information can be obtained from the article page and from the Page Information page. For example, see the [Page Information for the Wikipedia article on Brigham Young Univeristy](https://en.wikipedia.org/w/index.php?title=Brigham_Young_University&action=info/)

**Information to retrieve**
* Page content (i.e., the text of a page)
* Date of last edit
* Date of page creation
* Number of redirects to page
* Total number of edits
* Number of edits, last 30 days
* Recent number of distinct authors

## Create function to retrieve and clean text from Wikipedia articles
> This function receives a Wikipedia URL as an input and returns the full text of that article, with references removed. For example, _"Sir Isaac Newton is credited for inventing calculus**[12]**"_ would become _"Sir Isaac Newton is credited for inventing calculus_," without the **[12]** at the end.

In [None]:
def wikipedia_get_text(wikipedia_url):
    import requests
    from bs4 import BeautifulSoup

    webpage = requests.get(wikipedia_url)
    parsed_page = BeautifulSoup(webpage.text, 'html')
    # paragraphs = parsed_page.findAll(text=True)       # find all tags that have text, like <div>, <span> or <p>
    paragraphs = parsed_page.findAll('p')

    article_text = ''

    for paragraph in paragraphs:
        article_text += '\n\n' + paragraph.text

    # remove blank spaces before and after the article text
    article_text = article_text.strip()

    # Use the re (RegEx) library to substitute any references with an empty space
    # See: https://www.kite.com/python/answers/how-to-use-re.sub()-in-python
    # Also: https://docs.python.org/3/library/re.html#regular-expression-syntax

    # The 'r' in front of the pattern tells Python to treat this as a raw string
    # so any Python-specific character sequences (like /n for a new line)
    # will be treated as ordinary text.
    article_text = re.sub(
        pattern = r'\[[0-9]*\]', # or, to remove characters too: pattern = r'\[[a-z0-9]*\]'
        repl = '',
        string = article_text)

    return article_text

## Create a function to retrieve additional info from the 'Page information' page for any Wikipedia article

In [None]:
def wikipedia_get_page_info(wikipedia_info_page_url):
    import requests
    from bs4 import BeautifulSoup
    # format for a Wikipedia info page URL:
    # https://en.wikipedia.org/w/index.php?title=[PAGE_TITLE]&action=info

    webpage = requests.get(wikipedia_info_page_url)
    soup = BeautifulSoup(webpage.text, 'html')

    try:
        Last_edited_on = soup.find('tr', {'id':'mw-pageinfo-lasttime'}).findAll('td')[1].text
        comma_position = Last_edited_on.find(',')
        if comma_position > -1:
            Last_edited_on = Last_edited_on[comma_position + 1:]
    except:
        Last_edited_on = ''
    
    try:
        Page_created_on = soup.find('tr', {'id':'mw-pageinfo-firsttime'}).findAll('td')[1].text
        comma_position = Page_created_on.find(',')
        if comma_position > -1:
            Page_created_on = Page_created_on[comma_position + 2:]
    except:
        Page_created_on = ''

    # No unique ID for number of redirects, so I use BeautifulSoup's ability to get the next elements
    # See: https://www.kite.com/python/examples/1742/beautifulsoup-find-the-next-element-after-a-tag
    # See also: https://www.kite.com/python/examples/1730/beautifulsoup-find-the-next-sibling-of-a-tag
    try:
        Num_redirects = soup.find('td', text='Number of redirects to this page').next_sibling.text
    except:
        try:
            Num_redirects = soup.find('tr', {'id':'mw-pageinfo-visiting-watchers'}).next_sibling.next_sibling.findAll('td')[1].text
        except:
            Num_redirects = 0
    
    try:
        Total_edits = soup.find('tr', {'id':'mw-pageinfo-edits'}).findAll('td')[1].text
    except:
        Total_edits = 0
    
    try:
        Edits_30_days = soup.find('tr', {'id':'mw-pageinfo-recent-edits'}).findAll('td')[1].text
    except:
        Edits_30_days = 0
    
    try:
        Distinct_authors_30_days = soup.find('tr', {'id':'mw-pageinfo-recent-authors'}).findAll('td')[1].text
    except:
        Distinct_authors_30_days = 0
    

    return_dict = {
        "Redirects": Num_redirects, 
        "Created on": Page_created_on,
        "Last edited": Last_edited_on,  
        "Total edits": Total_edits, 
        "Edits, 30 days": Edits_30_days, 
        "Recent authors": Distinct_authors_30_days
    }

    return return_dict

## Create lists of page titles and page URLs for obtaining additional information through web scraping

In [None]:
# Create a list of all page titles returned from the queries
# This also replaces spaces with underscores to prepare for use in Page Information URLs
page_titles = [i.replace(' ', '_') for i in list(df['Title'])]

# Create a list of the URLs of all pages returned from the queries
page_urls = list(df['URL'])


## Loop through page titles to get Page Information on each article

In [None]:
# Set up lists for storing returned values
Page_info_urls = []
Page_text = []
Redirects = []
Date_created = []
Date_last_edit = []
Total_edits = []
Recent_edits = []
Recent_authors = []

In [None]:
# Get all Page Information data

for i, pgtitle in enumerate(page_titles):
    info_url = f'https://en.wikipedia.org/w/index.php?title={pgtitle}&action=info'
    Page_info_urls.append(info_url)
    
    # Run function
    return_dict = wikipedia_get_page_info(info_url)

    # Store values
    Redirects.append(return_dict['Redirects'])
    Date_created.append(return_dict['Created on'])
    Date_last_edit.append(return_dict['Last edited'])
    Total_edits.append(return_dict['Total edits'])
    Recent_edits.append(return_dict['Edits, 30 days'])
    Recent_authors.append(return_dict['Recent authors'])

    print('Completed page:', i + 1, 'of', len(page_titles))

In [None]:
# Get all Page content (text)

for i, pgurl in enumerate(page_urls):
    Page_text.append(wikipedia_get_text(pgurl))
    
    print('Completed page:', i + 1, 'of', len(page_urls))

## Create a dictionary to organize lists of additional info

In [None]:
info_dict = {
    "Page ID": list(df['Page ID']), 
    "Page title": list(df['Title']), 
    "Page URL": list(df['URL']), 
    "Page Info URL": Page_info_urls, 
    "Page text": Page_text, 
    "Redirects": Redirects, 
    "Date created": Date_created, 
    "Last edited date": Date_last_edit, 
    "Total edits": Total_edits, 
    "Recent edits": Recent_edits, 
    "Recent authors": Recent_authors 
}

## Create a DataFrame of additional info

In [None]:
# This DataFrame is called df2, since the first one was just df

df2 = pd.DataFrame(info_dict)

In [None]:
df2.head(5)

## Save DataFrame as a SQLlite database and a .csv file

In [None]:
# First, to a SQLite database (.db file)
# establish connection
conn = sqlite3.connect('Wikipedia_other_info.db')

# run SQL -- this will create the table, too
df2.to_sql(name='Wikipedia_other_info', con=conn, if_exists='replace', index=False)

# Note: there is no need for conn.commit(), the changes are automatically saved
# Close the connection
conn.close()


# Next, to a .csv file
df2.to_csv('Wikipedia_other_info.csv', index=False)

# Alternative method to get article text using Wikipedia's `action=parse` API

This method is not recommended -- it takes much more effort to clean the text compared to using BeautifulSoup on the article's page.

Also, don't use the `action=query&prop=revisions&rvprop=content` API call, because it returns the content in _wikitext_ (not plain text or HTML), which is a strange kind of formatting that is difficult to parse and clean.

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
response = requests.get('https://en.wikipedia.org/w/api.php?action=parse&page=Earth&prop=text&formatversion=2&format=json')
json_data = response.json()

json_data

In [None]:
soup = BeautifulSoup(json_data['parse']['text'], 'html')

In [None]:
soup.text

In [None]:
print(soup.text)

In [None]:
pagetext = soup.text

# remove the 'Contents' section from the page text
start_of_contents_section = pagetext.index('\n\nContents\n\n')
end_of_contents_section = pagetext.index('\n\n\n', start_of_contents_section)

pagetext_beg = pagetext[:start_of_contents_section]
pagetext_end = pagetext[end_of_contents_section:]
pagetext = pagetext_beg + pagetext_end

# Remove the 'See also' and 'References' sections
end_of_text = pagetext.find('\n\nSee also[edit]')
if end_of_text == -1:
    # See also section not found, use References instead
    end_of_text = pagetext.find('\n\nReferences[edit]')

pagetext = pagetext[:end_of_text]

# The 'r' in front of the pattern tells Python to treat this as a raw string
# so any Python-specific character sequences (like /n for a new line)
# will be treated as ordinary text.
pagetext = re.sub(
    pattern = r'\[[0-9A-Za-z]*\]', # or, to remove characters too: pattern = r'\[[a-z0-9]*\]'
    repl = '',
    string = pagetext)

# Remove extra line breaks
pagetext = re.sub(
    pattern = r'\n\n\n', # or, to remove characters too: pattern = r'\[[a-z0-9]*\]'
    repl = '\n\n',
    string = pagetext)

print(pagetext)