# Text Analysis: 50 yrs. of Magazine Issues

**Author:** [Ryan Parker](https://github.com/rparkr)

**Data source:** Data scraped from magazines of The Church of Jesus Christ of Latter-day Saints from 1971-2021. Starting point: [Church Magazines](https://www.churchofjesuschrist.org/study/magazines?lang=eng).

**Objective**: understand trends in topics addressed through the content of four magazines produced by [The Church of Jesus Christ of Latter-day Saints](https://churchofjesuschrist.org).

In addition to collecting data, I'll analyze the data and implement various machine learning algorithms to understand the data, uncover insights, and make predictions.

## Steps
1. <strong><span style="color:rgb(82, 191, 127)">Data collection (this notebook)</strong></span>: Gather data through webscraping, lightly process it, and save it to .csv files.
2. Topic modeling: assign a topic to each document (article) and label the topics, then append the topic label as a new feature (column) in the dataset.
3. Data augmentation: modify and add additional features (columns) to the dataset to enhance its analysis.
4. Analysis: explore the dataset through visualization, and make predictions using supervised and unsupervised machine learing algorithms.

# Data scraping
**Note:** I reviewed the site's [`robots.txt` page](https://www.churchofjesuschrist.org/robots.txt) and verified that the pages used for this data collection exercise are allowed.

Import required packages for this section

In [4]:
# Built-in packages
import json                     # parse JSON to Python dictionaries
import os                       # file management and other operating system utilities
import csv                      # save data to .csv file
import re                       # text-based pattern matching
import time                     # pause between each webpage request
import random                   # sample from a list

# Third-party packages
import requests                 # send and receive data over HTTP(S)
from bs4 import BeautifulSoup   # class used to parse HTML document object model (DOM)
from tqdm.auto import tqdm      # progress bars
import pandas as pd             # tabular data analysis
tqdm.pandas()                   # register df.progress_apply() to add progress bars to DataFrame.apply() methods

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Optional alternative to requests, for returning content rendered by JavaScript.
# Does not work in a Jupyter notebook, only from a Python script (.py file).

# %pip install requests_html
# from requests_html import HTMLSession   # requests, with support for rendering JavaScript
# See: https://github.com/psf/requests-html
#      https://stackoverflow.com/questions/45448994/wait-page-to-load-before-getting-data-with-requests-get-in-python-3/67778709#67778709
#      https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python/49865035#49865035
# Usage:
# session = HTMLSession()
# response = session.get(URL_TO_PAGE)
# html = response.html.render()
# parsed_page = BeautifulSoup(html, features='lxml')
# print(parsed_page.text)

# Another alternative: Selenium with Chrome in headless mode
# See, for example: https://stackoverflow.com/questions/45448994/wait-page-to-load-before-getting-data-with-requests-get-in-python-3/68787500#68787500

Test sites with interesting structures to account for: 
* [2017 collection of _Liahona_ issues](https://www.churchofjesuschrist.org/study/magazines/liahona/2017?lang=eng), has Special Issues in addition to standard monthly issues. See also [2018](https://www.churchofjesuschrist.org/study/magazines/liahona/2018?lang=eng) for other examples of special editions. 
* [August 2021 example](https://www.churchofjesuschrist.org/study/liahona/2021/08/hear-him?lang=eng) of an article with no text content
* [Do Not Copy photo notice example](https://www.churchofjesuschrist.org/study/liahona/2018/02/a-release-is-a-beginning-not-an-end?lang=eng)
* [Nested category example](https://www.churchofjesuschrist.org/study/liahona/2018/02/october-2017-conference-notebook/how-can-we-bring-the-savior-into-our-lives?lang=eng), where categories form the part of the URL after the month (e.g., `.../liahona/2018/02/additional_category/article?lang=eng`)
* [Extra Contents section example](https://www.churchofjesuschrist.org/study/liahona/2010/10/contents?lang=eng)


Things to account for: 
* All links on a Year page, including special editions
* All articles, using links from left-side navigation (including nested categories like Youth, Adults, Children, etc.)
* Multiple pictures
* Author bios, if present (bios are present if you can find the author's name at the beginning of the sidebar text)
* Author locations, if present
* Author titles, if present (e.g., "Of the Quorum of the Twelve Apostles" or "Church magazines")
* Article summaries, if present
* Don't include an entry for an article that has no text; e.g., one that is just a picture and a caption

## Dataset features scraped from the web
* Magazine (_Liahona_, _The Friend_, _Ensign_, _New Era_)
* URL
* Year
* Month
* Issue (month/year combo)
* Author name
* Author's role
* Author bio (if present)
* Article source (e.g., "from a devotional address given at BYU on...")
* Article summary
* Article text
* Article main topic (uses Python function to generate topic from synonym dictionary)
* Article category (Main, Young Adults, Children, etc.)
* Word count
* Number of links on each page
* Image captions
* Image descriptions (Alt text) --> actually, alt text is generated by a JavaScript script client-side, so the alt text appears as an empty string ('') in the HTML retrieved by `requests.get()`. For more info, see this [Stack Overflow question](https://stackoverflow.com/questions/53469230/beautifulsoup-not-extracting-image-alt-text)

<img alt="DevTools window showing that alt text is written by JavaScript after HTML DOM is loaded" href="images/alt_text_attribute_assigned_by_JavaScript.png" width=300>

## Data collection methods

In [2]:
class MagazineData():
    '''
    A class to collect and store magazine data by scraping webpages.
    '''
    def __init__(self, root_url, magazine_names, magazine_urls):
        '''
        Create an object that will collect and store magazine data.

        Parameters
        ---

        `root_url`: str
        The URL of the main page of the website. Necessary because the links
        given throughout the website are relative links that start at the end
        of the root URL (e.g., a relative link looks like: '/study/liahona/')

        `magazine_names`: list
        A list of the names for each magazine. Will be used as the first-level
        keys in the returned dictionary. Please ensure that the number of
        elements in `magazine_names` matches the number of elements in
        `magazine_urls`.
        
        `magazine_urls`: list
        A list of the main URLs for the magainzes, where the main URL is the page
        with links to each year for the magazine.
        '''
        # Set the root_url, magazine_names, and magazine_urls properties of
        # the object. Same as self.root_url = root_url, etc. for each parameter.
        self.__dict__.update(locals())
        self.all_years = None
        self.all_issues = None
        self.article_data = None
        self.article_text = None

        # Alternative form
        # self.root_url = root_url
        # self.magazine_names = magazine_names
        # self.magazine_urls = magazine_urls

        # Enforce lists for the input parameters
        if type(self.magazine_names) == str:
            self.magazine_names = list(self.magazine_names)
        if type(self.magazine_urls) == str:
            self.magazine_urls = list(self.magazine_urls)
        
        # Ensure lists passed are of equal length
        assert len(self.magazine_urls) == len(self.magazine_names), (
            f"The number of items passed to 'magazine_urls' ({len(self.magazine_urls)})"
            f"\ndoesn't match the number of items passed to 'magazine_names' ({len(self.magazine_names)})."
            f"\nPlease re-initialize the object with an equal number of items in each list."
        )



    # Collect links to magazines, grouped by year
    def get_magazine_years(
        self,
        return_data: bool=False,
        crawl_delay: float=0.1) -> dict:
        '''
        Creates a dictionary of dictionaries that organizes URLs to the year
        page for each magazine.

        Parameters
        ---

        `return_data`: bool, default=False
        Whether to return the data that this function collects. Either way, the
        data is stored in the `all_years` property of the `MagazineData` object.

        `crawl_delay`: float, default=0.1
        The number of seconds to pause prior to making each web request. Implemented
        to reduce the load on the web server. This delay separates each request by a
        minimum of `crawl_delay` seconds, since the code's execution time also creates
        a delay between consecutive requests.

        Return value
        ---
        Dictionary of the form:
        ```python
        {
            'Liahona': {
                '2021': {'url': 'https://www.example.org/'},
                '2020': {'url': 'https://www.example.org/'}
            },
            'Friend': {
                '2021': {'url': 'https://www.example.org/'},
                '2020': {'url': 'https://www.example.org/'}
            },
        }
        ```
        '''
        # Create a dictionary to store the links to each year for each magazine
        magazine_years = {name: {} for name in self.magazine_names}

        # Start a web session
        sesh = requests.session()

        # Loop through all provided links
        for i, main_link in enumerate(self.magazine_urls):
            mag_name = self.magazine_names[i]

            # Access the magazine's main page
            response = sesh.get(main_link)

            # Parse the HTML into a Python-navigable structure using BeautifulSoup.
            # For best speed, use the 'lxml' parser. 
            #  See: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
            #  Alternative: built-in parser 'html.parser'
            page = BeautifulSoup(response.text, 'lxml')

            for link in page.find('section', id='main').find_all('a'):
                # Make sure the link is to a full year, not a month from the current year.
                # Full years have only 4 characters (the digits of the year), and months
                #  of the current year have a month name prefixing the year.
                if link.text.isdigit():
                    magazine_years[mag_name][link.text] = {'url': self.root_url + link.get('href')}
            
            # Pause between page requests as a courtesy to reduce server load
            time.sleep(crawl_delay)
        
        # Save data in the all_years property of the object
        self.all_years = magazine_years

        if return_data:
            return magazine_years
        else:
            return None



    # Collect links to magazine issues
    def get_magazine_issues(
        self,
        return_data: bool=False,
        force_retrieve: bool=False,
        crawl_delay: float=0.1):
        '''
        Creates a dictionary with the following information for each year of
        a magazine:
        * 'url': link to a page that groups links for each issue of the year
        * 'issues': list of titles of issues for that year (e.g., 'April 2018')
        * 'issue_urls': list of links to each issue for that year

        Parameters
        ---
        `return_data`: bool, default=False
        Whether to return the data that this function collects. Either way, the
        data is stored in the `all_issues` property of the `MagazineData` object.

        `force_retrieve`: bool, default=False
        If True, will collect the data regardless of whether the data has already
        been scraped.

        `crawl_delay`: float, default=0.1
        The number of seconds to pause prior to making each web request. Implemented
        to reduce the load on the web server. This delay separates each request by a
        minimum of `crawl_delay` seconds, since the code's execution time also creates
        a delay between consecutive requests.

        Return value
        ---
        Dictionary of the form:
        ```python
        {
            'Liahona': {
                '2020': {
                    'url': 'https://www.example.org/',
                    'issues': ['January 2020', 'February 2020'],
                    'issue_urls': ['https://www.example.org/', 'https://www.example.org/']
                    },
                '2021': {
                    'url': 'https://www.example.org/',
                    'issues': ['January 2021', 'February 2021'],
                    'issue_urls': ['https://www.example.org/', 'https://www.example.org/']
                    }
            },
            'Friend': {
                '2020': {
                    'url': 'https://www.example.org/',
                    'issues': ['January 2020', 'February 2020'],
                    'issue_urls': ['https://www.example.org/', 'https://www.example.org/']
                    },
                '2021': {
                    'url': 'https://www.example.org/',
                    'issues': ['January 2021', 'February 2021'],
                    'issue_urls': ['https://www.example.org/', 'https://www.example.org/']
                    }
            }
        }
        ```
        '''
        # Check if the magazine issues have already been collected and saved
        if os.path.exists('data/magazine_issues.json') and force_retrieve == False:
            with open('data/magazine_issues.json', mode='rt', encoding='utf-8') as jsonfile:
                self.all_issues = json.load(jsonfile)
            print(f'magazine_issues.json file found; data loaded to `all_issues` attribute of the MagazineData object.')
        else:
            if self.all_years == None:
                print("Running the get_magazine_years() function to find URLs for the year pages of each magazine.")
                self.get_magazine_years()
            
            # Start the dictionary with the data already scraped and saved in the all_years attribute.
            magazine_issues = self.all_years

            total_mag_years = 0

            # Create lists for issue titles and URLs
            for mag_name, yrs in magazine_issues.items():
                total_mag_years += len(yrs)
                for yr in yrs:
                    magazine_issues[mag_name][yr]['issues'] = []
                    magazine_issues[mag_name][yr]['issue_urls'] = []

            # Initialize progress bar that counts the number of issues
            progress_bar = tqdm(total=total_mag_years)

            # Create a web session
            sesh = requests.session()

            total_issues = 0

            for mag_name, yrs in magazine_issues.items():
                for yr, properties in yrs.items():
                    progress_bar.set_description(f"Retrieving issues for: {mag_name}, {yr}")
                    # Get webpage with links to issues for a given year
                    response = sesh.get(properties['url'])
                    response.encoding = 'utf-8'
                    page = BeautifulSoup(response.text, 'lxml')
                    for link in page.find('section', id='main').find_all('a'):
                        total_issues += 1
                        # Store issue titles and URLs in the list for the magazine year
                        magazine_issues[mag_name][yr]['issues'].append(link.text)
                        magazine_issues[mag_name][yr]['issue_urls'].append(self.root_url + link.get('href'))
                    progress_bar.update()

                    # Pause between page requests as a courtesy to reduce server load
                    time.sleep(crawl_delay)

            print(f"There are a total of {total_issues:,.0f} magazine issues.")

            self.all_issues = magazine_issues

            # Save data to a file so it doesn't need to be scraped again
            with open('data/magazine_issues.json', mode='wt', encoding='utf-8') as jsonfile:
                json.dump(obj=magazine_issues, fp=jsonfile)

            if return_data:
                return magazine_issues
            else:
                return None



    # Scrape data on articles from each magazine issue
    def get_article_data(
        self,
        save_progress: bool=True,
        return_data: bool=False,
        crawl_delay: float=0.1,
        force_retrieve: bool=False,
        continue_last_saved: bool=False):
        '''
        Scrape article data from all issues in the `all_issues` attribute of the
        MagazineData object.
        
        The resulting data is saved to this `MagazineData` object
        as a dictionary of columns (keys) and rows (values, as lists).
        After running this function, access the data using the `.article_data`
        and `.article_text` attributes of the `MagazineData` object.
        
        Also saves the data to .csv files called: "data/article_data.csv"
        and "data/article_text.csv" in the current working directory.

        **NOTE:** This function will take approximately 12-18 hours to scrape
        data for all 2,302 magazine issues released between 1970 and 2020 across
        all four magazines. It is best to use the .csv files already scraped:
        the data likely won't change since the articles are historical.

        On my 2015 Windows 10 laptop with a 2.20GHz Intel Core i5-5200U CPU,
        8GB of RAM, and 150Mb/s internet connection, the scraping process
        took about 13 hours in total (conducted over multiple sessions).

        There are about 67,145 articles in the dataset, and each article
        took approximately 0.7 seconds to scrape.


        Parameters
        ---
        `save_progress`: bool, default=True
            Whether to save the scraped data to .csv files during the
            function's execution or to hold all data in memory until the end.
            Due to the relative speed of RAM compared with disk access,
            the in-memory option is faster. Saving progress to disk at 
            each step safeguards against function failure partway at 
            the expense of addititional computation time.
            
            Approximately 332MB of data are collected from all magazines
            (the Friend, New Era, Liahona, and Ensign) from 1970-2020.
            Regardless of the `save_progress` setting, the collected 
            data is held in-memory in the `.article_data` 
            and `.article_text` attributes of the `MagazineData` object
            
        `return_data`: bool, default=False
        Whether to return the data that this function collects. Either way, the
        data is stored in the `article_data` and `.article_text` attributes
        of the `MagazineData` object. If True, returns a DataFrame.

        `force_retrieve`: bool, default=False
        Whether to force the function to scrape data and save it, even though
        `data/article_data.csv` file already exists. Will overwrite existing data.

        `crawl_delay`: float, default=0.1
        The number of seconds to pause prior to making each web request. Implemented
        to reduce the load on the web server. This delay separates each request by a
        minimum of `crawl_delay` seconds, since the code's execution time also creates
        a delay between consecutive requests.

        `continue_last_saved`: bool, default=True
        Whether to continue scraping starting with the last scraped article. Useful
        because the scraping takes many hours, so this enables a developer to continue
        at the last saved point. `force_retrive` must be set to `False` if this
        parameter is set to `True`.


        Return value
        ---
        `None`, unless `return_data` is set to `True`, in which case the function returns
        a pandas DataFrame with data scraped from each article (one row per article).
        '''
        
        if force_retrieve:
            assert not continue_last_saved, (
                "Conflicting arguments: if `force_retrieve` is `True`, `continue_last_saved` cannot also be `True`."
                + "\nPlease set one of the arguments to `False` to proceed.")

        # Dictionary to hold data during function execution.
        # Each key is a column, with the value as a list of the rows in that column.
        columns = [
            'magazine', 'year', 'month', 'issue', 'extra_issue', 'url',
            'section', 'series', 'title', 'subtitle', 'source',
            'author_name', 'author_role', 'author_bio',
            'summary', 'word_count', 'references_count',
            'image_alt_text', 'image_captions'
        ]

        text_columns = ['url', 'text']

        if self.all_issues == None:
            print("Running the get_magazine_issues() function to find issue URLs for each magazine.")
            self.get_magazine_issues()
            print("Now, running the get_article_data function.")

        # Check if the article data has already been scraped and saved
        if os.path.exists('data/article_data.csv') and os.path.exists('data/article_text.csv') and not force_retrieve:
            if not self.article_data:
                self.article_data = pd.read_csv('data/article_data.csv').to_dict(orient='list')
                print("article_data.csv file found; data loaded to `article_data` attribute of the MagazineData object.")
            if not self.article_text:
                self.article_text = pd.read_csv('data/article_text.csv').to_dict(orient='list')
                print("article_text.csv file found; data loaded to `article_text` attribute of the MagazineData object.")
            
            if continue_last_saved:
                print("Continuing to scrape article data starting from the last retrieved article.")
            else:
                if return_data:
                    print("Merging article_data and article_text dictionaries and returning a single DataFrame with all data.")
                    return pd.DataFrame({**self.article_data, **self.article_text})
                else:
                    return None
        else:
            self.article_data = {col: [] for col in columns}
            self.article_text = {col: [] for col in text_columns}
            if save_progress:
                # Create CSV files and write the header row in each file
                with open('data/article_data.csv', mode='wt', encoding='utf-8', newline='') as csvfile:
                    csv_writer = csv.writer(csvfile)
                    csv_writer.writerow(columns)
                with open('data/article_text.csv', mode='wt', encoding='utf-8', newline='') as csvfile:
                    csv_writer = csv.writer(csvfile)
                    csv_writer.writerow(text_columns)
    
        # Start a web requests session
        sesh = requests.session()

        # Headers updated in Jul 2022
        headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml,application/json;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'en-US,en;q=0.9',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.71',
            # 'Referer': 'https://www.churchofjesuschrist.org/study?lang=eng&platform=web'
        }
        
        # Copy the all_issues dictionary to modify it if necessary,
        # which enables the continue_last_saved feature.
        # The pattern below is: iterate through the all_issues dict (the copy)
        # and remove items whose index is less than the index of the last saved item;
        # that is, remove from the copy the items that have already been scraped.
        all_issues = self.all_issues.copy()
        if continue_last_saved:
            last_magazine = self.article_data['magazine'][-1]
            last_year = str(self.article_data['year'][-1])
            last_issue = self.article_data['issue'][-1]

            # Remove magazines from the all_issues dictionary if
            # all articles from those magazines have already been scraped
            del_keys = []
            index_last_mag = list(all_issues.keys()).index(last_magazine)
            for key in all_issues.keys():
                if list(all_issues.keys()).index(key) < index_last_mag:
                    del_keys.append(key)
            for key in del_keys:
                # Note that I am deleting keys (magazines) from the COPY, not the original
                del all_issues[key]
            
            # Remove years for the magazine that have already been scraped.
            del_yrs = []
            index_last_year = list(all_issues[last_magazine]).index(last_year)
            for yr_num in all_issues[last_magazine]:
                if list(all_issues[last_magazine]).index(yr_num) < index_last_year:
                    del_yrs.append(yr_num) 
            for yr_num in del_yrs:
                del all_issues[last_magazine][yr_num]
            
            # Remove issues for the magazine that have already been scraped
            del_issue_nums = []
            index_last_issue = all_issues[last_magazine][last_year]['issues'].index(last_issue)
            for i_num, issue in enumerate(all_issues[last_magazine][last_year]['issues']):
                if all_issues[last_magazine][last_year]['issues'].index(issue) < index_last_issue:
                    del_issue_nums.append(i_num)
            # When deleting items from a list using the index number,
            # items must be deleted in reverse order; otherwise the indices
            # of later elements shift down with each deletion.
            for i_num in reversed(del_issue_nums):
                del all_issues[last_magazine][last_year]['issues'][i_num]
                del all_issues[last_magazine][last_year]['issue_urls'][i_num]

        # Count the number of issues
        issue_count = 0
        for mag_name in all_issues:
            for yr_num in all_issues[mag_name]:
                issue_count += len(all_issues[mag_name][yr_num]['issues'])
        
        # When scraping article text within the <div class-"body-block"> by
        # looping through each direct child element in body_block.findChildren(recursive=False),
        # if an element is an <aside>, <figure>, or <figcaption> tag,
        # or if an element is a <div> tag with class "credit" or "imageWrapper...",
        # ignore its text since it isn't part of the article's main body text.
        tags_to_omit = ['aside', 'figure', 'figcaption']
        classes_to_omit = ['credit', re.compile('imageWrapper.*')]

        # When scraping image alt text, the text is prefixed
        # with "Image", which I later strip out when saving the data.
        img_prefix_chars = len('Image')
        
        # Create progress bar
        pbar = tqdm(total=issue_count)

        month_names = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

        # Loop through each issue in all_issues dict
        for mag_name in all_issues:
            for yr_num in all_issues[mag_name]:
                for n, url in enumerate(all_issues[mag_name][yr_num]['issue_urls']):
                    issue = all_issues[mag_name][yr_num]['issues'][n]
                    # The postion of the last question mark in the URL
                    q_pos = url.rfind('?')
                    # The position of the last forward slash
                    start_pos = url.rfind('/')
                    month_num = url[start_pos + 1 : q_pos]
                    if month_num.isdigit():
                        month_name = issue.split(' ')[0]
                        extra_issue = False
                    else:
                        extra_issue = True
                        if month_num[:2].isdigit():
                            month_name = month_names[int(month_num[:2]) - 1]
                        else:
                            month_name = None

                    # Pause between page requests as a courtesy to reduce server load
                    time.sleep(crawl_delay)
                    # Find all articles in the issue
                    response = sesh.get(url=url, headers=headers)
                    retry_count = 5
                    while not response.ok and retry_count > 0:
                        time.sleep(1)
                        response = sesh.get(url=url, headers=headers)
                        retry_count -= 1
                    response.encoding = 'utf-8'
                    page = BeautifulSoup(response.text, 'lxml')
                    sections = []
                    article_urls = []
                    section = "general"
                    try:
                        for li in page.find('div', class_='body').find('nav', class_='manifest').find_all('li'):
                            # A single HTML element can have multiple classes,
                            # so BeautifulSoup returns a list of classes when accessing the "class" attribute
                            n_n_class = li.next.next.get('class')  # -> list of classes or None
                            if n_n_class:
                                if any(nnc in ['title'] for nnc in n_n_class):
                                    # Update section to match current section
                                    section = li.find('p', class_='title').text
                            a_url = li.find('a', {'href': True})
                            if a_url.text != "Contents" and (self.root_url + a_url.get('href') not in article_urls):
                                sections.append(section)
                                article_urls.append(self.root_url + a_url.get('href'))
                            else:
                                continue
                    except AttributeError:
                        # An AttributeError means that either:
                        # 1. page is None because BeautifulSoup didn't have any HTML to parse (unlikely because requests retries 5 times if the response fails)
                        # 2. there is no <div class="body"> element on the page
                        # 3. there is no <nav class="manifest"> (i.e., the "Contents" section) within the body of the page
                        # For any of those errors, simply skip this issue and move on to the next one.
                        continue

                    # Loop through each article
                    for a_num, a_url in enumerate(article_urls):
                        pbar.set_description(
                            f"Working on {mag_name}, "
                            + f"issue {issue}, "
                            + f"article {a_num + 1:,d}/{len(article_urls):,d}")
                        # Skip article if its data has already been scraped
                        if continue_last_saved:
                            if a_url in self.article_data['url']:
                                continue
                        # Pause between page requests as a courtesy to reduce server load
                        time.sleep(crawl_delay)
                        # Open webpage
                        response = sesh.get(url=a_url, headers=headers, timeout=None)
                        retry_count = 5
                        while not response.ok and retry_count > 0:
                            time.sleep(1)
                            response = sesh.get(url=a_url, headers=headers)
                            retry_count -= 1
                        # Skip article if it still isn't loading
                        if not response.ok:
                            print(f"Failed to retrieve article with status code {response.status_code} at URL: {a_url}")
                            continue
                        response.encoding = 'utf-8'
                        page = BeautifulSoup(response.text, 'lxml')
                        body = page.find('div', class_='body')
                        # If no body element is found on the page, skip to next article
                        if not body:
                            continue
                        # Sometimes, an article has mutliple body-block div tags. In that case, use the last one (i.e., index [-1])
                        # See, for example: https://www.churchofjesuschrist.org/study/liahona/2021/01/unlocking-the-door-to-personal-revelation?lang=eng
                        body_block = body.find_all('div', class_='body-block')

                        if body_block:
                            body_block = body_block[-1]
                            # A better approach might be to merge the body-block sections.
                            # For example, see:
                            # https://stackoverflow.com/questions/50026264/beautifulsoup-combine-consecutive-tags
                            # https://stackoverflow.com/questions/35531608/how-can-i-merge-two-beautiful-soup-tags
                            
                        else:
                            # print(f"No <div class=\"body-block\"> tag found on page. Could not load text from article: {a_url}")
                            # Skip this article since it has no content
                            continue
                        
                        series = body.find('p', class_='series-title')
                        # BeautifulSoup returns None if no element is found.
                        # None is a "falsey" value that evaluates to False.
                        # Thus, we can write "if series:" and Python will return True
                        # if series is an object (that is, a match was found) or False
                        # if series is None (that is, no match was found).
                        # pandas treats None as a missing value (NaN), which is the intended behavior
                        # in this dataset. See: https://note.nkmk.me/en/python-pandas-nan-none-na/#none-is-also-considered-a-missing-value
                        series = series.text.strip() if series else None

                        title = body.find(id='title1')
                        title = title.text.strip() if title else None
                        subtitle = body.find(class_='subtitle')
                        subtitle = subtitle.text.strip() if subtitle else None

                        byline = body.find('div', class_='byline')
                        if byline:
                            byline_vals = byline.find_all('p')
                            author_name = byline_vals[0].text
                            if author_name.startswith("By "):
                                # Trim the leading "By "
                                author_name = author_name[3:]
                            author_role = byline_vals[1].text if len(byline_vals) > 1 else None
                        else:
                            author_name = None
                            author_role = None

                        author_bio = None
                        asides = body_block.find_all('aside', class_='sidebar')
                        # asides is a list of elements found. If no elements
                        # were found, it will be an empty list.
                        # Empty lists (i.e., [] or list()) evaluate to False.
                        if asides and author_name:
                            for aside_box in asides:
                                for p_elem in aside_box.find_all('p'):
                                    if author_name in p_elem.text:
                                        author_bio = p_elem.text
                                        break
                        else:
                            # Check if there is a "bio" class
                            bio_div = body.find('div', class_='bio')
                            if bio_div:
                                author_bio = bio_div.text
                        
                        # Trim leading or trailing newline characters
                        if author_bio:
                            author_bio = author_bio.strip()
                    
                        summary = body.find('p', class_='kicker')
                        if summary:
                            summary = summary.text
                        else:
                            # Check if there is an epigraph rather than a "kicker"
                            summary = body.find('p', class_='epigraph')
                            summary = summary.text.strip() if summary else None
                        
                        # Source describes where the article came from, if adapted
                        # from a prior address given by the author.
                        source = body.find('div', class_='event')
                        source = source.text.strip() if source else None
                        
                        references_count = len(body_block.find_all('a', class_='scripture-ref'))
                        references_count += len(body_block.find_all('a', class_='note-ref'))
                        references_count += len(body_block.find_all('a', class_='cross-ref'))

                        # Use the .text attribute of the container that holds
                        # the image, which is a <div> element with class that begins with "imageWrapper"
                        images = body.find_all(class_=re.compile('imageWrapper.*'))
                        if images:
                            image_alt_text = [img.text.strip() for img in images]
                        else:
                            image_alt_text = None
                            
                        image_captions = body.find_all('figcaption')
                        if image_captions:
                            image_captions = [fig.text.strip() for fig in image_captions]
                        else:
                            image_captions = None
                        
                        # The code below applies filters to refrain from collecting
                        # text that is not part of the article (image captions, section headings, asides).
                        text = []
                        for ch in body_block.findChildren(recursive=False):
                            ch_class = ch.get('class')  # -> a list, or None
                            if ch_class:
                                skip_class = any(cc in classes_to_omit for cc in ch_class)
                            else:
                                skip_class = False
                            
                            if ch.name in tags_to_omit or skip_class:
                                continue
                            else:
                                text.append(ch.text.strip())

                        # Combine into a single string, with paragraphs delimitted by newlines.
                        text = '\n'.join(text)
                        # Replace multiple newlines with a single newline.
                        text = re.sub('[\n]{2,}', '\n', text)

                        # Remove unwanted text that was included through child elements
                        # (e.g., a section that, along with article text, 
                        # contained text from unwanted elements like images, credits, or asides)
                        credit_elems = body_block.find_all('div', class_='credit')
                        if credit_elems:
                            for cr in credit_elems:
                                text = text.replace(cr.text.strip(), '', 1)
                        
                        aside_elems = body_block.find_all('aside')
                        if aside_elems:
                            for ae in aside_elems:
                                text = text.replace(ae.text.strip(), '', 1)

                        if image_alt_text:
                            for iat in image_alt_text:
                                text = text.replace(iat, '', 1)

                        if image_captions:
                            for ic in image_captions:
                                text = text.replace(ic, '', 1)
                        
                        # Replace multiple whitespace characters with a single space
                        text = re.sub('[\s]{2,}', ' ', text)
                        
                        if text:
                            # Trim leading newline character, if present
                            if text[0] == '\n':
                                text = text[1:]
                        # else:
                        #     # Skip to next article, since there is no text in this article
                        #     continue

                        # Convert less-common Unicode characters to simpler ones (e.g., á -> a)
                        # text = unicode_to_ascii(text)
                        
                        word_count = len(text.split()) if text else None

                        # Clean image_alt_text by removing the "Image" prefix.
                        if image_alt_text:
                            image_alt_text = [iat[img_prefix_chars:] for iat in image_alt_text]
                    

                        # Add info to article_data
                        self.article_data['magazine'].append(mag_name)
                        self.article_data['year'].append(yr_num)
                        self.article_data['month'].append(month_name)
                        self.article_data['issue'].append(issue)
                        self.article_data['extra_issue'].append(extra_issue)
                        self.article_data['url'].append(a_url)
                        self.article_data['section'].append(sections[a_num])
                        self.article_data['series'].append(series)
                        self.article_data['source'].append(source)
                        self.article_data['title'].append(title)
                        self.article_data['subtitle'].append(subtitle)
                        self.article_data['author_name'].append(author_name)
                        self.article_data['author_role'].append(author_role)
                        self.article_data['author_bio'].append(author_bio)
                        self.article_data['summary'].append(summary)
                        self.article_data['word_count'].append(word_count)
                        self.article_data['references_count'].append(references_count)
                        self.article_data['image_alt_text'].append('\n'.join(image_alt_text) if image_alt_text else None)
                        self.article_data['image_captions'].append('\n'.join(image_captions) if image_captions else None)

                        self.article_text['url'].append(a_url)
                        self.article_text['text'].append(text)

                        if save_progress:
                            # Write a new row with data scraped from the current article
                            values = [self.article_data[col][-1] for col in self.article_data.keys()]
                            text_values = [self.article_text[col][-1] for col in self.article_text.keys()]
                            # Append data to the .csv files
                            with open('data/article_data.csv', mode='a', encoding='utf-8', newline='') as csvfile:
                                csv_writer = csv.writer(csvfile)
                                csv_writer.writerow(values)
                            with open('data/article_text.csv', mode='a', encoding='utf-8', newline='') as csvfile:
                                csv_writer = csv.writer(csvfile)
                                csv_writer.writerow(text_values)

                    # Completed scraping data for all articles in the current issue.
                    # Update the progress bar and move to the next issue.
                    pbar.update()
        
        # Completed scraping data from all issuee
        # Save to .csv files if files were not already created
        if not save_progress:
            df_data = pd.DataFrame(self.article_data)
            df_text = pd.DataFrame(self.article_text)
            df_data.to_csv('data/article_data.csv', index=False)
            df_text.to_csv('data/article_text.csv', index=False)
        if return_data:
            # Return a single DataFrame by merging the dictionaries using unpacking (**)
            return pd.DataFrame({**self.article_data, **self.article_text})
        else:
            return None


    # Load already-scraped data to the MagazineData object
    def load_article_data(
        self,
        return_data: bool=False):
        '''
        Loads data from the .csv files: 'data/article_data.csv' and 'data/article_text.csv'.

        If `return_data` is `False` (default), the data is loaded as attributes of the MagazineData
        object but no data is returned. If `True`, the data is loaded as attributes of the MagazineData
        object _and_ returned as a single pandas DataFrame combining both datasets.
        '''

        if os.path.exists('data/article_data.csv'):
            if not self.article_data:
                self.article_data = pd.read_csv('data/article_data.csv').to_dict(orient='list')
                print("article_data.csv file found; data loaded to `article_data` attribute of the MagazineData object.")
        else:
            print("Could not find the file 'data/article_data.csv'. Consider running the `get_article_data` method to create the file and scrape article data.")
            return
        
        if os.path.exists('data/article_text.csv'):
            if not self.article_text:
                self.article_text = pd.read_csv('data/article_text.csv').to_dict(orient='list')
                print("article_text.csv file found; data loaded to `article_text` attribute of the MagazineData object.")
        else:
            print("Could not find the file 'data/article_text.csv'. Consider running the `get_article_data` method to create the file and scrape article text.")
            return
        
        if return_data:
            print("Merging article_data and article_text dictionaries and returning a single DataFrame with all data...")
            return pd.DataFrame({**self.article_data, **self.article_text})
        else:
            return None

## Collect data using the `MagazineData` class
This process will likely take 12-15 hours to complete. There are about 2,700 magazine issues and over 67,000 articles. The scraped dataset will be approximately 300 MB.

Note that after scraping and saving the data to .csv files, you can safely delete the `MagazineData` instance (here, I've called it `magdata`) to release its memory.

In [4]:
ROOT_URL = 'https://www.churchofjesuschrist.org'
MAGAZINE_NAMES = ['Liahona', 'Ensign', 'New Era', 'Friend']
MAGAZINE_URLS = [
    'https://www.churchofjesuschrist.org/study/magazines/liahona',
    'https://www.churchofjesuschrist.org/study/magazines/ensign-19712020',
    'https://www.churchofjesuschrist.org/study/magazines/for-the-strength-of-youth/new-era-19712020?lang=eng',
    'https://www.churchofjesuschrist.org/study/magazines/friend']

magdata = MagazineData(
    root_url=ROOT_URL,
    magazine_names=MAGAZINE_NAMES,
    magazine_urls=MAGAZINE_URLS)

In [6]:
magdata.get_article_data(
    save_progress=True,
    crawl_delay=0.0,
    continue_last_saved=True)

Running the get_magazine_issues() function to find issue URLs for each magazine.
magazine_issues.json file found; data loaded to `all_issues` attribute of the MagazineData object.
Now, running the get_article_data function.
article_data.csv file found; data loaded to `article_data` attribute of the MagazineData object.
article_text.csv file found; data loaded to `article_text` attribute of the MagazineData object.
Continuing to scrape article data starting from the last retrieved article.


  0%|          | 0/1091 [00:00<?, ?it/s]

Missed articles during scraping due to server error
- Failed to retrieve article with status code 500 at URL: https://www.churchofjesuschrist.org/study/ensign/1990/04/portraits/ed-rawley-a-steel-grip-on-family-history
- Failed to retrieve article with status code 500 at URL: https://www.churchofjesuschrist.org/study/new-era/2014/09/what-is-the-work-of-salvation
- Failed to retrieve article with status code 500 at URL: https://www.churchofjesuschrist.org/study/new-era/2013/10/come-follow-me

In [7]:
print(f"Count of articles with text extracted: {len(pd.read_csv('data/article_text.csv').index)}")
print(f"Count of articles with data extracted: {len(pd.read_csv('data/article_data.csv').index)}")

Count of articles with text extracted: 67143
Count of articles with data extracted: 67143


  print(f"Count of articles with data extracted: {len(pd.read_csv('data/article_data.csv').index)}")


# Dataset processing: article section
After collecting the data, I noticed that some of the articles with specific sections should be re-classified in the "general" section based on their URL pattern. This post-collection data processing will enhance the quality of the dataset.

After running this code, I'll update the data collection class so any future runs will correctly classify the section of each magazine article; then I'll move this code to the "Experiments" section below.

**What I've learned so far**<br>
One way to know the section is to open the _contents_ page of each issue (the main URL of the issue) and scrape links to each article from the `<nav class="tableOfContents-HYfFr">` element on the left of the screen. Each article (`<a>` element) will have an attribute that identifies its connection to the header it falls under. If the element's `style` attribute is `style="padding-inline-start:16px"`, the element is part of the "general" section of magazine articles, but if its `style` attribute is `style="padding-inline-start:32px"` (i.e., with `32px` of indentation or more), the element's section is the same as the nearest section header above it (a `<span class="sectionTitle-_Dn99 sectionLabel-HdHsm">` element that is a parent element to the article).
> Key takeaway: the level of indentation is the surest way to identify the section that an article belongs to; it's difficult to figure out from the URL alone or from the combination of the URL and the series.

Aside from re-scraping the data on each article to determine the section based on its level of indentation, there might not be a sure way to accurately determine the correct section (that is, assign the section "general" rather than a more specific category when the article belongs to the general magazine articles and not to a specific category).

The code below is my best attempt to clean up many of the incorrect sections labels.

Logic:
- If `section` is not 'general'
- AND `series` is `NaN`
- AND the `url` has no forward slashes (`/`) after the issue (which is formatted as: '2021/01/article_title', '1987/12/article_title', or '2017/11-se/article_title'). There are 7 forward slashes in the URL for an article in the "general" section (and for many articles that have a specific section as well -- 7 is the minimum number of forward slashes).
- THEN reclassify the article as the 'general' section and keep track of what its section was to check if the following sections should be updated.
- If the article follows a section that was changed to 'general' and the article's section is the same as the one that was changed to 'general', then change the current article's section to 'general' as well -- since articles don't switch back and forth between the 'general' category and a more specific category (I think -- from the samples I've looked at so far, that seems to hold true).

Then reclassify the section as "general".

This should correctly update the section classification for most articles that were misclassified before.

In [51]:
# Load the dataset
df_data = pd.read_csv('data/article_data.csv', low_memory=False)

# Ensure entire URLs can be viewed
pd.options.display.max_colwidth = None
# Reset to default
# pd.reset_option('display.max_colwidth')

# Count the number of forward slashes in each URL
df_data['slash_count'] = df_data['url'].str.count(r'/')

In [54]:
pd.options.display.max_rows = 100

In [55]:
print(f"Number of articles in the 'general' category: {df_data[(df_data.section=='general')]['section'].count():,.0f}/{len(df_data.index):,.0f}")
df_data[['url', 'section', 'series', 'slash_count']].head(100)

Number of articles in the 'general' category: 22,399/67,143


Unnamed: 0,url,section,series,slash_count
0,https://www.churchofjesuschrist.org/study/liahona/2021/01/pointing-us-all-to-jesus-christ,general,,7
1,https://www.churchofjesuschrist.org/study/liahona/2021/01/hear-him,general,Hear Him,7
2,https://www.churchofjesuschrist.org/study/liahona/2021/01/a-new-publication-for-a-worldwide-church,general,Welcome to This Issue,7
3,https://www.churchofjesuschrist.org/study/liahona/2021/01/grow-into-the-principle-of-revelation,general,,7
4,https://www.churchofjesuschrist.org/study/liahona/2021/01/god-speaks-to-us-today,general,Gospel Basics,7
5,https://www.churchofjesuschrist.org/study/liahona/2021/01/the-power-of-personal-revelation,general,,7
6,https://www.churchofjesuschrist.org/study/liahona/2021/01/latter-day-saint-voices/a-better-choice,Latter-day Saint Voices,Latter-day Saint Voices,8
7,https://www.churchofjesuschrist.org/study/liahona/2021/01/latter-day-saint-voices/deneto-forde-saint-catherine-jamaica,Latter-day Saint Voices,Latter-day Saint Voices,8
8,https://www.churchofjesuschrist.org/study/liahona/2021/01/latter-day-saint-voices/the-anchor-of-my-life-and-faith,Latter-day Saint Voices,Latter-day Saint Voices,8
9,https://www.churchofjesuschrist.org/study/liahona/2021/01/latter-day-saint-voices/home-centered-church-away-from-home,Latter-day Saint Voices,Latter-day Saint Voices,8


In [56]:
current_section = 'general'
update_following_labels = False

for n, row in enumerate(df_data.itertuples()):
    if row.section != 'general':
        # Check the next rows that have the same section,
        # since they will also need to be update to 'general' 
        # until the section changes.
        if update_following_labels:
            if row.section == current_section:
                # Change this section to 'general' since
                # it is in the same section as a previous
                # article that was reclassified as 'general'
                df_data.loc[n, 'section'] = 'general'
                continue
            else:
                # Update the current section
                current_section = row.section
                # Don't change the next labels to 'general' since we're in a new section
                update_following_labels = False
        
        if pd.isna(row.series):
            if row.slash_count == 7:
                # Update the section
                df_data.loc[n, 'section'] = 'general'
                current_section = row.section
                update_following_labels = True

print(f"Number of articles in the 'general' category: {df_data[(df_data.section=='general')]['section'].count():,.0f}/{len(df_data.index):,.0f}")
df_data[['url', 'section', 'series', 'slash_count']].head(100)

Number of articles in the 'general' category: 41,495/67,143


Unnamed: 0,url,section,series,slash_count
0,https://www.churchofjesuschrist.org/study/liahona/2021/01/pointing-us-all-to-jesus-christ,general,,7
1,https://www.churchofjesuschrist.org/study/liahona/2021/01/hear-him,general,Hear Him,7
2,https://www.churchofjesuschrist.org/study/liahona/2021/01/a-new-publication-for-a-worldwide-church,general,Welcome to This Issue,7
3,https://www.churchofjesuschrist.org/study/liahona/2021/01/grow-into-the-principle-of-revelation,general,,7
4,https://www.churchofjesuschrist.org/study/liahona/2021/01/god-speaks-to-us-today,general,Gospel Basics,7
5,https://www.churchofjesuschrist.org/study/liahona/2021/01/the-power-of-personal-revelation,general,,7
6,https://www.churchofjesuschrist.org/study/liahona/2021/01/latter-day-saint-voices/a-better-choice,Latter-day Saint Voices,Latter-day Saint Voices,8
7,https://www.churchofjesuschrist.org/study/liahona/2021/01/latter-day-saint-voices/deneto-forde-saint-catherine-jamaica,Latter-day Saint Voices,Latter-day Saint Voices,8
8,https://www.churchofjesuschrist.org/study/liahona/2021/01/latter-day-saint-voices/the-anchor-of-my-life-and-faith,Latter-day Saint Voices,Latter-day Saint Voices,8
9,https://www.churchofjesuschrist.org/study/liahona/2021/01/latter-day-saint-voices/home-centered-church-away-from-home,Latter-day Saint Voices,Latter-day Saint Voices,8


# How many topics should I use?
> Answer: use the Gospel Topics section on https://ChurchOfJesusChrist.org/study to find out!

In this section, I scrape article data for topics, to be used for automatically labeling topics during the topic modeling phase of this analysis.

Much of the code below is messy and written in a one-off fashion -- I was exploring an idea more than developing a re-usable solution, so some of the code is redundant and certainly not optimized, but it worked for the task of creating a list of topics.

In [None]:
# Count the number of topics listed on the General Conference Topics page

# import requests
# from bs4 import BeautifulSoup
# URL = "https://www.churchofjesuschrist.org/study/general-conference/topics?lang=eng"

# response = requests.get(URL)
# response.encoding = 'utf-8'
# page = BeautifulSoup(response.text)

# topics = [a.text for a in page.find_all('a', class_='sc-omeqik-0 ksscEV tile-P903U listTile-WHLxI')]
# print(f"Number of topics: {len(topics)}")
# print("\nTopics:")
# print(topics)

# Number of topics: 317

# Topics:
# ['Aaronic Priesthood', 'Adam and Eve', 'Articles of Faith', 'Atonement', ...]

In [3]:
# First, using the Gospel Topics articles
URL = "https://www.churchofjesuschrist.org/study/manual/gospel-topics"

response = requests.get(URL)
response.encoding = 'utf-8'
page = BeautifulSoup(response.text)

gt_topic_links = [a.get('href') for a in page.find('div', class_='body').find('ul', class_='doc-map').find_all('a')]
print(f"There are {len(gt_topic_links)} topics in Gospel Topics.")

# Next, using True to the Faith
URL = "https://www.churchofjesuschrist.org/study/manual/true-to-the-faith"
response = requests.get(URL)
response.encoding = 'utf-8'
page = BeautifulSoup(response.text)

tf_topic_links = [a.get('href') for a in page.find('div', class_='body').find('ul', class_='doc-map').find_all('a')]
print(f"There are {len(tf_topic_links)} topics in True to the Faith.")

There are 224 topics in Gospel Topics.
There are 173 topics in True to the Faith.


Scrape data from topics in [_True to the Faith_](https://www.churchofjesuschrist.org/study/manual/true-to-the-faith) to use for automatic topic labeling after performing LDA or NMF.

Most of this code is adapted from the "Data Scraping > Data collection class" section at the beginning of this notebook (see the `get_article_data` method of that class).

In [4]:
# Create a .csv file to store the data
with open('data/topics_tf.csv', mode='w', encoding='utf-8', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(['url', 'topic', 'text'])

ROOT_URL = "https://www.churchofjesuschrist.org"
response = requests.get(ROOT_URL + "/study/manual/true-to-the-faith")
response.encoding = 'utf-8'
page = BeautifulSoup(response.text)

skip_topics = ['Title Page', 'Message from the First Presidency']
# When scraping article text within the <div class-"body-block"> by
# looping through each direct child element in body_block.findChildren(recursive=False),
# if an element is an <aside>, <figure>, or <figcaption> tag,
# or if an element is a <div> tag with class "credit" or "imageWrapper...",
# ignore its text since it isn't part of the article's main body text.
tags_to_omit = ['aside', 'figure', 'figcaption']
classes_to_omit = ['credit', re.compile('imageWrapper.*')]

# When scraping image alt text, the text is prefixed
# with "Image", which I later strip out when saving the data.
img_prefix_chars = len('Image')

topic_links = []
topic_titles = []
for a in page.find('div', class_='body').find('ul', class_='doc-map').find_all('a'):
    if a.text not in skip_topics:
        topic_links.append(ROOT_URL + a.get('href'))
        topic_titles.append(a.text.strip())

total_topics = len(topic_links)
pbar = tqdm(total=total_topics)

for n, url in enumerate(topic_links):
    response = requests.get(url)
    response.encoding = 'utf-8'
    page = BeautifulSoup(response.text)
    body = page.find('div', class_='body')
    body_block = body.find('div', class_='body-block')

    pbar.set_description(f"Working on topic {n + 1}/{total_topics}, {topic_titles[n]}")

    title = body.find(id='title1')
    if title:
        title = title.text.strip()

    # Use the .text attribute of the container that holds
    # the image, which is a <div> element with class that begins with "imageWrapper"
    images = body.find_all(class_=re.compile('imageWrapper.*'))
    if images:
        image_alt_text = [img.text.strip() for img in images]
    else:
        image_alt_text = None
        
    image_captions = body.find_all('figcaption')
    if image_captions:
        image_captions = [fig.text.strip() for fig in image_captions]
    else:
        image_captions = None

    # The code below applies filters to refrain from collecting
    # text that is not part of the article (image captions, section headings, asides).
    title = body.find(id='title1')
    title = title.text.strip() if title else None
    text = []
    for ch in body_block.findChildren(recursive=False):
        ch_class = ch.get('class')  # -> a list, or None
        if ch_class:
            skip_class = any(cc in classes_to_omit for cc in ch_class)
        else:
            skip_class = False
        
        if ch.name in tags_to_omit or skip_class:
            continue
        else:
            text.append(ch.text.strip())

    # Combine into a single string, with paragraphs delimitted by newlines.
    text = '\n'.join(text)
    # Replace multiple newlines with a single newline.
    text = re.sub(r'[\n]{2,}', '\n', text)

    # Remove unwanted text that was included through child elements
    # (e.g., a section that, along with article text, 
    # contained text from unwanted elements like images, credits, or asides)
    credit_elems = body_block.find_all('div', class_='credit')
    if credit_elems:
        for cr in credit_elems:
            text = text.replace(cr.text.strip(), '', 1)
    
    aside_elems = body_block.find_all('aside')
    if aside_elems:
        for ae in aside_elems:
            text = text.replace(ae.text.strip(), '', 1)

    if image_alt_text:
        for iat in image_alt_text:
            text = text.replace(iat, '', 1)

    if image_captions:
        for ic in image_captions:
            text = text.replace(ic, '', 1)
    
    # Replace multiple whitespace characters with a single space
    text = re.sub(r'[\s]{2,}', ' ', text)
    
    if text:
        # Trim leading newline character, if present
        if text[0] == '\n':
            text = text[1:]
    # else:
    #     # Skip to next article, since there is no text in this article
    #     continue
    
    word_count = len(text.split()) if text else 0

    pbar.update(n=1)

    if word_count > 0:
        # Save to .csv file
        # Create a .csv file to store the data
        with open('data/topics_tf.csv', mode='a', encoding='utf-8', newline='') as csvfile:
            csv_writer = csv.writer(csvfile)
            csv_writer.writerow([url, title, text])
    # else:
        # Don't save data; skip this article to move to the next one


  text = re.sub('[\s]{2,}', ' ', text)


  0%|          | 0/171 [00:00<?, ?it/s]

In [5]:
topics_df = pd.read_csv('data/topics_tf.csv')

topics_df['word_count'] = topics_df['text'].apply(lambda x: len(x.split()))
topics_df['word_count'].describe()

count     108.000000
mean      497.500000
std       388.832581
min        44.000000
25%       224.750000
50%       356.500000
75%       629.250000
max      1966.000000
Name: word_count, dtype: float64

Scrape data from topics in [_Gospel Topics_](https://www.churchofjesuschrist.org/study/manual/gospel-topics) to use for automatic topic labeling after performing LDA or NMF.

Most of this code is adapted from the "Data Scraping > Data collection class" section at the beginning of this notebook (see the `get_article_data` method of that class).

In [4]:
# Create a .csv file to store the data
with open('data/topics_gt.csv', mode='w', encoding='utf-8', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(['url', 'topic', 'text'])

ROOT_URL = "https://www.churchofjesuschrist.org"
response = requests.get(ROOT_URL + "/study/manual/gospel-topics")
response.encoding = 'utf-8'
page = BeautifulSoup(response.text)

skip_topics = ['Contents', 'Introduction to Gospel Topics']
# When scraping article text within the <div class-"body-block"> by
# looping through each direct child element in body_block.findChildren(recursive=False),
# if an element is an <aside>, <figure>, or <figcaption> tag,
# or if an element is a <div> tag with class "credit" or "imageWrapper...",
# ignore its text since it isn't part of the article's main body text.
tags_to_omit = ['button', 'aside', 'figure', 'figcaption']
classes_to_omit = ['credit', re.compile('imageWrapper.*')]

# When scraping image alt text, the text is prefixed
# with "Image", which I later strip out when saving the data.
img_prefix_chars = len('Image')

topic_links = []
topic_titles = []
for a in page.find('div', class_='body').find('ul', class_='doc-map').find_all('a'):
    ttl = a.find('p', class_='title').text.strip()
    if ttl not in skip_topics:
        topic_links.append(ROOT_URL + a.get('href'))
        topic_titles.append(ttl)

# Determine the length of the longest title 
# to use for left-justifying the names of 
# other titles when displaying on the progress bar
max_title_chars = 0
for item in topic_titles:
    if len(item) > max_title_chars:
        max_title_chars = len(item)

total_topics = len(topic_links)
pbar = tqdm(total=total_topics)

for n, url in enumerate(topic_links):
    response = requests.get(url)
    response.encoding = 'utf-8'
    page = BeautifulSoup(response.text)
    body = page.find('div', class_='body')
    body_block = body.find('div', class_='body-block')

    justified_title = topic_titles[n].ljust(max_title_chars)
    pbar.set_description(f"Working on topic {n + 1}/{total_topics}, {justified_title}")

    title = body.find(id='title1')
    if title:
        title = title.text.strip()

    # Use the .text attribute of the container that holds
    # the image, which is a <div> element with class that begins with "imageWrapper"
    images = body.find_all(class_=re.compile('imageWrapper.*'))
    if images:
        image_alt_text = [img.text.strip() for img in images]
    else:
        image_alt_text = None
        
    image_captions = body.find_all('figcaption')
    if image_captions:
        image_captions = [fig.text.strip() for fig in image_captions]
    else:
        image_captions = None

    # The code below applies filters to refrain from collecting
    # text that is not part of the article (image captions, section headings, asides).
    title = body.find(id='title1')
    title = title.text.strip() if title else None
    text = []
    for ch in body_block.findChildren(recursive=False):
        ch_class = ch.get('class')  # -> a list, or None
        if ch_class:
            skip_class = any(cc in classes_to_omit for cc in ch_class)
        else:
            skip_class = False
        
        if ch.name in tags_to_omit or skip_class:
            continue
        else:
            text.append(ch.text.strip())

    # Combine into a single string, with paragraphs delimitted by newlines.
    text = '\n'.join(text)

    # Remove unwanted text that was included through child elements
    # (e.g., a section that, along with article text, 
    # contained text from unwanted elements like images, credits, or asides)
    credit_elems = body_block.find_all('div', class_='credit')
    if credit_elems:
        for cr in credit_elems:
            text = text.replace(cr.text.strip(), '', 1)
    
    aside_elems = body_block.find_all('aside')
    if aside_elems:
        for ae in aside_elems:
            text = text.replace(ae.text.strip(), '', 1)

    if image_alt_text:
        for iat in image_alt_text:
            text = text.replace(iat, '', 1)

    if image_captions:
        for ic in image_captions:
            text = text.replace(ic, '', 1)
    
    # Replace multiple whitespace characters with a single space
    text = re.sub(r'[\s]{2,}', ' ', text)
    
    if text:
        # Trim leading newline character, if present
        if text[0] == '\n':
            text = text[1:]
        if text[:len('Overview')] == 'Overview':
            # Remove the "Overview\n" at the beginning of the article
            text = text[len('Overview'):]
        
        end_of_article = text.find('Related Topics')
        text = text[:end_of_article]

        end_of_article = text.find('Record Your Impressions')
        # str.find() returns -1 if the text isn't found, so
        # anything above 0 means a match.
        if end_of_article > 0:
            text = text[:end_of_article]

        # Replace multiple newlines with a single newline.
        text = re.sub(r'[\n]{2,}', r'\n', text)

        end_of_article = text.find('Scriptures Scripture References')
        if end_of_article > 0:
            text = text[:end_of_article]

        # Trim leading newline character, if present
        if text[0] == '\n':
            text = text[1:]
        
        text = text.strip()

        # Prefix topic text with topic name in case 
        # the topic is not mentioned within the text. 
        # Increases score for TFIDF normalization.
        text = f"{title}: {text}"
    
    word_count = len(text.split()) if text else 0

    pbar.update(n=1)

    if word_count > 0:
        # Save to .csv file
        # Create a .csv file to store the data
        with open('data/topics_gt.csv', mode='a', encoding='utf-8', newline='') as csvfile:
            csv_writer = csv.writer(csvfile)
            csv_writer.writerow([url, title, text])
    # else:
        # Don't save data; skip this article to move to the next one


  0%|          | 0/223 [00:00<?, ?it/s]

In [5]:
topics_df = pd.read_csv('data/topics_gt.csv')

topics_df['word_count'] = topics_df['text'].apply(lambda x: len(x.split()))
topics_df['word_count'].describe()

count     223.000000
mean      457.547085
std       583.696188
min         9.000000
25%       172.000000
50%       306.000000
75%       472.500000
max      4146.000000
Name: word_count, dtype: float64

In [12]:
# Consider removing the top 20% longest articles,
# since they tend to be overweighted when creating
# the bag-of-words (vectorized) representation.

# NEVERMIND: I corrected that problem by normalizing the topics' doc-terms matrix
# by the word count of each topic, so the sum across the terms for each topic
# equals 1.

# cutoff_wrdcnt = topics_df['word_count'].quantile(0.80)
# cutoff_wrdcnt

# trimmed_df = topics_df[topics_df['word_count'] < cutoff_wrdcnt]
# trimmed_df.describe()

Unnamed: 0,word_count
count,178.0
mean,253.926966
std,152.539669
min,9.0
25%,139.5
50%,263.5
75%,386.5
max,541.0


## Manually adjust the dataset
For example, to replace article text "See ___ in Gospel Topics Essays" with the actual text from the Gospel Topics essay.

In [34]:
topics_df.sort_values(by='word_count', axis=0, ascending=True).head(10)

Unnamed: 0,url,topic,text,word_count
209,https://www.churchofjesuschrist.org/study/manu...,Transgression,Violation or breaking of a commandment or law.,8
17,https://www.churchofjesuschrist.org/study/manu...,Becoming Like God,See Becoming Like God in the Gospel Topics Essays,9
122,https://www.churchofjesuschrist.org/study/manu...,Mother in Heaven,See Heavenly Parents and Mother in Heaven (Gos...,10
148,https://www.churchofjesuschrist.org/study/manu...,Priesthood and Race,See Race and the Priesthood in the Gospel Topi...,10
100,https://www.churchofjesuschrist.org/study/manu...,"Joseph Smith’s Teachings about Priesthood, Tem...","See Joseph Smith’s Teachings about Priesthood,...",14
2,https://www.churchofjesuschrist.org/study/manu...,"Abraham, Book of",See Translation and Historicity of the Book of...,14
120,https://www.churchofjesuschrist.org/study/manu...,Mormons,This is a commonly used term for members of Th...,17
177,https://www.churchofjesuschrist.org/study/manu...,Sealing,An ordinance performed in the temple eternally...,17
134,https://www.churchofjesuschrist.org/study/manu...,Patriarch,A patriarch is a priesthood holder who is orda...,19
119,https://www.churchofjesuschrist.org/study/manu...,Mormonism,A common term used to describe the teachings a...,19


In [74]:
# topics_df = pd.read_csv('data/topics_gt_mod.csv')

In [67]:
# Find the index of a topic
# topics_df['topic'].values.tolist().index('SEARCH TOPIC')

'Family'

In [75]:
# Text to add to the article (topic) text.
manual_adjustment_text = """
THE FAMILY
A PROCLAMATION TO THE WORLD

The First Presidency and Council of the Twelve Apostles of The Church of Jesus Christ of Latter-day Saints

We, the First Presidency and the Council of the Twelve Apostles of The Church of Jesus Christ of Latter-day Saints, solemnly proclaim that marriage between a man and a woman is ordained of God and that the family is central to the Creator’s plan for the eternal destiny of His children.

All human beings—male and female—are created in the image of God. Each is a beloved spirit son or daughter of heavenly parents, and, as such, each has a divine nature and destiny. Gender is an essential characteristic of individual premortal, mortal, and eternal identity and purpose.

In the premortal realm, spirit sons and daughters knew and worshipped God as their Eternal Father and accepted His plan by which His children could obtain a physical body and gain earthly experience to progress toward perfection and ultimately realize their divine destiny as heirs of eternal life. The divine plan of happiness enables family relationships to be perpetuated beyond the grave. Sacred ordinances and covenants available in holy temples make it possible for individuals to return to the presence of God and for families to be united eternally.

The first commandment that God gave to Adam and Eve pertained to their potential for parenthood as husband and wife. We declare that God’s commandment for His children to multiply and replenish the earth remains in force. We further declare that God has commanded that the sacred powers of procreation are to be employed only between man and woman, lawfully wedded as husband and wife.

We declare the means by which mortal life is created to be divinely appointed. We affirm the sanctity of life and of its importance in God’s eternal plan.

Husband and wife have a solemn responsibility to love and care for each other and for their children. “Children are an heritage of the Lord” (Psalm 127:3). Parents have a sacred duty to rear their children in love and righteousness, to provide for their physical and spiritual needs, and to teach them to love and serve one another, observe the commandments of God, and be law-abiding citizens wherever they live. Husbands and wives—mothers and fathers—will be held accountable before God for the discharge of these obligations.

The family is ordained of God. Marriage between man and woman is essential to His eternal plan. Children are entitled to birth within the bonds of matrimony, and to be reared by a father and a mother who honor marital vows with complete fidelity. Happiness in family life is most likely to be achieved when founded upon the teachings of the Lord Jesus Christ. Successful marriages and families are established and maintained on principles of faith, prayer, repentance, forgiveness, respect, love, compassion, work, and wholesome recreational activities. By divine design, fathers are to preside over their families in love and righteousness and are responsible to provide the necessities of life and protection for their families. Mothers are primarily responsible for the nurture of their children. In these sacred responsibilities, fathers and mothers are obligated to help one another as equal partners. Disability, death, or other circumstances may necessitate individual adaptation. Extended families should lend support when needed.

We warn that individuals who violate covenants of chastity, who abuse spouse or offspring, or who fail to fulfill family responsibilities will one day stand accountable before God. Further, we warn that the disintegration of the family will bring upon individuals, communities, and nations the calamities foretold by ancient and modern prophets.

We call upon responsible citizens and officers of government everywhere to promote those measures designed to maintain and strengthen the family as the fundamental unit of society.
"""
topics_df.at[47, "text"] = topics_df.at[47, "text"] + manual_adjustment_text

In [84]:
# Prefix topic text with topic name in case 
# the topic is not mentioned within the text. 
# Increases score for TFIDF normalization.

test_data = {
    "url": ["example.org/1", "example.org/2"],
    "topic": ["one", "two"],
    "text": ["This is topic one.", "This is topic two."]
}

test_df = pd.DataFrame(test_data)

def prefix_topic(series):
    x = series.copy()
    x['text'] = f"{x['topic']}: {x['text']}"
    return x

test_df = test_df.apply(prefix_topic, axis=1)
test_df.head()

Unnamed: 0,url,topic,text
0,example.org/1,one,one: This is topic one.
1,example.org/2,two,two: This is topic two.


In [86]:
def prefix_topic(series):
    '''No longer needed; I added the topic prefix in the data collection function.'''
    x = series.copy()
    x['text'] = f"{x['topic']}: {x['text']}"
    return x

topics_df2 = topics_df.copy()
# topics_df2 = topics_df2.apply(prefix_topic, axis=1)
topics_df2['word_count'] = topics_df2['text'].apply(lambda x: len(x.split()))
topics_df2.head()

Unnamed: 0,url,topic,text,word_count
0,https://www.churchofjesuschrist.org/study/manu...,Aaronic Priesthood,Aaronic Priesthood: The priesthood is the eter...,406
1,https://www.churchofjesuschrist.org/study/manu...,Abortion,Abortion: See the Church’s official statement ...,276
2,https://www.churchofjesuschrist.org/study/manu...,"Abraham, Book of","Abraham, Book of: The Church of Jesus Christ o...",2947
3,https://www.churchofjesuschrist.org/study/manu...,Abrahamic Covenant,Abrahamic Covenant: Abraham made covenants wit...,160
4,https://www.churchofjesuschrist.org/study/manu...,Abuse,Abuse: Abuse is the mistreatment or neglect of...,525


In [88]:
topics_df2.sort_values(by='word_count', ascending=False).head()

Unnamed: 0,url,topic,text,word_count
9,https://www.churchofjesuschrist.org/study/manu...,Answering Gospel Questions,Answering Gospel Questions: The Lord encourage...,4146
206,https://www.churchofjesuschrist.org/study/manu...,The Manifesto and the End of Plural Marriage,The Manifesto and the End of Plural Marriage: ...,3893
137,https://www.churchofjesuschrist.org/study/manu...,Peace and Violence among 19th-Century Latter-d...,Peace and Violence among 19th-Century Latter-d...,3632
17,https://www.churchofjesuschrist.org/study/manu...,Becoming Like God,Becoming Like God: One of the most common imag...,3488
210,https://www.churchofjesuschrist.org/study/manu...,Translation and Historicity of the Book of Abr...,Translation and Historicity of the Book of Abr...,2952


In [92]:
# Save modified version
topics_df2 = topics_df2.drop(columns=["word_count"])
topics_df2.to_csv('data/topics_gt_mod.csv', index=False)

### Remove unncessary topics

In [77]:
# Load the modified version to make additional changes (e.g., removing topics)
topics_df = pd.read_csv('data/topics_gt_mod_old.csv')

# The following topics don't represent true topics 
# for magazine articles, but rather are more like terms in a glossary
topics_to_remove = [
    'Aaronic Priesthood',
    'Abraham, Book of',
    'Apostle',
    'Are “Mormons” Christian?',
    'Articles of Faith',
    'Bible, Inerrancy of',
    'Birth Control',
    'Bishop',
    'Book of Mormon and DNA Studies',
    'Book of Mormon Geography',
    'Book of Mormon Translation',
    'Church Finances—Commercial Businesses',
    'Cross',
    'Daughters in My Kingdom',
    'Deacon',
    'Elder',
    'First Presidency',
    'First Vision Accounts',
    'Gambling',
    'Gold Plates',
    'High Council',
    'High Priest',
    'Jesus Christ Chosen as Savior',
    "Joseph Smith’s Teachings about Priesthood, Temple, and Women",
    'Journal of Discourses',
    'Laying On of Hands',
    'Melchizedek Priesthood',
    'Membership Councils',
    'Mormon Church',
    'Mormonism',
    'Mormons',
    'Mother in Heaven',
    'Mountain Meadows Massacre',
    'Noah',
    'Original Sin',
    'Patriarch',
    'Peace and Violence among 19th-Century Latter-day Saints',
    'Plural Marriage in The Church of Jesus Christ of Latter-day Saints',
    'Priest',
    'Priesthood Blessing',
    'Priesthood and Race',
    'Primary',
    'Prison Ministry',
    'Prophecy',
    'Quorum',
    'Quorum of the Twelve Apostles',
    'Race and the Priesthood',
    'Relief Society',
    'Restoration of the Priesthood',
    'Sacrament Meeting',
    'Same-Sex Marriage',
    'Satan',
    'Signs',
    'Spaulding Manuscript',
    'Stake',
    'Tattooing and Body Piercing',
    'Teacher (Aaronic Priesthood)',
    'Standard Works',
    'Telestial Kingdom',
    'Terrestrial Kingdom',
    'The Manifesto and the End of Plural Marriage',
    'Transgression',
    'Translation and Historicity of the Book of Abraham',
    'Unwed Pregnancy',
    'Urim and Thummim'
]

topics_to_remove = [item.lower() for item in topics_to_remove]
filter = topics_df['topic'].apply(lambda x: x.lower() not in topics_to_remove)
topics_df = topics_df[filter.values]

topics_df.to_csv('data/topics_gt_mod.csv', index=False)
print(f"Success! Filtered to {len(topics_df.index)} topics.")

Success! Filtered to 158 topics.


### Combine similar topics

In [13]:
# Built-in packages
import time                     # measure code execution time
import re                       # text-based pattern matching
import unicodedata              # Unicode character reference (database on character properties)
                                # for normalizing characters. See also unicodedata2: https://pypi.org/project/unicodedata2/

# Third-party packages
import numpy as np              # array-based data storage and computation
import matplotlib.pyplot as plt
import pandas as pd             # tabular data analysis
import seaborn as sns           # statistical data visualization

# Create a doc-terms matrix for the topics, to find out which ones are most similar
from sklearn.feature_extraction.text import CountVectorizer

pd.options.display.precision = 3  # Show 3 decimal places of precision in DataFrames

#### Text pre-processing

Function to normalize text (convert Unicode to ASCII equivalents), convert to lowercase, and remove non-alphabetic characters.

In [4]:
def preprocess_text(orig_str):
    # Normalize after lowercasing
    # Third-party method: works with most Latin-looking characters and punctuation.
    # new_str = unidecode.unidecode(orig_str.lower())

    # Built-in normalizing method: fails to convert “” to "" and ‘’ to '',
    # but good enough for my purposes since I remove those characters anyways.
    # TfidfVectorizer also doesn't remove those characters, or numbers either.
    new_str = unicodedata.normalize(
        'NFKD',
        orig_str.lower()
    ).encode(encoding='ascii', errors='ignore').decode()

    # Remove digits and punctuation (and anything else besides lowercase a-z and whitespace)
    new_str = re.sub(pattern=r'[^a-z\s]', repl=' ', string=new_str)
    
    # Replace multiple whitespace chars with a single space
    new_str = re.sub(pattern=r'[\s]{2,}', repl=' ', string=new_str)

    return new_str

### Topic consolidation
Create two DataFrames: one with all topics that will be considered (`full_topics_df`), and one with only the topics that will be kept in the final, cleaned version of the topic articles (`topics_df`).

For topics that are in `full_topics_df` but not in `topics_df`, the text of those excluded topics will be joined (appended to) the most similar topic kept in `topics_df`.

In [78]:
# Load the modified version to make additional changes (e.g., removing topics)
full_topics_df = pd.read_csv('data/topics_gt_mod_old.csv')
print(f"Starting number of topics: {len(full_topics_df.index)}.")

# The following topics are overrepresented in the data and don't represent
# true topics for magazine articles, but rather are more like terms in a glossary
topics_to_remove = [
    # 'Aaronic Priesthood',
    'Abraham, Book of',
    # 'Apostle',
    'Are “Mormons” Christian?',
    'Articles of Faith',
    'Bible, Inerrancy of',
    'Birth Control',
    # 'Bishop',
    'Book of Mormon and DNA Studies',
    'Book of Mormon Geography',
    'Book of Mormon Translation',
    'Church Finances—Commercial Businesses',
    # 'Cross',
    'Daughters in My Kingdom',
    # 'Deacon',
    # 'Elder',
    # 'First Presidency',
    # 'First Vision Accounts',
    'Gambling',
    'Gold Plates',
    'High Council',
    # 'High Priest',
    # 'Jesus Christ Chosen as Savior',
    "Joseph Smith’s Teachings about Priesthood, Temple, and Women",
    'Journal of Discourses',
    'Laying On of Hands',
    # 'Melchizedek Priesthood',
    'Membership Councils',
    'Mormon Church',
    'Mormonism',
    'Mormons',
    # 'Mother in Heaven',
    'Mountain Meadows Massacre',
    'Noah',
    'Original Sin',
    # 'Patriarch',
    'Peace and Violence among 19th-Century Latter-day Saints',
    'Plural Marriage in The Church of Jesus Christ of Latter-day Saints',
    # 'Priest',
    # 'Priesthood Blessing',
    # 'Priesthood and Race',
    # 'Primary',
    'Prison Ministry',
    # 'Prophecy',
    'Quorum',
    # 'Quorum of the Twelve Apostles',
    'Race and the Priesthood',
    # 'Relief Society',
    # 'Restoration of the Priesthood',
    'Sacrament Meeting',
    'Same-Sex Marriage',
    'Satan',
    'Signs',
    'Spaulding Manuscript',
    'Stake',
    'Tattooing and Body Piercing',
    # 'Teacher (Aaronic Priesthood)',
    # 'Standard Works',
    # 'Telestial Kingdom',
    # 'Terrestrial Kingdom',
    'The Manifesto and the End of Plural Marriage',
    'Transgression',
    'Translation and Historicity of the Book of Abraham',
    'Unwed Pregnancy',
    'Urim and Thummim'
]

topics_to_remove = [item.lower() for item in topics_to_remove]
filter = full_topics_df['topic'].apply(lambda x: x.lower() not in topics_to_remove)
full_topics_df = full_topics_df[filter.values]

print(f"After filtering, {len(full_topics_df.index)} topics remain.")

Starting number of topics: 223.
After filtering, 183 topics remain.


In [79]:
print(f"Starting number of topics: {len(full_topics_df.index)}.")

# The topics_to_keep list is meant to group together
# similar topics (e.g., Celestial Kingdom into Kingdoms of Glory)
topics_to_keep = [
    'Addiction',
    'Adversity',
    'Agency and Accountability',
    'Atonement of Jesus Christ',
    'Baptism',
    'Bible',
    'Book of Mormon',
    'Charity',
    'Chastity',
    'Christmas',
    'Citizenship',
    'Communication',
    'Conversion',
    'Covenant',
    'Dating and Courtship',
    'Diversity and Unity in The Church of Jesus Christ of Latter-day Saints',
    'Easter',
    'Education',
    'Emergency Preparedness',
    'Employment',
    'Eternal Life',
    'Faith in Jesus Christ',
    'Family',
    'Family Finances',
    'Family History',
    'Fasting and Fast Offerings',
    'Forgiveness',
    'Gospel',
    'Grace',
    'Gratitude',
    'Grief',
    'Happiness',
    'Health',
    'Heavenly Parents',
    'Holy Ghost',
    'Honesty',
    'Hope',
    'Humility',
    'Jesus Christ',
    'Joseph Smith',
    'Judging Others',
    'Kingdoms of Glory',
    'Love',
    'Marriage',
    'Media',
    'Mercy',
    'Ministering',
    'Modesty',
    'Music',
    'Obedience',
    'Parenting',
    'Peace',
    'Peer Pressure',
    'Plan of Salvation',
    'Prayer',
    'Priesthood',
    'Prophets',
    'Repentance',
    'Restoration of the Church',
    'Revelation',
    'Sabbath Day',
    'Scriptures',
    'Second Coming of Jesus Christ',
    'Self-Reliance',
    'Service',
    'Spiritual Gifts',
    'Teaching the Gospel',
    'Temples',
    'Testimony',
    'Tithing',
    'Vicarious Work',
    'Virtue',
    'Women in the Church',
    'Word of Wisdom',
    'Worship'
]


topics_df = full_topics_df[full_topics_df['topic'].isin(topics_to_keep)]

print(f"After filtering, {len(topics_df.index)} topics remain.")

Starting number of topics: 183.
After filtering, 75 topics remain.


Load data and create doc-terms matrix from the `full_topics_df` DataFrame. Later, that will be consolidated into the abbreviated list of topics.

In [80]:
# print(f"Topic dataset loaded. Number of topics: {len(topics_df.index)}")

NGRAMS = 2

count_vectorizer = CountVectorizer(
    preprocessor = preprocess_text,
    ngram_range = (1, NGRAMS),
    stop_words = 'english'
)

# Fit the doc-terms matrix
# dtm = count_vectorizer.fit_transform(topics_df['text'])
dtm = count_vectorizer.fit_transform(full_topics_df['text'])

# Normalize by word count. The .reshape turns the sum into a column vector used in the divide operation
dtm_normalized =  dtm / dtm.sum(axis=1).reshape(-1, 1)

# Check vocabulary size
print(f"\nNumber of docs by number of terms:\n{dtm.shape}")

# View the size, in bytes, of the doc-to-terms matrix
print(f"\nApproximate size in memory: {dtm.data.nbytes:,.0f} bytes")


Number of docs by number of terms:
(183, 34083)

Approximate size in memory: 464,928 bytes


Compare topic similarity using matrix multiplication

In [81]:
# Matrix multiplication using the @ operator.
corr_matrix = dtm_normalized @ dtm_normalized.transpose()
corr_matrix = np.array(corr_matrix)
print(f"Shape of correlation matrix: {corr_matrix.shape}")

df_corr = pd.DataFrame(
    data=np.array(corr_matrix),
    index=full_topics_df['topic'].values,
    columns=full_topics_df['topic'].values
)

# Save for later use
# df_corr.to_csv('experiments/topic_correlations_for_consolidation.csv')

# Display heatmap of correlation matrix
# fig, ax = plt.subplots(figsize=(30, 30))
# sns.heatmap(df_corr)
# plt.show()

Shape of correlation matrix: (183, 183)


In [82]:
# Find the most similar topics for each topic

df_similar_topics = pd.DataFrame(
    data = np.argsort(corr_matrix, axis=1)[:, ::-1], # reverse the order of the sorted columns so the max ones are first
    index = full_topics_df['topic'].values
).apply(
    # Replace the index number (from argsort) with the word
    lambda s: pd.Series([full_topics_df['topic'].values[i] for i in s]),
    axis = 1)

df_similar_topics.to_csv('experiments/similar_topics.csv')

# Add score
# df_similar_topics = df_similar_topics.join(
#     pd.DataFrame(
#         data = np.sort(corr_matrix, axis=1)[:, ::-1], # reverse the order of the sorted columns so the max ones are first
#         index = full_topics_df['topic'].values
#     ),
#     rsuffix='_score'
# )

# df_similar_topics.to_csv('experiments/similar_topics_with_scores.csv')

Combine text from excluded articles based on their most-similar included article.

In [None]:
final_topic_list = {k: [k] for k in topics_df['topic'].values}

for row in df_similar_topics.itertuples():
    # row[0] is the index (the topic word), and row[1:] contains the next most similar topics
    if row[0] not in final_topic_list:
        # This topic is excluded from the final topic list; append it to the next most simlar topic that is included
        for t in row[1:]:
            if t in final_topic_list:
                final_topic_list[t].append(row[0])
                break

# View the topics included in the final topic list
final_topic_list

Using the similar-scored topics as a starting point, I now make a few manual adjustments to the categories.

In [92]:
final_topic_list = {
    'Addiction': [
        'Addiction',
        'Pornography'],
    'Adversity': ['Adversity'],
    'Agency and Accountability': [
        'Agency and Accountability',
        'Temptation',],
    'Atonement of Jesus Christ': [
        'Atonement of Jesus Christ',
        'Cross',
        'Fall of Adam and Eve',
        'Sacrifice',
        'Salvation'],
    'Baptism': ['Baptism'],
    'Bible': [
        'Bible',
        'New Testament',
        'Old Testament'],
    'Book of Mormon': ['Book of Mormon'],
    'Charity': ['Charity'],
    'Chastity': ['Chastity'],
    'Christmas': ['Christmas'],
    'Citizenship': [
        'Citizenship',
        'Religious Freedom'],
    'Communication': ['Communication'],
    'Conversion': ['Conversion'],
    'Covenant': [
        'Covenant', 
        'Abrahamic Covenant', 
        'Endowment',
        'Ordinances'],
    'Dating and Courtship': ['Dating and Courtship'],
    'Diversity and Unity in The Church of Jesus Christ of Latter-day Saints': [
        'Diversity and Unity in The Church of Jesus Christ of Latter-day Saints',
        'Disabilities',
        'Priesthood and Race',
        'Racial and Cultural Prejudice',
        'Same-Sex Attraction',
        'Single Adult Members of the Church',
        'Transgender',
        'Unity',
        'War'],
    'Easter': [
        'Easter', 
        'Resurrection'],
    'Education': ['Education'],
    'Emergency Preparedness': [
        'Emergency Preparedness',
        'Food Storage',
        'Emergency Response'],
    'Employment': ['Employment'],
    'Eternal Life': ['Eternal Life'],
    'Faith in Jesus Christ': [
        'Faith in Jesus Christ', 
        'Answering Gospel Questions'],
    'Family': [
        'Family', 
        'Abortion',
        'Family Councils',
        'Home Evening',
        'Single-Parent Families'],
    'Family Finances': [
        'Family Finances', 
        'Debt'],
    'Family History': ['Family History'],
    'Fasting and Fast Offerings': ['Fasting and Fast Offerings'],
    'Forgiveness': ['Forgiveness'],
    'Gospel': ['Gospel'],
    'Grace': [
        'Grace',
        'Abuse',
        'Miracles',
        'Suicide'],
    'Gratitude': ['Gratitude'],
    'Grief': ['Grief'],
    'Happiness': ['Happiness'],
    'Health': ['Health'],
    'Heavenly Parents': [
        'Heavenly Parents',
        'Mother in Heaven',
        'God the Father',
        'Spirit',
        'Spirit Children of Heavenly Parents'],
    'Holy Ghost': [
        'Holy Ghost',
        'Godhead',
        'Light of Christ',
        'Spiritual Experiences'],
    'Honesty': ['Honesty'],
    'Hope': ['Hope'],
    'Humility': ['Humility'],
    'Jesus Christ': ['Jesus Christ'],
    'Joseph Smith': [
        'Joseph Smith',
        'First Vision',
        'First Vision Accounts'],
    'Judging Others': ['Judging Others'],
    'Kingdoms of Glory': [
        'Kingdoms of Glory',
        'Celestial Kingdom',
        'Heaven',
        'Telestial Kingdom',
        'Terrestrial Kingdom'],
    'Love': ['Love'],
    'Marriage': [
        'Marriage',
        'Divorce'],
    'Media': [
        'Media',
        'Movies and Television'],
    'Mercy': [
        'Mercy',
        'Justice'],
    'Ministering': [
        'Ministering',
        'Bishop'],
    'Modesty': [
        'Modesty',
        'Profanity'],
    'Music': ['Music'],
    'Obedience': [
        'Obedience',
        'Stewardship',
        'Ten Commandments'],
    'Parenting': [
        'Parenting',
        'Adoption',
        'Primary',
        'Sex Education and Behavior'],
    'Peace': [
        'Peace',
        'Conscience'],
    'Peer Pressure': ['Peer Pressure'],
    'Plan of Salvation': [
        'Plan of Salvation',
        'Becoming Like God',
        'Council in Heaven',
        'Creation',
        'Environmental Stewardship and Conservation',
        'Death, Physical',
        'Death, Spiritual',
        'Foreordination',
        'Hell',
        'Immortality',
        'Jesus Christ Chosen as Savior',
        'Millennium',
        'Mortality',
        'Paradise',
        'Postmortality',
        'Premortality',
        'Soul',
        'Spirit World',
        'War in Heaven'],
    'Prayer': ['Prayer'],
    'Priesthood': [
        'Priesthood',
        'Aaronic Priesthood',
        'Deacon',
        'Elder',
        'High Priest',
        'Patriarch',
        'Priest',
        'Priesthood Blessing',
        'Melchizedek Priesthood',
        'Teacher (Aaronic Priesthood)'],
    'Prophets': [
        'Prophets',
        'Apostle',
        'First Presidency',
        'Prophecy',
        'Quorum of the Twelve Apostles'],
    'Repentance': [
        'Repentance',
        'Sin'],
    'Restoration of the Church': [
        'Restoration of the Church',
        'Apostasy',
        'Dispensations',
        'Doctrine and Covenants',
        'Pearl of Great Price',
        'Restoration of the Priesthood',
        'Zion'],
    'Revelation': [
        'Revelation',
        'Patriarchal Blessings'],
    'Sabbath Day': [
        'Sabbath Day',
        'Sacrament'],
    'Scriptures': [
        'Scriptures',
        'Standard Works'],
    'Second Coming of Jesus Christ': ['Second Coming of Jesus Christ'],
    'Self-Reliance': [
        'Self-Reliance',
        'Gardening'],
    'Service': ['Service'],
    'Spiritual Gifts': ['Spiritual Gifts'],
    'Teaching the Gospel': [
        'Teaching the Gospel',
        'Missionary Work'],
    'Temples': [
        'Temples',
        'Garments',
        'Sealing'],
    'Testimony': [
        'Testimony',
        'Witness',],
    'Tithing': [
        'Tithing',
        'Church Councils',
        'Consecration'],
    'Vicarious Work': [
        'Vicarious Work',
        'Baptisms for the Dead',
        'Proxy Baptism'],
    'Virtue': ['Virtue'],
    'Women in the Church': [
        'Women in the Church',
        'Relief Society'],
    'Word of Wisdom': ['Word of Wisdom'],
    'Worship': [
        'Worship',
        'Reverence']
 }

Create a new DataFrame with the final topic list, and save to file: `data/topics_gt_mod.csv` to indicate that this is a modified version of the topics list from the Gospel Topics articles.

In [106]:
# Set index of full_topics_df to 'topic' so items can be accessed more succinctly
full_topics_df = full_topics_df.set_index('topic')

data_dict = {
    'url': topics_df['url'].values.tolist(),
    'topic': topics_df['topic'].values.tolist(),
    'text': []
}

for n, t in enumerate(final_topic_list):
    text_to_add = ''
    for subtopic in final_topic_list[t]:
        text_to_add += '\n' + full_topics_df.at[subtopic, 'text']
    # Add the combined text to its position in the data dictionary, trimming the leading newline
    data_dict['text'].append(text_to_add.strip())

# Create DataFrame from dictionary
final_topics_df = pd.DataFrame.from_dict(data_dict)

# Save resulting topics list
final_topics_df.to_csv('data/topics_gt_mod.csv', index=False)

---

# <strong><span style="color: rgba(220, 85, 80, 1.0)"> 🛑 STOP HERE</span></strong>

When running the entire notebook, stop before this section to save computation time.

I conduct problem-solving through experiments, exploring ideas and testing different approaches. I usually make note of what I learned from the experiments and save the code I wrote in case I want to refer to it later, but I keep the best part of it for the finished analysis, which is in the sections above.

The sections below contain valuable notes documenting my problem-solving methodology and explaining what worked and what didn't. Review the code below to understand the steps I carried out before arriving at the finished code above, but **comment out the below sections if you plan to "Run All" cells in the notebook from top to bottom** to avoid redundant computation on problems that have already been explored and resolved.

# Experiments: data scraping

Test for scraping image alt text

In [8]:
url = 'https://www.churchofjesuschrist.org/study/liahona/2021/01/pointing-us-all-to-jesus-christ?lang=eng'
test_response = requests.get(url)
test_response.encoding = 'utf-8'
test_page = BeautifulSoup(test_response.text)

body_block = test_page.find('div', class_='body-block')
img_alt_text = []
prefix_chars = len('Image')
for img in body_block.find_all(class_=re.compile('imageWrapper.*')):
    img_alt_text.append(img.text.strip()[prefix_chars:])

# img_alt_text = '\n'.join(img_alt_text)
# img_alt_text = re.sub('[\n]{2,}', '\n', img_alt_text)
print(img_alt_text)

['Liahona', 'Human Race', 'First Presidency 2018 Official Portraits Photography', 'Smartphone', 'Taking Notes']


Test for scraping article body text

In [None]:
url = 'https://www.churchofjesuschrist.org/study/liahona/2021/04/united-states-and-canada-section/finding-blessings-in-tragedy?lang=eng'
url = 'https://www.churchofjesuschrist.org/study/liahona/2021/01/lucy-mack-smith-a-faithful-witness'
test_response = requests.get(url)
test_response.encoding = 'utf-8'
test_page = BeautifulSoup(test_response.text)

body = test_page.find('div', class_='body')
body_block = test_page.find('div', class_='body-block')

images = body.find_all(class_=re.compile('imageWrapper.*'))
if images:
    image_alt_text = [img.text.strip() for img in images]
else:
    image_alt_text = None
    
image_captions = body.find_all('figcaption')
if image_captions:
    image_captions = [fig.text.strip() for fig in image_captions]
else:
    image_captions = None

# If the element is an <aside>, <figure>, or <figcaption> tag,
# or if the element is a <div> tag with class "credit" or "imageWrapper...",
# ignore its text since it isn't part of the article's main body text.
tags_to_omit = ['aside', 'figure', 'figcaption']
classes_to_omit = ['credit', re.compile('imageWrapper.*')]
text = []

for ch in body_block.findChildren(recursive=False):
    ch_class = ch.get('class')  # -> a list, or None
    
    if ch_class:
        skip_class = any(cc in classes_to_omit for cc in ch_class)
    else:
        skip_class = False
    
    if ch.name in tags_to_omit or skip_class:
        continue
    else:
        text.append(ch.text.strip())

# Combine into a single string, with paragraphs delimitted by newlines.
text = '\n'.join(text)
# Replace multiple newlines with a single newline.
text = re.sub('[\n]{2,}', '\n', text)

# Remove unwanted text that was included through child elements
# (e.g., a section that, along with text, contained unwanted child elements like images or credits)
credit_elems = body_block.find_all('div', class_='credit')
if credit_elems:
    for cr in credit_elems:
        text = text.replace(cr.text.strip(), '')

aside_elems = body_block.find_all('aside')
if aside_elems:
    for ae in aside_elems:
        text = text.replace(ae.text.strip(), '')

if image_alt_text:
    for iat in image_alt_text:
        text = text.replace(iat, '')

if image_captions:
    for ic in image_captions:
        text = text.replace(ic, '')
# Replace multiple whitespace characters with a single space
text = re.sub('[\s]{2,}', ' ', text)
print(text)

Test for how to delete items from a dictionary, which is used in the logic for implementing the `continue_last_saved` feature (i.e., picking up where the scraping left off).

In [171]:
test_dict = {
    'Liahona': {
        '2021': {'url': 'https://www.example.org/'},
        '2020': {'url': 'https://www.example.org/'}
    },
    'Friend': {
        '2021': {'url': 'https://www.example.org/'},
        '2020': {'url': 'https://www.example.org/'}
    },
}

new_test_dict = test_dict.copy()

test_index = list(test_dict['Liahona']).index('2020')

del_years = []
for yr_num in test_dict['Liahona']:
    if list(test_dict['Liahona']).index(yr_num) < test_index:
        del_years.append(yr_num)

for dy in del_years:
    del new_test_dict['Liahona'][dy]

print(new_test_dict)

{'Liahona': {'2020': {'url': 'https://www.example.org/'}}, 'Friend': {'2021': {'url': 'https://www.example.org/'}, '2020': {'url': 'https://www.example.org/'}}}


Test: extracting the month from the URL

In [None]:
all_matches = []
issue_months = []

for url in magdata.all_issues['Liahona']['2017']['issue_urls']:
    # Version using string methods
    q_pos = url.rfind('?')
    # The last forward slash
    start_pos = url.rfind('/')
    issue_months.append(url[start_pos + 1:q_pos])

    # Version using RegEx
    # What this re is doing:
    # starts with (?<=...) four digits followed by a forward slash (/)
    # ends with (?=...) a question mark
    # capture everything in between that start and end point
    pattern = re.compile(r'(?<=[0-9]{4}[/]).+(?=\?)')
    match = re.search(pattern, url)
    if match:
        all_matches.append(match.group(0))

for string_methods_month, month, url in zip(issue_months, all_matches, magdata.all_issues['Liahona']['2017']['issue_urls']):
    print(f"{string_methods_month}\t{month}\t{url}")

01	01	https://www.churchofjesuschrist.org/study/liahona/2017/01?lang=eng
02	02	https://www.churchofjesuschrist.org/study/liahona/2017/02?lang=eng
03	03	https://www.churchofjesuschrist.org/study/liahona/2017/03?lang=eng
04	04	https://www.churchofjesuschrist.org/study/liahona/2017/04?lang=eng
05	05	https://www.churchofjesuschrist.org/study/liahona/2017/05?lang=eng
06	06	https://www.churchofjesuschrist.org/study/liahona/2017/06?lang=eng
07	07	https://www.churchofjesuschrist.org/study/liahona/2017/07?lang=eng
08	08	https://www.churchofjesuschrist.org/study/liahona/2017/08?lang=eng
09	09	https://www.churchofjesuschrist.org/study/liahona/2017/09?lang=eng
10	10	https://www.churchofjesuschrist.org/study/liahona/2017/10?lang=eng
11-se	11-se	https://www.churchofjesuschrist.org/study/liahona/2017/11-se?lang=eng
11	11	https://www.churchofjesuschrist.org/study/liahona/2017/11?lang=eng
12	12	https://www.churchofjesuschrist.org/study/liahona/2017/12?lang=eng
digital	digital	https://www.churchofjesusc

In [35]:
# Test retrieving alt text for img tags
# Appears to be that alt text is generated by a JavaScript script and is
# not present in the HTML DOM. Perhaps that helps accommodate multiple languages?

url = 'https://www.churchofjesuschrist.org/study/liahona/2022/08/03_trust-god-and-let-him-prevail?lang=eng'
# url = 'https://www.churchofjesuschrist.org/study/liahona/2008/11/you-know-enough?lang=eng'
# url = 'https://www.churchofjesuschrist.org/study/liahona/1991/02/the-towers-of-chartres?lang=eng'
response = requests.get(url, timeout=None)
response.encoding = 'utf-8'
page = BeautifulSoup(response.text)

image_captions = page.find('div', class_='body').find_all('figcaption')
if image_captions:
    image_captions = [fig.text for fig in image_captions]
    print(image_captions)
else:
    image_captions = None
    print("No captions found")

images = page.find('div', class_='body').find_all('img', {'alt': True})
if images:
    image_alt_text = [img.get('alt', {'alt': True}) for img in images if img.get('alt')]
    # Convert to a string representation, delimitted by "; "
    # image_alt_text = '; '.join(image_alt_text)
    print(image_alt_text)
    print(images)
else:
    image_alt_text = None
    print("No images found")


# Image alt text is generated by JavaScript; it is not available in the HTML,
# so I am commenting out this section for now.
# The alt text could be retrieved by using Selenium in headless mode or 
#  by using the requests-html package and rendering the page before extracting
#  alt text, but Selenium takes more memory than requests, and requests-html does not work in a Jupyter notebook.
# See, for example: https://stackoverflow.com/questions/53469230/beautifulsoup-not-extracting-image-alt-text
# image_alt_text = None
# images = body.find_all('img', alt=True)
# if images:
#     image_alt_text = [
#         img.get('alt') 
#         for img in images]
#     if image_alt_text:
#         # Convert to a string representation delimitted by ";"
#         image_alt_text = ';'.join(image_alt_text)

['\nAny suffering “can be made right through the Atonement of Jesus Christ.”\n', '\n“When [God] hath tried me,” Job learned, “I shall come forth as gold.”\n']
[]
[<img alt="" height="119" width="100"/>, <img alt="" height="500" width="500"/>, <img alt="" height="500" width="500"/>]


### Estimate memory requirements

In [None]:
def estimate_memory_usage(magazine_issues: dict, root_url: str, sample_size: int=10) -> float:
    '''
    Estimates the memory (RAM) needed to store all the data scraped from
    magazine issues listed in the `magazine_issues` dictionary.
    
    Each character of text requires approximately 1 byte of memory,
    in addition to the memory overhead for Python objects. This function simply
    counts the number of characters in a sample of `sample_size` number of
    issues from the `magazine_issues` dictionary, then uses the average number
    of characters per issue to estimate the total number of characters of
    text in all articles.

    Note that this function underestimates memory usage, since it does not
    account for fields to be captured in addition to article text, such as
    author name, article URL, or magazine issue number.

    The estimate serves as a guideline to the user, who determines whether
    to save the scraped data to a .csv file during the web scraping process or
    to hold all data in memory until the end, when a .csv file will be created
    before the memory is released. Due to the relative speed of RAM compared
    with disk access, the in-memory option is much faster, but will result in
    greater memory usage as the function runs. Saving progress to disk at 
    each step reduces the memory footprint (and safeguards against function
    failure partway) at the expense of addititional computation time.

    Parameters
    ---
    `magazine_issues`: dict
        The dictionary returned by the function: `get_magazine_issues`
    
    `root_url`: str
        The URL to the domain's main page. Necessary because links on any
        issue or article page are relative links, rather than absolute.

    `sample_size`: int, default=10
        The number of magazine issues to sample prior to making an estimate.
        This number of issues will be sampled for each magazine in the
        `magazine_issues` dictionary.

    Return value
    ---
        Returns a float approximating the number of bytes required to store
        all magazine data in RAM.
    '''

    # Start a web requests session
    sesh = requests.session()

    chars_per_issue = {mag_name: [] for mag_name in magazine_issues}
    articles_per_issue = {mag_name: [] for mag_name in magazine_issues}
    
    # Create progress bar
    pbar = tqdm(total=(sample_size * len(magazine_issues)))
    
    for mag_name in magazine_issues:
        sampled_years = random.sample(
            list(magazine_issues[mag_name].keys()),
            k=sample_size)
        for yr_num, yr in enumerate(sampled_years):
            # Select a random issue from that year
            issue_url = random.choice(magazine_issues[mag_name][yr]['issue_urls'])
            issue_chars = 0
            # Open the webpage with the magazine issue's contents,
            #  with no timeout (so Requests waits until the page is returned)
            #  See: https://requests.readthedocs.io/en/master/user/advanced/#timeouts
            issue_page = sesh.get(issue_url, timeout=None)
            issue_page.encoding = 'utf-8'
            # Parse the page's text
            issue_page = BeautifulSoup(issue_page.text, 'lxml')
            # Find links to all articles in the issue
            issue_contents = issue_page.find('nav', {'class': 'manifest'})
            if issue_contents != None:
                # 💡 Consider using a standard for loop and only prepending
                #  the root_url if the link is a relative link,
                #  that is: a['href'].startswith('/')
                article_links = [
                    root_url + a['href'] 
                    for a in issue_contents.find_all('a') 
                    if a.text != "Contents"]
                articles_per_issue[mag_name].append(len(article_links))
            else:
                print(f"No <nav class=\"manifest\"> tag found on page. {issue_url}")
                # Move to next issue
                continue
            # Access each article to determine its length (in characters)
            for article_num, link in enumerate(article_links):
                pbar.set_description(
                    f"Working on {mag_name}, "
                  + f"issue {yr_num + 1:,d}/{sample_size:,d}, "
                  + f"article {article_num + 1:,d}/{len(article_links):,d}")
                # Pause for 100ms between page requests as a courtesy to reduce server load
                time.sleep(0.1)
                # Open webpage
                response = sesh.get(link, timeout=None)
                # Set the encoding used when returning text from the page
                response.encoding = 'utf-8'
                article_page = BeautifulSoup(response.text, 'lxml')
                article_text = article_page.find('div', class_='body-block')
                if article_text != None:
                    issue_chars += len(article_text.text.strip())
                else:
                    print(f"No <div class=\"body-block\"> tag found on page. Could not load text from article: {link}")
                   
            # Add the number of characters in this issue to the running list
            chars_per_issue[mag_name].append(issue_chars)
            pbar.update(n=1)
    
    # Compute the average characters per issue from the sampled issues
    for mag_name in chars_per_issue:
        chars_per_issue[mag_name] = round(
            sum(chars_per_issue[mag_name]) / len(chars_per_issue[mag_name]), 0)
    
    # Count the number of issues in each magazine
    issues_per_mag = {}

    for mag_name in magazine_issues:
        issue_count = 0
        for yr in magazine_issues[mag_name]:
            issue_count += len(magazine_issues[mag_name][yr]['issue_urls'])
        issues_per_mag[mag_name] = issue_count

    # Estimate the characters in all issues
    total_chars = 0
    for mag_name in chars_per_issue:
        total_chars += (chars_per_issue[mag_name] * issues_per_mag[mag_name])
    
    print(f"Estimated memory required: {total_chars:,.0f} bytes.")
    print(f"Number of articles per issue:\n{articles_per_issue}")

    return total_chars

In [None]:
est_memory_reqd = estimate_memory_usage(
    magazine_issues=magdata.all_issues, root_url=ROOT_URL, sample_size=10)

  0%|          | 0/40 [00:00<?, ?it/s]

No <div class="body-block"> tag found on page. Could not load text from article: https://www.churchofjesuschrist.org/study/liahona/1988/10/tambulilit
No <div class="body-block"> tag found on page. Could not load text from article: https://www.churchofjesuschrist.org/study/liahona/2000/11/the-friend
No <div class="body-block"> tag found on page. Could not load text from article: https://www.churchofjesuschrist.org/study/liahona/1980/02/childrens-section
No <div class="body-block"> tag found on page. Could not load text from article: https://www.churchofjesuschrist.org/study/liahona/1984/01/tambulilit
No <div class="body-block"> tag found on page. Could not load text from article: https://www.churchofjesuschrist.org/study/liahona/1994/11/tambulilit
No <div class="body-block"> tag found on page. Could not load text from article: https://www.churchofjesuschrist.org/study/liahona/1989/09/tambulilit
No <div class="body-block"> tag found on page. Could not load text from article: https://www.

In [None]:
print("Estimated memory required to store text from all issues of all magazines"
   + f" from 1970-2020:\n{(est_memory_reqd / 10**6):,.2f} MB")

Estimated memory required to store text from all issues of all magazines from 1970-2020:
316.16 MB


In [None]:
# I copied this dictionary from the output printed two cells above
article_count = {
    'Liahona': [20, 21, 15, 15, 18, 41, 22, 23, 14, 64],
    'Ensign': [38, 43, 31, 45, 39, 31, 33, 47, 55, 40],
    'New Era': [20, 18, 19, 29, 14, 16, 29, 16, 18, 14],
    'Friend': [25, 27, 24, 41, 26, 24, 28, 35, 28, 27]
    }

msg = "Average number of articles per magazine:"
print(msg, '\n', '-' * len(msg), sep='')
for mag_name in article_count:
    total_articles = sum(article_count[mag_name])
    total_issues = len(article_count[mag_name])
    avg = total_articles / total_issues
    print(f"{mag_name}: {avg:,.0f}")

Average number of articles per magazine:
----------------------------------------
Liahona: 25
Ensign: 40
New Era: 19
Friend: 28
