# Text Analysis: 50 yrs. of Magazine Issues

**Author:** [Ryan Parker](https://github.com/rparkr)

**Data source:** Data scraped from magazines of The Church of Jesus Christ of Latter-day Saints from 1971-2021. Starting point: [Church Magazines](https://www.churchofjesuschrist.org/study/magazines?lang=eng).

**Objective**: understand trends in topics addressed through the content of four magazines produced by [The Church of Jesus Christ of Latter-day Saints](https://churchofjesuschrist.org).

In addition to collecting data, I'll analyze the data and implement various machine learning algorithms to understand the data, uncover insights, and make predictions.

## Steps
1. Data collection: Gather data through webscraping then lightly process it and save it to .csv files.
2. Topic modeling: assign a topic to each document (article) and label the topics, then append the topic label as a new feature (column) in the dataset.
3. <strong><span style="color:rgb(82, 191, 127)">Data augmentation (this notebook)</strong></span>: modify and add additional features (columns) to the dataset to enhance its analysis.
4. Analysis: explore the dataset through visualization, and make predictions using supervised and unsupervised machine learing algorithms.


# Data augmentation
Also known as feature engineering, I add additional features (columns) with computed information to enhance the analysis of this dataset.

This notebook assumes that the .csv files created by the Data Collection notebook exist and are saved in the `data/` directory (relative to where this notebook is saved). In particular, it relies on:
- `data/article_data.csv`
- `data/article_text.csv`

**Features to be added to the dataset**
* [Readability score](https://en.wikipedia.org/wiki/Readability#Popular_readability_formulas "Wikipedia article"), a measure of article complexity. We would expect this to be lower for articles in the children's magazine (_The Friend_) and highest for magazines targeted towards youth and adults.
* Author's gender (predicted from the [Genderize API](https://genderize.io) or estimated from the [Social Security Names dataset](https://www.ssa.gov/oact/babynames/limits.html))
* Author's age (predicted from the [Agify API](https://agify.io/) or estimated from the [Social Security Names dataset](https://www.ssa.gov/oact/babynames/limits.html))
* Article summary ("extractive" statistical method, e.g., [using NLTK and n-grams](https://stackabuse.com/text-summarization-with-nltk-in-python/ "StackAbuse tutorial article"))
* Article summary ("abstractive" neural network method, e.g., using a pre-trained Large Language Model)



## Next steps
6-Aug-2022

1. Use a regex pattern to convert the 'section' column to 'general' if there are a certain number of forward slashes ('/') in the URL; since additional forward slashes indicate that the article was from a section other than general. Other approaches could extract the text between the forward slashes, but since I already have the sections correctly assigned, I need only to re-assign some articles to the 'general' cateogry.

## Article complexity: Readability Score

**References**

* [textstat](https://pypi.org/project/textstat/) Python package for returning readability scores from texts
* [py-readability-metrics](https://pypi.org/project/py-readability-metrics/) Python package for returning readbility scores from texts. See also its [documentation](https://py-readability-metrics.readthedocs.io/en/latest/).
* [GeeksForGeeks article](https://www.geeksforgeeks.org/readability-index-pythonnlp/) on calculating readability scores
* [Readability formulas](https://readabilityformulas.com/search/pages/Readability_Formulas/) for implementing readability scores directly (that is, without importing an external package)
* [Wikipedia article](https://en.wikipedia.org/wiki/Readability#Popular_readability_formulas) on popular readability scores