# Exploring changes in the front pages of newspapers

How has the content of newspaper front pages changed over time? In the nineteenth and early twentieth centuries, front pages of many newspapers were dominated by classified advertising. The idea that front pages would carry the big, breaking stories came later, but when? To investigate this change, I've [harvested information](large-harvest-example.ipynb) about the contents of all front pages in Trove's digitised newspapers and [converted this harvest](convert-front-pages-harvest.ipynb) into parquet datasets for easy analysis. This notebook starts to explore this data.

In [1]:
import os

import altair as alt
import arrow
import ipywidgets as widgets
import pandas as pd
import requests
from dateutil.relativedelta import relativedelta
from IPython.display import Image, Markdown, HTML

In [2]:
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

In [3]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

## Load the data

The data file is available in `parquet` format. Each row in the dataset provides the total number of words by page and category. This enables us to examine how the number of words in each category changes over time, and to explore variations in individual newspapers.

In [4]:
df = pd.read_parquet("front_pages_totals.parquet")

# Convert the category field to category type, this should reduce memory usage
df["category"] = df["category"].astype("category")

In [5]:
df.head()

Unnamed: 0,date,page_id,newspaper_id,category,total
0,1936-07-28,13651456,501,Advertising,1723.0
1,1936-12-04,13651678,501,Article,1154.0
2,1938-04-29,13652539,501,Article,715.0
3,1936-11-10,13651636,501,Article,2604.0
4,1922-12-08,13654107,501,Advertising,1186.0


 Here's a list of the categories:

In [6]:
df["category"].unique()

['Advertising', 'Article', 'Detailed lists, results, guides', 'Family Notices', 'Literature']
Categories (5, object): ['Advertising', 'Article', 'Detailed lists, results, guides', 'Family Notices', 'Literature']

How many words in total are in each of the categories?

In [7]:
# Display big numbers in a human friendly way
pd.options.display.float_format = "{:,.0f}".format
df.groupby("category")["total"].sum()

category
Advertising                       6,316,130,808
Article                           2,870,871,592
Detailed lists, results, guides     244,157,191
Family Notices                      166,221,018
Literature                                8,194
Name: total, dtype: float64

So that's 6 billion words in advertisements!

How many front pages are there?

In [8]:
df["page_id"].nunique()

2480438

To prepare the dataset for visualisation, we'll make sure the `date` field is recognised as a datetime value, and add a new column for the year.

In [9]:
df["date"] = pd.to_datetime(df["date"])
df["year"] = pd.to_datetime((df["date"].dt.year), format="%Y")

## Number of words on front pages

Let's start by looking at the total number of words on front pages by year.

In [10]:
df_word_totals = df.groupby(["year"])["total"].sum().to_frame().reset_index()

In [11]:
alt.Chart(df_word_totals).mark_bar().encode(
    x=alt.X("year:T", axis=alt.Axis(labelAngle=45)),
    y=alt.Y("total:Q", title="number of words"),
    tooltip=[alt.Tooltip("year:T", format="%Y"), alt.Tooltip("total:Q", format=",")],
).properties(width="container")

The peak from 1914 to 1918 is another example of the 'WWI effect' – more newspapers have been digitised from the WWI period as a result of digitisation priorities. There are more words because more pages have been digitised. Not very suprising really. What we really want to know is how the number of words per page changes over time. To do that we first need to find the number of pages per year.

In [12]:
df_page_counts = df.groupby(["year"])["page_id"].count().to_frame().reset_index()

In [13]:
alt.Chart(df_page_counts).mark_bar().encode(
    x=alt.X("year:T", axis=alt.Axis(labelAngle=45)),
    y=alt.Y("page_id:Q", title="number of front pages"),
    tooltip=[alt.Tooltip("year:T", format="%Y"), alt.Tooltip("page_id:Q", title="pages", format=",")],
).properties(width="container")

Hey look, it's the 'WWI effect' again. But now we have the words per year and the pages per year, we can combine them to find the words per page/year.

## Number of words per page/year

By merging the two dataframes created above, we can calculate the average number of words per page by dividing the total number of words by the number of pages for each year.

In [14]:
# Merge the two dataframes
df_words_per_page = pd.merge(df_word_totals, df_page_counts, how="left", on="year")

# Calculate the number of words per page by dividing the number of words by the number of pages
df_words_per_page["words_per_page"] = (
    df_words_per_page["total"] / df_words_per_page["page_id"]
)

In [15]:
alt.Chart(df_words_per_page).mark_bar().encode(
    x=alt.X("year:T", axis=alt.Axis(labelAngle=45)),
    y=alt.Y("words_per_page:Q", title="number of words per page"),
    tooltip=[alt.Tooltip("year:T", format="%Y"), alt.Tooltip("words_per_page:Q", title="words per page", format=",")],
).properties(width="container")

That's a bit more interesting! We can see that the average number of words on front pages peaked in 1854, and has declined steadily since.

## Number of words in each category

Now let's look at the average number of words per page by category. 

In [16]:
def get_words_per_page(df, newspaper=None):
    """
    Calculates the average number of words per page by article category and returns a dataframe with the results.
    Supply a newspaper identifier to limit results to a specific newspaper.
    """
    if newspaper:
        df = df.loc[df["newspaper_id"] == newspaper]
    # Get the number of words in each category per year
    df_words = df.groupby(["year", "category"])["total"].sum().to_frame().reset_index()
    # Get the number of front pages each year
    df_pages = df.groupby(["year"])["page_id"].count().to_frame().reset_index()
    # Add the number of front pages to the df with number of words
    df_words_per_page = pd.merge(df_words, df_pages, how="left", on="year")
    # Calculate the average number of words per page per year --ie number of words / number of pages
    df_words_per_page["words_per_page"] = (
        df_words_per_page["total"] / df_words_per_page["page_id"]
    )
    return df_words_per_page


def display_words_per_page(df, newspaper=None, start=None, end=None):
    """
    Build a chart that displays the average number of words per page by article category.
    Supply a newspaper identifier to limit results to a specific newspaper.

    The `start` and `end` parameters are year values, and can be used to highlight a particular section of the chart
    (see below for an example of this).
    """
    df_words_per_page = get_words_per_page(df, newspaper)
    base = alt.Chart(df_words_per_page).properties(width="container")
    chart = base.mark_line().encode(
        x=alt.X("year:T", axis=alt.Axis(tickMinStep=1, labelAngle=45)),
        y=alt.Y("words_per_page:Q", title="words per page"),
        color="category:N",
        tooltip=[
            "year(year)",
            "category",
            alt.Tooltip("words_per_page", title="words per page"),
        ],
    )
    if start and end:
        start_rule = base.mark_rule(strokeDash=[6, 3], size=1, color="gray").encode(
            x=alt.datum(
                alt.DateTime(year=start.year, month=start.month, date=start.day)
            ),
        )
        end_rule = base.mark_rule(strokeDash=[6, 3], size=1, color="gray").encode(
            x=alt.datum(alt.DateTime(year=end.year, month=end.month, date=end.day))
        )
        return chart + start_rule + end_rule
    return chart

The two functions above calculate the average number of words per page by category and display the results. Let's see how the average number of words in each category changed over time.

In [17]:
display_words_per_page(df)

We can see, as expected, that advertising dominated front pages in the late 19th and early 20th centuries. But the number of words in news articles steadily increased until it overtook advertising around 1930.

The functions above also accept a `newspaper` parameter that expects a Trove newspaper identifier. Using this we can explore changes in an individual newspaper. Let's try the *Sydney Morning Herald* which has an identifier of `35`.

In [18]:
display_words_per_page(df, 35)

Comparing the two charts we see that the *Sydney Morning Herald* continued to have mostly advertisements on the front page until 1944 when things changed dramatically.

## When did articles overtake advertising?

The charts above show that, across all newspapers, articles overtook advertising about 1930. But the *Sydney Morning Herald* was significantly later. Let's see if we can find the date this change occured across different newspapers. Of course, not every newspaper started with more ads than articles, so not all of them will have a 'crossover' date. Indeed, some newspapers on Trove have short publication spans, so you're not likely to observe major changes. Also, in some cases the number of words in each category went up and down, so there might be multiple crossovers. Here I'm looking for newspapers where at some point in their history there were more words in advertising that articles on front pages, and then selecting the latest date that this changed so that there were more words in articles.

In [19]:
def find_latest_year(df, newspaper=None):
    """
    Find the latest 'crossover' date when the number of words in articles
    became greater than in advertising.
    """
    df_words_per_page = get_words_per_page(df, newspaper)
    # Limit to only Article and Advertising categories
    df_changes = df_words_per_page.copy().loc[
        df_words_per_page["category"].isin(["Advertising", "Article"])
    ]
    # Because we're grouping by year, this will calculate the difference
    # between the Advertising and Article totals
    df_changes["diff"] = df_changes.groupby("year")["total"].diff()

    # We want to check that Articles are in front of Ads at the end of this time span
    # So let's find the last Article
    last_year = df_changes["year"].max()
    last_article = df_changes.loc[
        (df_changes["category"] == "Article") & (df_changes["year"] == last_year)
    ]
    try:
        # Check that the last Article has more words than the Ads
        if last_article.iloc[0]["diff"] > 0:
            # Find all of the years when Articles pull ahead of Ads and select the last one
            latest_year = df_changes.loc[
                (df_changes["category"] == "Article") & (df_changes["diff"] < 0)
            ]["year"].max()
        else:
            latest_year = None
    except ValueError:
        # last_year = int(df["year"].min())
        latest_year = None
    if pd.isnull(latest_year):
        latest_year = None
    return latest_year

Let's try to find the latest crossover date across all newspapers.

In [20]:
find_latest_year(df)

Timestamp('1930-01-01 00:00:00')

Cool, that matches the chart above. Now let's try the  *SMH*.

In [21]:
find_latest_year(df, 35)

Timestamp('1944-01-01 00:00:00')

Again that matches the chart. Now let's run this function across all the newspapers.

In [22]:
changes = []

# Get a list of all newspaper ids
newspapers = list(df["newspaper_id"].unique())

# Loop through all the newspaper ids trying to get the last crossover year
for newspaper in newspapers:
    latest_year = find_latest_year(df, int(newspaper))
    if latest_year:
        changes.append({"newspaper_id": newspaper, "year": latest_year})

In [23]:
df_changes = pd.DataFrame(changes)

How many newspapers can we identify a 'crossover' date for?.

In [24]:
df_changes.shape

(390, 2)

How are these crossover dates distributed across time?

In [25]:
alt.Chart(df_changes).mark_bar().encode(
    x=alt.X("year:T", axis=alt.Axis(labelAngle=45)),
    y=alt.Y("count()", title="number of 'crossovers'"),
    tooltip=[alt.Tooltip("year:T", format="%Y"), alt.Tooltip("count()", title="crossovers")],
).properties(width="container")

Once again, it seems that the late 1920s was the period when most change was happening.

## Looking at individual titles in depth

To examine these 'crossover' dates in more detail we can zoom in and use a scatter plot to display the number of words in each category for every issue. The code below creates two charts. The first, like those above, shows the number of words in each category across the newspaper's complete publication run on Trove. However, if a 'crossover' date has been found, a window two years each side of that date will be highlighted. The second chart is limited to that four year window, and displays a point for each category in every issue within that period. This will enable us to look at what's actually changing. Finally, the code grabs two front page images from Trove, one at the beginning of the period and one at the end, so you can compare how they look.

In [26]:
def get_newspapers():
    """
    Get a list of newspapers from the Trove API.
    We'll use this to get titles from ids.
    """
    params = {"encoding": "json"}
    headers = {"X-API-KEY": API_KEY}
    response = requests.get(
        "https://api.trove.nla.gov.au/v3/newspaper/titles",
        params=params,
        headers=headers,
    )
    data = response.json()
    newspapers = {n["id"]: n["title"] for n in data["newspaper"]}
    return newspapers


def embed_image(page_id, date, width=300):
    """
    Prepare HTML to display a page image.
    """
    # Url to download image
    image_url = (
        f"https://trove.nla.gov.au/ndp/imageservice/nla.news-page{page_id}/level1"
    )
    # Url to view page on Trove
    page_url = f"https://nla.gov.au/nla.news-page{page_id}"
    return (
        f'<figure style="float: left; margin-right: 40px;">'
        f'<a target="_blank" href="{page_url}">'
        f'<img src="{image_url}" style="display:inline;margin:1px" width="{width}px">'
        f"</a>"
        f"<figcaption>"
        f'<a target="_blank" href="{page_url}">{arrow.get(date).format("D MMMM YYYY")}</a>'
        f"</figcaption>"
        f"</figure>"
    )


def create_scatter_plot(df):
    """
    Create a scatter plot showing all the categories/page totals.
    This assumes that the df will have a limited time span or else you're
    likely to get MaxRowsErrors from Altair.
    """
    chart = (
        alt.Chart(df)
        .mark_circle()
        .encode(
            x=alt.X("date:T", axis=alt.Axis(labelAngle=45)),
            y="total:Q",
            color="category:N",
            tooltip=["date:T", "category:N", "total:Q"],
        )
        .properties(width="container")
        .interactive()
    )
    return chart


def create_pages_box(df):
    """
    Creates some HTML to display the first and last front page image,
    including links to Trove.
    """
    # Get the first and last dates in this df
    first_date = df["date"].min()
    last_date = df["date"].max()
    # Get the page ids from the first and last dates
    first_page = df.loc[df["date"].idxmin()]["page_id"]
    last_page = df.loc[df["date"].idxmax()]["page_id"]
    # Compile the HTML
    pages_box = HTML(
        embed_image(first_page, first_date, width=300)
        + embed_image(last_page, last_date, width=300)
    )
    return pages_box


def display_newspaper(nid, title, start=None, end=None):
    """
    Display summary information about a specific newspaper, includes:
    - words per page line chart
    - scatter plot (if date range is set or crossover date is found)
    - images of front pages from beginning and end of date range (linked to Trove)
    """
    # Display a heading with the newspaper title
    display(Markdown(f"### {title}"))
    # If a date range hasn't been explicitly set, try to find
    # the crossover date and set a date range based on that
    if not start and not end:
        # Look for crossover
        year = find_latest_year(df, nid)
        if year:
            # Set the date span to two years either side of the crossover
            start = year - relativedelta(years=1)
            end = year + relativedelta(years=2)
    # Display the words per page line chart
    display(display_words_per_page(df, nid, start, end))
    # If we have a defined date range, then create a scatter plot chart 
    if start and end:
        # Set limits on df
        df_year = df.loc[
            (df["newspaper_id"] == nid) & (df["year"] >= start) & (df["year"] <= end)
        ]
        # Create the chart
        display(create_scatter_plot(df_year))
        # Get HTML to display page images
        pages_box = create_pages_box(df_year)
    else:
        pages_box = create_pages_box(df.loc[df["newspaper_id"] == nid])
    display(pages_box)


# Generate the list of newspapers
newspapers = get_newspapers()

Let's try it out with the *Sydney Morning Herald* again.

In [27]:
display_newspaper(35, newspapers["35"])

### The Sydney Morning Herald (NSW : 1842 - 1954)

You can see that the *SMH* layout literally changed overnight from 14 April 1944 to 15 April 1944 (hover over the points to see the exact dates). But was this a common pattern, or were some of the crossovers more gradual or messy? We saw above that the most number of crossovers was in 1929, so let's look in depth at all the newspapers that had crossover dates in 1929.

In [28]:
year = 1929
for nid in df_changes.loc[df_changes["year"] == f"{year}-01-01"][
    "newspaper_id"
].to_list():
    display_newspaper(nid, newspapers[str(nid)])

### Ovens and Murray Advertiser (Beechworth, Vic. : 1855 - 1955)

### The Manning River Times and Advocate for the Northern Coast Districts of New South Wales (Taree, NSW : 1898 - 1954)

### Collie Mail (Perth, WA : 1908 - 1954)

### The Southern Districts Advocate (Katanning, WA : 1913 - 1936)

### North-Eastern Advertiser (Scottsdale, Tas. : 1909 - 1954)

### Great Southern Leader (Pingelly, WA : 1907 - 1934)

### Kyabram Free Press and Rodney and Deakin Shire Advocate (Vic. : 1894 - 1954)

### The Blackwood Times (Greenbushes, WA : 1905 - 1955)

### Bunyip (Gawler, SA : 1863 - 1954)

### Daily Commercial News and Shipping List (Perth, WA : 1927 - 1934)

### Burra Record (SA : 1878 - 1954)

### The Register News-Pictorial (Adelaide, SA : 1929 - 1931)

### Gilgandra Weekly and Castlereagh (NSW : 1929 - 1942)

### Western Star and Roma Advertiser (Qld. : 1875 - 1948)

### Wiluna Chronicle and East Murchison Advocate (WA : 1924 - 1931)

### Box Hill Reporter (Vic. : 1925 - 1930)

### The Blackheath Bulletin (Katoomba, NSW : 1926; 1929 - 1931)

### The Border Star (Coolangatta, Qld. : 1929 - 1942)

From the examples above we can see that the evolution of front pages varied considerably. In some cases there was a gradual change, as the amount of 'news' on the front page increased. In other cases there was an abrupt change to the layout.

## The metropolitan dailies

Many of the newspapers identified above with crossovers in 1929 are smaller local titles. It would be interesting to look at how the large, metropolitan dailies changed. We've already seen that the *SMH* flipped its layout in 1944, but what about the others?

In [29]:
metro_dailies = [35, 11, 809, 12, 10, 44, 30]
for nid in metro_dailies:
    display_newspaper(nid, newspapers[str(nid)])

### The Sydney Morning Herald (NSW : 1842 - 1954)

### The Canberra Times (ACT : 1926 - 1995)

### The Age (Melbourne, Vic. : 1854 - 1954)

### The Courier-Mail (Brisbane, Qld. : 1933 - 1954)

### The Mercury (Hobart, Tas. : 1860 - 1954)

### The Advertiser (Adelaide, SA : 1931 - 1954)

### The West Australian (Perth, WA : 1879 - 1954)

So with exception of the *Canberra Times*, which has always had news on the front page, the metropolitan dailies all changed their front page layout quite abruptly in the 1930s or early 1940s.

## Problems and questions

The major problem with this method is that it only takes into account the *number of words* in articles, not the amount of space they take up. As we can see in the assortment of front pages displayed above, there were also changes in the *way* advertising was displayed – from text dense classifieds to display ads. And when the headlines came to the front page, they were often accompanied by photographs. How much would this analysis change if we factored in how front page *space* was allocated? The Trove API doesn't provide data on the spatial coordinates of articles, though it is possible to scrape it from the web site – something to explore in the future perhaps?

----

Created by [Tim Sherratt](https://timsherratt.org), August 2023