# 24 ways search index

Start by using `wget -r` to create a local mirror of the pages we care about:

    cd /Users/simonw/Dropbox/Development/24ways-search
    wget --recursive --wait 2 --no-clobber \
      -I /2005,/2006,/2007,/2008,/2009,/2010,/2011,/2012,/2013,/2014,/2015,/2016,/2017 \
      -X "/*/*/comments" \
      https://24ways.org/archives/ 

And to get the rest:

    wget --recursive --wait 2 --no-clobber \
      -I /2018 \
      -X "/*/*/comments" \
      https://24ways.org/

In [1]:
from pathlib import Path

The [pathlib](https://docs.python.org/3/library/pathlib.html) module lets us easily work with recursive directory structures. We can use `glob` to find deeply nested files mating a specified pattern.

In [2]:
base = Path("/Users/simonw/Dropbox/Development/24ways-search")

In [3]:
articles = list(base.glob("*/*/*/*.html"))
len(articles)

329

In [4]:
articles[:5]

[PosixPath('/Users/simonw/Dropbox/Development/24ways-search/24ways.org/2013/why-bother-with-accessibility/index.html'),
 PosixPath('/Users/simonw/Dropbox/Development/24ways-search/24ways.org/2013/levelling-up/index.html'),
 PosixPath('/Users/simonw/Dropbox/Development/24ways-search/24ways.org/2013/project-hubs/index.html'),
 PosixPath('/Users/simonw/Dropbox/Development/24ways-search/24ways.org/2013/credits-and-recognition/index.html'),
 PosixPath('/Users/simonw/Dropbox/Development/24ways-search/24ways.org/2013/managing-a-mind/index.html')]

We need a way to derive the URL on 24 ways based on the filepath:

In [5]:
str(articles[37].relative_to(base).parent)

'24ways.org/2014/websites-of-christmas-past-present-and-future'

Let's grab a random article and experiment with using the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Python HTML scraping library to extract the data we need from it.

We needed to `pip install beautifulsoup4 html5lib` first to install the library.

In [6]:
path = articles[37]
html = path.open().read()

In [7]:
len(html)

40295

In [8]:
from bs4 import BeautifulSoup as Soup

In [9]:
soup = Soup(html, "html5lib")

In [10]:
soup.find("title").text

'Websites of Christmas Past, Present and Future ◆ 24 ways'

BeautifulSoup lets you extract data from HTML using CSS selectors

In [11]:
div = soup.select_one(".e-content")

Using `div.text` extracts just the text contents of a div, stripping any HTML tags

In [12]:
print(div.text.strip()[:200])

The websites of Christmas past

The first website was created at CERN. It was launched on 20 December 1990 (just in time for Christmas!), and it still works today, after twenty-four years. Isn’t that 


In [13]:
soup.select(".c-meta time")

[<time class="dt-published" datetime="2014-12-08T00:00:00+00:00">8 Dec<span>ember</span> 2014</time>]

We can use `["attribute"]` syntax to access HTML attributes.

In [14]:
soup.select_one(".c-meta time")["datetime"]

'2014-12-08T00:00:00+00:00'

The easiest way to extract the topic of the article (code, process, design etc) is from a link:

In [15]:
soup.select_one('.c-meta a[href^="/topics/"]')["href"].split("/topics/")[1].split("/")[0]

'code'

The author's name can be found in the sentence "More information about Josh Emerson"

In [16]:
soup.select_one(".c-continue")["title"].split("More information about")[1].strip()

'Josh Emerson'

In [17]:
soup.select_one(".c-continue")["href"].split("/authors/")[1].split("/")[0]

'joshemerson'

In [18]:
year = str(path.relative_to(base)).split("/")[1]; year

'2014'

Now that we've figured out how to extract data from a single article, we can extract the data from all of the articles inside a loop:

In [19]:
docs = []
for path in articles:
    year = str(path.relative_to(base)).split("/")[1]; year
    url = 'https://' + str(path.relative_to(base).parent) + '/'
    soup = Soup(path.open().read(), "html5lib")
    author = soup.select_one(".c-continue")["title"].split("More information about")[1].strip()
    author_slug = soup.select_one(".c-continue")["href"].split("/authors/")[1].split("/")[0]
    published = soup.select_one(".c-meta time")["datetime"]
    contents = soup.select_one(".e-content").text.strip()
    title = soup.find("title").text.split(" ◆")[0]
    try:
        topic = soup.select_one('.c-meta a[href^="/topics/"]')["href"].split("/topics/")[1].split("/")[0]
    except TypeError:
        # Some articles don't have topics
        topic = None
    docs.append({
        "title": title,
        "contents": contents,
        "year": year,
        "author": author,
        "author_slug": author_slug,
        "published": published,
        "url": url,
        "topic": topic,
    })

In [20]:
len(docs)

329

In [21]:
docs[0]

{'title': 'Why Bother with Accessibility?',
 'contents': 'Web accessibility (known in other fields as inclusive design or universal design) is the degree to which a website is available to as many people as possible. Accessibility is most often used to describe how people with disabilities can access the web.\n\nHow we approach accessibility\n\nIn the web community, there’s a surprisingly inconsistent approach to accessibility. There are some who are endlessly dedicated to accessible web design, and there are some who believe it so intrinsic to the web that it shouldn’t be considered a separate topic. Still, of those who are familiar with accessibility, there’s an overwhelming number of designers, developers, clients and bosses who just aren’t that bothered.\n\nOver the last few months I’ve spoken to a lot of people about accessibility, and I’ve heard the same reasons to ignore it over and over again. Let’s take a look at the most common excuses.\n\nExcuse 1: “People with disabilities 

Finally, we can use the [sqlite-utils](https://sqlite-utils.readthedocs.io/) library to create a brand new SQLite database and then automatically create an `articles` database table with the correct columns to store all of our collected data.

In [22]:
import sqlite_utils

In [23]:
db = sqlite_utils.Database("/tmp/24ways.db")

In [24]:
db["articles"].insert_all(docs)

<Table articles>

We're going to want to be able to run full-text-search queries against the contents of the `title`, `author` and `contents` columns:

In [25]:
db["articles"].enable_fts(["title", "author", "contents"])

<Table articles>

And here's our final SQLite database:

In [26]:
!ls -lah /tmp/24ways.db

-rw-r--r--  1 simonw  wheel   5.2M Dec 16 22:13 /tmp/24ways.db


You can play with it in Datasette at https://search-24ways.herokuapp.com/