https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy

A funny thing many people have noticed about Wikipedia is 
A funny thing that I found in my explorations is that if I go from page to page by clicking the first link in each article, I eventually find myself at the philosophy article. I wonder if that will happen every time. I'm going to try it out. I'll click on the first link on each Wikipedia page I come to. So, here's what I'm doing. Let me start at some Wikipedia page like the article for African swallow. I suppose I could have used the random article feature if I didn't have this topic in mind already. Now, I'm going to click the first link I come to in the main part of the article. I'll skip the disambiguation link and head to the article proper. That takes me to the article for bird. The first link here is endothermic. So, I can keep going on like this for quite a long time. The first link in this article takes me to ancient Greek, and the next step is Greek language. From there, I get to modern Greek. Here, the first link is a pronunciation help link. I think it's a good idea to skip over that. It's not taking me to another article, just teaching me how to say modern Greek in modern Greek. The first link to another article is colloquially. After that comes word, and then linguistics, and then scientific. These articles are getting more and more abstract. The first link in this article is knowledge, and that leads to awareness, and then quality which is a philosophical term. So eventually, I make it to Philosophy. So, the process is go to a Wikipedia page and find the first ordinary link in the main part of the text. Click through to the new page and repeat the process. Keep going until you reach philosophy, which seems to happen pretty often or until you get tired of clicking. Let's try the process again. Hmm, where should I start? Why not the chair article? The first link in the article is here, the word furniture. And the first link in the furniture article takes me back to chair. So not every article chain makes it to philosophy. Instead, some chains make loops. We can try this with some other articles to see what happens. All this clicking is a bit slow though. This is something we can automate with Python. It's surprising how often I find that everyday tasks can be made better with programming.


In [5]:
import time
import urllib

import bs4
import requests


start_url = "https://en.wikipedia.org/wiki/Special:Random"
target_url = "https://en.wikipedia.org/wiki/Philosophy"


def find_first_link(url):
    response = requests.get(url)
    html = response.text
    soup = bs4.BeautifulSoup(html, "html.parser")

    # This div contains the article's body
    # (June 2017 Note: Body nested in two div tags)
    content_div = soup.find(
        id="mw-content-text").find(class_="mw-parser-output")

    # stores the first link found in the article, if the article contains no
    # links this value will remain None
    article_link = None

    # Find all the direct children of content_div that are paragraphs
    for element in content_div.find_all("p", recursive=False):
        # Find the first anchor tag that's a direct child of a paragraph.
        # It's important to only look at direct children, because other types
        # of link, e.g. footnotes and pronunciation, could come before the
        # first link to an article. Those other link types aren't direct
        # children though, they're in divs of various classes.
        if element.find("a", recursive=False):
            article_link = element.find("a", recursive=False).get('href')
            break

    if not article_link:
        return

    # Build a full url from the relative article_link url
    first_link = urllib.parse.urljoin(
        'https://en.wikipedia.org/', article_link)

    return first_link


def continue_crawl(search_history, target_url, max_steps=25):
    if search_history[-1] == target_url:
        print("We've found the target article!")
        return False
    elif len(search_history) > max_steps:
        print("The search has gone on suspiciously long, aborting search!")
        return False
    elif search_history[-1] in search_history[:-1]:
        print("We've arrived at an article we've already seen, aborting search!")
        return False
    else:
        return True


article_chain = [start_url]

while continue_crawl(article_chain, target_url):
    print(article_chain[-1])

    first_link = find_first_link(article_chain[-1])
    if not first_link:
        print("We've arrived at an article with no links, aborting search!")
        break

    article_chain.append(first_link)

    time.sleep(.25)  # Slow things down so as to not hammer Wikipedia's servers

https://en.wikipedia.org/wiki/Special:Random
https://en.wikipedia.org/wiki/Ahmad_Jamal
https://en.wikipedia.org/wiki/Jazz
https://en.wikipedia.org/wiki/Music_genre
https://en.wikipedia.org/wiki/Music
https://en.wikipedia.org/wiki/Art
https://en.wikipedia.org/wiki/Human_behavior
https://en.wikipedia.org/wiki/Motion_(physics)
https://en.wikipedia.org/wiki/Physics
https://en.wikipedia.org/wiki/Ancient_Greek
https://en.wikipedia.org/wiki/Greek_language
https://en.wikipedia.org/wiki/Modern_Greek
https://en.wikipedia.org/wiki/Colloquialism
https://en.wikipedia.org/wiki/Vernacular
https://en.wikipedia.org/wiki/Dialect
We've arrived at an article we've already seen, aborting search!
