# Getting to Philosophy article in Wikipedia

## 0. Intro

According to [Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy), clicking on the **first link** in the main text of a Wikipedia article, and then repeating the process for subsequent articles, would usually lead to the **Philosophy article**. 

As of February 2016, **97% of all articles** in Wikipedia eventually led to the article Philosophy. The remaining articles lead to an article without any outgoing wikilinks, to pages that do not exist, or get stuck in loops. This has gone up from 94.52% in 2011. At some point in the past the median link chain length to reach philosophy was 23.

## 1. Web scraping  Wikipedia 

In [30]:
import time
import urllib

import bs4
import requests


start_url = "https://en.wikipedia.org/wiki/special:Random"
target_url = "https://en.wikipedia.org/wiki/Philosophy"

def find_first_link(url):
    response = requests.get(url)
    html = response.text
    soup = bs4.BeautifulSoup(html, "html.parser")

    # This div contains the article's body
    content_div = soup.find(class_= "mw-parser-output")

    # stores the first link found in the article, if the article contains no
    # links this value will remain None
    article_link = None

    # Find all the direct children of content_div that are paragraphs
    for element in content_div.find_all("p", recursive=False):
        print(element)
        # Find the first anchor tag that's a direct child of a paragraph.
        # It's important to only look at direct children, because other types
        # of link, e.g. footnotes and pronunciation, could come before the
        # first link to an article. Those other link types aren't direct
        # children though, they're in divs of various classes.
        if element.find("a", recursive=False):
            article_link = element.find("a", recursive=False).get('href')
            break

    if not article_link:
        return

    # Build a full url from the relative article_link url
    first_link = urllib.parse.urljoin('https://en.wikipedia.org/', article_link)

    return first_link

def continue_crawl(search_history, target_url, max_steps=25):
    if search_history[-1] == target_url:
        print("We've found the target article!")
        return False
    elif len(search_history) > max_steps:
        print("The search has gone on suspiciously long, aborting search!")
        return False
    elif search_history[-1] in search_history[:-1]:
        print("We've arrived at an article we've already seen, aborting search!")
        return False
    else:
        return True

article_chain = [start_url]

while continue_crawl(article_chain, target_url):
    print(article_chain[-1])

    first_link = find_first_link(article_chain[-1])
    if not first_link:
        print("We've arrived at an article with no links, aborting search!")
        break

    article_chain.append(first_link)

    time.sleep(2) # Slow things down so as to not hammer Wikipedia's servers



https://en.wikipedia.org/wiki/special:Random
<p><b>The Toasters</b> are one of the original American <a class="mw-redirect" href="/wiki/2_Tone_(music_genre)" title="2 Tone (music genre)">second wave of ska</a> bands. Founded in New York City in 1981, the band has released nine studio albums, primarily through <a href="/wiki/Moon_Ska_Records" title="Moon Ska Records">Moon Ska Records</a>.
</p>
https://en.wikipedia.org/wiki/2_Tone_(music_genre)
<p class="mw-empty-elt">
</p>
<p class="mw-empty-elt">
</p>
<p><b>Two-tone</b> (or <b>2 tone</b>) is a genre of British music that fuses traditional <a href="/wiki/Ska" title="Ska">ska</a> with musical elements of <a href="/wiki/Punk_rock" title="Punk rock">punk rock</a> and <a href="/wiki/New_wave_music" title="New wave music">new wave music</a>.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> Its name comes from <a href="/wiki/2_Tone_Records" title="2 Tone Records">2 Tone Records</a>, a label founded by <a href="/wiki/

https://en.wikipedia.org/wiki/Culture
<p class="mw-empty-elt">
</p>
<p><b>Culture</b> (<span class="nowrap"><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="/ˈ/: primary stress follows">ˈ</span><span title="'k' in 'kind'">k</span><span title="/ʌ/: 'u' in 'cut'">ʌ</span><span title="'l' in 'lie'">l</span><span title="/tʃ/: 'ch' in 'China'">tʃ</span><span title="/ər/: 'er' in 'letter'">ər</span></span>/</a></span></span>) is the <a href="/wiki/Social_behavior" title="Social behavior">social behavior</a> and <a class="mw-redirect" href="/wiki/Norm_(social)" title="Norm (social)">norms</a> found in <a href="/wiki/Human" title="Human">human</a> <a href="/wiki/Society" title="Society">societies</a><sup class="noprint Inline-Template Template-Fact" style="white-space:nowrap;">[<i><a href="/wiki/Wikipedia:Citation_needed" title="Wikipedia:Citation needed"><span title="This claim needs references 

https://en.wikipedia.org/wiki/Sociolinguistics
<p><b>Sociolinguistics</b> is the descriptive study of the effect of any and all aspects of <a href="/wiki/Society" title="Society">society</a>, including cultural <a class="mw-redirect" href="/wiki/Norm_(sociology)" title="Norm (sociology)">norms</a>, expectations, and context, on the way <a href="/wiki/Language" title="Language">language</a> is used, and society's effect on language. It differs from <a href="/wiki/Sociology_of_language" title="Sociology of language">sociology of language</a>, which focuses on the effect of language on society. Sociolinguistics overlaps considerably with <a href="/wiki/Pragmatics" title="Pragmatics">pragmatics</a>. It is historically closely related to <a href="/wiki/Linguistic_anthropology" title="Linguistic anthropology">linguistic anthropology</a>, and the distinction between the two fields has been questioned.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup>
</p>
https://en.w

https://en.wikipedia.org/wiki/Reality
<p class="mw-empty-elt">
</p>
<p><b>Reality</b> is the sum or aggregate of all that is real or <a href="/wiki/Existence" title="Existence">existent</a>, as opposed to that which is merely <a href="/wiki/Object_of_the_mind" title="Object of the mind">imaginary</a>. The term is also used to refer to the ontological status of things, indicating their existence.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> In <a href="/wiki/Physics" title="Physics">physical</a> terms, reality is the totality of the <a href="/wiki/Universe" title="Universe">universe</a>, known and unknown. Philosophical questions about the nature of reality or existence or being are considered under the rubric of <a href="/wiki/Ontology" title="Ontology">ontology</a>, which is a major branch of <a href="/wiki/Metaphysics" title="Metaphysics">metaphysics</a> in the Western philosophical tradition. Ontological questions also feature in diverse branches of phi