### **Getting Early Life Information for U.S. Presidents from Wikipedia**

Information on the early life of all U.S. presidents. Let's take the **first paragraph** that discusses the early life of U.S. presidents.

Take a look at some of their pages to decide how we can target them:

https://en.wikipedia.org/wiki/William_McKinley

https://en.wikipedia.org/wiki/John_Adams

https://en.wikipedia.org/wiki/Thomas_Jefferson


It's somewhat difficult to exactly target that portion of the page:

- There are many paragraphs, if we target all paragraphs, we need a way to distinguish between the scraped paragraphs
- We can try to look at the paragraph in the "Early life" section---but this section is named differently each time ("Early life and education", "Early life and career", "Early life and family")
- Subsection headings always has a `h` tag and a `mw-headline` class---but all subsections have it
- My approach is to look through section headings and take the one that has the phrase "early life" somewhere



In [1]:
from bs4 import BeautifulSoup
import requests

Let's do for one page first: https://en.wikipedia.org/wiki/George_Washington

In [2]:
url = 'https://en.wikipedia.org/wiki/George_Washington'
r = requests.get(url)
soup = BeautifulSoup(r.text)

In [3]:
r.status_code

200

In [11]:
soup.find_all(['h2','h3'])

[<h2 class="vector-pinnable-header-label">Contents</h2>,
 <h2><span id="Early_life_.281732.E2.80.931752.29"></span><span class="mw-headline" id="Early_life_(1732–1752)">Early life (1732–1752)</span></h2>,
 <h2><span id="Colonial_military_career_.281752.E2.80.931758.29"></span><span class="mw-headline" id="Colonial_military_career_(1752–1758)">Colonial military career (1752–1758)</span></h2>,
 <h3><span class="mw-headline" id="French_and_Indian_War">French and Indian War</span></h3>,
 <h2><span id="Marriage.2C_civilian.2C_and_political_life_.281755.E2.80.931775.29"></span><span class="mw-headline" id="Marriage,_civilian,_and_political_life_(1755–1775)">Marriage, civilian, and political life (1755–1775)</span></h2>,
 <h3><span class="mw-headline" id="Opposition_to_the_British_Parliament_and_Crown">Opposition to the British Parliament and Crown</span></h3>,
 <h2><span id="Commander_in_chief_.281775.E2.80.931783.29"></span><span class="mw-headline" id="Commander_in_chief_(1775–1783)">Comma

List comprehension: what text appears under each h2 tag?

In [12]:
[item.text for item in soup.find_all('h2')]

['Contents',
 'Early life (1732–1752)',
 'Colonial military career (1752–1758)',
 'Marriage, civilian, and political life (1755–1775)',
 'Commander in chief (1775–1783)',
 'Early republic (1783–1789)',
 'Presidency (1789–1797)',
 'Post-presidency (1797–1799)',
 'Burial, net worth, and aftermath',
 'Personal life',
 'Slavery',
 'Historical reputation and legacy',
 'See also',
 'Notes',
 'References',
 'Bibliography',
 'Further reading',
 'External links']

Our target is the second heading! (Index 1). We could target it even more.

In [13]:
heading = [item for item in soup.find_all('h2') if 'early life' in item.text.lower()]
type(heading)
heading[0]

<h2><span id="Early_life_.281732.E2.80.931752.29"></span><span class="mw-headline" id="Early_life_(1732–1752)">Early life (1732–1752)</span></h2>

Use the find_next function to identify a next `p` tag

In [None]:
first_paragraph = heading[0].find_next('p').text

'George Washington was born on February 22, 1732,[a] at Popes Creek in Westmoreland County, Virginia.[3] He was the first of six children of Augustine and Mary Ball Washington.[4] His father was a justice of the peace and a prominent public figure who had four additional children from his first marriage to Jane Butler.[5] The family moved to Little Hunting Creek in 1734 before eventually settling in Ferry Farm near Fredericksburg, Virginia. When Augustine died in 1743, Washington inherited Ferry Farm and ten slaves; his older half-brother Lawrence inherited Little Hunting Creek and renamed it Mount Vernon.[6][7]\n'

Put all these into a function.

In [14]:
def get_early_life(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text)
    heading = [item for item in soup.find_all(['h2','h3']) if 'early life' in item.text.lower()]
    first_paragraph = heading[0].find_next('p').text
    return first_paragraph

Try on a different article to see if it works:

In [15]:
url = 'https://en.wikipedia.org/wiki/William_McKinley'
get_early_life(url)

'William McKinley Jr. was born in 1843 in Niles, Ohio, the seventh of nine children of William McKinley Sr. and Nancy (née Allison) McKinley.[1] The McKinleys were of English and Scots-Irish descent and had settled in western Pennsylvania in the 18th century. Their immigrant ancestor was David McKinley, born in Dervock, County Antrim, in present-day Northern Ireland. William McKinley Sr. was born in Pennsylvania, in Pine Township, Mercer County.[1]\n'