# Scraping Text

John McLevey    
Winter 2018

What if we want to grab text that is not in a table? Say, textual data from news stories? Beautiful soup will do the heavy lifting. 

Let's grab a small bit of text from Factiva. Note, doing this on a large scale violates their policies.

![](images/Screenshot 2018-02-06 11.43.52.png)

![](images/Screenshot 2018-02-06 11.44.56.png)

And then let's further restrict to op-eds / commentaries. Duplicates are automatically dropped. For the  purposes of this small scale example, let's just take the first 100. 

![](images/Screenshot 2018-02-06 12.05.00.png)

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import csv

In [9]:
text = "data/raw/Factiva.html"
soup = BeautifulSoup(open(text), "html.parser")

In [14]:
articles = soup.find_all("div", class_="article enArticle")
print('Found ' + str(len(articles)) + ' articles in this dataset.')

Found 100 articles in this dataset.


In [15]:
articles

[<div class="article enArticle"><p>
 <div>Opinion</div>
 <div id="hd"><span class="enHeadline">
 NAFTA will undermine health unless Canada resists monopolies on medicines</span>
 </div><div class="author">Nicholas Caivano Richard Elliott </div><div>728 words</div><div>5 February 2018</div><div>The Toronto Star</div><div>TOR</div><div>English</div><div>Copyright (c) 2018 The Toronto Star </div>
 </p>
 <p class="articleParagraph enarticleParagraph">In his first State of the Union address, Donald <b>Trump</b> again reiterated his campaign promise to bring down drug costs. But a year into his presidency, he's done nothing of the sort. And if U.S. negotiators and powerful corporate lobbyists have their way, the new <span class="companylink">North American Free Trade Agreement</span> (<span class="companylink">NAFTA</span>) will change in ways that would keep drug prices high and out of reach for people in Canada, the U.S., Mexico and beyond.</p>
 <p class="articleParagraph enarticleParagrap

# Extract!

Let's grab titles and body text.

In [11]:
titles = []
bodies = []

In [17]:
for a in articles:
    title = a.find_all('span', class_='enHeadline')
    content = []
    paragraphs = a.find_all('p', class_="articleParagraph enarticleParagraph")
    for p in paragraphs:
        content.append(p.text.replace(',', ''))
    content_clean = " ".join(content)

    titles.append(title[0].text.replace(',', ''))
    bodies.append(content_clean.replace('\n', ''))

In [21]:
len(titles) == len(bodies)

False

In [24]:
len(titles) - len(bodies)

1

Your turn: Where is the extra title coming from? What's going on here? How can we fix this problem?

There is a tag for `companies.` Let's see who is in here. 

In [30]:
companies = []

for a in articles:
    linked_companies = a.find_all('span', class_='companylink')
    for company in linked_companies:
        companies.append(company.text)

In [32]:
companies[:10]

['North American Free Trade Agreement',
 'NAFTA',
 'NAFTA',
 'NAFTA',
 'Canadian Institute for Health Information',
 'NAFTA',
 'Health Canada',
 'NAFTA',
 'NAFTA',
 'NAFTA']

In [34]:
import collections

num_cos = collections.Counter(companies)
print(num_cos)

Counter({'NAFTA': 119, 'UN': 28, 'Twitter': 11, 'Google': 11, 'North American Free Trade Agreement': 10, 'G7': 10, 'Facebook': 9, 'NATO': 9, 'World Economic Forum': 7, 'Parliament': 7, 'Ford': 6, 'UN Security Council': 6, 'Canadian government': 6, 'Boeing': 6, 'European Union': 5, 'FCC': 5, 'World Trade Organization': 4, 'U.S. government': 4, 'UN General Assembly': 4, 'New York Times': 3, 'Statistics Canada': 3, 'Canadian Press': 3, 'Canadian Human Rights Tribunal': 3, 'WTO': 2, 'McKinsey Global Institute': 2, 'Fraser Institute': 2, 'Universite Laval': 2, 'Canadian Centre for Policy Alternatives': 2, 'CBC Radio': 2, 'government of Canada': 2, 'U.S. Congress': 2, 'EU': 2, 'Canadian Institute for Health Information': 1, 'Health Canada': 1, 'Elections Canada': 1, 'Inter-Parliamentary Union': 1, 'Netflix': 1, 'Midas': 1, 'IMF': 1, 'SPDR S&P; 500 ETF': 1, 'Quinnipiac University': 1, 'Peterson Institute for International Economics': 1, 'Brookings Institution': 1, 'Universal': 1, 'Department 

In [40]:
df = pd.DataFrame.from_dict(num_cos, orient='index').reset_index()
df

Unnamed: 0,index,0
0,North American Free Trade Agreement,10
1,NAFTA,119
2,Canadian Institute for Health Information,1
3,Health Canada,1
4,World Trade Organization,4
5,Twitter,11
6,New York Times,3
7,Facebook,9
8,Elections Canada,1
9,European Union,5


In [42]:
df.rename(columns = {
    0:'Number of Mentions'
}, inplace = True)

In [45]:
df.sort_values('Number of Mentions', ascending = False)

Unnamed: 0,index,Number of Mentions
1,NAFTA,119
64,UN,28
5,Twitter,11
75,Google,11
0,North American Free Trade Agreement,10
61,G7,10
7,Facebook,9
38,NATO,9
49,Parliament,7
16,World Economic Forum,7


Lots of talk about [NAFTA](https://en.wikipedia.org/wiki/North_American_Free_Trade_Agreement)!

# Your Turn

* Where did that extra title come from?
* Make a dataframe where each row is an article. There are two columns. One stores the title, the other stores the body text. 
* What else can you extract from this data?
* Write the dataframes to `csv` or some other format. 