# Web Scraping with Beautiful Soup
*Beautiful Soup*  is a Python library for pulling data out of HTML and XML files. It can be used to scrape data, text, links, or image urls from within a website.


In [None]:
#pip install beautifulsoup4
#above line not required in Colab, as bs4 is preinstalled

After installing, import `BeautifulSoup`, as well as `requests` for loading websites and `re` for using regular expressions

In [None]:
from bs4 import BeautifulSoup
import requests, re

Here, we use `requests.get` to load our website. In this example we are interested in downloading articles from the Associated Press *Politics* page.
<br>
Next, we use `BeautifulSoup` to grab and print the html content of the site. This will be messy at first, but we will learn to look through the html for the desired links.

In [None]:
s = requests.get("https://apnews.com/hub/politics") 
soup = BeautifulSoup(s.content)

print(soup.prettify())

## Scraping html for links
One common task is extracting all the URLs found within a page’s < a > tags. The link itself is often labeled href as can be seen in the following code:
    
`<a class="Component-headline-0-2-110" data-key="card-headline" href="/article/election-2020-virus-outbreak-joe-biden-campaigns-kamala-harris-aa0bb12aca5568d20a7d6b86f24da7d0">`
    

In [None]:
links = soup.find_all("a", {"href": True})
print(links)

While the output above still contains all of the < a > tag data (titles, authours, as well as links), the following will sort out ONLY the links

In [None]:
for o in links:
    print(o.attrs["href"])

Rather than just printing the links, let's `append` them to a new list

In [None]:
link_list =[]
for o in links:
    link_list.append(o.attrs["href"])
print(link_list)

At this point, we have a list containing all of our links, but it also contains some unwanted links (to AP social media, etc.)

The next cell will run a trick called [list comprehension](https://docs.python.org/3/tutorial/datastructures.html) to only keep those list elements that contain the string `"/article"`

In [None]:
new_list=[x for x in link_list if "/article" in x]
new_list

Finally, we see many duplicate/triplicate links, but we can use a quick trick to eliminate dupes. `Lists` can contain duplicates, but the `dictionary` data type cannot. So the following cell simply turns `new_list` into a dictionary then back into a list, thus eliminating dupes

In [None]:
new_list = list(dict.fromkeys(new_list))
new_list

## Scraping html for bodies of text
Great! Now that we have our list of links, we want to follow each one and scrape the textual content from each of the pages.
The following code is identical to cell #3 where we first used `requests`
Note: url is created by adding apnews.com to the first element of our `new_list` of links 


In [None]:
url = ("https://apnews.com" + new_list[1])
print(url)
p = requests.get(url) 
mysoup = BeautifulSoup(p.content)
print(mysoup.prettify())

Once again we see a big mess of html, but it is fairly clear that the article text begins with the tag `<div class="Article" data-key="article">`
So we will once again use `.find_all` to isolate the article

In [None]:
text = mysoup.find_all("div", {"class": "Article"})
text_string = str(text)
print(text_string)

There are just a few formatting issues to take care of using `re`... regular expressions. First we delete anything between < carrots >... ~then replace `\n` new lines with spaces~ (AP doesn't appear to have newline characters)

In [None]:
text_string = re.sub(r'<[^>]+>', ' ', text_string)
#text_string = text_string.replace('\n', ' ')
#print(text_string)
text_string

So, here we have the entire article text from that single AP article. But remember, we have a whole list of 50+ articles. 

In [None]:
print(len(new_list))
new_list

For the sake of demonstration, lets shorten this to 10:

In [None]:
del(new_list[10:])
print(len(new_list))
print(new_list)


The following loop brings everthing together, downloading each link, scraping the body text, scrubbing with regex and combining into one long string (this will take a couple minutes to process all articles)

In [None]:
collect = ''
for link in new_list:
    
#make whole url from apnews + list element
    url = ("https://apnews.com" + link) 
    print(url)
    
#get html
    s = requests.get(url) 
    soup = BeautifulSoup(s.content)

#grab the article
    text = soup.find_all("div", {"class": "Article"})
    text_str = str(text)

#regex    
    text_str = re.sub(r'<[^>]+>', ' ', text_str)
    #text_str = text_str.replace('\n', ' ')
    
#collect all texts
    collect = collect + text_str
print(collect)


Let's check the word count of that long string of combined articles

In [None]:
print(str(len(collect.split())) + " appx words") 

And finally, we can write the complete collected text to a singe .txt file

In [None]:
text_file = open("ap_politics.txt", "w")
n = text_file.write(collect)
text_file.close()