# Web Scraping with Beautiful Soup
*[Beautiful Soup](https://pypi.org/project/beautifulsoup4/)*  is a Python library for pulling data out of HTML and XML files. It can be used to scrape data, text, links, or image urls from within a website.
<br>
In addition to going step-by-step through scraping web text with BeautifulSoup, we will cover these core Python topics:
<br>
    Regular Expressions
    <br>
    List Comprehension
    <br>
    For loops
    <br>
    Output data to file (.txt)
   


In [None]:
#pip install beautifulsoup4
#above line not required in Colab, as bs4 is preinstalled

After installing, import `BeautifulSoup`, as well as `requests` for loading websites and `re` for using regular expressions

In [None]:
from bs4 import BeautifulSoup
import requests, re

Here, we use `requests.get` to load our website. In this example we are interested in downloading articles from eFlux Architecture's project *Superhumanity.*
<br>
Next, we use `BeautifulSoup` to grab and print the html content of the site. This will be messy at first, but we will learn to look through the html for the desired links.

In [None]:
url = 'https://www.e-flux.com/architecture/superhumanity/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

print(soup.prettify())

## Scraping html for links
One common task is extracting all the URLs found within a pageâ€™s < a > tags. The link itself is often labeled href as can be seen in the following code:
    
`<a class="preview-item-details-title js-open-overlay is-architecture" data-pagetitle="Our Heads Are Round, Our Hands Irregular - Superhumanity - e-flux" data-topline="Superhumanity - Hu Fang - Our Heads Are Round, Our Hands Irregular" href="/architecture/superhumanity/68654/our-heads-are-round-our-hands-irregular/">`

In [None]:
links = soup.find_all("a", {"href": True})
print(links)

While the output above still contains all of the < a > tag data (titles, authours, as well as links), the following will print out ONLY the links

In [None]:
for o in links:
    print(o.attrs["href"])

Rather than just printing the links, let's `append` them to a new list that only contains the links.

<br>
The for loop in this example simply goes through our big, messy list of links, authors, titles, etc... and adds *just* the links to a new list


In [None]:
# make an empty list for our links
link_list = []
for o in links:
    link_list.append(o.attrs["href"])
print(link_list)

At this point, we have a list containing all of our links, but it also contains some unwanted links (to eFlux social media, parent links, etc.)

The next cell will run a trick called [list comprehension](https://docs.python.org/3/tutorial/datastructures.html) to only keep those list elements that contain this string `"/architecture/superhumanity/"`

A list comprehension is a way of simplifying a process that may otherwise be done in a for loop, so it is a tool built in to Python for modifying or making a new list. 

In [None]:
new_list = [x for x in link_list if "/architecture/superhumanity/" in x]
new_list

One problem we see with the above list is duplicate entries, we will address that later

## Scraping html for bodies of text
Great! Now that we have our list of links, we want to follow each one and scrape the textual content from each of the pages.
The following code is identical to cell #3 where we first used `requests`... except this time we are looking at one of the article pages, not the higher-level "superhumanity" page.

In [None]:
p = requests.get("https://www.e-flux.com/architecture/superhumanity/91055/art-without-death/") 
mysoup = BeautifulSoup(p.content)

print(mysoup.prettify())

Once again we see a big mess of html, but it is fairly clear that the article text begins with the tag `<div class="article__body">`
So we will once again use `.find_all` to isolate the block-text

In [None]:
text = mysoup.find_all("div", {"class":"article__body"})
text_string = str(text)
print(text_string)

There are just a few formatting issues to take care of using `re`... regular expressions. First we delete anything between < carrots >.

<br>
Before proceeding, let's look at a specific issue with regular expressions. In our previous example with wikipedia, we could remove everything between the ==headers== with the following regex: ==.*==+

<br>
So we may assume we can remove everything between the carrots with something like <.*>... but the carrots have their own specific meaning in regex, so we need to separate them from other metacharacters. 


<br>

`re.sub(r'` defines this as a substitution with regular expressions
<br>
`<` starting with left carrot
<br>
`[^>]+` any characters that are not a right carrot
<br>
`>` ending with a right carrot
<br>
....replace all of this with `''` an empty string
<br>

then replace `\n` new lines with spaces `text_string.replace('\n', ' ')`

In [None]:
text_string = re.sub(r'<[^>]+>', '', text_string)
text_string = text_string.replace('\n', ' ')
(text_string)


So, here we have the entire article text from that single eFlux plage. But remember, we have a whole list of 125 other pages

In [None]:
print(len(new_list))
new_list

It looks like there are some duplicate links in the above list. Here we can use another list comprehension to eliminate any repeated elements

In [None]:
scrubbed_list = []
[scrubbed_list.append(x) for x in new_list if x not in scrubbed_list]
scrubbed_list

The following loop brings everthing together, downloading each link, scraping the body text, scrubbing with regex and combining into one long string (this will take a couple minutes to process all 125 articles)

In [None]:
collect = ''
for link in scrubbed_list:
#make whole url from e-flux + list element
    url = ("https://www.e-flux.com" + link) 
    print(url)
#get html
    s = requests.get(url) 
    soup = BeautifulSoup(s.content)
#grab the block-text
    text = soup.find_all("div", {"class":"article__body"})
    text_str = str(text)
#regex    
    text_str = re.sub(r'<[^>]+>', '', text_str)
    text_str = text_str.replace('\n', ' ')
#collect all texts
    collect = collect + text_str
print(collect)


Let's check the word count of that long string of combined articles

In [None]:
print(str(len(collect.split())) + " appx words") 

And finally, we can write the complete collected text to a singe .txt file

In [None]:
text_file = open("eflux.txt", "w")
n = text_file.write(collect)
text_file.close()