# Chapter 8 - Building a Web Scraping Pipeline

In [None]:
import requests, bs4
from bs4 import BeautifulSoup as bs
import pandas as pd

Now it's time to build your very own web scraping pipeline.  For these exercises, we will once again be exploring the Metis blog.  We've requested and parsed the HTML for the main blog page for you and saved it as `soup`.

In [None]:
url = 'https://www.thisismetis.com/blog'
response = requests.get(url)
status = response.status_code
if status == 200:
  page = response.text
  soup = bs(page)
else:
  print(f"Oops! Received status code {status}")

# Exercise 1 - Collecting Information from One Page

For this exercise, you will be extracting each blog's title, author, and URL from the blog main page.  Create one dictionary for each blog with the keys "title", "author", and "link"; here's one example:

``` {
  'title': 'Our Top 10 Most-Read Blog Posts of 2020',
  'author': 'Carlos Russo',
  'link': '/blog/our-top-10-most-read-blog-posts-of-2020'
}```

Save these dictionaries in a list called `post_list`.  You should find that `post_list` had between 5 and 20 dictionary elements.

_Hint_: Be sure to clean the string containing the author!

In [None]:
post_list = []

### BEGIN SOLUTION
for div in soup.find_all(class_='blog-post-summary'):
  title_tag = div.find(class_='blog-post-title')
  title = title_tag.text
  link = title_tag.parent['href']
  author = div.find(class_='blog-post-details').text.split('•')[0].replace('By', '').strip()
  post_list.append({'title': title, 'author': author, 'link': link})
### END SOLUTION

post_list

[{'author': 'Carlos Russo',
  'link': '/blog/bootcamp-grad-aims-to-drive-advancements-in-healthcare-research',
  'title': 'Bootcamp Grad Aims To Drive Advancements in Healthcare Research'},
 {'author': 'Carlos Russo',
  'link': '/blog/our-top-10-most-read-blog-posts-of-2020',
  'title': 'Our Top 10 Most-Read Blog Posts of 2020'},
 {'author': 'Carlos Russo',
  'link': '/blog/a-virtual-classroom-tour-with-course-report',
  'title': 'A Virtual Classroom Tour with Course Report'},
 {'author': 'Jason Moss',
  'link': '/blog/founders-note-the-continuing-evolution-of-metis',
  'title': 'Founder’s Note: The Continuing Evolution of Metis'},
 {'author': 'Metis',
  'link': '/blog/jason-moss-discusses-innovation-covid-19-and-managing-through-difficult-times',
  'title': 'Jason Moss Discusses Innovation, COVID-19, and Managing Through Difficult Times'},
 {'author': 'Tony Yiu',
  'link': '/blog/stress-testing-our-fair-value-calculation',
  'title': 'Stress Testing Our Stock Market Fair Value Calcula

In [None]:
assert type(post_list) == list, "Be sure that post_list is a Python list."
assert type(post_list[0]) == dict, "Each entry in post_list should be a Python dictionary."
assert 5 < len(post_list) < 20, "You should find between 5 and 20 posts on the main page of the blog."

In [None]:
### BEGIN HIDDEN TESTS
test_list = []
for div in soup.find_all(class_='blog-post-summary'):
  title_tag = div.find(class_='blog-post-title')
  title = title_tag.text
  link = title_tag.parent['href']
  author = div.find(class_='blog-post-details').text.split('•')[0].replace('By', '').strip()
  test_list.append({'title': title, 'author': author, 'link': link})

assert post_list == test_list
### END HIDDEN TESTS

## Exercise 2 - Collecting Information from Multiple Pages

Now you're going to scale up to collect the same information from at least the most recent fifty blog posts (including those from the main blog page).

You will need to create a web scraping pipeline to do this.  Remember the main steps:
1. Gather links -- develop a strategy for straping past posts
2. Scrape the same data from each page
3. Clean the data as necessary

Save the information as one dictionary for each post (as described in Exercise 1) and put all the posts in a list called `pipeline_list`. 

In [None]:
pipeline_list = []

### BEGIN SOLUTION

url_base = 'https://www.thisismetis.com/blog/page/'
url_list = [url_base+str(i) for i in range(1,6)]
for url in url_list:
  page = requests.get(url).text
  soup = bs(page)

  for div in soup.find_all(class_='blog-post-summary'):
    title_tag = div.find(class_='blog-post-title')
    title = title_tag.text
    link = title_tag.parent['href']
    author = div.find(class_='blog-post-details').text.split('•')[0].replace('By', '').strip()
    pipeline_list.append({'title': title, 'author': author, 'link': link}) 


### END SOLUTION

pipeline_list[-5:]

[{'author': 'Emily Wilson',
  'link': '/blog/a-hunger-for-data-leads-grad-to-role-at-hellofresh',
  'title': 'A Hunger for Data Leads Grad to Role at HelloFresh'},
 {'author': 'Metis',
  'link': '/blog/metis-courses-offered-through-new-wake-forest-university-financial-services-and-fintech-hub',
  'title': 'Metis Courses Offered Through New Wake Forest University Financial Services and Fintech Hub'},
 {'author': 'Metis',
  'link': '/blog/artists-data-science-podcast-metis-chief-data-scientist-debbie-berebichez',
  'title': 'The Artists of Data Science Podcast Feat. Metis Chief Data Scientist Debbie Berebichez'},
 {'author': 'Metis',
  'link': '/blog/video-metis-chief-data-scientist-making-of-a-data-scientist',
  'title': 'VIDEO: Metis Chief Data Scientist Discusses The Making of a Data Scientist'},
 {'author': 'Metis',
  'link': '/blog/metis-included-course-reports-35-best-online-bootcamps-of-2020-list',
  'title': "Metis Included on Course Report's 35 Best Online Bootcamps of 2020 List

In [None]:
assert type(pipeline_list) == list, "Be sure that pipeline_list is a Python list."
assert type(pipeline_list[0]) == dict, "Each entry in pipeline_list should be a Python dictionary."
assert len(pipeline_list) >= 50, "Be sure to collect information for at least 50 blog posts."

In [None]:
### BEGIN HIDDEN TESTS
test_pipeline_list = []
url_base = 'https://www.thisismetis.com/blog/page/'
url_list = [url_base+str(i) for i in range(1,6)]
for url in url_list:
  page = requests.get(url).text
  soup = bs(page)

  for div in soup.find_all(class_='blog-post-summary'):
    title_tag = div.find(class_='blog-post-title')
    title = title_tag.text
    link = title_tag.parent['href']
    author = div.find(class_='blog-post-details').text.split('•')[0].replace('By', '').strip()
    test_pipeline_list.append({'title': title, 'author': author, 'link': link}) 

for post in test_pipeline_list:
  assert post in pipeline_list
### END HIDDEN TESTS

## Exercise 3 - Storing Scraped Data

Now that you have built `pipeline_list`, let's convert that data into a pandas dataframe.  Call your dataframe `pipeline_df`.  It should have at least 50 rows of the most recent posts and three columns: "title", "author", and "link".

Congratulations -- you have now completed a full web scraping pipeline!

In [None]:
### BEGIN SOLUTION
pipeline_df = pd.DataFrame(pipeline_list)
### END SOLUTION

pipeline_df.head()

Unnamed: 0,title,author,link
0,Bootcamp Grad Aims To Drive Advancements in He...,Carlos Russo,/blog/bootcamp-grad-aims-to-drive-advancements...
1,Our Top 10 Most-Read Blog Posts of 2020,Carlos Russo,/blog/our-top-10-most-read-blog-posts-of-2020
2,A Virtual Classroom Tour with Course Report,Carlos Russo,/blog/a-virtual-classroom-tour-with-course-report
3,Founder’s Note: The Continuing Evolution of Metis,Jason Moss,/blog/founders-note-the-continuing-evolution-o...
4,"Jason Moss Discusses Innovation, COVID-19, and...",Metis,/blog/jason-moss-discusses-innovation-covid-19...


In [None]:
assert type(pipeline_df) == pd.DataFrame, "Be sure pipeline_df is a pandas dataframe."
assert len(pipeline_df) >= 50, "pipeline_df should contain the information for at least 50 blog posts."
for col in ['title', 'author', 'link']:
  assert col in pipeline_df.columns, f"pipeline_df should contain a column called {col}"

In [None]:
### BEGIN HIDDEN TEST
test_pipeline_df = pd.DataFrame(test_pipeline_list)
for col in test_pipeline_df.columns:
  assert test_pipeline_df.loc[0, col] in pipeline_df[col].values  #first required post
  assert test_pipeline_df.loc[49, col] in pipeline_df[col].values #last required post
### END HIDDEN TEST