### Class Example:  Building A Web Scraper

This notebook will go over briefly how you build a web-scraper, and relate it to a few different concepts covered so far in this workshop.

**Important**:  This example is not meant to explain every single line of code that's used, but merely to relate some important concepts discussed in class to a real world example.  It's okay if you don't understand everything that's shown here, just focus on the bigger picture, and turn this notebook into a follow up project after the workshop.

**Step 1**:  Do the imports

In [1]:
from bs4 import BeautifulSoup
import requests
import json
import pandas as pd

**Step 2**: Connect to the URL using the requests library.

In [2]:
url         = 'https://generalassemb.ly/education'

params      = {
                'where'  : 'new-york-city',
                'format' : 'classes-workshops'
              }

r           = requests.get(url, params=params)

Now, we have the contents of the website saved as a variable 'r'.  

We can now grab the entire content of the website as a piece of text.

In [3]:
r.text



**Our Problem**:  We don't need the vast majority of this information.  Just the list at the bottom of the page that contains the information about the classes.

To turn this into actual data, we need to accomplish two steps:
 
 - Extract the portion of the string that ONLY contains what we need
 - Convert this into an actual python list that we can then further manipulate to our desires.
 
Some of this code we are not going to go over, so make a note to do follow up study on what you don't understand:

In [5]:
# turn the request text into a web scraping object
doc         = BeautifulSoup(r.text, 'html.parser')
# grab all the script tags, then use slices to get the one that we want
script     = doc.find_all('script')[-2]
# convert the web scraping object into a string
text        = str(script)

Now let's take a look at our variable text:

In [7]:
text

'<script>\n  window.EDUCATIONAL_OFFERINGS_JSON = [{"format":"workshop","overview":"Creating a perfect resumé from a recruiter’s lense!","topics":[{"id":9,"name":"Career Development","asset_folder":"career_development"}],"instructors":[{"id":20807,"name":"Daniel Robinson","title":"Career Coach \\u0026 Employer Partnerships, General Assembly"}],"title":"The Resumé Run-Down","starts":"2019-12-14T15:00:00.000Z","length_in_weeks":null,"url":"http://generalassemb.ly/education/the-resume-run-down/new-york-city/93407","image_url":"https://ga-core.s3.amazonaws.com/production/uploads/program/default_image/12118/thumb_thumb_Tech_Branding_Networking_You_Hand_Business_Card.jpg","duration_description":null,"next_info_session":null,"number_of_sessions":null,"date_num":"14","date_description":"Sat, 14 December","time_description":"10:00 - 11:30am EST"},{"format":"workshop","overview":"Find out which of your ideas are most viable by learning how to test them quickly and efficiently.","topics":[{"id":1,

This is still closer to what we want, but we need to whittle it down to the actual list that we need, and nothing else.

In [8]:
# get the index position of where our string ends
end         = text.index('window.TOPICS_JSON')
# these two lines create slices reduce the string to the exact positions of the list 
text        = text[:end]
text        = text[47:-4] 

Now let's go ahead and take a look at our text variable:

In [9]:
text

'[{"format":"workshop","overview":"Creating a perfect resumé from a recruiter’s lense!","topics":[{"id":9,"name":"Career Development","asset_folder":"career_development"}],"instructors":[{"id":20807,"name":"Daniel Robinson","title":"Career Coach \\u0026 Employer Partnerships, General Assembly"}],"title":"The Resumé Run-Down","starts":"2019-12-14T15:00:00.000Z","length_in_weeks":null,"url":"http://generalassemb.ly/education/the-resume-run-down/new-york-city/93407","image_url":"https://ga-core.s3.amazonaws.com/production/uploads/program/default_image/12118/thumb_thumb_Tech_Branding_Networking_You_Hand_Business_Card.jpg","duration_description":null,"next_info_session":null,"number_of_sessions":null,"date_num":"14","date_description":"Sat, 14 December","time_description":"10:00 - 11:30am EST"},{"format":"workshop","overview":"Find out which of your ideas are most viable by learning how to test them quickly and efficiently.","topics":[{"id":1,"name":"Business","asset_folder":"business"}],"i

This is a list, serialized into a string, that has to be turned back into a list.  So we'll use the JSON library to accomplish this with one line of code.

In [13]:
data = json.loads(text)
data[5]

{'format': 'workshop',
 'overview': 'In this introductory class, learn how to build an email marketing strategy that grows your business by acquiring and retaining customers.\r\n',
 'topics': [{'id': 14, 'name': 'Marketing', 'asset_folder': 'marketing'}],
 'instructors': [{'id': 13206,
   'name': 'Sharon Lee Thony',
   'title': 'Digital Marketing Executive, Founder & Principal of SLT Consulting, Sharon Lee Thony Consulting'}],
 'title': 'Email Marketing for Entrepreneurs and Startups',
 'starts': '2019-12-14T15:00:00.000Z',
 'length_in_weeks': None,
 'url': 'http://generalassemb.ly/education/email-marketing-for-entrepreneurs-and-startups/new-york-city/84081',
 'image_url': 'https://ga-core.s3.amazonaws.com/production/uploads/program/default_image/10294/thumb_Marketing_Email_Communication_Envelope_Data_Send_Chart_Open_Card.jpg',
 'duration_description': None,
 'next_info_session': None,
 'number_of_sessions': None,
 'date_num': '14',
 'date_description': 'Sat, 14 December',
 'time_descr

And now if we take a look at this, we can this is a normal list, that can be manipulated in every way we've discussed so far today.

In [9]:
data[0]

{'format': 'workshop',
 'overview': 'Learn to communicate with databases by learning the language of choice, Structured Query Language (SQL).\r\n',
 'topics': [{'id': 8, 'name': 'Data', 'asset_folder': 'data'}],
 'instructors': [{'id': 5455,
   'name': 'Nigel Caldon',
   'title': 'Co-Founder, BALLSTAR'}],
 'title': 'SQL Bootcamp: Learning the Language',
 'starts': '2019-10-31T14:00:00.000Z',
 'length_in_weeks': None,
 'url': 'http://generalassemb.ly/education/sql-bootcamp-learning-the-language/new-york-city/83835',
 'image_url': 'https://ga-core.s3.amazonaws.com/production/uploads/program/default_image/7867/thumb_Data_Excel_SQL_Analysis_Can_QA_Number_Can_Base_Database.jpg',
 'duration_description': None,
 'next_info_session': None,
 'number_of_sessions': None,
 'date_num': '31',
 'date_description': 'Thu, 31 October',
 'time_description': '10:00 -  5:00pm EDT'}

And finally, let's do some additional processing to see how we can go ahead and turn this into a dataframe.

**Key Note**:  We are not going to go over these details in class!

In [14]:
# unpack the information from the instructors dictionary
instructors = [x['instructors'][0]['name'] if x['instructors'] else None for x in data]
# unpack the infomation from the topics dictionary
topics      = [x['topics'][0]['name'] for x in data]
# unpack the information from the date dictionary
date        = [x['starts'] for x in data]

# turn the original list into a datafram
df          = pd.DataFrame(data)

# make new columns from the variables instructors, topics, date
df['Instructor']   = pd.Series(instructors)
df['Topic']        = pd.Series(topics)
df['date']         = pd.to_datetime(date)
df['date']         = df['date'].dt.date
# drop unnecessary column
df                 = df.drop(['url', 'topics', 'instructors', 'image_url', 'next_info_session', 'starts', 'date_description', 'number_of_sessions', 'date_num', 'duration_description'], axis=1)

And now if we run this, we'll see that this is a very tidy, manageable variable that can be used easily.

In [16]:
df.head(10)

Unnamed: 0,format,length_in_weeks,overview,time_description,title,Instructor,Topic,date
0,workshop,,Creating a perfect resumé from a recruiter’s l...,10:00 - 11:30am EST,The Resumé Run-Down,Daniel Robinson,Career Development,2019-12-14
1,workshop,,Find out which of your ideas are most viable b...,10:00 - 5:00pm EST,Product Management Bootcamp,Ryan Cooley,Business,2019-12-14
2,workshop-series,,Learn the fundamentals of front-end and back-e...,10:00 - 5:00pm EST,Programming for Non-Programmers Bootcamp,Michael Glumac,Coding,2019-12-14
3,workshop,,Learn how to use your Google Analytics to get ...,10:00 - 5:00pm EST,Google Analytics Bootcamp,Kassie Phillips,Marketing,2019-12-14
4,workshop,,"In this introductory class, learn how to build...",10:00 - 1:00pm EST,Email Marketing for Entrepreneurs and Startups,Sharon Lee Thony,Marketing,2019-12-14
5,workshop,,"In this introductory class, learn how to build...",10:00 - 1:00pm EST,Email Marketing for Entrepreneurs and Startups,Sharon Lee Thony,Marketing,2019-12-14
6,workshop-series,,Build and Evaluate Machine Learning Models Wit...,10:00 - 5:00pm EST,Python and Machine Learning Bootcamp Series,Jonathan Bechtel,Data,2019-12-14
7,workshop,,"Get in the design mindset, learn the principle...",10:00 - 5:00pm EST,User Experience Journey Workshop,Tyler Hartrich,Design,2019-12-14
8,workshop,,"Go from concept to prototype with Adobe XD, th...",10:00 - 5:00pm EST,Adobe XD Bootcamp,Ivan Freaner,Design,2019-12-14
9,workshop-series,,User experiences are evolving - fast. Learn ho...,10:00 - 5:00pm EST,2-Day User Experience Bootcamp,Kyra Peralte,Design,2019-12-14
