# Introduction

Behind most websites is some sort of data structure.

When you see something like a list of search results from Google or a list of items on Amazon, there is a structure to the data that makes up the list.

For example, for just 1 Google search result, these variables about the search result are visible:
* The title of the page that the link goes to
* The URL of the page that the link goes to
* A snippet of some of the text on the page that the link goes to

This same format is followed for all of the other search results that all have that same information.

# Installing Packages

Web scraping is used to extract data such as the above.

There are a number of packages that can assist this, and it can be done in many languages.

In Python there are packages that you can install that provide a lot of different functionalities. You should only have to install each package once.

For webscraping, we will use 2 packages: Requests & BeautifulSoup4

* Requests is a package that lets you request and get the raw HTML from webpages: http://docs.python-requests.org/en/master/

* BeautifulSoup4 is a package used to assist in webscraping: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

To install packages, run the code below. (This can also be run in a terminal)

In [None]:
# Install the "requests" and "BeautifulSoup" packages
!pip install requests
!pip install BeautifulSoup

# Importing packages

To make the functionality of each package available, you will need to import the installed package.

If errors appear when running the code below, then the package wasn't installed correctly:

In [None]:
import requests
from bs4 import BeautifulSoup
print('Done importing')

# Web scraping

Now, let's scrape data!

The site we will be scraping data from is IDTech - https://www.idtech.com/courses

On the Courses page, there is one row for every course, each of which follow the same format.

We will get information about every course.

# Step 1: Setting things up

The links on this website use relative URLs, so saving the base URL is very helpful.

We will also make an empty list which will contain dictionaries to convert into a CSV file at the end. Each dictionary in this list will become 1 row and contain the above information (Title, URL, etc.)

In [None]:
# The base url of the website. Let's set this once so we won't have to set it again.
base_url = 'https://www.idtech.com'

csv = [] # Empty list to contain dictionaries that will become a CSV at the end

# Step 2: Grabbing the raw HTML of the course directory page

On the course directory page, you can access the page of every course that's available: https://www.idtech.com/courses

So we will start there and get all the raw HTML data.

From there, we can get all the links to all the different courses, then loop over them to extract each one's data, turn that data into a dictionary, and create a list of dictionaries.

In [None]:
# Make a request to the Courses page and turn it into raw HTML using BeautifulSoup
r = requests.get(base_url + '/courses')
soup = BeautifulSoup(r.text, 'html5lib')

# You can check out the raw HTML by printing soup
#print(soup)

Instead of trying to read the printed soup, I would recommend just going to the webpage you are about to scrape, right click the page and press "Inspect", then Elements will have the same content.

# Step 3: Grabbing the links to each individual course page

In a browser like Chrome or Firefox, with the inspector open, click on the top left option that says "Select an element in the page to inspect it"

Then click the link to each page, such as "Code-a-bot: AI and Robotics with Your Own Cozmo".

Notice that it has the class "course-name". When you search for "course-name" in the inspector, you will find that everything with this class is a link to a course.

So, we will want to get only the links with the class "course-name", then get each one's "href", which is the URL of each page, relative to the base URL.

BeautifulSoup helps us in doing this with the find_all function:

In [None]:
# Find all tags in the soup that are <a> (anchor/link) tags that have the class "course-name"
course_tags = soup.find_all('a', attrs={'class': 'course-name'})

#print(course_tags)
print(len(course_tags))

# Step 4: Looping through all URLs of course pages

With each course tag, we can get its href and concatenate it to the end of the base URL to get the URL to every course page.



In [None]:
course_urls = []

# Loop over every course tag content
for course_tag in course_tags:
    
    # Assemble the URL of the specific course page:
    course_url = base_url + course_tag['href'] # Access the "href" of every <a> course tag
    course_urls.append(course_url)

# Step 5: Getting the data!

Now that we have a list of URLs for each page, we can request the raw HTML of each page to be able to find the information we want.

### For every course, we want the following information:
1. #### __URL__: The URL of the course

* #### __Title__: The title of the course

* #### __Topic__: The title is NOT the topic. The topic for example could be something like "Robotics", "Design", or "Coding", but under this topic, there could be many courses with different titles.

* #### __SkillLevel__: The skill requirement of the course, such as "Beginner" or "Advanced"

* #### __CourseType__: There are several types of courses, such as Summer Camps or just regular classes. We want to distinguish between each of these.

* #### __MinimumAge__: The minimum age that someone can enroll in the course at. If there is no minimum, set this to be -1.

* #### __MaximumAge__: The maximum age that someone can enroll in the course at. If there is no maximum, set this to be -1.

* #### __Description__: A description of the course - course pages will usually have a description of what the course is about.

* #### __Provider__: The main title of the website itself, such as "ID Tech" or "Computing Kids". This will be the same for every course per website - all courses scraped from ID Tech will have "ID Tech" as the provider.

* #### __Locations__: The locations where the course takes place. Multiple locations are in one string, so make sure you set this to be a list []. This list will be processed to remove duplicates later.


So we must find where in the page each field is located and get its contents using BeautifulSoup's functions.

Each course page will have its own dictionary that contains all of the fields, then get added to the list "csv" that we created in Step 1.

In [None]:
csv = []

print('Start')

# Get every course page's information and put it in the "csv" list of dictionaries
for course_url in course_urls:
    
    # New course dictionary, will get added to "csv" list
    course = {} 

    #__________
    # Get the content of the course page
    r = requests.get(course_url)
    soup = BeautifulSoup(r.text, 'html5lib')

    #__________
    # [1] URL
    #       The URL that we are scraping from
    
    course['URL'] = course_url
    
    #__________
    # [2] Title
    #       The first h1 tag has the title.
    
    title = soup.find('h1').getText()
    course['Title'] = title

    # Age, skill level, type and age are all in this table
    course_attributes_blocks = soup.find('div', attrs={'class': 'course-attributes'})
    col_sm_11 = course_attributes_blocks.find('dl', attrs={'class': 'col-sm-6'})
    dds = col_sm_11.find_all('dd')

    #__________
    # [3] Topic
    #        The 3rd <dd> in the first <dl> of <div> with class "course-attributes"
    topic = dds[2].getText()
    course['Topic'] = topic

    #__________
    # [4] SkillLevel
    #        The 2nd <dd> in the first <dl> of <div> with class "course-attributes"
    skill_level = dds[1].getText()
    skill_level = skill_level.replace('–', '-')
    course['SkillLevel'] = skill_level

    #__________
    # [5] CourseType
    #        The 1st <dd> in the 2nd <dl> of <div> with class "course-attributes"
    col_sm_ll2 = course_attributes_blocks.find_all('dl', attrs={'class': 'col-sm-6'})[1]
    dds_2 = col_sm_ll2.find_all('dd')
    course_type = dds_2[0].getText()
    
    course['CourseType'] = course_type

    #__________
    # [6] MinimumAge
    #        The age range is from the 1st <dd> in the first <dl> of <div> with class "course-attributes"
    age_range = dds[0].getText().replace('-', '').split(' ')
    
    # The age range can be split into the lower age, and the higher age.
    course['MinimumAge'] = age_range[0]
    
    #__________
    # [7] MaximumAge
    course['MaximumAge'] = age_range[2]

    #__________
    # [8] Description
    #        The content of a div with class "content" inside a div with the class "course-details"
    course_details = soup.find('div', attrs={'class': 'course-details'})
    course_details_content = course_details.find('div', attrs={'class': 'content'})
    
    course['Description'] = course_details_content.getText()
    
    #__________
    # [9] Provider
    #       The name of the website. This is very important for distinguishing different providers
    course['Provider'] = 'iDTech'
    
    #__________
    # [10] Locations
    collapsed_lis = soup.find_all('li', attrs={'class': 'collapsed'})
    
    locations = []
    
    # Get the address of every location available
    for li in collapsed_lis:
        uls = li.find_all('ul', attrs={'class': 'location-list'})
        
        for ul in uls:
            location_lis = ul.find_all('li')
            
            for location_li in location_lis:
                address = location_li.find('span', attrs={'class': 'address'}).getText()
                locations.append(address)
                
    course['Locations'] = locations
    

    csv.append(course)

#________________
# Finally, convert list of dictionaries into a CSV
import pandas as pd

# Change list of dictionaries of courses into a dataframe
df = pd.DataFrame(csv)

# Save dataframe as a CSV
df.to_csv('course_idtech.csv', index=False,encoding='utf-8')

print('Complete')


# Notes

After you run all of the code above, on the left you should see a "course_idtech.csv" file, which you can open here, or in Excel.

When you scrape websites the above process will have to get information from completely different places, since different websites have different structures.

Some of the information may not be available at all either, so if you can't find the data, make note of this.

Every site can be vastly different than all others, so a different scraping script will be needed for different websites.

The more entries there are on a site, the longer the script will take to finish. However, sites with more courses may actually be easier to scrape than websites with fewer courses.

# What will be doing with all these CSV files?

We are making a website that aggregrates many "providers" throughout Washington as well as the courses that they offer.

Example of Providers include iDTech, the Pacific Science Center, or Seattle .gov, basically anything that has technical classes available.

Every one of these Providers will have their information scraped with the fields specified above.

With further adjustments such as grouping certain topics together (one provider may give a topic a different name than a different provider) that will occur after all the data is scraped, the CSV files will all be combined into one CSV file and be used to create a database table for our site, "Digital Skills For All"

Then users of our site can search for parameters they want, such as certain skills or skill level or recommended age, and the sufficient results will appear across all of these different providers.

If you have any questions, email me at greycabbage@gmail.com

Here is a list of providers:
1. https://www.pacificsciencecenter.org/
* http://sdkbridge.com/youth.php
* http://www.piercecountylibrary.org/
* https://codeday.org/
* http://www.techsmartkids.com/
* https://www.tbcs.org/community/summer-camp
* http://www.bigbrainedsuperheroes.org/
* https://www.cs.washington.edu/outreach/k12
* https://www.summer-camp.uw.edu/
* https://www.thecoderschool.com/bellevue?tcs=
* https://www.idtech.com/courses (already done)
* http://apprenticareers.org/apply/
* https://www.literacysource.org/programs/
* http://www.elcentrodelaraza.org/events/event/
* https://sahareducation.org/digital-literacy/
* http://www.seattle.gov/iandraffairs/RTW
* https://seattlegoodwill.org/job-training-and-education/work-readiness/computer-classes
* https://www.seattleymca.org/accelerator/ytech
* https://www.spl.org/programs-and-services/learning/learning-calendar
* https://www.nhseattle.com/about-us/contact-us
* https://generalassemb.ly/seattle/marketing/digital-marketing
* https://itconnect.uw.edu/learn/workshops/
* https://itconnect.uw.edu/learn/workshops/online-tutorials/
* https://www.intrepidlearning.com/training-providers
* http://seattlecolleges.edu/
* https://www.onlc.com/dirwase.htm
* https://www.techsherpas.com/it-training-centers/seattle-wa/
* https://www.ramco-training.com/index.html
* https://www.pryor.com/
* https://generalassemb.ly/locations/seattle
* https://learncodinganywhere.com/TheTechAcademySeattle
* https://ncs.seattleu.edu/programs-courses/digital-technology/
* https://rtc.edu/#
* http://www.kalacademy.org/
* www.schoolsoutwashington.org
* https://www.501commons.org/services/technology-services/plan-IT-program