# Web Scraping using Beautiful Soup

#### Author: William Blanzeisky: github.com/williamblanzeisky

This project is to scrape Minnesota State University, Mankato catalog data. The link to the catalog is https://mankato.mnsu.edu/academics/academic-catalog/undergraduate/computer-information-technology/computer-information-technology-bs/. The tasks go as follows:
- import all the libraries needed
- assign the variable "url" to the specific URL we're trying to scrape
- Inspect the website and find what data you're trying to scrape
- Data cleaning
- Convert data into pandas dataframe (so that you can convert it into .csv file.

This code is written in Python 3. 

In [5]:
# import all the libraries needed
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

In [6]:
# assign url to the link we're trying to scrape
url = 'https://mankato.mnsu.edu/academics/academic-catalog/undergraduate/computer-information-technology/computer-information-technology-bs/'
response = requests.get(url)

In [7]:
# test connection - successful!
response

<Response [200]>

In [10]:
# initialize beautiful soup 
soup = BeautifulSoup(response.text, "html.parser")

In [213]:
# After inspecting the website, we see that the data we want is in <span> code. Therefore, we tell beautifulSoup to 
# scrape every <span> code in the website.
soup.findAll('span')

[<span class="caret"></span>,
 <span class="sr-only">Search</span>,
 <span class="btn-hamburger__text">Open Menu</span>,
 <span class="btn-hamburger__wrapper">
 <span class="btn-hamburger__bun"></span>
 <span class="btn-hamburger__bun"></span>
 <span class="btn-hamburger__bun"></span>
 </span>,
 <span class="btn-hamburger__bun"></span>,
 <span class="btn-hamburger__bun"></span>,
 <span class="btn-hamburger__bun"></span>,
 <span class="input-group-btn">
 <button class="btn btn-default search-bar-btn" type="submit">Search</button>
 </span>,
 <span class="btn-hamburger__text">Close</span>,
 <span class="caret"></span>,
 <span class="caret"></span>,
 <span><img alt="Harley Ries, Minnesota State Mankato student" height="166" src="/globalassets/globalassets/images/woit_160419_0013.jpg" style="float: left; margin-left: 10px; margin-right: 10px;" width="250">“Minnesota State Mankato supplies its students with countless valuable opportunities—whether it be starting a business with your Integrat

In [202]:
# initialize empty array as a temporary storage for the data we're trying to scrape
course_number = []
course_name = []
credit_hours = []
i = 0;

In [203]:
# initialize a while loop to go through all <span> data and extract the information we need
# print is used for debugging purposes

while (i <= (len(soup.findAll('span'))-3)):
    if str(soup.findAll('span')[i]).startswith("<span>") == True and str(soup.findAll('span')[i]).startswith("<span><img")== False and soup.findAll('span')[i].text != "none":
        print(i)
        courseNumber = soup.find_all("span")[i].text
        courseName = soup.find_all("span")[i+1].text
        creditHours = soup.find_all("span")[i+2].text
        course_number.append(courseNumber)
        course_name.append(courseName)
        credit_hours.append(creditHours)
        print(course_number)
        print(course_name)
        print(credit_hours)
        i = i+3
    else:
        i = i+1
        
    

16
['ENG 101']
['Composition']
['4 credits']
20
['ENG 101', 'IT 202W']
['Composition', 'Computers in Society']
['4 credits', '4 credits']
24
['ENG 101', 'IT 202W', 'MATH 121']
['Composition', 'Computers in Society', 'Calculus I']
['4 credits', '4 credits', '4 credits']
27
['ENG 101', 'IT 202W', 'MATH 121', 'STAT 154']
['Composition', 'Computers in Society', 'Calculus I', 'Elementary Statistics']
['4 credits', '4 credits', '4 credits', '4 credits']
30
['ENG 101', 'IT 202W', 'MATH 121', 'STAT 154', 'CMST 100']
['Composition', 'Computers in Society', 'Calculus I', 'Elementary Statistics', 'Fundamentals of Communication']
['4 credits', '4 credits', '4 credits', '4 credits', '3 credits']
34
['ENG 101', 'IT 202W', 'MATH 121', 'STAT 154', 'CMST 100', 'CMST 102']
['Composition', 'Computers in Society', 'Calculus I', 'Elementary Statistics', 'Fundamentals of Communication', 'Public Speaking']
['4 credits', '4 credits', '4 credits', '4 credits', '3 credits', '3 credits']
38
['ENG 101', 'IT 202W'

In [209]:
# convert the arrays into a dataframe with desired column names
catalog = pd.DataFrame({'Course Number':course_number, 'Course Name':course_name, 'Credit Hours':credit_hours})

In [211]:
# catalog dataframe
catalog

Unnamed: 0,Course Number,Course Name,Credit Hours
0,ENG 101,Composition,4 credits
1,IT 202W,Computers in Society,4 credits
2,MATH 121,Calculus I,4 credits
3,STAT 154,Elementary Statistics,4 credits
4,CMST 100,Fundamentals of Communication,3 credits
5,CMST 102,Public Speaking,3 credits
6,CMST 312,Professional Communication and Interviewing,4 credits
7,ENG 271W,Technical Communication,4 credits
8,IT 210,Fundamentals of Programming,4 credits
9,IT 214,Fundamentals of Software Development,4 credits
