### Web Scraper for www.cs.olemiss.edu

You should use this notebook to *finish* the web scraper for the cs domain.  This data should be stored in the format (url, text from url, etc.).  It might be beneficial to split out the data into a different format (page, dictionary of key phrases, text correspnding to key phrases)

You will need to examine the page source code to identify which tags are beneficial to store you might need to store more than just the url and text from url but you can decide.  You can use some libraries such as beautifulsoup to help scrape data from pages.

In the end, you should write all the data you acquire to a CSV file for later use by the next part of the project.

In [53]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


# Steps
- Download main page and make list of links
- Extend list of links with links in faculty page
- Visit each link and download text

In [54]:

page = requests.get('https://cs.olemiss.edu/faculty/')     
soup = BeautifulSoup(page.content, 'html.parser')          

### Build list of links
- Some faculty cards have no link so filter that out
- Not all of the links are from CSCI server so filter that out too

In [55]:
 # main menu
menu_items = soup.find_all('li', class_='menu-item')    

# faculty pages
faculty_cards = soup.find_all('h2', class_='ab-profile-name')                           

#extract 'a' tags with links
a_tags=[]
for item in menu_items:
    a_tags.append(item.find('a'))
    
for item in faculty_cards:
    a_tags.append(item.find('a'))

# function to check if a link is not just a '#' but has the some kind of hyperlinks
def has_link(a):
    if len(a['href'].split('//'))>1:
        return True
    else:
        return False

 # function to check if a link is from the cs website
def in_cs(a):
    if a['href'].split('//')[1][:2] == 'cs':
        return True
    else:
        return False


 # make list of valid href links
urls=set()
for a in a_tags:
    if a and has_link(a) and in_cs(a):
        urls.add(a['href'])

#displaying all the links
print('All links:', [l.split('//')[1] for l in urls])

All links: ['cs.olemiss.edu/course-descriptions/', 'cs.olemiss.edu/minors-offered/', 'cs.olemiss.edu/forms/', 'cs.olemiss.edu/bscs/', 'cs.olemiss.edu/calendar/', 'cs.olemiss.edu/research-groups/', 'csci.cs.olemiss.edu/faculty/cunningham/', 'cs.olemiss.edu/faculty/', 'cs.olemiss.edu/help/', 'csci.cs.olemiss.edu/faculty/davidson/', 'csci.cs.olemiss.edu/faculty/carlisle/', 'cs.olemiss.edu/doctor-of-philosophy/', 'csci.cs.olemiss.edu/faculty/xiong/', 'cs.olemiss.edu/accreditation/', 'cs.olemiss.edu/research-areas/', 'cs.olemiss.edu/future-students/', 'csci.cs.olemiss.edu/faculty/rhodes/', 'cs.olemiss.edu/bscs-emphases-revision-draft/', 'csci.cs.olemiss.edu/faculty/chen/', 'cs.olemiss.edu/faq/', 'csci.cs.olemiss.edu/faculty/vitter/', 'csci.cs.olemiss.edu/faculty/wilkins/', 'csci.cs.olemiss.edu/faculty/jang/', 'cs.olemiss.edu/', 'cs.olemiss.edu/master-of-science/', 'cs.olemiss.edu/bacs/', 'cs.olemiss.edu/staff/', 'cs.olemiss.edu/mission/', 'cs.olemiss.edu/mission-2/']


In [58]:
texts = []
links = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    p_tags = soup.find_all('p', class_='')
    
    
    texts.extend(p.text for p in p_tags)
    links.extend(url for _ in p_tags)


# print(len(texts))

#Text preprocessing
clean_texts = []
clean_links = []

for (text, link) in zip(texts, links):
    text = text.replace(r'\n', '\n').replace('\n', '').replace('\r', '')

    if text in clean_texts or text == '' or text == ' ':   
        continue
    
    clean_texts.append(text)
    clean_links.append(link)


# save to files
with open('text.txt', 'w', encoding='utf8') as f:
    f.write('\n'.join(clean_texts)) 

with open('links.txt', 'w', encoding='utf8') as f:
    f.write('\n'.join(clean_links))

print(f'Saved {len(clean_texts)} sentences to file')

# Creating a pandas dataframe and csv file
data_df=pd.DataFrame()
data_df['Url']=clean_links
data_df['Text']=clean_texts
data_df.to_csv("url_text.csv")

420
Saved 322 sentences to file


In [37]:
data_df

Unnamed: 0,Url,Text
0,https://cs.olemiss.edu/course-descriptions/,See below for the catalog descriptions of CSci...
1,https://cs.olemiss.edu/course-descriptions/,The first digit of the course number more or l...
2,https://cs.olemiss.edu/course-descriptions/,Visit the University’s main catalog to read de...
3,https://cs.olemiss.edu/course-descriptions/,Introduction to computers and computing for st...
4,https://cs.olemiss.edu/course-descriptions/,Introduction to computer science with emphasis...
...,...,...
317,https://cs.olemiss.edu/mission/,Our faculty have a wide range of research inte...
318,https://cs.olemiss.edu/mission/,"The computer science student body is diverse, ..."
319,https://cs.olemiss.edu/mission/,The following are only a few of the many organ...
320,https://cs.olemiss.edu/mission/,Reflection Logic: CS students complete a senio...
