# Simple "Wellesley News" Search Results Scraper
Shreya Parjan

10/31/19

This scraper is designed for use with the Wellesley College News website. It takes in a results page from a search bar search and outputs a CSV that lists the title and date for the articles on that page. Ultimately, the outputs from this scraper can be used to analyze trends in titles/frequency of discussion of certain issues. For this proof of concept, as a part of our group project to create a dataset to quantify conversations about student housing, we examine a page of results from the search term "housing."

## Table of Contents
1. [Step 1: Scraping Environment Set-up](#s0)
2. [Step 2: Isolating Content](#s1)
3. [Step 3: Extracting Dates with RegEx](#s2)
4. [Step 4: Extracting Titles with String Indexing](#s3)
5. [Step 5: Writing Data to CSV](#s4)

## 1. Scraping Environment Set-up
<a id="s0"></a>

We use Python's BeautifulSoup package and the requests library. 

In [13]:
import requests
from bs4 import BeautifulSoup 

Through requests, we specify our URL and extract all of the content in it.

In [14]:
url = "https://thewellesleynews.com/page/1/?s=housing"
r = requests.get(url)

In [15]:
#print(r.content)

## 2. Isolating Content
<a id="s1"></a>

Clearly, the content above is too dense for us to parse in a meaningful way and even worse for converting into an actionable spreadsheet. Thus, we use BeautifulSoup to filter down this web page content.

In [16]:
"""create a BeautifulSoup instance using the content we extracted above"""
soup = BeautifulSoup(r.content, 'html5lib') 

In [17]:
"""extract all 'article' tags from HTML"""
a = soup.find_all('article')
articles = soup.find_all('h2',class_='entry-title')

In [18]:
"""convert article tags to strings"""
str(articles[1].contents[1])

articleStrings = []
for i in range(len(articles)):
    articleStrings.append(str(articles[i].contents[1]))

## 3. Extracting Dates with RegEx
<a id="s2"></a>

Now that we've narrowed down the parts of the webpage we actually want to extract content from, we can use regular expressions to extract information like the date of publication from the HTML.

In [19]:
import re
redates = []
for i in range(len(articleStrings)):
    redates.append(re.search(r'\d{4}/\d{2}/\d{2}', articleStrings[i]))

In [20]:
redates

[<_sre.SRE_Match object; span=(38, 48), match='2019/10/12'>,
 <_sre.SRE_Match object; span=(38, 48), match='2019/09/26'>,
 <_sre.SRE_Match object; span=(38, 48), match='2019/09/18'>,
 <_sre.SRE_Match object; span=(38, 48), match='2018/04/25'>,
 <_sre.SRE_Match object; span=(38, 48), match='2018/02/21'>,
 <_sre.SRE_Match object; span=(38, 48), match='2017/02/25'>,
 <_sre.SRE_Match object; span=(38, 48), match='2016/11/17'>,
 <_sre.SRE_Match object; span=(38, 48), match='2016/10/27'>,
 <_sre.SRE_Match object; span=(38, 48), match='2016/10/19'>,
 <_sre.SRE_Match object; span=(38, 48), match='2016/02/03'>]

In [21]:
"""CONVERT SRE_Match objects to strings"""
dates = []
for i in redates:
    dates.append(i.group(0))
dates

['2019/10/12',
 '2019/09/26',
 '2019/09/18',
 '2018/04/25',
 '2018/02/21',
 '2017/02/25',
 '2016/11/17',
 '2016/10/27',
 '2016/10/19',
 '2016/02/03']

## 4. Extracting Titles with String Indexing
<a id="s3"></a>

If you look at a href tag for any given article in the search results page, you'll see that the title always appears after the first '>' and ends after the first '</'. We can use this fact to extract the titles of the articles.

In [22]:
titles = []
for i in range(len(articleStrings)):
    startSubStr = articleStrings[i].index('>')
    endSubStr = articleStrings[i].index('</')
    titles.append(articleStrings[i][startSubStr+1:endSubStr])

In [23]:
titles

['Tensions Rise Between Student Activists and Senior Administration Over Housing',
 'Letter to the Editor: More than just displacement: the housing crisis and its debilitating effects',
 'Dozens of students displaced due to on-campus housing issues',
 'Affordable housing proposals cause for debate in the town of Wellesley',
 'Office of Residential Life announces changes to the housing process',
 'New housing process introduced at Wellesley',
 'College revamps housing registration process',
 'Off-campus housing diversifies students’ residential and social experiences',
 'Co-housing caters to the modern family',
 'Wellesley students scramble to find housing upon returning from study abroad']

## 5. Writing Data to CSV
<a id="s4"></a>

The output of this simple scraper is a CSV whose columns are the titles and dates of the articles on that given page.

In [24]:
import csv

with open('wellesley_news_scraper_output.csv', 'w') as f:
    writer = csv.DictWriter(f, fieldnames=["Title", "Date"])
    writer.writeheader()
    writer = csv.writer(f)
    writer.writerows(zip(titles,dates))