v.2023-10-03

## 📅 GWU Events Scraper Challenge 📅



1. Request the **HTML** page contents of https://calendar.gwu.edu/calendar.
2. Parse the page and loop through the events and **display** some information about each, including:
  + **Title** (e.g. "Annie Artist Gallery Opening")
  + **Date info** including date, day of week, start, and end times (e.g. "Mon, Oct 2, 2023 11am to 3pm" or "Mon, Oct 2, 2023", depending on whether or not the start and end times are present - some events don't have start and end)
  + **Location** (e.g. "Smith Hall of Art, Gallery 102")
  + **Image** (ideally displayed)
  + **Primary event type** / first tag (e.g. "Arts and Culture")

> BONUS: instead of a single field for the date info including day of week, start, and end times, see if you can further **decompose the date info** into separate fields:
>  + **Date** (e.g. "2023-09-08")
>  + **Day of week** (e.g. "Friday")
>  + **Start time**, as applicable (e.g. "12:30pm" or None)
>  + **End time**, as applicable (e.g. "01:30pm" or None)

3. Ensure your loop from Question 2 also **collects** the events as a list of dictionaries called `events`.

4. Convert the entire list of `events` to a `pandas.DataFrame` object called `events_df`. Optionally inspect by:
  + Print the column names.
  + Print the number of rows.
  + Show the first few rows.
  + Print the earliest start time. And the latest end time.

5. Write the `events_df` contents to **CSV file** called "events.csv" in the colab filesystem. Optionally download the CSV file into your computer and open with speadsheet software.

> NOTE: we haven't worked with DataFrames yet, so the required parts of Questions 4 and 5 have been done for you.

**Further Exploration**

If you get through that easily enough, consider tackling this optional bonus challenge.

6. Write a function called `fetch_events` that accepts a date string like "2023-10-20" as a parameter input called `selected_date`, and returns the first page of events on that day, formatted as a list of dictionaries. If the `selected_date` parameter is omitted, the function should return the current day's events.


> HINT: when we select a different date in the browser, the url will be something like: https://calendar.gwu.edu/calendar/day/2023/10/20


> HINT: use a default value of `None` for the `selected_date` parameter. Inside the function, check the parameter's value, and if one is present, adjust the request url to include something like "/day/2023/10/20" at the end


Also implement tests like these for your function, and make sure they pass:


```python
from datetime import date

events = fetch_events()
assert any(events)
assert events[0]["date"] == str(date.today())

events = fetch_events("2023-10-20")
assert any(events)
assert events[0]["date"] == "2023-10-20"
```

## Evaluation

Rubric:

Part | Weight
--- | ---
Part 1 | 20%
Part 2 | 60%
Part 3 and 4 and 5 | 20%

The bonus from Part 1 is worth an additional 10% extra credit.

The event fetching function bonus is worth an additional 10% extra credit.


## Solution

In [38]:
import requests
from bs4 import BeautifulSoup
import string
from itertools import chain
from IPython.display import Image, display

In [39]:
EVENTS_PAGE_URL = "https://calendar.gwu.edu/calendar"
request = requests.get(EVENTS_PAGE_URL)
soup = BeautifulSoup(request.text)

In [40]:
item = soup.find_all("div","em-card" )

In [82]:
# Question 3 prep:

events = [] # TODO: collect all the events as a list of dict
for i in item:
    events_dict = {}
    if i.find("h3","em-card_title") == None:
        print("title : No Title Available")
        events_dict["title"] = "No Title Available"
    else:
        print("title :",(i.find("h3","em-card_title").text).title())
        events_dict["title"] = (i.find("h3","em-card_title").text).title()
    if i.find("p","em-card_event-text") == None:
        print("Time : No Time Available")
    else:
        sentence = i.find("p","em-card_event-text").text
        for char in string.punctuation:
            if char != ":":
                sentence = sentence.replace(char, ' ')
        words = list(sentence.split())
        length = int(len(words))
        print("Date :",' '.join(i for i in words[1:4]))
        events_dict["date"] = ' '.join(i for i in words[1:4])
        print("Day of Week :",words[0])
        events_dict["day_of_week"] = words[0]
        if length <= 4:
            print("Start Time :","Time Not Specified")
            events_dict["start_time"] = "Time Not Specified"
            print("End Time :","Time Not Specified")
            events_dict["end_time"] = "Time Not Specified"
        elif length <= 5:
            print("Start Time :",words[-1])
            events_dict["start_time"] = words[-1]
            print("End Time : Time Not Specified")
            events_dict["end_time"] = "Time Not Specified"
        else:
            print("Start Time :",words[4])
            events_dict["start_time"] = words[4]
            print("End Time :",words[-1])
            events_dict["end_time"] = words[-1]
    x = i.find_all("p","em-card_event-text")
    for j in x:
        if "2023" not in j.text.strip():
            print(j.text.strip())
            events_dict["location"] = j.text.strip()
        else:
            None
    count = []
    for j in x:
        count.append(j)
    count_length = len(count)
    if count_length == 1:
        print("Location: No Location Was Listed")
        events_dict["location"] = "No Location Was Listed"
    if i.find("img","img_card")["src"] == None:
        print("Image: No Image Found")
        events_dict["image"] = "No Image Found"
    else:
        display(Image(url=i.find("img","img_card")["src"], height=200))
        events_dict["image"] = i.find("img","img_card")["src"]
    if i.find("a","em-card_tag") == None:
        print("Primary Event type: No Event Type Specified")
        events_dict["primary_event_type"] = "No Event Type Specified"
    else:
        print("Primary Event type:",(i.find("a","em-card_tag").text))
        events_dict["primary_event_type"] = i.find("a","em-card_tag").text
    events.append(events_dict)
    print("--------------------------------------")

title : The Art Of Collecting: Gifts From The Luther W. Brady Estate
Date : Oct 11 2023
Day of Week : Wed
Start Time : Time Not Specified
End Time : Time Not Specified
Luther W. Brady Art Gallery


Primary Event type: Exhibition
--------------------------------------
title : Next Next_
Date : Oct 11 2023
Day of Week : Wed
Start Time : Time Not Specified
End Time : Time Not Specified
Flagg Building, Student Lounge & Gallery 7


Primary Event type: Arts & Culture
--------------------------------------
title : Alumni In Finance & Real Estate: Industry Networking Breakfast (Nyc)
Date : Oct 11 2023
Day of Week : Wed
Start Time : 8:30am
End Time : 9:30am
Location: No Location Was Listed


Primary Event type: Alumni
--------------------------------------
title : Gwsb Ms In Information Systems Technology  Information Session & Webinar
Date : Oct 11 2023
Day of Week : Wed
Start Time : 9am
End Time : 9:30am
Virtual Event


Primary Event type: Admissions
--------------------------------------
title : Anne Lindberg: What Color Is Divine Light?
Date : Oct 11 2023
Day of Week : Wed
Start Time : 10am
End Time : 5pm
The George Washington Museum and the Textile Museum


Primary Event type: Arts & Culture
--------------------------------------
title : Classical Washington
Date : Oct 11 2023
Day of Week : Wed
Start Time : 10am
End Time : 5pm
The George Washington Museum and the Textile Museum


Primary Event type: Arts & Culture
--------------------------------------
title : Handstitched Worlds: The Cartography Of Quilts
Date : Oct 11 2023
Day of Week : Wed
Start Time : 10am
End Time : 5pm
The George Washington Museum and the Textile Museum


Primary Event type: Arts & Culture
--------------------------------------
title : The New Naval And Military Map Of The United States
Date : Oct 11 2023
Day of Week : Wed
Start Time : 10am
End Time : 5pm
The George Washington Museum and the Textile Museum


Primary Event type: Arts & Culture
--------------------------------------
title : From Spark To Impact: Promoting And Enabling Research With The Libraries And Academic Innovation Team
Date : Oct 11 2023
Day of Week : Wed
Start Time : 11am
End Time : 12pm
Gelman Library, Room 608


Primary Event type: Research
--------------------------------------
title : Gwsb Ms In Project Management Information Session & Webinar
Date : Oct 11 2023
Day of Week : Wed
Start Time : 12pm
End Time : 12:30pm
Virtual Event


Primary Event type: Admissions
--------------------------------------
title : International Education (Master'S) - Virtual Information Session
Date : Oct 11 2023
Day of Week : Wed
Start Time : 12pm
End Time : 1pm
Virtual Event


Primary Event type: Admissions
--------------------------------------
title : Bloomberg Industry Group Careers Panel
Date : Oct 11 2023
Day of Week : Wed
Start Time : 12:30pm
End Time : 2pm
Law School, Tasher Great Room


Primary Event type: 
--------------------------------------
title : From Spark To Impact: Software Systems For Researchers With Proposals And Awards
Date : Oct 11 2023
Day of Week : Wed
Start Time : 1pm
End Time : 2pm
Rome Hall (Academic Center), Room 206


Primary Event type: Research
--------------------------------------
title : Gw Collection / Corcoran Faculty Selection 
Date : Oct 11 2023
Day of Week : Wed
Start Time : 1pm
End Time : 5pm
Flagg Building, Gallery 6


Primary Event type: Exhibition
--------------------------------------
title : Global Reflections: A Virtual Group Processing Space
Date : Oct 11 2023
Day of Week : Wed
Start Time : 1pm
End Time : 2:30pm
Online


Primary Event type: Student Life
--------------------------------------
title : Too Good To Be True? Beware Of Job Offer Scams! 
Date : Oct 11 2023
Day of Week : Wed
Start Time : 2pm
End Time : Time Not Specified
Virtual Event


Primary Event type: Info Session
--------------------------------------
title : Dr. Sharon Murphy Book Talk 
Date : Oct 11 2023
Day of Week : Wed
Start Time : 2:30pm
End Time : 4pm
Location: No Location Was Listed


Primary Event type: No Event Type Specified
--------------------------------------
title : Macro-International Seminar: Tony Zhang, Federal Reserve
Date : Oct 11 2023
Day of Week : Wed
Start Time : 2:30pm
End Time : 4pm
Hall of Government, 321


Primary Event type: Academic
--------------------------------------
title : Accessibility On Campus Discussion
Date : Oct 11 2023
Day of Week : Wed
Start Time : 3pm
End Time : 4pm
Online


Primary Event type: Student Life
--------------------------------------
title : Collaborative Reform Conference -- Effective Law Enforcement For All
Date : Oct 11 2023
Day of Week : Wed
Start Time : 3pm
End Time : 7:30am
Elliott School of International Affairs, GW University, City View Room


Primary Event type: Lectures & Speakers
--------------------------------------
title : Inequality And The Crisis Of Liberal Democracy
Date : Oct 11 2023
Day of Week : Wed
Start Time : 4pm
End Time : 5pm
1957 E Street NW, 412Q


Primary Event type: Lectures & Speakers
--------------------------------------


In [83]:
# Question 4 basic solution:
from pandas import DataFrame

events_df = DataFrame(events)
events_df.head()

Unnamed: 0,title,date,day_of_week,start_time,end_time,location,image,primary_event_type
0,The Art Of Collecting: Gifts From The Luther W...,Oct 11 2023,Wed,Time Not Specified,Time Not Specified,Luther W. Brady Art Gallery,https://localist-images.azureedge.net/photos/4...,Exhibition
1,Next Next_,Oct 11 2023,Wed,Time Not Specified,Time Not Specified,"Flagg Building, Student Lounge & Gallery 7",https://localist-images.azureedge.net/photos/4...,Arts & Culture
2,Alumni In Finance & Real Estate: Industry Netw...,Oct 11 2023,Wed,8:30am,9:30am,No Location Was Listed,https://localist-images.azureedge.net/photos/4...,Alumni
3,Gwsb Ms In Information Systems Technology Inf...,Oct 11 2023,Wed,9am,9:30am,Virtual Event,https://localist-images.azureedge.net/photos/4...,Admissions
4,Anne Lindberg: What Color Is Divine Light?,Oct 11 2023,Wed,10am,5pm,The George Washington Museum and the Textile M...,https://localist-images.azureedge.net/photos/4...,Arts & Culture


In [84]:
# Question 5 solution:
events_df.to_csv("events.csv")

## Exploration / Scratch Work

In [47]:
item[0].find_all("p","em-card_event-text")

[<p class="em-card_event-text">
       
         Wed, Oct 11, 2023
       
     </p>,
 <p class="em-card_event-text">
 <a href="https://calendar.gwu.edu/event/the_art_of_collecting_gifts_from_the_luther_w_brady_estate"><i class="fas fa-map-marker-alt"></i>  Luther W. Brady Art Gallery</a>
 </p>]

In [55]:
for i in item:
    x = i.find_all("p","em-card_event-text")
    for j in x:
        if "2023" not in j.text.strip():
            print(j.text.strip())
        else:
            None

Luther W. Brady Art Gallery
Flagg Building, Student Lounge & Gallery 7
Virtual Event
The George Washington Museum and the Textile Museum
The George Washington Museum and the Textile Museum
The George Washington Museum and the Textile Museum
The George Washington Museum and the Textile Museum
Gelman Library, Room 608
Virtual Event
Virtual Event
Law School, Tasher Great Room
Rome Hall (Academic Center), Room 206
Flagg Building, Gallery 6
Online
Virtual Event
Hall of Government, 321
Online
Elliott School of International Affairs, GW University, City View Room
1957 E Street NW, 412Q


In [None]:
for i in item:
    if i.find("h3","em-card_title") == None:
        print("Title : No Title Available")
    else:
        print("Title :",(i.find("h3","em-card_title").text).title())

In [None]:
for i in item:
    if i.find("a","em-card_tag") == None:
        print("Primary Event type: No Event Type Specified")
    else:
        print("Primary Event type:",(i.find("a","em-card_tag").text))

In [None]:
for i in item:
    if i.find("p","em-card_event-text") == None:
        print("Time : No Time Available")
    else:
        sentence = i.find("p","em-card_event-text").text
        for char in string.punctuation:
            if char != ":":
                sentence = sentence.replace(char, ' ')
        words = list(sentence.split())
        length = int(len(words))
        if length <= 4:
            print("Date :",' '.join(i for i in words[1:4]))
            print("Day of Week :",words[0])
            print("Start Time :",None)
            print("End Time :",None)
        elif length <= 5:
            print("Date :",' '.join(i for i in words[1:4]))
            print("Day of Week :",words[0])
            print("Start Time :",words[-1])
            print("End Time :",None)
        else:
            print("Date :",' '.join(i for i in words[1:4]))
            print("Day of Week :",words[0])
            print("Start Time :",words[4])
            print("End Time :",words[-1])
    print("--------------------------------------")

In [None]:
for i in item:
    if i.find("img","img_card")["src"] == None:
        print("Image: No Image Found")
    else:
        display(Image(url=i.find("img","img_card")["src"]))

In [71]:
for i in item:
    count = []
    x = i.find_all("p","em-card_event-text")
    for j in x:
        if "2023" not in j.text.strip():
            print(j.text.strip())
            events_dict["location"] = j.text.strip()
        else:
            None
    for j in x:
        count.append(j)
    count_length = len(count)
    if count_length == 1:
        print("Location: No Location Was Listed")
        events_dict["location"] = "No Location Was Listed"

Luther W. Brady Art Gallery
Flagg Building, Student Lounge & Gallery 7
Location: No Location Was Listed
Virtual Event
The George Washington Museum and the Textile Museum
The George Washington Museum and the Textile Museum
The George Washington Museum and the Textile Museum
The George Washington Museum and the Textile Museum
Gelman Library, Room 608
Virtual Event
Virtual Event
Law School, Tasher Great Room
Rome Hall (Academic Center), Room 206
Flagg Building, Gallery 6
Online
Virtual Event
Location: No Location Was Listed
Hall of Government, 321
Online
Elliott School of International Affairs, GW University, City View Room
1957 E Street NW, 412Q


In [None]:
def date_parser(date_info:str):
    # TODO: parse the date string
    return {
        "weekday": "TODO",
        "date": "TODO",
        "start": "TODO",
        "end": "TODO",
    }

In [None]:
d1 = "Mon, Oct 2, 2023 10am to 11:30am"

d2 = "Mon, Oct 2, 2023"

In [None]:
assert date_parser(d1) == {
    "weekday": "Monday", # or Mon
    "date": "2023-10-02",
    "start": "10:00am", # or "10am" or "10:00" or "10:00:00"
    "end": "11:30am" # or "11:30" or "11:30:00"
}

In [None]:
assert date_parser(d2) == {
    "weekday": "Monday", # or Mon
    "date": "2023-10-02",
    "start": None,
    "end": None
}

In [None]:
def fetch_events(selected_date=None):
    events = []
    # TODO
    return events

In [None]:
from datetime import date

events = fetch_events()
assert any(events)
assert events[0]["date"] == str(date.today())

events = fetch_events("2023-10-20")
assert any(events)
assert events[0]["date"] == "2023-10-20"