Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0037 spider stl board of public service #96

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Samatha-Kodali
Copy link

Summary

Issue: #37

Replace "ISSUE_NUMBER" with the number of your issue so that GitHub will link this pull request with the issue and make review easier.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

  • Tests are implemented
  • All tests are passing
  • Style checks run (see documentation for more details)
  • Style checks are passing
  • Code comments from template removed

Questions

Include any questions you have about what you're working on.

Comment on lines +33 to +34
"https://www.stlouis-mo.gov/events/"
"past-meetings.cfm?span=-30&department=332"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The department number for the Board of Public service is 209. So change the part of the url path to &department=209.

custom_settings = {"ROBOTSTXT_OBEY": False}
start_urls = [
(
"https://www.stlouis-mo.gov/government/departments/public-service/index.cfm"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the start_urls link to be the link to where the meeting materials are posted. I believe the website we want is https://www.stlouis-mo.gov/government/departments/public-service/documents/meeting-materials.cfm.

event_sponsors = response.css("ul.list-group li span.small::text").getall()
urls = []
for url, sponsor in zip(event_urls, event_sponsors):
if "aldermen" in sponsor.lower() or "aldermanic" in sponsor.lower():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you should change it to something like if "public service" in sponsor.lower(). The current code will only scrape events for the Board of Alderman.

Comment on lines +86 to +92
def _parse_title(self, response):
"""Parse or generate meeting title."""
title = response.css("div.page-title-row h1::text").get()
title = title.replace("Meeting", "").replace("Metting", "")
title = title.replace("-", "- ")
title = title.replace("(Canceled)", "Cancelled")
return title.replace(" ", " ").strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the Board of Public Service's meeting titles are either Board of Public Service or Special Board of Public Service Meeting. So you can do something like this for _parse_title.

Comment on lines +94 to +108
def _parse_description(self, response):
"""Parse or generate meeting description."""
description = response.css(
"div#EventDisplayBlock div.col-md-8 h4 strong::text"
).getall()
i = 0
while i < len(description) - 1:
if "following:" in description[i]:
return description[i + 1].replace("\xa0", "")
elif "will" in description[i]:
return description[i].replace("\xa0", "")
else:
i += 1
else:
return ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get rid of _parse_description and put description="" in _parse_event.

Comment on lines +110 to +118
def _parse_classification(self, response):
"""Parse or generate classification from allowed options."""
title = response.css("div.page-title-row h1::text").get()
if "committee" in title.lower():
return COMMITTEE
elif "board" in title.lower():
return BOARD
else:
return NOT_CLASSIFIED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get rid of this and put classification=BOARD in _parse_event.

Comment on lines +156 to +189
def _parse_location(self, response):
"""Parse or generate location."""
location = response.css("div.col-md-4 div.content-block p *::text").getall()
temp = []
for item in location:
item = item.replace("\n", "")
if item != "":
temp.append(item)
location = temp
i, location_index, sponsor_index = 0, 0, 0
while i < len(location):
if "location" in location[i].lower():
location_index = i
if "sponsor" in location[i].lower():
sponsor_index = i
break
i += 1

if location_index + 1 < len(location) and sponsor_index < len(location):
name = location[location_index + 1]
address = []
for j in range(location_index + 2, sponsor_index):
address.append(location[j])
address = (
" ".join(address).replace("Directions to this address", "").strip()
)
else:
name = ""
address = ""

return {
"address": address,
"name": name,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is a virtual/Zoom meeting, we want the name to be "Zoom" and the address to be "".

Comment on lines +201 to +218
pattern_mmddyy = r"(?P<date>(\d{1,2}-\d{1,2}-\d{2}))"
pattern_mmddyyyy = r"(?P<date>(\d{1,2}-\d{1,2}-\d{4}))"
pattern_monthddyyyy = r"(?P<date>([A-Z]* \d{1,2}, \d{4}))"

rm_mmddyy = re.search(pattern_mmddyy, description)
rm_mmddyyyy = re.search(pattern_mmddyyyy, description)
rm_monthddyyyy = re.search(pattern_monthddyyyy, description)

dt = None
if rm_mmddyy is not None:
date = rm_mmddyy.group("date")
dt = datetime.strptime(date, "%m-%d-%y")
if rm_mmddyyyy is not None:
date = rm_mmddyyyy.group("date")
dt = datetime.strptime(date, "%m-%d-%Y")
if rm_monthddyyyy is not None:
date = rm_monthddyyyy.group("date")
dt = datetime.strptime(date, "%b %d, %Y")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these regex patterns match the way the date is formatted for the Public Service Board meeting materials.

pattern = r"(?P<date>[A-Z][a-z]* \d{1,2})"
rm = re.search(pattern, description)
if rm is not None:
    date = rm.group("date")
    dt = datetime.strptime(date, "%B %d")
else:
    dt = None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants