New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

0037 spider stl board of public service #96

Open

Samatha-Kodali wants to merge 3 commits into stl-public-meetings:main from Samatha-Kodali:0037-spider-stl_board_of_public_service

Samatha-Kodali commented Aug 3, 2020

Summary

Issue: #37

Replace "ISSUE_NUMBER" with the number of your issue so that GitHub will link this pull request with the issue and make review easier.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Tests are implemented
All tests are passing
Style checks run (see documentation for more details)
Style checks are passing
Code comments from template removed

Questions

Include any questions you have about what you're working on.

SamathaKodali and others added 3 commits

July 27, 2020 14:13


          public works

1b1a2a5


          stl public service

ceb1bda


          Merge remote-tracking branch 'upstream/main' into 0037-spider-stl_boa…

e915a62

…rd_of_public_service

ledaliang reviewed

View reviewed changes

city_scrapers/spiders/example.py

Comment on lines +33 to +34

		"https://www.stlouis-mo.gov/events/"
		"past-meetings.cfm?span=-30&department=332"

Member

ledaliang Aug 3, 2020

The department number for the Board of Public service is 209. So change the part of the url path to &department=209.

ledaliang reviewed

View reviewed changes

city_scrapers/spiders/example.py

+                  custom_settings = {"ROBOTSTXT_OBEY": False}
+                  start_urls = [
+                      (
+                          "https://www.stlouis-mo.gov/government/departments/public-service/index.cfm"

Member

ledaliang Aug 3, 2020

We want the start_urls link to be the link to where the meeting materials are posted. I believe the website we want is https://www.stlouis-mo.gov/government/departments/public-service/documents/meeting-materials.cfm.

ledaliang reviewed

View reviewed changes

city_scrapers/spiders/example.py

+                      event_sponsors = response.css("ul.list-group li span.small::text").getall()
+                      urls = []
+                      for url, sponsor in zip(event_urls, event_sponsors):
+                          if "aldermen" in sponsor.lower() or "aldermanic" in sponsor.lower():

Member

ledaliang Aug 3, 2020

Here, you should change it to something like if "public service" in sponsor.lower(). The current code will only scrape events for the Board of Alderman.

ledaliang reviewed

View reviewed changes

city_scrapers/spiders/example.py

Comment on lines +86 to +92

+                  def _parse_title(self, response):
+                      """Parse or generate meeting title."""
+                      title = response.css("div.page-title-row h1::text").get()
+                      title = title.replace("Meeting", "").replace("Metting", "")
+                      title = title.replace("-", "- ")
+                      title = title.replace("(Canceled)", "Cancelled")
+                      return title.replace("  ", " ").strip()

Member

ledaliang Aug 3, 2020

It looks like the Board of Public Service's meeting titles are either Board of Public Service or Special Board of Public Service Meeting. So you can do something like this for _parse_title.

ledaliang reviewed

View reviewed changes

city_scrapers/spiders/example.py

Comment on lines +94 to +108

+                  def _parse_description(self, response):
+                      """Parse or generate meeting description."""
+                      description = response.css(
+                          "div#EventDisplayBlock div.col-md-8 h4 strong::text"
+                      ).getall()
+                      i = 0
+                      while i < len(description) - 1:
+                          if "following:" in description[i]:
+                              return description[i + 1].replace("\xa0", "")
+                          elif "will" in description[i]:
+                              return description[i].replace("\xa0", "")
+                          else:
+                              i += 1
+                      else:
+                          return ""

Member

ledaliang Aug 3, 2020

You can get rid of _parse_description and put description="" in _parse_event.

ledaliang reviewed

View reviewed changes

city_scrapers/spiders/example.py

Comment on lines +110 to +118

+                  def _parse_classification(self, response):
+                      """Parse or generate classification from allowed options."""
+                      title = response.css("div.page-title-row h1::text").get()
+                      if "committee" in title.lower():
+                          return COMMITTEE
+                      elif "board" in title.lower():
+                          return BOARD
+                      else:
+                          return NOT_CLASSIFIED

Member

ledaliang Aug 3, 2020

You can get rid of this and put classification=BOARD in _parse_event.

ledaliang reviewed

View reviewed changes

city_scrapers/spiders/example.py

Comment on lines +156 to +189

+                  def _parse_location(self, response):
+                      """Parse or generate location."""
+                      location = response.css("div.col-md-4 div.content-block p *::text").getall()
+                      temp = []
+                      for item in location:
+                          item = item.replace("\n", "")
+                          if item != "":
+                              temp.append(item)
+                      location = temp
+                      i, location_index, sponsor_index = 0, 0, 0
+                      while i < len(location):
+                          if "location" in location[i].lower():
+                              location_index = i
+                          if "sponsor" in location[i].lower():
+                              sponsor_index = i
+                              break
+                          i += 1
+                      if location_index + 1 < len(location) and sponsor_index < len(location):
+                          name = location[location_index + 1]
+                          address = []
+                          for j in range(location_index + 2, sponsor_index):
+                              address.append(location[j])
+                          address = (
+                              " ".join(address).replace("Directions to this address", "").strip()
+                          )
+                      else:
+                          name = ""
+                          address = ""
+                      return {
+                          "address": address,
+                          "name": name,
+                      }

Member

ledaliang Aug 3, 2020

When there is a virtual/Zoom meeting, we want the name to be "Zoom" and the address to be "".

ledaliang reviewed

View reviewed changes

city_scrapers/spiders/example.py

Comment on lines +201 to +218

+                          pattern_mmddyy = r"(?P<date>(\d{1,2}-\d{1,2}-\d{2}))"
+                          pattern_mmddyyyy = r"(?P<date>(\d{1,2}-\d{1,2}-\d{4}))"
+                          pattern_monthddyyyy = r"(?P<date>([A-Z]* \d{1,2}, \d{4}))"
+                          rm_mmddyy = re.search(pattern_mmddyy, description)
+                          rm_mmddyyyy = re.search(pattern_mmddyyyy, description)
+                          rm_monthddyyyy = re.search(pattern_monthddyyyy, description)
+                          dt = None
+                          if rm_mmddyy is not None:
+                              date = rm_mmddyy.group("date")
+                              dt = datetime.strptime(date, "%m-%d-%y")
+                          if rm_mmddyyyy is not None:
+                              date = rm_mmddyyyy.group("date")
+                              dt = datetime.strptime(date, "%m-%d-%Y")
+                          if rm_monthddyyyy is not None:
+                              date = rm_monthddyyyy.group("date")
+                              dt = datetime.strptime(date, "%b %d, %Y")

Member

ledaliang Aug 3, 2020

None of these regex patterns match the way the date is formatted for the Public Service Board meeting materials.

pattern = r"(?P<date>[A-Z][a-z]* \d{1,2})"
rm = re.search(pattern, description)
if rm is not None:
    date = rm.group("date")
    dt = datetime.strptime(date, "%B %d")
else:
    dt = None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment