# School Board Minutes

Scrape all of the school board minutes from http://www.mineral.k12.nv.us/pages/School_Board_Minutes

Save a CSV called `minutes.csv` with the date and the URL to the file. The date should be formatted as YYYY-MM-DD.

**Bonus:** Download the PDF files

**Bonus 2:** Use [PDF OCR X](https://solutions.weblite.ca/pdfocrx/index.php) on one of the PDF files and see if it can be converted into text successfully.

* **Hint:** If you're just looking for links, there are a lot of other links on that page! Can you look at the link to know whether it links or minutes or not? You'll want to use an "if" statement.
* **Hint:** You could also filter out bad links later on using pandas instead of when scraping
* **Hint:** If you get a weird error that you can't really figure out, you can always tell Python to just ignore it using `try` and `except`, like below. Python will try to do the stuff inside of 'try', but if it hits an error it will skip right out.
* **Hint:** Remember the codes at http://strftime.org
* **Hint:** If you have a date that you've parsed, you can use `.dt.strftime` to turn it into a specially-formatted string. You use the same codes (like %B etc) that you use for converting strings into dates.

```python
try:
  blah blah your code
  your code
  your code
except:
  pass
```

* **Hint:** You can use `.apply` to download each pdf, or you can use one of a thousand other ways. It'd be good `.apply` practice though!

In [1]:
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import re
import numpy as np



In [2]:
driver = webdriver.Chrome()

In [3]:
url = 'http://www.mineral.k12.nv.us/pages/School_Board_Minutes'
driver.get(url)

In [4]:
#Getting the date and PDF links 

meetings = []

minutes_block = driver.find_element_by_id("livesite-page-content-left")
minutes_list = minutes_block.find_elements_by_tag_name('p')[4:]

for minute_date in minutes_list:
    date = ''
    pdf_link = ''
    
    try:
        date = minute_date.find_elements_by_tag_name('span')[0].text.strip()
            
        #because June 16 is formatted with the link in a separate span than the text
        if date == "":
            date = minute_date.find_elements_by_tag_name('span')[3].text.strip()
    
    #strangely coded exceptions without a span or a tag
    except:
        if minute_date.text.strip() == "March 27, 2018":
            date = minute_date.text.strip()
        elif minute_date.text.strip()  == "May 21, 2019 CANCELLED":
            date = minute_date.text.strip()
        else:
            pass
    
    try:
        pdf_link =  minute_date.find_element_by_tag_name('a').get_attribute('href')
    except: 
        pdf_link = "not available"
      
    #print(date)
    
    if date != '':
        if "cancel" not in date.lower():
            meetings.append({'date': date, 
                         'pdf link': pdf_link})

meetings

[{'date': 'September 1, 2020',
  'pdf link': 'http://www.mineral.k12.nv.us/files/9.1.20_minutes.pdf'},
 {'date': 'August 11, 2020',
  'pdf link': 'http://www.mineral.k12.nv.us/files/8.11.20_minutes.pdf'},
 {'date': 'July 28, 2020',
  'pdf link': 'http://www.mineral.k12.nv.us/files/7.28.20_minutes.pdf'},
 {'date': 'July 14, 2020',
  'pdf link': 'http://www.mineral.k12.nv.us/files/7.14.20_minutes.pdf'},
 {'date': 'June 16, 2020',
  'pdf link': 'http://www.mineral.k12.nv.us/files/6.16.20_minutes.pdf'},
 {'date': 'May 20, 2020',
  'pdf link': 'http://www.mineral.k12.nv.us/files/5.20.20_minutes.pdf'},
 {'date': 'April 7, 2020',
  'pdf link': 'http://www.mineral.k12.nv.us/files/4.7.20_minutes.pdf'},
 {'date': 'March 12, 2020',
  'pdf link': 'http://www.mineral.k12.nv.us/files/3.12.20_minutes.pdf'},
 {'date': 'March 5, 2020',
  'pdf link': 'http://www.mineral.k12.nv.us/files/3.5.20_minutes.pdf'},
 {'date': 'February 21, 2020',
  'pdf link': 'http://www.mineral.k12.nv.us/files/2.21.20_minutes.

In [5]:
df = pd.DataFrame(meetings)
df

Unnamed: 0,date,pdf link
0,"September 1, 2020",http://www.mineral.k12.nv.us/files/9.1.20_minu...
1,"August 11, 2020",http://www.mineral.k12.nv.us/files/8.11.20_min...
2,"July 28, 2020",http://www.mineral.k12.nv.us/files/7.28.20_min...
3,"July 14, 2020",http://www.mineral.k12.nv.us/files/7.14.20_min...
4,"June 16, 2020",http://www.mineral.k12.nv.us/files/6.16.20_min...
...,...,...
62,"March 6, 2018","http://www.mineral.k12.nv.us/files/march_6,_20..."
63,"February 20, 2018","http://www.mineral.k12.nv.us/files/feb_20,_210..."
64,"February 6, 2018",http://www.mineral.k12.nv.us/files/2.6.18_minu...
65,"January 16, 2018","http://www.mineral.k12.nv.us/files/january_16,..."


In [6]:
#Cleaning it up

#Fixing 2108 date in there
df[df.date.str.contains('2108')]
#shows it's row 51

#Replacing in the dataframe
df.iloc[51,0] = df.iloc[51,0].replace('2108', '2018')

In [7]:
#Turning it into date time

df['date'] = pd.to_datetime(df['date'])
df

Unnamed: 0,date,pdf link
0,2020-09-01,http://www.mineral.k12.nv.us/files/9.1.20_minu...
1,2020-08-11,http://www.mineral.k12.nv.us/files/8.11.20_min...
2,2020-07-28,http://www.mineral.k12.nv.us/files/7.28.20_min...
3,2020-07-14,http://www.mineral.k12.nv.us/files/7.14.20_min...
4,2020-06-16,http://www.mineral.k12.nv.us/files/6.16.20_min...
...,...,...
62,2018-03-06,"http://www.mineral.k12.nv.us/files/march_6,_20..."
63,2018-02-20,"http://www.mineral.k12.nv.us/files/feb_20,_210..."
64,2018-02-06,http://www.mineral.k12.nv.us/files/2.6.18_minu...
65,2018-01-16,"http://www.mineral.k12.nv.us/files/january_16,..."


## Getting the PDFs

In [13]:
#Selenium was not working on the buttons in Google's PDF viewer, and StackOverflow's explanations on disabling it seemed a bit complex.
#Instead I looked into urllib's request function.
#These are the sources I looked at for help on this:
#https://stackoverflow.com/questions/24844729/download-pdf-using-urllib
#https://www.codegrepper.com/code-examples/delphi/download+pdf+from+link+using+python

import urllib

def download_file(download_url, filename):
    response = urllib.request.urlopen(download_url)    
    file = open(filename + ".pdf", 'wb')
    file.write(response.read())
    file.close()

#downloading only first 10
for pdf in df['pdf link'][0:10]:
    if pdf != "not available":
        filename = re.findall(r'^http://www.mineral.k12.nv.us/files/(.+).pdf', pdf)[0]
        download_file(pdf, filename)
        print(f'{filename} download complete')

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
9.1.20_minutes download complete
8.11.20_minutes download complete
7.28.20_minutes download complete
7.14.20_minutes download complete
6.16.20_minutes download complete
5.20.20_minutes download complete
4.7.20_minutes download complete
3.12.20_minutes download complete
3.5.20_minutes download complete
2.21.20_minutes download complete


In [None]:
#I was able to convert the first page of one of the PDFs with PDF OCR X (but more requires an upgrade from the community version)