# EQAO Grade 9 Math Results Analysis
## EQAO Info
EQAO is a yearly-run census-like test for Grades 3, 6 and 9 run by the Government of Ontario. The results of the students' scores are said to contribute to the growth of Ontario's curriculums.

## Data Collection
Unfortunately, the newest results, which are of most interest, are only made publicly available through a query on the EQAO website. I'd imagine that posting a spreadsheet with each of the school's results would be easier than an interactive website.
The EQAO website is indexed using ("https://www.eqao.com/report/?id=####"). Iterating through each page that has the required data will be the method of data collection. 
### Features
The following features of each page will be scraped:
- The average achievement of each school
- Total school enrolment
- Number of students taking the test.
- Town name (to cross reference)
There are socio-economic factors that will be included that are not found on the indexing of the EQAO page. These will be scraped via another CSV file:
- School's town's population
- Poverty rate in the respective school's town


## Imports


In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from numpy.random import randint
### lxml is also installed from the bs4 library


The library requests will be used to access the webpages' data.

BeautifulSoup will be used for parsing the HTML files that are grabbed by requests.

Pandas will be used for organizing and analyzing the data.

## Collecting Data


### Functions and Boilerplate

In [22]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}

document = requests.get("https://www.eqao.com/report/?id=3838", headers = headers)
src = document.content
soup = BeautifulSoup(src, "html.parser")

The grades are indexed using bar charts within \<script> tags. The Javascript must be parsed manually, unfortunately. 

In [23]:
def find_grades(source):
    """
    Finds respective academic and applied averages from the EQAO website.
    @param - source - a soup document to be examined
    @return - a list with applied in 0th position and academic in 1st position
    """
    all_scripts = source.find_all('script')
    fall_grades = []
    counter = 0

    for script in all_scripts:
        if 'drawBarChart([["Year Range"' in script.text:
            counter+=1
            if counter == 1 or counter  == 2:
                fall_grades.append(script.text)

    scraped_grades = []
    string_list = ""

    for grades in fall_grades:
        done = False
        for i in range(0,len(grades)):
            x = i
            if grades[i] == 'P':
                while done == False:
                    if grades[x+12] == ',':
                        scraped_grades.append((string_list))
                        string_list = ''
                        x+=1
                        continue
                    elif grades[x+12] == ']':
                        scraped_grades.append((string_list))
                        string_list = ''
                        done = True
                        continue
                    string_list+=grades[x+12]
                    x+=1
    return [scraped_grades[2], scraped_grades[5]]


In [24]:
all_grades = find_grades(soup)
print(all_grades)

['0.69', '0.89']


The school size is stored within a \<div> tag. Unfortunately, other pieces of data are stored under the same div class, so the index of the specific div class had to be accessed.

In [29]:
def find_pop(source):
    """
    Finds population of the school from the given source/school page
    @param - source - a soup
    @return - the population of the school"""
    elems = source.find("div", {"class": "basic-info"}).find_all("div", {"class": "flex-table row"})[5]
    pop = elems.find_all("div", {"class": "flex-row"})
    pop = pop[1].text
    return pop

In [30]:
school_size = find_pop(soup)
print(school_size)

808


### Collecting the Proper Data
Indexing on the EQAO website using the aforementioned URL method isn't as simple as it seems. Not every school has the exact information that is meant to be collected.

Certain schools, particularly the Adult Learning Centres (places where people who haven't completed high school go to complete their credits), do not offer EQAO -- only OSSLT.

With some testing, it became apparent that high schools only start to pop up around the 1300 mark, so the id iteration starts there.

In [None]:
def create_data ():
    large_data = []
    for x in range(1300,10000):
        document = requests.get("https://www.eqao.com/report/?id=" + str(x), headers = headers).content
        website = BeautifulSoup(document, "lxml")
        if ("Either the requested page" in website.text or "I Like to Write" in website.text):
            continue
        else:
            data = []
            print(x)
            data.append(find_school_name(website))
            data.append(find_pop(website))
            data.append(find_postal_town(website)[0])
            data.append(find_postal_town(website)[1])
            data.append(find_grades(website)[0])
            data.append(find_grades(website)[1])
            large_data.append(data)
            print(large_data)
    frame = pd.DataFrame(large_data, columns = ['School Name', 'School Size', 'Town Name', 'Postal Code', 'Applied EQAO Grades', 'Academic EQAO Grades'])
    frame.to_csv('EQAODataset.csv')


As the scraper ran, it unexpectedly picked up some of the OSSLT-only schools. With some simple excel work, the aforementioned schools were removed.
And so the data collection and cleaning is finished.

### Data Cleaning
As the scraper ran, it unexpectedly picked up some of the OSSLT-only schools, and some schools that only contain one of academic or applied marks. With some simple excel operations, the aforementioned schools were removed, for the purpose of comparative data analysis.

The cleaned data should only contain schools that have both an academic and applied math class (most Ontario schools).

In [12]:
data = pd.read_csv(r'C:\Users\isaac\Documents\EQAO Data Analysis\EQAODataset(only matches).csv')
data.pop('id')
print(data)

                               School Name School Size      School Town  \
0                                  Ajax HS       1,137             Ajax   
1                      Anderson C &amp; VI         866           Whitby   
2                                 Brock HS         366       Cannington   
3                      Donald A. Wilson SS       1,217           Whitby   
4                             Dunbarton HS       1,465        Pickering   
..                                     ...         ...              ...   
470        École secondaire Roméo Dallaire         118           Barrie   
471                   Superior Heights CVS         800  SaultSte. Marie   
472  St. Michael Catholic Secondary School       1,194           Bolton   
473          David Suzuki Secondary School       1,621         Brampton   
474     Nottawasaga Pines Secondary School         735            Angus   

    Postal Code  Applied EQAO Achievement  Academic EQAO Achievement  
0       L1S 1P2             

## Data Analysis
Finding the average difference between applied EQAO scores and applied EQAO schools could show..

In [17]:
data['Difference Academic Applied'] = data['Academic EQAO Achievement'] - data['Applied EQAO Achievement']
print(data['Difference Academic Applied'].mean(axis = 0))

0.3555999999999998
