# Webscraping GOSA Data

The goal of webscraping the GOSA website is to retrieve the urls needed to download relevant datasets.

The urls are located in the "DATA AVAILABLE FOR DOWNLOAD" column of the table on the GOSA Downloadable Data website.

Code adapted from https://towardsdatascience.com/web-scraping-scraping-table-data-1665b6b2271c

## Import packages

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

## Use BeautifulSoup to Scrape and Review HTML

In [2]:
# GOSA Downloadable Data site URL
url = 'https://gosa.georgia.gov/dashboards-data-report-card/downloadable-data'

In [3]:
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

In [4]:
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
#print(soup.prettify()) # print the parsed data of html

In [5]:
#print(soup.prettify())

In reviewing the HTML above, it looks like the table class needed is "stacked-row-plus" and the columns needed are the first "DATA CATEGORY" and the third "DATA AVAILABLE FOR DOWNLOAD"

In [6]:
# On site there are 1 tables with the class "stacked-row-plus"
# The following line will generate a list of HTML content for each table
gdp = soup.find_all("table", attrs={"class": "stacked-row-plus"})
print("Number of tables on site: ",len(gdp))

Number of tables on site:  1


## Determine Headers for Single Table

In [7]:
# Lets go ahead and scrape first table with HTML code gdp[0]
table1 = gdp[0]
# the head will form our column names
body = table1.find_all("tr")
# Head values (Column names) are the first items of the body list
head = body[0] # 0th item is the header row
body_rows = body[1:] # All other items becomes the rest of the rows

# Lets now iterate through the head HTML code and make list of clean headings

# Declare empty list to keep Columns names
headings = []
for item in head.find_all("th"): # loop through all th elements
    # convert the th elements to text and strip "\n"
    item = (item.text).rstrip("\n")
    # append the clean column name to headings
    headings.append(item)
print(headings)

['DATA CATEGORY', 'DESCRIPTION', 'DATA AVAILABLE FOR DOWNLOAD']


## Create a Dataframe for Relevant Data

In [8]:
# Loop through all body rows 

all_rows = [] #list for all body rows
for row_num in range(len(body_rows)): # One row at a time
    row = [] # this will old entries for one row
    for row_item in body_rows[row_num].find_all("td"): #loop through all row entries
        aa = row_item 
        row.append(aa)
    all_rows.append(row)

In [9]:
# Create a dataframe using the rows and three column headers
df = pd.DataFrame(all_rows, columns=headings)

In [10]:
# Review the dataframe
df.head()

Unnamed: 0,DATA CATEGORY,DESCRIPTION,DATA AVAILABLE FOR DOWNLOAD
0,"[\n, [[ACT Scores (Highest)]], \n]","[\n, [ACT testing counts and average composite...","[\n, [ ], \n, [[2019-20], [], [2018-19], , []..."
1,[[ACT Scores (Recent)]],"[\n, [ACT testing counts and average composite...",[[2019-20]]
2,[[Advanced Placement (AP) Scores]],"[\n, [Number of students tested, number of AP ...","[\n, [ ], \n, [[2019-20], [], [2018-19], , []..."
3,[[Attendance]],"[\n, [Collected from the Student Record showin...","[\n, [ ], \n, [[2019-20], [], [2018-19], [], [..."
4,[[Certified Personnel]],[Certified Personnel data are compiled from in...,"[\n, [ ], \n, [[2019-20], [], [2018-19], [], [..."


## Reformat the Data Available for Download Column

In [11]:
# Check to see what is available in the data for download column
df['DATA AVAILABLE FOR DOWNLOAD'][0]

<td>
<p> </p>
<p><a href="https://download.gosa.ga.gov/2020/ACT_HIGHEST_2020_JUN_21_2021.csv">2019-20</a><br/><a href="https://download.gosa.ga.gov/2019/ACT_HIGHEST_2019_FEB_24_2020.csv">2018-19</a> <br/><a href="https://download.gosa.ga.gov/2018/ACT_HIGHEST_2018_FEB_24_2020.csv">2017-18</a><br/><a href="https://download.gosa.ga.gov/2017/ACT_HIGHEST_2017_FEB_24_2020.csv">2016-17</a><br/><a href="https://download.gosa.ga.gov/2016/ACT_HIGHEST_2016_FEB_24_2020.csv">2015-16</a><br/><a href="https://download.gosa.ga.gov/2015/ACT_HIGHEST_2015_FEB_24_2020.csv">2014-15</a><br/><a href="https://download.gosa.ga.gov/2014/ACT_HIGHEST_2014_FEB_24_2020.csv">2013-14</a><br/><a href="https://download.gosa.ga.gov/2013/ACT_HIGHEST_2013_FEB_24_2020.csv">2012-13</a><br/><a href="https://download.gosa.ga.gov/2012/ACT_HIGHEST_2012_FEB_24_2020.csv">2011-12</a><br/><a href="https://download.gosa.ga.gov/2011/ACT_HIGHEST_2011_FEB_24_2020.csv">2010-11</a></p>
<p> </p>
</td>

In [12]:
# Isolate each link and year
act_isolation = df['DATA AVAILABLE FOR DOWNLOAD'][0].find_all("a", href=True)

act_isolation

[<a href="https://download.gosa.ga.gov/2020/ACT_HIGHEST_2020_JUN_21_2021.csv">2019-20</a>,
 <a href="https://download.gosa.ga.gov/2019/ACT_HIGHEST_2019_FEB_24_2020.csv">2018-19</a>,
 <a href="https://download.gosa.ga.gov/2018/ACT_HIGHEST_2018_FEB_24_2020.csv">2017-18</a>,
 <a href="https://download.gosa.ga.gov/2017/ACT_HIGHEST_2017_FEB_24_2020.csv">2016-17</a>,
 <a href="https://download.gosa.ga.gov/2016/ACT_HIGHEST_2016_FEB_24_2020.csv">2015-16</a>,
 <a href="https://download.gosa.ga.gov/2015/ACT_HIGHEST_2015_FEB_24_2020.csv">2014-15</a>,
 <a href="https://download.gosa.ga.gov/2014/ACT_HIGHEST_2014_FEB_24_2020.csv">2013-14</a>,
 <a href="https://download.gosa.ga.gov/2013/ACT_HIGHEST_2013_FEB_24_2020.csv">2012-13</a>,
 <a href="https://download.gosa.ga.gov/2012/ACT_HIGHEST_2012_FEB_24_2020.csv">2011-12</a>,
 <a href="https://download.gosa.ga.gov/2011/ACT_HIGHEST_2011_FEB_24_2020.csv">2010-11</a>]

In [13]:
# Define a function to accomplish the task of making dictionaries for years and links
def parse_year_url(df_row):
    years = []
    urls = []
    url_dict = {}
    
    isolation = df['DATA AVAILABLE FOR DOWNLOAD'][df_row].find_all("a", href=True)
    
    for entry in isolation:
        years.append(entry.text)
        urls.append(entry['href'])

    for year in range(len(years)):
        url_dict[years[year]] = urls[year]

    df['DATA AVAILABLE FOR DOWNLOAD'][df_row] = url_dict

In [14]:
# Loop through each row in the dataframe to make the Data Available for Download column into dictionaries
for row in range(len(df['DATA AVAILABLE FOR DOWNLOAD'])):
    parse_year_url(row)

In [15]:
# Test to confirm the rows worked
df['DATA AVAILABLE FOR DOWNLOAD'][35]

{'2016-17': 'https://gosa.georgia.gov/sites/gosa.georgia.gov/files/related_files/site_page/ELL_Deferred_2017_DEC_1st_2017.csv',
 '2015-16': 'https://gosa.georgia.gov/sites/gosa.georgia.gov/files/related_files/site_page/ELL_Deferred_2016_DEC_1st_2017.csv',
 '2014-15': 'https://gosa.georgia.gov/sites/gosa.georgia.gov/files/related_files/site_page/ELL_Deffered_2015_Mar292016.csv',
 '2013-14': 'https://gosa.georgia.gov/sites/gosa.georgia.gov/files/related_files/site_page/ELL_Deffered_2014_Jan_15th_2015.csv',
 '2012-13': 'https://gosa.georgia.gov/sites/gosa.georgia.gov/files/related_files/site_page/ELL_New_by_County_2013.csv',
 '2011-12': 'https://gosa.georgia.gov/sites/gosa.georgia.gov/files/related_files/site_page/ELL_New_by_County_2012.csv',
 '2010-11': 'https://download.gosa.ga.gov/2011/ELL_Deferred_2011_MAR_23_2020.csv'}

## Reformat the Data Category and Description Columns

In [16]:
# Define function to convert row into text and strip '\n'
def strip_row(dataframe, col_name):
    for row in range(len(dataframe[col_name])):
        dataframe[col_name][row] = dataframe[col_name][row].text.strip('\n')

In [17]:
# Call function for Data Category and Description columns
strip_row(df, 'DATA CATEGORY')
strip_row(df, 'DESCRIPTION')

In [18]:
df.shape

(54, 3)

## Melt the Data Frame for Years and URL Columns

In [19]:
# Convert dictionary into df columns 
df_year_url = df['DATA AVAILABLE FOR DOWNLOAD'].apply(pd.Series)
df_year_url.shape

(54, 18)

In [20]:
# Since no common column, create temporary columns
df['tmp'] = 1
df_year_url['tmp'] = 1

# Merge
df_years = pd.merge(df, df_year_url, on=['tmp'])

# Drop temp column
df_years = df_years.drop('tmp', axis=1)


In [21]:
# Merge dataframes
df_years_long = pd.melt(df_years, id_vars = ['DATA CATEGORY', 'DESCRIPTION', 'DATA AVAILABLE FOR DOWNLOAD'],
                       value_vars = ['2019-20', '2018-19', '2017-18', '2016-17', '2015-16', '2014-15',
                                     '2013-14', '2012-13', '2011-12', '2010-11', '2020', '2017', '2018',
                                     '2019', '\n2019-20\n', '2015-2016', '\n2018-19\n'],
                       var_name = 'Year', value_name = 'URL')

In [22]:
# Drop all rows with NaN
df_years_long = df_years_long.dropna().reset_index()

In [23]:
# Strip our all '\n' from the Year column
for row in range(len(df_years_long['Year'])):
    df_years_long['Year'][row] = df_years_long['Year'][row].strip('\n')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_years_long['Year'][row] = df_years_long['Year'][row].strip('\n')


In [24]:
df_years_long

Unnamed: 0,index,DATA CATEGORY,DESCRIPTION,DATA AVAILABLE FOR DOWNLOAD,Year,URL
0,0,ACT Scores (Highest),ACT testing counts and average composite and s...,{'2019-20': 'https://download.gosa.ga.gov/2020...,2019-20,https://download.gosa.ga.gov/2020/ACT_HIGHEST_...
1,1,ACT Scores (Highest),ACT testing counts and average composite and s...,{'2019-20': 'https://download.gosa.ga.gov/2020...,2019-20,https://download.gosa.ga.gov/2020/ACT_RECENT_2...
2,2,ACT Scores (Highest),ACT testing counts and average composite and s...,{'2019-20': 'https://download.gosa.ga.gov/2020...,2019-20,https://download.gosa.ga.gov/2020/AP_2020_JUN_...
3,3,ACT Scores (Highest),ACT testing counts and average composite and s...,{'2019-20': 'https://download.gosa.ga.gov/2020...,2019-20,https://download.gosa.ga.gov/2020/Attendance_2...
4,4,ACT Scores (Highest),ACT testing counts and average composite and s...,{'2019-20': 'https://download.gosa.ga.gov/2020...,2019-20,https://download.gosa.ga.gov/2020/Certified_Pe...
...,...,...,...,...,...,...
17059,49452,Georgia Alternate Assessment (GAA) (Retired),Number of students tested as well as totals an...,{'2017-18': 'https://download.gosa.ga.gov/2018...,2018-19,https://gosa.georgia.gov/document/document/201...
17060,49505,Georgia High School Graduation Test (GHSGT) (R...,Number of students tested as well as totals an...,{'2010-11': 'https://gosa.georgia.gov/sites/go...,2018-19,https://gosa.georgia.gov/document/document/201...
17061,49506,Georgia High School Graduation Test (GHSGT) (R...,Number of students tested as well as totals an...,{'2010-11': 'https://gosa.georgia.gov/sites/go...,2018-19,https://gosa.georgia.gov/document/document/201...
17062,49559,Georgia High School Writing Test (GHSWT) (Reti...,Number of students tested as well as totals an...,{'2014-15': 'https://gosa.georgia.gov/sites/go...,2018-19,https://gosa.georgia.gov/document/document/201...
