# Historical Tuition and Fees from 2010 - 2020

As shown on http://otcads.umd.edu/bfa/budgetinfo3.htm, there is a historical tuition and fees page (shown here: http://otcads.umd.edu/bfa/FY20%20Working%20Budget/web/TuitHist%202010%20to%202020%20(Excel).htm).  
This dataset has tuition rates for all kinds of students from 2010 to 2020. 

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pickle

## Scraping

In [2]:
url = 'http://otcads.umd.edu/bfa/FY20%20Working%20Budget/web/TuitHist%202010%20to%202020%20(Excel)_files/sheet001.htm'
r = requests.get(url)

In [3]:
soup = BeautifulSoup(r.text,'html.parser')

In [4]:
# Extracting the only table there is.
len(soup.find_all('table'))

1

In [5]:
table = soup.find('table')

In [6]:
str_table = []
for e in table.find_all('td'):
    if(e.text.strip()):
        str_table.append(e.text.strip())

In order to find the parts of the text that is necessary to copy into the dataframe, keywords like 'Undergrad Resident' were used to find the starting and ending indexes. The code directly below can be uncommented to show all the indexes of the parsed text.

## Parsing

In [7]:
tuition_fees = pd.DataFrame(columns=['Fee Type (Total for Fall and Spring)', 'Student Type'] + list(map(str, range(2010, 2021))))

i = str_table.index('Undergrad Resident')
while(i <= str_table.index('Mandatory Fees') - 1):
    descr = str_table[i]
    skip = 0
    if(str_table[i + 12] == '(fee per credit hour)'):
        descr = descr + ' (fee per credit hour)'
        skip = 1
        
    years = list(map(str, range(2010, 2021)))
    values = str_table[i+1:i+12]
    dictionary = dict(zip(years, values))
    dictionary['Student Type'] = descr
    
    dictionary['Fee Type (Total for Fall and Spring)'] = 'Standard Tuition Rates'
    
    tuition_fees = tuition_fees.append(dictionary, ignore_index=True)
    i = i + 12 + skip 
    
i = str_table.index('Undergrad FT')
while(i <= len(str_table) - 2):
    descr = str_table[i]
    
    years = list(map(str, range(2010, 2021)))
    values = str_table[i+1:i+12]
    dictionary = dict(zip(years, values))
    dictionary['Student Type'] = descr
    
    dictionary['Fee Type (Total for Fall and Spring)'] = 'Mandatory Fees'
    
    tuition_fees = tuition_fees.append(dictionary, ignore_index=True)
    i = i + 12

## Finalizing & Fine-tuning

The data is now being correctly type casted.

In [8]:
for year in list(range(2010, 2021)):
    tuition_fees[str(year)] = tuition_fees[str(year)].apply(lambda x : x.replace(',', ''))
    tuition_fees[str(year)] = tuition_fees[str(year)].astype(float)

In [9]:
tuition_fees

Unnamed: 0,Fee Type (Total for Fall and Spring),Student Type,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Standard Tuition Rates,Undergrad Resident,6566.0,6763.0,6966.0,7175.0,7390.0,7764.0,8152.0,8315.0,8481.0,8651.0,8824.0
1,Standard Tuition Rates,Undergrad Non-Res,22503.0,23178.0,24337.0,25554.0,26576.0,27905.0,29300.0,30179.0,31688.0,33272.0,34936.0
2,Standard Tuition Rates,Undergrad PT Resident (fee per credit hour),273.0,282.0,290.0,299.0,308.0,324.0,340.0,346.0,353.0,360.0,367.0
3,Standard Tuition Rates,Undergrad PT Non-Res (fee per credit hour),938.0,966.0,1014.0,1065.0,1108.0,1163.0,1221.0,1258.0,1321.0,1387.0,1456.0
4,Standard Tuition Rates,Graduate Resident (fee per credit hour),471.0,500.0,525.0,551.0,573.0,602.0,632.0,651.0,683.0,717.0,731.0
5,Standard Tuition Rates,Graduate Non-Res (fee per credit hour),1016.0,1077.0,1131.0,1188.0,1236.0,1298.0,1363.0,1404.0,1474.0,1548.0,1625.0
6,Mandatory Fees,Undergrad FT,1487.0,1653.0,1689.0,1733.0,1771.0,1815.0,1844.0,1866.0,1918.0,1944.0,1955.0
7,Mandatory Fees,Undergrad PT,678.0,761.0,779.0,799.0,818.0,840.0,855.0,866.0,893.0,906.0,910.0
8,Mandatory Fees,Graduate FT,1188.0,1351.0,1383.0,1413.0,1446.0,1490.0,1521.0,1538.0,1590.0,1620.0,1635.0
9,Mandatory Fees,Graduate PT,675.0,757.0,773.0,788.0,806.0,829.0,846.0,855.0,881.0,898.0,902.0


In [10]:
tuition_fees.to_pickle('df/tuition_fees')