# Phase II: Data Curation, Exploratory Analysis and Plotting
## Project Title Here

- Team
- John Rotondo, Spring Yan, Anne Hu, Evan Li

## Project Goal:

We are using the College Scorecard API to gather data on various colleges. The program we are designing will allow students to input their ACT or SAT scores along with their major of interest and any other needed information. Based on this input, the program will generate a customized list of colleges categorized as reach schools, target schools, and safety schools. These categories are based on the student’s academic profile and the best fit for the chosen major, helping students make informed decisions about which colleges to apply to.

## Pipeline Overview:

Explain how we would scrape here

### Pipeline:

#### 0. Imports:

In [1]:
import requests
import os
import random
import time
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

# For plotting
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#### 1. Query Generation:

#### 2. Selenium Crawler:

In [2]:
# Uses selenium library
def open_browser():
    '''
    Opens a new automated browser window with all tell-tales of automated browser disabled
    '''
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    
    # remove all signs of this being an automated browser
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)

    # open the browser with the new options
    driver = webdriver.Chrome(options=options)
    return driver

In [3]:
def load_website(driver, attempts = 5):
    ...

#### 3. Webscraping:

In [22]:
def find_school(school):
    school_info = []
    
    # This is the URL and the key used to get access to the API.
    url = 'https://api.data.gov/ed/collegescorecard/v1/schools'
    key = 'eKqjIMsTr8NFbXbKmGecMsA1oebkdIPd3TlGeEGY'
    
    # Parameters for the API request
    api_req = {
        'api_key': key,
        'school.name': school,
        'fields': 'id,school.name,school.city,latest.admissions.sat_scores.average.overall,latest.admissions.act_scores.midpoint.cumulative'
    }

    # Gets the data from the College Scorecard API and turns it into json format
    response = requests.get(url, params=api_req)

    data = response.json()
    
    # Appending data to list for a school in the results
    for x in data.get('results', []):
        school_info.append({
            'School ID': x.get('id'),
            'School Name': x.get('school.name'),
            'City': x.get('school.city'),
            'SAT Average': x.get('latest.admissions.sat_scores.average.overall'),
            'ACT Average': x.get('latest.admissions.act_scores.midpoint.cumulative')
        })

    # Turn the list to a DataFrame
    df = pd.DataFrame(school_info)
    return df

df = find_school('Harvard')
print(df)

   School ID         School Name       City  SAT Average  ACT Average
0     166027  Harvard University  Cambridge         1553           35


In [32]:
# ----- To get school dataframe
key = 'eKqjIMsTr8NFbXbKmGecMsA1oebkdIPd3TlGeEGY'
URL = f'https://api.data.gov/ed/collegescorecard/v1/schools?api_key={key}'

In [150]:
response = requests.get(URL)
data = response.json()

# --------- note: has # of students enrolled in each major, 
# -------- how to put this into a single data frame if each school has a different list of majors?
data['results'][0]['latest']['academics']['program']['bachelors'] # ------ this gets all the major counts 

{'legal': 0,
 'health': 0,
 'english': 1,
 'history': 0,
 'library': 0,
 'computer': 1,
 'language': 0,
 'military': 0,
 'education': 1,
 'resources': 1,
 'biological': 1,
 'humanities': 1,
 'psychology': 1,
 'agriculture': 1,
 'engineering': 1,
 'mathematics': 1,
 'architecture': 1,
 'construction': 0,
 'communication': 0,
 'social_science': 1,
 'transportation': 0,
 'multidiscipline': 0,
 'physical_science': 1,
 'personal_culinary': 0,
 'visual_performing': 1,
 'business_marketing': 1,
 'science_technology': 0,
 'philosophy_religious': 0,
 'precision_production': 0,
 'engineering_technology': 1,
 'ethnic_cultural_gender': 0,
 'family_consumer_science': 1,
 'parks_recreation_fitness': 1,
 'security_law_enforcement': 1,
 'communications_technology': 1,
 'mechanic_repair_technology': 0,
 'theology_religious_vocation': 0,
 'public_administration_social_service': 1}

In [54]:
# ----- interesting could be financial aid component - FAFSA applications recieved, tuition 
# --- for first college = Alabama A&M University 
data['results'][0]['latest']['admissions']['admission_rate']['overall'] # ----- get admissions rate 
data['results'][0]['latest']['admissions']['sat_scores']['average']['overall'] # --- avg sat score 
data['results'][0]['latest']['admissions']['act_scores']['midpoint']['cumulative'] # ----- avg act score

18

In [56]:
len(data['results'])

20

In [126]:
from collections import defaultdict

def get_school_data(response_data):
    # -------- this only gets the first 20 schools, how to change this? | PER PAGE LIMIT
    """ Processes API response data and stores school names and other relevant data (i.e., avg sat & act scores)
    Params: 
    - repsonse_data = API JSON response
    Returns: A dictionary of lists containing schools names and extended relevant data"""
    data_dct = defaultdict(list)

    results = response_data['results']
    for i in range(len(results)):
        data_dct['school'].append(results[i]['school'].get('name', None))
        data_dct['admin_rate'].append(results[i]['latest']['admissions']['admission_rate'].get('overall', None)) 
        data_dct['avg_sat'].append(results[i]['latest']['admissions']['sat_scores']['average'].get('overall', None))
        data_dct['avg_act'].append(results[i]['latest']['admissions']['act_scores']['midpoint'].get('cumulative', None))
        
    return data_dct 

In [128]:
len(data['results']) # -------- default shows 20 schools per page 

20

In [130]:
data_dct = get_school_data(data)
school_df = pd.DataFrame(data_dct)
school_df

Unnamed: 0,school,admin_rate,avg_sat,avg_act
0,Alabama A & M University,0.684,920.0,18.0
1,University of Alabama at Birmingham,0.8668,1291.0,27.0
2,Amridge University,,,
3,University of Alabama in Huntsville,0.781,1259.0,28.0
4,Alabama State University,0.966,963.0,18.0
5,The University of Alabama,0.8006,1304.0,26.0
6,Central Alabama Community College,,,
7,Athens State University,,,
8,Auburn University at Montgomery,0.9223,1051.0,21.0
9,Auburn University,0.4374,1292.0,


#### 4. Cleaning the Data:

## Plotting:

In [5]:
# for this we can plot the student population for each school on the recommended list

## Analysis/ML Plan:

Plan here