# Phase II: Data Curation, Exploratory Analysis and Plotting
## Project Title Here

- Team
- John Rotondo, Spring Yan, Anne Hu, Evan Li

## Project Goal:

We are using the College Scorecard API to gather data on various colleges. The program we are designing will allow students to input their ACT or SAT scores along with their major of interest and any other needed information. Based on this input, the program will generate a customized list of colleges categorized as reach schools, target schools, and safety schools. These categories are based on the student’s academic profile and the best fit for the chosen major, helping students make informed decisions about which colleges to apply to.

## Pipeline Overview:

Explain how we would scrape here

### Pipeline:

#### 0. Imports:

In [1]:
import requests
import os
import random
import time
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

# For plotting
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#### 1. Query Generation:

#### 2. Selenium Crawler:

In [2]:
# Uses selenium library
def open_browser():
    '''
    Opens a new automated browser window with all tell-tales of automated browser disabled
    '''
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    
    # remove all signs of this being an automated browser
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)

    # open the browser with the new options
    driver = webdriver.Chrome(options=options)
    return driver

In [3]:
def load_website(driver, attempts = 5):
    ...

#### 3. Webscraping:

In [2]:
# this is the API key for college scorecard
key = 'eKqjIMsTr8NFbXbKmGecMsA1oebkdIPd3TlGeEGY'

# this is the URL and the key used to get access to the API.
url = 'https://api.data.gov/ed/collegescorecard/v1/schools'

In [130]:
def find_school(school, url, key):
    """ add in docstring """
    school_info = []
    
    # Parameters for the API request
    api_req = {
        'api_key': key,
        'school.name': school,
        'fields': 'id,school.name,school.city,latest.admissions.sat_scores.average.overall,latest.admissions.act_scores.midpoint.cumulative'
    }

    # Gets the data from the College Scorecard API and turns it into json format
    response = requests.get(url, params=api_req)

    data = response.json()
    
    # Appending data to list for a school in the results
    for x in data.get('results', []):
        school_info.append({
            'School ID': x.get('id'),
            'School Name': x.get('school.name'),
            'City': x.get('school.city'),
            'SAT Average': x.get('latest.admissions.sat_scores.average.overall'),
            'ACT Average': x.get('latest.admissions.act_scores.midpoint.cumulative')
        })

    # Turn the list to a DataFrame
    df = pd.DataFrame(school_info)
    return df

df = find_school('Harvard', url, key)
print(df)

   School ID         School Name       City  SAT Average  ACT Average
0     166027  Harvard University  Cambridge         1553           35


In [36]:
def filter_schools(url, key, size_range='10000..'):
    """ add in doc string"""
    school_info = []

    fields = ['id',
              'school.name',
              '2023.student.size',
              'school.city',
              'latest.admissions.sat_scores.average.overall',
              'latest.admissions.act_scores.midpoint.cumulative',]
    
    # Parameters for the API request
    api_req = {
        'api_key': key,
        'fields': ','.join(fields),
        '2013.student.size__range': '10000..'
    }

    # Gets the data from the College Scorecard API and turns it into json format
    response = requests.get(url, params=api_req)

    data = response.json()
    
    # Appending data to list for a school in the results
    for x in data.get('results', []):
        school_dict = {
            'School ID': x.get('id'),
            'School Name': x.get('school.name'),
            'City': x.get('school.city'),
            'SAT Average': x.get('latest.admissions.sat_scores.average.overall'),
            'ACT Average': x.get('latest.admissions.act_scores.midpoint.cumulative')
        }
        
        # Append the merged school to the school_info list
        school_info.append(school_dict)

    # Turn the list to a DataFrame
    df = pd.DataFrame(school_info)
    
    return df

In [37]:
school_df = filter_schools(url, key)
school_df # ---- only 20 schools because of PAGE LIMIT

Unnamed: 0,School ID,School Name,City,SAT Average,ACT Average
0,100663,University of Alabama at Birmingham,Birmingham,1291.0,27.0
1,100751,The University of Alabama,Tuscaloosa,1304.0,26.0
2,100858,Auburn University,Auburn,1292.0,
3,102368,Troy University,Troy,1050.0,21.0
4,104151,Arizona State University Campus Immersion,Tempe,,
5,104179,University of Arizona,Tucson,1229.0,25.0
6,104717,Grand Canyon University,Phoenix,,
7,105154,Mesa Community College,Mesa,,
8,105330,Northern Arizona University,Flagstaff,,
9,105525,Pima Community College,Tucson,,


In [18]:
school_name = "Stanford University"
stanford_url = f"https://api.data.gov/ed/collegescorecard/v1/schools?api_key={key}&school.name={school_name}"

s_response = requests.get(stanford_url)
s_data = s_response.json()

# ------ either program.bachelors (bool?) or program.percentage
program_dct = s_data['results'][0]['latest']['academics']['program_percentage']
program_df = pd.DataFrame(program_dct, index=[0])

In [19]:
program_df

Unnamed: 0,legal,health,english,history,library,computer,language,military,education,resources,...,precision_production,engineering_technology,ethnic_cultural_gender,family_consumer_science,parks_recreation_fitness,security_law_enforcement,communications_technology,mechanic_repair_technology,theology_religious_vocation,public_administration_social_service
0,0,0,0.0169,0.0156,0,0.1704,0.0187,0,0,0,...,0,0.0468,0.0287,0,0,0,0,0,0,0.0094


In [43]:
url = f"https://api.data.gov/ed/collegescorecard/v1/schools?api_key={key}"
response = requests.get(url)
data = response.json()

# data['results'][i]['latest']['academics']['program_percentage']

{'legal': 0,
 'health': 0,
 'english': 0.0117,
 'history': 0,
 'library': 0,
 'computer': 0.0802,
 'language': 0,
 'military': 0,
 'education': 0.0391,
 'resources': 0.0157,
 'biological': 0.1644,
 'humanities': 0.0176,
 'psychology': 0.0744,
 'agriculture': 0.0333,
 'engineering': 0.1076,
 'mathematics': 0.002,
 'architecture': 0.0059,
 'construction': 0,
 'communication': 0,
 'social_science': 0.0313,
 'transportation': 0,
 'multidiscipline': 0,
 'physical_science': 0.0117,
 'personal_culinary': 0,
 'visual_performing': 0.0137,
 'business_marketing': 0.1526,
 'science_technology': 0,
 'philosophy_religious': 0,
 'precision_production': 0,
 'engineering_technology': 0.0157,
 'ethnic_cultural_gender': 0,
 'family_consumer_science': 0.0235,
 'parks_recreation_fitness': 0.0333,
 'security_law_enforcement': 0.09,
 'communications_technology': 0.0333,
 'mechanic_repair_technology': 0,
 'theology_religious_vocation': 0,
 'public_administration_social_service': 0.0431}

### TO-DO'S
- need to figure out work around the page limit
- find additional helpful params to the data frame
- cleaning to-do's: need to handle NaN values? 


#### 4. Cleaning the Data:

## Plotting:

In [5]:
# for this we can plot the student population for each school on the recommended list

## Analysis/ML Plan:

Plan here