# Perkins Core Indicator Report Scraper

## Install Required Packages

Before running the notebook, please ensure all required Python packages are installed by running the next cell.

If you're using a virtual environment, activate it first. Once installed, proceed to the setup instructions below.

In [None]:
%pip install tqdm
%pip install selenium
%pip install bs4
%pip install requests
%pip install pandas
%pip install openpyxl

## 1. Setup Instructions

There are two ways to set up your parameters for the scraper:

1. **Using config.yml**: If you have already set up your parameters in the config.yml file, just run the next cell.
2. **Manual Setup (Recommended for Python Beginner)**: If you haven't set up config.yml, you can define your parameters directly in the third cell below.

### Important Parameters:
- **Forms**: Types of forms to scrape
- **Colleges**: List of colleges to collect data from
- **Years**: Fiscal years to analyze

### Note:
- Make sure all college names match exactly with the website dropdown menu
- Years should be in the format 'YYYY-YYYY' (e.g., '2023-2024')

In [None]:
# Option 1: Load parameters from config.yml
import yaml
import os

# Define the project root directory
project_root = os.getcwd()

try:
    with open(os.path.join(project_root, 'config.yml'), 'r') as f:
        config = yaml.safe_load(f)
    
    # Get parameters from config
    FORM_TYPE_LS = config['forms']
    COLLEGE_LS = config['colleges']
    YEAR_LS = config['years']
    
    print('Successfully loaded parameters from config.yml!')
except Exception as e:
    print('Could not load config.yml. Please use Option 2 in the next cell to define parameters manually.')

data_folder = config['paths']['data_folder']
college_fp = config['paths']['college_folder']
district_fp = config['paths']['district_folder']
top_code_fp = config['paths']['top_code_folder']
record_fp = config['paths']['record_csv']

# Create directories if they don't exist
os.makedirs(college_fp, exist_ok=True)
os.makedirs(district_fp, exist_ok=True)
os.makedirs(top_code_fp, exist_ok=True)

In [None]:
# Option 2: Define parameters manually if config.yml is not set up
# Uncomment and modify these lines if needed:
# To uncomment multiple lines at once, select them and press Ctrl + / (Windows/Linux) or Command + / (MacOS)

# Edit it to reflect the college you are interested in scraping. # You can find the college you're interested in by selecting it from the 'Select District/College' dropdown on the website.
# Note that some colleges may be referenced by names different from their official ones.
# COLLEGE_LS = [
#     'San Diego City College',
#     'San Diego Mesa College',
#     'San Diego Miramar College Reg Cntr'
# ]

# Edit it to reflect the fiscal year you are interested in scraping.
# YEAR_LS = [
#     '2025-2026',
#     '2024-2025',
#     '2023-2024',
#     '2022-2023',
#     '2021-2022',
#     '2020-2021'
# ]

# Define Your Path
# project_root = os.getcwd()
# data_folder = os.path.join(project_root, 'Data')
# college_folder = os.path.join(data_folder, 'College')
# district_folder = os.path.join(data_folder, 'District')
# top_code_folder = os.path.join(data_folder, 'Top Code')
# record_csv = os.path.join(data_folder, 'scraping_log.csv')

# Create directories if they don't exist
# os.makedirs(college_fp, exist_ok=True)
# os.makedirs(district_fp, exist_ok=True)
# os.makedirs(top_code_fp, exist_ok=True)

## 2. Run the Scraping Process

The scraping process will be executed for each form type defined in the configuration:

1. Form 1 Part E-C - College: College-level core indicators
2. Form 1 Part E-D - District: District-level indicators (We recommend download manually since it's only one report per district)
3. Form 1 Part F by 6 Digit TOP Code: Program-specific indicators

Each form will be processed for all specified colleges and years.

In [None]:
import run

In [None]:
# 1. College Level Core Indicators Report
form = 'Form 1 Part E-C - College'

print('Scraping {} for the following colleges: {}'.format(form, COLLEGE_LS))
print('Scraping {} for the following years: {}'.format(form, YEAR_LS))
run.run(
    form_type = form,
    college_ls = COLLEGE_LS,
    year_ls = YEAR_LS
       )

In [None]:
# 3. College Level Core Indicators Report by 6 Digit TOP Code
form = 'Form 1 Part F by 6 Digit TOP Code - College'

print('Scraping {} for the following colleges: {}'.format(form, COLLEGE_LS))
print('Scraping {} for the following years: {}'.format(form, YEAR_LS))
run.run(
    form_type = form,
    college_ls = COLLEGE_LS,
    year_ls = YEAR_LS
       )

## 3. Verify Data Collection

Check the contents of the output directories to ensure all data was collected properly.

In [None]:
def check_directory_contents(path, indent=0):
    if not os.path.exists(path):
        print(f'{" "*indent}Directory not found: {path}')
        return
    print(f'{" "*indent}Contents of {path}:')
    for item in os.listdir(path):
        full_path = os.path.join(path, item)
        if os.path.isdir(full_path):
            print(f'{" "*(indent+2)}📁 {item}')
        else:
            print(f'{" "*(indent+2)}📄 {item}')

print('Checking College Data:')
check_directory_contents(college_fp)
print('\nChecking District Data:')
check_directory_contents(district_fp)
print('\nChecking TOP Code Data:')
check_directory_contents(top_code_fp)

## 4. Scraping Process Complete

The data has been collected and saved to the following locations:
- College-level indicators: `Data/College/`
- District-level indicators: `Data/District/`
- TOP Code indicators: `Data/Top Code/`

Check the scraping log at `Data/scraping_log.csv` for detailed information about the collected data.