# Code Samples

The prupose of this notebook is to demonstrate how to use the code in this repository.

## College Scorecard

**Prerequisites:**

1. Go to [the API documentation website](https://collegescorecard.ed.gov/data/api-documentation/) and request an API key.
2. Store the API key as an environment variable called `SCORECARDAPI`.

**Use case:**

Below is an example of how you can use the College Scorecard API wrapper in downstream code.

In [1]:
from ucd_sta_221_project.api.college_scorecard import (
    get_latest_student_scorecard_data_by_state,
    get_scorecard_by_college,
)

In [None]:
# All colleges in California:
# Warning: This function may take about a minute to run
all_ca = get_latest_student_scorecard_data_by_state()

{'page': 1, 'total': 672, 'per_page': 100}
{'page': 2, 'total': 672, 'per_page': 100}
{'page': 3, 'total': 672, 'per_page': 100}
{'page': 4, 'total': 672, 'per_page': 100}
{'page': 5, 'total': 672, 'per_page': 100}
{'page': 6, 'total': 672, 'per_page': 100}
{'page': 7, 'total': 672, 'per_page': 100}


In [5]:
all_ca.shape

(572, 134)

In [3]:
all_ca.head()

Unnamed: 0,college,size,grad_students,size_category,enrollment_all,enrollment_grad_12_month,enrollment_undergrad_12_month,share_25_older,part_time_share,demographics_men,...,share_independent_middleincome_48001_75000,undergrads_with_pell_grant_or_federal_student_loan,retention_rate_suppressed_four_year_full_time_pooled,retention_rate_suppressed_four_year_part_time_pooled,retention_rate_suppressed_lt_four_year_full_time_pooled,retention_rate_suppressed_lt_four_year_part_time_pooled,dcs_undergrads_with_pell_grant_or_federal_student_loan,ftft_undergrads_with_pell_grant_or_federal_student_loan,dcs_undergrads_with_pell_grant_or_federal_student_loan_pooled,ftft_undergrads_with_pell_grant_or_federal_student_loan_pooled
0,International School of Beauty Inc,173.0,,1.0,,,305.0,0.3757,0.0,0.2659,...,,305.0,,,0.8734,,305.0,112.0,593.0,206.0
0,College of the Desert,8900.0,,2.0,,,14165.0,0.2917,0.5975,0.4287,...,0.053652,10359.0,,,0.6603,0.4285,8652.0,1173.0,17286.0,2268.0
0,Design Institute of San Diego,92.0,19.0,1.0,,18.0,130.0,0.5,0.2283,0.0652,...,,93.0,,,,,93.0,8.0,212.0,20.0
0,Dharma Realm Buddhist University,,,0.0,,,,,,,...,,,,,,,,,14.0,4.0
0,Diablo Valley College,14734.0,,2.0,,,24223.0,0.2966,0.5813,0.5019,...,0.059963,15727.0,,,0.7881,0.4862,14067.0,1525.0,29576.0,2833.0


In [None]:
# Data availability is a problem. Note the Non-Null Counts:
all_ca.iloc[:, 0:70].info()

<class 'pandas.core.frame.DataFrame'>
Index: 572 entries, 0 to 0
Data columns (total 70 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   college                                                 572 non-null    object 
 1   size                                                    488 non-null    float64
 2   grad_students                                           156 non-null    float64
 3   size_category                                           572 non-null    float64
 4   enrollment_all                                          0 non-null      float64
 5   enrollment_grad_12_month                                153 non-null    float64
 6   enrollment_undergrad_12_month                           487 non-null    float64
 7   share_25_older                                          486 non-null    float64
 8   part_time_share                                

In [26]:
all_ca.iloc[:, 71:134].info()

<class 'pandas.core.frame.DataFrame'>
Index: 572 entries, 0 to 0
Data columns (total 63 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   fafsa_sent_3_college                                            0 non-null      float64
 1   fafsa_sent_2_colleges                                           0 non-null      float64
 2   fafsa_sent_4_colleges                                           0 non-null      float64
 3   fafsa_sent_2_college_allyrs                                     395 non-null    float64
 4   fafsa_sent_3_college_allyrs                                     348 non-null    float64
 5   fafsa_sent_4_college_allyrs                                     302 non-null    float64
 6   fafsa_sent_5_or_more_colleges                                   0 non-null      float64
 7   fafsa_sent_5plus_college_allyrs                             

In [None]:
# Or, get data by college name
ucdavis = get_scorecard_by_college(
    college_name = "Davis",
    college_city = "Davis",
    college_state = "CA"
)
ucdavis.head()

Unnamed: 0,school_name,latest_student_size,latest_student_enrollment_undergrad_12_month,latest_student_demographics_over_23_at_entry,latest_student_demographics_first_generation,latest_student_demographics_median_hh_income,latest_student_demographics_student_faculty_ratio,latest_student_FAFSA_applications
0,University of California-Davis,31777,33442,0.11,0.408311,76621,21,9280


In [None]:
# The fuzzy name matching may return unintended results:
california = get_scorecard_by_college(
    college_name = "University of California",
    college_state = "CA"
)
california.head()

Unnamed: 0,school_name,latest_student_size,latest_student_enrollment_undergrad_12_month,latest_student_demographics_over_23_at_entry,latest_student_demographics_first_generation,latest_student_demographics_median_hh_income,latest_student_demographics_student_faculty_ratio,latest_student_FAFSA_applications
0,University of California-San Francisco,,,0.85,,78614.0,,69.0
1,California State University Maritime Academy,761.0,891.0,0.2,0.199475,78248.0,10.0,402.0
2,California Aeronautical University,383.0,453.0,0.51,0.5,65732.0,14.0,309.0
3,DeVry University-California,2081.0,2679.0,0.59,0.502252,63213.0,18.0,24522.0
4,Vanguard University of Southern California,1975.0,2239.0,0.2,0.358836,75891.0,16.0,1040.0


---

## California Community College Chancellor's Office

**Use case:**

Below is an example of how you can use the CCCCO API wrapper in downstream code.

In [2]:
from ucd_sta_221_project.api.cccco import (
    get_ccc_colleges,
    get_ccc_districts,
    get_ccc_programs
)

In [3]:
# Either get all colleges:
ccc_colleges = get_ccc_colleges()
ccc_colleges.head()

Unnamed: 0,CollegeID,CollegeName,DistrictID,StreetAddress,City,County,Zip,ZipPlus4,MailingAddress,MailingCity,MailingZip,Phone,WebsiteURL,Latitude,Longitude,LogoURL,District
0,21,Cuyamaca College,20,900 Rancho San Diego Parkway,El Cajon,San Diego,92019,4304,900 Rancho San Diego Parkway,El Cajon,92019,619.660.4000,www.cuyamaca.edu,32.74489,-116.935229,CuyamacaCollegeLogo.jpg,
1,22,Grossmont College,20,8800 Grossmont College Drive,El Cajon,San Diego,92020,1799,8800 Grossmont College Drive,El Cajon,92020,619.644.7000,www.grossmont.edu,32.817897,-117.00564,GrossmontCollegelogo.jpg,
2,31,Imperial Valley College,30,380 East Aten Road,Imperial,Imperial,92251,9787,380 East Aten Road,Imperial,92251,760.352.8320,www.imperial.edu,32.825859,-115.502999,ImperialValleyCollegeLogocopy.jpg,
3,51,MiraCosta College,50,1 Barnard Drive,Oceanside,San Diego,92056,3899,1 Barnard Drive,Oceanside,92056,760.757.2121,www.miracosta.edu,33.188864,-117.301064,Mira_Costa_College_Logo_4c.png,
4,61,Palomar College,60,1140 West Mission Road,San Marcos,San Diego,92069,1487,1140 West Mission Road,San Marcos,92069,760.744.1150,www.palomar.edu,33.147015,-117.18398,PalomarCollegeLogo.jpg,


In [None]:
# Or, get by data by college name:
solano = get_ccc_colleges(search_param = "Solano")
solano.head()

Unnamed: 0,CollegeID,CollegeName,DistrictID,StreetAddress,City,County,Zip,ZipPlus4,MailingAddress,MailingCity,MailingZip,Phone,WebsiteURL,Latitude,Longitude,LogoURL,District
0,281,Solano Community College,280,4000 Suisun Valley Road,Fairfield,Solano,94534,3197,4000 Suisun Valley Road,Fairfield,94534,707.864.7000,www.solano.edu,38.232644,-122.126308,SolanoCollegeLogo.jpg,


In [4]:
# Either get all districts:
ccc_districts = get_ccc_districts()
ccc_districts.head()

Unnamed: 0,DistrictID,DistrictName,DistrictTitle,StreetAddress,City,Zip,Phone,WebsiteURL,Boundaries,Colleges
0,20,Grossmont-Cuyamaca,Grossmont-Cuyamaca Community College District,8800 Grossmont College Drive,El Cajon,92020.0,619-644-7010,www.gcccd.edu,,"[{'CollegeID': '021', 'CollegeName': 'Cuyamaca..."
1,30,Imperial,Imperial Community College District,380 E. Aten Road (PO Box 158),Imperial,92251.0,760-352-8320,www.imperial.edu,,"[{'CollegeID': '031', 'CollegeName': 'Imperial..."
2,50,MiraCosta,MiraCosta Community College District,,,,,www.miracosta.edu,,"[{'CollegeID': '051', 'CollegeName': 'MiraCost..."
3,60,Palomar,Palomar Community College District,1140 W. Mission Road,San Marcos,92069.0,760-744-1150,www.palomar.edu,,"[{'CollegeID': '061', 'CollegeName': 'Palomar ..."
4,70,San Diego,San Diego Community College District,3375 Camino del Rio South,San Diego,92108.0,619-388-6500,www.sdccd.edu,,"[{'CollegeID': '071', 'CollegeName': 'San Dieg..."


In [None]:
# Or, get by data by district name:
kernccd = get_ccc_districts(search_param = "Kern")
kernccd.head()

Unnamed: 0,DistrictID,DistrictName,DistrictTitle,StreetAddress,City,Zip,Phone,WebsiteURL,Boundaries,Colleges
0,520,Kern,Kern Community College District,2100 Chester Avenue,Bakersfield,93301,661-336-5100,www.kccd.edu,,"[{'CollegeID': '521', 'CollegeName': 'Bakersfi..."
1,690,West Kern,West Kern Community College District,29 Emmons Park Drive,Taft,93268,661-763-7700,www.taftcollege.edu,,"[{'CollegeID': '691', 'CollegeName': 'Taft Col..."


In [None]:
# Colleges that offer programs related to argument:
data_science_programs = get_ccc_programs(search_param = "Data Science")
data_science_programs.head()

Unnamed: 0,CollegeID,CollegeName,ProgramAward,CreditType,Title,TopCode
0,731,Glendale Community College,S,C,Data Science,170100
1,831,Coastline College,S,C,Data Science,70700
2,963,Norco College,S,C,Data Science,70730
3,651,Santa Barbara City College,S,C,Data Science,170100
4,651,Santa Barbara City College,N,C,Data Science,170100


---

## UC Official Trasfer Data

**Use case:**

Below is an example of how you can use the Tableau data in downstream code.

In [None]:
import io, requests, pandas as pd
import time

def get_transfer_data_major(year, cc):

    base = "https://visualizedata.ucop.edu/t/Public/views/Transfersbymajor/Bycommunitycollege.csv"

    params = [(":showVizHome", "no")] \
         + [("Sch Src Name", cc)] \
         + [("Academic year", year)]

    r = requests.get(base, params=params, timeout=60)
    r.raise_for_status()
    time.sleep(1.5)

    df = pd.read_csv(io.StringIO(r.text))
    uc2cc_3status_eth_agg['CC'] = cc

    uc2cc_3status_eth_agg1 = uc2cc_3status_eth_agg.dropna()
    
    uc2cc = uc2cc_3status_eth_agg1[uc2cc_3status_eth_agg1['CIP family title'] != 'All' ].iloc[:,[0,1,3,5,6,7]]
    uc2cc.columns = ['Year','UC','Field','Major','Enrolls','CC']

    
    return uc2cc

In [None]:
year_list = ["2012-13","2013-14","2014-15","2015-16","2016-17","2017-18",
          "2018-19","2019-20","2020-21","2021-22","2022-23","2023-24"]

cc_list = [
    "ALLAN HANCOCK COLLEGE",
    "AMERICAN RIVER COLLEGE",
    "ANTELOPE VALLEY COLLEGE",
    "BAKERSFIELD COLLEGE",
    "BARSTOW COLLEGE",
    "BARSTOW COMMUNITY COLLEGE",
    "BERKELEY CITY COLLEGE",
    "BUTTE COLLEGE",
    "CABRILLO COLLEGE",
    "CANADA COLLEGE",
    "CERRITOS COLLEGE",
    "CERRO COSO COMMUNITY COLLEGE",
    "CHABOT COLLEGE",
    "CHAFFEY COLLEGE",
    "CITRUS COLLEGE",
    "CITY COLLEGE OF SAN FRANCISCO",
    "CITY COLLEGE SAN FRANCISCO",
    "CLOVIS COMMUNITY COLLEGE",
    "COASTLINE COMMUNITY COLLEGE",
    "COLLEGE OF ALAMEDA",
    "COLLEGE OF MARIN",
    "COLLEGE OF SAN MATEO",
    "COLLEGE OF THE CANYONS",
    "COLLEGE OF THE DESERT",
    "COLLEGE OF THE REDWOODS",
    "COLLEGE OF THE SEQUOIAS",
    "COLLEGE OF THE SISKIYOUS",
    "COLUMBIA COLLEGE",
    "COMPTON COLLEGE",
    "CONTRA COSTA COLLEGE",
    "COPPER MOUNTAIN COLLEGE",
    "COSUMNES RIVER COLLEGE",
    "CRAFTON HILLS COLLEGE",
    "CUESTA COLLEGE",
    "CUYAMACA COLLEGE",
    "CYPRESS COLLEGE",
    "DE ANZA COLLEGE",
    "DIABLO VALLEY COLLEGE",
    "EAST LOS ANGELES COLLEGE",
    "EL CAMINO COLLEGE",
    "EVERGREEN VALLEY COLLEGE",
    "FEATHER RIVER COLLEGE",
    "FOLSOM LAKE COLLEGE",
    "FOOTHILL COLLEGE",
    "FRESNO CITY COLLEGE",
    "FULLERTON COLLEGE",
    "GAVILAN COLLEGE",
    "GLENDALE COMMUNITY COLLEGE",
    "GOLDEN WEST COLLEGE",
    "GROSSMONT CMTY COLLEGE",
    "GROSSMONT COLLEGE",
    "HARTNELL COLLEGE",
    "IMPERIAL VALLEY COLLEGE",
    "IRVINE VALLEY COLLEGE",
    "LAKE TAHOE COMMUNITY COLLEGE",
    "LANEY COLLEGE",
    "LAS POSITAS COLLEGE",
    "LASSEN COLLEGE",
    "LONG BEACH CITY COLLEGE",
    "LOS ANGELES CITY COLLEGE",
    "LOS ANGELES HARBOR COLLEGE",
    "LOS ANGELES MISSION COLLEGE",
    "LOS ANGELES PIERCE COLLEGE",
    "LOS ANGELES SOUTHWEST COLLEGE",
    "LOS ANGELES TRADE TECHNICAL COLLEGE",
    "LOS ANGELES TRADE-TECH COLLEGE",
    "LOS ANGELES VALLEY COLLEGE",
    "LOS MEDANOS COLLEGE",
    "MADERA COMMUNITY COLLEGE",
    "MENDOCINO COLLEGE",
    "MERCED COLLEGE",
    "MERRITT COLLEGE",
    "MIRACOSTA COLLEGE",
    "MISSION COLLEGE",
    "MODESTO JUNIOR COLLEGE",
    "MONTEREY PENINSULA COLLEGE",
    "MOORPARK COLLEGE",
    "MORENO VALLEY COLLEGE",
    "MOUNT SAN ANTONIO COLLEGE",
    "MOUNT SAN JACINTO COLLEGE",
    "MT. SAN ANTONIO COLLEGE",
    "NAPA VALLEY COLLEGE",
    "NORCO COLLEGE",
    "OHLONE COLLEGE",
    "ORANGE COAST COLLEGE",
    "OXNARD COLLEGE",
    "PALOMAR COLLEGE",
    "PALO VERDE COLLEGE",
    "PASADENA CITY COLLEGE",
    "PORTERVILLE COLLEGE",
    "RANCHO SANTIAGO CANYON COLLEGE",
    "REEDLEY COLLEGE",
    "RIO HONDO COLLEGE",
    "RIVERSIDE CITY COLLEGE",
    "SACRAMENTO CITY COLLEGE",
    "SADDLEBACK COLLEGE",
    "SAN BERNARDINO VALLEY COLLEGE",
    "SAN DIEGO CITY COLLEGE",
    "SAN DIEGO MESA COLLEGE",
    "SAN DIEGO MIRAMAR COLLEGE",
    "SAN JOAQUIN DELTA COLLEGE",
    "SAN JOSE CITY COLLEGE",
    "SANTA ANA COLLEGE",
    "SANTA BARBARA CITY COLLEGE",
    "SANTA MONICA COLLEGE",
    "SANTA ROSA JUNIOR COLLEGE",
    "SANTIAGO CANYON COLLEGE",
    "SHASTA COLLEGE",
    "SIERRA COLLEGE",
    "SKYLINE COLLEGE",
    "SOLANO COMMUNITY COLLEGE",
    "SOUTHWESTERN COLLEGE",
    "TAFT COLLEGE",
    "VENTURA COLLEGE",
    "VICTOR VALLEY COLLEGE",
    "WEST HILLS COLLEGE COALINGA",
    "WEST HILLS COLLEGE LEMOORE",
    "WEST LOS ANGELES COLLEGE",
    "WEST VALLEY COLLEGE",
    "WOODLAND COMMUNITY COLLEGE",
    "YUBA COLLEGE"
]

In [None]:
uc2cc_major = get_transfer_data_major(year_list[0], cc_list[0])

Unnamed: 0,Major,Field,Year,UC,Enrolls,CC
0,Plant Sciences,AGRICULTURAL/ANIMAL/PLANT/VETERINARY SCIENCE A...,2012-13,UCD,1,ALLAN HANCOCK COLLEGE
1,"Area, Ethnic, Cultural, Gender, and Group Stud...","AREA, ETHNIC, CULTURAL, GENDER, AND GROUP STUDIES",2012-13,UCSC,1,ALLAN HANCOCK COLLEGE
2,"Ethnic, Cultural Minority, Gender, and Group S...","AREA, ETHNIC, CULTURAL, GENDER, AND GROUP STUDIES",2012-13,UCSC,1,ALLAN HANCOCK COLLEGE
3,"Biology, General",BIOLOGICAL AND BIOMEDICAL SCIENCES,2012-13,UCB,1,ALLAN HANCOCK COLLEGE
4,"Biology, General",BIOLOGICAL AND BIOMEDICAL SCIENCES,2012-13,UCD,2,ALLAN HANCOCK COLLEGE
5,Biotechnology,BIOLOGICAL AND BIOMEDICAL SCIENCES,2012-13,UCD,1,ALLAN HANCOCK COLLEGE
6,Neurobiology and Neurosciences,BIOLOGICAL AND BIOMEDICAL SCIENCES,2012-13,UCD,1,ALLAN HANCOCK COLLEGE
7,"Biology, General",BIOLOGICAL AND BIOMEDICAL SCIENCES,2012-13,UCI,1,ALLAN HANCOCK COLLEGE
8,"Biology, General",BIOLOGICAL AND BIOMEDICAL SCIENCES,2012-13,UCSB,3,ALLAN HANCOCK COLLEGE
9,"Ecology, Evolution, Systematics, and Populatio...",BIOLOGICAL AND BIOMEDICAL SCIENCES,2012-13,UCSC,2,ALLAN HANCOCK COLLEGE


## UC Official Admission Data From CC

**Use case:**

Below is an example of how you can use the Tableau data in downstream code.

## Ethnicity

In [None]:
import io, requests, pandas as pd
import time

def get_transfer_data_3status_eth(year, uc):

    base =  "https://visualizedata.ucop.edu/t/Public/views/AdmissionsDataTable/TREthbyYr.csv"

    params = [("Campus", uc)] \
            +[("Academic Yr", year)]


    r = requests.get(base, params=params, timeout=60)
    r.raise_for_status()
    time.sleep(1.5)

    uc2cc_3status_eth_agg = pd.read_csv(io.StringIO(r.text))
    # df1 = df[df['uad_uc_ethn_7_cat'] == 'All'].iloc[:,[1,2,3,4,6]]
    df1 = df.iloc[:,[1,2,3,4,5,6]]
    df1['UC'] = uc
    df1['Year'] = year
    df2 = df1.rename(columns={"Pivot Field Values" : 'Num', "uad_uc_ethn_7_cat" : "Ethnicity"})
    df2 = df2.fillna(0)

    return df2


In [None]:
year_list = ["2012-13","2013-14","2014-15","2015-16","2016-17","2017-18",
          "2018-19","2019-20","2020-21","2021-22","2022-23","2023-24"]

uc_list = ['Berkeley', 'Davis', 'Irvine' 'Los Angeles', 'Merced', 'Riverside', 
               'San Diego', 'Santa Barbara', 'Santa Cruz']


In [None]:
get_transfer_data_3status_eth(year_list[0], uc_list[0])

Unnamed: 0,City,Count,County,School,Ethnicity,Num,UC,Year
0,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,All,48,Berkeley,2012-13
1,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,Hispanic/ Latinx,0,Berkeley,2012-13
2,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,Hispanic/ Latinx,24,Berkeley,2012-13
3,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,Pacific Islander,0,Berkeley,2012-13
4,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,Asian,0,Berkeley,2012-13
...,...,...,...,...,...,...,...,...
3469,Marysville,Enr,Yuba,YUBA COLLEGE,All,1,Berkeley,2012-13
3470,Marysville,Enr,Yuba,YUBA COLLEGE,African American,0,Berkeley,2012-13
3471,Marysville,Enr,Yuba,YUBA COLLEGE,Asian,0,Berkeley,2012-13
3472,Marysville,Enr,Yuba,YUBA COLLEGE,Asian,1,Berkeley,2012-13


## Gender

In [None]:
import io, requests, pandas as pd
import time

def get_transfer_data_3status_gnd(year, uc):

    base =  "https://visualizedata.ucop.edu/t/Public/views/AdmissionsDataTable/TRGndbyYr.csv"

    params = [("Campus", uc)] \
            +[("Academic year", year)]


    r = requests.get(base, params=params, timeout=60)
    r.raise_for_status()
    time.sleep(1.5)

    df = pd.read_csv(io.StringIO(r.text))
#     df1 = df[df['uad_uc_ethn_7_cat'] == 'All'].iloc[:,[1,2,3,4,6]]
    df1 = df.iloc[:,[1,2,3,4,5,6]]
    df1['UC'] = uc
    df1['Year'] = year
    df2 = df1.rename(columns={"Pivot Field Values" : 'Num', "gender" : "Gender"})
    df2 = df2.fillna(0)

    return df2


In [27]:
year_list = ["2012-13","2013-14","2014-15","2015-16","2016-17","2017-18",
          "2018-19","2019-20","2020-21","2021-22","2022-23","2023-24"]

uc_list = ['Berkeley', 'Davis', 'Irvine' 'Los Angeles', 'Merced', 'Riverside', 
               'San Diego', 'Santa Barbara', 'Santa Cruz']


In [32]:
get_transfer_data_3status_gnd(year_list[0], uc_list[0])

Unnamed: 0,City,Count,County,Gender,School,Num,UC,Year
0,Santa Maria,App,Santa Barbara,All,ALLAN HANCOCK COLLEGE,45,Berkeley,2012-13
1,Santa Maria,App,Santa Barbara,Female,ALLAN HANCOCK COLLEGE,0,Berkeley,2012-13
2,Santa Maria,App,Santa Barbara,Female,ALLAN HANCOCK COLLEGE,14,Berkeley,2012-13
3,Santa Maria,App,Santa Barbara,Male,ALLAN HANCOCK COLLEGE,0,Berkeley,2012-13
4,Santa Maria,App,Santa Barbara,Male,ALLAN HANCOCK COLLEGE,27,Berkeley,2012-13
...,...,...,...,...,...,...,...,...
2344,Marysville,Enr,Yuba,Female,YUBA COLLEGE,0,Berkeley,2012-13
2345,Marysville,Enr,Yuba,Female,YUBA COLLEGE,0,Berkeley,2012-13
2346,Marysville,Enr,Yuba,Male,YUBA COLLEGE,0,Berkeley,2012-13
2347,Marysville,Enr,Yuba,Male,YUBA COLLEGE,0,Berkeley,2012-13


# College Scorecard Downloads

The purpose of this section is to pull relevant information from the College Scorecard API and save to `./ucd_sta_221_project/data_files/{uc, cc}_scorecard.csv`

In [20]:
import os
import pandas as pd
import re
import requests
import time
import warnings # Suppress all Pandas FutureWarning messages

from collections import defaultdict
from ucd_sta_221_project.api.cccco import get_ccc_colleges

warnings.simplefilter(action='ignore', category=FutureWarning)

url = "https://api.data.gov/ed/collegescorecard/v1/schools"

In [None]:
FEATURES = [
    ## From .aid
    "aid.ftft_pell_grant_rate",
    "aid.ftft_federal_loan_rate",
    "aid.pell_grant_rate",
    "aid.federal_loan_rate",
    "aid.loan_principal",
    "aid.median_debt.income",
    "aid.median_debt.pell_grant",
    "aid.median_debt.no_pell_grant",
    "aid.median_debt.first_generation_students",
    "aid.median_debt.non_first_generation_students",

    ## From .student.demographics
    "student.demographics.race_ethnicity.white",
    "student.demographics.race_ethnicity.black",
    "student.demographics.race_ethnicity.hispanic",
    "student.demographics.race_ethnicity.asian",
    "student.demographics.race_ethnicity.aian",
    "student.demographics.race_ethnicity.nhpi",
    "student.demographics.race_ethnicity.two_or_more",
    "student.demographics.race_ethnicity.non_resident_alien",
    "student.demographics.race_ethnicity.unknown",
    "student.demographics.men",
    "student.demographics.women",
    "student.demographics.student_faculty_ratio",
    "student.demographics.faculty.race_ethnicity.two_or_more",
    "student.demographics.faculty.race_ethnicity.aian",
    "student.demographics.faculty.race_ethnicity.asian",
    "student.demographics.faculty.race_ethnicity.black",
    "student.demographics.faculty.race_ethnicity.hispanic",
    "student.demographics.faculty.race_ethnicity.nhpi",
    "student.demographics.faculty.race_ethnicity.non_resident_alien",
    "student.demographics.faculty.race_ethnicity.unknown",
    "student.demographics.faculty.race_ethnicity.white",
    "student.demographics.faculty.men",
    "student.demographics.faculty.women",

    ## From .student.enrollment
    "student.enrollment.undergrad_12_month",

    ## From .cost
    "cost.avg_net_price.public",
    "cost.net_price.public.by_income_level.0-30000",
    "cost.net_price.public.by_income_level.30001-48000",
    "cost.net_price.public.by_income_level.48001-75000",
    "cost.net_price.public.by_income_level.75001-110000",
    "cost.net_price.public.by_income_level.110001-plus",
    "cost.title_iv.public.by_income_level.0-30000",
    "cost.title_iv.public.by_income_level.30001-48000",
    "cost.title_iv.public.by_income_level.48001-75000",
    "cost.title_iv.public.by_income_level.75001-110000",
    "cost.title_iv.public.by_income_level.110001-plus",
    "cost.attendance.academic_year",
    "cost.tuition.in_state",
    "cost.tuition.out_of_state",
    "cost.booksupply",
    "cost.roomboard.oncampus",
    "cost.otherexpense.oncampus",
    "cost.roomboard.offcampus",
    "cost.otherexpense.offcampus",
    "cost.otherexpense.withfamily",

    ## From .academics
    # This requires some post-processing
    "academics.program_percentage", # Shows program distribution (use for %STEM)

    ## From .admissions
    "admissions.admission_rate.overall"

    ## From .completion
    "completion.title_iv",
    "completion.transfer_rate"
]

In [52]:
def get_feature_list(year: str) -> pd.DataFrame:
    """
    Generate a list of College Scorecard features with year prefixes.

    :param year: Year string (e.g., "2020")
    :return: List of features with year prefixes
    """
    features = []
    for feature in FEATURES:
        features.append(f"{year}.{feature}")
    return features

In [18]:
years = [
    "2012",
    "2013",
    "2014",
    "2015",
    "2016",
    "2017",
    "2018",
    "2019",
    "2020",
    "2021",
    "2022",
    "2023"
]

## The UC's

Pull the UC data from the College Scorecard API using their OPEID's.

In [None]:
uc_scorecard = pd.DataFrame()

for year in years:
    fields = get_feature_list(year)
    fields.insert(0, "school.name")
    fields.insert(1, "ope6_id")
    fields.insert(2, "location")
    params = {
        "api_key": os.getenv("SCORECARDAPI"),
        "ope6_id": ",".join(
            [
                # The 10 UC's:
                "001313",
                "001312",
                "001314",
                "001321",
                "001320",
                "041271",
                "001315",
                "001316",
                "001317",
                "001319",
            ]
        ),
        "fields": ",".join(fields),
        "per_page": 100,
    }
    try:
        time.sleep(1)
        response = requests.get(url, params=params)
        response.raise_for_status()
        # print(year, response.json()["metadata"])
        data = response.json()
        df = pd.json_normalize(data["results"])
        uc_scorecard = pd.concat([uc_scorecard, df], ignore_index=True)
    except requests.exceptions.RequestException as e:
        print(f"Error fetching data for year {year}: {e}")

The different reporting years are captured as columns in the API response, so the resulting DataFrame has a block-diagonal structure. Below we convert the reporting years to rows and flatten the DataFrame.

In [None]:
# Compress columns like "2012.*" -> "*" and capture the year per row.
_year_re = re.compile(r"^(20\d{2})\.(.+)$")

year_cols = [c for c in uc_scorecard.columns if _year_re.match(c)]
other_cols = [c for c in uc_scorecard.columns if not _year_re.match(c)]

# Map base column name -> sorted list of (year, colname)
base_map = defaultdict(list)
for c in year_cols:
    y, base = _year_re.match(c).groups()
    base_map[base].append((int(y), c))
for base in base_map:
    base_map[base].sort(key=lambda t: t[0])  # sort by year

# Build compressed DataFrame by taking the first non-null value across
# year-columns for each base name
compressed = pd.DataFrame(index=uc_scorecard.index)
for base, year_col_pairs in base_map.items():
    cols_sorted = [col for _, col in year_col_pairs]
    # bfill across columns then take first column -> first non-null value in
    # the row
    compressed[base] = uc_scorecard[cols_sorted].bfill(axis=1).iloc[:, 0]

# Copy over non-year columns (school.name, ope6_id)
for c in other_cols:
    compressed[c] = uc_scorecard[c]

# Build a 'year' column
years_list = years if "years" in globals() else [
    str(y) for y in range(2012, 2024)
]
year_series = pd.Series(index=uc_scorecard.index, dtype="object")
for y in years_list:
    prefixed = [col for col in year_cols if col.startswith(f"{y}.")]
    if not prefixed:
        continue
    mask = uc_scorecard[prefixed].notna().any(axis=1)
    year_series.loc[mask & year_series.isna()] = y

compressed["year"] = year_series

# Reorder columns: year and known meta columns first, then features
meta_first = ["year"] + [
    c for c in ("school.name", "ope6_id") if c in compressed.columns
]
feature_cols = [c for c in compressed.columns if c not in meta_first]
compressed = compressed[meta_first + feature_cols]


uc_scorecard_compact = compressed.reset_index(drop=True)
uc_scorecard_compact.head()

Unnamed: 0,year,school.name,ope6_id,aid.ftft_pell_grant_rate,aid.ftft_federal_loan_rate,aid.pell_grant_rate,aid.federal_loan_rate,aid.loan_principal,aid.median_debt.pell_grant,aid.median_debt.no_pell_grant,...,cost.booksupply,cost.roomboard.oncampus,cost.otherexpense.oncampus,cost.roomboard.offcampus,cost.otherexpense.offcampus,cost.otherexpense.withfamily,admissions.admission_rate.overall,aid.median_debt.income.0_30000,aid.median_debt.income.30001_75000,aid.median_debt.income.greater_than_75000
0,2012,University of California-Berkeley,1312,0.2441,0.2897,0.333,0.3181,13117.0,13008.0,13600.0,...,1213.0,15304.0,3497.0,10281.0,4121.0,9682.0,0.2161,12051.5,13215.5,15000.0
1,2012,University of California-Davis,1313,0.421,0.4902,0.4268,0.4616,12300.0,12849.0,11000.0,...,1602.0,13503.0,3186.0,8247.0,4162.0,9280.0,0.4826,11667.0,13000.0,12081.0
2,2012,University of California-Irvine,1314,0.4487,0.5092,0.4001,0.4443,14500.0,13666.0,15000.0,...,1567.0,11706.0,3633.0,9635.0,4605.0,8824.0,0.4746,12499.5,14544.0,15439.0
3,2012,University of California-Los Angeles,1315,0.3152,0.3854,0.3617,0.3868,13300.0,13740.0,12500.0,...,1521.0,14232.0,3487.0,10336.0,4375.0,9218.0,0.2711,12050.0,14148.0,14950.0
4,2012,University of California-Riverside,1316,0.5907,0.5775,0.5678,0.5662,15000.0,15000.0,15000.0,...,1784.0,13240.0,3613.0,8890.0,4287.0,8621.0,0.6816,14000.0,15458.0,16500.0


In [None]:
uc_scorecard_compact.to_csv(
    "./ucd_sta_221_project/data_files/uc_scorecard.csv",
    index=False
)
uc_scorecard_compact.head()

Unnamed: 0,year,school.name,ope6_id,aid.ftft_pell_grant_rate,aid.ftft_federal_loan_rate,aid.pell_grant_rate,aid.federal_loan_rate,aid.loan_principal,aid.median_debt.pell_grant,aid.median_debt.no_pell_grant,...,cost.booksupply,cost.roomboard.oncampus,cost.otherexpense.oncampus,cost.roomboard.offcampus,cost.otherexpense.offcampus,cost.otherexpense.withfamily,admissions.admission_rate.overall,aid.median_debt.income.0_30000,aid.median_debt.income.30001_75000,aid.median_debt.income.greater_than_75000
0,2012,University of California-Berkeley,1312,0.2441,0.2897,0.333,0.3181,13117.0,13008.0,13600.0,...,1213.0,15304.0,3497.0,10281.0,4121.0,9682.0,0.2161,12051.5,13215.5,15000.0
1,2012,University of California-Davis,1313,0.421,0.4902,0.4268,0.4616,12300.0,12849.0,11000.0,...,1602.0,13503.0,3186.0,8247.0,4162.0,9280.0,0.4826,11667.0,13000.0,12081.0
2,2012,University of California-Irvine,1314,0.4487,0.5092,0.4001,0.4443,14500.0,13666.0,15000.0,...,1567.0,11706.0,3633.0,9635.0,4605.0,8824.0,0.4746,12499.5,14544.0,15439.0
3,2012,University of California-Los Angeles,1315,0.3152,0.3854,0.3617,0.3868,13300.0,13740.0,12500.0,...,1521.0,14232.0,3487.0,10336.0,4375.0,9218.0,0.2711,12050.0,14148.0,14950.0
4,2012,University of California-Riverside,1316,0.5907,0.5775,0.5678,0.5662,15000.0,15000.0,15000.0,...,1784.0,13240.0,3613.0,8890.0,4287.0,8621.0,0.6816,14000.0,15458.0,16500.0


## The CC's

Pull the CC data from the College Scorecard API by attempting to query based on college name and lat/lon coordinates. This was the most successful way to capture the correct colleges and not include extraneous colleges.

If providing lat/lon returns no results, the query is retried using college name only. In testing, it was found that this only occurs for two schools, and searching for them by name alone provided the correct results.

In [None]:
def fetch_college_data(
        fields: list[str],
        college_name: str,
        latitude: str | None = None,
        longitude: str | None = None,
        distance: int | None = None,
    ) -> pd.DataFrame:
    """ 
    Fetch College Scorecard data for a specific college. College is queried by
    name, and optionally by location (latitude, longitude) and distance from that
    location. If no results are found when searching with location, a broader
    search is performed without location constraints.
    
    :param fields: List of fields to retrieve from the College Scorecard API
    :param college_name: Name of the college to search for
    :param latitude: Latitude of the location to search around (optional)
    :param longitude: Longitude of the location to search around (optional)
    :param distance: Distance (mi) from the location to search within (optional)
    :return: DataFrame containing the College Scorecard data for the college
    """

    params = {
        "api_key": os.getenv("SCORECARDAPI"),
        "school.name": college_name,
        "school.state": "CA",
        "distance": distance,
        "fields": ",".join(fields),
        "per_page": 10,
    }

    if latitude:
        params["location.lat"] = latitude
    if longitude:
        params["location.lon"] = longitude
    if distance:
        params["distance"] = distance

    try:
        response = requests.get(url, params=params)
        response.raise_for_status()
        data = response.json()
        results_list = data.get("results", [])
        results_df = pd.json_normalize(results_list, sep='.')
        return results_df
    except Exception as e:
        print(f"* Error processing {college_name}: {e}")
        return pd.DataFrame()

In [None]:
# NOTE: This is a large API GET request (~116 schools x 12 years = 1,392 rows).
# The College Scorecard API has rate limiting (1000 requests per hour), so we
# include significant delays between requests.

cc_colleges_df = get_ccc_colleges()
cc_scorecard = pd.DataFrame()

for year in years:
    fields = get_feature_list(year)
    fields.insert(0, "school.name")
    fields.insert(1, "ope6_id")
    fields.insert(2, "location")
    for index, row in cc_colleges_df.iterrows():
        time.sleep(3.5)
        # print(f"Processing {row['CollegeName']} for year {year}...")
        college_data = fetch_college_data(
            fields=fields,
            college_name=row["CollegeName"],
            latitude=row["Latitude"],
            longitude=row["Longitude"],
            distance=1,
        )
        cc_scorecard = pd.concat(
            [cc_scorecard, college_data],
            ignore_index=True
        )

        # There are two colleges (Cerritos College and Palo Verde College)
        # that return no results when lat/lon is provided. They both return
        # unique results when queried without lat/lon and distance.
        if college_data.empty:
            time.sleep(3.5)
            # print(f"* College not found. Broadening search...")
            college_data = fetch_college_data(
                fields=fields,
                college_name=row["CollegeName"],
            )
            cc_scorecard = pd.concat(
                [
                    cc_scorecard,
                    college_data
                ], ignore_index=True
            )

The different reporting years are captured as columns in the API response, so the resulting DataFrame has a block-diagonal structure. Below we convert the reporting years to rows and flatten the DataFrame.

In [84]:
# Compress columns like "2012.*" -> "*" and capture the year per row.
_year_re = re.compile(r"^(20\d{2})\.(.+)$")

year_cols = [c for c in cc_scorecard.columns if _year_re.match(c)]
other_cols = [c for c in cc_scorecard.columns if not _year_re.match(c)]

# Map base column name -> sorted list of (year, colname)
base_map = defaultdict(list)
for c in year_cols:
    y, base = _year_re.match(c).groups()
    base_map[base].append((int(y), c))
for base in base_map:
    base_map[base].sort(key=lambda t: t[0])  # sort by year

# Build compressed DataFrame by taking the first non-null value across
# year-columns for each base name
compressed = pd.DataFrame(index=cc_scorecard.index)
for base, year_col_pairs in base_map.items():
    cols_sorted = [col for _, col in year_col_pairs]
    # bfill across columns then take first column -> first non-null value in
    # the row
    compressed[base] = cc_scorecard[cols_sorted].bfill(axis=1).iloc[:, 0]

# Copy over non-year columns (school.name, ope6_id)
for c in other_cols:
    compressed[c] = cc_scorecard[c]

# Build a 'year' column
years_list = years if "years" in globals() else [
    str(y) for y in range(2012, 2024)
]
year_series = pd.Series(index=cc_scorecard.index, dtype="object")
for y in years_list:
    prefixed = [col for col in year_cols if col.startswith(f"{y}.")]
    if not prefixed:
        continue
    mask = cc_scorecard[prefixed].notna().any(axis=1)
    year_series.loc[mask & year_series.isna()] = y

compressed["year"] = year_series

# Reorder columns: year and known meta columns first, then features
meta_first = ["year"] + [
    c for c in ("school.name", "ope6_id") if c in compressed.columns
]
feature_cols = [c for c in compressed.columns if c not in meta_first]
compressed = compressed[meta_first + feature_cols]


cc_scorecard_compact = compressed.reset_index(drop=True)
cc_scorecard_compact.head()

Unnamed: 0,year,school.name,ope6_id,aid.ftft_pell_grant_rate,aid.ftft_federal_loan_rate,aid.pell_grant_rate,aid.federal_loan_rate,aid.loan_principal,aid.median_debt.pell_grant,aid.median_debt.no_pell_grant,...,cost.booksupply,cost.roomboard.oncampus,cost.otherexpense.oncampus,cost.roomboard.offcampus,cost.otherexpense.offcampus,cost.otherexpense.withfamily,admissions.admission_rate.overall,aid.median_debt.income.0_30000,aid.median_debt.income.30001_75000,aid.median_debt.income.greater_than_75000
0,2012,Cuyamaca College,21113,0.4523,0.0098,0.2902,0.0244,3500.0,3500.0,4500.0,...,1500.0,,,10500.0,4000.0,3700.0,,3500.0,,
1,2012,Grossmont College,1208,0.3899,0.045,0.2348,0.0304,3234.0,3209.5,3500.0,...,1500.0,,,10500.0,4000.0,3700.0,,3152.0,3500.0,3200.0
2,2012,Imperial Valley College,1214,0.7614,0.0,0.5777,0.0,,,,...,1665.0,,,10962.0,4158.0,4275.0,,,,
3,2012,MiraCosta College,1239,0.2165,0.0184,0.0985,0.0125,3500.0,3500.0,3500.0,...,1656.0,,,10962.0,4158.0,4266.0,,3500.0,,
4,2012,Palomar College,1260,0.3291,0.0198,0.1508,0.0146,3840.0,4474.0,3500.0,...,1666.0,,,10962.0,4160.0,4272.0,,4194.0,3500.0,3500.0


Two CC's have associated Beauty schools that were returned from the GET request. They are removed below.

In [None]:
cc_scorecard_compact = cc_scorecard_compact[
    ~cc_scorecard_compact["school.name"]
    .str.contains("Beauty", case=False, na=False)
]

In [85]:
cc_scorecard_compact.to_csv(
    "./ucd_sta_221_project/data_files/cc_scorecard.csv",
    index=False
)
cc_scorecard_compact.head()

Unnamed: 0,year,school.name,ope6_id,aid.ftft_pell_grant_rate,aid.ftft_federal_loan_rate,aid.pell_grant_rate,aid.federal_loan_rate,aid.loan_principal,aid.median_debt.pell_grant,aid.median_debt.no_pell_grant,...,cost.booksupply,cost.roomboard.oncampus,cost.otherexpense.oncampus,cost.roomboard.offcampus,cost.otherexpense.offcampus,cost.otherexpense.withfamily,admissions.admission_rate.overall,aid.median_debt.income.0_30000,aid.median_debt.income.30001_75000,aid.median_debt.income.greater_than_75000
0,2012,Cuyamaca College,21113,0.4523,0.0098,0.2902,0.0244,3500.0,3500.0,4500.0,...,1500.0,,,10500.0,4000.0,3700.0,,3500.0,,
1,2012,Grossmont College,1208,0.3899,0.045,0.2348,0.0304,3234.0,3209.5,3500.0,...,1500.0,,,10500.0,4000.0,3700.0,,3152.0,3500.0,3200.0
2,2012,Imperial Valley College,1214,0.7614,0.0,0.5777,0.0,,,,...,1665.0,,,10962.0,4158.0,4275.0,,,,
3,2012,MiraCosta College,1239,0.2165,0.0184,0.0985,0.0125,3500.0,3500.0,3500.0,...,1656.0,,,10962.0,4158.0,4266.0,,3500.0,,
4,2012,Palomar College,1260,0.3291,0.0198,0.1508,0.0146,3840.0,4474.0,3500.0,...,1666.0,,,10962.0,4160.0,4272.0,,4194.0,3500.0,3500.0


# UC2CC Data

In [60]:
import io, requests, pandas as pd
import time

def get_transfer_data_major(year, cc):

    base = "https://visualizedata.ucop.edu/t/Public/views/Transfersbymajor/Bycommunitycollege.csv"

    params = [(":showVizHome", "no")] \
         + [("Sch Src Name", cc)] \
         + [("Academic year", year)]

    r = requests.get(base, params=params, timeout=60)
    r.raise_for_status()
    time.sleep(1.5)

    try:
        df = pd.read_csv(io.StringIO(r.text))
        df['CC'] = cc

        df1 = df.dropna()
        
        uc2cc = df1[df1['CIP family title'] != 'All' ].iloc[:,[0,1,3,5,6,7]]
        uc2cc.columns = ['Year','UC','Field','Major','Enrolls','CC']
    except:
        return None
    return uc2cc
        

    
    

year_list = ["2012-13","2013-14","2014-15","2015-16","2016-17","2017-18",
          "2018-19","2019-20","2020-21","2021-22","2022-23","2023-24"]

cc_list = [
    "ALLAN HANCOCK COLLEGE",
    "AMERICAN RIVER COLLEGE",
    "ANTELOPE VALLEY COLLEGE",
    "BAKERSFIELD COLLEGE",
    "BARSTOW COLLEGE",
    "BARSTOW COMMUNITY COLLEGE",
    "BERKELEY CITY COLLEGE",
    "BUTTE COLLEGE",
    "CABRILLO COLLEGE",
    "CANADA COLLEGE",
    "CERRITOS COLLEGE",
    "CERRO COSO COMMUNITY COLLEGE",
    "CHABOT COLLEGE",
    "CHAFFEY COLLEGE",
    "CITRUS COLLEGE",
    "CITY COLLEGE OF SAN FRANCISCO",
    "CITY COLLEGE SAN FRANCISCO",
    "CLOVIS COMMUNITY COLLEGE",
    "COASTLINE COMMUNITY COLLEGE",
    "COLLEGE OF ALAMEDA",
    "COLLEGE OF MARIN",
    "COLLEGE OF SAN MATEO",
    "COLLEGE OF THE CANYONS",
    "COLLEGE OF THE DESERT",
    "COLLEGE OF THE REDWOODS",
    "COLLEGE OF THE SEQUOIAS",
    "COLLEGE OF THE SISKIYOUS",
    "COLUMBIA COLLEGE",
    "COMPTON COLLEGE",
    "CONTRA COSTA COLLEGE",
    "COPPER MOUNTAIN COLLEGE",
    "COSUMNES RIVER COLLEGE",
    "CRAFTON HILLS COLLEGE",
    "CUESTA COLLEGE",
    "CUYAMACA COLLEGE",
    "CYPRESS COLLEGE",
    "DE ANZA COLLEGE",
    "DIABLO VALLEY COLLEGE",
    "EAST LOS ANGELES COLLEGE",
    "EL CAMINO COLLEGE",
    "EVERGREEN VALLEY COLLEGE",
    "FEATHER RIVER COLLEGE",
    "FOLSOM LAKE COLLEGE",
    "FOOTHILL COLLEGE",
    "FRESNO CITY COLLEGE",
    "FULLERTON COLLEGE",
    "GAVILAN COLLEGE",
    "GLENDALE COMMUNITY COLLEGE",
    "GOLDEN WEST COLLEGE",
    "GROSSMONT CMTY COLLEGE",
    "GROSSMONT COLLEGE",
    "HARTNELL COLLEGE",
    "IMPERIAL VALLEY COLLEGE",
    "IRVINE VALLEY COLLEGE",
    "LAKE TAHOE COMMUNITY COLLEGE",
    "LANEY COLLEGE",
    "LAS POSITAS COLLEGE",
    "LASSEN COLLEGE",
    "LONG BEACH CITY COLLEGE",
    "LOS ANGELES CITY COLLEGE",
    "LOS ANGELES HARBOR COLLEGE",
    "LOS ANGELES MISSION COLLEGE",
    "LOS ANGELES PIERCE COLLEGE",
    "LOS ANGELES SOUTHWEST COLLEGE",
    "LOS ANGELES TRADE TECHNICAL COLLEGE",
    "LOS ANGELES TRADE-TECH COLLEGE",
    "LOS ANGELES VALLEY COLLEGE",
    "LOS MEDANOS COLLEGE",
    "MADERA COMMUNITY COLLEGE",
    "MENDOCINO COLLEGE",
    "MERCED COLLEGE",
    "MERRITT COLLEGE",
    "MIRACOSTA COLLEGE",
    "MISSION COLLEGE",
    "MODESTO JUNIOR COLLEGE",
    "MONTEREY PENINSULA COLLEGE",
    "MOORPARK COLLEGE",
    "MORENO VALLEY COLLEGE",
    "MOUNT SAN ANTONIO COLLEGE",
    "MOUNT SAN JACINTO COLLEGE",
    "MT. SAN ANTONIO COLLEGE",
    "NAPA VALLEY COLLEGE",
    "NORCO COLLEGE",
    "OHLONE COLLEGE",
    "ORANGE COAST COLLEGE",
    "OXNARD COLLEGE",
    "PALOMAR COLLEGE",
    "PALO VERDE COLLEGE",
    "PASADENA CITY COLLEGE",
    "PORTERVILLE COLLEGE",
    "RANCHO SANTIAGO CANYON COLLEGE",
    "REEDLEY COLLEGE",
    "RIO HONDO COLLEGE",
    "RIVERSIDE CITY COLLEGE",
    "SACRAMENTO CITY COLLEGE",
    "SADDLEBACK COLLEGE",
    "SAN BERNARDINO VALLEY COLLEGE",
    "SAN DIEGO CITY COLLEGE",
    "SAN DIEGO MESA COLLEGE",
    "SAN DIEGO MIRAMAR COLLEGE",
    "SAN JOAQUIN DELTA COLLEGE",
    "SAN JOSE CITY COLLEGE",
    "SANTA ANA COLLEGE",
    "SANTA BARBARA CITY COLLEGE",
    "SANTA MONICA COLLEGE",
    "SANTA ROSA JUNIOR COLLEGE",
    "SANTIAGO CANYON COLLEGE",
    "SHASTA COLLEGE",
    "SIERRA COLLEGE",
    "SKYLINE COLLEGE",
    "SOLANO COMMUNITY COLLEGE",
    "SOUTHWESTERN COLLEGE",
    "TAFT COLLEGE",
    "VENTURA COLLEGE",
    "VICTOR VALLEY COLLEGE",
    "WEST HILLS COLLEGE COALINGA",
    "WEST HILLS COLLEGE LEMOORE",
    "WEST LOS ANGELES COLLEGE",
    "WEST VALLEY COLLEGE",
    "WOODLAND COMMUNITY COLLEGE",
    "YUBA COLLEGE"
]

In [61]:
from tqdm import tqdm

uc2cc_major_list = []

for cc in tqdm(cc_list, desc = 'CC loop'):
    for year in year_list:
        uc2cc_major = get_transfer_data_major(year, cc)
        uc2cc_major_list.append(uc2cc_major)

CC loop: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 121/121 [59:03<00:00, 29.28s/it]


In [62]:
uc2cc_major_agg = pd.concat([uc2cc_major for uc2cc_major in uc2cc_major_list if uc2cc_major is not None], ignore_index=True)

uc2cc_major_agg.to_csv(
    "./ucd_sta_221_project/data_files/uc2cc_major.csv",
    index=False
)

In [8]:
cc2uc_major_agg = pd.read_csv("./ucd_sta_221_project/data_files/uc2cc_major.csv")
cc2uc_major_agg.to_csv(
    "./ucd_sta_221_project/data_files/cc2uc_major.csv",
    index=False
)

cc2uc_major_agg

Unnamed: 0,Year,UC,Field,Major,Enrolls,CC
0,2012-13,UCSB,BIOLOGICAL AND BIOMEDICAL SCIENCES,"Biology, General",3,ALLAN HANCOCK COLLEGE
1,2012-13,UCSB,"COMMUNICATION, JOURNALISM, AND RELATED PROGRAMS",Communication and Media Studies,3,ALLAN HANCOCK COLLEGE
2,2012-13,UCSB,ENGLISH LANGUAGE AND LITERATURE/LETTERS,"English Language and Literature, General",3,ALLAN HANCOCK COLLEGE
3,2012-13,UCSC,"FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS","Linguistic, Comparative, and Related Language ...",3,ALLAN HANCOCK COLLEGE
4,2012-13,UCSB,PSYCHOLOGY,"Psychology, General",6,ALLAN HANCOCK COLLEGE
...,...,...,...,...,...,...
21144,2021-22,UCD,PSYCHOLOGY,Research and Experimental Psychology,4,YUBA COLLEGE
21145,2021-22,UCD,SOCIAL SCIENCES,Economics,3,YUBA COLLEGE
21146,2022-23,UCD,PSYCHOLOGY,Research and Experimental Psychology,5,YUBA COLLEGE
21147,2023-24,UCD,FAMILY AND CONSUMER SCIENCES/HUMAN SCIENCES,"Human Development, Family Studies, and Related...",5,YUBA COLLEGE


In [41]:
import io, requests, pandas as pd
import time

def get_transfer_data_3status_eth(year, uc):

    base =  "https://visualizedata.ucop.edu/t/Public/views/AdmissionsDataTable/TREthbyYr.csv"

    params = [("Campus", uc)] \
            +[("Academic Yr", year)]


    r = requests.get(base, params=params, timeout=60)
    r.raise_for_status()
    time.sleep(1.5)

    df = pd.read_csv(io.StringIO(r.text))
    # df1 = df[df['uad_uc_ethn_7_cat'] == 'All'].iloc[:,[1,2,3,4,6]]
    df1 = df.iloc[:,[1,2,3,4,5,6]]
    df1['UC'] = uc
    df1['Year'] = year
    df2 = df1.rename(columns={"Pivot Field Values" : 'Num', "uad_uc_ethn_7_cat" : "Ethnicity"})
    df2 = df2.fillna(0)

    return df2


def get_transfer_data_3status_gnd(year, uc):

    base =  "https://visualizedata.ucop.edu/t/Public/views/AdmissionsDataTable/TRGndbyYr.csv"

    params = [("Campus", uc)] \
            +[("Academic year", year)]


    r = requests.get(base, params=params, timeout=60)
    r.raise_for_status()
    time.sleep(1.5)

    df = pd.read_csv(io.StringIO(r.text))
#     df1 = df[df['uad_uc_ethn_7_cat'] == 'All'].iloc[:,[1,2,3,4,6]]
    df1 = df.iloc[:,[1,2,3,4,5,6]]
    df1['UC'] = uc
    df1['Year'] = year
    df2 = df1.rename(columns={"Pivot Field Values" : 'Num', "gender" : "Gender"})
    df2 = df2.fillna(0)

    return df2


# Year from 2005 since UCM opened in 2005

year_list = ["2005-06",
"2006-07","2007-08","2008-09","2009-10","2010-11","2011-12",
"2012-13","2013-14","2014-15","2015-16","2016-17","2017-18",
"2018-19","2019-20","2020-21","2021-22","2022-23","2023-24"]

uc_list = ['Berkeley', 'Davis', 'Irvine', 'Los Angeles', 'Merced', 'Riverside', 
               'San Diego', 'Santa Barbara', 'Santa Cruz']

In [42]:
uc2cc_3status_eth_list = []

for year in tqdm(year_list, desc = 'Year loop'):
    for uc in uc_list:
        uc2cc_3status_eth = get_transfer_data_3status_eth(year, uc)
        uc2cc_3status_eth_list.append(uc2cc_3status_eth)

Year loop: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 19/19 [09:52<00:00, 31.19s/it]


In [49]:
uc2cc_3status_eth_agg = pd.concat(uc2cc_3status_eth_list, ignore_index=True)

uc2cc_3status_eth_agg.to_csv(
    "./ucd_sta_221_project/data_files/uc2cc_3status_eth.csv",
    index=False
)

uc2cc_3status_eth_agg.head()

Unnamed: 0,City,Count,County,School,Ethnicity,Num,UC,Year
0,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,All,29.0,Berkeley,2005-06
1,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,African American,0.0,Berkeley,2005-06
2,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,Hispanic/ Latinx,0.0,Berkeley,2005-06
3,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,Hispanic/ Latinx,7.0,Berkeley,2005-06
4,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,Asian,0.0,Berkeley,2005-06


In [12]:
import numpy as np
import pandas as pd

uc2cc_3status_eth_agg = pd.read_csv("./ucd_sta_221_project/data_files/uc2cc_3status_eth.csv")

key_cols = [c for c in uc2cc_3status_eth_agg.columns if c != "Num"]

uc2cc_3status_eth_agg2 = uc2cc_3status_eth_agg.copy()
uc2cc_3status_eth_agg2["__idx__"] = np.arange(len(uc2cc_3status_eth_agg2))

uc2cc_3status_eth_agg2["__num__"] = pd.to_numeric(uc2cc_3status_eth_agg2["Num"], errors="coerce").fillna(0)

uc2cc_3status_eth_agg_best = (
    uc2cc_3status_eth_agg2.sort_values("__num__", ascending=False)
       .groupby(key_cols, as_index=False)
       .first()
)

uc2cc_3status_eth_agg_out = (
    uc2cc_3status_eth_agg_best.sort_values("__idx__")
           .drop(columns=["__idx__", "__num__"])
)


uc2cc_3status_eth_agg_out.to_csv(
    "./ucd_sta_221_project/data_files/uc2cc_3status_eth.csv",
    index=False
)

uc2cc_3status_eth_agg_out

Unnamed: 0,City,Count,County,School,Ethnicity,UC,Year,Num
295403,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,All,Berkeley,2005-06,29.0
295265,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,African American,Berkeley,2005-06,0.0
295942,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,Hispanic/ Latinx,Berkeley,2005-06,7.0
295643,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,Asian,Berkeley,2005-06,0.0
296177,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,White,Berkeley,2005-06,18.0
...,...,...,...,...,...,...,...,...
126606,Marysville,Enr,Yuba,YUBA COLLEGE,American Indian,Santa Cruz,2023-24,0.0
126998,Marysville,Enr,Yuba,YUBA COLLEGE,Hispanic/ Latinx,Santa Cruz,2023-24,0.0
126775,Marysville,Enr,Yuba,YUBA COLLEGE,Asian,Santa Cruz,2023-24,1.0
127181,Marysville,Enr,Yuba,YUBA COLLEGE,White,Santa Cruz,2023-24,0.0


In [2]:
import pandas as pd

cc2uc_3status_eth_agg = pd.read_csv("./ucd_sta_221_project/data_files/uc2cc_3status_eth.csv")
cc2uc_3status_eth_agg.to_csv(
    "./ucd_sta_221_project/data_files/cc2uc_3status_eth.csv",
    index=False
)

cc2uc_3status_eth_agg

Unnamed: 0,City,Count,County,School,Ethnicity,UC,Year,Num
0,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,All,Berkeley,2005-06,29.0
1,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,African American,Berkeley,2005-06,0.0
2,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,Hispanic/ Latinx,Berkeley,2005-06,7.0
3,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,Asian,Berkeley,2005-06,0.0
4,Santa Maria,App,Santa Barbara,ALLAN HANCOCK COLLEGE,White,Berkeley,2005-06,18.0
...,...,...,...,...,...,...,...,...
366748,Marysville,Enr,Yuba,YUBA COLLEGE,American Indian,Santa Cruz,2023-24,0.0
366749,Marysville,Enr,Yuba,YUBA COLLEGE,Hispanic/ Latinx,Santa Cruz,2023-24,0.0
366750,Marysville,Enr,Yuba,YUBA COLLEGE,Asian,Santa Cruz,2023-24,1.0
366751,Marysville,Enr,Yuba,YUBA COLLEGE,White,Santa Cruz,2023-24,0.0


In [50]:
uc2cc_3status_gnd_list = []

for year in tqdm(year_list, desc = 'Year loop'):
    for uc in uc_list:
        uc2cc_3status_gnd = get_transfer_data_3status_gnd(year, uc)
        uc2cc_3status_gnd_list.append(uc2cc_3status_gnd)

Year loop: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 19/19 [07:58<00:00, 25.16s/it]


In [51]:
uc2cc_3status_gnd_agg = pd.concat(uc2cc_3status_gnd_list, ignore_index=True)

uc2cc_3status_gnd_agg.to_csv(
    "./ucd_sta_221_project/data_files/uc2cc_3status_gnd.csv",
    index=False
)

uc2cc_3status_gnd_agg.head()

Unnamed: 0,City,Count,County,Gender,School,Num,UC,Year
0,Santa Maria,App,Santa Barbara,All,ALLAN HANCOCK COLLEGE,45,Berkeley,2005-06
1,Santa Maria,App,Santa Barbara,Female,ALLAN HANCOCK COLLEGE,0,Berkeley,2005-06
2,Santa Maria,App,Santa Barbara,Female,ALLAN HANCOCK COLLEGE,14,Berkeley,2005-06
3,Santa Maria,App,Santa Barbara,Male,ALLAN HANCOCK COLLEGE,0,Berkeley,2005-06
4,Santa Maria,App,Santa Barbara,Male,ALLAN HANCOCK COLLEGE,27,Berkeley,2005-06


In [20]:
import numpy as np
import pandas as pd

uc2cc_3status_gnd_agg = pd.read_csv("./ucd_sta_221_project/data_files/uc2cc_3status_gnd.csv")

key_cols = [c for c in uc2cc_3status_gnd_agg.columns if c != "Num"]

uc2cc_3status_gnd_agg2 = uc2cc_3status_gnd_agg.copy()
uc2cc_3status_gnd_agg2["__idx__"] = np.arange(len(uc2cc_3status_gnd_agg2))

uc2cc_3status_gnd_agg2["__num__"] = pd.to_numeric(uc2cc_3status_gnd_agg2["Num"], errors="coerce").fillna(0)

uc2cc_3status_gnd_agg_best = (
    uc2cc_3status_gnd_agg2.sort_values("__num__", ascending=False)
       .groupby(key_cols, as_index=False)
       .first()
)

uc2cc_3status_gnd_agg_out = (
    uc2cc_3status_gnd_agg_best.sort_values("__idx__")
           .drop(columns=["__idx__", "__num__"])
)


uc2cc_3status_gnd_agg_out.to_csv(
    "./ucd_sta_221_project/data_files/uc2cc_3status_gnd.csv",
    index=False
)

uc2cc_3status_gnd_agg_out.head(10)

Unnamed: 0,City,Count,County,Gender,School,UC,Year,Num
199880,Santa Maria,App,Santa Barbara,All,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,45
200051,Santa Maria,App,Santa Barbara,Female,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,14
200222,Santa Maria,App,Santa Barbara,Male,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,27
200393,Santa Maria,App,Santa Barbara,Other,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,0
200564,Santa Maria,App,Santa Barbara,Unknown,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,0
199044,Santa Maria,Adm,Santa Barbara,All,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,12
199215,Santa Maria,Adm,Santa Barbara,Female,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,5
199386,Santa Maria,Adm,Santa Barbara,Male,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,6
199557,Santa Maria,Adm,Santa Barbara,Other,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,0
199728,Santa Maria,Adm,Santa Barbara,Unknown,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,0


In [7]:
cc2uc_3status_gnd_agg = pd.read_csv("./ucd_sta_221_project/data_files/uc2cc_3status_gnd.csv")
cc2uc_3status_gnd_agg.to_csv(
    "./ucd_sta_221_project/data_files/cc2uc_3status_gnd.csv",
    index=False
)

cc2uc_3status_gnd_agg

Unnamed: 0,City,Count,County,Gender,School,UC,Year,Num
0,Santa Maria,App,Santa Barbara,All,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,45
1,Santa Maria,App,Santa Barbara,Female,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,14
2,Santa Maria,App,Santa Barbara,Male,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,27
3,Santa Maria,App,Santa Barbara,Other,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,0
4,Santa Maria,App,Santa Barbara,Unknown,ALLAN HANCOCK COLLEGE,Berkeley,2005-06,0
...,...,...,...,...,...,...,...,...
250966,Marysville,Adm,Yuba,Other,YUBA COLLEGE,Santa Cruz,2023-24,0.0
250967,Marysville,Enr,Yuba,All,YUBA COLLEGE,Santa Cruz,2023-24,0.0
250968,Marysville,Enr,Yuba,Female,YUBA COLLEGE,Santa Cruz,2023-24,0.0
250969,Marysville,Enr,Yuba,Male,YUBA COLLEGE,Santa Cruz,2023-24,0.0
