## Problem Statement

This project aims to understand the prevalence of standardized test taking to California high school students. Some questions I'd like to consider are 
* Which schools have the highest testing rates?
* Do schools where more students take standardized tests also score better on those tests?
* Do different tests (ACT, SAT, potentially AP) have different patterns?
* How did testing prevalence / scores change in California during COVID?

### Contents

## Background

**To-Do:**

### Datasets Used

**To-Do**

### Outside Research

**To-Do:** 

### Cali - ACTs

In [1]:
# Imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### View Dataset

In [2]:
act_ca = pd.read_csv('../data/act_2019_ca.csv')

In [3]:
act_ca.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2310 entries, 0 to 2309
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   CDS          2309 non-null   float64
 1   CCode        2309 non-null   float64
 2   CDCode       2309 non-null   float64
 3   SCode        1787 non-null   float64
 4   RType        2309 non-null   object 
 5   SName        1729 non-null   object 
 6   DName        2251 non-null   object 
 7   CName        2309 non-null   object 
 8   Enroll12     2309 non-null   float64
 9   NumTstTakr   2309 non-null   float64
 10  AvgScrRead   1953 non-null   object 
 11  AvgScrEng    1953 non-null   object 
 12  AvgScrMath   1953 non-null   object 
 13  AvgScrSci    1953 non-null   object 
 14  NumGE21      1953 non-null   object 
 15  PctGE21      1953 non-null   object 
 16  Year         2309 non-null   object 
 17  Unnamed: 17  0 non-null      float64
dtypes: float64(7), object(11)
memory usage: 325.0+ K

In [4]:
act_ca.head()

Unnamed: 0,CDS,CCode,CDCode,SCode,RType,SName,DName,CName,Enroll12,NumTstTakr,AvgScrRead,AvgScrEng,AvgScrMath,AvgScrSci,NumGE21,PctGE21,Year,Unnamed: 17
0,33669930000000.0,33.0,3366993.0,129882.0,S,21st Century Learning Institute,Beaumont Unified,Riverside,18.0,0.0,,,,,,,2018-19,
1,19642120000000.0,19.0,1964212.0,1995596.0,S,ABC Secondary (Alternative),ABC Unified,Los Angeles,58.0,0.0,,,,,,,2018-19,
2,15637760000000.0,15.0,1563776.0,1530377.0,S,Abraham Lincoln Alternative,Southern Kern Unified,Kern,18.0,0.0,,,,,,,2018-19,
3,43696660000000.0,43.0,4369666.0,4333795.0,S,Abraham Lincoln High,San Jose Unified,Santa Clara,463.0,53.0,23.0,22.0,22.0,23.0,34.0,64.15,2018-19,
4,19647330000000.0,19.0,1964733.0,1935121.0,S,Abraham Lincoln Senior High,Los Angeles Unified,Los Angeles,226.0,19.0,21.0,20.0,23.0,22.0,11.0,57.89,2018-19,


#### Data Cleaning

In [5]:
act_ca.drop(columns='Unnamed: 17', inplace=True) # This column is all NA

In [6]:
act_ca['PctGE21'].value_counts()

*        532
50.00     22
33.33     10
0.00       9
55.56      9
        ... 
6.56       1
8.00       1
80.81      1
24.53      1
57.58      1
Name: PctGE21, Length: 915, dtype: int64

#### Drop unused columns
For our analysis we won't use the codes, school/district/county names, or any of the 4 subject scores

In [69]:
act_ca.columns

Index(['CDS', 'CCode', 'CDCode', 'SCode', 'RType', 'SName', 'DName', 'CName',
       'Enroll12', 'NumTstTakr', 'AvgScrRead', 'AvgScrEng', 'AvgScrMath',
       'AvgScrSci', 'NumGE21', 'PctGE21', 'Year', 'ACT_taken_%',
       'ACT_high_score_%'],
      dtype='object')

In [70]:
code_cols = ['CDS', 'CCode', 'CDCode', 'SCode']
name_cols = ['SName', 'DName', 'CName']
subject_cols = ['AvgScrRead', 'AvgScrEng', 'AvgScrMath', 'AvgScrSci']
dropped_cols = code_cols + name_cols + subject_cols
act_ca.drop(columns=dropped_cols, inplace = True)

#### Exclude schools with very low sample size

In [65]:
act_ca = act_ca[act_ca['PctGE21'] != '*'] 

In [66]:
act_ca.loc[act_ca['NumGE21'].isna(), ['NumGE21']] = 0 # NaN means no students took test

##### Should we drop rows with no students taking test?
These are wanted for particpation metrics but not wanted for performance metrics

In [54]:
act_ca['PctGE21'].isna().sum()
act_ca['PctGE21'].value_counts(dropna = False)

NaN      356
50.00     22
33.33     10
55.56      9
0.00       9
        ... 
6.56       1
8.00       1
80.81      1
24.53      1
57.58      1
Name: PctGE21, Length: 915, dtype: int64

In [55]:
act_ca['RType'].value_counts()

S    1308
D     413
C      55
X       1
Name: RType, dtype: int64

In [62]:
act_ca[['RType', 'PctGE21']].groupby('RType').count()

Unnamed: 0_level_0,PctGE21
RType,Unnamed: 1_level_1
C,54
D,350
S,1016
X,1


#### Fix data types

In [71]:
act_ca.dtypes

RType                object
Enroll12              int32
NumTstTakr            int32
NumGE21               int32
PctGE21              object
Year                 object
ACT_taken_%         float64
ACT_high_score_%    float64
dtype: object

* Make Enroll12 and NumTstTakr ints
* Make NumGE21 int
* Make PctGE21 a float

In [23]:
act_ca[act_ca['Enroll12'].isna()]

Unnamed: 0,CDS,CCode,CDCode,SCode,RType,SName,DName,CName,Enroll12,NumTstTakr,AvgScrRead,AvgScrEng,AvgScrMath,AvgScrSci,NumGE21,PctGE21,Year
2309,,,,,,,,,,,,,,,0,,


In [24]:
act_ca.drop(index=2309, inplace = True)

In [28]:
act_ca['Enroll12'].astype(int)

0        18
1        58
2        18
3       463
4       226
       ... 
2302    138
2303    394
2305    102
2306    628
2308     47
Name: Enroll12, Length: 1777, dtype: int32

In [30]:
act_ca['NumTstTakr'].astype(int)

0        0
1        0
2        0
3       53
4       19
        ..
2302    38
2303    56
2305     0
2306    61
2308     0
Name: NumTstTakr, Length: 1777, dtype: int32

In [31]:
act_ca['Enroll12'] = act_ca['Enroll12'].astype(int)
act_ca['NumTstTakr'] = act_ca['NumTstTakr'].astype(int)

In [74]:
act_ca['NumGE21'] = act_ca['NumGE21'].astype(int)
act_ca['PctGE21'] = act_ca['PctGE21'].astype(float)

### Features to measure test performance and participation
Since our goal is to track test performance and particpation we need a metric for each of these.
For our purposes, we can measure the as follows: 
* perfomance by the percentage of test takers with a composite score above 21
* participation by the percentage of enrollees who take the test

In [77]:
act_ca['ACT_taken_%'] = 100*act_ca['NumTstTakr']/act_ca['Enroll12']
act_ca['ACT_high_score_%'] = 100*act_ca['NumGE21']/act_ca['NumTstTakr'] # <-- This is redundant with PctGE21

#### Select Data for Later Analysis

In [76]:
act_ca.drop(columns = ['Enroll12', 'NumTstTakr', 'NumGE21', 'Year'])

Unnamed: 0,RType,Enroll12,NumTstTakr,NumGE21,PctGE21,Year,ACT_taken_%,ACT_high_score_%
0,S,18,0,0,,2018-19,0.000000,
1,S,58,0,0,,2018-19,0.000000,
2,S,18,0,0,,2018-19,0.000000,
3,S,463,53,34,64.15,2018-19,11.447084,64.150943
4,S,226,19,11,57.89,2018-19,8.407080,57.894737
...,...,...,...,...,...,...,...,...
2302,S,138,38,20,52.63,2018-19,27.536232,52.631579
2303,S,394,56,35,62.50,2018-19,14.213198,62.500000
2305,S,102,0,0,,2018-19,0.000000,
2306,S,628,61,40,65.57,2018-19,9.713376,65.573770


In [20]:
act_ca.loc[act_ca['RType'] == 'S', ['Enroll12', 'NumGE21']]

Unnamed: 0,Enroll12,NumGE21
0,18.0,0
1,58.0,0
2,18.0,0
3,463.0,34
4,226.0,11
...,...,...
2302,138.0,20
2303,394.0,35
2305,102.0,0
2306,628.0,40


In [21]:
act_schools = set(act_ca['SCode'].dropna())
len(act_schools)

1309

In [32]:
act_ca['RType'].value_counts()

S    1308
D     413
C      55
X       1
Name: RType, dtype: int64

### Cali SATs

#### View Dataset

In [22]:
sat_ca = pd.read_csv('../data/sat_2019_ca.csv')

In [23]:
sat_ca.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2580 entries, 0 to 2579
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   CDS                    2579 non-null   float64
 1   CCode                  2579 non-null   float64
 2   CDCode                 2579 non-null   float64
 3   SCode                  2579 non-null   float64
 4   RType                  2579 non-null   object 
 5   SName                  1982 non-null   object 
 6   DName                  2521 non-null   object 
 7   CName                  2579 non-null   object 
 8   Enroll12               2579 non-null   float64
 9   NumTSTTakr12           2579 non-null   float64
 10  NumERWBenchmark12      2304 non-null   object 
 11  PctERWBenchmark12      2304 non-null   object 
 12  NumMathBenchmark12     2304 non-null   object 
 13  PctMathBenchmark12     2304 non-null   object 
 14  Enroll11               2579 non-null   float64
 15  NumT

In [24]:
sat_ca.head()

Unnamed: 0,CDS,CCode,CDCode,SCode,RType,SName,DName,CName,Enroll12,NumTSTTakr12,...,NumERWBenchmark11,PctERWBenchmark11,NumMathBenchmark11,PctMathBenchmark11,TotNumBothBenchmark12,PctBothBenchmark12,TotNumBothBenchmark11,PctBothBenchmark11,Year,Unnamed: 25
0,6615981000000.0,6.0,661598.0,630046.0,S,Colusa Alternative Home,Colusa Unified,Colusa,18.0,0.0,...,,,,,,,,,2018-19,
1,6616061000000.0,6.0,661606.0,634758.0,S,Maxwell Sr High,Maxwell Unified,Colusa,29.0,10.0,...,*,*,*,*,*,*,*,*,2018-19,
2,19647330000000.0,19.0,1964733.0,1930924.0,S,Belmont Senior High,Los Angeles Unified,Los Angeles,206.0,102.0,...,42,24.14,12,6.90,14,13.73,11,6.32,2018-19,
3,19647330000000.0,19.0,1964733.0,1931476.0,S,Canoga Park Senior High,Los Angeles Unified,Los Angeles,227.0,113.0,...,97,35.27,37,13.45,18,15.93,35,12.73,2018-19,
4,19647330000000.0,19.0,1964733.0,1931856.0,S,Whitman Continuation,Los Angeles Unified,Los Angeles,18.0,14.0,...,*,*,*,*,*,*,*,*,2018-19,


#### Clean Data

In [25]:
sat_ca.drop(columns='Unnamed: 25', inplace=True)

In [26]:
sat_ca.head()

Unnamed: 0,CDS,CCode,CDCode,SCode,RType,SName,DName,CName,Enroll12,NumTSTTakr12,...,NumTSTTakr11,NumERWBenchmark11,PctERWBenchmark11,NumMathBenchmark11,PctMathBenchmark11,TotNumBothBenchmark12,PctBothBenchmark12,TotNumBothBenchmark11,PctBothBenchmark11,Year
0,6615981000000.0,6.0,661598.0,630046.0,S,Colusa Alternative Home,Colusa Unified,Colusa,18.0,0.0,...,0.0,,,,,,,,,2018-19
1,6616061000000.0,6.0,661606.0,634758.0,S,Maxwell Sr High,Maxwell Unified,Colusa,29.0,10.0,...,6.0,*,*,*,*,*,*,*,*,2018-19
2,19647330000000.0,19.0,1964733.0,1930924.0,S,Belmont Senior High,Los Angeles Unified,Los Angeles,206.0,102.0,...,174.0,42,24.14,12,6.90,14,13.73,11,6.32,2018-19
3,19647330000000.0,19.0,1964733.0,1931476.0,S,Canoga Park Senior High,Los Angeles Unified,Los Angeles,227.0,113.0,...,275.0,97,35.27,37,13.45,18,15.93,35,12.73,2018-19
4,19647330000000.0,19.0,1964733.0,1931856.0,S,Whitman Continuation,Los Angeles Unified,Los Angeles,18.0,14.0,...,5.0,*,*,*,*,*,*,*,*,2018-19


In [27]:
sat_ca[sat_ca['TotNumBothBenchmark12'].isna()]

Unnamed: 0,CDS,CCode,CDCode,SCode,RType,SName,DName,CName,Enroll12,NumTSTTakr12,...,NumTSTTakr11,NumERWBenchmark11,PctERWBenchmark11,NumMathBenchmark11,PctMathBenchmark11,TotNumBothBenchmark12,PctBothBenchmark12,TotNumBothBenchmark11,PctBothBenchmark11,Year
0,6.615981e+12,6.0,661598.0,630046.0,S,Colusa Alternative Home,Colusa Unified,Colusa,18.0,0.0,...,0.0,,,,,,,,,2018-19
12,1.563776e+13,15.0,1563776.0,1530377.0,S,Abraham Lincoln Alternative,Southern Kern Unified,Kern,18.0,0.0,...,0.0,,,,,,,,,2018-19
19,1.062117e+13,10.0,1062117.0,1030469.0,S,Enterprise Alternative,Clovis Unified,Fresno,18.0,0.0,...,0.0,,,,,,,,,2018-19
36,3.768163e+13,37.0,3768163.0,137109.0,S,Diego Valley East Public Charter,Julian Union Elementary,San Diego,78.0,0.0,...,1.0,*,*,*,*,,,*,*,2018-19
43,3.467314e+13,34.0,3467314.0,3430352.0,S,Las Flores High (Alternative),Elk Grove Unified,Sacramento,64.0,0.0,...,1.0,*,*,*,*,,,*,*,2018-19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2540,4.110413e+13,41.0,4110413.0,0.0,D,,San Mateo County Office of Education,San Mateo,97.0,0.0,...,0.0,,,,,,,,,2018-19
2561,1.976992e+13,19.0,1976992.0,0.0,D,,SBE - Prepa Tec Los Angeles High,Los Angeles,0.0,0.0,...,47.0,17,36.17,4,8.51,,,4,8.51,2018-19
2572,5.071092e+13,50.0,5071092.0,0.0,D,,Hart-Ransom Union Elementary,Stanislaus,18.0,0.0,...,0.0,,,,,,,,,2018-19
2573,5.071134e+13,50.0,5071134.0,0.0,D,,Keyes Union,Stanislaus,25.0,0.0,...,0.0,,,,,,,,,2018-19


In [28]:
sat_ca.loc[:,['NumERWBenchmark12', 'NumERWBenchmark11', 
            'NumMathBenchmark12', 'NumMathBenchmark11', 
            'TotNumBothBenchmark12', 'TotNumBothBenchmark11']].fillna(0, inplace = True)

In [29]:
sat_schools = set(sat_ca['SCode'].dropna())
sat_schools

{0.0,
 1933316.0,
 114694.0,
 1335306.0,
 1933324.0,
 1237007.0,
 106518.0,
 4530200.0,
 532507.0,
 114736.0,
 131128.0,
 1933381.0,
 4333639.0,
 106567.0,
 1531987.0,
 1933399.0,
 3334232.0,
 114777.0,
 4530267.0,
 5636188.0,
 4030557.0,
 131169.0,
 131177.0,
 131185.0,
 5030010.0,
 3432572.0,
 2130045.0,
 6119556.0,
 4530309.0,
 1933449.0,
 6111371.0,
 6119564.0,
 3932308.0,
 3334299.0,
 4530333.0,
 2130078.0,
 114850.0,
 3236007.0,
 131250.0,
 114868.0,
 3735750.0,
 1933530.0,
 106716.0,
 3834082.0,
 4333795.0,
 106732.0,
 6119671.0,
 5030135.0,
 123133.0,
 123141.0,
 3334406.0,
 106765.0,
 5030168.0,
 1933597.0,
 131359.0,
 5030176.0,
 4030755.0,
 1737006.0,
 114991.0,
 2335024.0,
 106799.0,
 5431598.0,
 123190.0,
 131383.0,
 5030200.0,
 3735867.0,
 1032507.0,
 5431614.0,
 4137279.0,
 1933647.0,
 106831.0,
 3637584.0,
 5030226.0,
 115030.0,
 123224.0,
 5030234.0,
 106849.0,
 106864.0,
 5030259.0,
 123257.0,
 2630010.0,
 5030267.0,
 4333951.0,
 1532290.0,
 3432838.0,
 5030283.0,
 11

In [30]:
sat_ca.loc[sat_ca['RType'] == 'S', ['Enroll12', 'TotNumBothBenchmark12']]

Unnamed: 0,Enroll12,TotNumBothBenchmark12
0,18.0,
1,29.0,*
2,206.0,14
3,227.0,18
4,18.0,*
...,...,...
1976,76.0,6
1977,15.0,*
1978,27.0,*
1979,1083.0,293


### Compare schools

In [31]:
schools = act_schools & sat_schools
act_only_schools = act_schools - sat_schools
sat_only_schools = sat_schools - act_schools
print(len(schools))
print(len(act_only_schools))
print(len(sat_only_schools))

1306
3
676
