# Capstone 2 Data Wrangling File

The purpose is to download and review the following files from the GOSA website (for SY19-20):

**Student Operational Data**
- Attendance
- Enrollment by Subgroup Programs

**Teacher Categorization Data**
- Educator Experience
- Emergency and Provisional Teacher Credentials
- Out of Field Teachers


## <div class="alert alert-block alert-success">Summary of Data Selection</div>



<div class="alert alert-block alert-success">
According to the [Georgia Department of Education](https://www.gadoe.org/External-Affairs-and-Policy/AskDOE/Pages/Schools-and-Districts.aspx), the state of Georgia is home to 181 school districts with more than 2,200 schools staffed by more than 114,800 teachers serving 1.6 million students. 

The goal of this project is to cluster schools by features include performance (ACT and SAT), attendance, enrollment, teacher experience, and out of field credentials.

To select a subset of schools with relevant data, the following process was followed for **academic performance data**:

1. Given that school numbers are not unique for the state, an identifier was created with a combination of the district number and the school number.
2. The ACT dataset was filtered for schools with a non-null Composite score (the identifier to ensure the rows are unique). There were 360 schools meeting these criteria.
3. The SAT dataset was filtered for schools with a non-null Combined score (the identifier to ensure the rows are unique). There were 383 schools meeting these criteria.
4. The SAT and ACT datasets were merged to compare the schools: there were 8 schools without SAT scores and 31 schools without ACT scores.

**A set of 352 schools was chosen that have both ACT and SAT overall scores.**

With the ACT/SAT data reviewed, the operational data was then considered to see which schools have attendance and enrollment data for the 352 schools:

1. Similar to the academic data, a unique school identifier was created.
2. The attendance dataset was filtered to only include schools with identifiers from the performance dataset. There were 344 schools with attendance data and ACT/SAT scores.
3. The enrollment dataset filtered to only include schools with identifiers from the performance dataset. There were 344 schools with enrollment data and ACT/SAT scores.
4. The identifiers from the attendance and enrollment datasets were compared to see if the same 344 schools were present in both datsets.

**A set of 344 schools was chosen that have both ACT and SAT overall scores and both enrollment and attendance data.**

With the academic performance and operational data reviewed, the teacher categorization data was considered to see which of the 344 schools have educator and out-of-field teacher data:


</div>

## Dataset Issues

In reviewing the data sets, a few issues were observed:
 - Institution/school numbers are not unique for the entire state. For example, Baker County K12 School in Baker county has the number 105; Lanier College and Career Academy in Hall County also has the number 105.
 - A number of counts for the datasets shows up as "TFS" which stands for too-few-students. According to the GOSA website, this means that fewer than 10 students were included. If an average score is intended for this count, it shows NaN instead.
 - The operational datasets have leading zeros for their school numbers that need to be removed.

## Imports

In [1]:
import pandas as pd

## Load the Georgia Education Data

In [2]:
# Gather URLs
act_url = 'https://download.gosa.ga.gov/2020/ACT_RECENT_2020_JUN_21_2021.csv'
attendance_url = 'https://download.gosa.ga.gov/2020/Attendance_2020_Dec112020.csv'
experience_url = 'https://download.gosa.ga.gov/2020/Educators_Inexperienced_2020_Dec112020.csv'
credential_url = 'https://download.gosa.ga.gov/2020/Educators_EMERGENCY_WAIVERS_2020_Dec112020.csv'
enrollment_url = 'https://download.gosa.ga.gov/2020/Enrollment_by_Subgroups_Programs_2020_Dec112020.csv'
oof_url = 'https://download.gosa.ga.gov/2020/Educators_OUT_OF_FIELD_2020_Dec112020.csv'
sat_url = 'https://download.gosa.ga.gov/2020/SAT_HIGHEST_2020_JUN_21_2021.csv'

In [3]:
# Create list for urls
urls = [act_url, attendance_url, experience_url, credential_url, enrollment_url, oof_url, sat_url]

In [4]:
# Create dataframes
d = {url: pd.read_csv(url) for url in urls}

## Define a Function to Create an Identifying Number

In [5]:
# Define a function to create a unique school identifying number based on the district number and school number
def identifier(df):
    # Convert the district number and school number to strings
    nums_to_change = ['INSTN_NUMBER', 'SCHOOL_DISTRCT_CD']

    for num in nums_to_change:
        df[num] = df[num].astype(str)
        df[num] = df[num].str.lstrip('0')
    
    # Create new column with identifying number
    df['Identifier'] = df['SCHOOL_DISTRCT_CD'] + '-' + df['INSTN_NUMBER']

## Explore Student Academic Performance Data

### ACT Scores

The ACT dataset shows the school information (name, number) and component (English, Reading, Science, Writing, Math) and composite counts and average scores. The counts and average schools are provided for the national, state, district, and school levels. 

For some counts, "TFS" (too few students) is listed. This results in a NaN value for the average.

**After filtering for only the Composite score and using the combined identifier to ensure the rows are unique, there are 360 schools remaining.** 

In [6]:
act_df = d[act_url]

In [7]:
act_df.head()

Unnamed: 0,LONG_SCHOOL_YEAR,SCHOOL_DISTRCT_CD,SCHOOL_DSTRCT_NM,INSTN_NUMBER,INSTN_NAME,SUBGRP_DESC,TEST_CMPNT_TYP_CD,NATIONAL_NUM_TESTED_CNT,STATE_NUM_TESTED_CNT,DSTRCT_NUM_TESTED_CNT,INSTN_NUM_TESTED_CNT,NATIONAL_AVG_SCORE_VAL,STATE_AVG_SCORE_VAL,DSTRCT_AVG_SCORE_VAL,INSTN_AVG_SCORE_VAL
0,2019-20,644,DeKalb County,2055,Druid Hills High School,All Students,English,1670497,26811,1337,80,19.7,20.2,19.5,21.6
1,2019-20,619,Calhoun County,113,Calhoun County High School,All Students,English,1670497,26811,TFS,TFS,19.7,20.2,,
2,2019-20,772,Dalton Public Schools,110,Morris Innovative High School,All Students,English,1670497,26811,82,TFS,19.7,20.2,18.5,
3,2019-20,604,Baker County,105,Baker County K12 School,All Students,English,1670497,26811,TFS,TFS,19.7,20.2,,
4,2019-20,657,Floyd County,107,Pepperell High School,All Students,English,1670497,26811,201,57,19.7,20.2,19.1,17.9


In [8]:
act_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2417 entries, 0 to 2416
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   LONG_SCHOOL_YEAR         2417 non-null   object 
 1   SCHOOL_DISTRCT_CD        2417 non-null   int64  
 2   SCHOOL_DSTRCT_NM         2417 non-null   object 
 3   INSTN_NUMBER             2417 non-null   int64  
 4   INSTN_NAME               2417 non-null   object 
 5   SUBGRP_DESC              2417 non-null   object 
 6   TEST_CMPNT_TYP_CD        2417 non-null   object 
 7   NATIONAL_NUM_TESTED_CNT  2417 non-null   int64  
 8   STATE_NUM_TESTED_CNT     2417 non-null   int64  
 9   DSTRCT_NUM_TESTED_CNT    2417 non-null   object 
 10  INSTN_NUM_TESTED_CNT     2417 non-null   object 
 11  NATIONAL_AVG_SCORE_VAL   2417 non-null   float64
 12  STATE_AVG_SCORE_VAL      2417 non-null   float64
 13  DSTRCT_AVG_SCORE_VAL     2166 non-null   float64
 14  INSTN_AVG_SCORE_VAL     

#### Select the overall score for ACT

In [9]:
# Check to see which component scores are included
act_df['TEST_CMPNT_TYP_CD'].unique()

array(['English', 'Mathematics', 'Composite', 'Reading', 'Science',
       'Writing Subscore'], dtype=object)

In [10]:
# Filter to only keep combined test scores
act_df = act_df[act_df['TEST_CMPNT_TYP_CD'] == 'Composite']

In [11]:
act_df.shape

(409, 15)

#### Create the identifier column

In [12]:
# Call identifier function
identifier(act_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[num] = df[num].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[num] = df[num].str.lstrip('0')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Identifier'] = df['SCHOOL_DISTRCT_CD'] + '-' + df['INSTN_NUMBER']


In [13]:
act_df.head()

Unnamed: 0,LONG_SCHOOL_YEAR,SCHOOL_DISTRCT_CD,SCHOOL_DSTRCT_NM,INSTN_NUMBER,INSTN_NAME,SUBGRP_DESC,TEST_CMPNT_TYP_CD,NATIONAL_NUM_TESTED_CNT,STATE_NUM_TESTED_CNT,DSTRCT_NUM_TESTED_CNT,INSTN_NUM_TESTED_CNT,NATIONAL_AVG_SCORE_VAL,STATE_AVG_SCORE_VAL,DSTRCT_AVG_SCORE_VAL,INSTN_AVG_SCORE_VAL,Identifier
87,2019-20,644,DeKalb County,810,Elizabeth Andrews High School,All Students,Composite,1670497,26810,1337,TFS,20.6,20.8,20.0,,644-810
88,2019-20,633,Cobb County,103,Kell High School,All Students,Composite,1670497,26810,2143,98,20.6,20.8,22.5,20.8,633-103
89,2019-20,625,Savannah-Chatham County,499,Savannah Arts Academy,All Students,Composite,1670497,26810,261,67,20.6,20.8,19.4,25.5,625-499
90,2019-20,757,Wilkes County,110,Washington-Wilkes Comprehensive High School,All Students,Composite,1670497,26810,11,11,20.6,20.8,18.1,18.1,757-110
91,2019-20,656,Fayette County,182,McIntosh High School,All Students,Composite,1670497,26810,617,155,20.6,20.8,22.9,25.5,656-182


#### Remove schools that do not have an institutional average score

In [14]:
# Determine the number of unique schools with ACT composite scores
act_df['Identifier'].nunique()

409

In [15]:
# Filter for non NaN score in the Institutional score
act_df = act_df[~act_df['INSTN_AVG_SCORE_VAL'].isnull()]

In [16]:
act_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 360 entries, 88 to 2261
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   LONG_SCHOOL_YEAR         360 non-null    object 
 1   SCHOOL_DISTRCT_CD        360 non-null    object 
 2   SCHOOL_DSTRCT_NM         360 non-null    object 
 3   INSTN_NUMBER             360 non-null    object 
 4   INSTN_NAME               360 non-null    object 
 5   SUBGRP_DESC              360 non-null    object 
 6   TEST_CMPNT_TYP_CD        360 non-null    object 
 7   NATIONAL_NUM_TESTED_CNT  360 non-null    int64  
 8   STATE_NUM_TESTED_CNT     360 non-null    int64  
 9   DSTRCT_NUM_TESTED_CNT    360 non-null    object 
 10  INSTN_NUM_TESTED_CNT     360 non-null    object 
 11  NATIONAL_AVG_SCORE_VAL   360 non-null    float64
 12  STATE_AVG_SCORE_VAL      360 non-null    float64
 13  DSTRCT_AVG_SCORE_VAL     360 non-null    float64
 14  INSTN_AVG_SCORE_VAL     

### SAT Scores

The SAT dataset shows the school information (name, number) and component, and counts and average scores. The counts and average schools are provided for the national, state, district, and school levels. 

Similar to the AP and ACT datasets, for some counts, "TFS" (too few students) is listed. This results in a NaN value for the average.

**After filtering for only the Combined Test Score and using the combined identifier to ensure the rows are unique, there are 383 schools remaining.** 

In [17]:
sat_df = d[sat_url]

In [18]:
sat_df.head()

Unnamed: 0,LONG_SCHOOL_YEAR,SCHOOL_DISTRCT_CD,SCHOOL_DSTRCT_NM,INSTN_NUMBER,INSTN_NAME,SUBGRP_DESC,TEST_CMPNT_TYP_CD,NATIONAL_NUM_TESTED_CNT,STATE_NUM_TESTED_CNT,DSTRCT_NUM_TESTED_CNT,INSTN_NUM_TESTED_CNT,STATE_AVG_SCORE_VAL,DSTRCT_AVG_SCORE_VAL,INSTN_AVG_SCORE_VAL
0,2019-20,601,Appling County,103,Appling County High School,All Students,Combined Test Score,2198460,43074,62,62,1028,1005,1005
1,2019-20,601,Appling County,103,Appling County High School,All Students,Essay Analysis Score - New,1258478,15097,17,17,3,2,2
2,2019-20,601,Appling County,103,Appling County High School,All Students,Essay Reading Score - New,1258478,15097,17,17,5,4,4
3,2019-20,601,Appling County,103,Appling County High School,All Students,Essay Total,1258478,15097,17,17,13,11,11
4,2019-20,601,Appling County,103,Appling County High School,All Students,Essay Writing Score - New,1258478,15097,17,17,5,5,5


In [19]:
sat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   LONG_SCHOOL_YEAR         3276 non-null   object
 1   SCHOOL_DISTRCT_CD        3276 non-null   int64 
 2   SCHOOL_DSTRCT_NM         3276 non-null   object
 3   INSTN_NUMBER             3276 non-null   int64 
 4   INSTN_NAME               3276 non-null   object
 5   SUBGRP_DESC              3276 non-null   object
 6   TEST_CMPNT_TYP_CD        3276 non-null   object
 7   NATIONAL_NUM_TESTED_CNT  3276 non-null   int64 
 8   STATE_NUM_TESTED_CNT     3276 non-null   int64 
 9   DSTRCT_NUM_TESTED_CNT    3276 non-null   object
 10  INSTN_NUM_TESTED_CNT     3276 non-null   object
 11  STATE_AVG_SCORE_VAL      3276 non-null   int64 
 12  DSTRCT_AVG_SCORE_VAL     3276 non-null   object
 13  INSTN_AVG_SCORE_VAL      3276 non-null   object
dtypes: int64(5), object(9)
memory usage: 358

#### Select the overall score for SAT

In [20]:
# Check to see which subject scores are included
sat_df['TEST_CMPNT_TYP_CD'].unique()

array(['Combined Test Score', 'Essay Analysis Score - New',
       'Essay Reading Score - New', 'Essay Total',
       'Essay Writing Score - New', 'Math Section Score - New',
       'Reading Test  Score - New', 'WritLang Test  Score - New'],
      dtype=object)

In [21]:
# Filter to only keep combined test scores
sat_df = sat_df[sat_df['TEST_CMPNT_TYP_CD'] == 'Combined Test Score']

In [22]:
sat_df.shape

(420, 14)

#### Create a new identifier that is a combination of the school district number and the school number

In [23]:
identifier(sat_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[num] = df[num].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[num] = df[num].str.lstrip('0')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Identifier'] = df['SCHOOL_DISTRCT_CD'] + '-' + df['INSTN_NUMBER']


In [24]:
sat_df.head()

Unnamed: 0,LONG_SCHOOL_YEAR,SCHOOL_DISTRCT_CD,SCHOOL_DSTRCT_NM,INSTN_NUMBER,INSTN_NAME,SUBGRP_DESC,TEST_CMPNT_TYP_CD,NATIONAL_NUM_TESTED_CNT,STATE_NUM_TESTED_CNT,DSTRCT_NUM_TESTED_CNT,INSTN_NUM_TESTED_CNT,STATE_AVG_SCORE_VAL,DSTRCT_AVG_SCORE_VAL,INSTN_AVG_SCORE_VAL,Identifier
0,2019-20,601,Appling County,103,Appling County High School,All Students,Combined Test Score,2198460,43074,62,62,1028,1005,1005,601-103
8,2019-20,602,Atkinson County,103,Atkinson County High School,All Students,Combined Test Score,2198460,43074,39,39,1028,1004,1004,602-103
16,2019-20,603,Bacon County,302,Bacon County High School,All Students,Combined Test Score,2198460,43074,33,33,1028,954,954,603-302
24,2019-20,604,Baker County,105,Baker County K12 School,All Students,Combined Test Score,2198460,43074,TFS,TFS,1028,TFS,TFS,604-105
32,2019-20,605,Baldwin County,189,Baldwin High School,All Students,Combined Test Score,2198460,43074,83,83,1028,904,904,605-189


#### Remove schools that do not have an institutional score

In [25]:
# Determine the number of unique schools with SAT combined scores
sat_df['Identifier'].nunique()

420

In [26]:
# Filter for non "TFS" scores in the Institutional score
sat_df = sat_df[~sat_df['INSTN_AVG_SCORE_VAL'].isin(['TFS'])]

In [27]:
sat_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 383 entries, 0 to 3264
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   LONG_SCHOOL_YEAR         383 non-null    object
 1   SCHOOL_DISTRCT_CD        383 non-null    object
 2   SCHOOL_DSTRCT_NM         383 non-null    object
 3   INSTN_NUMBER             383 non-null    object
 4   INSTN_NAME               383 non-null    object
 5   SUBGRP_DESC              383 non-null    object
 6   TEST_CMPNT_TYP_CD        383 non-null    object
 7   NATIONAL_NUM_TESTED_CNT  383 non-null    int64 
 8   STATE_NUM_TESTED_CNT     383 non-null    int64 
 9   DSTRCT_NUM_TESTED_CNT    383 non-null    object
 10  INSTN_NUM_TESTED_CNT     383 non-null    object
 11  STATE_AVG_SCORE_VAL      383 non-null    int64 
 12  DSTRCT_AVG_SCORE_VAL     383 non-null    object
 13  INSTN_AVG_SCORE_VAL      383 non-null    object
 14  Identifier               383 non-null    

### Combined SAT and ACT Scores

After filtering for overall scores and schools that have overall scores, there are 360 schools with ACT scores and 383 schools with SAT scores. After merging the data sets to see the two scores, there are 8 schools without SAT scores and 31 schools without ACT scores.

**A final set of 352 schools is chosen that has both ACT and SAT overall scores.**

#### Merge the SAT and ACT datasets

In [28]:
scores = sat_df.merge(act_df, how='outer', on=['LONG_SCHOOL_YEAR', 'SCHOOL_DISTRCT_CD', 'SCHOOL_DSTRCT_NM',
       'INSTN_NUMBER', 'INSTN_NAME','Identifier'], suffixes=['_sat', '_act'])

In [29]:
scores.head()

Unnamed: 0,LONG_SCHOOL_YEAR,SCHOOL_DISTRCT_CD,SCHOOL_DSTRCT_NM,INSTN_NUMBER,INSTN_NAME,SUBGRP_DESC_sat,TEST_CMPNT_TYP_CD_sat,NATIONAL_NUM_TESTED_CNT_sat,STATE_NUM_TESTED_CNT_sat,DSTRCT_NUM_TESTED_CNT_sat,...,SUBGRP_DESC_act,TEST_CMPNT_TYP_CD_act,NATIONAL_NUM_TESTED_CNT_act,STATE_NUM_TESTED_CNT_act,DSTRCT_NUM_TESTED_CNT_act,INSTN_NUM_TESTED_CNT_act,NATIONAL_AVG_SCORE_VAL,STATE_AVG_SCORE_VAL_act,DSTRCT_AVG_SCORE_VAL_act,INSTN_AVG_SCORE_VAL_act
0,2019-20,601,Appling County,103,Appling County High School,All Students,Combined Test Score,2198460.0,43074.0,62,...,All Students,Composite,1670497.0,26810.0,16.0,16.0,20.6,20.8,19.0,19.0
1,2019-20,602,Atkinson County,103,Atkinson County High School,All Students,Combined Test Score,2198460.0,43074.0,39,...,,,,,,,,,,
2,2019-20,603,Bacon County,302,Bacon County High School,All Students,Combined Test Score,2198460.0,43074.0,33,...,,,,,,,,,,
3,2019-20,605,Baldwin County,189,Baldwin High School,All Students,Combined Test Score,2198460.0,43074.0,83,...,All Students,Composite,1670497.0,26810.0,41.0,41.0,20.6,20.8,15.6,15.6
4,2019-20,606,Banks County,199,Banks County High School,All Students,Combined Test Score,2198460.0,43074.0,68,...,All Students,Composite,1670497.0,26810.0,28.0,28.0,20.6,20.8,20.1,20.1


In [30]:
scores.columns

Index(['LONG_SCHOOL_YEAR', 'SCHOOL_DISTRCT_CD', 'SCHOOL_DSTRCT_NM',
       'INSTN_NUMBER', 'INSTN_NAME', 'SUBGRP_DESC_sat',
       'TEST_CMPNT_TYP_CD_sat', 'NATIONAL_NUM_TESTED_CNT_sat',
       'STATE_NUM_TESTED_CNT_sat', 'DSTRCT_NUM_TESTED_CNT_sat',
       'INSTN_NUM_TESTED_CNT_sat', 'STATE_AVG_SCORE_VAL_sat',
       'DSTRCT_AVG_SCORE_VAL_sat', 'INSTN_AVG_SCORE_VAL_sat', 'Identifier',
       'SUBGRP_DESC_act', 'TEST_CMPNT_TYP_CD_act',
       'NATIONAL_NUM_TESTED_CNT_act', 'STATE_NUM_TESTED_CNT_act',
       'DSTRCT_NUM_TESTED_CNT_act', 'INSTN_NUM_TESTED_CNT_act',
       'NATIONAL_AVG_SCORE_VAL', 'STATE_AVG_SCORE_VAL_act',
       'DSTRCT_AVG_SCORE_VAL_act', 'INSTN_AVG_SCORE_VAL_act'],
      dtype='object')

In [31]:
scores.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 391 entries, 0 to 390
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   LONG_SCHOOL_YEAR             391 non-null    object 
 1   SCHOOL_DISTRCT_CD            391 non-null    object 
 2   SCHOOL_DSTRCT_NM             391 non-null    object 
 3   INSTN_NUMBER                 391 non-null    object 
 4   INSTN_NAME                   391 non-null    object 
 5   SUBGRP_DESC_sat              383 non-null    object 
 6   TEST_CMPNT_TYP_CD_sat        383 non-null    object 
 7   NATIONAL_NUM_TESTED_CNT_sat  383 non-null    float64
 8   STATE_NUM_TESTED_CNT_sat     383 non-null    float64
 9   DSTRCT_NUM_TESTED_CNT_sat    383 non-null    object 
 10  INSTN_NUM_TESTED_CNT_sat     383 non-null    object 
 11  STATE_AVG_SCORE_VAL_sat      383 non-null    float64
 12  DSTRCT_AVG_SCORE_VAL_sat     383 non-null    object 
 13  INSTN_AVG_SCORE_VAL_

In [32]:
# Filter for only the schools that have both SAT and ACT scores
scores = scores.dropna()

In [33]:
scores.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 352 entries, 0 to 382
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   LONG_SCHOOL_YEAR             352 non-null    object 
 1   SCHOOL_DISTRCT_CD            352 non-null    object 
 2   SCHOOL_DSTRCT_NM             352 non-null    object 
 3   INSTN_NUMBER                 352 non-null    object 
 4   INSTN_NAME                   352 non-null    object 
 5   SUBGRP_DESC_sat              352 non-null    object 
 6   TEST_CMPNT_TYP_CD_sat        352 non-null    object 
 7   NATIONAL_NUM_TESTED_CNT_sat  352 non-null    float64
 8   STATE_NUM_TESTED_CNT_sat     352 non-null    float64
 9   DSTRCT_NUM_TESTED_CNT_sat    352 non-null    object 
 10  INSTN_NUM_TESTED_CNT_sat     352 non-null    object 
 11  STATE_AVG_SCORE_VAL_sat      352 non-null    float64
 12  DSTRCT_AVG_SCORE_VAL_sat     352 non-null    object 
 13  INSTN_AVG_SCORE_VAL_

In [64]:
# Create a list of the school numbers being used
school_identifiers = scores['Identifier'].to_list()
len(school_identifiers)

352

## Explore Student Operational Data

### Attendance Data

The attendance dataset shows the total number of students at the school, the percent of students missing fewer than five days, between 6-15 days, and more than 15 days. The percentages are also available for race, gender, students with disability designations, economically disadvantaged designations, limited English proficiency, and migrant status. Chronic absenteeism rates (missing at least 10% of school days) by subgroup is also included.

While the attendance data shares the same columns for school/district names/numbers, the spelling of the column headers is slightly different.

In [35]:
attendance_df = d[attendance_url]

In [36]:
attendance_df.head()

Unnamed: 0,LONG_SCHOOL_YEAR,DETAIL_LVL_DESC,SCHOOL_DSTRCT_CD,SCHOOL_DSTRCT_NM,INSTN_NUMBER,INSTN_NAME,GRADES_SERVED_DESC,STUDENT_COUNT_ALL,FIVE_OR_FEWER_PERCENT_ALL,SIX_TO_FIFTEEN_PERCENT_ALL,...,CHRONIC_ABSENT_PERC_HISPANI,CHRONIC_ABSENT_PERC_MULTI,CHRONIC_ABSENT_PERC_FEMALE,CHRONIC_ABSENT_PERC_MALE,CHRONIC_ABSENT_PERC_SWD,CHRONIC_ABSENT_PERC_NOT_SWD,CHRONIC_ABSENT_PERC_ED,CHRONIC_ABSENT_PERC_NOT_ED,CHRONIC_ABSENT_PERC_LEP,CHRONIC_ABSENT_PERC_MIGRANT
0,2019-20,School,601,Appling County,103,Appling County High School,09101112,1027,55.6,32.9,...,11.8,23.1,10.7,13.3,15.4,11.6,12.1,0.0,13.6,8.7
1,2019-20,School,601,Appling County,177,Appling County Elementary School,02030405,520,62.5,32.5,...,4.4,8.3,6.8,5.2,6.5,5.8,6.0,0.0,4.9,0.0
2,2019-20,School,601,Appling County,195,Appling County Middle School,060708,867,62.5,30.1,...,3.9,17.6,6.2,8.3,10.4,6.7,7.3,0.0,2.6,1.9
3,2019-20,School,601,Appling County,277,Appling County Primary School,"PK,KK,01,02",579,54.9,37.8,...,6.7,11.5,7.6,7.2,10.3,7.0,7.4,0.0,3.7,0.0
4,2019-20,School,601,Appling County,1050,Altamaha Elementary School,"PK,KK,01,02,03,04,05",380,51.1,42.1,...,7.4,7.7,6.4,4.1,6.8,5.0,5.3,0.0,9.1,0.0


In [37]:
attendance_df.shape

(2493, 82)

#### Initial review of dataset

In [38]:
#How many unique district names are there?
attendance_df['SCHOOL_DSTRCT_NM'].nunique()

216

In [39]:
#How many unique school names are included?
attendance_df['INSTN_NAME'].nunique()

2181

In [40]:
#How many rows with 9th grade are included?
attendance_hs = attendance_df[attendance_df['GRADES_SERVED_DESC'].str.contains('09')]
attendance_hs.shape

(711, 82)

In [41]:
#How many schools are included?
attendance_hs['INSTN_NUMBER'].nunique()

224

In [42]:
#How many schools are included?
attendance_hs['INSTN_NAME'].nunique()

487

#### Create identifier column

In [43]:
# Rename school district number column to be the same as the academic performance
attendance_df = attendance_df.rename(columns={"SCHOOL_DSTRCT_CD": "SCHOOL_DISTRCT_CD"})

In [44]:
identifier(attendance_df)

In [45]:
attendance_df.head()

Unnamed: 0,LONG_SCHOOL_YEAR,DETAIL_LVL_DESC,SCHOOL_DISTRCT_CD,SCHOOL_DSTRCT_NM,INSTN_NUMBER,INSTN_NAME,GRADES_SERVED_DESC,STUDENT_COUNT_ALL,FIVE_OR_FEWER_PERCENT_ALL,SIX_TO_FIFTEEN_PERCENT_ALL,...,CHRONIC_ABSENT_PERC_MULTI,CHRONIC_ABSENT_PERC_FEMALE,CHRONIC_ABSENT_PERC_MALE,CHRONIC_ABSENT_PERC_SWD,CHRONIC_ABSENT_PERC_NOT_SWD,CHRONIC_ABSENT_PERC_ED,CHRONIC_ABSENT_PERC_NOT_ED,CHRONIC_ABSENT_PERC_LEP,CHRONIC_ABSENT_PERC_MIGRANT,Identifier
0,2019-20,School,601,Appling County,103,Appling County High School,09101112,1027,55.6,32.9,...,23.1,10.7,13.3,15.4,11.6,12.1,0.0,13.6,8.7,601-103
1,2019-20,School,601,Appling County,177,Appling County Elementary School,02030405,520,62.5,32.5,...,8.3,6.8,5.2,6.5,5.8,6.0,0.0,4.9,0.0,601-177
2,2019-20,School,601,Appling County,195,Appling County Middle School,060708,867,62.5,30.1,...,17.6,6.2,8.3,10.4,6.7,7.3,0.0,2.6,1.9,601-195
3,2019-20,School,601,Appling County,277,Appling County Primary School,"PK,KK,01,02",579,54.9,37.8,...,11.5,7.6,7.2,10.3,7.0,7.4,0.0,3.7,0.0,601-277
4,2019-20,School,601,Appling County,1050,Altamaha Elementary School,"PK,KK,01,02,03,04,05",380,51.1,42.1,...,7.7,6.4,4.1,6.8,5.0,5.3,0.0,9.1,0.0,601-1050


In [46]:
attendance_df.shape

(2493, 83)

#### Filter data based on schools from academic performance

In [47]:
attendance = attendance_df[attendance_df['Identifier'].isin(school_identifiers)]

In [65]:
attendance.shape

(344, 83)

### Enrollment Data

The enrollment dataset shows the grade levels at the school and the percentages of students by race and sub-groups (migrant, ED, SWD, LEP, gender). Also included are the counts and percentages of remedial middle school students, early intervention elementary students, special education, alternative programs, and gifted students.

In [49]:
enrollment_df = d[enrollment_url]

In [50]:
enrollment_df.head()

Unnamed: 0,DETAIL_LVL_DESC,INSTN_NUMBER,SCHOOL_DSTRCT_CD,LONG_SCHOOL_YEAR,INSTN_NAME,SCHOOL_DSTRCT_NM,GRADES_SERVED_DESC,ENROLL_PERCENT_ASIAN,ENROLL_PERCENT_NATIVE,ENROLL_PERCENT_BLACK,...,ENROLL_COUNT_SPECIAL_ED_PK,ENROLL_PCT_SPECIAL_ED_PK,ENROLL_COUNT_VOCATION_9_12,ENROLL_PCT_VOCATION_9_12,ENROLL_COUNT_ALT_PROGRAMS,ENROLL_PCT_ALT_PROGRAMS,ENROLL_COUNT_GIFTED,ENROLL_PCT_GIFTED,ENROLL_PERCENT_MALE,ENROLL_PERCENT_FEMALE
0,School,103,601,2019-20,Appling County High School,Appling County,09101112,1.0,0.0,21.0,...,0,0.0,579.0,59.2,53.0,7.0,59.0,6.0,51.0,49.0
1,School,177,601,2019-20,Appling County Elementary School,Appling County,02030405,1.0,0.0,28.0,...,0,0.0,,,0.0,0.0,25.0,5.1,52.0,48.0
2,School,195,601,2019-20,Appling County Middle School,Appling County,060708,1.0,0.0,23.0,...,0,0.0,,,22.0,2.6,59.0,7.1,50.0,50.0
3,School,277,601,2019-20,Appling County Primary School,Appling County,"PK,KK,01,02",0.0,0.0,27.0,...,26,17.1,,,0.0,0.0,11.0,2.0,48.0,52.0
4,School,1050,601,2019-20,Altamaha Elementary School,Appling County,"PK,KK,01,02,03,04,05",0.0,0.0,7.0,...,10,20.0,,,0.0,0.0,36.0,10.0,50.0,50.0


In [51]:
enrollment_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2497 entries, 0 to 2496
Data columns (total 37 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   DETAIL_LVL_DESC                2497 non-null   object 
 1   INSTN_NUMBER                   2497 non-null   object 
 2   SCHOOL_DSTRCT_CD               2497 non-null   object 
 3   LONG_SCHOOL_YEAR               2497 non-null   object 
 4   INSTN_NAME                     2497 non-null   object 
 5   SCHOOL_DSTRCT_NM               2497 non-null   object 
 6   GRADES_SERVED_DESC             2497 non-null   object 
 7   ENROLL_PERCENT_ASIAN           2493 non-null   float64
 8   ENROLL_PERCENT_NATIVE          2493 non-null   float64
 9   ENROLL_PERCENT_BLACK           2493 non-null   float64
 10  ENROLL_PERCENT_HISPANIC        2493 non-null   float64
 11  ENROLL_PERCENT_MULTIRACIAL     2493 non-null   float64
 12  ENROLL_PERCENT_WHITE           2493 non-null   f

#### Initial review of dataset

In [52]:
#How many high schools are included?
enrollment_hs = enrollment_df[enrollment_df['GRADES_SERVED_DESC'].str.contains('09')]
enrollment_hs.shape

(711, 37)

In [53]:
enrollment_hs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 711 entries, 0 to 2496
Data columns (total 37 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   DETAIL_LVL_DESC                711 non-null    object 
 1   INSTN_NUMBER                   711 non-null    object 
 2   SCHOOL_DSTRCT_CD               711 non-null    object 
 3   LONG_SCHOOL_YEAR               711 non-null    object 
 4   INSTN_NAME                     711 non-null    object 
 5   SCHOOL_DSTRCT_NM               711 non-null    object 
 6   GRADES_SERVED_DESC             711 non-null    object 
 7   ENROLL_PERCENT_ASIAN           711 non-null    float64
 8   ENROLL_PERCENT_NATIVE          711 non-null    float64
 9   ENROLL_PERCENT_BLACK           711 non-null    float64
 10  ENROLL_PERCENT_HISPANIC        711 non-null    float64
 11  ENROLL_PERCENT_MULTIRACIAL     711 non-null    float64
 12  ENROLL_PERCENT_WHITE           711 non-null    fl

#### Create identifier column

In [66]:
# Rename school district number column to be the same as the academic performance
enrollment_df = enrollment_df.rename(columns={"SCHOOL_DSTRCT_CD": "SCHOOL_DISTRCT_CD"})

In [67]:
identifier(enrollment_df)

In [68]:
enrollment_df.head()

Unnamed: 0,DETAIL_LVL_DESC,INSTN_NUMBER,SCHOOL_DISTRCT_CD,LONG_SCHOOL_YEAR,INSTN_NAME,SCHOOL_DSTRCT_NM,GRADES_SERVED_DESC,ENROLL_PERCENT_ASIAN,ENROLL_PERCENT_NATIVE,ENROLL_PERCENT_BLACK,...,ENROLL_PCT_SPECIAL_ED_PK,ENROLL_COUNT_VOCATION_9_12,ENROLL_PCT_VOCATION_9_12,ENROLL_COUNT_ALT_PROGRAMS,ENROLL_PCT_ALT_PROGRAMS,ENROLL_COUNT_GIFTED,ENROLL_PCT_GIFTED,ENROLL_PERCENT_MALE,ENROLL_PERCENT_FEMALE,Identifier
0,School,103,601,2019-20,Appling County High School,Appling County,09101112,1.0,0.0,21.0,...,0.0,579.0,59.2,53.0,7.0,59.0,6.0,51.0,49.0,601-103
1,School,177,601,2019-20,Appling County Elementary School,Appling County,02030405,1.0,0.0,28.0,...,0.0,,,0.0,0.0,25.0,5.1,52.0,48.0,601-177
2,School,195,601,2019-20,Appling County Middle School,Appling County,060708,1.0,0.0,23.0,...,0.0,,,22.0,2.6,59.0,7.1,50.0,50.0,601-195
3,School,277,601,2019-20,Appling County Primary School,Appling County,"PK,KK,01,02",0.0,0.0,27.0,...,17.1,,,0.0,0.0,11.0,2.0,48.0,52.0,601-277
4,School,1050,601,2019-20,Altamaha Elementary School,Appling County,"PK,KK,01,02,03,04,05",0.0,0.0,7.0,...,20.0,,,0.0,0.0,36.0,10.0,50.0,50.0,601-1050


#### Filter data based on schools from academic performance

In [69]:
enrollment = enrollment_df[enrollment_df['Identifier'].isin(school_identifiers)]

In [70]:
enrollment.shape

(344, 38)

### Compare Filtered Attendance and Enrollment Datasets

In [75]:
# Create a list of schools that are different between the two datasets
list(set(attendance['Identifier']) - set(enrollment['Identifier']))

[]

## Expore Teacher Categorization Data

### Educator Experience

In [54]:
experience_df = d[experience_url]

In [55]:
experience_df.head()

Unnamed: 0,LONG_SCHOOL_YEAR,SCHOOL_DSTRCT_NM,INSTN_NAME,LABEL_LVL_3_DESC,LABEL_LVL_2_DESC,FTE,INEXPERIENCED_FTE,INEXPERIENCED_FTE_PCT
0,2019-20,Appling County,Altamaha Elementary School,Leaders,Not Applicable,1.0,0.0,0
1,2019-20,Appling County,Altamaha Elementary School,Teachers,Total,27.3,6.0,22
2,2019-20,Appling County,Appling County Elementary School,Leaders,Not Applicable,2.0,1.5,75
3,2019-20,Appling County,Appling County Elementary School,Teachers,Total,44.9,14.0,31
4,2019-20,Appling County,Appling County High School,Leaders,Not Applicable,2.5,0.5,20


In [56]:
experience_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   LONG_SCHOOL_YEAR       6435 non-null   object 
 1   SCHOOL_DSTRCT_NM       6435 non-null   object 
 2   INSTN_NAME             6435 non-null   object 
 3   LABEL_LVL_3_DESC       6435 non-null   object 
 4   LABEL_LVL_2_DESC       6435 non-null   object 
 5   FTE                    6435 non-null   float64
 6   INEXPERIENCED_FTE      6435 non-null   float64
 7   INEXPERIENCED_FTE_PCT  6435 non-null   int64  
dtypes: float64(2), int64(1), object(5)
memory usage: 402.3+ KB


The experience dataset shows the total number of full time employees at the school at the leader and teacher levels. Also included are the number of inexperienced FTEs and the percentage of total FTEs that are inexperienced.

### Teacher Emergency Credentials

In [57]:
credential_df = d[credential_url]

In [58]:
credential_df.head()

Unnamed: 0,LONG_SCHOOL_YEAR,SCHOOL_DSTRCT_NM,INSTN_NAME,LABEL_LVL_3_DESC,LABEL_LVL_2_DESC,FTE,OUTOFFIELD_FTE,OUTOFFIELD_FTE_PCT
0,2019-20,Appling County,Altamaha Elementary School,Teachers,Total,27.3,1.0,4
1,2019-20,Appling County,Appling County Elementary School,Teachers,Total,44.9,2.0,4
2,2019-20,Appling County,Appling County High School,Teachers,Total,55.7,8.1,15
3,2019-20,Appling County,Appling County Middle School,Teachers,Total,53.9,13.0,24
4,2019-20,Appling County,Appling County Primary School,Teachers,Total,48.2,0.0,0


In [59]:
credential_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3779 entries, 0 to 3778
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   LONG_SCHOOL_YEAR    3779 non-null   object 
 1   SCHOOL_DSTRCT_NM    3779 non-null   object 
 2   INSTN_NAME          3779 non-null   object 
 3   LABEL_LVL_3_DESC    3779 non-null   object 
 4   LABEL_LVL_2_DESC    3779 non-null   object 
 5   FTE                 3779 non-null   float64
 6   OUTOFFIELD_FTE      3779 non-null   float64
 7   OUTOFFIELD_FTE_PCT  3779 non-null   int64  
dtypes: float64(2), int64(1), object(5)
memory usage: 236.3+ KB


The emergency credentials dataset shows the total number of full time employees at the school at the teacher level. Also included are the number of out-of-field FTEs (those with emergency or provisial credentials) and the percentage of total FTEs that are out-of-field.

Note that this description of out-of-field is different from the out-of-field dataset.

### Out-of-Field Teachers

In [60]:
oof_df = d[oof_url]

In [61]:
oof_df.head()

Unnamed: 0,LONG_SCHOOL_YEAR,SCHOOL_DSTRCT_NM,INSTN_NAME,LABEL_LVL_3_DESC,LABEL_LVL_2_DESC,FTE,OUTOFFIELD_FTE,OUTOFFIELD_FTE_PCT
0,2019-20,Appling County,Altamaha Elementary School,Teachers,Total,27.3,1.0,4
1,2019-20,Appling County,Appling County Elementary School,Teachers,Total,44.9,1.0,2
2,2019-20,Appling County,Appling County High School,Teachers,Total,55.7,2.0,4
3,2019-20,Appling County,Appling County Middle School,Teachers,Total,53.9,0.0,0
4,2019-20,Appling County,Appling County Primary School,Teachers,Total,48.2,0.0,0


In [62]:
oof_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3779 entries, 0 to 3778
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   LONG_SCHOOL_YEAR    3779 non-null   object 
 1   SCHOOL_DSTRCT_NM    3779 non-null   object 
 2   INSTN_NAME          3779 non-null   object 
 3   LABEL_LVL_3_DESC    3779 non-null   object 
 4   LABEL_LVL_2_DESC    3779 non-null   object 
 5   FTE                 3779 non-null   float64
 6   OUTOFFIELD_FTE      3779 non-null   float64
 7   OUTOFFIELD_FTE_PCT  3779 non-null   int64  
dtypes: float64(2), int64(1), object(5)
memory usage: 236.3+ KB


In [63]:
oof_df.compare(credential_df)

Unnamed: 0_level_0,OUTOFFIELD_FTE,OUTOFFIELD_FTE,OUTOFFIELD_FTE_PCT,OUTOFFIELD_FTE_PCT
Unnamed: 0_level_1,self,other,self,other
1,1.0,2.0,2.0,4.0
2,2.0,8.1,4.0,15.0
3,0.0,13.0,0.0,24.0
5,4.0,24.1,2.0,10.0
7,3.0,7.3,10.0,23.0
...,...,...,...,...
3774,0.0,3.7,0.0,9.0
3775,0.0,2.0,0.0,4.0
3776,0.0,2.0,0.0,4.0
3777,0.0,13.4,0.0,7.0


The out-of-field dataset shows the total number of full time employees at the school at the teacher level. Also included are the number of out-of-field FTEs (those teaching in a subject or field for which the teacher is not certified or licensed) and the percentage of total FTEs that are out-of-field.

Note that this definition of out-of-field is different from the emergency credentials dataset.

## Next Steps

1. Create a summary for the selection criteria for which subset of schools will be used.
2. Merge for a few cases (use cases; maybe for county levels). Need to understand the data for those schools before they are merged. Could seem reasonable at first, but could be inaccurate.
3. Loop over the scraped df csv to pull out the right URL