# CSCA 5622 Intro to Machine Learning: Final Project

## Likelihood of Student Criminal Activity

Using public datasets from the Department of Education, find any correlations between states where children attended preschool, teacher certifications, and criminal offenses ocurring at school.  


#### DATA INFO

**Teacher Credentials:**
- The `teacher_creds` data frame records number and percentage of public school classroom teachers (in full-time equivalents), by certification status and years of experience, by state: School Year 2013-14.

- Table reads (for US Totals): Of all 3,138,535 classroom teachers (FTE), 3,084,697 (98.3%) met all state licensing/certification requirements. Data reported in this table represent 100.0% of responding schools.

- *Source: U.S. Department of Education, Office for Civil Rights, Civil Rights Data Collection, 2013-14, available at http://ocrdata.ed.gov. Data notes are available at http://ocrdata.ed.gov/downloads/DataNotes.docx*
 
**Preschool Enrollment:**
- The `preschool` data frame records number and percentage of public school students enrolled in Preschool, by race/ethnicity, disability status, and English proficiency, by state: School Year 2015-16.

- Table reads (for US Totals): Of all 1,536,982 public school students enrolled in Preschool, 17,964 (1.2%) were American Indian or Alaska Native, and 313,601 (20.4%) were students with disabilities served under the Individuals with Disabilities Education Act (IDEA). Data reported in this table represent 100.0% of responding schools.																						
	
- *Source: U.S. Department of Education, Office for Civil Rights, Civil Rights Data Collection, 2015-16, available at http://ocrdata.ed.gov. Data notes are available at https://ocrdata.ed.gov/Downloads/Data-Notes-2015-16-CRDC.pdf*

**School Incidents:**
- The `incidents` data frame records number of incidents, by state: School Year 2015-16.

- Table reads (for US): The number of incidents of sexual assault was 9,255. Data reported in this table represent 98.0% of responding schools.

- *Source: U.S. Department of Education, Office for Civil Rights, Civil Rights Data Collection, 2015-16, available at http://ocrdata.ed.gov. Data notes are available at https://ocrdata.ed.gov/Downloads/Data-Notes-2015-16-CRDC.pdf.*																	

##### Disclaimers 
- Due to limited data available, the teacher credentials data is from school year 2013-2014 while the other two data sets are from 2015-2016.
- The school incidents data provides a disclaimer at the footer to "Interpret data in this row with caution. Data are missing for more than 15 percent of schools."
- This is only one school year's worth of data and would probably be more accurate with several years. 

**This results of this experiment will inherently be inacurrate so this should not be interpreted as factual.**

In [90]:
%matplotlib inline
import numpy as np
import scipy as sp
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# Set color map to have light blue background
sns.set()
import statsmodels.formula.api as smf
import statsmodels.api as sm
from pathlib import Path
from ISLP import load_data

# Load & Prepare Teacher Credentials Data
- Load theExcel sheet and exlcude the top three rows of formatting.
- Rename long column names to shorter acronyms. See comments in the code for mappings.
- Remove the empty 0index `NaN` column from Excel sheet.
- Drop the last three rows of metada and formatting. 

In [91]:
# -- Load Teach Credentials DataFrame
file_path = Path.cwd().joinpath("../../data/teacher-certification-and-years-of-experience.xlsx")
teacher_creds = pd.read_excel(
    file_path,
    header=2,
    skiprows=[0, 1, 2], # -- Exclude the top three rows of formatting
    names=[    
        "C0",
        "State",
        "CT (FTE)",
        # "Classroom Teachers (FTE)",
        "MR (FTE)",
        # "Meeting All State Licensing/Certification Requirements (FTE)",
        "MR (P)",
        # "Meeting All State Licensing/Certification Requirements (P)",
        "FY (FTE)",
        # "Classroom Teachers in their First Year of Teaching (FTE)",
        "FY (P)",
        # "Classroom Teachers in their First Year of Teaching (P)",
        "SY (FTE)",
        "SY (P)",
        "schools",
        "schools (P)",
      ]
)

# -- Remove empty NaN column values
teacher_creds.drop(["C0"], axis=1, inplace=True)

# -- Delete metadata text
teacher_creds.drop([52, 53, 54, 55], inplace=True)

# Load & Prepare Preschool Data
- Load theExcel sheet and exlcude the top five rows of formatting.
- Rename long column names to shorter acronyms. See comments in the code for mappings.
- Remove the empty 0 index `NaN` column from Excel sheet.
- Drop the last three rows of metada and formatting.

In [100]:
# -- Load Teach Credentials DataFrame
file_path = Path.cwd().joinpath("../../data/preschool-enrollment.xlsx")
preschool = pd.read_excel(
    file_path,
    # header=2,
    sheet_name="Total",
    skiprows=[0, 1, 2, 3, 4], # -- Exclude the top three rows of formatting		
    names=[
        "C0",
        "State",
        "Total (N)",
        "Total (P)",
        # "American Indian or Alaska Native (N)",
        "AIAN (N)",
        # "American Indian or Alaska Native (P)",
        "AIAN (P)",
        "Asian (N)",
        "Asian (P)",
        # "Hispanic or Latino of any race (N)",
        "HL (N)",
        # "Hispanic or Latino of any race (P)",
        "HL (P)",
        # "Black or African American (N)"
        "BAA (N)",
        # "Black or African American (P)"
        "BAA (N)",
        "White (N)",
        "White (P)",
        # "Native Hawaiian or Other Pacific Islander (N)"
        "NHPI (N)",
        # "Native Hawaiian or Other Pacific Islander (P)"
        "NHPI (P)",
        # "Two or more races (N)",
        "TMR (N)",
        # "Two or more races (P)",
        "TMR (P)",
        # Students With Disabilities Served Under IDEA
        "SWDSUI (N)",
        "SWDSUI (P)",
        # English Language Learners
        "ELL (N)",
        "ELL (P)",
        # Number of Schools
        "schools",
        # Percent of Schools Reporting
        "schools (P)",        
    ]
)

# -- Remove empty NaN column values
preschool.drop(["C0"], axis=1, inplace=True)

# -- Delete metadata text
preschool.drop([52, 53, 54, 55], inplace=True)
preschool.tail(10)

  warn(msg)


Unnamed: 0,State,Total (N),Total (P),AIAN (N),AIAN (P),Asian (N),Asian (P),HL (N),HL (P),BAA (N),...,NHPI (N),NHPI (P),TMR (N),TMR (P),SWDSUI (N),SWDSUI (P),ELL (N),ELL (P),schools,schools (P)
42,South Dakota,3715.0,100.0,514.0,13.8358,66.0,1.7766,242.0,6.5141,222.0,...,4.0,0.1077,174.0,4.6837,890.0,23.9569,11.0,0.2961,688.0,100.0
43,Tennessee,28121.0,100.0,49.0,0.1742,420.0,1.4935,2930.0,10.4193,7487.0,...,24.0,0.0853,671.0,2.3861,5430.0,19.3094,379.0,1.3477,1818.0,100.0
44,Texas,237646.0,100.0,791.0,0.3328,8212.0,3.4556,150094.0,63.1586,34407.0,...,305.0,0.1283,4679.0,1.9689,21564.0,9.074,88136.0,37.0871,8616.0,100.0
45,Utah,15453.0,100.0,241.0,1.5596,359.0,2.3232,2653.0,17.1682,242.0,...,236.0,1.5272,322.0,2.0837,5663.0,36.6466,87.0,0.563,1009.0,100.0
46,Vermont,4632.0,100.0,13.0,0.2807,105.0,2.2668,62.0,1.3385,95.0,...,3.0,0.0648,104.0,2.2453,707.0,15.2634,24.0,0.5181,306.0,100.0
47,Virginia,33012.0,100.0,83.0,0.2514,1388.0,4.2045,5796.0,17.5573,11140.0,...,49.0,0.1484,1751.0,5.3041,9079.0,27.5021,1642.0,4.9739,1971.0,100.0
48,Washington,19912.0,100.0,272.0,1.366,1045.0,5.2481,5962.0,29.9417,1374.0,...,186.0,0.9341,1819.0,9.1352,8758.0,43.9835,562.0,2.8224,2305.0,100.0
49,West Virginia,15106.0,100.0,13.0,0.0861,85.0,0.5627,199.0,1.3174,594.0,...,4.0,0.0265,507.0,3.3563,2360.0,15.6229,112.0,0.7414,720.0,100.0
50,Wisconsin,53896.0,100.0,495.0,0.9184,2265.0,4.2025,6932.0,12.8618,5835.0,...,54.0,0.1002,2234.0,4.145,9410.0,17.4596,2586.0,4.7981,2232.0,100.0
51,Wyoming,635.0,100.0,73.0,11.4961,2.0,0.315,125.0,19.685,13.0,...,3.0,0.4724,23.0,3.622,11.0,1.7323,20.0,3.1496,365.0,100.0
