# More data cleaning for present school year: cleaning up outcomes for students who switch school systems

SQL script 3 separately tallies absences with the DCPS and PCS tables, which are too large to easily be combined when they contain all students

In turn, there are different types of student migration between schools:

1. A student migrates to a school within DCPS or within PCS-- e.g., the student attends one charter school and then switches to another
2. A student who migrates between the DCPS system and the PCS system

The SQL script handles (1) fine-- attendances are aggregated by ID so switching is not a problem. For students in group (2), the present script identifies those students and makes sure that their absence tally does not restart when they switch between a school system.

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell

from suso.utils import here

InteractiveShell.ast_node_interactivity = "all"
pd.set_option("display.max_columns", None)  # or 1000
pd.set_option("display.max_rows", None)  # or 1000
pd.set_option("display.max_colwidth", None)

In [None]:
DATA_DIR = here("data")

# 1: Remaining data cleaning

## 1.1: initialize db connection and load SUSO data/attendance outcomes

In [None]:
public_attendance = pd.read_parquet(
    DATA_DIR / "dcps_sy1718_attendanceoutcomes_suso.parquet"
)

In [None]:
charter_attendance = pd.read_parquet(
    DATA_DIR / "charter_sy1718_attendanceoutcomes_suso.parquet"
)

In [None]:
## add indicators and write to pickle
public_attendance["type_school"] = "DCPS"
charter_attendance["type_school"] = "PCS"

## 1.2: create lookup table for students who switch between DCPS and charter

Because the cumulative sums in SQL were done separately for DCPS and public charter schools (because they use separate attendance codes and due to data size), students who switch between school systems have their absence clocks restarted in the new system (this came up as well in Clarice's analysis)

Here, I find those students and write lookup table to reconstruct outcomes 

In [None]:
## find students whose usi is present in both public and charter
usi_both = set(public_attendance.usi).intersection(set(charter_attendance.usi))
print(
    str(len(usi_both))
    + " students present in both dcps and pcs over the course of\nthe 2017-2018 school year"
)

In [None]:
attendance_both_clean = pd.read_parquet(DATA_DIR / "attendance_both_clean.parquet")

## 1.4: Sanity checks

### Sanity check one: how many school days are students observed for?

Should ideally be close to 180 for students

In [None]:
days_perstudent = pd.DataFrame(
    attendance_both_clean.groupby("usi").agg({"total_schooldays": np.max})
).reset_index()
print(
    "The mean number of observed school days per student is: "
    + str(round(np.mean(days_perstudent.total_schooldays), 2))
    + " school days"
)
print(
    "The max number of observed school days per student is: "
    + str(round(np.max(days_perstudent.total_schooldays), 2))
    + " school days"
)
print(
    "The min number of observed school days per student is: "
    + str(round(np.min(days_perstudent.total_schooldays), 2))
    + " school days"
)

## see google doc for confirmation that the low count is correct for those
## students

### Sanity check two: is the count of unexcused + excused always larger than unexcused only?

See below that total unexcused is always either less than or less than/equal to excused

In [None]:
np.unique(
    np.where(
        attendance_both_clean.total_unexcused
        <= attendance_both_clean.total_excusedorunexcused,
        1,
        0,
    ),
    return_counts=True,
)