# Report Part 1: High Level Missingness

The goal of this report is to draw attention to major gaps in the National Incident Based Reporting System (NIBRS) data. In particular, it seeks to identify cases where entire states are not reporting. 

This report uses the "Universe" file which lists all agencies for each state and whether or not they are listed as NIBRS to identify cases where there are agencies that should be reporting. In addition, it uses the Summary Reporting Statistics (SRS) crime counts to compare with the incidents reported for each agency.

In [None]:
from datetime import datetime
import os
print("Author: Automated Pipeline")
year = int(os.getenv('DATA_YEAR'))
print("Generating reports for year:",year)
print("Report date:", datetime.now().strftime("%m/%d/%y"))

In [None]:
from utils import *
from dictionaries import *
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from IPython.display import display, Markdown
from datetime import datetime as dt

output_folder = Path(os.getenv("OUTPUT_PIPELINE_DIR"))

years = get_available_years()
output_dir = output_folder / "QC_output_files"
output_dir.mkdir(parents=True, exist_ok=True)
input_dir = output_folder / "initial_tasks_output"

engine_database = connect_to_database()

### Import some variables we need.
state_name_to_abbrev = {v: k for k, v in us_state_abbrev.items()}

states_list = list(us_state_abbrev.values())
states_list.sort()

print("--------------------------------")
print(" Loading datasets, please wait. ")

reta_frame = pd.read_csv(output_folder/"artifacts"/f"missing_months_{year}.csv")
universe_frame = pd.read_csv(input_dir/f"ref_agency_{year}.csv")

srs_frame = pd.read_csv("../compute_weights/Data/srs2016_2020_smoothed.csv")


# this query gets the total number of incidents for each agency
agency_incident_query = f"""
SELECT 
  ref_agency.ori,
  COUNT(nibrs_incident.incident_id) as num_incidents,
  ref_state.name as state_name,
  ref_agency.nibrs_start_date
 FROM ucr_prd.ref_agency_yearly ref_agency_yearly
	 LEFT JOIN ucr_prd.ref_agency ref_agency USING (agency_id)
	 LEFT JOIN ucr_prd.ref_state USING (state_id)
	 LEFT JOIN ucr_prd.ref_agency_status ref_agency_status USING (agency_id, data_year)
	 LEFT JOIN ucr_prd.nibrs_incident ON (nibrs_incident.agency_id = ref_agency.agency_id) AND (EXTRACT(year FROM nibrs_incident.incident_date) = ref_agency_yearly.data_year)
 WHERE ref_agency_status.data_year IS NOT NULL 
       AND ref_agency_yearly.is_nibrs IS TRUE
       AND EXTRACT(year FROM nibrs_incident.incident_date) = {year}
 GROUP BY ori, nibrs_start_date, state_name;
"""

nibrs_frame = pd.read_sql(agency_incident_query, engine_database)

next_year = dt(day=1,month=1,year=year+1)
nibrs_frame = nibrs_frame.loc[(pd.to_datetime(nibrs_frame["nibrs_start_date"]) < next_year)]

srs_frame = srs_frame.rename(columns={col:col.lower() for col in srs_frame.columns}).rename(columns={"ori_universe":"ori"})
universe_frame.rename(columns={col:col.lower() for col in universe_frame.columns},inplace=True)
reta_frame.rename(columns={col:col.lower() for col in reta_frame.columns},inplace=True)

# Get the total crime counts for srs
srs_frame["total_crime"] = srs_frame[["totcrime"]].sum(axis=1)

# Subset the universe by eligible agencies according to reta-mm
eligible_agencies = get_elegible_agency_list(reta_frame)

reta_frame_el = reta_frame.loc[reta_frame["ori"].isin(eligible_agencies)]

# first merge on the ORI field, and then the legacy_ori field
universe_frame_el = universe_frame.loc[universe_frame["ori"].isin(eligible_agencies)].copy()
nibrs_frame_el = nibrs_frame.loc[nibrs_frame["ori"].isin(eligible_agencies)].copy()
nibrs_frame_el = nibrs_frame_el.merge(universe_frame_el[["ori","parent_pop_group_desc","region_name"]],on="ori",how="left")

srs_frame_el = srs_frame.loc[srs_frame["ori"].isin(eligible_agencies)].copy()



print("              done              ")
print("--------------------------------")

## Part 1: Number of Eligible Agencies by State, Agency Type, and Region

This section identifies states, agency types, and regions where there are no eligible agencies in the Universe file, SRS file, or NIBRS database. Agency eligibility is identified using the Missing Months report data. An eligible agency has been identified to be active, not covered by a different agency, and not dormant.

In [None]:
nibrs_frame_el["In NIBRS"] = True
uni_nibrs = universe_frame_el.merge(nibrs_frame_el[["ori","In NIBRS","num_incidents","nibrs_start_date"]],on="ori",how="left",suffixes=["__universe","__NIBRS"])
uni_nibrs["In NIBRS"] = uni_nibrs["In NIBRS"].fillna(False)
type_i_no_NIBRS = uni_nibrs.loc[(uni_nibrs["reporting_type"] == "I") & (uni_nibrs["In NIBRS"] == False)]

universe_cols = ['ucr_agency_name','ori','state_name','parent_pop_group_desc','population']
retamm_cols = ['ori','jan_mm_flag','feb_mm_flag','mar_mm_flag','apr_mm_flag','may_mm_flag','jun_mm_flag','jul_mm_flag','aug_mm_flag','sep_mm_flag','oct_mm_flag','nov_mm_flag','dec_mm_flag']

uni_nibrs_reta_srs = type_i_no_NIBRS[universe_cols].merge(reta_frame[retamm_cols],on="ori",how="left").merge(srs_frame[["ori","total_crime"]],on="ori",how="left").rename(columns={"total_crime":"SRS_total_crime"})
retamm_cols.remove('ori')
uni_nibrs_reta_srs.to_csv(output_dir / f"{year}_type_I_not_in_NIBRS_somecols.csv",index=False)
uni_nibrs_reta_srs = uni_nibrs_reta_srs.loc[uni_nibrs_reta_srs[retamm_cols].sum(axis=1) > 0]
uni_nibrs_reta_srs.to_csv(output_dir / f"{year}_type_I_not_in_NIBRS_some-retamm.csv",index=False)

# get agencies which are type I but not in NIBRS

type_i_no_NIBRS.to_csv(output_dir / f"{year}_type_I_not_in_NIBRS.csv",index=False)

# get agencies which are NOT type I but ARE in NIBRS
type_s_in_NIBRS = uni_nibrs.loc[(uni_nibrs["reporting_type"] != "I") & (uni_nibrs["In NIBRS"] == True)]
type_s_in_NIBRS.to_csv(output_dir / f"{year}_not_type_I_but_in_NIBRS.csv",index=False)

In [None]:
from IPython.display import display, Markdown

group_var_name = {
    "state_name":"State",
    "parent_pop_group_desc":"Agency Type",
    "region_name":"Region"
}

for grouping_var in ["state_name","parent_pop_group_desc","region_name"]:

    universe_counts_el = universe_frame_el.groupby(grouping_var)["ori"].count().to_dict()
    universe_typei_counts_el = universe_frame_el.loc[universe_frame_el["reporting_type"] == "I"\
                                                    ].groupby(grouping_var)["ori"].count().astype(int).to_dict()
    nibrs_counts_el = nibrs_frame_el.groupby(grouping_var)["ori"].count().astype(int).to_dict()

    nibrs_nonzero_counts_el = nibrs_frame_el.loc[nibrs_frame_el["num_incidents"] > 0].groupby(grouping_var)["ori"].count().astype(int).to_dict()

    if grouping_var == "state_name":
        count_frame2 = pd.DataFrame({"Abbrev":state_name_to_abbrev,\
                                "Agencies in Universe":universe_counts_el,\
                                "NIBRS Reporters Universe":universe_typei_counts_el,\
                                "NIBRS Reporters NIBRS":nibrs_counts_el,\
                                 "Nibrs Reporters NIBRS (>0 incidents)":nibrs_nonzero_counts_el
                                }).fillna(0).loc[states_list]
    else:
        count_frame2 = pd.DataFrame({"Agencies in Universe":universe_counts_el,\
                                "NIBRS Reporters Universe":universe_typei_counts_el,\
                                "NIBRS Reporters NIBRS":nibrs_counts_el,\
                                 "Nibrs Reporters NIBRS (>0 incidents)":nibrs_nonzero_counts_el
                                }).fillna(0)
    
    
    

    count_frame2["Percent Coverage"] = count_frame2.apply(lambda x: 0 if x["Agencies in Universe"] == 0 \
                                                        else int(100 * (x['NIBRS Reporters Universe']/x['Agencies in Universe']))  ,axis=1)

    count_frame2["Over 80% Coverage"] = count_frame2["Percent Coverage"].apply(lambda x: x >=80)
    count_frame2[["Agencies in Universe","NIBRS Reporters Universe","NIBRS Reporters NIBRS","Nibrs Reporters NIBRS (>0 incidents)"]\
                ] = count_frame2[["Agencies in Universe","NIBRS Reporters Universe","NIBRS Reporters NIBRS","Nibrs Reporters NIBRS (>0 incidents)"]
                                ].astype(int)

    count_frame2.sort_values(by=["Percent Coverage","NIBRS Reporters NIBRS"],ascending=False,inplace=True)
    count_frame2["Percent Coverage"] = count_frame2["Percent Coverage"].apply(lambda x: f"{x}%")


    universe_missing2 = set(count_frame2.loc[count_frame2["Agencies in Universe"] == 0].index.tolist())
    nibrs_missing2 = set(count_frame2.loc[count_frame2["NIBRS Reporters NIBRS"] == 0\
                                 ].loc[count_frame2["NIBRS Reporters Universe"] > 0\
                                      ].index.tolist())

    display(Markdown(f"### {group_var_name[grouping_var]}-Level Eligible Agencies"))
    print(f"The following {group_var_name[grouping_var]}s have no eligible agencies in the Universe file:",", ".join(universe_missing2))
    print(f"The following {group_var_name[grouping_var]}s have no eligible agencies in NIBRS but do have eligible NIBRS agencies in the Universe file:",", ".join(nibrs_missing2))

    display(Markdown(f"\n The table below outlines the following"))
    display(Markdown(f"""* **Agencies in Universe:** The number of eligible agencies in the universe file for each {group_var_name[grouping_var]}.
* **NIBRS Reporters Universe:** The number of eligible agencies in the Universe file listed as NIBRS reporters for each {group_var_name[grouping_var]}.
* **NIBRS Reporters NIBRS:** The number of eligible agencies found in the NIBRS database for each {group_var_name[grouping_var]}.
* **Percent Coverage:** The percent of eligible agencies in the Universe file which are listed as NIBRS reporters for each {group_var_name[grouping_var]}.
* **Over 80% Coverage:** Whether or not the Percent Coverage is greater than 80% for the {group_var_name[grouping_var]}.
    """))
    if grouping_var == "parent_pop_group_desc":
        ordering = [
            'Cities under 2,500',
            'Cities from 2,500 thru 9,999',
            'Cities from 10,000 thru 24,999',
            'Cities from 25,000 thru 49,999',
            'Cities from 50,000 thru 99,999',
            'Cities from 100,000 thru 249,999',
            'All cities 250,000 or over',
            'MSA Counties',
            'Non-MSA Counties',
            'Possessions (Puerto Rico, Guam, Canal Zone, Virgin Islands, and American Samoa'
        ]
        count_frame2 = count_frame2.loc[ordering]
    
    display(count_frame2)

## Part 2: Missing Crimes from Eligible NIBRS Agencies in NIBRS or SRS

This section identifies cases where states have no crimes reported in the NIBRS database from eligible agencies. It also looks at the SRS crime reports for those eligible agencies that are in the NIBRS and identifies cases where the SRS crime counts are 0.

In [None]:
typei_agencies = universe_frame_el.loc[universe_frame_el["reporting_type"] == "I"]["ori"].unique().tolist()


universe_typei_counts_el = universe_frame_el.loc[universe_frame_el["reporting_type"] == "I"\
                                                ].groupby("state_name")["ori"].count().astype(int).to_dict()


nibrs_typei = nibrs_frame_el.loc[nibrs_frame_el["ori"].isin(typei_agencies)]

nibrs_crime_counts_el = nibrs_typei.groupby("state_name")["num_incidents"].sum().astype(int).to_dict()

srs_crime_counts_el = srs_frame_el.loc[srs_frame_el["ori"].isin(nibrs_typei["ori"].tolist())].groupby("state_name")["total_crime"].sum().to_dict()


count_frame3 = pd.DataFrame({"state":state_name_to_abbrev,\
                            "Type I Agencies Universe":universe_typei_counts_el,\
                            "SRS crimes for Agencies in NIBRS":srs_crime_counts_el,\
                            "NIBRS incidents for Agencies in NIBRS":nibrs_crime_counts_el
                            }).fillna(0).loc[states_list]

count_frame3[["Type I Agencies Universe","SRS crimes for Agencies in NIBRS","NIBRS incidents for Agencies in NIBRS"]] = count_frame3[["Type I Agencies Universe","SRS crimes for Agencies in NIBRS","NIBRS incidents for Agencies in NIBRS"]].astype(int)

count_frame3 = count_frame3.loc[(count_frame3["SRS crimes for Agencies in NIBRS"] == 0) | (count_frame3["NIBRS incidents for Agencies in NIBRS"] == 0)]

srs_missing3 = count_frame3.loc[count_frame3["SRS crimes for Agencies in NIBRS"] == 0\
                               ].loc[count_frame3["Type I Agencies Universe"] > 0].index.tolist() 
nibrs_missing3 = count_frame3.loc[count_frame3["NIBRS incidents for Agencies in NIBRS"] == 0\
                             ].loc[count_frame3["Type I Agencies Universe"] > 0].index.tolist() 

if len(nibrs_missing3) > 0:
    print("The following states report no incidents in the NIBRS database despite there being "\
          "NIBRS type agencies in the Universe file:",", ".join(nibrs_missing3))
if len(srs_missing3) > 0:
    print("The following states report no crimes in SRS for NIBRS agencies despite there being "\
          "NIBRS type agencies in the Universe file:",", ".join(srs_missing3))

display(Markdown("""\n The table below outlines the following
 * **Type I Agencies Universe:** The number of eligible agencies of type I in the universe file for each state.
 * **SRS crimes for Agencies in NIBRS:** The number of crimes reported to SRS for eligible type I agencies which appeared in the NIBRS database.
 * **NIBRS incidents for Agencies in NIBRS:** The number of incidents found in the NIBRS database from eligible type I agencies each state.

**NOTE:** States are only shown if at least one of the crime count columns is zero.
"""))

display(count_frame3)

## Part 3: Number of Agencies, Including Not Eligible, by State

This section identifies states with no agencies in each file. It does not take eligibility criterea into account 

In [None]:
from IPython.display import display, Markdown

universe_counts = universe_frame.groupby("state_name")["ori"].count().to_dict()
universe_typei_counts = universe_frame.loc[universe_frame["reporting_type"] == "I"\
                                          ].groupby("state_name")["ori"].count().to_dict()
srs_counts = srs_frame.groupby("state_name")["ori"].count().to_dict()
nibrs_counts = nibrs_frame.groupby("state_name")["ori"].count().to_dict()

count_frame = pd.DataFrame({"Abbrev":state_name_to_abbrev,\
                            "Agencies in Universe":universe_counts,\
                            "NIBRS Reporters Universe":universe_typei_counts,\
                            "NIBRS Reporters NIBRS":nibrs_counts
                            }).fillna(0).loc[states_list]

count_frame["Percent Coverage"] = count_frame.apply(lambda x: 0 if x["Agencies in Universe"] == 0 \
                                                    else int(100 * (x['NIBRS Reporters Universe']/x['Agencies in Universe'])),axis=1)
                                                    
count_frame["Over 80% Coverage"] = count_frame["Percent Coverage"].apply(lambda x: x>=80)
count_frame[["Agencies in Universe","NIBRS Reporters Universe","NIBRS Reporters NIBRS"]\
           ] = count_frame[["Agencies in Universe","NIBRS Reporters Universe","NIBRS Reporters NIBRS"]\
                          ].astype(int)

count_frame.sort_values(by=["Percent Coverage","NIBRS Reporters NIBRS"],ascending=False,inplace=True)

count_frame["Percent Coverage"] = count_frame["Percent Coverage"].apply(lambda x: f"{x}%")

universe_missing = set(count_frame.loc[count_frame["Agencies in Universe"] == 0].index.tolist())
nibrs_missing = set(count_frame.loc[count_frame["NIBRS Reporters NIBRS"] == 0\
                             ].loc[count_frame["NIBRS Reporters Universe"] > 0\
                                  ].index.tolist())

print("The following states have no agencies in the Universe file:",", ".join(universe_missing))
print("The following states have no agencies in NIBRS but do have NIBRS agencies in the Universe file:",", ".join(nibrs_missing))

display(Markdown("""\n The table below outlines the following
 * **Agencies in Universe:** The number of agencies in the universe file for each state.
 * **NIBRS Reporters Universe:** The number of agencies in the Universe file listed as NIBRS reporters.
 * **NIBRS Reporters NIBRS:** The number of agencies found in the NIBRS database for each state.
 * **Percent Coverage:** The percent of agencies in the Universe file which are listed as NIBRS reporters.
 * **Over 80% Coverage:** Whether or not the Percent Coverage is greater than 80% for the state.
"""))
display(count_frame)

## Datasets Used:

`Missing months datafile`: missing_months_<year>.csv (reta missing months)
* **Source**: NIBRS database
* **Description**: All law enforcement agencies in the US, whether or not they should be reporting crimes, and what months they reported incidents. Lists eligible agencies and whether or not they reported for different months.
* **Typical data**: 23 columns of ORI, state, status flags, population information, and indicators for if they reported crimes for each month.


`Universe datafile`: ref_agency_year.xlsx
* **Source**: FBI CJIS
* **Description**: Annual Snapshot List of all agencies and meta-data, regardless of NIBRS reporting status.
* **Typical data**: 66 columns of ORI, population region and officer meta-data. This includes both NIBRS and non-NIBRS agencies.
   * Agency Population loaded from column (POPULATION)
   * Agency Officer Count loaded from columns (PE_MALE_OFFICER_COUNT + PE_FEMALE_OFFICER_COUNT)

`SRS datafile (historic)`: srs2016_2020_smoothed.csv
* **Source**: FBI CJIS
* **Description**: Summary Reporting System (SRS) Crime data smoothed across four years.
* **Typical data**: Several hundred columns of crime counts by month/category. For NIBRS agencies, the SRS crime counts should reflect the subset of incidents reported to NIBRS which are relevant.
   * SRS incident count is sum of all monthly total columns (v95,v213,v331,v449,v567,v685,v803,v921,v1039,v1157,v1275,v1393)

`NIBRS datafile`: Amazon Web Services database
* **Source**: FBI CJIS
* **Description**: Incident/Offender/Victim dataset of crimes published by FBI
* **Typical data**: Incident level data can be retrieved in various ways (e.g. incident, victim, offender, or agency centric viewpoints)
   * Eligible ORIs selected from reta-mm have
     * AGENCY_STATUS is 'Active' or 'Federal' (reject 'LEOKA', blanks)
     * COVERED_FLAG is 'N' (reject 'Y')
     * DORMANT_FLAG is 'N' (reject 'Y')