# Report Part 3: Missing NIBRS Agencies Added Before Current Year


This report uses the "Universe" file which lists all agencies for each state and whether or not they are listed as NIBRS to identify cases where there are agencies that should be reporting. The Universe file also indicates the date when the agency was said to have switched to being a NIBRS reporter. This report highlights cases where agencies switched to NIBRS before the current year, but did not appear in the NIBRS Database.

Note that in all cases, we are only looking at eligible agencies. Agency eligibility is identified using the Missing Months report data for the year of interest. An eligible agency has been identified to be active, not covered by a different agency, and not dormant.

In [None]:
from datetime import datetime
import os
print("Author: Automated Pipeline")
year = int(os.getenv('DATA_YEAR'))
print("Generating reports for year:",year)
print("Report date:", datetime.now().strftime("%m/%d/%y"))

# the lower bound year to report on
if year % 2 == 0:
    min_year = year - 8
else:
    min_year = year - 7

In [None]:
# should be in the top level folder above src
from utils import *
from dictionaries import *
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from IPython.display import display, Markdown
from datetime import datetime as dt

output_folder = Path(os.getenv("OUTPUT_PIPELINE_DIR"))
output_dir = output_folder / "QC_output_files"
output_dir.mkdir(parents=True, exist_ok=True)
input_dir = output_folder / "initial_tasks_output"

engine_database = connect_to_database()

important_cols = [
    "DATA_YEAR",
    "ORI",
    "STATE_ABBR",
    "UCR_AGENCY_NAME",
    "AGENCY_TYPE_NAME",
    "POPULATION",
    "PE_MALE_OFFICER_COUNT",
    "PE_FEMALE_OFFICER_COUNT",
    "NIBRS_START_YEAR",
    "NIBRS_START_DATE",
    "NIBRS_LEOKA_START_DATE",
    "NIBRS_CT_START_DATE",
    "NIBRS_MULTI_BIAS_START_DATE",
    "NIBRS_OFF_ETH_START_DATE",
    #"ADDED_DATE",
]
state_name_to_abbrev = {v: k for k, v in us_state_abbrev.items()}

print("--------------------------------")
print(" Loading datasets, please wait. ")

reta_frame = pd.read_csv(output_folder/"artifacts"/f"missing_months_{year}.csv")
universe_frame = pd.read_csv(input_dir/f"ref_agency_{year}.csv")
available_years_nibrs = pd.read_sql('SELECT DISTINCT data_year FROM ucr_prd.ref_agency_status WHERE data_year IS NOT NULL;', engine_database)["data_year"].tolist()
available_years_nibrs.sort()

agencies_query = f"""
  SELECT DISTINCT 
    ref_agency.ori,
    ref_agency.nibrs_start_date,
    ref_agency_status.data_year
   FROM ucr_prd.ref_agency_yearly ref_agency_yearly
     LEFT JOIN ucr_prd.ref_agency USING (agency_id)
     LEFT JOIN ucr_prd.ref_agency_status USING (agency_id, data_year)
  WHERE ref_agency_status.data_year IS NOT NULL AND ref_agency_yearly.is_nibrs IS TRUE
    AND ref_agency_status.data_year = {year}
"""

# Get the agencies for each year up to the current year
nibrs_frame_dict = {}
for nibrs_year in available_years_nibrs:
    nibrs_frame_dict[nibrs_year] = pd.read_sql(agencies_query.format(year=nibrs_year), engine_database)
    # drop agencies that started the following year
    next_year = dt(day=1,month=1,year=nibrs_year+1)
    nibrs_frame_dict[nibrs_year] = nibrs_frame_dict[nibrs_year].loc[(pd.to_datetime(nibrs_frame_dict[nibrs_year]["nibrs_start_date"]) < next_year)]
    if nibrs_year == year: 
        break
        
# Subset the universe by eligible agencies according to reta-mm
eligible_agencies = get_elegible_agency_list(reta_frame)

universe_frame_el = universe_frame.loc[universe_frame["ORI"].isin(eligible_agencies)]

nibrs_frame_dict_el = {key: df.loc[df["ori"].isin(eligible_agencies)] for key, df in nibrs_frame_dict.items()}

# remove territories from universe
universe_frame_el = universe_frame_el.loc[universe_frame_el["STATE_NAME"].isin(us_state_abbrev.values())]

# get some values we are going to need for the analysis

universe_nibrs_el = universe_frame_el.loc[universe_frame_el["REPORTING_TYPE"] == "I"].copy()
universe_nibrs_el["NIBRS_START_YEAR"] = pd.to_datetime(universe_nibrs_el["NIBRS_START_DATE"]).apply(lambda x: x.year)
universe_nibrs_earlierstart_el = universe_nibrs_el.loc[universe_nibrs_el["NIBRS_START_YEAR"] < int(year)]

print("              done              ")
print("--------------------------------")

## Agencies added Each Year

The table below identifies how many NIBRS agencies in the Universe file are said to have been added for each year.

For each state, the bar plot takes the number of NIBRS agencies added during each year range, and divides it by the total number of eligible NIBRS and non-NIBRS agencies in the Universe file.

In [None]:
import plotly.express as px

added_years = {}

state_eligible_sum = universe_frame_el.groupby("STATE_NAME")["ORI"].count()

added_years["Total Type I (NIBRS)"] = universe_nibrs_el.groupby("STATE_NAME")["ORI"].count()

universe_nibrs_el["NIBRS_START_YEAR_cutoff"] =universe_nibrs_el["NIBRS_START_YEAR"].apply(lambda x: str(x) if x >= min_year else f"before {min_year}")
added_years[f"Added before {min_year}"] = universe_nibrs_el.loc[universe_nibrs_el["NIBRS_START_YEAR_cutoff"] == f"before {min_year}"].groupby("STATE_NAME")["ORI"].count()

for added_year, year_group in universe_nibrs_el.groupby("NIBRS_START_YEAR_cutoff"):
    if added_year == f"before {min_year}":
        continue
    added_years[f"Added {added_year}"] = year_group.groupby("STATE_NAME")["ORI"].count()
    

    
added_frame = pd.DataFrame(added_years).fillna(0).astype(int)
added_frame.loc["ALL"] = added_frame.sum()
added_frame.sort_index(inplace=True)

display(added_frame)


# need to turn this into a dataframe with (state, added_year, count)
df = added_frame.copy()
df.drop(columns=["Total Type I (NIBRS)"], index=["ALL"],inplace=True)
df.rename(columns={col: int(col.replace("Added ","")) for col in df.columns if col != f"Added before {min_year}"},inplace=True)
year_cols = df.columns.values.tolist()
year_cols.remove(f"Added before {min_year}")

mid_year = min_year + 5
df[f"{min_year} - {mid_year}"] = df[min_year]
df.drop(columns=[min_year],inplace=True)
for i in range(min_year+1,mid_year + 1):
    df[f"{min_year} - {mid_year}"] += df[i]
    df.drop(columns=[i],inplace=True)


for i in range(mid_year+1,year-1,2):
    df[f"{i} - {i+1}"] = df[i] + df[i+1]
    df.drop(columns=[i,i+1],inplace=True)
    
# if the year
df[f"{i+2}+"] = df[i+2]
df.drop(columns=[i+2],inplace=True)
for j in range(i+3, max(year_cols)+1):
    df[f"{i+2}+"] += df[j]
    df.drop(columns=[j],inplace=True)
    
for c in df.columns:
    df[c] = (100 * (df[c] / state_eligible_sum)).apply(lambda x: round(x,2))
    
df.index = df.index.rename(None)
df_reset = df.reset_index()
added_melted = df_reset.melt(id_vars="index").rename(columns={"variable":"Year Added","value":"% Agency in Universe","index":"State"})

fig = px.bar(added_melted, x="State", y="% Agency in Universe", color='Year Added',color_discrete_sequence=px.colors.qualitative.Safe)
fig.update_layout(xaxis=dict(tickmode="linear"))
display(fig)

## Missing Agencies which were Added in the Past

This section identifies agencies which according to the Universe file were added as NIBR reporters before the current year, and yet are missing from the NIBRS database.

In [None]:
from IPython.display import display, Markdown

nibrs_oris = nibrs_frame_dict_el[year]["ori"].tolist()
universe_not_nibrs = universe_nibrs_earlierstart_el.loc[~universe_nibrs_earlierstart_el["ORI"].isin(nibrs_oris)].copy()

# check if the ori's that were not in the NIBRS for the current year were there in earlier years
added_cols = []
for earlier_year in available_years_nibrs:
    if earlier_year == year:
        break
    earlier_nibrs_oris = nibrs_frame_dict_el[earlier_year]["ori"].tolist()
    added_cols.append(f"In_NIBRS_{earlier_year}")
    universe_not_nibrs[f"In_NIBRS_{earlier_year}"] = universe_not_nibrs["ORI"].apply(lambda ori: ori in earlier_nibrs_oris)
    
output_file = output_dir / f"Missing agencies that were NIBRS before {year}.csv"

display(Markdown(f"The full set of agencies which were missing for year {year} "\
      f"but added before {year} are available at: {output_file}"))

universe_not_nibrs[important_cols + added_cols].to_csv(output_file,index=False)

display(Markdown(f"### Old NIBRS Agencies for each State which are missing from NIBRS database for {year}"))

display(Markdown(f"The number of agencies per state which were missing in {year} and were added "\
      f"before {year}.\n"))

missing_dict = universe_not_nibrs.groupby("STATE_NAME")["ORI"].count().sort_values(ascending=False).to_dict()

display(Markdown(f'* **ALL**: {universe_not_nibrs.groupby("STATE_NAME")["ORI"].count().sum()}'))
for state, count in missing_dict.items():
    display(Markdown(f"* **{state}**: {count}"))

In [None]:
for earlier_year in available_years_nibrs:
    if earlier_year == year:
        break
        
    display(Markdown(f"### Old NIBRS Agencies for each State which are missing from NIBRS database for {year} but not for {earlier_year}"))
    
    display(Markdown(f"\nThe number of agencies per state which were missing in {year} and were added "\
          f"before {year}, but did appear \nin the NIBRS data for year {earlier_year}.\n"))
    missing_dict2 = universe_not_nibrs.loc[universe_not_nibrs[f"In_NIBRS_{earlier_year}"]\
                                       ].groupby("STATE_NAME")["ORI"].count().sort_values(ascending=False).to_dict()
    display(Markdown(f'* **ALL**: {np.sum(list(missing_dict2.values()))}'))
    for state, count in missing_dict2.items():
        display(Markdown(f"* **{state}**: {count}"))
    

## Datasets Used:

`Missing months datafile`: missing_months_<year>.csv (reta missing months)
* **Source**: NIBRS database
* **Description**: All law enforcement agencies in the US, whether or not they should be reporting crimes, and what months they reported incidents. Lists eligible agencies and whether or not they reported for different months.
* **Typical data**: 23 columns of ORI, state, status flags, population information, and indicators for if they reported crimes for each month.


`Universe datafile`: ref_agency_year.xlsx
* **Source**: FBI CJIS
* **Description**: Annual Snapshot List of all agencies and meta-data, regardless of NIBRS reporting status.
* **Typical data**: 66 columns of ORI, population region and officer meta-data. This includes both NIBRS and non-NIBRS agencies.
   * Agency Population loaded from column (POPULATION)
   * Agency Officer Count loaded from columns (PE_MALE_OFFICER_COUNT + PE_FEMALE_OFFICER_COUNT)


`NIBRS datafile`: Amazon Web Services database
* **Source**: FBI CJIS
* **Description**: Incident/Offender/Victim dataset of crimes published by FBI
* **Typical data**: Incident level data can be retrieved in various ways (e.g. incident, victim, offender, or agency centric viewpoints)
   * Eligible ORIs selected from reta-mm have
     * AGENCY_STATUS is 'Active' or 'Federal' (reject 'LEOKA', blanks)
     * COVERED_FLAG is 'N' (reject 'Y')
     * DORMANT_FLAG is 'N' (reject 'Y')