# Report Part 4: NIBRS vs SRS Dataset Comparisons

**Summary of Report Finding**<br>
The goal of this report is to highlight agency reporting differences between the National Incident Based Reporting System (NIBRS) database and Summary Reporting System (SRS) datasets. For example, when an agency is actively reporting to SRS, but is missing from the NIBRS, it may negatively impact estimates that rely on the NIBRS data. In particular, we want to draw attention to discrepancies at large agencies (>100,000 population or >750 officers), which have larger impacts on estimates.

In [None]:
from datetime import datetime
import os
print("Author: Automated Pipeline")
year = os.getenv('DATA_YEAR')
print("Generating reports for year:",year)
print("Report date:", datetime.now().strftime("%m/%d/%y"))

In [None]:
%autosave 0
# Must mount nibrs share: mount_smbfs //rtpnfil02.rti.ns/0216153_NIBRS/ /mnt/nibrs_qc
high_population_threshold = 100000
high_officer_count_threshold = 750
import os
import pandas as pd
from IPython.display import display, HTML
from pathlib import Path
from utils import *
from dictionaries import *

output_folder = Path(os.getenv("OUTPUT_PIPELINE_DIR"))

years = get_available_years()
output_dir = output_folder / "QC_output_files"
output_dir.mkdir(parents=True, exist_ok=True)

# Import code to check for and process data files.
# If file is not present for this <year>, generate it
%run Part_4_code.ipynb

# Connect to AWS database via utils.connect_to_database
engine_database = connect_to_database()

# Load 3 primary datafiles and merge incident counts into nibrs
df_final, df_universe, df_incidents, df_srs = load_datasets(year, output_folder)

# Clean Up ORIs and Incident Counts
df_final = clean_up_oris(df_final)

# Add 12 columns with various flags for NIBRS vs SRS counts
df_final = compare_nibrs_to_srs(df_final)
print('comparisons flags created')

df_final.to_csv(output_dir / f"{year}_nibrs_srs_comparison.csv")
print('csv saved')

year_counts = str(year) + '_Counts'
year_large_pop_counts = str(year) + '_Over_' + str(high_population_threshold) + '_Pop'
year_large_officer_counts = str(year) + '_Over_' + str(high_officer_count_threshold) + '_Officer'
df_output = pd.DataFrame(columns=[year_counts, year_large_pop_counts, year_large_officer_counts,
                                  'Flag_Name', 'Flag_Definition', 'Comments'])
df_output_high_population = df_output.copy()
df_output_high_officer_count = df_output.copy()

df_output = create_output_df(df_output, df_final, year)
df_output_high_population = create_output_df_high_population(df_output_high_population, df_final, year, high_population_threshold)
df_output_high_officer_count = create_output_df_officer_count(df_output_high_officer_count, df_final, year, high_officer_count_threshold)

# Copy High Population and High Officer Count data into df_output 
df_output[year_large_pop_counts] = df_output_high_population[year_counts]
df_output[year_large_officer_counts] = df_output_high_officer_count[year_counts]

generate_csv_files(year, df_final)

databases_loaded = True

### Main Results

In [None]:
# NIBRS=0_SRS>0
# NIBRS_missing_SRS>0
print()
print("Agencies that appear in SRS, but have zero reports in NIBRS database or are missing from NIBRS database")
print()

# LARGE AGENCY RESULTS
# Mask for agencies with above threshold number of officers
large_officer = df_final['Officer_Count']>high_officer_count_threshold

# Mask for agencies with above threshold population
large_population = df_final['POPULATION']>high_population_threshold

# Select rows from either large population that also have flags set for NIBRS vs SRS issues
agencies_of_interest = df_final[large_officer | large_population].loc[(df_final['NIBRS=0_SRS>0'] == 1) | 
                                                                            (df_final['NIBRS_missing_SRS>0'] == 1)]
num_observations = len(agencies_of_interest)

total_large_pop = agencies_of_interest['POPULATION'].sum()
total_large_officers = agencies_of_interest['Officer_Count'].sum()
total_srs_incidents = agencies_of_interest['srs_annual_total_incidents'].sum()

if num_observations == 1:
    print("     {} Large Agency".format(num_observations))
else:
    print("     {} Large Agencies".format(num_observations))
print("        Total population size", total_large_pop)
print("        Total officer count", int(total_large_officers))
print("        Total SRS incident count", int(total_srs_incidents))

if num_observations > 0:
    print()
    print("        ORI, Agency Name")

    for index, row in agencies_of_interest.iterrows():
        print("       ", row['ORI_resolved'] + ",", row['UCR_AGENCY_NAME'])
print()

# SMALL AGENCY RESULTS
# Mask for agencies below threshold number of officers
small_officer = df_final['Officer_Count']<=high_officer_count_threshold

# Mask for agencies below threshold population
small_population = df_final['POPULATION']<=high_population_threshold

# Select rows from either large population that also have flags set for NIBRS vs SRS issues
agencies_of_interest = df_final[small_officer | small_population].loc[(df_final['NIBRS=0_SRS>0'] == 1) | 
                                                                            (df_final['NIBRS_missing_SRS>0'] == 1)]
num_observations = len(agencies_of_interest)

total_small_pop = agencies_of_interest['POPULATION'].sum()
total_small_officers = agencies_of_interest['Officer_Count'].sum()
total_srs_incidents = agencies_of_interest['srs_annual_total_incidents'].sum()

if num_observations == 1:
    print("     {} Small Agency".format(num_observations))
else:
    print("     {} Small Agencies".format(num_observations))
print("        Total population size", total_small_pop)
print("        Total officer count", int(total_small_officers))
print("        Total SRS incident count", int(total_srs_incidents))


In [None]:
# NIBRS>0_SRS=0
# NIBRS>0_SRS_missing

print()

print("Agencies that appear in NIBRS database, but have zero reports in SRS or are missing from SRS")
# print("(NIBRS incident counts > 0 but SRS incident counts are 0 or missing)")
print()
# Select rows from either large population that also have flags set for NIBRS vs SRS issues
agencies_of_interest = df_final[large_officer | large_population].loc[(df_final['NIBRS>0_SRS=0'] == 1) | 
                                                                            (df_final['NIBRS>0_SRS_missing'] == 1)]
num_observations = len(agencies_of_interest)

total_large_pop = agencies_of_interest['POPULATION'].sum()
total_large_officers = agencies_of_interest['Officer_Count'].sum()
total_nibrs_incidents = agencies_of_interest['nibrs_annual_total_incidents'].sum()

if num_observations == 1:
    print("     {} Large Agency".format(num_observations))
else:
    print("     {} Large Agencies".format(num_observations))
print("        Total population size", total_large_pop)
print("        Total officer count", int(total_large_officers))
print("        Total NIBRS incident count", int(total_nibrs_incidents))

if num_observations > 0:
    print()
    print("        ORI, Agency Name")

    for index, row in agencies_of_interest.iterrows():
        print("       ", row['ORI_resolved'] + ",", row['UCR_AGENCY_NAME'])
print()

# SMALL AGENCY RESULTS
# Mask for agencies below threshold number of officers
small_officer = df_final['Officer_Count']<=high_officer_count_threshold

# Mask for agencies below threshold population
small_population = df_final['POPULATION']<=high_population_threshold

# Select rows from either large population that also have flags set for NIBRS vs SRS issues
agencies_of_interest = df_final[small_officer | small_population].loc[(df_final['NIBRS>0_SRS=0'] == 1) | 
                                                                            (df_final['NIBRS>0_SRS_missing'] == 1)]
num_observations = len(agencies_of_interest)

total_small_pop = agencies_of_interest['POPULATION'].sum()
total_small_officers = agencies_of_interest['Officer_Count'].sum()
total_nibrs_incidents = agencies_of_interest['nibrs_annual_total_incidents'].sum()

if num_observations == 1:
    print("     {} Small Agency".format(num_observations))
else:
    print("     {} Small Agencies".format(num_observations))
print("        Total population size", total_small_pop)
print("        Total officer count", int(total_small_officers))
print("        Total NIBRS incident count", int(total_nibrs_incidents))

### All Agency Count Checks

In [None]:
df_output.style.hide()

### 5 CSV files are also generated for the following flags
- NIBRS>0_SRS=0
- NIBRS>0_SRS_missing
- NIBRS_missing_SRS>0
- NIBRS_missing_SRS=0
- NIBRS_missing_SRS_missing
<br/>
<br/>

### Datasets Used:

`Missing months datafile`: missing_months_<year>.csv (reta missing months)
* **Source**: NIBRS database
* **Description**: All law enforcement agencies in the US, whether or not they should be reporting crimes, and what months they reported incidents. Lists eligible agencies and whether or not they reported for different months.
* **Typical data**: 23 columns of ORI, state, status flags, population information, and indicators for if they reported crimes for each month.


`Universe datafile`: ref_agency_year.xlsx
* **Source**: FBI CJIS
* **Description**: Annual Snapshot List of all agencies and meta-data, regardless of NIBRS reporting status.
* **Typical data**: 66 columns of ORI, population region and officer meta-data. This includes both NIBRS and non-NIBRS agencies.
   * Agency Population loaded from column (POPULATION)
   * Agency Officer Count loaded from columns (PE_MALE_OFFICER_COUNT + PE_FEMALE_OFFICER_COUNT)

`SRS datafile (historic)`: srs2016_2020_smoothed.csv
* **Source**: FBI CJIS
* **Description**: Summary Reporting System (SRS) Crime data smoothed across four years.
* **Typical data**: Several hundred columns of crime counts by month/category. For NIBRS agencies, the SRS crime counts should reflect the subset of incidents reported to NIBRS which are relevant.
   * SRS incident count is sum of all monthly total columns (v95,v213,v331,v449,v567,v685,v803,v921,v1039,v1157,v1275,v1393)

`NIBRS datafile`: Amazon Web Services database
* **Source**: FBI CJIS
* **Description**: Incident/Offender/Victim dataset of crimes published by FBI
* **Typical data**: Incident level data can be retrieved in various ways (e.g. incident, victim, offender, or agency centric viewpoints)
   * Eligible ORIs selected from reta-mm have
     * AGENCY_STATUS is 'Active' or 'Federal' (reject 'LEOKA', blanks)
     * COVERED_FLAG is 'N' (reject 'Y')
     * DORMANT_FLAG is 'N' (reject 'Y')