Table: Crashes
- CrashID (Primary Key, Auto Increment) - System-generated unique identifying number for a crash
- UnitNumber - Unit number entered on crash report for a unit involved in the crash
- PersonNumber - Person number captured on the crash report
- PersonType - Type of person involved in the crash
- Location - The physical location of an occupant in, on, or outside of the motor vehicle prior to the First Harmful Event or loss of control
- InjurySeverity - Severity of injury to the occupant
- Age - Age of person involved in the crash
- Ethnicity - Ethnicity of person involved in the crash
- Gender - Gender of person involved in the crash
- BodyExpulsion - The extent to which the person's body was expelled from the vehicle during any part of the crash
- RestraintType - The type of restraint used by each occupant
- AirbagDeployment - Indicates whether a person's airbag deployed during the crash and in what manner
- HelmetWorn - Indicates if a helmet was worn at the time of the crash
- Solicitation - Solicitation information
- AlcoholSpecimenType - Type of alcohol specimen taken for analysis from the primary persons involved in the crash
- AlcoholResult - Numeric blood alcohol content test result for a primary person involved in the crash, using standardized alcohol breath results (i.e. .08 or .129)
- DrugSpecimenType - Type of drug specimen taken for analysis from the primary persons involved in the crash
- DrugTestResult - Primary person drug test result
- TimeOfDeath - Time of death

Table: InjuryCounts
- CrashID (Foreign Key) - CrashID referencing the CrashID in the Crashes table
- SuspectedSeriousInjuryCount - Count of suspected serious injuries
- NonIncapacitatingInjuryCount - Count of non-incapacitating injuries
- PossibleInjuryCount - Count of possible injuries
- NotInjuredCount - Count of individuals not injured
- UnknownInjuryCount - Count of individuals with unknown injuries
- TotalInjuryCount - Total count of injuries
- DeathCount - Count of deaths


In [2]:
import pandas as pd
import numpy as np
import os
import pandas as pd
from glob import glob
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

In [3]:
def get_single_motorcycle_crashes(df):
    '''
    Takes in a pandas dataframe that needs to be filtered down to only single motorcycle crash incidents.
    This will only work if you have an existing 'Crash ID' column.

    INPUT:
    df = Pandas dataframe with data that needs to be filtered down into only crashes with a single motorcycle

    OUTPUT:
    new_df = Filtered dataframe with only single motorcycle crashes
    '''
    original_df = df
    original_df = original_df.reset_index()
    count_of_people_involved_in_crash = original_df['crash_id'].value_counts()
    crashes_with_only_one_person = count_of_people_involved_in_crash[count_of_people_involved_in_crash == 1].index
    crashes_with_only_one_person = crashes_with_only_one_person.to_list()
    new_df = original_df[original_df['crash_id'].isin(crashes_with_only_one_person)]
    return new_df


csv_files = [file for file in glob('*.csv') if '_all_person' in file] 

def process_csv_files(csv_files):
    # Read the first CSV file to initialize the dataframe with columns
    df = pd.read_csv(csv_files[0])

    # Iterate over the remaining CSV files and append them to the dataframe
    for file in csv_files[1:]:
        df = pd.concat([df, pd.read_csv(file)], ignore_index=True)

    # Standardize column names
    df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('-', '_')

    # Process the DataFrame for single motorcycle crashes
    df_svc = get_single_motorcycle_crashes(df)

    # Standardize the text in the DataFrame
    df_svc = df_svc.applymap(lambda x: x.lower().strip() if isinstance(x, str) else x)
    # Replace 'no data' with an empty string
    df_svc.replace('no data', '', inplace=True)

    return df_svc





csv_files = [file for file in glob('*.csv') if '_all_person' in file] 
# # Read the first CSV file to initialize the dataframe with columns
# df = pd.read_csv(csv_files[0])

# # Iterate over the remaining CSV files and append them to the dataframe
# for file in csv_files[1:]:
#     df = pd.concat([df, pd.read_csv(file)], ignore_index=True)
    
# df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('-', '_')

# df_svc = get_single_motorcycle_crashes(df)

# # Standardize the text in the DataFrame
# df_svc = df_svc.applymap(lambda x: x.lower().strip() if isinstance(x, str) else x)
# # Replace 'no data' with an empty string
# df_svc.replace('no data', '', inplace=True)




In [4]:
process_csv_files(csv_files)

Unnamed: 0,index,crash_id,charge,citation,person_age,person_airbag_deployed,person_alcohol_result,person_alcohol_specimen_type_taken,person_blood_alcohol_content_test_result,person_death_count,person_drug_specimen_type,person_drug_test_result,person_ejected,person_ethnicity,person_gender,person_helmet,person_injury_severity,person_non_suspected_serious_injury_count,person_not_injured_count,person_possible_injury_count,person_restraint_used,person_suspected_serious_injury_count,person_time_of_death,person_total_injury_count,person_type,person_unknown_injury_count,physical_location_of_an_occupant
0,0,19202337,no charges,,24,97 - not applicable,,96 - none,,0,96 - none,97 - not applicable,97 - not applicable,w - white,1 - male,1 - not worn,c - possible injury,0,0,1,97 - not applicable,0,,1,5 - driver of motorcycle type vehicle,0,1 - front left or motorcycle driver
11,11,19185965,no charges,,75,97 - not applicable,,96 - none,,0,96 - none,97 - not applicable,97 - not applicable,w - white,1 - male,"3 - worn, not damaged",n - not injured,0,1,0,97 - not applicable,0,,0,5 - driver of motorcycle type vehicle,0,1 - front left or motorcycle driver
12,12,19071274,fail to drive single lane,tx6dlm0doxpg,60,97 - not applicable,,96 - none,,0,96 - none,97 - not applicable,97 - not applicable,w - white,1 - male,"3 - worn, not damaged",b - suspected minor injury,1,0,0,97 - not applicable,0,,1,5 - driver of motorcycle type vehicle,0,1 - front left or motorcycle driver
13,13,18923820,no charges,,52,97 - not applicable,1 - positive,2 - blood,0.184,1,2 - blood,1 - positive,97 - not applicable,w - white,1 - male,1 - not worn,k - fatal injury,0,0,0,97 - not applicable,0,1808,0,5 - driver of motorcycle type vehicle,0,1 - front left or motorcycle driver
27,27,19120390,no charges,,29,97 - not applicable,,96 - none,,0,96 - none,97 - not applicable,97 - not applicable,w - white,1 - male,99 - unknown if worn,n - not injured,0,1,0,97 - not applicable,0,,0,5 - driver of motorcycle type vehicle,0,1 - front left or motorcycle driver
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81139,81139,17514238,no charges,,44,97 - not applicable,,96 - none,,0,96 - none,97 - not applicable,97 - not applicable,w - white,1 - male,"2 - worn, damaged",n - not injured,0,1,0,97 - not applicable,0,,0,5 - driver of motorcycle type vehicle,0,1 - front left or motorcycle driver
81140,81140,17202931,no charges,,27,97 - not applicable,,96 - none,,0,96 - none,97 - not applicable,97 - not applicable,w - white,1 - male,"2 - worn, damaged",a - suspected serious injury,0,0,0,97 - not applicable,1,,1,5 - driver of motorcycle type vehicle,0,1 - front left or motorcycle driver
81143,81143,17202929,no charges,,26,97 - not applicable,,96 - none,,0,96 - none,97 - not applicable,97 - not applicable,w - white,1 - male,"2 - worn, damaged",b - suspected minor injury,1,0,0,97 - not applicable,0,,1,5 - driver of motorcycle type vehicle,0,1 - front left or motorcycle driver
81149,81149,17366774,no charges,,32,97 - not applicable,1 - positive,2 - blood,0.265,1,2 - blood,2 - negative,97 - not applicable,w - white,1 - male,1 - not worn,k - fatal injury,0,0,0,97 - not applicable,0,0023,0,5 - driver of motorcycle type vehicle,0,1 - front left or motorcycle driver


In [5]:
# Get list of all CSV files in the project directory with "_all_persons" in the file name
csv_files = [file for file in glob('*.csv') if '_all_person' in file]


In [6]:
csv_files

['2022_all_person.csv',
 '2021_all_person.csv',
 '2020_all_person.csv',
 '2018_all_person.csv',
 '2019_all_person.csv']

In [7]:
# Read the first CSV file to initialize the dataframe with columns
df = pd.read_csv(csv_files[0])

# Iterate over the remaining CSV files and append them to the dataframe
for file in csv_files[1:]:
    df = pd.concat([df, pd.read_csv(file)], ignore_index=True)
    
df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('-', '_')

# Display the combined dataframe
df.head()

Unnamed: 0,crash_id,charge,citation,person_age,person_airbag_deployed,person_alcohol_result,person_alcohol_specimen_type_taken,person_blood_alcohol_content_test_result,person_death_count,person_drug_specimen_type,person_drug_test_result,person_ejected,person_ethnicity,person_gender,person_helmet,person_injury_severity,person_non_suspected_serious_injury_count,person_not_injured_count,person_possible_injury_count,person_restraint_used,person_suspected_serious_injury_count,person_time_of_death,person_total_injury_count,person_type,person_unknown_injury_count,physical_location_of_an_occupant
0,19202337,NO CHARGES,No Data,24,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,1 - NOT WORN,C - POSSIBLE INJURY,0,0,1,97 - NOT APPLICABLE,0,No Data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
1,19136802,NO CHARGES,No Data,64,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,H - HISPANIC,1 - MALE,1 - NOT WORN,C - POSSIBLE INJURY,0,0,1,97 - NOT APPLICABLE,0,No Data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
2,19136802,NO CHARGES,No Data,46,1 - NOT DEPLOYED,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,1 - NO,W - WHITE,1 - MALE,97 - NOT APPLICABLE,N - NOT INJURED,0,1,0,1 - SHOULDER & LAP BELT,0,No Data,0,1 - DRIVER,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
3,19120420,NO CHARGES,No Data,25,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,"2 - WORN, DAMAGED",C - POSSIBLE INJURY,0,0,1,97 - NOT APPLICABLE,0,No Data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
4,19120420,"NO DRIVER'S LICENSE, FTYROW - TURNING LEFT","L081650, L081650",42,1 - NOT DEPLOYED,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,H - HISPANIC,1 - MALE,97 - NOT APPLICABLE,N - NOT INJURED,0,1,0,1 - SHOULDER & LAP BELT,0,No Data,0,1 - DRIVER,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER


In [8]:
def get_single_motorcycle_crashes(df):
    '''
    Takes in a pandas dataframe that needs to be filtered down to only single motorcycle crash incidents.
    This will only work if you have an existing 'Crash ID' column.

    INPUT:
    df = Pandas dataframe with data that needs to be filtered down into only crashes with a single motorcycle

    OUTPUT:
    new_df = Filtered dataframe with only single motorcycle crashes
    '''
    original_df = df
    original_df = original_df.reset_index()
    count_of_people_involved_in_crash = original_df['crash_id'].value_counts()
    crashes_with_only_one_person = count_of_people_involved_in_crash[count_of_people_involved_in_crash == 1].index
    crashes_with_only_one_person = crashes_with_only_one_person.to_list()
    new_df = original_df[original_df['crash_id'].isin(crashes_with_only_one_person)]
    return new_df

In [9]:
df_svc = get_single_motorcycle_crashes(df)

In [10]:
df_svc

Unnamed: 0,index,crash_id,charge,citation,person_age,person_airbag_deployed,person_alcohol_result,person_alcohol_specimen_type_taken,person_blood_alcohol_content_test_result,person_death_count,person_drug_specimen_type,person_drug_test_result,person_ejected,person_ethnicity,person_gender,person_helmet,person_injury_severity,person_non_suspected_serious_injury_count,person_not_injured_count,person_possible_injury_count,person_restraint_used,person_suspected_serious_injury_count,person_time_of_death,person_total_injury_count,person_type,person_unknown_injury_count,physical_location_of_an_occupant
0,0,19202337,NO CHARGES,No Data,24,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,1 - NOT WORN,C - POSSIBLE INJURY,0,0,1,97 - NOT APPLICABLE,0,No Data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
11,11,19185965,NO CHARGES,No Data,75,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,"3 - WORN, NOT DAMAGED",N - NOT INJURED,0,1,0,97 - NOT APPLICABLE,0,No Data,0,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
12,12,19071274,FAIL TO DRIVE SINGLE LANE,TX6DLM0DOXPG,60,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,"3 - WORN, NOT DAMAGED",B - SUSPECTED MINOR INJURY,1,0,0,97 - NOT APPLICABLE,0,No Data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
13,13,18923820,NO CHARGES,No Data,52,97 - NOT APPLICABLE,1 - POSITIVE,2 - BLOOD,0.184,1,2 - BLOOD,1 - POSITIVE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,1 - NOT WORN,K - FATAL INJURY,0,0,0,97 - NOT APPLICABLE,0,1808,0,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
27,27,19120390,NO CHARGES,No Data,29,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,99 - UNKNOWN IF WORN,N - NOT INJURED,0,1,0,97 - NOT APPLICABLE,0,No Data,0,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81139,81139,17514238,NO CHARGES,No Data,44,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,"2 - WORN, DAMAGED",N - NOT INJURED,0,1,0,97 - NOT APPLICABLE,0,No Data,0,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
81140,81140,17202931,NO CHARGES,No Data,27,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,"2 - WORN, DAMAGED",A - SUSPECTED SERIOUS INJURY,0,0,0,97 - NOT APPLICABLE,1,No Data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
81143,81143,17202929,NO CHARGES,No Data,26,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,"2 - WORN, DAMAGED",B - SUSPECTED MINOR INJURY,1,0,0,97 - NOT APPLICABLE,0,No Data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
81149,81149,17366774,NO CHARGES,No Data,32,97 - NOT APPLICABLE,1 - POSITIVE,2 - BLOOD,0.265,1,2 - BLOOD,2 - NEGATIVE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,1 - NOT WORN,K - FATAL INJURY,0,0,0,97 - NOT APPLICABLE,0,0023,0,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER


In [11]:
# Standardize the text in the DataFrame
df_svc = df_svc.applymap(lambda x: x.lower().strip() if isinstance(x, str) else x)
# Replace 'no data' with an empty string
df_svc.replace('no data', '', inplace=True)


In [12]:
df_svc.isna().sum()

index                                         0
crash_id                                      0
charge                                        9
citation                                     15
person_age                                    0
person_airbag_deployed                        0
person_alcohol_result                         0
person_alcohol_specimen_type_taken            0
person_blood_alcohol_content_test_result      0
person_death_count                            0
person_drug_specimen_type                     0
person_drug_test_result                       0
person_ejected                                0
person_ethnicity                              0
person_gender                                 0
person_helmet                                 0
person_injury_severity                        0
person_non_suspected_serious_injury_count     0
person_not_injured_count                      0
person_possible_injury_count                  0
person_restraint_used                   

## Fucntion work 

In [13]:
# we want to get the get_single_motorcycle_crashes and motorcycle accidents 
df.head()

Unnamed: 0,crash_id,charge,citation,person_age,person_airbag_deployed,person_alcohol_result,person_alcohol_specimen_type_taken,person_blood_alcohol_content_test_result,person_death_count,person_drug_specimen_type,person_drug_test_result,person_ejected,person_ethnicity,person_gender,person_helmet,person_injury_severity,person_non_suspected_serious_injury_count,person_not_injured_count,person_possible_injury_count,person_restraint_used,person_suspected_serious_injury_count,person_time_of_death,person_total_injury_count,person_type,person_unknown_injury_count,physical_location_of_an_occupant
0,19202337,NO CHARGES,No Data,24,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,1 - NOT WORN,C - POSSIBLE INJURY,0,0,1,97 - NOT APPLICABLE,0,No Data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
1,19136802,NO CHARGES,No Data,64,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,H - HISPANIC,1 - MALE,1 - NOT WORN,C - POSSIBLE INJURY,0,0,1,97 - NOT APPLICABLE,0,No Data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
2,19136802,NO CHARGES,No Data,46,1 - NOT DEPLOYED,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,1 - NO,W - WHITE,1 - MALE,97 - NOT APPLICABLE,N - NOT INJURED,0,1,0,1 - SHOULDER & LAP BELT,0,No Data,0,1 - DRIVER,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
3,19120420,NO CHARGES,No Data,25,97 - NOT APPLICABLE,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,W - WHITE,1 - MALE,"2 - WORN, DAMAGED",C - POSSIBLE INJURY,0,0,1,97 - NOT APPLICABLE,0,No Data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
4,19120420,"NO DRIVER'S LICENSE, FTYROW - TURNING LEFT","L081650, L081650",42,1 - NOT DEPLOYED,No Data,96 - NONE,No Data,0,96 - NONE,97 - NOT APPLICABLE,97 - NOT APPLICABLE,H - HISPANIC,1 - MALE,97 - NOT APPLICABLE,N - NOT INJURED,0,1,0,1 - SHOULDER & LAP BELT,0,No Data,0,1 - DRIVER,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER


In [14]:
filtered_df = df[~df['crash_id'].duplicated(keep=False)]
filtered_df = filtered_df[filtered_df.person_type == '5 - DRIVER OF MOTORCYCLE TYPE VEHICLE']

In [15]:
crash_id_master = filtered_df.crash_id

In [16]:
test = df[~df['crash_id'].duplicated(keep=False)]

In [17]:
test.person_type.value_counts

<bound method IndexOpsMixin.value_counts of 0        5 - DRIVER OF MOTORCYCLE TYPE VEHICLE
11       5 - DRIVER OF MOTORCYCLE TYPE VEHICLE
12       5 - DRIVER OF MOTORCYCLE TYPE VEHICLE
13       5 - DRIVER OF MOTORCYCLE TYPE VEHICLE
27       5 - DRIVER OF MOTORCYCLE TYPE VEHICLE
                         ...                  
81139    5 - DRIVER OF MOTORCYCLE TYPE VEHICLE
81140    5 - DRIVER OF MOTORCYCLE TYPE VEHICLE
81143    5 - DRIVER OF MOTORCYCLE TYPE VEHICLE
81149    5 - DRIVER OF MOTORCYCLE TYPE VEHICLE
81150    5 - DRIVER OF MOTORCYCLE TYPE VEHICLE
Name: person_type, Length: 14557, dtype: object>

In [18]:
import functions as f

In [19]:
lat_long_files = f.get_csv_files(keyword='lat_long')

In [20]:
# test_maseter_lat_long = f.process_csv_files(csv_files= lat_long_files)

## Data Validation

In [21]:
svc_master = pd.read_csv('new_svcs (2).csv')
svc_master.columns = svc_master.columns.str.lower().str.replace(' ', '_').str.replace('-', '_')

In [22]:
filtered_df_1 = svc_master[~svc_master['crash_id'].duplicated(keep=False)]
filtered_df_1 = filtered_df_1[filtered_df_1.person_type == '5 - DRIVER OF MOTORCYCLE TYPE VEHICLE']

In [23]:
filtered_df_1.head()

Unnamed: 0,crash_id,$1000_damage_to_any_one_person's_property,active_school_zone_flag,adjusted_average_daily_traffic_amount,adjusted_percentage_of_average_daily_traffic_for_trucks,adjusted_roadway_part,agency,at_intersection_flag,average_daily_traffic_amount,average_daily_traffic_year,case_id,city,contributing_factors,control_section,control_section_milepoint,county,crash_date,crash_month,crash_number,crash_severity,crash_time,crash_year,day_of_week,dfo,highway_alpha_suffix,highway_lane_design,highway_number,highway_system,hour_of_day,inside_shoulder_width_on_divided_highway,intersecting_highway_alpha_suffix,intersecting_highway_number,intersecting_highway_system,intersecting_street_name,intersection_related,latitude,left_curb_type,left_shoulder_type,left_shoulder_use,left_shoulder_width,light_condition,longitude,manner_of_collision,median_type,median_width,median_width_plus_both_inside_shoulders,medical_advisory_flag,mpo,number_of_entering_roads,number_of_lanes,object_struck,on_system_flag,other_factor,outside_shoulder_width_on_divided_highway,percentage_of_combo_truck_average_daily_traffic,percentage_of_single_unit_truck_average_daily_traffic,physical_feature_1,physical_feature_2,population_group,private_drive_flag,property_damages,reference_marker_number,reference_marker_offset_distance,right_curb_type,right_of_way_usual_width,right_shoulder_type,right_shoulder_use,right_shoulder_width,road_base_type,road_class,roadbed_width,roadway_alignment,roadway_function,roadway_part,roadway_relation,roadway_type,rural_flag,rural_urban_type,speed_limit,street_name,street_number,surface_condition,surface_width,toll_road_flag,traffic_control_type,txdot_reportable_flag,weather_condition,contributing_factor_1,contributing_factor_2,contributing_factor_3,driver_blood_alcohol_content_test_result,driver_drug_test_result,driver_license_class,driver_license_endorsements,driver_license_restrictions,driver_license_state,driver_license_type,emergency_responder_flag,financial_responsibility_proof,financial_responsibility_type,license_plate_state,possible_contributing_factor_1,possible_contributing_factor_2,possible_vehicle_defect_1,vehicle_body_style,vehicle_color,vehicle_damage_rating_1___area,vehicle_damage_rating_1___direction_of_force,vehicle_damage_rating_1___severity,vehicle_damage_rating_2___area,vehicle_damage_rating_2___direction_of_force,vehicle_damage_rating_2___severity,vehicle_defect_1,vehicle_hit_and_run_flag,vehicle_inventoried_flag,vehicle_make,vehicle_model_name,vehicle_model_year,vehicle_parked_flag,vehicle_towed_by,vehicle_towed_to,vehicle_travel_direction,charge,citation,person_age,person_ethnicity,person_gender,person_helmet,person_injury_severity,person_type,physical_location_of_an_occupant
0,16189632,No,NO,no data,no data,1 - MAIN/PROPER LANE,"DEPARTMENT OF PUBLIC SAFETY, STATE OF TEXAS",False,no data,no data,1848553,OUTSIDE CITY LIMITS,OTHER (EXPLAIN IN NARRATIVE),no data,no data,JIM WELLS,2018-01-01,1,2018006416,A - SUSPECTED SERIOUS INJURY,1123,2018,MONDAY,no data,-1,no data,no data,no data,11:00 - 11:59,no data,no data,no data,no data,no data,NON INTERSECTION,28.04319507,no data,no data,no data,no data,"3 - DARK, LIGHTED",-97.9290595,ONE MOTOR VEHICLE - GOING STRAIGHT,no data,no data,no data,No,no data,97 - NOT APPLICABLE,no data,OVERTURNED,No,NOT APPLICABLE,no data,no data,no data,NOT APPLICABLE,NOT APPLICABLE,RURAL,NO,NONE,no data,no data,no data,no data,no data,no data,no data,no data,COUNTY ROAD,no data,"1 - STRAIGHT, LEVEL",no data,1 - MAIN/PROPER LANE,OFF ROADWAY,no data,Yes,no data,60,COUNTY ROAD 373,178,1 - DRY,no data,NO,98 - OTHER (EXPLAIN IN NARRATIVE),Yes,2 - CLOUDY,98 - OTHER (EXPLAIN IN NARRATIVE),no data,no data,no data,97 - NOT APPLICABLE,C - CLASS C,NONE,NONE,TX - TEXAS,1 - DRIVER LICENSE,No,NO,no data,TX - TEXAS,no data,no data,no data,MC - MOTORCYCLE,BLU - BLUE,"MC - MOTORCYCLE, SCOOTER, MOPED, ETC. ONLY",no data,1 - DAMAGED 1 MINIMUM,no data,no data,no data,no data,No,NO,OTHER (EXPLAIN IN NARRATIVE),OTHER (EXPLAIN IN NARRATIVE) (OTHER (EXPLAIN I...,no data,NO,TAKEN BY FATHER,TAKEN BY FATHER,N - NORTH,OPERATE UNREGISTERED MOTOR VEHICLE,TX52Q80UKZPL,37,W - WHITE,1 - MALE,1 - NOT WORN,A - SUSPECTED SERIOUS INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER
1,16203470,No,NO,no data,no data,1 - MAIN/PROPER LANE,"DEPARTMENT OF PUBLIC SAFETY, STATE OF TEXAS",False,no data,no data,no data,OUTSIDE CITY LIMITS,ANIMAL ON ROAD - DOMESTIC,no data,no data,HIDALGO,2018-01-04,1,2018018233,C - POSSIBLE INJURY,1316,2018,THURSDAY,no data,-1,no data,no data,no data,13:00 - 13:59,no data,no data,no data,no data,no data,NON INTERSECTION,26.29303508,no data,no data,no data,no data,1 - DAYLIGHT,-98.34886112,ONE MOTOR VEHICLE - GOING STRAIGHT,no data,no data,no data,No,HIDALGO COUNTY,97 - NOT APPLICABLE,no data,NOT APPLICABLE,No,NOT APPLICABLE,no data,no data,no data,NOT APPLICABLE,NOT APPLICABLE,RURAL,NO,DOG,no data,no data,no data,no data,no data,no data,no data,no data,COUNTY ROAD,no data,"1 - STRAIGHT, LEVEL",no data,1 - MAIN/PROPER LANE,ON ROADWAY,no data,Yes,no data,30,CALIFORNIA ST,6350,1 - DRY,no data,NO,96 - NONE,Yes,1 - CLEAR,1 - ANIMAL ON ROAD - DOMESTIC,no data,no data,no data,97 - NOT APPLICABLE,C - CLASS C,NONE,WITH CORRECTIVE LENSES,TX - TEXAS,1 - DRIVER LICENSE,No,YES,2 - PROOF OF LIABILITY INSURANCE,TX - TEXAS,no data,no data,no data,MC - MOTORCYCLE,GRY - GRAY,"MC - MOTORCYCLE, SCOOTER, MOPED, ETC. ONLY",no data,1 - DAMAGED 1 MINIMUM,no data,no data,no data,no data,No,NO,SUZUKI,GSX-R600 (SUZUKI),2004,NO,no data,no data,N - NORTH,"NO CLASS ""M"" LICENSE",TX52QD0NAP34,30,H - HISPANIC,1 - MALE,"3 - WORN, NOT DAMAGED",C - POSSIBLE INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER
2,16191458,Yes,NO,no data,no data,1 - MAIN/PROPER LANE,WILLIAMSON COUNTY SHERIFF'S OFFICE,False,no data,no data,20180100170,OUTSIDE CITY LIMITS,UNSAFE SPEED;OTHER (EXPLAIN IN NARRATIVE),no data,no data,WILLIAMSON,2018-01-05,1,2018007939,99 - UNKNOWN,2207,2018,FRIDAY,no data,-1,no data,no data,no data,22:00 - 22:59,no data,no data,no data,no data,no data,NON INTERSECTION,30.46717132,no data,no data,no data,no data,"3 - DARK, LIGHTED",-97.8333446,ONE MOTOR VEHICLE - GOING STRAIGHT,no data,no data,no data,No,CAPITAL AREA METROPOLITAN PLANNING ORGANIZATIO...,97 - NOT APPLICABLE,no data,OVERTURNED,No,NOT APPLICABLE,no data,no data,no data,NOT APPLICABLE,NOT APPLICABLE,RURAL,NO,NONE,no data,no data,no data,no data,no data,no data,no data,no data,COUNTY ROAD,no data,"1 - STRAIGHT, LEVEL",no data,1 - MAIN/PROPER LANE,ON ROADWAY,no data,Yes,no data,35,SABINAL TRL,2794,1 - DRY,no data,NO,96 - NONE,Yes,1 - CLEAR,98 - OTHER (EXPLAIN IN NARRATIVE),no data,no data,no data,97 - NOT APPLICABLE,C - CLASS C,NONE,NONE,TX - TEXAS,1 - DRIVER LICENSE,No,YES,2 - PROOF OF LIABILITY INSURANCE,TX - TEXAS,60 - UNSAFE SPEED,no data,no data,MC - MOTORCYCLE,WHI - WHITE,FD - FRONT END DAMAGE DISTRIBUTED IMPACT,10 - 10 O'CLOCK,3 - DAMAGED 3,FD - FRONT END DAMAGE DISTRIBUTED IMPACT,9 - 9 O'CLOCK,2 - DAMAGED 2,no data,No,YES,YAMAHA,YX600 (YAMAHA),2010,NO,LAKESIDE WRECKER,12228 ROXIE DR. AUSTIN 78681,UNK - UNKNOWN,ACCIDENT INVOLVING DAMAGE TO VEHICLE>=$200/ FS...,2018-01-00170,20,W - WHITE,1 - MALE,99 - UNKNOWN IF WORN,99 - UNKNOWN,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER
3,16192023,Yes,NO,no data,no data,1 - MAIN/PROPER LANE,HARRIS COUNTY SHERIFF'S OFFICE,False,no data,no data,180002640,OUTSIDE CITY LIMITS,FAILED TO DRIVE IN SINGLE LANE;UNSAFE SPEED,no data,no data,HARRIS,2018-01-05,1,2018008161,A - SUSPECTED SERIOUS INJURY,2045,2018,FRIDAY,no data,-1,no data,no data,no data,20:00 - 20:59,no data,no data,no data,no data,no data,NON INTERSECTION,29.95793172,no data,no data,no data,no data,"2 - DARK, NOT LIGHTED",-95.64982499,ONE MOTOR VEHICLE - GOING STRAIGHT,no data,no data,no data,No,HOUSTON-GALVESTON AREA COUNCIL (HGAC),97 - NOT APPLICABLE,no data,HIT CULVERT-HEADWALL,No,NOT APPLICABLE,no data,no data,no data,NOT APPLICABLE,NOT APPLICABLE,RURAL,NO,NONE,no data,no data,no data,no data,no data,no data,no data,no data,COUNTY ROAD,no data,"1 - STRAIGHT, LEVEL",no data,1 - MAIN/PROPER LANE,OFF ROADWAY,no data,Yes,no data,45,TELGE RD,no data,1 - DRY,no data,NO,17 - MARKED LANES,Yes,1 - CLEAR,23 - FAILED TO DRIVE IN SINGLE LANE,no data,no data,no data,97 - NOT APPLICABLE,C - CLASS C,NONE,NONE,TX - TEXAS,1 - DRIVER LICENSE,No,YES,2 - PROOF OF LIABILITY INSURANCE,TX - TEXAS,60 - UNSAFE SPEED,no data,11 - DEFECTIVE STEERING MECHANISM,MC - MOTORCYCLE,BLU - BLUE,"MC - MOTORCYCLE, SCOOTER, MOPED, ETC. ONLY",no data,1 - DAMAGED 1 MINIMUM,no data,no data,no data,no data,No,YES,YAMAHA,YZFR6 (YAMAHA),2017,NO,CERTIFIED TOWING & TRANSPORT,17715 CLAY RD,S - SOUTH,NO CHARGES,no data,21,W - WHITE,1 - MALE,"2 - WORN, DAMAGED",A - SUSPECTED SERIOUS INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER
4,16196720,No,NO,no data,no data,1 - MAIN/PROPER LANE,MCALLEN POLICE DEPARTMENT,False,no data,no data,2018-1099,MCALLEN,FAILED TO CONTROL SPEED,no data,no data,HIDALGO,2018-01-05,1,2018012213,B - SUSPECTED MINOR INJURY,307,2018,FRIDAY,no data,-1,no data,no data,no data,03:00 - 03:59,no data,no data,no data,no data,N 27TH ST,INTERSECTION RELATED,26.2246359,no data,no data,no data,no data,"2 - DARK, NOT LIGHTED",-98.24724129,ONE MOTOR VEHICLE - GOING STRAIGHT,no data,no data,no data,No,HIDALGO COUNTY,97 - NOT APPLICABLE,no data,OVERTURNED,No,"SLOWING/STOPPING - FOR OFFICER, FLAGMAN, OR TR...",no data,no data,no data,NOT APPLICABLE,NOT APPLICABLE,"100,000 - 249,999 POP",NO,NONE,no data,no data,no data,no data,no data,no data,no data,no data,CITY STREET,no data,"1 - STRAIGHT, LEVEL",no data,1 - MAIN/PROPER LANE,ON ROADWAY,no data,No,no data,35,TAMARACK AVE,2686,1 - DRY,no data,NO,8 - STOP SIGN,Yes,1 - CLEAR,22 - FAILED TO CONTROL SPEED,no data,no data,no data,97 - NOT APPLICABLE,5 - UNLICENSED,UNLICENSED,UNLICENSED,TX - TEXAS,4 - ID CARD,No,NO,no data,TX - TEXAS,no data,no data,no data,MC - MOTORCYCLE,BLU - BLUE,"MC - MOTORCYCLE, SCOOTER, MOPED, ETC. ONLY",no data,1 - DAMAGED 1 MINIMUM,no data,no data,no data,no data,No,no data,YAMAHA,RZ500 (YAMAHA),2002,NO,no data,no data,S - SOUTH,NO DRIVER LICENSE NO INSURANCE,667341,18,H - HISPANIC,1 - MALE,1 - NOT WORN,B - SUSPECTED MINOR INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER


In [24]:
for cols in filtered_df_1.columns:
    print( filtered_df_1[cols].value_counts())
    

crash_id
16189632    1
18394514    1
18382203    1
18383951    1
18384531    1
           ..
18891909    1
18894741    1
18875613    1
18876202    1
19332955    1
Name: count, Length: 14548, dtype: int64
$1000_damage_to_any_one_person's_property
Yes    12581
No      1967
Name: count, dtype: int64
active_school_zone_flag
NO     14547
YES        1
Name: count, dtype: int64
adjusted_average_daily_traffic_amount
no data    6711
379          29
55           26
640          20
152153       17
           ... 
6522          1
7004          1
6373          1
9724          1
10030         1
Name: count, Length: 4267, dtype: int64
adjusted_percentage_of_average_daily_traffic_for_trucks
no data    6711
3.7         117
4.5         105
3.3          94
5.3          89
           ... 
54.4          1
44.9          1
56            1
36.8          1
42.9          1
Name: count, Length: 438, dtype: int64
adjusted_roadway_part
1 - MAIN/PROPER LANE                 12350
2 - SERVICE/FRONTAGE ROAD           

In [25]:
len(filtered_df_1)

14548

In [26]:
lat_list = f.get_csv_files(keyword='lat')

In [27]:
df_svc.columns

Index(['index', 'crash_id', 'charge', 'citation', 'person_age',
       'person_airbag_deployed', 'person_alcohol_result',
       'person_alcohol_specimen_type_taken',
       'person_blood_alcohol_content_test_result', 'person_death_count',
       'person_drug_specimen_type', 'person_drug_test_result',
       'person_ejected', 'person_ethnicity', 'person_gender', 'person_helmet',
       'person_injury_severity', 'person_non_suspected_serious_injury_count',
       'person_not_injured_count', 'person_possible_injury_count',
       'person_restraint_used', 'person_suspected_serious_injury_count',
       'person_time_of_death', 'person_total_injury_count', 'person_type',
       'person_unknown_injury_count', 'physical_location_of_an_occupant'],
      dtype='object')

In [28]:
df_svc.person_injury_severity

0                 c - possible injury
11                    n - not injured
12         b - suspected minor injury
13                   k - fatal injury
27                    n - not injured
                     ...             
81139                 n - not injured
81140    a - suspected serious injury
81143      b - suspected minor injury
81149                k - fatal injury
81150             c - possible injury
Name: person_injury_severity, Length: 14557, dtype: object

In [29]:
mask_list = ['person_age', 'person_ethnicity', 'person_gender', 'person_helmet', 'person_injury_severity', 'person_drug_test_result']