<div style='background-color:orange'>
<a id="TableOfContents"></a>
    <h1 style='text-align:center ; top-padding:5px'>
        <b><i>
            TABLE OF CONTENTS:
        </i></b></h1>
    <li><a href='#imports'>Imports</a>
    <li><a href="#acquire">Acquire</a>
    <li><a href='#prepare'>Prepare</a>
    <li><a href="#wrangle">Wrangle</a>
    <li><a href='#misc'>Miscellaneous</a>
    </li>
</div>

<div style='background-color:orange'>
<a id="imports"></a>
    <h1 style='text-align:center ; top-padding:5px'>
        <b><i>
            Imports
        </i></b></h1>
    <li><a href='#TableOfContents'>Table of Contents</a>
    </li>
</div>

In [1]:
# Vectorization and tables
import numpy as np
import pandas as pd

# Regex
import re

# .py files
import wrangle as w

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

<div style='background-color:orange'>
<a id="acquire"></a>
    <h1 style='text-align:center ; top-padding:5px'>
        <b><i>
            Acquire
        </i></b></h1>
    <li><a href='#TableOfContents'>Table of Contents</a>
    </li>
</div>

Acquire all crashes with motorcycles in Texas from the 2018 - 2022 via <a href='https://cris.dot.state.tx.us/public/Query/app/query-builder'>CRIS Query</a> data pull

- Master Vanilla Shape:
    - Rows: 81,153
    - Columns: 233

In [2]:
# Read the .csv exported file from CRIS Query
master = pd.read_csv('master_list.csv', index_col=0)
master.shape

(81153, 233)

<div style='background-color:orange'>
<a id="prepare"></a>
    <h1 style='text-align:center ; top-padding:5px'>
        <b><i>
            Prepare
        </i></b></h1>
    <li><a href='#TableOfContents'>Table of Contents</a>
    <li><a href='#preparenormalize'>Normalize Columns</a>
    <li><a href='#preparenulls'>Null Handling</a>
    <li><a href='#preparesvc'>Identify Single Vehicle Crashes(SVC)</a>
    <li><a href='#prepareremovespecific'>Remove Specific Columns</a>
    <li><a href='#prepareremovethreshold'>Remove Columns By Null Thresholds</a>
    <li><a href='#preparetarget'>Fix Target Column</a>
    <li><a href='#preparesummary'>Preparation Summary</a>
    <li><a href='#preparedatetime'>Create Crash Datetime Column</a>
    </li>
</div>

<a id='preparenormalize'></a>
<h3><b><i>
    Normalize Columns:
</i></b></h3>
<li><a href='#prepare'>Prepare Top</a></li>

In [3]:
# Reset the index to ensure crash id's are not the index
master.reset_index(inplace=True)
master.head(2)

Unnamed: 0,Crash ID,$1000 Damage to Any One Person's Property,Active School Zone Flag,Adjusted Average Daily Traffic Amount,Adjusted Percentage of Average Daily Traffic For Trucks,Adjusted Roadway Part,Agency,At Intersection Flag,Average Daily Traffic Amount,Average Daily Traffic Year,...,Person Non-Suspected Serious Injury Count,Person Not Injured Count,Person Possible Injury Count,Person Restraint Used,Person Suspected Serious Injury Count,Person Time of Death,Person Total Injury Count,Person Type,Person Unknown Injury Count,Physical Location of An Occupant
0,16189632,No,NO,No Data,No Data,1 - MAIN/PROPER LANE,"DEPARTMENT OF PUBLIC SAFETY, STATE OF TEXAS",False,No Data,No Data,...,0,0,0,97 - NOT APPLICABLE,1,No Data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
1,16188701,Yes,NO,No Data,No Data,1 - MAIN/PROPER LANE,ABILENE POLICE DEPARTMENT,True,No Data,No Data,...,0,1,0,1 - SHOULDER & LAP BELT,0,No Data,0,1 - DRIVER,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER


In [4]:
# Replace all whitespace and lowercase everything
master.columns = master.columns.str.replace(' ', '_').str.lower()
master.columns

Index(['crash_id', '$1000_damage_to_any_one_person's_property',
       'active_school_zone_flag', 'adjusted_average_daily_traffic_amount',
       'adjusted_percentage_of_average_daily_traffic_for_trucks',
       'adjusted_roadway_part', 'agency', 'at_intersection_flag',
       'average_daily_traffic_amount', 'average_daily_traffic_year',
       ...
       'person_non-suspected_serious_injury_count', 'person_not_injured_count',
       'person_possible_injury_count', 'person_restraint_used',
       'person_suspected_serious_injury_count', 'person_time_of_death',
       'person_total_injury_count', 'person_type',
       'person_unknown_injury_count', 'physical_location_of_an_occupant'],
      dtype='object', length=234)

---

<a id='preparenulls'></a>
<h3><b><i>
    Null Handling:
</i></b></h3>
<li><a href='#prepare'>Prepare Top</a></li>

In [5]:
# Identify columns with null values first
columns_with_nulls = master.columns[master.isna().sum() > 0]
master[columns_with_nulls].isna().sum()

case_id                             9939
intersecting_street_name           47292
street_name                            2
carrier's_primary_address_-_zip    80077
driver_license_endorsements          378
driver_license_restrictions          378
vehicle_towed_by                    1692
vehicle_towed_to                    2168
charge                                38
citation                              55
dtype: int64

In [6]:
# Convert all nulls to 'No Data' to maintain consistency with
# null value formatting from original data pull
master.fillna('No Data', inplace=True)

In [7]:
# Ensure all occurances of 'No Data' is consistent as 'no data' 
# and isn't varients like 'NO DATA', 'No DaTa', etc.
master = master.replace(to_replace=re.compile(r'.*no\s*data.*', re.IGNORECASE), value='no data', regex=True)
master.head(2)

Unnamed: 0,crash_id,$1000_damage_to_any_one_person's_property,active_school_zone_flag,adjusted_average_daily_traffic_amount,adjusted_percentage_of_average_daily_traffic_for_trucks,adjusted_roadway_part,agency,at_intersection_flag,average_daily_traffic_amount,average_daily_traffic_year,...,person_non-suspected_serious_injury_count,person_not_injured_count,person_possible_injury_count,person_restraint_used,person_suspected_serious_injury_count,person_time_of_death,person_total_injury_count,person_type,person_unknown_injury_count,physical_location_of_an_occupant
0,16189632,No,NO,no data,no data,1 - MAIN/PROPER LANE,"DEPARTMENT OF PUBLIC SAFETY, STATE OF TEXAS",False,no data,no data,...,0,0,0,97 - NOT APPLICABLE,1,no data,1,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER
1,16188701,Yes,NO,no data,no data,1 - MAIN/PROPER LANE,ABILENE POLICE DEPARTMENT,True,no data,no data,...,0,1,0,1 - SHOULDER & LAP BELT,0,no data,0,1 - DRIVER,0,1 - FRONT LEFT OR MOTORCYCLE DRIVER


---

<a id='preparesvc'></a>
<h3><b><i>
    Identify Single Vehicle Crashes(SVC):
</i></b></h3>
<li><a href='#prepare'>Prepare Top</a></li>

In [8]:
# Using the function from wrangle.py,
# Get only single motorcycle crashes
svcs = w.get_single_motorcycle_crashes(master)
svcs.shape

(14548, 234)

In [9]:
# Ensure each crash is unique (Meaning each crash only has one vehicle)
svcs.crash_id.nunique()

14548

In [10]:
# Ensure only motorcyclists are identified
svcs.person_type.value_counts()

person_type
5 - DRIVER OF MOTORCYCLE TYPE VEHICLE    14548
Name: count, dtype: int64

In [11]:
svcs.vehicle_body_style.value_counts()

vehicle_body_style
MC - MOTORCYCLE           14517
PM - POLICE MOTORCYCLE       31
Name: count, dtype: int64

- Single Vehicle Crashes (SVCs // Motorcyclists) Shape:
    - Rows: 14,548
    - Columns: 234

---

<a id='prepareremovespecific'></a>
<h3><b><i>
    Remove Specific Columns:
</i></b></h3>
<li><a href='#prepare'>Prepare Top</a></li>

In [12]:
# Specific columns to remove
# Due to things like redundancy, data poisoning, and isn't useful
# in the scope of predictive value in modeling
cols_to_remove = [
    'driver_drug_specimen_type',
    'person_drug_specimen_type',
    'person_drug_test_result',
    'driver_alcohol_result',
    'driver_alcohol_specimen_type',
    'person_alcohol_result',
    'person_alcohol_specimen_type_taken',
    'person_blood_alcohol_content_test_result',
    'crash_non-suspected_serious_injury_count',
    'crash_not_injured_count',
    'crash_possible_injury_count',
    'crash_suspected_serious_injury_count',
    'crash_total_injury_count',
    'crash_unknown_injury_count',
    'unit_non-suspected_serious_injury_count',
    'unit_not_injured_count',
    'unit_possible_injury_count',
    'unit_suspected_serious_injury_count',
    'unit_total_injury_count',
    'unit_unknown_injury_count',
    'person_non-suspected_serious_injury_count',
    'person_not_injured_count',
    'person_possible_injury_count',
    'person_suspected_serious_injury_count',
    'person_total_injury_count',
    'person_unknown_injury_count',
    'crash_death_count',
    'driver_time_of_death',
    'unit_death_count',
    'person_death_count',
    'person_time_of_death',
    'autonomous_level_engaged',
    'autonomous_unit_-_reported',
    'school_bus_flag',
    'bus_type',
    'cmv_actual_gross_weight',
    'cmv_cargo_body_type',
    'cmv_carrier_id_type',
    'cmv_disabling_damage_-_power_unit',
    'cmv_gvwr',
    'cmv_hazmat_release_flag',
    'cmv_intermodal_shipping_container_permit',
    'cmv_rgvw',
    'cmv_roadway_access',
    'cmv_sequence_of_events_1',
    'cmv_sequence_of_events_2',
    'cmv_sequence_of_events_3',
    'cmv_sequence_of_events_4',
    'cmv_total_number_of_axles',
    'cmv_total_number_of_tires',
    'cmv_trailer_disabling_damage',
    'cmv_trailer_gvwr',
    'cmv_trailer_rgvw',
    'cmv_trailer_type',
    'cmv_vehicle_operation',
    'cmv_vehicle_type',
    'vehicle_cmv_flag',
    'first_harmful_event',
    'first_harmful_event_involvement',
    'hazmat_class_1_id',
    'hazmat_class_2_id',
    'hazmat_id_number_1_id',
    'hazmat_id_number_2_id',
    'responder_struck_flag',
    'unit_description',
    'person_airbag_deployed',
    'person_ejected',
    'person_restraint_used',
    'highway_lane_design_for_hov,_railroads,_and_toll_roads',
    'railroad_company',
    'railroad_flag',
    'bridge_detail',
    'feature_crossed_by_bridge',
    'on_bridge_service_type',
    'under_bridge_service_type',
    'construction_zone_flag',
    'construction_zone_workers_present_flag',
    'commercial_motor_vehicle_flag',
    'date_arrived',
    'date_notified',
    'date_roadway_cleared',
    'date_scene_cleared',
    'direction_of_traffic'
]

In [13]:
# How many columns are being removed?
len(cols_to_remove)

83

In [14]:
# Drop the columns from the dataframe
svcs.drop(columns=cols_to_remove, inplace=True)

---

<a id='prepareremovethreshold'></a>
<h3><b><i>
    Remove Columns By Null Thresholds:
</i></b></h3>
<li><a href='#prepare'>Prepare Top</a></li>

In [15]:
# Remove any columns that exceed 98.7% total nulls
svcs, new_dict = w.drop_nullpct_alternate(svcs, 0.987)

In [16]:
# How many columns are being dropped?
pd.DataFrame(new_dict).shape[0]

18

In [17]:
# Alter function to be used for columns with -1
drop_negone_pct_dict = {
    'column_name' : [],
    'percent_nodata' : []
}
for col in svcs:
    pct = (svcs[col] == -1).sum() / svcs.shape[0]
    if pct > 0.987:
        svcs = svcs.drop(columns=col)
        drop_negone_pct_dict['column_name'].append(col)
        drop_negone_pct_dict['percent_nodata'].append(pct)

In [18]:
# How many columns are being dropped?
pd.DataFrame(drop_negone_pct_dict).shape[0]

2

---

<a id='preparetarget'></a>
<h3><b><i>
    Fix Target Column:
</i></b></h3>
<li><a href='#prepare'>Prepare Top</a></li>

Things to adjust:
- Killed (Fatal injury)
- Serious (Serious injury)
- Minor (Minor & Possible injury)
- Not Injured (Not injured)
- REMOVE VALUE (Unknown)

In [19]:
# What does the distribution of the target look like?
svcs.person_injury_severity.value_counts()

person_injury_severity
B - SUSPECTED MINOR INJURY      5578
A - SUSPECTED SERIOUS INJURY    4016
C - POSSIBLE INJURY             2569
N - NOT INJURED                 1162
K - FATAL INJURY                 849
99 - UNKNOWN                     374
Name: count, dtype: int64

In [20]:
# Adjust minor target value
svcs.person_injury_severity = svcs.person_injury_severity.str.replace('C - POSSIBLE INJURY', 'B - SUSPECTED MINOR INJURY')

In [21]:
# Remove instances of unknown values
svcs = svcs[~(svcs.person_injury_severity == '99 - UNKNOWN')]

In [22]:
# Verify the distribution of the target value
# GOT THAT 'BANK' DISTRIBUTION! :D
svcs.person_injury_severity.value_counts()

person_injury_severity
B - SUSPECTED MINOR INJURY      8147
A - SUSPECTED SERIOUS INJURY    4016
N - NOT INJURED                 1162
K - FATAL INJURY                 849
Name: count, dtype: int64

---

<a id='preparedatetime'></a>
<h3><b><i>
    Create Crash Datetime Column:
</i></b></h3>
<li><a href='#prepare'>Prepare Top</a></li>

In [23]:
# Fix 'crash_time' column
times_list = svcs.crash_time.astype(str).to_list()
fixed_times_list = []
for val in times_list:
    if len(val) == 1:
        val = '000' + val
    elif len(val) == 2:
        val = '00' + val
    elif len(val) == 3:
        val = '0' + val
    fixed_times_list.append(val)

In [24]:
# Replace unclean column with clean version and create datetime column
svcs.crash_time = fixed_times_list
svcs['crash_datetime'] = svcs.crash_date.str.strip() + ' ' + svcs.crash_time
svcs.crash_datetime = pd.to_datetime(svcs.crash_datetime)

In [25]:
# Confirm dtype and formatting
svcs.crash_datetime

0       2018-01-01 11:23:00
11      2018-01-04 13:16:00
18      2018-01-05 20:45:00
22      2018-01-05 03:07:00
35      2018-01-06 12:10:00
                ...        
81125   2022-12-31 11:26:00
81126   2022-12-31 22:29:00
81140   2022-12-31 15:55:00
81144   2022-12-31 14:49:00
81152   2022-12-31 12:26:00
Name: crash_datetime, Length: 14174, dtype: datetime64[ns]

---

<a id='preparesummary'></a>
<h3><b><i>
    Preparation Summary:
</i></b></h3>
<li><a href='#prepare'>Prepare Top</a></li>

- Changed whitespace to '_' and lowercased columns
- Change all nulls to 'no data'
- Identified only single vehicle crashes (SVCs // Motorcycle only)
    - 81153 Rows ==> 14548 Rows
- Removed columns
    - 234 Columns ==> 131 Columns
- Adjusted target column
    - Added possible injury as minor injury
    - Removed unknown values
    - 14548 Rows ==> 14174 Rows
- Datetime column created
    - 'crash_datetime' == 2022-12-31 12:26:00
    - 131 Columns ==> 132 Columns

In [26]:
# Final dataset shape
svcs.shape

(14174, 132)

<div style='background-color:orange'>
<a id="wrangle"></a>
    <h1 style='text-align:center ; top-padding:5px'>
        <b><i>
            Wrangle
        </i></b></h1>
    <li><a href='#TableOfContents'>Table of Contents</a>
    </li>
</div>

In [27]:
# Check functionality from wrangle.py
function_svcs = w.wrangle()
print(f'\033[35mWrangle Function:\033[0m {function_svcs.shape}\n\033[35mManual Preparation:\033[0m {svcs.shape}')

[35mWrangle Function:[0m (14174, 132)
[35mManual Preparation:[0m (14174, 132)


In [28]:
function_svcs

Unnamed: 0,crash_id,$1000_damage_to_any_one_person's_property,active_school_zone_flag,adjusted_average_daily_traffic_amount,adjusted_percentage_of_average_daily_traffic_for_trucks,adjusted_roadway_part,agency,at_intersection_flag,average_daily_traffic_amount,average_daily_traffic_year,...,charge,citation,person_age,person_ethnicity,person_gender,person_helmet,person_injury_severity,person_type,physical_location_of_an_occupant,crash_datetime
0,16189632,No,NO,no data,no data,1 - MAIN/PROPER LANE,"DEPARTMENT OF PUBLIC SAFETY, STATE OF TEXAS",False,no data,no data,...,OPERATE UNREGISTERED MOTOR VEHICLE,TX52Q80UKZPL,37,W - WHITE,1 - MALE,1 - NOT WORN,A - SUSPECTED SERIOUS INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER,2018-01-01 11:23:00
11,16203470,No,NO,no data,no data,1 - MAIN/PROPER LANE,"DEPARTMENT OF PUBLIC SAFETY, STATE OF TEXAS",False,no data,no data,...,"NO CLASS ""M"" LICENSE",TX52QD0NAP34,30,H - HISPANIC,1 - MALE,"3 - WORN, NOT DAMAGED",B - SUSPECTED MINOR INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER,2018-01-04 13:16:00
18,16192023,Yes,NO,no data,no data,1 - MAIN/PROPER LANE,HARRIS COUNTY SHERIFF'S OFFICE,False,no data,no data,...,NO CHARGES,no data,21,W - WHITE,1 - MALE,"2 - WORN, DAMAGED",A - SUSPECTED SERIOUS INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER,2018-01-05 20:45:00
22,16196720,No,NO,no data,no data,1 - MAIN/PROPER LANE,MCALLEN POLICE DEPARTMENT,False,no data,no data,...,NO DRIVER LICENSE NO INSURANCE,667341,18,H - HISPANIC,1 - MALE,1 - NOT WORN,B - SUSPECTED MINOR INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER,2018-01-05 03:07:00
35,16189103,Yes,NO,1426,7.1,1 - MAIN/PROPER LANE,"DEPARTMENT OF PUBLIC SAFETY, STATE OF TEXAS",False,1426,2019,...,NO CHARGES,no data,28,W - WHITE,1 - MALE,"3 - WORN, NOT DAMAGED",B - SUSPECTED MINOR INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER,2018-01-06 12:10:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81125,19321499,Yes,NO,1211,6.9,1 - MAIN/PROPER LANE,"DEPARTMENT OF PUBLIC SAFETY, STATE OF TEXAS",False,1211,2019,...,NO CHARGES,no data,49,W - WHITE,2 - FEMALE,1 - NOT WORN,B - SUSPECTED MINOR INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER,2022-12-31 11:26:00
81126,19323296,Yes,NO,no data,no data,1 - MAIN/PROPER LANE,"DEPARTMENT OF PUBLIC SAFETY, STATE OF TEXAS",False,no data,no data,...,NO CHARGES,no data,33,W - WHITE,1 - MALE,1 - NOT WORN,A - SUSPECTED SERIOUS INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER,2022-12-31 22:29:00
81140,19327850,No,NO,no data,no data,1 - MAIN/PROPER LANE,HARRIS COUNTY CONSTABLE PRECINCT 4,False,no data,no data,...,NO CHARGES,no data,35,W - WHITE,1 - MALE,"4 - WORN, UNK DAMAGE",A - SUSPECTED SERIOUS INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER,2022-12-31 15:55:00
81144,19330330,No,NO,no data,no data,1 - MAIN/PROPER LANE,"DEPARTMENT OF PUBLIC SAFETY, STATE OF TEXAS",False,no data,no data,...,FAIL TO DRIVE IN SINGLE LANE,TX6HCF0DDEIM,42,B - BLACK,2 - FEMALE,"4 - WORN, UNK DAMAGE",B - SUSPECTED MINOR INJURY,5 - DRIVER OF MOTORCYCLE TYPE VEHICLE,1 - FRONT LEFT OR MOTORCYCLE DRIVER,2022-12-31 14:49:00


<div style='background-color:orange'>
<a id="misc"></a>
    <h1 style='text-align:center ; top-padding:5px'>
        <b><i>
            Miscellaneous
        </i></b></h1>
    <li><a href='#TableOfContents'>Table of Contents</a>
    </li>
</div>

In [29]:
def dataset_consistency(df):
    df.reset_index(inplace=True)
    df.columns = df.columns.str.replace(' ', '_').str.lower()
    col_list = df.columns.to_list()
    target_variable_check = 'person_injury_severity' in col_list
    if target_variable_check == True:
        return col_list
    else:
        col_list.append('person_injury_severity')
        return col_list

In [30]:
# Read all separated versions of files
# all_person = pd.read_csv('2018_all_person.csv', index_col=0)
# contributing_factors = pd.read_csv('2018_contributing_factors.csv', index_col=0)
# crash_conditions = pd.read_csv('2018_crash_conditions.csv', index_col=0)
# lat_long = pd.read_csv('2018_lat_long.csv', index_col=0)
# road_stuff = pd.read_csv('2018_road_stuff.csv', index_col=0)
# vehicle_info = pd.read_csv('2018_vehicle_info.csv', index_col=0)

In [31]:
# Create lists of column names WITH target variable column
# all_person_col_list = dataset_consistency(all_person)
# contributing_factors_col_list = dataset_consistency(contributing_factors)
# crash_conditions_col_list = dataset_consistency(crash_conditions)
# lat_long_col_list = dataset_consistency(lat_long)
# road_stuff_col_list = dataset_consistency(road_stuff)
# vehicle_info_col_list = dataset_consistency(vehicle_info)

In [32]:
# Apply cols as a mask onto svcs dataframe
# all_person_new = svcs[all_person_col_list]
# contributing_factors_new = svcs[contributing_factors_col_list]
# crash_conditions_new = svcs[crash_conditions_col_list]
# lat_long_new = svcs[lat_long_col_list]
# road_stuff_new = svcs[road_stuff_col_list]
# vehicle_info_new = svcs[vehicle_info_col_list]

In [33]:
# Create .csvs for each of them for potential later use
# all_person_new.to_csv('all_person_master.csv')
# contributing_factors_new.to_csv('contributing_factors_master.csv')
# crash_conditions_new.to_csv('crash_conditions_master.csv')
# lat_long_new.to_csv('lat_long_master.csv')
# road_stuff_new.to_csv('road_stuff_master.csv')
# vehicle_info_new.to_csv('vehicle_info_master.csv')