# Explore the two datasources - Traffic violations

For the same topic, two different versions of the same dataset were identified, which were both published on openML. 

The first is unprocessed and can be found [here](https://api.openml.org/d/42132). The second one is a preprocessed and subsampled version that can be downloaded [here](https://www.openml.org/search?type=data&status=active&sort=runs&order=desc&id=42345). 

In [1]:
from pathlib import Path

import pandas as pd
from scipy.io.arff import loadarff

In [2]:
FULL_DATA_PATH = Path("..") / ".." / "data" / "dataset.csv"
DATA_PATH = Path("..") / ".." / "data" / "file65ef3a759daf.arff"

## Preprocessed dataset

In [3]:
data, meta = loadarff(DATA_PATH)
data = pd.DataFrame(data)

In [4]:
print("Dimensions of the dataset:", data.shape)

Dimensions of the dataset: (70340, 21)


In [5]:
data.columns

Index(['Description', 'Belts', 'Personal.Injury', 'Property.Damage',
       'Commercial.License', 'Commercial.Vehicle', 'State', 'VehicleType',
       'Year', 'Make', 'Model', 'Color', 'Charge', 'Contributed.To.Accident',
       'Race', 'Gender', 'Driver.City', 'Driver.State', 'DL.State',
       'Arrest.Type', 'Violation.Type'],
      dtype='object')

In [6]:
data.head()

Unnamed: 0,Description,Belts,Personal.Injury,Property.Damage,Commercial.License,Commercial.Vehicle,State,VehicleType,Year,Make,...,Color,Charge,Contributed.To.Accident,Race,Gender,Driver.City,Driver.State,DL.State,Arrest.Type,Violation.Type
0,b'DISPLAYING EXPIRED REGISTRATION PLATE ISSUED...,b'No',b'No',b'No',b'No',b'No',b'NC',b'02 - Automobile',2013.0,b'HYUNDAI',...,b'GRAY',b'13411f',b'No',b'WHITE',b'F',b'ASHEVILLE',b'NC',b'NC',b'A - Marked Patrol',b'Citation'
1,b'DRIVER FAIL TO STOP AT RED TRAFFIC SIGNAL BE...,b'No',b'No',b'No',b'No',b'No',b'MD',b'02 - Automobile',2015.0,b'FORD',...,b'SILVER',b'21202i1',b'No',b'OTHER',b'M',b'SILVER SPRING',b'MD',b'MD',b'A - Marked Patrol',b'Citation'
2,b'DRIVING UNDER THE INFLUENCE OF ALCOHOL PER SE',b'No',b'No',b'No',b'No',b'No',b'MD',b'02 - Automobile',2000.0,b'TOYOTA',...,b'BLACK',b'21902a2',b'No',b'BLACK',b'M',b'SILVER SPRING',b'MD',b'MD',b'B - Unmarked Patrol',b'Citation'
3,b'PERSON DRIVING MOTOR VEHICLE ON HIGHWAY OR P...,b'No',b'No',b'No',b'No',b'No',b'MD',b'02 - Automobile',2012.0,b'HOND',...,b'BLACK',b'16303c',b'No',b'BLACK',b'M',b'COLUMBIA',b'MD',b'MD',b'A - Marked Patrol',b'Citation'
4,b'DISPLAYING EXPIRED REGISTRATION PLATE ISSUED...,b'No',b'No',b'No',b'Yes',b'No',b'MD',b'02 - Automobile',2010.0,b'FORD',...,b'BLACK',b'13411f',b'No',b'WHITE',b'M',b'MOUNT AIRY',b'MD',b'MD',b'A - Marked Patrol',b'Citation'


In [7]:
# remove b string from data
str_df = data.select_dtypes([object])
str_df = str_df.stack().str.decode("utf-8").unstack()
data = pd.concat([str_df, data.select_dtypes(exclude=[object])], axis=1)

In [8]:
data.head()

Unnamed: 0,Description,Belts,Personal.Injury,Property.Damage,Commercial.License,Commercial.Vehicle,State,VehicleType,Make,Model,...,Charge,Contributed.To.Accident,Race,Gender,Driver.City,Driver.State,DL.State,Arrest.Type,Violation.Type,Year
0,DISPLAYING EXPIRED REGISTRATION PLATE ISSUED B...,No,No,No,No,No,NC,02 - Automobile,HYUNDAI,SONATA,...,13411f,No,WHITE,F,ASHEVILLE,NC,NC,A - Marked Patrol,Citation,2013.0
1,DRIVER FAIL TO STOP AT RED TRAFFIC SIGNAL BEFO...,No,No,No,No,No,MD,02 - Automobile,FORD,FUSION,...,21202i1,No,OTHER,M,SILVER SPRING,MD,MD,A - Marked Patrol,Citation,2015.0
2,DRIVING UNDER THE INFLUENCE OF ALCOHOL PER SE,No,No,No,No,No,MD,02 - Automobile,TOYOTA,CAMRY,...,21902a2,No,BLACK,M,SILVER SPRING,MD,MD,B - Unmarked Patrol,Citation,2000.0
3,PERSON DRIVING MOTOR VEHICLE ON HIGHWAY OR PUB...,No,No,No,No,No,MD,02 - Automobile,HOND,CROSSTOUR,...,16303c,No,BLACK,M,COLUMBIA,MD,MD,A - Marked Patrol,Citation,2012.0
4,DISPLAYING EXPIRED REGISTRATION PLATE ISSUED B...,No,No,No,Yes,No,MD,02 - Automobile,FORD,F250,...,13411f,No,WHITE,M,MOUNT AIRY,MD,MD,A - Marked Patrol,Citation,2010.0


Check distribution of target variable and obvious protected attributes

In [9]:
data["Violation.Type"].value_counts()

Violation.Type
Citation    32452
SERO         3506
Name: count, dtype: int64

In [10]:
data["Race"].value_counts()

Race
WHITE              24973
BLACK              22249
HISPANIC           15035
ASIAN               4140
OTHER               3784
NATIVE AMERICAN      159
Name: count, dtype: int64

In [11]:
data["Gender"].value_counts()

Gender
M    47163
F    23087
U       90
Name: count, dtype: int64

## Full dataset

WARNING: the following code does not load the data in a fully correct manner. It was nonetheless used to get a first impression of the dataset and the cleanliness.

In [12]:
df = pd.read_csv(FULL_DATA_PATH, quotechar="'", on_bad_lines="skip")

  df = pd.read_csv(FULL_DATA_PATH, quotechar="'", on_bad_lines="skip")


In [13]:
df.shape

(1578095, 43)

In [14]:
df.columns = df.columns.str.replace('"', "")

In [15]:
df.columns

Index(['seqid', 'date_of_stop', 'time_of_stop', 'agency', 'subagency',
       'description', 'location', 'latitude', 'longitude', 'accident', 'belts',
       'personal_injury', 'property_damage', 'fatal', 'commercial_license',
       'hazmat', 'commercial_vehicle', 'alcohol', 'work_zone',
       'search_conducted', 'search_disposition', 'search_outcome',
       'search_reason', 'search_reason_for_stop', 'search_type',
       'search_arrest_reason', 'state', 'vehicletype', 'year', 'make', 'model',
       'color', 'violation_type', 'charge', 'article',
       'contributed_to_accident', 'race', 'gender', 'driver_city',
       'driver_state', 'dl_state', 'arrest_type', 'geolocation'],
      dtype='object')

In [16]:
df.head()

Unnamed: 0,seqid,date_of_stop,time_of_stop,agency,subagency,description,location,latitude,longitude,accident,...,charge,article,contributed_to_accident,race,gender,driver_city,driver_state,dl_state,arrest_type,geolocation
0,fdcc1a6b-4854-4cde-bb60-248f478fa5b6,09/11/2019,09:56:00,MCP,"2nd District, Bethesda",STOP LIGHTS (*),27 @ SWEEPSTAKES RD,39.259627,-77.22376,No,...,64*,?,False,HISPANIC,F,DAMASCUS,MD,MD,B - Unmarked Patrol,"(39.2596266666667, -77.22376)"
1,842dad60-5edf-47a8-9e94-c7e6da729498,09/11/2019,09:42:00,MCP,"1st District, Rockville",DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC...,FREDERICK RD / REDLAND RD,39.11286,-77.162435,No,...,21-201(a1),Transportation Article,False,BLACK,M,TEMPLE HILLS,MD,MD,A - Marked Patrol,"(39.11286, -77.162435)"
2,4db837cc-f2fa-4a5b-9ac8-37698492b5f9,09/11/2019,09:36:00,MCP,"1st District, Rockville",FAILURE VEH. TO YIELD INTERSECTION RIGHT-OF-WA...,S/B GEORGIA AVE AT MD200,39.11708,-77.068113,No,...,21-401,Transportation Article,False,WHITE,M,OLNEY,MD,MD,A - Marked Patrol,"(39.11708, -77.0681133333333)"
3,79761295-50f6-4336-8b48-fdf55e87a326,09/11/2019,09:33:00,MCP,"2nd District, Bethesda",DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC...,OLD GEORGETOWN RD AT MCKINLEY ST,38.99163,-77.105873,No,...,21-201(a1),Transportation Article,False,BLACK,F,GAITHERSBURG,MD,MD,A - Marked Patrol,"(38.99163, -77.1058733333333)"
4,f9a7a508-386c-466e-95b2-dbf26d7d59fe,09/11/2019,09:30:00,MCP,"4th District, Wheaton",MOTOR VEH. W/O REQUIRED STOP LAMPS EQUIPMENT,CONNECTICUT AVE / WELLER RD,39.06456,-77.073673,No,...,22-206(a),Transportation Article,False,BLACK,F,SILVER SPRING,MD,MD,A - Marked Patrol,"(39.06456, -77.0736733333333)"
