# Final Data Analysis and Visualization

In this notebook, we will perform a data analysis and visualization on the 'person_data_info (2).csv' dataset. The target variable is 'person_injury_severity'. We will split the data into train, validate, and test sets using a custom function. We will also handle null values, convert 'No Data' to nulls, and encode categorical variables as needed. Finally, we will use Recursive Feature Elimination (RFE) to select the most important features for predicting the target variable.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

In [5]:
# Load the data
df = pd.read_csv('person_data_info (2).csv')
df.drop(columns=['Unnamed: 0'], inplace=True)
# Display the first few rows of the dataframe
df.head()

Unnamed: 0,crash_id,person_age,charge,person_ethnicity,financial_responsibility_type,crash_date,day_of_week,person_gender,person_helmet,driver_license_class,driver_license_endorsements,driver_drug_test_result,driver_license_state,driver_license_type,person_injury_severity
0,16189632,37,OPERATE UNREGISTERED MOTOR VEHICLE,W - WHITE,no data,2018-01-01,MONDAY,1 - MALE,1 - NOT WORN,C - CLASS C,NONE,97 - NOT APPLICABLE,TX - TEXAS,1 - DRIVER LICENSE,A - SUSPECTED SERIOUS INJURY
1,16203470,30,"NO CLASS ""M"" LICENSE",H - HISPANIC,2 - PROOF OF LIABILITY INSURANCE,2018-01-04,THURSDAY,1 - MALE,"3 - WORN, NOT DAMAGED",C - CLASS C,NONE,97 - NOT APPLICABLE,TX - TEXAS,1 - DRIVER LICENSE,C - POSSIBLE INJURY
2,16191458,20,ACCIDENT INVOLVING DAMAGE TO VEHICLE>=$200/ FS...,W - WHITE,2 - PROOF OF LIABILITY INSURANCE,2018-01-05,FRIDAY,1 - MALE,99 - UNKNOWN IF WORN,C - CLASS C,NONE,97 - NOT APPLICABLE,TX - TEXAS,1 - DRIVER LICENSE,99 - UNKNOWN
3,16192023,21,NO CHARGES,W - WHITE,2 - PROOF OF LIABILITY INSURANCE,2018-01-05,FRIDAY,1 - MALE,"2 - WORN, DAMAGED",C - CLASS C,NONE,97 - NOT APPLICABLE,TX - TEXAS,1 - DRIVER LICENSE,A - SUSPECTED SERIOUS INJURY
4,16196720,18,NO DRIVER LICENSE NO INSURANCE,H - HISPANIC,no data,2018-01-05,FRIDAY,1 - MALE,1 - NOT WORN,5 - UNLICENSED,UNLICENSED,97 - NOT APPLICABLE,TX - TEXAS,4 - ID CARD,B - SUSPECTED MINOR INJURY


## Data Preparation

We have converted 'No Data' values to NaNs. Now, let's handle these missing values. We will fill the missing values in the 'financial_responsibility_type' column based on the 'charge' column. If there is no insurance charge, we will assume that the person had financial responsibility. We will also standardize the text in the dataframe to lower case to ensure consistency.

In [6]:
# Fill missing values in 'financial_responsibility_type' based on 'charge'
df.loc[df['charge'].str.contains('INSURANCE', na=False) & df['financial_responsibility_type'].isnull(), 'financial_responsibility_type'] = '2 - PROOF OF LIABILITY INSURANCE'

# Standardize the text to lower case
df = df.applymap(lambda s:s.lower() if type(s) == str else s)

df.head()

Unnamed: 0,crash_id,person_age,charge,person_ethnicity,financial_responsibility_type,crash_date,day_of_week,person_gender,person_helmet,driver_license_class,driver_license_endorsements,driver_drug_test_result,driver_license_state,driver_license_type,person_injury_severity
0,16189632,37,operate unregistered motor vehicle,w - white,no data,2018-01-01,monday,1 - male,1 - not worn,c - class c,none,97 - not applicable,tx - texas,1 - driver license,a - suspected serious injury
1,16203470,30,"no class ""m"" license",h - hispanic,2 - proof of liability insurance,2018-01-04,thursday,1 - male,"3 - worn, not damaged",c - class c,none,97 - not applicable,tx - texas,1 - driver license,c - possible injury
2,16191458,20,accident involving damage to vehicle>=$200/ fs...,w - white,2 - proof of liability insurance,2018-01-05,friday,1 - male,99 - unknown if worn,c - class c,none,97 - not applicable,tx - texas,1 - driver license,99 - unknown
3,16192023,21,no charges,w - white,2 - proof of liability insurance,2018-01-05,friday,1 - male,"2 - worn, damaged",c - class c,none,97 - not applicable,tx - texas,1 - driver license,a - suspected serious injury
4,16196720,18,no driver license no insurance,h - hispanic,no data,2018-01-05,friday,1 - male,1 - not worn,5 - unlicensed,unlicensed,97 - not applicable,tx - texas,4 - id card,b - suspected minor injury


In [None]:
# Convert 'No Data' to NaN

df.replace('no data', np.nan, inplace=True)

df.isnull().sum()

## Data Splitting

Next, we will split the data into train, validate, and test sets using the provided function. This will allow us to explore the data and create and validate models.

In [None]:
def split(df):
    '''
    This function splits a dataframe into 
    train, validate, and test in order to explore the data and to create and validate models. 
    It takes in a dataframe and contains an integer for setting a seed for replication. 
    Test is 20% of the original dataset. The remaining 80% of the dataset is 
    divided between validate and train, with validate being .30*.80= 24% of 
    the original dataset, and train being .70*.80= 56% of the original dataset. 
    The function returns, train, validate and test dataframes. 
    '''
    train, test = train_test_split(df, test_size = .2, random_state=123)   
    train, validate = train_test_split(train, test_size=.3, random_state=123)
    
    return train, validate, test

# Split the data
train, validate, test = split(df)

## Data Normalization

Next, we will normalize the numerical data in the dataframe. This will ensure that all numerical features have the same scale, which can improve the performance of some machine learning models.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize a scaler
scaler = MinMaxScaler()

# Fit the scaler to the train data and transform the train data
train_scaled = scaler.fit_transform(train.select_dtypes(include=[np.number]))

# Transform the validate and test data
validate_scaled = scaler.transform(validate.select_dtypes(include=[np.number]))
test_scaled = scaler.transform(test.select_dtypes(include=[np.number]))

## Feature Selection

Next, we will perform feature selection using Recursive Feature Elimination (RFE). This method fits a model and removes the weakest feature (or features) until the specified number of features is reached. This will help us identify the most important features for predicting the target variable.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression()

# Initialize RFE
rfe = RFE(estimator=model, n_features_to_select=10)

# Fit RFE to the training data
rfe.fit(train_scaled, train['person_injury_severity'])

# Get the names of the selected features
selected_features = train.columns[rfe.support_]

selected_features

In [None]:
# Check the data types of the columns
for values in train.person_age:
    if values == np.nan:
        values = 0
        
    

In [None]:
train.person_age