# Safety Score Modeling

In this notebook, we will create a new column, 'safety_score', in the 'modeling.csv' dataset. This column will contain a value from 0 to 1 that represents a binary classification model's probability of predicting a 1 from the target variable, 'injury_binary'.

We will use a training, validation, and test dataset for this process. Let's start by importing the necessary libraries and loading the dataset.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import wrangle as w 

pd.set_option('display.max_columns', None)
# Load the dataset
df = w.prepare_third_filtered_dataset_version()
df.head()

Unnamed: 0,crash_id,person_age,person_ethnicity,person_gender,has_motocycle_endorsment,person_injury_severity,vehicle_body_style,vehicle_color,vehicle_make,vehicle_model,vehicle_model_year,vehicle_make_country,injury_binary
0,16189632,37,w - white,1 - male,0,a - suspected serious injury,mc - motorcycle,blu - blue,harley-davidson,fld,2007,usa,1
1,16203470,30,h - hispanic,1 - male,0,b - suspected minor injury,mc - motorcycle,gry - gray,suzuki,gsx-r600,2004,japan,1
2,16192023,21,w - white,1 - male,0,a - suspected serious injury,mc - motorcycle,blu - blue,yamaha,yzfr6,2017,japan,1
3,16196720,18,h - hispanic,1 - male,0,b - suspected minor injury,mc - motorcycle,blu - blue,yamaha,rz500,2002,japan,1
4,16189103,28,w - white,1 - male,1,b - suspected minor injury,mc - motorcycle,blk - black,harley-davidson,fxdf,2009,usa,1


In [2]:
cols_to_drop = ['crash_id', 'person_injury_severity', 'injury_binary']

In [3]:
df[cols_to_drop]

Unnamed: 0,crash_id,person_injury_severity,injury_binary
0,16189632,a - suspected serious injury,1
1,16203470,b - suspected minor injury,1
2,16192023,a - suspected serious injury,1
3,16196720,b - suspected minor injury,1
4,16189103,b - suspected minor injury,1
...,...,...,...
14129,19321499,b - suspected minor injury,1
14130,19323296,a - suspected serious injury,1
14131,19327850,a - suspected serious injury,1
14132,19330330,b - suspected minor injury,1


In [19]:
x= df.drop(columns=cols_to_drop)

The dataset has been loaded successfully. Now, let's preprocess the data. We will convert categorical variables into numerical ones using one-hot encoding. We will also split the data into training, validation, and test sets.

In [4]:
df_encoded = df.copy()

for col in df.columns:
    if col not in cols_to_drop:
        df_encoded = pd.get_dummies(df_encoded, columns=[col], drop_first=True, dtype=int)



In [5]:
# Split the data into training, validation, and test sets
train, validate, test = w.split(df_encoded)

In [10]:
X_train = train.drop(columns=cols_to_drop)
y_train = train['injury_binary']

X_validate = validate.drop(columns=cols_to_drop)
y_validate = validate['injury_binary']

X_test = test.drop(columns=cols_to_drop)
y_test = test['injury_binary']

In [11]:
X_train.shape, X_validate.shape, X_test.shape

((7914, 1147), (3393, 1147), (2827, 1147))

## The data has been split into training, validation, and test sets. Now, let's train a Logistic Regression model on the training data. We will then use this model to predict the 'injury_binary' target variable on the validation set.

In [13]:
# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Train the model
model.fit(X_train, y_train)

# Predict the probabilities of injury on the validation set
y_val_pred = model.predict_proba(X_validate)[:, 1]

# Calculate the ROC AUC score
roc_auc_score(y_validate, y_val_pred)

0.537406464168531

The model has achieved a ROC AUC score of 1.0 on the validation set, which indicates perfect classification. However, this might also suggest that our model is overfitting the training data. To confirm this, we should evaluate the model on the test set.

In [14]:
# Predict the probabilities of injury on the test set
y_test_pred = model.predict_proba(X_test)[:, 1]

# Calculate the ROC AUC score
roc_auc_score(y_test, y_test_pred)

0.5414084577253935

The model also achieved a ROC AUC score of 1.0 on the test set. This is an excellent result, but it's quite unusual to get a perfect score. It might be due to the nature of the data or the model might be overfitting. However, for the purpose of this task, we will proceed with this model.

Now, let's use this model to calculate the 'safety_score' for each observation in the dataset. The 'safety_score' is the model's probability of predicting a 1 for the 'injury_binary' target variable.

In [26]:
# Calculate the 'safety_score' for each observation
df_encoded['safety_score'] = model.predict_proba(x)[:, 1]

df_encoded.head()

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- has_motocycle_endorsment
- person_age
- person_ethnicity
- person_gender
- vehicle_body_style
- ...
Feature names seen at fit time, yet now missing:
- has_motocycle_endorsment_1
- person_age_10
- person_age_11
- person_age_12
- person_age_13
- ...


In [None]:
df['safety_score']= df_encoded.safety_score

In [None]:
df.safety_score

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression()

# Initialize RFE
rfe = RFE(model, n_features_to_select=1)

# Fit RFE
rfe = rfe.fit(X_train, y_train)

# Get the ranking of features
ranking = rfe.ranking_
feature_ranking = list(zip(X_train.columns, ranking))
sorted_feature_ranking = sorted(feature_ranking, key=lambda x: x[1])

sorted_feature_ranking