# Modeling
In this notebook, we will perform and score multiple types of classification models to predict the target variable which is `person_injury_severity`. We will start by reading the CSV file named `person_updated.csv` and split the data into train, validate, and test sets. We will then create a baseline model and score this model as well against the other models.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# Read the CSV file
df = pd.read_csv('person_updated.csv')

In [9]:
len(df)

14548

In [11]:
df_prepped = pd.read_csv('prepped_svc.csv')

In [15]:
len(df_prepped)

14184

In [3]:
def split(df, target_variable):
    '''
    This function splits a dataframe into
    train, validate, and test in order to explore the data and to create and validate models.
    It takes in a dataframe and contains an integer for setting a seed for replication.
    Test is 20% of the original dataset. The remaining 80% of the dataset is
    divided between validate and train, with validate being .30*.80= 24% of
    the original dataset, and train being .70*.80= 56% of the original dataset.
    The function returns, train, validate and test dataframes.
    '''
    train, test = train_test_split(df, test_size = .2, random_state=123, stratify=df[target_variable])
    train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train[target_variable])

    return train, validate, test

In [4]:
# Split the data
train, validate, test = split(df, 'person_injury_severity')

In [5]:
# Split the data into X and y
X_train = train.drop('person_injury_severity', axis=1)
y_train = train['person_injury_severity']

X_validate = validate.drop('person_injury_severity', axis=1)
y_validate = validate['person_injury_severity']

X_test = test.drop('person_injury_severity', axis=1)
y_test = test['person_injury_severity']

Unnamed: 0,person_age,person_gender,person_ethnicity,person_injury_severity,driver_license_class,motorcycle_endorsed
8279,19,1 - MALE,W - WHITE,A - SUSPECTED SERIOUS INJURY,C - CLASS C,False
12660,73,1 - MALE,W - WHITE,A - SUSPECTED SERIOUS INJURY,CM - CLASS C AND M,True
6787,24,2 - FEMALE,W - WHITE,A - SUSPECTED SERIOUS INJURY,5 - UNLICENSED,False
6088,27,1 - MALE,W - WHITE,K - FATAL INJURY,C - CLASS C,False
10085,37,1 - MALE,W - WHITE,B - SUSPECTED MINOR INJURY,CM - CLASS C AND M,True


Now that we have our data split into training, validation, and test sets, we will start building our models. We will start by creating a baseline model. The baseline model will be a simple model that we can compare our other models to. It provides a point of reference for whether the models we create are good. A good model should be able to outperform the baseline.

In [6]:
from sklearn.dummy import DummyClassifier

# Create a baseline model
baseline_model = DummyClassifier(strategy='most_frequent')

# Fit the model
baseline_model.fit(X_train, y_train)

# Score the model
baseline_score = baseline_model.score(X_validate, y_validate)
baseline_score

0.40893470790378006

The score of the baseline model is the accuracy of the model on the validation set. This score will be used as a point of reference for the other models that we will create. Next, we will create and score multiple types of classification models to predict the target variable.

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Create a dictionary to store the models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Support Vector Machine': SVC()
}

# Fit and score the models
for name, model in models.items():
    model.fit(X_train, y_train)
    score = model.score(X_validate, y_validate)
    print(f'{name} Score: {score}')

ValueError: could not convert string to float: '1 - MALE'

The scores of the models are the accuracies of the models on the validation set. We can compare these scores to the score of the baseline model to determine which models are performing better than the baseline. The models that have a higher score than the baseline model are the models that are able to predict the target variable more accurately.