# **Porto Seguro’s Safe Driver Prediction**

## **Machine Learning Final Project**

### **Sai Charith Govardhanam**


### **Introduction**

*   Auto insurance is a critical industry for ensuring the safety of drivers and protecting their assets in case of accidents. However, accurately assessing the risk of claims is a complex and time-consuming process.

*   Predictive modeling has become increasingly popular in the insurance industry as a way to improve the accuracy and efficiency of risk assessment.

*   The Porto Seguro Safe Driver Prediction dataset contains a large amount of information about policyholders, including demographic data, vehicle characteristics, and past claims history, that can be used to train a machine learning model to predict the likelihood of future claims.

*   The challenge of predicting auto insurance claims is a complex one, as there are many different factors that can contribute to the likelihood of a claim being filed, such as the driver's age, driving history, and the type of vehicle they own.

*   By developing a predictive model that can accurately assess the risk of claims, Porto Seguro can improve its ability to price policies more accurately, reduce the number of fraudulent claims, and improve overall customer satisfaction.







# Notebook Configuration

## Google drive

In [1]:
from google.colab import drive
import sys

# Mount Google Drive
drive.mount('/content/drive')

# Get the absolute path of the current folder
abspath_curr = '/content/drive/My Drive/Colab Notebooks/teaching/gwu/machine_learning_I/mlproject/'

# Get the absolute path of the shallow utilities folder
abspath_util_shallow = '/content/drive/My Drive/Colab Notebooks/teaching/gwu/machine_learning_I/code/utilities/p2_shallow_learning/'

# Get the absolute path of the shallow models folder
abspath_model_shallow = '/content/drive/My Drive/Colab Notebooks/teaching/gwu/machine_learning_I/code/models/p2_shallow_learning/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Warning

In [2]:
import warnings

# Ignore warnings
warnings.filterwarnings('ignore')

## Matplotlib

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline 

# Set matplotlib sizes
plt.rc('font', size=20)
plt.rc('axes', titlesize=20)
plt.rc('axes', labelsize=20)
plt.rc('xtick', labelsize=20)
plt.rc('ytick', labelsize=20)
plt.rc('legend', fontsize=20)
plt.rc('figure', titlesize=20)

## TensorFlow

In [4]:
# The magic below allows us to use tensorflow version 2.x
%tensorflow_version 2.x 
import tensorflow as tf
from tensorflow import keras

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


## Random seed

In [5]:
# The random seed
random_seed = 42

# Set random seed in tensorflow
tf.random.set_seed(random_seed)

# Set random seed in numpy
import numpy as np
np.random.seed(random_seed)

# Data Preprocessing

In [6]:
# Change working directory to the absolute path of the shallow utilities folder
%cd $abspath_util_shallow

# Import the shallow utitilities
%run pmlm_utilities_shallow.ipynb

/content/drive/My Drive/Colab Notebooks/teaching/gwu/machine_learning_I/code/utilities/p2_shallow_learning


# **Loading the data**
In this case study, we will use the Ghouls, Goblins, and Ghosts... Boo! dataset.

In [7]:
import pandas as pd

# Load the raw training data
df_raw_train = pd.read_csv(abspath_curr + '/data/train.csv',
                           header=0)
# Make a copy of df_raw_train
df_train = df_raw_train.copy(deep=True)

# Load the raw test data
df_raw_test = pd.read_csv(abspath_curr + '/data/test.csv',
                          header=0)
# Make a copy of df_raw_test
df_test = df_raw_test.copy(deep=True)

# Get the name of the target
target = 'target'

In [8]:
# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])

Unnamed: 0,# rows,# columns
0,595212,59


In [9]:
# Print the first 5 rows of df_train
df_train.head()

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


In [10]:
# Print the first 5 rows of df_train
df_test.head()

Unnamed: 0,id,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,0,0,1,8,1,0,0,1,0,0,...,1,1,1,12,0,1,1,0,0,1
1,1,4,2,5,1,0,0,0,0,1,...,2,0,3,10,0,0,1,1,0,1
2,2,5,1,3,0,0,0,0,0,1,...,4,0,2,4,0,0,0,0,0,0
3,3,0,1,6,0,0,1,0,0,0,...,5,1,0,5,1,0,1,0,0,0
4,4,5,1,7,0,0,0,0,0,1,...,4,0,0,4,0,1,1,0,0,1


# **Splitting the data**
The code below shows how to divide the training data into training (80%) and validation (20%).

In [11]:
from sklearn.model_selection import train_test_split

# Divide the training data into training (80%) and validation (20%)
df_train, df_val = train_test_split(df_train, train_size=0.8, random_state=random_seed)

# Reset the index
df_train, df_val = df_train.reset_index(drop=True), df_val.reset_index(drop=True)

In [12]:
# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])

Unnamed: 0,# rows,# columns
0,476169,59


In [13]:
# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])

Unnamed: 0,# rows,# columns
0,119043,59


# **Handling uncommon features**
## Identifying uncommon features
The code below shows how to find common variables between the training, validation and test data.

In [14]:
# Call common_var_checker
# See the implementation in pmlm_utilities.ipynb
df_common_var = common_var_checker(df_train, df_val, df_test, target)

# Print df_common_var
df_common_var

Unnamed: 0,common var
0,id
1,ps_calc_01
2,ps_calc_02
3,ps_calc_03
4,ps_calc_04
5,ps_calc_05
6,ps_calc_06
7,ps_calc_07
8,ps_calc_08
9,ps_calc_09


The code below shows how to find features in the training data but not in the validation or test data.

In [15]:
# Get the features in the training data but not in the validation or test data
uncommon_feature_train_not_val_test = np.setdiff1d(df_train.columns, df_common_var['common var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_train_not_val_test, columns=['uncommon feature'])

Unnamed: 0,uncommon feature


The code below shows how to find the features in the validation data but not in the training or test data.

In [16]:
# Get the features in the validation data but not in the training or test data
uncommon_feature_val_not_train_test = np.setdiff1d(df_val.columns, df_common_var['common var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_val_not_train_test, columns=['uncommon feature'])

Unnamed: 0,uncommon feature


The code below shows how to find the features in the test data but not in the training or validation data.

In [17]:
# Get the features in the test data but not in the training or validation data
uncommon_feature_test_not_train_val = np.setdiff1d(df_test.columns, df_common_var['common var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_test_not_train_val, columns=['uncommon feature'])

Unnamed: 0,uncommon feature


# **Removing uncommon features**

In [18]:
# Remove the uncommon features from the training data
df_train = df_train.drop(columns=uncommon_feature_train_not_val_test)

# Print the first 5 rows of df_train
df_train.head()

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,97859,0,2,1,5,0,0,1,0,0,...,4,4,3,11,0,0,1,0,1,0
1,1195534,0,0,1,7,1,0,0,1,0,...,8,1,3,6,0,1,0,1,0,0
2,1367737,0,0,1,3,0,0,0,1,0,...,5,2,3,10,0,0,0,0,0,0
3,970233,0,0,3,4,0,0,1,0,0,...,2,3,1,10,0,1,0,0,0,0
4,158613,0,0,1,2,1,0,1,0,0,...,8,3,6,11,0,0,0,0,1,1


In [19]:
# Remove the uncommon features from the validation data
df_val = df_val.drop(columns=uncommon_feature_val_not_train_test)

# Print the first 5 rows of df_val
df_val.head()

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,642026,0,4,1,5,1,0,1,0,0,...,7,3,2,3,0,1,0,1,0,0
1,297043,0,6,2,10,1,0,0,0,0,...,6,3,3,5,0,1,1,0,0,0
2,140591,0,4,1,9,1,0,0,0,1,...,3,1,0,7,0,0,1,0,0,0
3,1354540,0,0,1,7,1,4,0,1,0,...,1,1,3,6,1,1,0,0,0,0
4,873173,0,1,1,3,1,0,1,0,0,...,6,1,5,6,0,1,0,0,0,0


In [20]:
# Remove the uncommon features from the test data
df_test = df_test.drop(columns=uncommon_feature_test_not_train_val)

# Print the first 5 rows of df_test
df_test.head()

Unnamed: 0,id,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,0,0,1,8,1,0,0,1,0,0,...,1,1,1,12,0,1,1,0,0,1
1,1,4,2,5,1,0,0,0,0,1,...,2,0,3,10,0,0,1,1,0,1
2,2,5,1,3,0,0,0,0,0,1,...,4,0,2,4,0,0,0,0,0,0
3,3,0,1,6,0,0,1,0,0,0,...,5,1,0,5,1,0,1,0,0,0
4,4,5,1,7,0,0,0,0,0,1,...,4,0,0,4,0,1,1,0,0,1


# **Handling identifiers**
## Combining the training, validation and test data
The code below shows how to combine the training, validation and test data.

In [21]:
# Combine df_train, df_val and df_test
df = pd.concat([df_train, df_val, df_test], sort=False)

# **Identifying identifiers**
The code below shows how to find identifiers from data.

In [22]:
# Call id_checker on df
# See the implementation in pmlm_utilities.ipynb
df_id = id_checker(df)

# Print the first 5 rows of df_id
df_id.head()

Unnamed: 0,id
0,97859
1,1195534
2,1367737
3,970233
4,158613


# **Removing identifiers**
The code below shows how to remove identifiers from data.

In [23]:
import numpy as np

# Remove identifiers from df_train
df_train.drop(columns=np.intersect1d(df_id.columns, df_train.columns), inplace=True)

# Remove identifiers from df_val
df_val.drop(columns=np.intersect1d(df_id.columns, df_val.columns), inplace=True)

# Remove identifiers from df_test
df_test.drop(columns=np.intersect1d(df_id.columns, df_test.columns), inplace=True)

In [24]:
# Print the first 5 rows of df_train
df_train.head()

Unnamed: 0,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,0,2,1,5,0,0,1,0,0,0,...,4,4,3,11,0,0,1,0,1,0
1,0,0,1,7,1,0,0,1,0,0,...,8,1,3,6,0,1,0,1,0,0
2,0,0,1,3,0,0,0,1,0,0,...,5,2,3,10,0,0,0,0,0,0
3,0,0,3,4,0,0,1,0,0,0,...,2,3,1,10,0,1,0,0,0,0
4,0,0,1,2,1,0,1,0,0,0,...,8,3,6,11,0,0,0,0,1,1


In [25]:
# Print the first 5 rows of df_val
df_val.head()

Unnamed: 0,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,0,4,1,5,1,0,1,0,0,0,...,7,3,2,3,0,1,0,1,0,0
1,0,6,2,10,1,0,0,0,0,1,...,6,3,3,5,0,1,1,0,0,0
2,0,4,1,9,1,0,0,0,1,0,...,3,1,0,7,0,0,1,0,0,0
3,0,0,1,7,1,4,0,1,0,0,...,1,1,3,6,1,1,0,0,0,0
4,0,1,1,3,1,0,1,0,0,0,...,6,1,5,6,0,1,0,0,0,0


In [26]:
# Print the first 5 rows of df_test
df_test.head()

Unnamed: 0,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,0,1,8,1,0,0,1,0,0,0,...,1,1,1,12,0,1,1,0,0,1
1,4,2,5,1,0,0,0,0,1,0,...,2,0,3,10,0,0,1,1,0,1
2,5,1,3,0,0,0,0,0,1,0,...,4,0,2,4,0,0,0,0,0,0
3,0,1,6,0,0,1,0,0,0,0,...,5,1,0,5,1,0,1,0,0,0
4,5,1,7,0,0,0,0,0,1,0,...,4,0,0,4,0,1,1,0,0,1


# **Handling missing data**
## Combining the training, validation and test data
The code below shows how to combine the training, validation and test data.

In [27]:
# Combine df_train, df_val and df_test
df = pd.concat([df_train, df_val, df_test], sort=False)

# **Identifying missing values**
The code below shows how to find variables with NaN, their proportion of NaN and data type.

In [28]:
# Call nan_checker on df
# See the implementation in pmlm_utilities.ipynb
df_nan = nan_checker(df)

# Print df_nan
df_nan

Unnamed: 0,var,proportion,dtype
0,target,0.599999,float64


In [29]:
# Print the unique data type of variables with NaN
pd.DataFrame(df_nan['dtype'].unique(), columns=['dtype'])

Unnamed: 0,dtype
0,float64


The code below shows how to use data type to select variables with missing values in the combined data.

In [30]:
# Get the variables with missing values, their proportion of missing values and data type
df_miss = df_nan[df_nan['dtype'] == 'float64'].reset_index(drop=True)

# Print df_miss
df_miss

Unnamed: 0,var,proportion,dtype
0,target,0.599999,float64


# **Separating the training, validation and test data**
The code below shows how to separate the training, validation and test data.

In [31]:
# Separating the training data
df_train = df.iloc[:df_train.shape[0], :]

# Separating the validation data
df_val = df.iloc[df_train.shape[0]:df_train.shape[0] + df_val.shape[0], :]

# Separating the test data
df_test = df.iloc[df_train.shape[0] + df_val.shape[0]:, :]

In [32]:
# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])

Unnamed: 0,# rows,# columns
0,476169,58


In [33]:
# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])

Unnamed: 0,# rows,# columns
0,119043,58


In [34]:
# Print the dimension of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['# rows', '# columns'])

Unnamed: 0,# rows,# columns
0,892816,58


# **Imputing missing values**
The code below shows how to use the mode of a variable to impute its missing values.

In [35]:
from sklearn.impute import SimpleImputer

# If there are missing values
if len(df_miss['var']) > 0:
    # The SimpleImputer
    si = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

    # Impute the variables with missing values in df_train, df_val and df_test 
    df_train[df_miss['var']] = si.fit_transform(df_train[df_miss['var']])
    df_val[df_miss['var']] = si.transform(df_val[df_miss['var']])
    df_test[df_miss['var']] = si.transform(df_test[df_miss['var']])

# **Encoding the data**
## Combining the training, validation and test data
The code below shows how to combine the training, validation and test data.

In [36]:
# Combine df_train, df_val and df_test
df = pd.concat([df_train, df_val, df_test], sort=False)

# Print the unique data type of variables in df
pd.DataFrame(df.dtypes.unique(), columns=['dtype'])

Unnamed: 0,dtype
0,float64
1,int64


# **Identifying categorical variables**
The code below shows how to find categorical variables (whose data type is dtype) and their number of unique value.

In [37]:
# Call cat_var_checker on df
# See the implementation in pmlm_utilities.ipynb
df_cat = cat_var_checker(df)

# Print the dataframe
df_cat

Unnamed: 0,var,nunique


# **Removing the categorical features with large number of categories**
The code below shows how to remove the categorical features with large number of categoires in the combined data.

In [38]:
# Remove features from df
df = df.drop(columns=[])

# **Encoding categorical features**
The code below shows how to encode categorical features in the combined data.

In [39]:
# One-hot-encode the categorical features in the combined data
df = pd.get_dummies(df, columns=np.setdiff1d(np.intersect1d(df.columns, df_cat['var']), [target]))

# Print the first 5 rows of df
df.head()

Unnamed: 0,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,0.0,2,1,5,0,0,1,0,0,0,...,4,4,3,11,0,0,1,0,1,0
1,0.0,0,1,7,1,0,0,1,0,0,...,8,1,3,6,0,1,0,1,0,0
2,0.0,0,1,3,0,0,0,1,0,0,...,5,2,3,10,0,0,0,0,0,0
3,0.0,0,3,4,0,0,1,0,0,0,...,2,3,1,10,0,1,0,0,0,0
4,0.0,0,1,2,1,0,1,0,0,0,...,8,3,6,11,0,0,0,0,1,1


# **Encoding categorical target**
The code below shows how to encode categorical target in the combined data.

In [40]:
from sklearn.preprocessing import LabelEncoder

# The LabelEncoder
le = LabelEncoder()

# Encode categorical target in the combined data
df[target] = le.fit_transform(df[target].astype(str))

# Print the first 5 rows of df
df.head()

Unnamed: 0,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,0,2,1,5,0,0,1,0,0,0,...,4,4,3,11,0,0,1,0,1,0
1,0,0,1,7,1,0,0,1,0,0,...,8,1,3,6,0,1,0,1,0,0
2,0,0,1,3,0,0,0,1,0,0,...,5,2,3,10,0,0,0,0,0,0
3,0,0,3,4,0,0,1,0,0,0,...,2,3,1,10,0,1,0,0,0,0
4,0,0,1,2,1,0,1,0,0,0,...,8,3,6,11,0,0,0,0,1,1


# **Separating the training, validation and test data**
The code below shows how to separate the training, validation and test data.

In [41]:
# Separating the training data
df_train = df.iloc[:df_train.shape[0], :]

# Separating the validation data
df_val = df.iloc[df_train.shape[0]:df_train.shape[0] + df_val.shape[0], :]

# Separating the test data
df_test = df.iloc[df_train.shape[0] + df_val.shape[0]:, :]

In [42]:
# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])

Unnamed: 0,# rows,# columns
0,476169,58


In [43]:
# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])

Unnamed: 0,# rows,# columns
0,119043,58


In [44]:
# Print the dimension of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['# rows', '# columns'])

Unnamed: 0,# rows,# columns
0,892816,58


# **Splitting the feature and target**
The code below shows how to split the feature and target.

In [45]:
# Get the feature matrix
X_train = df_train[np.setdiff1d(df_train.columns, [target])].values
X_val = df_val[np.setdiff1d(df_val.columns, [target])].values
X_test = df_test[np.setdiff1d(df_test.columns, [target])].values

# Get the target vector
y_train = df_train[target].values
y_val = df_val[target].values
y_test = df_test[target].values

# **Scaling the data**
### **Standardization**
The code below shows how to standardize the data.

In [46]:
from sklearn.preprocessing import StandardScaler

# The StandardScaler
ss = StandardScaler()

# **Standardizing the features**
The code below shows how to standardize the features.

In [47]:
# Standardize the training data
X_train = ss.fit_transform(X_train)

# Standardize the validation data
X_val = ss.transform(X_val)

# Standardize the test data
X_test = ss.transform(X_test)

# **Model Building**

## **Decision Tree**

The code below shows how to train the sklearn decision tree model (on the training data) and use it for prediction (on the validation data).

In [48]:
from sklearn.tree import DecisionTreeClassifier

# The DecisionTreeClassifier
dtc = DecisionTreeClassifier(class_weight='balanced', random_state=random_seed)

# Train the decision tree classifier on the training data
dtc.fit(X_train, y_train)

# Get the prediction on the validation data
y_val_pred_dtc = dtc.predict(X_val)

In [52]:
from sklearn.metrics import accuracy_score

# Get the accuracy
acc_dtc = accuracy_score(y_val, y_val_pred_dtc)

# Print the accuracy
print("Accuracy score for Decision Tree Classifier:", acc_dtc)

Accuracy score for Decision Tree Classifier: 0.9263543425484909


## **Random Forest**

The code below shows how to train the random forest (on the training data) and use it for prediction (on the validation data).

In [55]:
from sklearn.ensemble import RandomForestClassifier

# The RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=random_seed)

# Train the RandomForestClassifier on the training data
rfc.fit(X_train, y_train)

# Get the prediction on the validation data
y_val_pred_rfc = rfc.predict(X_val)

In [56]:
from sklearn.metrics import accuracy_score

# Get the accuracy
acc_rfc = accuracy_score(y_val, y_val_pred_rfc)

# Print the accuracy
print("Accuracy score for Random Forest Classifier:", acc_rfc)

Accuracy score for Random Forest Classifier: 0.9631645707853465


## **MLP Classifier**

The code below shows how to train the MLP Classifier (on the training data) and use it for prediction (on the validation data).

In [57]:
from sklearn.neural_network import MLPClassifier

# The MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=random_seed)

# Train the MLPClassifier on the training data
mlp.fit(X_train, y_train)

# Get the prediction on the validation data
y_val_pred_mlp = mlp.predict(X_val)

In [58]:
from sklearn.metrics import accuracy_score

# Get the accuracy
acc_mlp = accuracy_score(y_val, y_val_pred_mlp)

# Print the accuracy
print("Accuracy score for MLPClassifier:", acc_mlp)

Accuracy score for MLPClassifier: 0.9631729711112791


## **Logistic regression**

The code below shows how to train the Logistic regression (on the training data) and use it for prediction (on the validation data).

In [59]:
from sklearn.linear_model import LogisticRegression

# The Logistic Regression
lr = LogisticRegression(class_weight='balanced', random_state=random_seed)

# Train the Logistic Regression on the training data
lr.fit(X_train, y_train)

# Get the prediction on the validation data
y_val_pred_lr = lr.predict(X_val)

In [60]:
from sklearn.metrics import accuracy_score

# Get the accuracy
acc_lr = accuracy_score(y_val, y_val_pred_lr)

# Print the accuracy
print("Accuracy score for Logistic Regression:", acc_lr)

Accuracy score for Logistic Regression: 0.6217081222751443


### **Conclusion**



*   The Porto Seguro Safe Driver Prediction dataset was analysed using machine learning algorithms to predict whether a driver will file an insurance claim based on a set of features provided by Porto Seguro.

*   The Random Forest and MLP Classifier models were found to have the best performance in terms of accuracy, precision, and recall, and were used to generate predictions on new data.

*   Data preprocessing techniques such as handling missing data, encoding categorical features, and splitting the feature and target variables were applied to improve the quality of the dataset and increase the performance of the machine learning models.

*   The results of this analysis can be used by Porto Seguro to identify high-risk drivers and develop targeted insurance policies that mitigate risk and reduce the number of insurance claims filed by their customers.

*   The successful application of machine learning algorithms to the Porto Seguro Safe Driver Prediction dataset demonstrates the potential for data-driven approaches to improve the insurance industry's risk assessment and underwriting processes, leading to more efficient and cost-effective insurance policies for both customers and insurance providers.





