In this notebook, we explore the dataset given for the challenge. 

In [1]:
%load_ext autoreload
%autoreload 2

In [29]:
import pandas as pd
# set precision for pandas
pd.set_option('display.precision', 2)
# imports to set up plotting
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# load data 
train_df = pd.read_csv('../data/train.csv')

train_df.head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,0,Male,21,1,35.0,0,1-2 Year,Yes,65101.0,124.0,187,0
1,1,Male,43,1,28.0,0,> 2 Years,Yes,58911.0,26.0,288,1
2,2,Female,25,1,14.0,1,< 1 Year,No,38043.0,152.0,254,0
3,3,Female,35,1,1.0,0,1-2 Year,Yes,2630.0,156.0,76,0
4,4,Female,36,1,15.0,1,1-2 Year,No,31951.0,152.0,294,0


In [6]:
# check columns
train_df.columns

Index(['id', 'Gender', 'Age', 'Driving_License', 'Region_Code',
       'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium',
       'Policy_Sales_Channel', 'Vintage', 'Response'],
      dtype='object')

In [9]:
# describe data
train_df.info(show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11504798 entries, 0 to 11504797
Data columns (total 12 columns):
 #   Column                Non-Null Count     Dtype  
---  ------                --------------     -----  
 0   id                    11504798 non-null  int64  
 1   Gender                11504798 non-null  object 
 2   Age                   11504798 non-null  int64  
 3   Driving_License       11504798 non-null  int64  
 4   Region_Code           11504798 non-null  float64
 5   Previously_Insured    11504798 non-null  int64  
 6   Vehicle_Age           11504798 non-null  object 
 7   Vehicle_Damage        11504798 non-null  object 
 8   Annual_Premium        11504798 non-null  float64
 9   Policy_Sales_Channel  11504798 non-null  float64
 10  Vintage               11504798 non-null  int64  
 11  Response              11504798 non-null  int64  
dtypes: float64(3), int64(6), object(3)
memory usage: 1.0+ GB


There are no missing values in the dataset.
Some variables to look at more closely:
 
0. How is the balance of the dataset?
1. Why is Driving_License in int? Is it 1-0 encoded? 
2. Why is Region_Code a float? How many unique values? Does this need to be 1-0 encoded?
3. Vehicle_Age -> categorical encoding (1, 2, 3)
4. Vehicle_Damage -> 1-0 encoding
5. What is the Vintage variable?

#### Balance

In [18]:
train_df['Response'].value_counts(normalize=True)

Response
0    0.877003
1    0.122997
Name: proportion, dtype: float64

The dataset is heavily imbalanced, with an almost 9:1 ratio of 0 (does not purchase car insurance) to 1. 

#### Driving_License

In [12]:
# check values
train_df['Driving_License'].value_counts()
train_df['Driving_License'].value_counts(normalize = True)

Driving_License
1    0.998022
0    0.001978
Name: proportion, dtype: float64

Presumably whether client has a driving license. Who has a car but no driving license? These people most likely do not get insurance... 

In [14]:
pd.crosstab(train_df['Driving_License'], train_df['Response'])

Response,0,1
Driving_License,Unnamed: 1_level_1,Unnamed: 2_level_1
0,21502,1255
1,10068237,1413804


In [16]:
pd.crosstab(train_df['Driving_License'], train_df['Response'], normalize = 'index')

Response,0,1
Driving_License,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.944852,0.055148
1,0.876868,0.123132


Half as many people with no driving licence buy insurance than those with driving license 

#### Region_code

In [24]:
train_df['Region_Code'].value_counts().sort_index()

Region_Code
0.0       59274
1.0       33966
2.0      118097
3.0      246303
4.0       52504
5.0       36832
6.0      181122
7.0       92240
8.0     1021036
9.0       93371
10.0     126081
11.0     278261
12.0      92142
13.0     108838
14.0     134585
15.0     403977
16.0      54905
17.0      74533
18.0     148548
19.0      44197
20.0      58765
21.0     126793
22.0      36932
23.0      54518
24.0      69136
25.0      70556
26.0      71228
27.0      78878
28.0    3451062
29.0     338146
30.0     367307
31.0      58442
32.0      78797
33.0     232387
34.0      48685
35.0     200035
36.0     261946
37.0     158976
38.0      60587
39.0     138068
39.2          1
40.0      35888
41.0     557581
42.0      13693
43.0      75868
44.0      20305
45.0     159292
46.0     578208
47.0     229190
48.0     114230
49.0      50822
50.0     302334
51.0       1880
52.0       3450
Name: count, dtype: int64

52 regions -> most likely federal US states

#### Vintage

In [27]:
train_df["Vintage"].value_counts().sort_index()

Vintage
10     25723
11     37077
12     21034
13     23862
14     20001
       ...  
295    30348
296    33415
297    33308
298    82529
299    26561
Name: count, Length: 290, dtype: int64

In [34]:
train_df["Vintage"].describe().apply(lambda x: format(x, '.2f'))

count    11504798.00
mean          163.90
std            79.98
min            10.00
25%            99.00
50%           166.00
75%           232.00
max           299.00
Name: Vintage, dtype: object