# Data Analysis: Fraud Detection for Automobile Customers Dataset

<a id='overview-0'></a>

## [Overview](./0-AutoClaimFraudDetection.ipynb)
* **[1: Overview, Architecture, and Data Exploration](./0-AutoClaimFraudDetection.ipynb)**
  * **[DataSets and Exploratory Data Analysis](#nb0-data-explore)**
  * **[Exploratory Data Science and Operational ML workflows](#nb0-workflows)**
  * **[The ML Life Cycle: Detailed View](#nb0-ml-lifecycle)**


<a id ='nb0-data-explore'> </a>

## DataSets and Exploratory Visualizations
[Overview](#overview-0)

The dataset is synthetically generated and consists of <font color='green'> customers and claims </font> datasets.
Here we will load them and do some exploratory visualizations.

In [None]:
import warnings
warnings.filterwarnings('ignore')
!pip install seaborn==0.11.1
!pip install pandas --upgrade

In [None]:
# Importing required libraries.
import pandas as pd
import numpy as np
import seaborn as sns  # visualisation
import matplotlib.pyplot as plt  # visualisation

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set(color_codes=True)

df_customers = pd.read_csv("./data/customers.csv", index_col=0)
df_customers.head()

## Letâ€™s have a look at data dimensionality, feature names, and feature types.

In [None]:
print(df_customers.shape)

In [None]:
features=df_customers.columns
features

### We can use the info() method to output some general information about the dataframe:

In [None]:
df_customers.info()

In [None]:
df_customers.describe()

In [None]:
df_customers['num_claims_past_year'].value_counts()

In [None]:
df_customers['customer_education'].value_counts()

In [None]:
df_customers['customer_gender'].value_counts()

In [None]:
# plot the bar graph customer gender
df_customers['customer_gender'].value_counts(normalize=True).plot.bar()
plt.xticks([0, 1,2,3], ["Male", "Female", "Unkown","Other"]);

In [None]:
sns.countplot(x="customer_gender", hue="customer_education", data=df_customers);

In [None]:
# plot the number of claims filed in the past year
df_customers.num_claims_past_year.hist(density=True)
plt.suptitle("Number of Claims in the Past Year")
plt.xlabel("Number of claims per year")

In [None]:
sns.pairplot(
    data=df_customers, vars=["num_insurers_past_5_years", "months_as_customer", "customer_age"]
);

Understandably, the `months_as_customer` and `customer_age` are correlated with each other. A younger person have been driving for a smaller amount of time and therefore have a smaller potential for how long they might have been a customer.

We can also see that the `num_insurers_past_5_years` is negatively correlated with `months_as_customer`. If someone frequently jumped around to different insurers, then they probably spent less time as a customer of this insurer.

In [None]:
sns.boxplot(x=df_customers["months_as_customer"]);

In [None]:
sns.boxplot(x=df_customers["customer_age"]);

___

### Next Notebook: [Data Preparation, Data Wrangler, Feature Store](./03-DataPrep-Wrangler-FeatureStore.ipynb)