# Data Analysis: Fraud Detection for Automobile Claims Dataset

<a id='overview-0'></a>

## [Overview](./0-AutoClaimFraudDetection.ipynb)
* **[1: Overview, Architecture, and Data Exploration](./0-AutoClaimFraudDetection.ipynb)**
  * **[DataSets and Exploratory Data Analysis](#nb0-data-explore)**
  * **[Exploratory Data Science and Operational ML workflows](#nb0-workflows)**
  * **[The ML Life Cycle: Detailed View](#nb0-ml-lifecycle)**


<a id ='nb0-data-explore'> </a>

## DataSets and Exploratory Visualizations
[Overview](#overview-0)

The dataset is synthetically generated and consists of <font color='green'> customers and claims </font> datasets.
Here we will load them and do some exploratory visualizations.

In [None]:
import warnings
warnings.filterwarnings('ignore')
!pip install seaborn==0.11.1
!pip install pandas --upgrade

In [None]:
# Importing required libraries.
import pandas as pd
import numpy as np
import seaborn as sns  # visualisation
import matplotlib.pyplot as plt  # visualisation

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set(color_codes=True)

df_claims = pd.read_csv("./data/claims.csv", index_col=0)
df_claims.head()

## Let’s have a look at data dimensionality, feature names, and feature types.

In [None]:
print(df_claims.shape)

In [None]:
features=df_claims.columns[:-1]
features

### We can use the info() method to output some general information about the dataframe:

In [None]:
df_claims.info()

In [None]:
df_claims.describe()

In [None]:
df_claims['fraud'].value_counts()

In [None]:
df_claims['incident_severity'].value_counts()

In [None]:
df_claims['police_report_available'].value_counts()

In [None]:
# plot the bar graph of fraudulent claims
# df_claims.fraud.value_counts(normalize=True)
df_claims['fraud'].value_counts(normalize=True).plot.bar()
plt.xticks([0, 1], ["Not Fraud", "Fraud"]);

In [None]:
sns.countplot(x="incident_severity", hue="fraud", data=df_claims);

In [None]:
# plot the total claim amounts
plt.hist(df_claims.total_claim_amount, bins=30)
plt.xlabel("Total Claim Amount")

In [None]:
fraud_df = df_claims[df_claims['fraud'] > 0]
fraud_df.total_claim_amount.hist(density=True)
plt.suptitle("Number of Claims in the Past Year")
plt.xlabel("total_claim_amount")

In [None]:
df_claims_corr = df_claims.corr()

df_claims_corr

In [None]:
fig, ax = plt.subplots(figsize=(12, 10))

sns.heatmap(df_claims_corr, annot=True)

___

### Next Notebook: [Data Preparation, Data Wrangler, Feature Store](./03-DataPrep-Wrangler-FeatureStore.ipynb)