### Step 0: Downloading dataset

The following code will help you in downloading the dataset stored in you Azure Machine Learning Service Workspace:

**NOTE**: We recommend storing your *Azure Credentials in Secrets, Azure Key Vault, or Environment Variables*

*You only need to run it once**

In [None]:
import os
import joblib

SUBSCRIPTION_ID = os.getenv('SUBSCRIPTION_ID')
RESOURCE_GROUP = os.getenv('RESOURCE_GROUP')
WORKSPACENAME = os.getenv('WORKSPACENAME')

### Step 1: Importing libraries

We typically recommend importing libraries beforehand. That will help you identify some dependency issues before even loading the data.

In [None]:
import pandas as pd
import pandas_profiling
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import cufflinks as cf
from plotly.offline import init_notebook_mode, iplot

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.covariance import EmpiricalCovariance, MinCovDet

from interpret import show
from interpret.data import ClassHistogram
from interpret.glassbox import ExplainableBoostingClassifier

import statistics
import hdbscan

import azureml.core
from azureml.core import Experiment, Workspace

---------------------------------------------

### Step 2: Reading downloaded data into dataframe

Typically, the file is going to be stored in the local file system that you are using.

In [None]:
df = pd.read_csv("insurance_claims_data.csv")
target_variable = 'fraud_reported'

### Step 3: Using info() method to understand the data schema.

With this you will understand number of missing samples and the type of each field.

In [None]:
df.info()

### Step 4: Formatting target variable into Boolean

This will help in the model creation and evaluation process.

In [None]:
df[target_variable] = [1 if i=='Y' else 0 for i in df[target_variable]]
df['incident_hour_of_the_day'] = pd.Categorical(df['incident_hour_of_the_day'])

### Step 5: Data Profiling (Unidimensional)

We are going understand the main characteristics of the different features and the target an unidimensional way. 

In [None]:
pandas_profiling.ProfileReport(df, explorative=True)

### Step 6: Data Profiling (With reference to Target)

In this case, since our target is binary, we recommend the following types of visualizations:
- **For categorial independent variables**: We recommend visualizing the percentage of the target class vs. each class of the independent variable.
- **For continuous independent variables**: We recommend splitting the data in bins visualizing the percentage of the target class vs. each bin in the independent variable. We also recommend using scatterplots in cases where dependent variable is continuous.

#### 6.1. Percentage of fraud by Gender. | Cat

In [None]:
temp = df[['insured_sex','fraud_reported']].groupby(['insured_sex'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='insured_sex',  y = 'fraud_reported', data=temp)

In [None]:
temp = df[['insured_sex','fraud_reported']].groupby(['insured_sex'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='insured_sex',  y = 'fraud_reported', data=temp)

#### 6.2. Percentage of fraud by Auto Model. | Cat

In [None]:
plt.figure(figsize=(30,5))
temp = df[['auto_model','fraud_reported']].groupby(['auto_model'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='auto_model',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(30,5))
temp = df[['auto_model','fraud_reported']].groupby(['auto_model'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='auto_model',  y = 'fraud_reported', data=temp)

#### 6.3. Percentage of fraud by Auto Make. | Cat

In [None]:
plt.figure(figsize=(20,5))
temp = df[['auto_make','fraud_reported']].groupby(['auto_make'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='auto_make',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(20,5))
temp = df[['auto_make','fraud_reported']].groupby(['auto_make'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='auto_make',  y = 'fraud_reported', data=temp)

#### 6.4. Percentage of fraud by Police Report Available. | Cat

In [None]:
plt.figure(figsize=(8,5))
temp = df[['police_report_available','fraud_reported']].groupby(['police_report_available'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='police_report_available',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(8,5))
temp = df[['police_report_available','fraud_reported']].groupby(['police_report_available'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='police_report_available',  y = 'fraud_reported', data=temp)

#### 6.5. Percentage of fraud by Property Damage. | Cat

In [None]:
plt.figure(figsize=(8,5))
temp = df[['property_damage','fraud_reported']].groupby(['property_damage'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='property_damage',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(8,5))
temp = df[['property_damage','fraud_reported']].groupby(['property_damage'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='property_damage',  y = 'fraud_reported', data=temp)

#### 6.6. Percentage of fraud by Incident City. | Cat

In [None]:
plt.figure(figsize=(10,5))
temp = df[['incident_city','fraud_reported']].groupby(['incident_city'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='incident_city',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(10,5))
temp = df[['incident_city','fraud_reported']].groupby(['incident_city'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='incident_city',  y = 'fraud_reported', data=temp)

#### 6.7. Percentage of fraud by Incident City. | Cat

In [None]:
plt.figure(figsize=(8,5))
temp = df[['incident_state','fraud_reported']].groupby(['incident_state'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='incident_state',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(8,5))
temp = df[['incident_state','fraud_reported']].groupby(['incident_state'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='incident_state',  y = 'fraud_reported', data=temp)

#### 6.8. Percentage of fraud by Authorities contacted. | Cat

In [None]:
plt.figure(figsize=(8,5))
temp = df[['authorities_contacted','fraud_reported']].groupby(['authorities_contacted'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='authorities_contacted',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(8,5))
temp = df[['authorities_contacted','fraud_reported']].groupby(['authorities_contacted'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='authorities_contacted',  y = 'fraud_reported', data=temp)

#### 6.9. Percentage of fraud by incident severity. | Cat

In [None]:
plt.figure(figsize=(8,5))
temp = df[['incident_severity','fraud_reported']].groupby(['incident_severity'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='incident_severity',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(8,5))
temp = df[['incident_severity','fraud_reported']].groupby(['incident_severity'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='incident_severity',  y = 'fraud_reported', data=temp)

#### 6.10. Percentage of fraud by Colision Type. | Cat

In [None]:
plt.figure(figsize=(8,5))
temp = df[['collision_type','fraud_reported']].groupby(['collision_type'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='collision_type',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(8,5))
temp = df[['collision_type','fraud_reported']].groupby(['collision_type'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='collision_type',  y = 'fraud_reported', data=temp)

#### 6.11. Percentage of fraud by Incident Type. | Cat

In [None]:
plt.figure(figsize=(8,5))
temp = df[['incident_type','fraud_reported']].groupby(['incident_type'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='incident_type',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(8,5))
temp = df[['incident_type','fraud_reported']].groupby(['incident_type'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='incident_type',  y = 'fraud_reported', data=temp)

#### 6.12. Percentage of fraud by Incident Day and Month. | Cat

In [None]:
df['incident_date'] = pd.to_datetime(df['incident_date'], errors = 'coerce')

# extracting days and month from date
df['incident_month'] = df['incident_date'].dt.month
df['incident_day'] = df['incident_date'].dt.day

plt.figure(figsize=(10,5))
temp = df[['incident_month','fraud_reported']].groupby(['incident_month'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='incident_month',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(10,5))
temp = df[['incident_day','fraud_reported']].groupby(['incident_day'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='incident_day',  y = 'fraud_reported', data=temp)

#### 6.13. Percentage of fraud by Insured Relationship. | Cat

In [None]:
plt.figure(figsize=(8,5))
temp = df[['insured_relationship','fraud_reported']].groupby(['insured_relationship'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='insured_relationship',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(8,5))
temp = df[['insured_relationship','fraud_reported']].groupby(['insured_relationship'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='insured_relationship',  y = 'fraud_reported', data=temp)

#### 6.14. Percentage of fraud by Insured Hobbies. | Cat

In [None]:
plt.figure(figsize=(25,5))
temp = df[['insured_hobbies','fraud_reported']].groupby(['insured_hobbies'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='insured_hobbies',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(25,5))
temp = df[['insured_hobbies','fraud_reported']].groupby(['insured_hobbies'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='insured_hobbies',  y = 'fraud_reported', data=temp)

#### 6.15. Percentage of fraud by Insured Occupation. | Cat

In [None]:
plt.figure(figsize=(25,5))
temp = df[['insured_occupation','fraud_reported']].groupby(['insured_occupation'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='insured_occupation',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(25,5))
temp = df[['insured_occupation','fraud_reported']].groupby(['insured_occupation'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='insured_occupation',  y = 'fraud_reported', data=temp)

#### 6.16. Percentage of fraud by Insured Education Level. | Cat

In [None]:
plt.figure(figsize=(10,5))
temp = df[['insured_education_level','fraud_reported']].groupby(['insured_education_level'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='insured_education_level',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(10,5))
temp = df[['insured_education_level','fraud_reported']].groupby(['insured_education_level'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='insured_education_level',  y = 'fraud_reported', data=temp)

#### 6.17. Percentage of fraud by policy CSL. | Cat

**CSL** stands for Combined Single Limit

**CSL** is a single number that describes the predetermined limit for the combined total of the **Bodily Injury 
Liability** Coverage and **Property Damage Liability** coverage per occurrence or accident.

In [None]:
plt.figure(figsize=(8,5))
temp = df[['policy_csl','fraud_reported']].groupby(['policy_csl'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='policy_csl',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(8,5))
temp = df[['policy_csl','fraud_reported']].groupby(['policy_csl'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='policy_csl',  y = 'fraud_reported', data=temp)

#### 6.18. Percentage of fraud vs. Auto Year | Cat

In [None]:
plt.figure(figsize=(20,5))
temp = df[['auto_year','fraud_reported']].groupby(['auto_year'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='auto_year',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(20,5))
temp = df[['auto_year','fraud_reported']].groupby(['auto_year'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='auto_year',  y = 'fraud_reported', data=temp)

#### 6.19. Approach using the **Interpret** library from Microsoft Research.

More information [Here.](https://github.com/interpretml/interpret)

In [None]:
features_ignore_initial = ['policy_number', 'policy_bind_date', 'incident_date', 'incident_location', 'insured_zip', 'auto_model', 'fraud_reported']
X = df.drop(features_ignore_initial, axis=1)
y = df[target_variable].values

from interpret.data import ClassHistogram

hist = ClassHistogram().explain_data(X, y, name = 'Train Data')
show(hist)

### Step 7:  Insights from Data Profiling

### Step 8:  Outlier and Extreme values identification

#### Step 8.1: Types of outliers and extreme values:

Pretty much all datasets contain outliers or extreme values, we should be very aware of them, the can difficult the capacity of generalization of Machine Learning models. 

Most common causes of outliers on a dataset:

- **Data entry** errors: human errors.
- **Measurement** errors: instrument errors.
- **Experimental** errors: data extraction or experiment planning/executing errors.
- **Intentional**: dummy outliers made to test detection methods.
- **Data processing** errors: data manipulation or data set unintended mutations.
- **Sampling** errors: extracting or mixing data from wrong or various sources.
- **Natural**: not an error, novelties in data. ***We might want to keep these*** as the can contain important information on the data.

In the process of producing, collecting, processing and analyzing data, outliers can come from many sources and hide in many dimensions. Those that are not a product of an error are called **novelties.**


[More detailed information can be found here.](https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561)

#### Step 8.2: Approaches to outlier and extreme values identification:

There are two general categories of approached for **Outlier Detection**:
- **Parametric** approaches: these assume that the data has some underlying distribution such as normal distribution.
- **Nonparametric** approaches: there is no requirements on the underlying distribution.

Additionally, you can conduct your **Outlier Detection** in any number of features:
- You can use a **Single** feature.
- Or you can use a **Subset** of features.


There are virtually infinite approaches (and combination) that help in the outlier identification process. Following you will find some of the approaches that we use

- **Heuristics**: because experience or because observations with certain (pre-defined) characteristics are always threated differently from the customers they sometimes market to either exclude them from the process or to specifically observe the behavior of the models for them.
- **Z-Score or Extreme Value Analysis (EVA)**: parametric | [Theory (EVA)](https://www.sciencedirect.com/science/article/pii/S0963869517300488) | [Python Implementation (EVA)](https://github.com/georgebv/pyextremes)
- **Isolation Forests**: [Intuition and Python Implementation](http://www.extended-cognition.com/2018/11/15/multivariate-outlier-detection-with-isolation-forests/)
- **Proximity Based Models with or without PCA:** [Theory PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) | [Theory Euclidean Distance](https://en.wikipedia.org/wiki/Euclidean_distance) | [Theory Mahalanobis Distance](https://nirpyresearch.com/detecting-outliers-using-mahalanobis-distance-pca-python/) | [Python Implementation](https://nirpyresearch.com/detecting-outliers-using-mahalanobis-distance-pca-python/)
- **High Dimensional Outlier Detection Methods** (high dimensional sparse data): [Theory HiCS](https://www.ipd.kit.edu/mitarbeiter/muellere/publications/ICDE2012.pdf) | [Why to use HiCS in highly dimensional spaces](https://members.loria.fr/MOBerger/Enseignement/Master2/Exposes/beyer.pdf) | [Python Implementation](https://github.com/KDD-OpenSource/fexum-hics)


#### Step 8.3: Identifying Outliers and Extreme Values

We recommend the identification of both, Extreme Values for determined **single features** and Extreme values for a **subset of features**. For this project, we are going to do both:

#### 8.3.1. Identifying Outliers and Extreme values for single features:

In this case, as a results a combination the insights generated in during the **Data Profiling** process and **business decisions** we are going to focus on identifying outliers for **total_claim_amount**:

**Lets start by plotting the distribution:**

In [None]:
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

df['total_claim_amount'].iplot(
    mean = 'True',
    kind='hist',
    bins=100,
    xTitle='Total Claim Amount',
    linecolor='black',
    yTitle='count',
    colorscale = 'greens',
    title='Histogram of Sale Price')

As we can see, this variable has a bimodal distribution, so we are going to split it into two datasets and then conduct the outlier analysis.

**Distribution A (<15K)**:

In this case we are not going to identify any outliers as the smallest value correspond to a minor, parking accident and the upper threshold is still smaller than the next cluster partition.

In [None]:
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

less_15k = df['total_claim_amount'][df['total_claim_amount']<15000]
less_15k.iplot(
    mean = 'True',
    kind='hist',
    bins=100,
    xTitle='Total Claim Amount',
    linecolor='black',
    yTitle='count',
    colorscale = 'greens',
    title='Histogram of Sale Price')

**Distribution A (>20K)**:

In this case we are not going to identify any outliers as the smallest values as we already analyzed them in the previous histogram. Now we are going to focus on the upper side of this distribution.

In [None]:
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

more_20k = df['total_claim_amount'][df['total_claim_amount']>20000]
more_20k.iplot(
    mean = 'True',
    kind='hist',
    bins=100,
    xTitle='Total Claim Amount',
    linecolor='black',
    yTitle='count',
    colorscale = 'blues',
    title='Histogram of Sale Price')

#### Calculating Standard Deviation of upper thresholds

In [None]:
more_20k_std = statistics.stdev(more_20k)
more_20k_mean = statistics.mean(more_20k)

upper_threshold_2_5 = more_20k_mean+(2.5*more_20k_std)
print("- Upper Threshold with 2.5 stds: ", int(upper_threshold_2_5))

upper_threshold_3_5 = more_20k_mean+(3.5*more_20k_std)
print("- Upper Threshold with 3.5 stds: ", int(upper_threshold_3_5))

We can see that there a no values above 3.5 Standard deviations.

#### Asigning classifications into variables on the original DataFrame

We do this because we want to generate some exploratory Data Analysis with those variables later:

In [None]:
df['total_claim_amount_15k'] = df['total_claim_amount'].apply(lambda x: 1 if x<15000 else 0)
print("- Total samples smaller than 15k:", int(sum(df['total_claim_amount_15k'])))

df['total_claim_amount_2_5_std'] = df['total_claim_amount'].apply(lambda x: 1 if x<upper_threshold_2_5 else 0)
print("- Total samples with more than than 2.5std:", int(len(df) - sum(df['total_claim_amount_2_5_std'])))

df['total_claim_amount_3_5_std'] = df['total_claim_amount'].apply(lambda x: 1 if x<upper_threshold_3_5 else 0)
print("- Total samples with more than than 3.5std:", int(len(df) - sum(df['total_claim_amount_3_5_std'])))

#### Percentage of fraud by by total_amount category <15k. | Cat

In [None]:
plt.figure(figsize=(8,5))
temp = df[['total_claim_amount_15k','fraud_reported']].groupby(['total_claim_amount_15k'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='total_claim_amount_15k',  y = 'fraud_reported', data=temp)

#### Percentage of fraud by by total_amount category <15k. | Cat

In [None]:
plt.figure(figsize=(8,5))
temp = df[['total_claim_amount_2_5_std','fraud_reported']].groupby(['total_claim_amount_2_5_std'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='total_claim_amount_2_5_std',  y = 'fraud_reported', data=temp)

In [None]:
plt.figure(figsize=(8,5))
temp = df[['total_claim_amount_2_5_std','fraud_reported']].groupby(['total_claim_amount_2_5_std'], as_index = False).count().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='total_claim_amount_2_5_std',  y = 'fraud_reported', data=temp)

#### 8.3.1. Identifying Outliers and Extreme values for a subset of the features:

In the **Data Profiling** process, we identified some variables, that we should not use here. 
* policy_number
* policy_bind_date
* incident_date
* incident_location
* auto_model

Additionally, we are going to ignore:

* fraud_reported
* total_claim_amount_15k
* total_claim_amount_2_5_std
* total_claim_amount_3_5_std


In [None]:
features_ignore_initial_pca = ['policy_number', 'policy_bind_date', 'incident_date', 'incident_location', 'auto_model', 'fraud_reported', 'total_claim_amount_15k', 'total_claim_amount_2_5_std', 'total_claim_amount_3_5_std']

In [None]:
df2 = df.drop(features_ignore_initial, errors ='ignore', axis = 1)

We will also need to tokenize the **categorial variables**.

In [None]:
categorial = ['policy_state', 'policy_csl', 'insured_sex', 'insured_education_level', 'insured_occupation', 'insured_hobbies', 'insured_relationship','incident_type','collision_type','incident_severity','authorities_contacted','incident_state','incident_city','property_damage','police_report_available','auto_make']
df2 = pd.get_dummies(df2.drop('fraud_reported', errors ='ignore', axis = 1))

Now we will use **PCA**:

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Define the PCA object
pca = PCA(svd_solver = 'full', random_state=80)
# Run PCA on scaled data and obtain the scores array
T = pca.fit_transform(StandardScaler().fit_transform(df2))

Then we will use [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html):

In [None]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
clusterer.fit(T[:,:2])

print("- Number of possible outliers:", max(clusterer.labels_)+1)
print("- Number of possible outliers:", sum(clusterer.labels_ ==-1))

colors = [plt.cm.jet(float(i)/max(clusterer.labels_)) for i in clusterer.labels_]
fig = plt.figure(figsize=(10,8))
with plt.style.context(('ggplot')):
    plt.scatter(T[:, 0], T[:, 1], c=colors, edgecolors='k', s=60)
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.title('Score Plot')

In [None]:
df['hdbscan_cluster'] = clusterer.labels_

In [None]:
plt.figure(figsize=(8,5))
temp = df[['hdbscan_cluster','fraud_reported']].groupby(['hdbscan_cluster'], as_index = False).mean().sort_values(
    by = 'fraud_reported', ascending = False)
ax = sns.barplot(x='hdbscan_cluster',  y = 'fraud_reported', data=temp)

In [None]:
features_ignore_initial = ['policy_number', 'policy_bind_date', 'incident_date', 'incident_location','insured_zip', 'auto_model', 'total_claim_amount_15k', 'total_claim_amount_2_5_std', 'total_claim_amount_3_5_std','hdbscan_cluster']

#### Step 8.3: What can be done with extreme values?

There are virtually infinite ways to deal with extreme values. However, one common practice is to research and understand the causes of these in order to determine what do with them. Furthermore, following you will find some of the most common practices:

- Use this engine in order to re-direct this datapoints to other processes: **human-in-the-loop** or **heuristics** based approaches.
- If the cause of those extreme values can be identified and explained (they have **natural** causes), they can be treated by adding features that capture their behavior and/or adding more samples to this dataset.
- In cases where those extreme values were generated because of **non-natural** causes (errors in data pipeline, measurement errors), Data Scientists typically remove those extreme values from the data. 

-----------------------------------------

### Step 9:  Working on data split for baseline model

In this step we are not going to do any **model, feature or hyperparameter selection**. Here we just want to have a quick model in order to understand limitations with the dataset and have a ballpark idea on how well the model can be.

For this step, we typically recommend having as less **Feature Engineering** as possible, as we want to understand the behavior of the **raw** features. We also recommend using the simplest algorithms available, as we do not want to spend too much time in this process. We typically recommend using simple linear models such as **Linear Regressions** or **Logistic Regressions**.

#### 9.1. Separrating features and targets

In this step we do not do anything too special, here we are just separating our **target** from our **features** and ignoring some of the features that cannot be put into the model without preprocessing (**dates, customer_id, etc...**)

In [None]:
df2 = df.drop(features_ignore_initial,errors='ignore', axis = 1)
X = df2.drop(target_variable, axis=1)
y = df2[target_variable].values

#### 9.2. Defining Split between test/train

Even if it seems trivial, this is one of the most important steps in the Data Science process, as an error here can mislead and/or make this process more difficult to you, as we can introduce **data leakage** and **underrepresentation** on the training set. These are some of the thinks that we recommend taking into account:

- For **categorical** dependent targets, we recommend using stratified splits, as we want to have a similar distribution on both the training and testing datasets.
- When you are working with **Time Series**, we recommend doing the splits based on date **cut-off**, as samples of the same period tend to have similar behaviors and in real life we will never have future data to train the models. Thus, the objective of this kind of models is to predict **future** behavior with data from the **past**.
- For **Non-Time Series** cases, even though in most cases not involving big data, we do splits based in percentages, this is not always applicable. In cases where you have **2MM observations**, for example, the **20% represents 400k observations** (that seems like a lot). In these cases, you might want to analyze what number of observations make sense in order to do a correct model selection.



#### 9.3. Splitting train/test

In [None]:
t_size = 0.2

In [None]:
#X=pd.get_dummies(X.drop('fraud_reported', errors ='ignore', axis = 1))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=t_size, random_state=19, stratify=y)

In [None]:
X_train.head()

### Step 10: Creation of baseline model

In [None]:
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

#### Step 10.1. Evaluating initial model

Now we are to do an initial evaluation of the Performance Metrics.

In [None]:
y_pred_ef = ebm.predict(X_test)

print("Training Accuracy: ", ebm.score(X_train, y_train))
print('Testing Accuarcy: ', ebm.score(X_test, y_test))

# making a classification report
cr = classification_report(y_test,  y_pred_ef,target_names=['non-fraud','fraud'])
print(cr)

# making a confusion matrix
cm = confusion_matrix(y_test, y_pred_ef)
sns.heatmap(cm, annot = True)

### Step 11. Logging results into Azure Machine Learning Workspace

#### Connecting to WorkspaceChoose a logging ops

If you want to track or monitor your experiment, you must add code to start logging when you submit the run. The following are ways to trigger the run submission:

- **Run.start_logging** - Add logging functions to your training script and start an interactive logging session in the specified experiment. **start_logging** creates an interactive run for use in scenarios such as notebooks. Any metrics that are logged during the session are added to the run record in the experiment.
- **ScriptRunConfig** - Add logging functions to your training script and load the entire script folder with the run. **ScriptRunConfig** is a class for setting up configurations for script runs. With this option, you can add monitoring code to be notified of completion or to get a visual widget to monitor.
- **Designer logging** - Add logging functions to a drag-&-drop designer pipeline by using the **Execute Python Script** module. Add Python code to log designer experiments.
- **AML AutoML** - When we use AutoML, it automatically creates an experiment and registers all the important metrics.

You can find more information [Here.](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-track-experiments)

#### Step 11.1. Connecting to Workspace

In [None]:
# Check core SDK version number
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")
print("")

ws = Workspace(SUBSCRIPTION_ID, RESOURCE_GROUP, WORKSPACENAME)

#### Step 11.2. Generating Metrics

In [None]:
cr = classification_report(y_test,  y_pred_ef,target_names=['non-fraud','fraud'], output_dict =True)

In [None]:
# Non-Fraud
non_fraud_precision = cr['non-fraud']['precision']
non_fraud_recall = cr['non-fraud']['recall']
non_fraud_f1_score = cr['non-fraud']['f1-score']
non_fraud_support = cr['non-fraud']['support']

# Fraud
fraud_precision = cr['fraud']['precision']
fraud_recall = cr['fraud']['recall']
fraud_f1_score = cr['fraud']['f1-score']
fraud_support = cr['fraud']['support']

# Macro Average
macro_precision = cr['macro avg']['precision']
macro_recall = cr['macro avg']['recall']
macro_f1_score = cr['macro avg']['f1-score']
macro_support = cr['macro avg']['support']

# Weighted Average
weighted_precision = cr['weighted avg']['precision']
weighted_recall = cr['weighted avg']['recall']
weighted_f1_score = cr['weighted avg']['f1-score']
weighted_support = cr['weighted avg']['support']

values_to_log = [non_fraud_precision, non_fraud_recall, non_fraud_f1_score,non_fraud_support, 
                 fraud_precision, fraud_recall,fraud_f1_score,fraud_support,
                macro_precision,macro_recall,macro_f1_score, macro_support,
                weighted_precision, weighted_recall, weighted_f1_score, weighted_support]

names_to_log = ['non_fraud_precision', 'non_fraud_recall', 'non_fraud_f1_score','non_fraud_support', 
                 'fraud_precision', 'fraud_recall','fraud_f1_score','fraud_support',
                'macro_precision','macro_recall','macro_f1_score', 'macro_support',
                'weighted_precision', 'weighted_recall', 'weighted_f1_score', 'weighted_support']

In [None]:
# Get an experiment object from Azure Machine Learning
experiment = Experiment(workspace=ws, name="train-within-notebook")

# Create a run object in the experiment
run =  experiment.start_logging()
# Log the algorithm parameter alpha to the run
for i in range(0,len(names_to_log)):
    run.log(names_to_log[i], values_to_log[i])

# Save the model to the outputs directory for capture
model_file_name = 'outputs/ebm.pkl'

joblib.dump(value = ebm, filename = model_file_name)

# upload the model file explicitly into artifacts 
run.upload_file(name = model_file_name, path_or_stream = model_file_name)

# Complete the run
run.complete()