<img src="https://github.com/seclea/seclea_ai/raw/dev/docs/media/logos/logo-light.png" width="400" alt="Seclea" />

# Getting Started

We will run through the basic process of using Seclea to record your data science work
and explore the results in the Seclea Platform.

For non data-scientists you will want to pay most attention to the Seclea Platform sections.

## Setting up

Head to [platform.seclea.com](https://platform.seclea.com) and log in.

Create a new project and give it a name and description.

<img src="https://github.com/seclea/seclea_ai/media/notebooks/getting_started/create-new-project.png" width=300/>
<img src="https://github.com/seclea/seclea_ai/media/notebooks/getting_started/create-project-name-description.png" width=300/>

Go to project settings

<img src="https://github.com/seclea/seclea_ai/media/notebooks/getting_started/project-settings.png" width=300/>

Select Compliance, Risk and Performance Templates for this project.
These are optional but are needed to take advantage of Checks. If in doubt leave these empty for now and come back.

TODO fill in images --- here ---

## The Data

[Download](https://raw.githubusercontent.com/mwitiderrick/insurancedata/master/insurance_claims.csv) the data for this tutorial

In [None]:
!pip install seclea-ai

Now we can upload the initial data to the Seclea Platform. This should include whatever information we know about the dataset at this point as metadata.

There are only two keys to add in metadata for now - outcome_name and continuous_features.

Here you will also have to log in to the Platform using the credentials given to you.

In [None]:
import numpy as np
import pandas as pd
from seclea_ai import SecleaAI

# load the data 
data = pd.read_csv('insurance_claims.csv', index="policy_number")

# upload the data in its initial state to the Seclea Platform
# NOTE - use the organization name provided to you when issued credentials.
seclea = SecleaAI(project_name="My Project", organization='My Org')

dataset_metadata = {"outcome_name": "fraud_reported", 
                    "continuous_features": [
                                            "total_claim_amount",
                                            'policy_annual_premium',
                                            'capital-gains',
                                            'capital-loss',
                                            'injury_claim',
                                            'property_claim',
                                            'vehicle_claim',
                                            'incident_hour_of_the_day',
                                            ]}

seclea.upload_dataset(dataset=data, dataset_name="Auto Insurance Fraud", metadata=dataset_metadata)


## Transformations

There is one important requirement when using Seclea to record your Data Science work, that is how
to deal with transformations of the data.

We require that all transformations are encapsulated in a function, that takes the data and returns the
transformed data.

In order to record and use the processing code effectively we need it to be packaged into functions.

In [None]:
# Creating a copy to isolate the original dataset
df1 = data.copy(deep=True)

def encode_nans(df):
    # converting the special character to nans so we can use nan processing code
    # available in pandas.
    return df.replace('?', np.NaN)

In [None]:
df1 = encode_nans(df1)


## 3.4 Seclea Platform

Now you should navigate in your browser to platform.seclea.com and login.

You should see a dashboard etc. select the project

Navigate to the Datasets section - under Prepare tab. See the preview and use the format check/PII check.

Include some tasks for them to explore.

Include screen shots.

# 4.Data preprocessing/Feature Engineering 

## 4.1 Dealing with Missing values

We may think that None (or NaN) values are just zeroes because they represent the absence of a value. The main difference between zero and None value is that zero is a value (for example integer or float), while the None value represents the absence of that value.

There are various techniques to replace missing value such as 
1. Fill NaN with Mean, Median or Mode of the data
2. Fill NaN with a constant value
3. Imputing with KNN
4. Imputing with MICE(Multiple Imputation by Chained Equations)

We will try two of these techniques, the constant fill for which we will use the value -1 as the fill value. We will also use the mode filling technique for a second version of the dataset.

Here we define a function to carry out the dropping of columns that contain more than a certain proportion of nulls. We do this as these columns usually don't add useful information and only slow us down.

In [None]:
## Drop the the column with certain proportion NaN value 
def drop_nulls(df, threshold):
    cols = [x for x in df.columns if df[x].isnull().sum() / df.shape[0] > threshold]
    return df.drop(columns=cols)

You will notice that we define the threshold value as a variable and pass it in. That is because we will be using this variable again later to upload these transformations. It is easier than copy and pasting the value.

In [None]:
# We choose 95% as our threshold
null_thresh = 0.95
df1 = drop_nulls(df1, threshold=null_thresh)

In this case only _c39 will be dropped as it is 100% null values.


Now we will define a function that will replace nans with a constant value. Here we have shown an option that will deal directly with None values in the dataframe as well - not needed for this dataset but can be useful on others.

In [None]:
# Changing 1st dataset with -1 

def fill_nan_const(df, val):
    """Fill NaN values in the dataframe with a constant value"""
    return df.replace(['None', np.nan], val) 
 

const_val = -1
df_const = fill_nan_const(df1, const_val)

In [None]:
df_const.isnull().sum()

In [None]:
def fill_nan_mode(df, columns):
    """
    Fills nans in specified columns with the mode of that column
    Note that we want to make sure to not modify the dataset we passed in but to
    return a new copy.
    We do that by making a copy and specifying deep=True.
    """
    new_df = df.copy(deep=True)
    for col in df.columns:
        if col in columns:
            new_df[col] = df[col].fillna(df[col].mode()[0])
    return new_df

In [None]:
nan_cols = ['collision_type','property_damage','police_report_available']
df_mode = fill_nan_mode(df1, nan_cols)

In [None]:
print(df_mode)

In [None]:
df_mode.isnull().sum()


In [None]:
def drop_correlated(data, thresh):
    import numpy as np

    # calculate correlations
    corr_matrix = data.corr().abs()
    # get the upper part of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

    # columns with correlation above threshold
    redundant = [column for column in upper.columns if any(upper[column] >= thresh)]
    print(f"Columns to drop with correlation > {thresh}: {redundant}")
    new_data = data.drop(columns=redundant)
    return new_data

correlation_threshold = 0.9

df_const = drop_correlated(df_const, correlation_threshold)
df_mode = drop_correlated(df_mode, correlation_threshold)

##4.3 Changing categorial data to numeric 


As we still have the same columns for both df_const and df_mode we are only looking for the columns in one dataset.

If we had dropped different columns in each that would not be possible.

Also an important note is that these columns are only considered categorical for encoding purposes! This doesn't mean that no other columns represent categorical values, just that if there are, they have already been encoded to numerical values in some way. This may be important for other datasets and analyses.

In [None]:
# find columns with categorical data for both dataset
cat_cols = df_const.select_dtypes(include=['object']).columns.tolist()
cat_cols

Here we are using the sklearn LabelEncoder to encode each of the selected columns passed in. We do this to avoid encoding columns that already contain categorical data.

In [None]:
def encode_categorical(df, cat_cols): 
  from sklearn.preprocessing import LabelEncoder

  new_df = df.copy(deep=True)
  for col in cat_cols:
    if col in df.columns:
        le = LabelEncoder()
        le.fit(list(df[col].astype(str).values))
        new_df[col] = le.transform(list(df[col].astype(str).values))
  return new_df


In [None]:
df_const = encode_categorical(df_const, cat_cols)
df_mode = encode_categorical(df_mode, cat_cols)

### 4.3.1 Uploading processed datasets

Before getting to balancing the datasets we will upload them to the Seclea Plaform.

- We define the metadata for the dataset - if there have been any changes since the original dataset we need to put that here, otherwise we can reuse the original metadata. In this case we have dropped some of the continuous feature columns so we will need to redefine

- We define the transformations that took place between the last state we uploaded and this dataset. This is a list of functions and arguments. See docs.seclea.com for more details of the correct formatting.

In [None]:
# define the metadata
# NOTE even though we defined an index initially, because this dataset has been 
# augmented, the index has been dropped so now there is no specific index column.
processed_metadata = {"index": None, 
                  "outcome_name": "fraud_reported", 
                  "continuous_features": ["total_claim_amount",
                                          'policy_annual_premium',
                                          'capital-gains',
                                          'capital-loss',
                                          'injury_claim',
                                          'property_claim',
                                          'incident_hour_of_the_day',
                                          ]}

# here we need to define the transformations we applied to our original dataset
# to get it to this point.
# see the documentation for more details of the formatting this needs.

const_processed_transformations = [
    encode_nans,
    (drop_nulls, [null_thresh]),
    (fill_nan_const, [const_val]),
    (drop_correlated, [correlation_threshold]),
    (encode_categorical, [cat_cols]),
]

seclea.upload_dataset(dataset=df_const, 
                      dataset_name="Auto Insurance Fraud - Const Fill", 
                      metadata=processed_metadata, 
                      parent=df, 
                      transformations=const_processed_transformations)

- We need to do this for the dataset that filled NaN values with the mode. We can reuse the metadata for the processed data as it is the same but we need to change the transformations.

In [None]:
mode_processed_transformations = [
    encode_nans,
    (drop_nulls, [null_thresh]),
    (fill_nan_mode, [nan_cols]),
    (drop_correlated, [correlation_threshold]),
    (encode_categorical, [cat_cols]),
]

seclea.upload_dataset(dataset=df_mode, 
                      dataset_name="Auto Insurance Fraud - Mode Fill", 
                      metadata=processed_metadata, 
                      parent=df, 
                      transformations=mode_processed_transformations)

##4.4 Balancing the dataset

In [None]:
#Checking for imbalance dataset (with half of the dataset)
import matplotlib.pyplot as plt
plt.figure(figsize=(2.5,5)) 
plt.title("Fraud Transaction Distribution") 
p1 = sns.countplot(df_const['fraud_reported'], palette = 'plasma') 
for p in p1.patches:
    height = p.get_height()
    p1.text(p.get_x()+p.get_width()/2.,
            height,
            f'{height/df.shape[0] * 100:.2f}%',
            ha='center', fontsize=12)


Most machine learning models perform best with balanced datasets and tolerate imbalanced datasets to different levels. Imbalance very often causes poor predictions, especially for minority class samples. In this dataset we are dealing with a mildly imbalanced dataset however many fraud datasets can have fraud cases making up as little as 0.1% of samples and security datasets even fewer.

We would expect to achieve fairly good accuracy on the basis of this, however we will explore the impact that balance can have on the accuracy of a model and see if balancing can be beneficial to us. 



There are more than 10 techniques available for balancing datasets. Out of all I have detailed the three most used tchniques

1.   Random under sampling: Removing majority of the class and keeping the data same as minor class. The main drawback of this technique is it may remove the important information from the dataset  

2.   Random oversampling techniques: adding more value to minority class when there is not enough data for minority class however this can cause overfitting and poor generalization. 

3. SMOTE (Synthetic Minority Oversampling Techniques ) the main idea of this techniuqe is it randomly picks a point from minority class and compute k-nearest neighbour for the point. Synthetic points are added between. 

In this dataset we will use SMOTE for balancind as random over sampling is prone to overfitting and undersampling will remove some information that may be useful to our model.



The following function has some particular restrictions in order to work well with the Seclea Platform. We need the function that processes the dataset to return the dataset in a complete form - that is with both features and labels as part of the same DataFrame. 

This will likely be changed in the future but for now you will need to split the data in the function into features and labels for the oversampling and then concat them back together after sampling to return all together from the function.

In [None]:
# define a balancing function

def smote_balance(df):
    from imblearn.over_sampling import SMOTE

    X1 = df.drop('fraud_reported', axis=1)
    y1 = df.fraud_reported

    sm = SMOTE(random_state=42)

    X_sm, y_sm = sm.fit_resample(X1, y1)

    print(f'''Shape of X before SMOTE: {X1.shape}
    Shape of X after SMOTE: {X_sm.shape}''')
    print(f'''Shape of y before SMOTE: {y1.shape}
    Shape of y after SMOTE: {y_sm.shape}''')
    return pd.concat([X_sm, y_sm], axis=1)


In [None]:
# Using Smote to balance the dataset 
df_const_smote = smote_balance(df_const)
df_mode_smote = smote_balance(df_mode)

In [None]:
df_const_smote.head(4)

In [None]:
df_mode_smote.head(4)

### 4.4.1 Upload Smote datasets

Here again we need to upload the transformed datasets.
It is easier here because we only have the one transformation to upload and the metadata we can reuse as this transformation didn't affect that.

This is in many ways the easiest way to keep track of your datasets and transformations as it saves keeping track of too many functions or variables at any one time.

Note that we include the processed datasets as the parent here, not the original dataset. That is because this dataset comes directly from the processed dataset with one transformation, not directly from the original data.

In [None]:
# here we need to define the transformations we applied to our original dataset

smote_transformations = [
    smote_balance,
]

seclea.upload_dataset(dataset=df_const_smote, 
                      dataset_name="Auto Insurance Fraud - Const fill - Smote", 
                      metadata=processed_metadata, 
                      parent=df_const, 
                      transformations=smote_transformations)

seclea.upload_dataset(dataset=df_mode_smote, 
                      dataset_name="Auto Insurance Fraud - Mode Fill - Smote", 
                      metadata=processed_metadata, 
                      parent=df_mode, 
                      transformations=smote_transformations)

Now head to platform.seclea.com again to take another look at the Datasets section. You will see that there is a lot more to look at this time.

You can see here how the transformations are used to show you the history of the data and how it arrived in its final state.

## 4.6 Building Train and Test Datasets

Now that we have finished processing our data, and logged it in the Platform, we will define a function to split the data for input to our training and evaluation code.

In [None]:
# Splitting the dataset 

def get_test_train_splits(df, output_col, test_size, random_state):
    from sklearn.model_selection import train_test_split

    X = df.drop(output_col, axis=1)
    y = df[output_col]

    return train_test_split(X, y, test_size=test_size, stratify=y, random_state=random_state)


# 5.Modeling with Balancing Techniques

Now we get started with the modelling. We will run the same models over each of our datasets to explore how the different processing of the data has affected our results.

We will use three models from sklearn for this, DecisionTree, RandomForest and GradientBoosting Classifers. 

In [None]:
!pip install tabulate
from tabulate import tabulate


Here we are defining our classifiers in a dictionary, this will help to simplify training code later on.

In [None]:
 ### Modeling 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

classifiers = {
    "RandomForestClassifier": RandomForestClassifier(),
    "DecisionTreeClassifier": DecisionTreeClassifier(),
    "GradientBoostingClassifier": GradientBoostingClassifier()
}

Here we are training all of our classifiers on each of the datasets in turn so that we can easily and rapidly evaluate their performance.

Note the use of functions really simplifies our lives here, we do not have to repeat code here as in reality we are doing the same thing for each classifier and dataset so better to loop over each and save ourselves some typing!

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from seclea_utils.get_model_manager import Frameworks

datasets = [("Const Fill", df_const), ("Mode Fill", df_mode), ("Const Fill Smote", df_const_smote), ("Mode Fill Smote", df_mode_smote)]

for name, dataset in datasets:
    X_train, X_test, y_train, y_test = get_test_train_splits(dataset, output_col="fraud_reported", test_size=0.2, random_state=42)

    for key, classifier in classifiers.items():
        # cross validate to get an idea of generalisation.
        training_score = cross_val_score(classifier, X_train, y_train, cv=5)
        # train on the full training set
        classifier.fit(X_train, y_train)
        # upload the fully trained model
        seclea.upload_training_run(classifier, Frameworks.SKLEARN, dataset=dataset)
        # test accuracy
        y_preds = classifier.predict(X_test)
        test_score = accuracy_score(y_test, y_preds)
        print(f"Classifier: {classifier.__class__.__name__} has a training score of {round(training_score.mean(), 3) * 100}% accuracy score on {name}")
        print(f"Classifier: {classifier.__class__.__name__} has a test score of {round(test_score, 3) * 100}% accuracy score on {name}")

So now we can see the overall results but here is a perfect opportunity to head to the Platform to dig deeper into our results and the performance differences.

