# 🌟 Auto-document your work with Vectice - Quickstart

This Vectice Quickstart notebook illustrates how to use Vectice auto-documentation features in a realistic business scenario. We will follow a classic but simplified model training flow to quickly show how Vectice can help you automate your **Model Documentation**.

<div class="alert" style="color: #383d41; background-color: #e2e3e5; border-color: #d6d8db" role="alert">
<b> This is a quickstart project designed to showcase Vectice’s capabilities in automatically documenting notebooks. Vectice also supports more complex projects, which will be explored in upcoming tutorials.</b>
</div>

## Install the latest Vectice Python client library

In [None]:
%pip install -q vectice -U

## Imports

In [None]:
import numpy as np
import pandas as pd
import json

from sklearn.preprocessing import LabelEncoder

import os
#import warnings
#warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
#!wget https://vectice-examples.s3.us-west-1.amazonaws.com/Samples+Data/tutorial_data.csv -q --no-check-certificate

## 🌟 Get started by configuring the Vectice autolog

### Pre-requisites:
Before using this notebook you will need:
* Copy your API Key inside Vectice instructions page Paste it in the cell below

In [None]:
#### PASTE YOUR APY KEY configuration below

# import vectice
# from vectice import autolog

# autolog.config( api_token = 'X91p4rl1D.gpGr8RBdw7YAOL3ZEba0X91p4rl1D2NeonklQxjMyJ6mP4vzVW', # your API Key to connect to Vectice
#  host = 'https://app.vectice.com',  # your host info
#  phase = 'PHA-11059', # your phase in which you want to log your work
#  prefix = 'UB'  # A prefix for your variables name
#             )

<div class="alert" style="color: #383d41; background-color: #e2e3e5; border-color: #d6d8db" role="alert">
<b> Once configured, autolog automatically monitors and captures all assets from your notebook—such as models, datasets, and graphs—for seamless logging and documentation in Vectice.</b>
</div>

# Start Your Regular Data Science Work -> Not specific to Vectice

In this notebook, we will work on predicting the probability of loan default using a simplified yet complete data science workflow. Here's the plan:

- Dataset Loading: 
  - Load a dataset containing information about loan default applications.
  - Select a subset of the data.
- Data Preparation:
  - Perform small feature engineering tasks.
  - Apply scaling to the features.
- Model Building and Evaluation:
  - Build a logistic regression model to predict the probability of loan default.
  - Evaluate the results of the model.



## Dataset loading

In [None]:
# For the baseline model, we are only going to select a subset of columns that would make sense for the business, namely ['SK_ID_CURR','AMT_ANNUITY','AMT_CREDIT','AMT_INCOME_TOTAL','AMT_GOODS_PRICE','CNT_CHILDREN','CNT_FAM_MEMBERS','DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH', 'EXT_SOURCE_2', 'EXT_SOURCE_3',"TARGET"]

from sklearn.model_selection import train_test_split

selected_columns = ['SK_ID_CURR','AMT_ANNUITY','NAME_CONTRACT_TYPE','AMT_CREDIT','AMT_INCOME_TOTAL','AMT_GOODS_PRICE','CNT_CHILDREN','CNT_FAM_MEMBERS',
           'DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH', 'EXT_SOURCE_2', 'EXT_SOURCE_3',"TARGET"]

# Training data
path_train = path_train = "./tutorial_data.csv"
application_cleaned_baseline = pd.read_csv(path_train)[selected_columns]
app_train_feat, app_test_feat = train_test_split(application_cleaned_baseline, test_size=0.15, random_state=42)




### Split the data

In [None]:

# Separate the target variable from the testing set

target_variable = 'TARGET'
app_test_feat_target = app_test_feat[target_variable]
app_test_feat = app_test_feat.drop(target_variable, axis=1)

# Print the shapes of the resulting dataframes
print('Training data shape: ', app_train_feat.shape)
print('Testing shape: ', app_test_feat.shape)
print('Testing target shape: ', app_test_feat_target.shape)

In [None]:
app_train_feat

# Data preparation

### Feature engineering

In [None]:
app_train_feat["CREDIT_INCOME_PERCENT"] = (
    app_train_feat["AMT_CREDIT"] / app_train_feat["AMT_INCOME_TOTAL"]
)
app_train_feat["ANNUITY_INCOME_PERCENT"] = (
    app_train_feat["AMT_ANNUITY"] / app_train_feat["AMT_INCOME_TOTAL"]
)
app_train_feat["CREDIT_TERM"] = (
    app_train_feat["AMT_ANNUITY"] / app_train_feat["AMT_CREDIT"]
)
app_train_feat["DAYS_EMPLOYED_PERCENT"] = (
    app_train_feat["DAYS_EMPLOYED"] / app_train_feat["DAYS_BIRTH"]
)
app_train_feat["NEW_SOURCES_PROD"] = (
    app_train_feat["EXT_SOURCE_2"] * app_train_feat["EXT_SOURCE_3"]
)
app_train_feat["NEW_EXT_SOURCES_MEAN"] = app_train_feat[
    ["EXT_SOURCE_2", "EXT_SOURCE_3"]
].mean(axis=1)

In [None]:
app_test_feat["CREDIT_INCOME_PERCENT"] = (
    app_test_feat["AMT_CREDIT"] / app_test_feat["AMT_INCOME_TOTAL"]
)
app_test_feat["ANNUITY_INCOME_PERCENT"] = (
    app_test_feat["AMT_ANNUITY"] / app_test_feat["AMT_INCOME_TOTAL"]
)
app_test_feat["CREDIT_TERM"] = (
    app_test_feat["AMT_ANNUITY"] / app_test_feat["AMT_CREDIT"]
)
app_test_feat["DAYS_EMPLOYED_PERCENT"] = (
    app_test_feat["DAYS_EMPLOYED"] / app_test_feat["DAYS_BIRTH"]
)
app_test_feat["NEW_SOURCES_PROD"] = (
    app_test_feat["EXT_SOURCE_2"] * app_test_feat["EXT_SOURCE_3"]
)
app_test_feat["NEW_EXT_SOURCES_MEAN"] = app_test_feat[
    ["EXT_SOURCE_2", "EXT_SOURCE_3"]
].mean(axis=1)

### One-Hot Encoding

In [None]:
# one-hot encoding of categorical variables
app_train_feat = pd.get_dummies(app_train_feat)
app_test_feat = pd.get_dummies(app_test_feat)
train_labels = app_train_feat['TARGET']

app_train_feat, app_test_feat = app_train_feat.align(app_test_feat, join = 'inner', axis = 1)

app_train_feat['TARGET'] = train_labels


app_train_feat['DAYS_EMPLOYED_ANOM'] = app_train_feat["DAYS_EMPLOYED"] == 365243

app_train_feat['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)


app_test_feat['DAYS_EMPLOYED_ANOM'] = app_test_feat["DAYS_EMPLOYED"] == 365243
app_test_feat["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace = True)
print('Training Features shape: ', app_train_feat.shape)
print('Testing Features shape: ', app_test_feat.shape)
print('There are %d anomalies in the test data out of %d entries' % (app_test_feat["DAYS_EMPLOYED_ANOM"].sum(), len(app_test_feat)))

In [None]:
def plot_feature_importances(features, feature_importance_values):

    df = pd.DataFrame({'feature': features, 'importance': feature_importance_values}).sort_values('importance', ascending = False).reset_index()
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()

    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))),
            df['importance_normalized'].head(15),
            align = 'center', edgecolor = 'k')

    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))

    # Plot labeling
    plt.xlabel('Normalized Importance');
    plt.title('Feature Importances')
    plt.tight_layout()
    plt.savefig('Feature Importance.png')
    plt.show()

## Scaling and missing Data handling

### Define the feature list - drop target and fit data

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Drop the target from the training data
if 'TARGET' in app_train_feat:
    train_no_missing = app_train_feat.drop(columns=['TARGET'])

# Separate 'SK_ID_CURR'
train_ids = train_no_missing['SK_ID_CURR']
test_ids = app_test_feat['SK_ID_CURR']

# Drop 'SK_ID_CURR' from the features list
train_no_missing = train_no_missing.drop(columns=['SK_ID_CURR'])
test_no_missing = app_test_feat.drop(columns=['SK_ID_CURR'])

# Define the features list without 'SK_ID_CURR'
features = list(train_no_missing.columns)

# Median imputation of missing values
imputer = SimpleImputer(strategy='median')
# Fit on the training data
imputer.fit(train_no_missing)

# Transform both training and testing data
train_no_missing = pd.DataFrame(imputer.transform(train_no_missing), columns=features)
test_no_missing = pd.DataFrame(imputer.transform(test_no_missing), columns=features)

# Standardize the features
scaler = StandardScaler()
# Fit on the training data
scaler.fit(train_no_missing)

# Transform both training and testing data
train_no_missing = pd.DataFrame(scaler.transform(train_no_missing), columns=features)
test_no_missing = pd.DataFrame(scaler.transform(test_no_missing), columns=features)

# Reattach 'SK_ID_CURR' to the DataFrames
train_no_missing['SK_ID_CURR'] = train_ids.values
test_no_missing['SK_ID_CURR'] = test_ids.values

# Set 'SK_ID_CURR' as the index
train_no_missing = train_no_missing.set_index('SK_ID_CURR')
test_no_missing = test_no_missing.set_index('SK_ID_CURR')

# Display the first few rows of the transformed training data
print(train_no_missing.head())
print(test_no_missing.head())

# Model building and evaluation
### Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score, recall_score, roc_curve, auc
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Build a logistic regression model
# Define and train the logistic regression model
logistic_regression = LogisticRegression(random_state=50, solver='liblinear', max_iter=1000)
features = list(train_no_missing.columns)
# Train on the training data
logistic_regression.fit(train_no_missing, train_labels)

# Make predictions on the test data
predictions = logistic_regression.predict_proba(test_no_missing)[:, 1]

# Evaluate the model
roc_auc = roc_auc_score(app_test_feat_target.values, predictions)

sorted_indices = np.argsort(predictions)[::-1]
sorted_labels = app_test_feat_target.iloc[sorted_indices]

desired_percentage = 0.25

threshold_index = int(desired_percentage * len(predictions))
threshold_probability = predictions[sorted_indices[threshold_index]]
binary_predictions = (predictions >= threshold_probability).astype(int)

# Calculate the recall at the desired percentage
recall = recall_score(app_test_feat_target.values, binary_predictions)
f1 = f1_score(app_test_feat_target.values, binary_predictions)

# Print metrics
metric = {"auc": float(roc_auc),
          f"recall at {desired_percentage}%": float(recall),
          f"f1_score at {desired_percentage}%": float(f1)}

print("ROC AUC Score:", roc_auc)
print("F1 Score:", f1)
print("Recall Score:", recall)

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(app_test_feat_target.values, predictions)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.savefig("Performance_roc_curve.png")
plt.show()

# Create a DataFrame with predicted probabilities and true labels
df_results = pd.DataFrame({'Probability': predictions, 'Default': app_test_feat_target.values})

# Sort instances based on predicted probabilities
df_results = df_results.sort_values(by='Probability', ascending=False)

# Divide the sorted instances into quantiles (e.g., deciles)
num_quantiles = 10
df_results['Quantile'] = pd.qcut(df_results['Probability'], q=num_quantiles, labels=False, duplicates='drop')

# Calculate the percentage of defaults in each quantile
quantile_defaults = df_results.groupby('Quantile')['Default'].mean() * 100

# Plot the results
plt.figure(figsize=(10, 6))
plt.bar(quantile_defaults.index, quantile_defaults.values, color='blue', alpha=0.7)
plt.xlabel('Quantile of predicted probabilities')
plt.ylabel('Percentage of Defaults')
plt.title('Percentage of Defaults by Quantile of Predicted Probabilities')
plt.xticks(ticks=quantile_defaults.index, labels=[f'Q{i + 1}' for i in quantile_defaults.index])
plt.savefig("Performance_Percentage_of_Defaults_by_Quantile.png")
#plt.show()


### Feature importance

In [None]:
def plot_feature_importance(model, column_names):
    # Extract feature importance (coefficients) and their absolute values
    coefficients = model.coef_[0]
    feature_importance = np.abs(coefficients)

    # Create a DataFrame for easier visualization
    importance_df = pd.DataFrame({
        'Feature': column_names,
        'Importance': feature_importance
    }).sort_values(by='Importance', ascending=False)

    # Plot feature importance
    plt.figure(figsize=(10, 6))
    plt.barh(importance_df['Feature'], importance_df['Importance'], align='center')
    plt.gca().invert_yaxis()  # Reverse the order for better visualization
    plt.title('Feature Importance in Logistic Regression')
    plt.xlabel('Importance (Absolute Coefficient Value)')
    plt.ylabel('Feature')

    # Adjust layout to avoid cutting labels
    plt.tight_layout()
    plt.savefig('Feature Importance.png')
    plt.show()

In [None]:
plot_feature_importance(logistic_regression, train_no_missing.columns)

# 🌟 Once done with your regular data science work -> Auto document your entire notebook

In [None]:
autolog.generate_doc(note= "Baseline model logistic regression", capture_schema_only=False)