<a href="https://colab.research.google.com/github/rajilsaj/FICOchallenge/blob/main/notebooks/Week_5_DataPreparation_forModelTraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FICO Educational Analytics Challenge © Fair Isaac 2025**

Copyright 2025 FICO licensed under CC BY-NC-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/

# Week 5: Data Preparation for Model Training

The purpose of this notebook is to prepare a datasets for model training and evaluation by splitting the given input data into training, validation, and test subsets. Proper dataset splitting is a critical step in building robust models, as it ensures fair evaluation and helps prevent overfitting.

- **Train dataset:** Used to learn patterns from the data and update the model’s parameters during fine-tuning.

- **Validation dataset:** Used to evaluate performance while training, tune hyperparameters, and prevent overfitting.

- **Test dataset:** Used only after training to evaluate the model’s final performance and generalization ability on unseen data.

This notebook creates Train, Validation and Test datasets from input conversations dataset with 60:20:20 split. Feel free to experiment with different ratios.

### Expected File Structure

This notebook expects you to have the following file structure inside of **MyDrive**:

```
MyDrive
    └── FICO Analytic Challenge
        └── Data
            └── conversations.csv
```

After running this notebook, you will have three additional datasets in the Data folder:

```
MyDrive
    └── FICO Analytic Challenge
        └── Data
            └── conversations.csv
            └── conversations_train.csv
            └── conversations_test.csv
            └── conversations_validation.csv
```


## Import Libraries and Set up Folder Paths

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [4]:
import os
import sys
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive/', force_remount=True)

# Base path for your project
path = '/content/drive/MyDrive/FICO Analytic Challenge/'

# Folder that's holding dataset
data = 'Data'

# Path to the "Data" folder
data_path = os.path.join(path, data)

Mounted at /content/drive/


## Input Data

This section carries out following operations -

1. Reads the input conversation dataset

2. Maps the intents to numerical labels and creates lookup dictionaries

3. Creates Train, Validation and Test datasets with 60:20:20 split

4. Creates a DatasetDict with Train, Validation and Test datasets to use in the training process

When training or fine-tuning a model, the data is typically divided into three subsets -


Assign the name of your input dataset to `conversations_dataset_name` variable

In [5]:
# Name of the input dataset - update as needed
conversations_dataset_name = 'conversations.csv'

In [6]:
# Read the dataset, print the shape and sample records
coversations_data_path = os.path.join(data_path, conversations_dataset_name)
df = pd.read_csv(coversations_data_path)
print(df.shape)
df.head()

(225, 5)


Unnamed: 0,intent,scenario,conversation_text,sentiment,user_speech_type
0,FALLBACK,The chatbot reaches out about a delinquent per...,"\n\nBot: Hello, this is [Bank Name] customer s...",neutral,typos
1,FALLBACK,The chatbot reaches out about a delinquent per...,"\n\nBot: Good day, this is [Bank Name] custome...",neutral,professional
2,FALLBACK,The chatbot reaches out about a delinquent per...,"\n\nBot: Hi there, this is [Bank Name] reachin...",angry,casual
3,FALLBACK,The chatbot reaches out about a delinquent per...,"\n\nBot: Hi there, this is an automated call f...",confused,casual
4,FALLBACK,The chatbot reaches out about a delinquent per...,"\n\nBot: Hi there, this is a representative fr...",angry,typos


In [7]:
# Filter the conversation text and intent
df = df[['intent', 'conversation_text']]

<font color="Blue">Below code creates Train, Validation and Test datasets with 60:20:20 split. Feel free to experiment with different ratios.<font>

In [8]:
# Split the dataset into train and test datasets in 80:20 split
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df["intent"], random_state=42)

# Split the train dataset into train and validation datasets in 75:25 split. At overall level the train data is 60% (75% of 80%), validation is 20% (25% x 80%)
df_train, df_val = train_test_split(df_train, test_size=0.25, stratify=df_train["intent"], random_state=42)

In [10]:
# Save the datasets
df_train.to_csv(os.path.join(data_path, 'conversations_train.csv'), index=False)
df_val.to_csv(os.path.join(data_path, 'conversations_validation.csv'), index=False)
df_test.to_csv(os.path.join(data_path, 'conversations_test.csv'), index=False)