# Kaggle Challenge - Learning Equality

https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/overview

## Goal of the Competition

The goal of this competition is to streamline the process of matching educational content to specific topics in a curriculum. You will develop an accurate and efficient model trained on a library of K-12 educational materials that have been organized into a variety of topic taxonomies. These materials are in diverse languages, and cover a wide range of topics, particularly in STEM (Science, Technology, Engineering, and Mathematics).

Your work will enable students and educators to more readily access relevant educational content to support and supplement learning.

## Submission File

For each **topic_id** in the test set, you must predict a space-delimited list of recommended **content_ids** for that topic. The file should contain a header and have the following format:

~~~
topic_id,content_ids
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c_76231f9d0b5e
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c_ebb7fdf10a7e
t_00069b63a70a,c_11a1dc0bfb99
...
~~~

## Scoring
Mean F2 score

## Evaluation Metric - Efficiency Scoring
We compute a submission's efficiency score by:

\begin{equation} \text{Efficiency} = \frac{1}{ \text{Benchmark} - \max\text{F2} }\text{F2} + \frac{1}{32400}\text{RuntimeSeconds} \end{equation}


where **F2** is the submission's score on the main competition metric, **Benchmark** is the score of the benchmark sample_submission.csv, **maxF2** is the maximum  of all submissions on the Private Leaderboard, and **RuntimeSeconds** is the number of seconds it takes for the submission to be evaluated. The objective is to minimize the efficiency score.

During the training period of the competition, you may see a leaderboard for the public test data in the following notebook, updated daily: Efficiency Leaderboard. After the competition ends, we will update this leaderboard with efficiency scores on the private data. During the training period, this leaderboard will show only the rank of each team, but not the complete score.

# Data

## Imports

In [None]:
!pip install -U sentence-transformers

In [None]:
import pandas as pd
import numpy as np
import string
from sentence_transformers import SentenceTransformer, util

# Data

# Data Collection

## Load dataframes

In [None]:
#drive_path = '/content/drive/My Drive/Colab Notebooks/learning_equality/'
drive_path = '/kaggle/input/learning-equality-curriculum-recommendations/'
dataset_path = '/kaggle/input/learning-equality-files/'
work_path = '/kaggle/working/'
model_path = '/kaggle/input/learning-equality-files/ST-all-MiniLM-L6-v2-trained'

# load 'topics' data into pandas dataframe
df_topics = pd.read_csv(f'{drive_path}topics.csv', index_col=0).fillna({"title": "", "description": ""})
print ("\nLoaded 'df_topics'")

In [None]:
# load 'content' data into pandas dataframe
df_content = pd.read_csv(f'{drive_path}content.csv', index_col=0).fillna("")
print ("\nLoaded 'df_content'")

In [None]:
# load 'correlations' data into pandas dataframe
df_corr = pd.read_csv(f'{drive_path}correlations.csv', index_col=0)
print ("\nLoaded 'df_corr'")

In [None]:
# load 'topic_breadcrumbs' data into pandas dataframe
df_topic_breadcrumbs = pd.read_csv(f'{dataset_path}topic_breadcrumbs.csv', index_col=0)
print ("\nLoaded 'df_topic_breadcrumbs'")

In [None]:
# Restructure 'df_corr' (correlations): explode the target column to be more practical to use - one topic -> one content
y = df_corr.copy()
y['content_ids'] = y.content_ids.str.split(' ')
y = y.explode('content_ids')
y.reset_index(inplace=True)
print ("\nCreated expanded 'y'")

## Data Cleaning

## Data cleaning params

In [None]:
topic_cols = ['title_x', 'description_x', 'topic_breadcrumbs']
content_cols = ['title_y', 'description_y', 'kind', 'text']
cat_for_val = 'aligned'
print ("\nLoaded parameters to clean data")

## Data cleaning functions

In [None]:
# Clean up text
def clean_text(text_col):
    """
    Clean ponctuation and special chars from a dataframe column
    """
    punctuations = string.punctuation
    text_col = text_col.str.replace('\W', ' ', regex=True)
    for punct in string.punctuation:
        text_col = text_col.str.replace(punct, ' ', regex=True)
    return text_col

## Features

*   Filter by category (do not use 'aligned', use only for validation)
*   Clean strings: ponctuation and special chars (\n\t...)
*   Merge columns in 'df_topics' from 'topic_cols' into 1 sentence (Sentence1)
*   Merge columns in 'df_content' from 'topic_cols' into 1 sentence (Sentence1)
*   Merge both datasets using y (correlations.csv)

By merging both sentences column using correlations, we get a pair of similar sentences with size of y

In [None]:
# Merge 'df_topics' with 'topics_breadcrumbs' and 'df_content'
print ("\nMerging dataframes 'df_topics', 'df_contents' and 'y'...")
features = df_topics.copy()[df_topics.category != cat_for_val].reset_index().rename(columns={'id': 'topic_id'})
features = y.merge (features, how='left', left_on='topic_id', right_on='topic_id')
features = features.merge (df_content, how='left', left_on='content_ids', right_on='id')
features = features.merge (df_topic_breadcrumbs.reset_index(), how='left', left_on='topic_id', right_on='index')
#Cleaning tasks
print ("\nCleaning data...")
for col in topic_cols:
    features[col] = clean_text(features[col])
for col in content_cols:
    features[col] = clean_text(features[col])
#Create 'sentence1' column
print ("\nCreating sentences from 'df_topics'...")
features['sentence1'] = features[topic_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)
#Create 'sentence2' column
print ("\nCreating sentences from 'df_contents'...")
features['sentence2'] = features[content_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)
#Clean unnecessary columns
features = features[['sentence1', 'sentence2']]
print ("\nCreated 'features' to train model")

# Model

## Model params

In [None]:
backbone = 'all-MiniLM-L6-v2' #explore also msmarco models and multilanguage
dataset_limit = 10
batch_size = 64
shuffle = True
num_epochs = 2
warmup_steps = int(len(features) * num_epochs * 0.2)
warmup_steps
print ("\nLoaded model training parameters...")

# Sentence Transformer train

- Using 'all-MiniLM-L6-v2' as from discussions it seems to be best model, but may use other models
- Using pair of sentences and label
- Sentence1: topic_title + topic_description' + topic_breadcrumbs
- Sentence2: content title + content description + kind + content text
- Target: correlations. 1 if it's a match, 0 if there's no match

In [None]:
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoader

print (f"\nLoading model {backbone}...")
model = SentenceTransformer(backbone)
train_sentences = []
X_train = features
matches = len(X_train)

print ("Converting 'features' to proper format...")
for i in range(matches):
    train_sentences.append(InputExample(texts=[X_train.iloc[i, 0], X_train.iloc[i, 1]]))

train_dataloader = DataLoader(train_sentences, shuffle=shuffle, batch_size=batch_size)

#print ("Setting multi GPU process...")
#pool = model.start_multi_process_pool()

print ("\nDefining model loss function...")
train_loss = losses.MultipleNegativesRankingLoss(model=model)
print (f"\nAll set to train model")

In [None]:
print (f"\nTraining model with {warmup_steps} sentences and {train_loss}...")
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=num_epochs,
          warmup_steps=warmup_steps) 

print (f"\nSaving model ST-{backbone}-trained...")
model.save (f"{work_path}/ST-{backbone}-trained'")
print ("Model saved.")