# Kaggle Challenge - Learning Equality

https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/overview

## Goal of the Competition

The goal of this competition is to streamline the process of matching educational content to specific topics in a curriculum. You will develop an accurate and efficient model trained on a library of K-12 educational materials that have been organized into a variety of topic taxonomies. These materials are in diverse languages, and cover a wide range of topics, particularly in STEM (Science, Technology, Engineering, and Mathematics).

Your work will enable students and educators to more readily access relevant educational content to support and supplement learning.

## Submission File

For each **topic_id** in the test set, you must predict a space-delimited list of recommended **content_ids** for that topic. The file should contain a header and have the following format:

~~~
topic_id,content_ids
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c_76231f9d0b5e
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c_ebb7fdf10a7e
t_00069b63a70a,c_11a1dc0bfb99
...
~~~

## Scoring
Mean F2 score

## Evaluation Metric - Efficiency Scoring
We compute a submission's efficiency score by:

\begin{equation} \text{Efficiency} = \frac{1}{ \text{Benchmark} - \max\text{F2} }\text{F2} + \frac{1}{32400}\text{RuntimeSeconds} \end{equation}


where **F2** is the submission's score on the main competition metric, **Benchmark** is the score of the benchmark sample_submission.csv, **maxF2** is the maximum  of all submissions on the Private Leaderboard, and **RuntimeSeconds** is the number of seconds it takes for the submission to be evaluated. The objective is to minimize the efficiency score.

During the training period of the competition, you may see a leaderboard for the public test data in the following notebook, updated daily: Efficiency Leaderboard. After the competition ends, we will update this leaderboard with efficiency scores on the private data. During the training period, this leaderboard will show only the rank of each team, but not the complete score.

# Data

## Imports

In [1]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l- \ | done
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=eed851c92276b2ce08c4645e49760107b1c0cb68b0d5b9e7f321f8866a7b5c65
  Stored in directory: /root/.cache/pip/wheels/bf/06/fb/d59c1e5bd1dac7f6cf61ec0036cc3a10ab8fecaa6b2c3d3ee9
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2
[0m

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics import fbeta_score, precision_score, recall_score, make_scorer
from sklearn.dummy import DummyClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Data

# Data Collection

## Helper functions and classes for data traversal (breadcrumbs = topic context)

In [3]:
# define some helper functions and classes to aid with data traversal

class ContentItem:
    """
    Class to handle content relations (breadcrumbs)
    """
    def __init__(self, content_id):
        self.id = content_id

    @property
    def topics(self):
        return [Topic(topic_id) for topic_id in df_topics.loc[df_corr[df_corr.content_ids.str.contains(self.id)].index].index]

    def get_all_breadcrumbs(self, separator=" >> ", include_root=True):
        breadcrumbs = []
        title = self.title
        for topic in self.topics:
            new_breadcrumb = topic.get_breadcrumbs(separator=separator, include_root=include_root)
            if new_breadcrumb:
                new_breadcrumb = new_breadcrumb + separator + title
            else:
                new_breadcrumb = title
            breadcrumbs.append(new_breadcrumb)
        return breadcrumbs

class Topic:
    """
    Class to handle topics relations (breadcrumbs)
    """
    def __init__(self, topic_id):
        self.id = topic_id

    @property
    def parent(self):
        parent_id = df_topics.loc[self.id].parent
        if pd.isna(parent_id):
            return None
        else:
            return Topic(parent_id)

    @property
    def ancestors(self):
        ancestors = []
        parent = self.parent
        while parent is not None:
            ancestors.append(parent)
            parent = parent.parent
        return ancestors

    @property
    def content(self):
        if self.id in df_corr.index:
            return [ContentItem(content_id) for content_id in df_corr.loc[self.id].content_ids.split()]
        else:
            return tuple([]) if self.has_content else []

    def get_breadcrumbs(self, separator=" >> ", include_self=True, include_root=True):
        ancestors = self.ancestors
        if include_self:
            ancestors = [self] + ancestors
        if not include_root:
            ancestors = ancestors[:-1]
        return separator.join(reversed([a.title for a in ancestors]))

## Load dataframes

In [4]:
#drive_path = '/content/drive/My Drive/Colab Notebooks/learning_equality/'
drive_path = '/kaggle/input/learning-equality-curriculum-recommendations/'
dataset_path = '/kaggle/input/learning-equality-files/'
work_path = '/kaggle/working/'

# load 'topics' data into pandas dataframe
df_topics = pd.read_csv(f'{drive_path}topics.csv', index_col=0).fillna({"title": "", "description": ""})
print ("\nLoaded 'df_topics'")


Loaded 'df_topics'


In [5]:
# load 'content' data into pandas dataframe
df_content = pd.read_csv(f'{drive_path}content.csv', index_col=0).fillna("")
print ("\nLoaded 'df_content'")


Loaded 'df_content'


In [6]:
# load 'correlations' data into pandas dataframe
df_corr = pd.read_csv(f'{drive_path}correlations.csv', index_col=0)
print ("\nLoaded 'df_corr'")


Loaded 'df_corr'


In [7]:
# load 'topic_breadcrumbs' data into pandas dataframe
df_topic_breadcrumbs = pd.read_csv(f'{dataset_path}topic_breadcrumbs.csv', index_col=0)
print ("\nLoaded 'df_topic_breadcrumbs'")


Loaded 'df_topic_breadcrumbs'


In [8]:
# Restructure 'df_corr' (correlations): explode the target column to be more practical to use - one topic -> one content
y = df_corr.copy()
y['content_ids'] = y.content_ids.str.split(' ')
y = y.explode('content_ids')
y.reset_index(inplace=True)
print ("\nCreated expanded 'y'")


Created expanded 'y'


## Create breadcrumbs for topics

In [9]:
#%%time
# - Create breadcrumbs for topics and save in csv file
#  these command lines were spending 1h20m without finishingt:
#       df_X['topic_id'].apply (lambda topic: Topic(topic).get_breadcrumbs()))
#       df_X['content_ids'].apply (lambda content: ContentItem(content).get_all_breadcrumbs())
# with for loop + list append it took 5m for topic

#topic_breadcrumbs=[]
#topic_ids=[]
#for i, topic in enumerate(df_topics.reset_index()['id']):
#    topic_ids.append (topic)
#    topic_breadcrumbs.append (Topic(topic).get_breadcrumbs(separator=' '))
#    print (i, ' : ', topic, ' - ', topic_breadcrumbs[-1])

# Save 'topic_breadcrumbs'
#df_topic_breadcrumbs = pd.DataFrame (topic_breadcrumbs, columns='breadcrumbs', index_col=topic_ids)
#df_topic_breadcrumbs.to_csv(f'{drive_path}topic_breadcrumbs.csv')

## Data Cleaning

## Data cleaning params

In [10]:
levels = {0: 'zero', 1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five', 6: 'six', 7: 'seven', 8: 'eight', 9: 'nine', 10: 'ten'}
langs_t = pd.DataFrame (df_topics['language'].unique(), columns=['lang']) 
langs_c = pd.DataFrame (df_content['language'].unique(), columns=['lang']) 
langs = langs_t.merge (langs_c, how='left')
# Output: ['bg', 'en', 'pt', 'gu', 'my', 'zh', 'ar', 'te', 'es', 'fr', 'sw', 'mr', 'hi', 'bn', 'fil', 'ru', 'it', 'or', 'pnb', 'km', 'as', 'kn','ur', 'pl', 'ta', 'swa', 'tr', 'mul']
topic_cols = ['title_x', 'description_x', 'topic_breadcrumbs']
content_cols = ['title_y', 'description_y', 'kind', 'text']
cat_for_val = 'aligned'
print ("\nLoaded parameters to clean data")


Loaded parameters to clean data


## Data cleaning functions

In [11]:
# Filter by language
def filter_lang(X, lang):
    """
    Filter dataframes by 'language'
    """
    return X[X.language_x == lang].copy()


# Clean up text
def clean_text(text_col):
    """
    Clean ponctuation and special chars from a dataframe column
    """
    punctuations = string.punctuation
    text_col = text_col.str.replace('\W', ' ', regex=True)
    for punct in string.punctuation:
        text_col = text_col.str.replace(punct, ' ', regex=True)
    return text_col

## Features

*   Filter by category (do not use 'aligned', use only for validation)
*   Clean strings: ponctuation and special chars (\n\t...)
*   Merge columns in 'df_topics' from 'topic_cols' into 1 sentence (Sentence1)
*   Merge columns in 'df_content' from 'topic_cols' into 1 sentence (Sentence1)
*   Merge both datasets using y (correlations.csv)

By merging both sentences column using correlations, we get a pair of similar sentences with size of y

In [12]:
# Merge 'df_topics' with 'topics_breadcrumbs' and 'df_content'
print ("\nMerging dataframes 'df_topics', 'df_contents' and 'y'...")
features = df_topics.copy()[df_topics.category != cat_for_val].reset_index().rename(columns={'id': 'topic_id'})
features = y.merge (features, how='left', left_on='topic_id', right_on='topic_id')
features = features.merge (df_content, how='left', left_on='content_ids', right_on='id')
features = features.merge (df_topic_breadcrumbs.reset_index(), how='left', left_on='topic_id', right_on='index')
#Cleaning tasks
print ("\nCleaning data...")
for col in topic_cols:
    features[col] = clean_text(features[col])
for col in content_cols:
    features[col] = clean_text(features[col])
#Create 'sentence1' column
print ("\nCreating sentences from 'df_topics'...")
features['sentence1'] = features[topic_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)
#Create 'sentence2' column
print ("\nCreating sentences from 'df_contents'...")
features['sentence2'] = features[content_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)
#Clean unnecessary columns
features = features[['sentence1', 'sentence2']]
print ("\nCreated 'features' to train model")


Merging dataframes 'df_topics', 'df_contents' and 'y'...

Cleaning data...

Creating sentences from 'df_topics'...

Creating sentences from 'df_contents'...

Created 'features' to train model


# Model

## Scoring

Scoring is based in mean F2 score

In [13]:
def model_scoring (y_test, y_pred, average='weighted'):
    """
    Calculate precision, recall and f2-score for y_test and y_pred
    """
    precision = precision_score(y_test['content_ids'], y_pred['content_id'], average=average)
    print ('Precision:', Precision)
    recall = recall_score(y_test['content_ids'], y_pred['content_id'], average=average)
    print ('Recall:', Recall)
    F2macro = fbeta_score(y_test['content_ids'], y_pred['content_id'], beta=2, average='macro')
    print ('F2 macro:', F2macro)
    F2micro = fbeta_score(y_test['content_ids'], y_pred['content_id'], beta=2, average='micro')
    print ('F2 micro:', F2micro)
    return precision, recall, F2macro, F2micro

## Model params

In [14]:
backbone = 'all-MiniLM-L6-v2' #explore also msmarco models and multilanguage
dataset_limit = 10
batch_size = 64
shuffle = True
num_epochs = 2
warmup_steps = int(len(features) * num_epochs * 0.2)
warmup_steps
print ("\nLoaded model training parameters...")


Loaded model training parameters...


# Sentence Transformer train

- Using 'all-MiniLM-L6-v2' as from discussions it seems to be best model, but may use other models
- Using pair of sentences and label
- Sentence1: topic_title + topic_description' + topic_breadcrumbs
- Sentence2: content title + content description + kind + content text
- Target: correlations. 1 if it's a match, 0 if there's no match

In [15]:
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoader

print (f"\nLoading model {backbone}...")
model = SentenceTransformer(backbone)
train_sentences = []
X_train = features
matches = len(X_train)

print ("Converting 'features' to proper format...")
for i in range(matches):
    train_sentences.append(InputExample(texts=[X_train.iloc[i, 0], X_train.iloc[i, 1]]))

train_dataloader = DataLoader(train_sentences, shuffle=shuffle, batch_size=batch_size)

#print ("Setting multi GPU process...")
#pool = model.start_multi_process_pool()

print ("\nDefining model loss function...")
train_loss = losses.MultipleNegativesRankingLoss(model=model)
print (f"\nAll set to train model")


Loading model all-MiniLM-L6-v2...


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Converting 'features' to proper format...

Defining model loss function...

All set to train model


In [16]:
print (f"\nTraining model with {warmup_steps} sentences and {train_loss}...")
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=num_epochs,
          warmup_steps=warmup_steps) 

print (f"\nSaving model ST-{backbone}-trained...")
model.save (f"{work_path}/ST-{backbone}-trained'")
print ("Model saved.")


Training model with 111967 sentences and MultipleNegativesRankingLoss(
  (model): SentenceTransformer(
    (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
    (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
    (2): Normalize()
  )
  (cross_entropy_loss): CrossEntropyLoss()
)...


Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/4374 [00:00<?, ?it/s]

Iteration:   0%|          | 0/4374 [00:00<?, ?it/s]


Saving model ST-all-MiniLM-L6-v2-trained...
Model saved.


## Sentence Transformer predictions

# Submissions

In [17]:
# load 'correlations' data into pandas dataframe
df_sample = pd.read_csv(f'{drive_path}sample_submission.csv')
df_sample

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
1,t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
2,t_00069b63a70a,c_11a1dc0bfb99
3,t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
4,t_4054df11a74e,c_3695c5dc1df6 c_f2d184a98231
