<a href="https://colab.research.google.com/github/sofiammatias/learning-equality/blob/main/Learning_Equality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle Challenge - Learning Equality

https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/overview

## Goal of the Competition

The goal of this competition is to streamline the process of matching educational content to specific topics in a curriculum. You will develop an accurate and efficient model trained on a library of K-12 educational materials that have been organized into a variety of topic taxonomies. These materials are in diverse languages, and cover a wide range of topics, particularly in STEM (Science, Technology, Engineering, and Mathematics).

Your work will enable students and educators to more readily access relevant educational content to support and supplement learning.

## Submission File

For each **topic_id** in the test set, you must predict a space-delimited list of recommended **content_ids** for that topic. The file should contain a header and have the following format:

~~~
topic_id,content_ids
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c_76231f9d0b5e
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c_ebb7fdf10a7e
t_00069b63a70a,c_11a1dc0bfb99
...
~~~

## Scoring
Mean F2 score

## Evaluation Metric - Efficiency Scoring
We compute a submission's efficiency score by:

\begin{equation} \text{Efficiency} = \frac{1}{ \text{Benchmark} - \max\text{F2} }\text{F2} + \frac{1}{32400}\text{RuntimeSeconds} \end{equation}


where **F2** is the submission's score on the main competition metric, **Benchmark** is the score of the benchmark sample_submission.csv, **maxF2** is the maximum  of all submissions on the Private Leaderboard, and **RuntimeSeconds** is the number of seconds it takes for the submission to be evaluated. The objective is to minimize the efficiency score.

During the training period of the competition, you may see a leaderboard for the public test data in the following notebook, updated daily: Efficiency Leaderboard. After the competition ends, we will update this leaderboard with efficiency scores on the private data. During the training period, this leaderboard will show only the rank of each team, but not the complete score.

# Data

## Imports

In [1]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m625.3 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=d351a259ea9671f16c02b6fd95ab5304c3e5e1b9dce3ca232981f0ee8a1e2597
  Stored in directory: /root/.cache/pip/wheels/bf/06/fb/d59c1e5bd1dac7f6cf61ec0036cc3a10ab8fecaa6b2c3d3ee9
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2
[0m

In [14]:
import pandas as pd
import numpy as np
import string
import torch
from sentence_transformers import SentenceTransformer, util

# Data Collection

## Load dataframes

In [3]:
#drive_path = '/content/drive/My Drive/Colab Notebooks/learning_equality/'
drive_path = '/kaggle/input/learning-equality-curriculum-recommendations/'
dataset_path = '/kaggle/input/learning-equality-files/'
work_path = '/kaggle/working/'
model_path = '/kaggle/input/learning-equality-files/ST-all-MiniLM-L6-v2-trained/ST-all-MiniLM-L6-v2-trained'

# load 'topics' data into pandas dataframe
df_topics = pd.read_csv(f'{drive_path}topics.csv', index_col=0).fillna({"title": "", "description": ""})
print (f"\nLoaded 'df_topics'")
df_topics


Loaded 'df_topics'


Unnamed: 0_level_0,title,description,channel,category,level,language,parent,has_content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True
...,...,...,...,...,...,...,...,...
t_fffb0bf2801d,4.3 Graph of functions,,e77b55,aligned,4,en,t_676e6a1a4dc7,False
t_fffbe1d5d43c,Inscribed shapes problem solving,Use properties of inscribed angles to prove pr...,0c929f,source,4,sw,t_50145b9bab3f,True
t_fffe14f1be1e,Lección 7,,6e90a7,aligned,6,es,t_d448c707984d,True
t_fffe811a6da9,تحديد العلاقة بين الإحداثيّات القطبية والإحداث...,5b9e5ca86571f90499ea987f,9fd860,source,2,ar,t_5b4f3ba4eb7d,True


In [4]:
# load 'content' data into pandas dataframe
df_content = pd.read_csv(f'{drive_path}content.csv', index_col=0).fillna("")
print (f"\nLoaded 'df_content'")
df_content


Loaded 'df_content'


Unnamed: 0_level_0,title,description,kind,text,language,copyright_holder,license
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,
c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,
c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,
c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,\nNado de aproximação\nSaber nadar nas ondas ...,pt,Sikana Education,CC BY-NC-ND
c_00016694ea2a,geometry-m3-topic-a-overview.pdf,geometry-m3-topic-a-overview.pdf,document,Estándares Comunes del Estado de Nueva York\n\...,es,Engage NY,CC BY-NC-SA
...,...,...,...,...,...,...,...
c_fffcbdd4de8b,2. 12: Diffusion,,html5,What will eventually happen to these dyes?\n\n...,en,CSU and Merlot,CC BY-NC-SA
c_fffe15a2d069,Sommare facendo gruppi da 10,Sal somma 5+68 spezzando il 5 in un 2 e un 3.\n\n,video,,it,,
c_fffed7b0d13a,Introdução à subtração,Sal fala sobre o que significa subtrair. Os ex...,video,,pt,,
c_ffff04ba7ac7,SA of a Cone,,video,,en,,


In [5]:
# load 'correlations' data into pandas dataframe
df_corr = pd.read_csv(f'{drive_path}correlations.csv', index_col=0)
print (f"\nLoaded 'df_corr'")
df_corr


Loaded 'df_corr'


Unnamed: 0_level_0,content_ids
topic_id,Unnamed: 1_level_1
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,c_11a1dc0bfb99
t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4
...,...
t_fff830472691,c_61fb63326e5d c_8f224e321c87
t_fff9e5407d13,c_026db653a269 c_0fb048a6412c c_20de77522603 c...
t_fffbe1d5d43c,c_46f852a49c08 c_6659207b25d5
t_fffe14f1be1e,c_cece166bad6a


In [6]:
# load 'topic_breadcrumbs' data into pandas dataframe
df_topic_breadcrumbs = pd.read_csv(f'{dataset_path}topic_breadcrumbs.csv', index_col=0)
print (f"\nLoaded 'df_topic_breadcrumbs'")
df_topic_breadcrumbs


Loaded 'df_topic_breadcrumbs'


Unnamed: 0,topic_breadcrumbs
t_00004da3a1b2,Khan Academy (български език) Наука Физика Отк...
t_000095e03056,Ghana JHS Curriculum (in progress) Junior High...
t_00068291e9a4,Khan Academy (Português (Brasil)) Matemática p...
t_00069b63a70a,MIT Blossoms Engineering Flow Charts: Logical ...
t_0006d41a73a8,Khan Academy (български език) Математика Алгеб...
...,...
t_fffb0bf2801d,Camara Education Ethiopia Secondary Science an...
t_fffbe1d5d43c,Khan Academy (Kiswahili) Hisabati Class 9 (Ind...
t_fffe14f1be1e,CREE Para el Estudiante I Ciclo 01 Primero Esp...
t_fffe811a6da9,Edraak (العربيّة) الهندسة وعلم المثلثات تحديد ...


In [7]:
# Restructure 'df_corr' (correlations): explode the target column to be more practical to use - one topic -> one content
y = df_corr.copy()
y['content_ids'] = y.content_ids.str.split(' ')
y = y.explode('content_ids')
y.reset_index(inplace=True)
print (f"\nLoaded 'df_corr'")
y


Loaded 'df_corr'


Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d
1,t_00004da3a1b2,c_376c5a8eb028
2,t_00004da3a1b2,c_5bc0e1e2cba0
3,t_00004da3a1b2,c_76231f9d0b5e
4,t_00068291e9a4,c_639ea2ef9c95
...,...,...
279914,t_fff9e5407d13,c_d64037a72376
279915,t_fffbe1d5d43c,c_46f852a49c08
279916,t_fffbe1d5d43c,c_6659207b25d5
279917,t_fffe14f1be1e,c_cece166bad6a


## Data Cleaning

## Data cleaning params

In [8]:
topic_cols = ['title', 'description', 'topic_breadcrumbs']
content_cols = ['title', 'description', 'kind', 'text']
cat_for_val = 'aligned'
print (f"\nLoaded cleaning parameters")


Loaded cleaning parameters


## Data cleaning functions

In [9]:
# Clean up text
def clean_text(text_col):
    """
    Clean ponctuation and special chars from a dataframe column
    """
    punctuations = string.punctuation
    text_col = text_col.str.replace('\W', ' ', regex=True)
    for punct in string.punctuation:
        text_col = text_col.str.replace(punct, ' ', regex=True)
    return text_col

## Topics

*   Filter by language (start with english then loop all the process for all languages, and save model between languages)
*   Clean strings: ponctuation and special chars (\n\t...)



In [10]:
print (f"\nCreating and cleaning topic features...")
topics_features = df_topics.copy()[df_topics.category != cat_for_val]
topics_features = topics_features.merge (df_topic_breadcrumbs, how='outer', right_index=True, left_index=True)
for col in topic_cols:
    topics_features[col] = clean_text(topics_features[col])
display (topics_features.head())
print (f"\nCreating topic sentences...")
topics_sentences = topics_features[topic_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)
print (f"\nCreated 'topic_features' and 'topics_sentences'")
topics_sentences


Creating and cleaning topic features...


Unnamed: 0,title,description,channel,category,level,language,parent,has_content,topic_breadcrumbs
t_00004da3a1b2,Откриването на резисторите,Изследване на материали които предизвикват на...,000cf7,source,4.0,bg,t_16e29365b50d,True,Khan Academy български език Наука Физика Отк...
t_000095e03056,,,,,,,,,Ghana JHS Curriculum in progress Junior High...
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções,8e286a,source,4.0,pt,t_d14b6c2a2b70,True,Khan Academy Português Brasil Matemática p...
t_00069b63a70a,Transcripts,,6e3ba4,source,3.0,en,t_4054df11a74e,True,MIT Blossoms Engineering Flow Charts Logical ...
t_0006d41a73a8,Графики на експоненциални функции Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4.0,bg,t_e2452e21d252,True,Khan Academy български език Математика Алгеб...



Creating topic sentences...

Created 'topic_features' and 'topics_sentences'


t_00004da3a1b2    Откриването на резисторите.Изследване на матер...
t_000095e03056    Ghana JHS Curriculum  in progress  Junior High...
t_00068291e9a4    Entradas e saídas de uma função.Entenda um pou...
t_00069b63a70a    Transcripts..MIT Blossoms Engineering Flow Cha...
t_0006d41a73a8    Графики на експоненциални функции  Алгебра 2 н...
                                        ...                        
t_fffb0bf2801d    Camara Education Ethiopia Secondary Science an...
t_fffbe1d5d43c    Inscribed shapes problem solving.Use propertie...
t_fffe14f1be1e    CREE Para el Estudiante I Ciclo 01 Primero Esp...
t_fffe811a6da9    تحديد العلاقة بين الإحداثي ات القطبية والإحداث...
t_fffe88835149    Kolibri Library for Kenya Form 1 Physics 8 0 R...
Length: 76972, dtype: object

## Contents


*   Filter by language (start with english then loop all the process for all languages, and save model between languages)
*   Clean strings: ponctuation and special chars (\n\t...)



In [11]:
print (f"\nCreating and cleaning content features...")
content_features = df_content.copy().drop(columns=['copyright_holder'])
for col in content_cols:
    content_features[col] = clean_text(content_features[col])
display (content_features.head())
print (f"\nCreating and cleaning content features...")
content_sentences =  content_features[content_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)
print (f"\nCreated 'content_features' and 'content_sentences'")
content_sentences


Creating and cleaning content features...


Unnamed: 0_level_0,title,description,kind,text,language,license
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
c_00002381196d,Sumar números de varios dígitos 48 029 233 930,Suma 48 029 233 930 mediante el algoritmo está...,video,,es,
c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120,video,,it,
c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda,video,,es,
c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,Nado de aproximação Saber nadar nas ondas se...,pt,CC BY-NC-ND
c_00016694ea2a,geometry m3 topic a overview pdf,geometry m3 topic a overview pdf,document,Estándares Comunes del Estado de Nueva York P...,es,CC BY-NC-SA



Creating and cleaning content features...

Created 'content_features' and 'content_sentences'


id
c_00002381196d    Sumar números de varios dígitos  48 029 233 93...
c_000087304a9e    Trovare i fattori di un numero.Sal trova i fat...
c_0000ad142ddb    Sumar curvas de demanda.Cómo añadir curvas de ...
c_0000c03adc8d    Nado de aproximação.Neste vídeo você vai apren...
c_00016694ea2a    geometry m3 topic a overview pdf.geometry m3 t...
                                        ...                        
c_fffcbdd4de8b    2  12  Diffusion..html5.What will eventually h...
c_fffe15a2d069    Sommare facendo gruppi da 10.Sal somma 5 68 sp...
c_fffed7b0d13a    Introdução à subtração.Sal fala sobre o que si...
c_ffff04ba7ac7                                 SA of a Cone..video.
c_ffffe5254266                                     The Jats..video.
Length: 154047, dtype: object

# Get Embeddings for Sentence Transformer

## Model params

In [12]:
dataset_limit = 10
print (f"\nLoaded model parameters")


Loaded model parameters


## Get embeddings from loaded model

*   Loaded model from trained model in Learning-Equality-ST-train_SM
*   Create embeddings as pytorch tensor
*   Save in csv files


In [26]:
# Creating and saving embeddings
print (f"\nLoading model...")
model = SentenceTransformer(model_path)
print (f"\nCreating 'topics_sentences' embeddings...")
topics_embeds = model.encode(topics_sentences, convert_to_tensor=True)
print (f"\nCreating 'content_sentences' embeddings...")
content_embeds =  model.encode(content_sentences, convert_to_tensor=True)
print (f"\nPreparing to save embeddings...")
torch.save (topics_embeds, 'topics_embeds.pt')
torch.save (content_embeds, 'content_embeds.pt')
print (f"\nSaved 'topics_embeds.pt' and 'content_embeds.pt'")


Loading model...

Creating 'topics_sentences' embeddings...


Batches:   0%|          | 0/2406 [00:00<?, ?it/s]


Creating 'content_sentences' embeddings...


Batches:   0%|          | 0/4814 [00:00<?, ?it/s]


Preparing to save embeddings...

Saved 'topics_embeds.pt' and 'content_embeds.pt'


In [23]:
from IPython.display import FileLink

display(FileLink(f'{work_path}topics_embeds.pt'))
display(FileLink(f'{work_path}content_embeds.pt'))