# Kaggle Challenge - Learning Equality

https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/overview

## Goal of the Competition

The goal of this competition is to streamline the process of matching educational content to specific topics in a curriculum. You will develop an accurate and efficient model trained on a library of K-12 educational materials that have been organized into a variety of topic taxonomies. These materials are in diverse languages, and cover a wide range of topics, particularly in STEM (Science, Technology, Engineering, and Mathematics).

Your work will enable students and educators to more readily access relevant educational content to support and supplement learning.

## Submission File

For each **topic_id** in the test set, you must predict a space-delimited list of recommended **content_ids** for that topic. The file should contain a header and have the following format:

~~~
topic_id,content_ids
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c_76231f9d0b5e
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c_ebb7fdf10a7e
t_00069b63a70a,c_11a1dc0bfb99
...
~~~

## Scoring
Mean F2 score

## Evaluation Metric - Efficiency Scoring
We compute a submission's efficiency score by:

\begin{equation} \text{Efficiency} = \frac{1}{ \text{Benchmark} - \max\text{F2} }\text{F2} + \frac{1}{32400}\text{RuntimeSeconds} \end{equation}


where **F2** is the submission's score on the main competition metric, **Benchmark** is the score of the benchmark sample_submission.csv, **maxF2** is the maximum  of all submissions on the Private Leaderboard, and **RuntimeSeconds** is the number of seconds it takes for the submission to be evaluated. The objective is to minimize the efficiency score.

During the training period of the competition, you may see a leaderboard for the public test data in the following notebook, updated daily: Efficiency Leaderboard. After the competition ends, we will update this leaderboard with efficiency scores on the private data. During the training period, this leaderboard will show only the rank of each team, but not the complete score.

# How To Solve This Challenge

## Model Train

* Calculate embeddings: for topics (title, description) and contents (title, description and text), use SentenceTransformer. Split sentences by language.
* Use KNN model: train model with content embeddings and use topic embeddings to predict content matches, use k=10, 20, 30, 50
* Calculate F2 to choose best k.
* Submit predictions
* Set X_train, X_test, y_train, y_test: split by category (use 'aligned' for validation), set has_content = True for validation. Check dimensions
* Use all features for X, including topic_id and content_id, topic_title and content_title. Use the KNN predictions to "mount" X and y. Check dimensions.
* Use correlations to get y: if KNN topic-content match with correlations, y is 1, else is 0
* Use SVM RBF or KNN again for multiclass classification
* Use RandomizedSearch to get best hyperparameters. Use model score "recall". Split the dataset in languages for training: train model for each language individually
* Calculate F2 and see if we've got improvements from KNN.
* Submit predictions

## Submissions

* Use submission sample and get topics and contents
* Apply KNN model to get topic-content matches. Get contents per topic
* Filter "good matches" with second classification model
* Compare predicts with submission sample and calculate F2

## Efficiency

* Use KNN and 2nd model without GPU and check if it takes a long time to calculate sample submissions

# Data

## Imports

In [1]:
drive_path = '/kaggle/input/learning-equality-curriculum-recommendations/'
dataset_path = '/kaggle/input/learning-equality-files/'
work_path = '/kaggle/working/'
model_path = '/kaggle/input/learning-equality-files/ST-all-MiniLM-L6-v2-trained/ST-all-MiniLM-L6-v2-trained'

In [2]:
import sys 
sys.path.append("../input/learning-equality-files/sentence-transformers-2.2.2/sentence-transformers-2.2.2") 
import sentence_transformers

In [3]:
import pandas as pd
import numpy as np
import string
import torch
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics import precision_score, recall_score, fbeta_score
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_validate

# Data Collection

## Load dataframes

In [4]:
# load 'topics' data into pandas dataframe
df_topics = pd.read_csv(f'{drive_path}topics.csv', index_col=0).fillna({"title": "", "description": ""})
print (f"\nLoaded 'df_topics'")
df_topics


Loaded 'df_topics'


Unnamed: 0_level_0,title,description,channel,category,level,language,parent,has_content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True
...,...,...,...,...,...,...,...,...
t_fffb0bf2801d,4.3 Graph of functions,,e77b55,aligned,4,en,t_676e6a1a4dc7,False
t_fffbe1d5d43c,Inscribed shapes problem solving,Use properties of inscribed angles to prove pr...,0c929f,source,4,sw,t_50145b9bab3f,True
t_fffe14f1be1e,Lección 7,,6e90a7,aligned,6,es,t_d448c707984d,True
t_fffe811a6da9,تحديد العلاقة بين الإحداثيّات القطبية والإحداث...,5b9e5ca86571f90499ea987f,9fd860,source,2,ar,t_5b4f3ba4eb7d,True


In [5]:
# load 'content' data into pandas dataframe
df_content = pd.read_csv(f'{drive_path}content.csv', index_col=0).fillna("")
print (f"\nLoaded 'df_content'")
df_content


Loaded 'df_content'


Unnamed: 0_level_0,title,description,kind,text,language,copyright_holder,license
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,
c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,
c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,
c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,\nNado de aproximação\nSaber nadar nas ondas ...,pt,Sikana Education,CC BY-NC-ND
c_00016694ea2a,geometry-m3-topic-a-overview.pdf,geometry-m3-topic-a-overview.pdf,document,Estándares Comunes del Estado de Nueva York\n\...,es,Engage NY,CC BY-NC-SA
...,...,...,...,...,...,...,...
c_fffcbdd4de8b,2. 12: Diffusion,,html5,What will eventually happen to these dyes?\n\n...,en,CSU and Merlot,CC BY-NC-SA
c_fffe15a2d069,Sommare facendo gruppi da 10,Sal somma 5+68 spezzando il 5 in un 2 e un 3.\n\n,video,,it,,
c_fffed7b0d13a,Introdução à subtração,Sal fala sobre o que significa subtrair. Os ex...,video,,pt,,
c_ffff04ba7ac7,SA of a Cone,,video,,en,,


In [6]:
# load 'correlations' data into pandas dataframe
df_corr = pd.read_csv(f'{drive_path}correlations.csv', index_col=0)
print (f"\nLoaded 'df_corr'")
df_corr


Loaded 'df_corr'


Unnamed: 0_level_0,content_ids
topic_id,Unnamed: 1_level_1
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,c_11a1dc0bfb99
t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4
...,...
t_fff830472691,c_61fb63326e5d c_8f224e321c87
t_fff9e5407d13,c_026db653a269 c_0fb048a6412c c_20de77522603 c...
t_fffbe1d5d43c,c_46f852a49c08 c_6659207b25d5
t_fffe14f1be1e,c_cece166bad6a


In [7]:
# load 'topic_breadcrumbs' data into pandas dataframe
df_topic_breadcrumbs = pd.read_csv(f'{dataset_path}topic_breadcrumbs.csv', index_col=0)
print (f"\nLoaded 'df_topic_breadcrumbs'")
df_topic_breadcrumbs


Loaded 'df_topic_breadcrumbs'


Unnamed: 0,topic_breadcrumbs
t_00004da3a1b2,Khan Academy (български език) Наука Физика Отк...
t_000095e03056,Ghana JHS Curriculum (in progress) Junior High...
t_00068291e9a4,Khan Academy (Português (Brasil)) Matemática p...
t_00069b63a70a,MIT Blossoms Engineering Flow Charts: Logical ...
t_0006d41a73a8,Khan Academy (български език) Математика Алгеб...
...,...
t_fffb0bf2801d,Camara Education Ethiopia Secondary Science an...
t_fffbe1d5d43c,Khan Academy (Kiswahili) Hisabati Class 9 (Ind...
t_fffe14f1be1e,CREE Para el Estudiante I Ciclo 01 Primero Esp...
t_fffe811a6da9,Edraak (العربيّة) الهندسة وعلم المثلثات تحديد ...


In [8]:
# Restructure 'df_corr' (correlations): explode the target column to be more practical to use - one topic -> one content
y = df_corr.copy()
y['content_ids'] = y.content_ids.str.split(' ')
y = y.explode('content_ids')
y.reset_index(inplace=True)
print (f"\nLoaded 'df_corr'")
y


Loaded 'df_corr'


Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d
1,t_00004da3a1b2,c_376c5a8eb028
2,t_00004da3a1b2,c_5bc0e1e2cba0
3,t_00004da3a1b2,c_76231f9d0b5e
4,t_00068291e9a4,c_639ea2ef9c95
...,...,...
279914,t_fff9e5407d13,c_d64037a72376
279915,t_fffbe1d5d43c,c_46f852a49c08
279916,t_fffbe1d5d43c,c_6659207b25d5
279917,t_fffe14f1be1e,c_cece166bad6a


## Data Cleaning

## Data cleaning params

In [9]:
levels = {1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five', 6: 'six', 7: 'seven', 8: 'eight',
          9: 'nine', 10: 'ten', 0: 'zero'}
langs = df_topics['language'].value_counts (ascending=True).reset_index()['index'].to_list()
topics_cols = ['title', 'description', 'topic_breadcrumbs']
content_cols = ['title', 'description', 'text']
cat_for_val = 'aligned'
print (f"\nLoaded cleaning parameters")


Loaded cleaning parameters


## Data cleaning functions

In [10]:
# Clean up text
def clean_text(text_col):
    """
    Clean ponctuation and special chars from a dataframe column
    """
    punctuations = string.punctuation
    text_col = text_col.str.replace('\W', ' ', regex=True)
    for punct in string.punctuation:
        text_col = text_col.str.replace(punct, ' ', regex=True)
    return text_col

## Topics

*   Changing 'level' values from integers to strings
*   Clean strings: ponctuation and special chars (\n\t...)
*   Sort values by language

In [11]:
def build_topic_features (topics_df, df_topics_breadcrumbs, levels, topic_cols):
    """
    Create 'topics_features' from df_topics, clean parameters and functions
    """
    print (f"\nCreating and cleaning topic features...")
    topics_features = topics_df.copy()
    topics_features = topics_features.replace ({'level': levels})
    topics_features = topics_features.merge (df_topic_breadcrumbs, how='outer', right_index=True, left_index=True)
    for col in topics_cols:
        topics_features[col] = clean_text(topics_features[col])
    topics_features['sentences'] = topics_features[topics_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)
    topics_features = topics_features.drop(columns=['parent'] + topics_cols) 
    print (f"\nCreated 'topic_features'")
    display (topics_features.head())
    
    return topics_features


## Contents


*   Remove columns 'copyright_holder' and 'language' (for 'language' assume that topic-content match with correlations is same language)
*   Clean strings: ponctuation and special chars (\n\t...)

In [12]:
def build_content_features (content_df, content_cols):
    """
    Create 'content_features' from df_content, clean parameters and functions
    """
    print (f"\nCreating and cleaning content features...")
    content_features = content_df.copy()
    for col in content_cols:
        content_features[col] = clean_text(content_features[col])
    content_features['sentences'] =  content_features[content_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)
    content_features = content_features.drop(columns=['copyright_holder'] + content_cols)
    print (f"\nCreated 'content_features'")
    display (content_features.head())
    
    return content_features

# Model training

### Models params

In [13]:
#General
langs_limit = len(langs)
# Verify device type (cpu or gpu)
device = 'cpu'
if torch.cuda.is_available():
    device = torch.device('cuda:0')
#Sentence Transformer
all_mini = 'all-MiniLM-L6-v2'
paraphrase = 'paraphrase-multilingual-MiniLM-L12-v2'
trained_model = '/kaggle/input/learning-equality-files/ST-all-MiniLM-L6-v2-trained/ST-all-MiniLM-L6-v2-trained'
# KNN
k = [5,10,20,30,50]
alg = 'kd_tree' #'ball_tree','brute'
leaf = 10
print (f"\nLoaded models parameters")


Loaded models parameters


### Scoring functions

In [14]:
def model_scoring (y_test, y_pred, average='macro'):
    """
    Calculate precision, recall and f2-score for y_test and y_pred
    """
    precision = precision_score(y_test['content_ids'], y_pred['content_ids'], average=average)
    print ('Precision:', precision)
    recall = recall_score(y_test['content_ids'], y_pred['content_ids'], average=average)
    print ('Recall:', recall)
    F2macro = fbeta_score(y_test['content_ids'], y_pred['content_ids'], beta=2, average='macro')
    print ('F2 macro:', F2macro)
    F2micro = fbeta_score(y_test['content_ids'], y_pred['content_ids'], beta=2, average='micro')
    print ('F2 micro:', F2micro)

In [15]:
def calculate_Fscore(prediction_df, actual_df):
    
    prediction_df['content_ids'] = prediction_df.content_ids.str.split(' ')
    prediction_df.columns=['topic_id', 'content_ids_pred']
    actual_df['content_ids'] = actual_df.content_ids.str.split(' ')
    actual_df.columns=['topic_id', 'content_ids_actual']
    df = pd.merge(prediction_df, actual_df, how='inner', on='topic_id')
    df['correct_pred'] = df[['content_ids_pred', 'content_ids_actual']].apply(lambda x: len([d for d in x[0] if d in x[1]]), axis=1)
    df['precision'] = df['correct_pred']/(df.content_ids_actual.str.len() + 1e-7)
    df['recall'] = df['correct_pred']/(df.content_ids_pred.str.len() + 1e-7)
    for beta in [0.5, 1, 2]:
        df['f'+str(beta)] = ((1 + beta**2) * df['precision'] * df['recall'])/((beta**2 * df['precision']) + df['recall'] + 1e-7) 
    
    return df

In [16]:
example_pred = [{ 'topic_id': 't1', 'content_ids': 'c1 c2 c3 c10 c12 c13 c14'}, { 'topic_id': 't2', 'content_ids': 'c1 c2 c3 c14'}]
example_actual = [{ 'topic_id': 't1', 'content_ids': 'c1 c2 c3 c4 c5'}, { 'topic_id': 't2', 'content_ids': 'c2 c4 c5'}]
calculate_Fscore(pd.DataFrame(example_pred), pd.DataFrame(example_actual))

Unnamed: 0,topic_id,content_ids_pred,content_ids_actual,correct_pred,precision,recall,f0.5,f1,f2
0,t1,"[c1, c2, c3, c10, c12, c13, c14]","[c1, c2, c3, c4, c5]",3,0.6,0.428571,0.555555,0.5,0.454545
1,t2,"[c1, c2, c3, c14]","[c2, c4, c5]",1,0.333333,0.25,0.3125,0.285714,0.263158


## Get embeddings

* Use output as pytorch tensor and normalize tensor embeddings
* Convert to numpy array

In [17]:
def get_embeddings(topics_features, content_features, langs, langs_limit):
    """
    Calculate embeddings for both topics and content
    From text in columns:
    topic_cols=['title','description', 'topic_breadcrumbs'] 
    content_cols=['title','description', 'text']
    """
    device = 'cpu'
    if torch.cuda.is_available():
        device = torch.device('cuda:0')

    encoder = SentenceTransformer(trained_model)
    topics_id_for_embeddings=[]
    topics_embeddings=[]
    content_ids_for_embeddings=[]
    content_embeddings=[]
    topics_aux = []
    content_aux = []

    for i, lang in enumerate(langs[:langs_limit]):
        print (f'\nGetting embeddings for {i+1}:{lang} from {langs_limit} languages')
        # topics embeddings
        topics_sentences = topics_features[topics_features.language == lang]['sentences']
        topics_aux = encoder.encode(topics_sentences, convert_to_tensor=True, show_progress_bar=True)
        topics_id_for_embeddings.append (topics_sentences.index)
        topics_embeddings.append (util.normalize_embeddings(topics_aux.to(device)).detach().cpu().numpy())
        # content embeddings
        content_sentences = content_features[content_features.language == lang]['sentences']
        if len(content_sentences) == 0:
            content_embeddings.append (np.array([0]).reshape(1,-1)) 
            content_ids_for_embeddings.append (content_sentences.index)
            continue
        else:
            content_ids_for_embeddings.append (content_sentences.index)
            content_aux = encoder.encode(content_sentences, convert_to_tensor=True, show_progress_bar=True)
            content_embeddings.append (util.normalize_embeddings(content_aux.to(device)).detach().cpu().numpy())
    
    return topics_embeddings, content_embeddings, topics_id_for_embeddings, content_ids_for_embeddings

## Apply Nearest Neighbours

* Use unsupervised NN. Use GPU when available, otherwise use sklearn NN

### topic-content predictions functions (df's)

In [18]:
def get_preds(list_aux, topics_id_for_embeddings):
    """
    df_preds = Dataframe with topic_id's and KNN chosen content_ids
    """
    df = pd.DataFrame(list_aux)
    df['topic_id'] = topics_id_for_embeddings
    df = df.explode(['content_ids','topic_id'])
    df.content_ids = df.content_ids.str.join(' ')
    df = df.iloc[:,[0,2,1]].reset_index().drop(columns=['language', 'index'])
    df = df.fillna('')
    return df

In [19]:
def get_true(df_corr, df):
    """
    Ground truth
    df_true = Dataframe with same topic_id's from 'get_preds' function but content_id's taken from correlations, 
    for score calculations,
    """
    df_true = df_corr.copy().reset_index().merge (df, how='right', on='topic_id')[['topic_id', 'content_ids_x']].rename(columns={'content_ids_x': 'content_ids'})
    df_true = df_true.fillna('')
    return df_true

### Calculate k neighbors for contents per topic and F2 score

In [20]:
%%time
def k_neighbors (k, topics_embeddings, content_embeddings, topics_id_for_embeddings, content_ids_for_embeddings, df_corr, langs, lang_lim):
    """
    Calculate k neighbors of contents:id's for each topic_id
    """
    device = 'cpu'
    if torch.cuda.is_available():
        device = torch.device('cuda:0')
    if device == 'cpu':
        from sklearn.neighbors import NearestNeighbors
    else:
        import cudf
        import cuml
        from cuml.neighbors import NearestNeighbors

    print ('\nSetting up KNN model...')
    # Model
    min_sample = min(len(content_embeddings[i]) for i in range(len(content_embeddings)))
    if k > min_sample:
        k = min_sample
    print ('\n K value:', k)
    model = NearestNeighbors(n_neighbors=k, 
                             n_jobs=-1)

    print ('\nFitting KNN model...')
    # Fit and Predictions
    list_aux=[]
    topics_count = 0
    for lang in range(lang_lim):
        print (f'\nGetting content_ids for topics in {lang+1}:{langs[lang]} from {lang_lim} languages')
        topics_count = topics_count + len(topics_embeddings[lang])
        if content_embeddings[lang].size != 1:
            nbrs = model.fit(content_embeddings[lang])
            _, indices = nbrs.kneighbors(topics_embeddings[lang])
            content_id_list = [content_ids_for_embeddings[1][val] for i, val in enumerate (indices)]
            list_aux.append (dict (language=langs[lang], 
                                   content_ids=content_id_list))
        else:
            list_aux.append (dict (language=langs[lang], content_ids=np.zeros(len(topics_id_for_embeddings[lang]))))

    print ('\nCalculating scores...')
    #Define dataframes for predictions and ground truth
    df_preds = get_preds(list_aux, topics_id_for_embeddings)
    df_true = get_true(df_corr, df_preds)
    fscore = calculate_Fscore(df_preds.copy(), df_true.copy())
    print ('\nCorrect predictions:', fscore.correct_pred.sum())
    print (fscore.f2.mean())
    #print ('\n\nF2 score by sklearn functions:')
    #model_scoring (df_preds.copy(), df_true.copy(), average='micro')
    print ('\n')
    
    return model, df_preds, df_true, fscore

CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 10.5 µs


## K parameter vs F2 score

In [21]:
Encoder = paraphrase
Language_limit = 5
K0 = 5
F2_score0 = 0.14187983285626535
K1 = 10
F2_score1 = 0.14168308552036216
K2 = 20
F2_score2 = 0.13348882325195702
K3 = 30
F2_score3 = 0.1252348976238163
Language_limit = 20
K0 = 5
F2_score0 = 0.060680956642008434
k1 = 10
F2_score0 = 0.04720607798794654
k2 = 20
F2_score1 = 0.0348693757018806
k5 = 100
F2_score5 = 0.016168764671525976
Encoder = trained_model
Language_limit = 10
k1 = 2 #limited by the sample dataset
F2_score1 = 0.32051279617850204


# Submissions



In [22]:
# load 'sample_submission' data into pandas dataframe
df_sample_subs = pd.read_csv(f'{drive_path}sample_submission.csv', index_col=0)
df_sample_subs

Unnamed: 0_level_0,content_ids
topic_id,Unnamed: 1_level_1
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,c_11a1dc0bfb99
t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_4054df11a74e,c_3695c5dc1df6 c_f2d184a98231


In [23]:
# Get features data from existing df's
sample_subs = df_sample_subs.copy()
sample_subs['content_ids'] = sample_subs.content_ids.str.split(' ')
sample_subs = sample_subs.explode('content_ids')
df_subs_topics = df_topics.reset_index().merge (df_sample_subs.reset_index(), how='inner', left_on='id', right_on='topic_id').fillna('')
df_subs_topics = df_subs_topics.set_index('topic_id').drop(columns=['id']).fillna('')
df_subs_contents = df_content.reset_index().merge (sample_subs, how='inner', left_on='id', right_on='content_ids')
df_subs_contents = df_subs_contents.set_index('content_ids').drop(columns=['id']).fillna('')

#Getting parameters
langs = df_subs_topics['language'].value_counts (ascending=True).reset_index()['index'].to_list()
lang_limit = len(langs)
K = k[1]

print ('\n* Calculating topics_features dataframe...\n')
sub_topics_features = build_topic_features (df_subs_topics, df_topic_breadcrumbs, levels, topics_cols)

print ('\n* Calculating content_features dataframe...\n')
sub_contents_features = build_content_features (df_subs_contents, content_cols)

print ('\n* Calculating embeddings for both topics and contents text columns...\n')
sub_topics_embeds, sub_contents_embeds, topic_id_embeds, content_ids_embeds = get_embeddings(sub_topics_features, sub_contents_features, langs, lang_limit)

print ('\n* Calculating k neighnors predictions...\n')
model, preds, ground_truth, fscore = k_neighbors (K, sub_topics_embeds, sub_contents_embeds, topic_id_embeds, content_ids_embeds, df_corr, langs, lang_limit)

print ('\n All finished!\n')
print ('\n*** Predictions ***\n')
display(preds)
print ('\n*** Score ***\n')
display(fscore)


* Calculating topics_features dataframe...


Creating and cleaning topic features...

Created 'topic_features'


Unnamed: 0,channel,category,level,language,has_content,content_ids,sentences
t_00004da3a1b2,000cf7,source,four,bg,True,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...,Откриването на резисторите.Изследване на матер...
t_000095e03056,,,,,,,Ghana JHS Curriculum in progress Junior High...
t_00068291e9a4,8e286a,source,four,pt,True,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...,Entradas e saídas de uma função.Entenda um pou...
t_00069b63a70a,6e3ba4,source,three,en,True,c_11a1dc0bfb99,Transcripts..MIT Blossoms Engineering Flow Cha...
t_0006d41a73a8,000cf7,source,four,bg,True,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...,Графики на експоненциални функции Алгебра 2 н...



* Calculating content_features dataframe...


Creating and cleaning content features...

Created 'content_features'


Unnamed: 0_level_0,kind,language,license,sentences
content_ids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c_0c6473c3480d,video,bg,CC BY-NC-SA,Чертане на показателни финкции.Сал чертае y 2...
c_1108dd0c7a5d,video,bg,,Молив като резистор.Моливът причинява промяна ...
c_11a1dc0bfb99,document,en,CC BY-NC-SA,Flow Charts Logical Written Transcript of ...
c_1c57a1316568,video,bg,CC BY-NC-SA,Графики на показателни функции стар пример .С...
c_3695c5dc1df6,html5,en,CC BY-NC-SA,Additional Resources for Flow Charts Logical ...



* Calculating embeddings for both topics and contents text columns...


Getting embeddings for 1:pt from 3 languages


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Getting embeddings for 2:bg from 3 languages


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Getting embeddings for 3:en from 3 languages


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


* Calculating k neighnors predictions...


Setting up KNN model...

 K value: 2

Fitting KNN model...

Getting content_ids for topics in 1:pt from 3 languages

Getting content_ids for topics in 2:bg from 3 languages

Getting content_ids for topics in 3:en from 3 languages

Calculating scores...

Correct predictions: 4
0.32051279617850204



 All finished!


*** Predictions ***



Unnamed: 0,topic_id,content_ids
0,t_00068291e9a4,c_376c5a8eb028 c_0c6473c3480d
1,t_00004da3a1b2,c_1108dd0c7a5d c_5bc0e1e2cba0
2,t_0006d41a73a8,c_1c57a1316568 c_5e375cf14c47
3,t_00069b63a70a,c_1108dd0c7a5d c_0c6473c3480d
4,t_4054df11a74e,c_1108dd0c7a5d c_0c6473c3480d



*** Score ***



Unnamed: 0,topic_id,content_ids_pred,content_ids_actual,correct_pred,precision,recall,f0.5,f1,f2
0,t_00068291e9a4,"[c_376c5a8eb028, c_0c6473c3480d]","[c_639ea2ef9c95, c_89ce9367be10, c_ac1672cdcd2...",0,0.0,0.0,0.0,0.0,0.0
1,t_00004da3a1b2,"[c_1108dd0c7a5d, c_5bc0e1e2cba0]","[c_1108dd0c7a5d, c_376c5a8eb028, c_5bc0e1e2cba...",2,0.5,1.0,0.555555,0.666667,0.833333
2,t_0006d41a73a8,"[c_1c57a1316568, c_5e375cf14c47]","[c_0c6473c3480d, c_1c57a1316568, c_5e375cf14c4...",2,0.4,1.0,0.454545,0.571429,0.769231
3,t_00069b63a70a,"[c_1108dd0c7a5d, c_0c6473c3480d]",[c_11a1dc0bfb99],0,0.0,0.0,0.0,0.0,0.0
4,t_4054df11a74e,"[c_1108dd0c7a5d, c_0c6473c3480d]","[c_3695c5dc1df6, c_f2d184a98231]",0,0.0,0.0,0.0,0.0,0.0


In [24]:
import pickle
pickle.dump(model, open(f'ST-KNN-{k[1]}', 'wb'))
preds.to_csv('submission.csv', index=False)