<a href="https://www.kaggle.com/code/sofiamatias/learning-equality-model2-predicts-sm?scriptVersionId=118133220" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Kaggle Challenge - Learning Equality

https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/overview

## Goal of the Competition

The goal of this competition is to streamline the process of matching educational content to specific topics in a curriculum. You will develop an accurate and efficient model trained on a library of K-12 educational materials that have been organized into a variety of topic taxonomies. These materials are in diverse languages, and cover a wide range of topics, particularly in STEM (Science, Technology, Engineering, and Mathematics).

Your work will enable students and educators to more readily access relevant educational content to support and supplement learning.

## Submission File

For each **topic_id** in the test set, you must predict a space-delimited list of recommended **content_ids** for that topic. The file should contain a header and have the following format:

~~~
topic_id,content_ids
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c_76231f9d0b5e
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c_ebb7fdf10a7e
t_00069b63a70a,c_11a1dc0bfb99
...
~~~

## Scoring
Mean F2 score

## Evaluation Metric - Efficiency Scoring
We compute a submission's efficiency score by:

\begin{equation} \text{Efficiency} = \frac{1}{ \text{Benchmark} - \max\text{F2} }\text{F2} + \frac{1}{32400}\text{RuntimeSeconds} \end{equation}


where **F2** is the submission's score on the main competition metric, **Benchmark** is the score of the benchmark sample_submission.csv, **maxF2** is the maximum  of all submissions on the Private Leaderboard, and **RuntimeSeconds** is the number of seconds it takes for the submission to be evaluated. The objective is to minimize the efficiency score.

During the training period of the competition, you may see a leaderboard for the public test data in the following notebook, updated daily: Efficiency Leaderboard. After the competition ends, we will update this leaderboard with efficiency scores on the private data. During the training period, this leaderboard will show only the rank of each team, but not the complete score.

# How To Solve This Challenge

## Model Train

* Calculate embeddings: for topics (title, description) and contents (title, description and text), use SentenceTransformer. Split sentences by language.
* Use KNN model: train model with content embeddings and use topic embeddings to predict content matches, use k=10, 20, 30, 50
* Calculate F2 to choose best k.
* Submit predictions
* Set X_train, X_test, y_train, y_test: split by category (use 'aligned' for validation), set has_content = True for validation. Check dimensions
* Use all features for X, including topic_id and content_id, topic_title and content_title. Use the KNN predictions to "mount" X and y. Check dimensions.
* Use correlations to get y: if KNN topic-content match with correlations, y is 1, else is 0
* Use SVM RBF , KNN or another classifier for binary classification
* Use RandomizedSearch to get best hyperparameters. Use model score "recall". Split the dataset in languages for training: train model for each language individually
* Calculate F2 and see if we've got improvements from KNN.
* Submit predictions

## Submissions

* Use submission sample and get topics and contents
* Apply KNN model to get topic-content matches. Get contents per topic
* Filter "good matches" with second classification model
* Compare predicts with submission sample and calculate F2

## Efficiency

* Use KNN and 2nd model without GPU and check if it takes a long time to calculate sample submissions

# Data

## Imports

In [1]:
import sys, os
sys.path.append("../input/sentence-transformer-package/sentence-transformers-2.2.2/sentence-transformers-2.2.2") 
import sentence_transformers

In [2]:
!pip install -q /kaggle/input/loguru-lib-ds/loguru-0.5.3-py3-none-any.whl

[0m

In [3]:
import pandas as pd
import numpy as np
import string
import datetime
import torch
from pathlib import Path
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics import precision_score, recall_score, fbeta_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_validate, train_test_split

In [4]:
DATA_PATH = Path.cwd().parent.resolve()
drive_path = DATA_PATH / 'input'
files_path = DATA_PATH / 'input/learning-equality-curriculum-recommendations'
dataset_path = drive_path / 'learningequalityfiles'
work_path = DATA_PATH / 'working'
model_path = '/kaggle/input/sentence-transformer-package/ST-all-MiniLM-L6-v2-trained/ST-all-MiniLM-L6-v2-trained'

# Data Collection

## Load dataframes

In [5]:
def read_file_in_chunks(filename):
        
    list_chunks = []
    chunksize = 10000
    with pd.read_csv(filename, chunksize=chunksize, index_col=0) as reader:
        for chunk in reader:
            list_chunks.append(chunk)
    df = pd.concat(list_chunks)
    return df

In [6]:
# loading dataframes from existing .csv files

print (f"\nLoading dataframes...")

df_content = pd.DataFrame([])
df_topics = pd.DataFrame([])
df_corr = pd.DataFrame([])
df_sample = pd.DataFrame([])
for path, subdirs, files in os.walk(drive_path):
    for file in files:
        if file.endswith(".csv") == True:
            filepath = os.path.join(path, file)
            df = read_file_in_chunks(filepath)
            if set(['copyright_holder', 'kind']).issubset(df.columns) and df_content.empty:
                df_content = df.fillna({"title": "", "description": ""})
                display(df_content.head())
                print (f"\nLoaded 'df_content' from {filepath} with shape {df_content.shape}")
            elif set(['channel', 'parent']).issubset(df.columns):
                df_topics = df.fillna({"title": "", "description": ""})
                display(df_topics.head())
                print (f"\nLoaded 'df_topics' from {filepath} with shape {df_topics.shape}")
            elif ('content_ids' in df.columns) and ('correlation' in file) and (df_corr.empty):
                df_corr = df
                display(df_corr.head())
                print (f"\nLoaded 'df_corr' from {filepath}  with shape {df_corr.shape}")
            if ('sample' in file):
                df_sample = df
                display(df_sample)
                print (f"\nLoaded 'df_sample' from {filepath}")


Loading dataframes...


Unnamed: 0_level_0,content_ids
topic_id,Unnamed: 1_level_1
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,c_11a1dc0bfb99
t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_4054df11a74e,c_3695c5dc1df6 c_f2d184a98231



Loaded 'df_sample' from /kaggle/input/learning-equality-curriculum-recommendations/sample_submission.csv


Unnamed: 0_level_0,title,description,channel,category,level,language,parent,has_content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True



Loaded 'df_topics' from /kaggle/input/learning-equality-curriculum-recommendations/topics.csv with shape (76972, 8)


Unnamed: 0_level_0,content_ids
topic_id,Unnamed: 1_level_1
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,c_11a1dc0bfb99
t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4



Loaded 'df_corr' from /kaggle/input/learning-equality-curriculum-recommendations/correlations.csv  with shape (61517, 1)


Unnamed: 0_level_0,title,description,kind,text,language,copyright_holder,license
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,
c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,
c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,
c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,\nNado de aproximação\nSaber nadar nas ondas ...,pt,Sikana Education,CC BY-NC-ND
c_00016694ea2a,geometry-m3-topic-a-overview.pdf,geometry-m3-topic-a-overview.pdf,document,Estándares Comunes del Estado de Nueva York\n\...,es,Engage NY,CC BY-NC-SA



Loaded 'df_content' from /kaggle/input/learning-equality-curriculum-recommendations/content.csv with shape (154047, 7)


In [7]:
#Sampling 'df_topics' and 'df_content' with sample_submission
if ~df_sample.empty:
    df_topics = df_topics.merge(df_sample, how='inner', left_index=True, right_index=True)
    print (f"\nFiltered 'df_topics' according to 'sample_submission'")
    display (df_topics)

#Sampling with limited number of rows
#num_samples = 20000
#df_topics = df_topics.sample(n=num_samples//2)
#df_content = df_content.sample(n=num_samples)
#print (f"\nSampled {num_samples//2} rows from 'df_topics' for model training")
#display (df_content)
display (df_topics)

if df_content.empty and df_topics.empty:
    print ('Error: there are no topics or content files to work with')
    quit()


Filtered 'df_topics' according to 'sample_submission'


Unnamed: 0,title,description,channel,category,level,language,parent,has_content,content_ids
t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True,c_11a1dc0bfb99
t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_4054df11a74e,Flow Charts: Logical Thinking?,This lesson is focused on flow charts. It supp...,6e3ba4,source,2,en,t_acbbd893e6af,True,c_3695c5dc1df6 c_f2d184a98231


Unnamed: 0,title,description,channel,category,level,language,parent,has_content,content_ids
t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True,c_11a1dc0bfb99
t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_4054df11a74e,Flow Charts: Logical Thinking?,This lesson is focused on flow charts. It supp...,6e3ba4,source,2,en,t_acbbd893e6af,True,c_3695c5dc1df6 c_f2d184a98231


In [8]:
# load 'correlations' data into pandas dataframe
# Restructure 'df_corr' (correlations): explode the target column to be more practical to use - one topic -> one content
y = df_corr.copy()
y['content_ids'] = y.content_ids.str.split(' ')
y = y.explode('content_ids')
y.reset_index(inplace=True)
print (f"\nChanged 'df_corr' to exploded 'y'")
display(y)


Changed 'df_corr' to exploded 'y'


Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d
1,t_00004da3a1b2,c_376c5a8eb028
2,t_00004da3a1b2,c_5bc0e1e2cba0
3,t_00004da3a1b2,c_76231f9d0b5e
4,t_00068291e9a4,c_639ea2ef9c95
...,...,...
279914,t_fff9e5407d13,c_d64037a72376
279915,t_fffbe1d5d43c,c_46f852a49c08
279916,t_fffbe1d5d43c,c_6659207b25d5
279917,t_fffe14f1be1e,c_cece166bad6a


## Create Topic Breadcrumbs Functions

In [9]:
# define some helper functions and classes to aid with data traversal

def get_topic_breadcrumbs (df_topic):
    print (f"Creating breadcrumbs...")
    topic_breadcrumbs=[]
    topic_ids=[]
    if 'topic_id' == df_topics.index.name:
        col_name = 'topic_id'
    else:
        col_name = 'id'
    for i, topic in enumerate(df_topic.index):
        if (i % 5000 == 0) and (i > 0):
            print (f'Created {i} breadcrumbs...')
        if df_topic.loc[topic].parent in df_topic.index:
            topic_ids.append (topic)
            topic_breadcrumbs.append (Topic(topic).get_breadcrumbs(separator=' '))
    print (f"Creating dataframe from breadcrumbs 'df_topics_breadcrumbs'...")
    df_topics_breadcrumbs = pd.DataFrame (topic_breadcrumbs, 
                                          index = topic_ids,
                                          columns = ['topic_breadcrumbs'])
    return df_topics_breadcrumbs


def print_markdown(md):
    display(Markdown(md))

class Topic:
    def __init__(self, topic_id):
        self.id = topic_id

    @property
    def parent(self):
        parent_id = df_topics.loc[self.id].parent
        if pd.isna(parent_id):
            return None
        else:
            return Topic(parent_id)

    @property
    def ancestors(self):
        ancestors = []
        parent = self.parent
        while (parent is not None) and (parent.id in df_topics.index):
            ancestors.append(parent)
            parent = parent.parent
        return ancestors

    @property
    def siblings(self):
        if not self.parent:
            return []
        else:
            return [topic for topic in self.parent.children if topic != self]

    @property
    def content(self):
        if self.id in df_corr.index:
            return [ContentItem(content_id) for content_id in df_corr.loc[self.id].content_ids.split()]
        else:
            return tuple([]) if self.has_content else []

    def get_breadcrumbs(self, separator=" >> ", include_self=True, include_root=True):
        ancestors = self.ancestors
        if include_self:
            ancestors = [self] + ancestors
        if not include_root:
            ancestors = ancestors[:-1]
        return separator.join(reversed([a.title for a in ancestors]))

    @property
    def children(self):
        return [Topic(child_id) for child_id in df_topics[df_topics.parent == self.id].index]

    def subtree_markdown(self, depth=0):
        markdown = "  " * depth + "- " + self.title + "\n"
        for child in self.children:
            markdown += child.subtree_markdown(depth=depth + 1)
        for content in self.content:
            markdown += ("  " * (depth + 1) + "- " + "[" + content.kind.title() + "] " + content.title) + "\n"
        return markdown

    def __eq__(self, other):
        if not isinstance(other, Topic):
            return False
        return self.id == other.id

    def __getattr__(self, name):
        return df_topics.loc[self.id][name]

    def __str__(self):
        return self.title
    
    def __repr__(self):
        return f"<Topic(id={self.id}, title=\"{self.title}\")>"


class ContentItem:
    def __init__(self, content_id):
        self.id = content_id

    @property
    def topics(self):
        return [Topic(topic_id) for topic_id in df_topics.loc[df_corr[df_corr.content_ids.str.contains(self.id)].index].index]

    def __getattr__(self, name):
        return content_df.loc[self.id][name]

    def __str__(self):
        return self.title
    
    def __repr__(self):
        return f"<ContentItem(id={self.id}, title=\"{self.title}\")>"

    def __eq__(self, other):
        if not isinstance(other, ContentItem):
            return False
        return self.id == other.id

    def get_all_breadcrumbs(self, separator=" >> ", include_root=True):
        breadcrumbs = []
        for topic in self.topics:
            new_breadcrumb = topic.get_breadcrumbs(separator=separator, include_root=include_root)
            if new_breadcrumb:
                new_breadcrumb = new_breadcrumb + separator + self.title
            else:
                new_breadcrumb = self.title
            breadcrumbs.append(new_breadcrumb)
        return breadcrumbs

## Data Cleaning

## Data cleaning params

In [10]:
levels = {1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five', 6: 'six', 7: 'seven', 8: 'eight',
          9: 'nine', 10: 'ten', 0: 'zero'}
topics_cols = ['title', 'description', 'topic_breadcrumbs']
content_cols = ['title', 'description', 'text']
cat_for_val = 'aligned'
print (f"\nLoaded cleaning parameters")


Loaded cleaning parameters


## Data cleaning functions

In [11]:
# Clean up text
def clean_text(text_col):
    """
    Clean ponctuation and special chars from a dataframe column
    """
    punctuations = string.punctuation
    text_col = text_col.str.replace('\W', ' ', regex=True)
    for punct in string.punctuation:
        text_col = text_col.str.replace(punct, ' ', regex=True)
    return text_col

### Topics

*   Changing 'level' values from integers to strings
*   Clean strings: ponctuation and special chars (\n\t...)
*   Sort values by language

In [12]:
def build_topic_features (topics_df, df_topics_breadcrumbs, levels, topic_cols):
    """
    Create 'topics_features' from df_topics, clean parameters and functions
    """
    print (f"\nCreating and cleaning topic features...")
    topics_features = topics_df.copy()
    topics_features = topics_features.replace ({'level': levels})
    topics_features = topics_features.merge (df_topic_breadcrumbs, how='outer', right_index=True, left_index=True)
    for col in topics_cols:
        topics_features[col] = clean_text(topics_features[col])
    topics_features['sentences'] = topics_features[topics_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)
    topics_features = topics_features.drop(columns=['parent'] + topics_cols) 
    print (f"\nCreated 'topic_features'")
    display (topics_features.head())
    
    return topics_features


### Contents


*   Remove columns 'copyright_holder' and 'language' (for 'language' assume that topic-content match with correlations is same language)
*   Clean strings: ponctuation and special chars (\n\t...)

In [13]:
def build_content_features (content_df, content_cols):
    """
    Create 'content_features' from df_content, clean parameters and functions
    """
    print (f"\nCreating and cleaning content features...")
    content_features = content_df.copy()
    for col in content_cols:
        content_features[col] = clean_text(content_features[col])
    content_features['sentences'] =  content_features[content_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)
    content_features = content_features.drop(columns=['copyright_holder', 'license'] + content_cols)
    print (f"\nCreated 'content_features'")
    display (content_features.head())
    
    return content_features

# Model Train

### Scoring functions

In [14]:
def calculate_Fscore(pred_df, act_df):
    
    if pred_df.empty or act_df.empty:
        print ('\nOne or both dataframes are empty. Abort F2score calculation.')
        return None
    prediction_df=pred_df.copy()
    actual_df = act_df.copy()
    prediction_df['content_ids'] = prediction_df.content_ids.str.split(' ')
    prediction_df.columns=['topic_id', 'content_ids_pred']
    actual_df['content_ids'] = actual_df.content_ids.str.split(' ')
    actual_df.columns=['topic_id', 'content_ids_actual']
    df = pd.merge(prediction_df, actual_df, how='inner', on='topic_id')
    df['correct_pred'] = df[['content_ids_pred', 'content_ids_actual']].apply(lambda x: len([d for d in x[0] if d in x[1]]), axis=1)
    df_target = df.apply(lambda x: [(x[0], d) for d in x[1] if d in x[2]], axis=1).tolist()
    df['precision'] = df['correct_pred']/(df.content_ids_actual.str.len() + 1e-7)
    df['recall'] = df['correct_pred']/(df.content_ids_pred.str.len() + 1e-7)
    for beta in [0.5, 1, 2]:
        df['f'+str(beta)] = ((1 + beta**2) * df['precision'] * df['recall'])/((beta**2 * df['precision']) + df['recall'] + 1e-7) 
    print ('\nF2score calculation finished.')

    return df, df_target

## Calculate Embeddings Functions

* Use output as pytorch tensor and normalize tensor embeddings
* Convert to numpy array

In [15]:
def get_embeddings(topics_features, content_features):
    """
    Calculate embeddings for both topics and content
    From text in columns:
    topic_cols=['title','description', 'topic_breadcrumbs'] 
    content_cols=['title','description', 'text']
    """
    device = 'cpu'
    if torch.cuda.is_available():
        device = torch.device('cuda:0')

    encoder = SentenceTransformer(paraphrase)
    topics_id_for_embeddings=[]
    topics_embeddings=[]
    content_ids_for_embeddings=[]
    content_embeddings=[]
    topics_aux = []
    content_aux = []

    topics_sentences = topics_features['sentences']
    print (f'\nGetting embeddings for {len(topics_sentences)} sentences from topics')
    # topics embeddings
    topics_aux = encoder.encode(topics_sentences, convert_to_tensor=True)
    topics_id_for_embeddings = topics_sentences.index
    topics_embeddings = util.normalize_embeddings(topics_aux.to(device)).detach().cpu().numpy()
    # content embeddings
    content_sentences = content_features['sentences']
    print (f'\nGetting embeddings for {len(content_sentences)} sentences from contents')
    content_ids_for_embeddings = content_sentences.index
    content_aux = encoder.encode(content_sentences, convert_to_tensor=True)
    content_embeddings = util.normalize_embeddings(content_aux.to(device)).detach().cpu().numpy()
    
    return topics_embeddings, content_embeddings, topics_id_for_embeddings, content_ids_for_embeddings

### Apply Sentence similarity scoring Function

In [16]:
# Calculating matches with similarity score
def similarity_matches (topic_ids_for_embeddings, content_ids_for_embeddings, content_embeds, topics_embeds, device, content_count_per_topic):

    list_aux = []
    start = datetime.datetime.now()
    print (f"\nCalculating matches for {len(topics_embeds)} topic sentences...")

    if torch.is_tensor(content_embeds):
        corpus_embeddings = content_embeds
    else:
        corpus_embeddings = torch.tensor(content_embeds)
    if torch.is_tensor(topics_embeds):
        query_embeddings = topics_embeds
    else:
        query_embeddings = torch.tensor(topics_embeds)

    if device != torch.device('cuda:0'):
        print ('\nGPU is not active. Similarity calculation aborted.')
        return None

    # we use util.semantic_search to perform cosine similarty + topk
    corpus_embeddings = corpus_embeddings.to('cuda')
    corpus_embeddings = util.normalize_embeddings(corpus_embeddings)
    query_embeddings = query_embeddings.to('cuda')
    query_embeddings = util.normalize_embeddings(query_embeddings)
    i = 0
    for query in query_embeddings:
        hits = util.semantic_search(query, corpus_embeddings, score_function=util.cos_sim, top_k=content_count_per_topic)
        for hit in hits[0]:
            list_aux.append (dict(topic_id = topic_ids_for_embeddings[i], content_ids = content_ids_for_embeddings[hit['corpus_id']], score = hit['score']))
        i += 1
        if i%1000 == 0:
            print (i, ' topics matched.')
    end  = datetime.datetime.now()
    delta = end - start
    print ("Finished matching in:", delta.total_seconds(), ' s')
    df_preds = pd.DataFrame (list_aux)
           
    return df_preds

In [17]:
def get_preds(df_preds):
    """
    df_preds = Dataframe with topic_id's and KNN chosen content_ids
    """
    df.content_ids = df.content_ids.str.join(' ')
    df = df.fillna('')
    return df

In [18]:
def get_true(df_corr, df):
    """
    Ground truth
    df_true = Dataframe with same topic_id's from 'get_preds' function but content_id's taken from correlations, 
    for score calculations. Returns the format defined in 'correlations.csv' but as dataframe.
    """
    df_true = df_corr.copy().reset_index().merge (df_preds, how='right', on='topic_id').iloc[:,[0,1]].rename(columns={'content_ids_x': 'content_ids'})
    df_true = df_true.fillna('')
    return df_true

# Setup Features and Target Functions

* Set X_train, X_test, y_train, y_test: split by category (use 'aligned' for validation); set has_content = True for validation. Check dimensions
* Use all features for X, including topic_id and content_id, topic_title and content_title
* Use the KNN predictions to "mount" X and y. Check dimensions.
* Use correlations to get y: if KNN topic-content match with correlations, y is 1, else is 0
* Use LGBMClassifier for multiclass classification

In [19]:
def get_target(preds):
    """
    Get a binary target from comparing similarity predictions y_preds with ground truth 'y' from 'df_corr'
    """
    preds_aux = preds.copy()
    df_num_rows = 400
    y_int=[]
    for i in range(len(preds_aux)//df_num_rows):
        print (f'\nCalculating preds: {i+1} of {len(preds_aux)//df_num_rows} iterations...\n')
        max_r = df_num_rows*(i+1)
        if max_r > len(preds_aux):
            max_r = len(preds_aux)
        min_r = df_num_rows*i
        df_aux = preds_aux.iloc[min_r:max_r]
        df_aux.iloc[:,1] = df_aux.content_ids.str.split(' ')
        df_aux = df_aux.explode('content_ids')
        df_aux = df_aux.merge (y, on='topic_id', how='left')
        df_aux['match'] = (df_aux.iloc[:,1] == df_aux.iloc[:,2]).astype (int) 
        y_int.append (df_aux.groupby(['topic_id', 'content_ids_x']).sum().reset_index())
        target = pd.concat(y_int)
        target.to_csv('target.csv', index=False)

    return target

In [20]:
def get_features (y_aux, topics_features, content_features):
    """ 
    Get features for classification. Using topic_id, topic_title, content_id and content_title, 
    among remaining class features. Returns Xcolumns=X. and y_target.
    """
    print ("\n* Defining features for X...\n")

    df_num_rows = 100000
    list_X=[]
    iterations = len(y_aux)//df_num_rows
    if iterations == 0:
        iterations = 1
    for i in range(iterations):
        print (f'\nCalculating features: {i+1} of {iterations} iterations...\n')
        max_r = df_num_rows*(i+1)
        if max_r > len(y_aux):
            max_r = len(y_aux)
        min_r = df_num_rows*i
        X = y_aux[min_r:max_r]
        topic_feat = topics_features.reset_index()
        # Merging 'topic_features', 'content_features' and 'matching_preds' altogether. 
        # Cleaning unnecessary columns from merges  
        X = X.merge (topic_feat, how='left', left_on=X.columns[0], right_on=topic_feat.columns[0])
        X['topic_title'] = X.sentences.str.split('.').str[0]
        X.drop(columns=X.iloc[:,[2,3,-2,-3]].columns.tolist(), inplace=True)
        content_aux = contents_features.reset_index().drop(columns=contents_features.iloc[:,1].name) 
        X = X.merge (content_aux, how='left', left_on=X.iloc[:,1].name, right_on=contents_features.reset_index().iloc[:,0].name)
        X['content_title'] = X.iloc[:,-1].str.split('.').str[0]
        X.drop(columns=X.iloc[:,[-2, -4, -6]].columns.tolist(), inplace=True)
        X.columns = X.columns.str.replace('_x', '')
        X = X.sort_values('language')
        print ("\nPartial X defined.")
        # Appending partial dataframe to list
        list_X.append(X)
    X = pd.concat (list_X)
    print ("\nX is defined.")
    return X

## Parameters

In [21]:
#General
device = 'cpu'
if torch.cuda.is_available():
    device = torch.device('cuda:0')
#Sentence Transformer
all_mini = 'all-MiniLM-L6-v2'
paraphrase = '/kaggle/input/sentence-embedding-models/paraphrase-MiniLM-L12-v2'
trained_model = model_path
# Similarity
content_per_topic = 100
print (f"\nLoaded models parameters")


Loaded models parameters


In [22]:
#Getting topics breadcrumbs or topics context
print ("\n* Getting 'topics_breadcrumbs' dataframe...\n")
df_topic_breadcrumbs = get_topic_breadcrumbs (df_topics)

#Cleaning data and getting topics and contents features
print ('\n* Calculating topics_features dataframe...\n')
topics_features = build_topic_features (df_topics, df_topic_breadcrumbs, levels, topics_cols)

print ('\n* Calculating content_features dataframe...\n')
contents_features = build_content_features (df_content, content_cols)


* Getting 'topics_breadcrumbs' dataframe...

Creating breadcrumbs...
Creating dataframe from breadcrumbs 'df_topics_breadcrumbs'...

* Calculating topics_features dataframe...


Creating and cleaning topic features...

Created 'topic_features'


Unnamed: 0,channel,category,level,language,has_content,content_ids,sentences
t_00004da3a1b2,000cf7,source,four,bg,True,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...,Откриването на резисторите.Изследване на матер...
t_00068291e9a4,8e286a,source,four,pt,True,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...,Entradas e saídas de uma função.Entenda um pou...
t_00069b63a70a,6e3ba4,source,three,en,True,c_11a1dc0bfb99,Transcripts..Flow Charts Logical Thinking Tr...
t_0006d41a73a8,000cf7,source,four,bg,True,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...,Графики на експоненциални функции Алгебра 2 н...
t_4054df11a74e,6e3ba4,source,two,en,True,c_3695c5dc1df6 c_f2d184a98231,Flow Charts Logical Thinking .This lesson is ...



* Calculating content_features dataframe...


Creating and cleaning content features...

Created 'content_features'


Unnamed: 0_level_0,kind,language,sentences
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
c_00002381196d,video,es,Sumar números de varios dígitos 48 029 233 93...
c_000087304a9e,video,it,Trovare i fattori di un numero.Sal trova i fat...
c_0000ad142ddb,video,es,Sumar curvas de demanda.Cómo añadir curvas de ...
c_0000c03adc8d,document,pt,Nado de aproximação.Neste vídeo você vai apren...
c_00016694ea2a,document,es,geometry m3 topic a overview pdf.geometry m3 t...


In [23]:
df_topic_breadcrumbs

Unnamed: 0,topic_breadcrumbs
t_00069b63a70a,Flow Charts: Logical Thinking? Transcripts


In [24]:
use_saved_embeds = False
if use_saved_embeds:
    #Loading previously saved embeddings
    print ('\n* Loading embeddings for both topics and contents...\n')
    topic_load_path = dataset_path / 'topic_embeds.npy'
    with open(topic_load_path, "rb") as f:
        topics_embeds = np.load(f)
    content_load_path = dataset_path / 'content_embeds.npy'
    with open(content_load_path, "rb") as f:
        contents_embeds = np.load(f)
    print ('\n Embeddings loaded.\n')
    topic_id_embeds = df_topics.index.tolist()
    content_ids_embeds = df_content.index.tolist()     

else:
    #Calculating embeddings from text
    print ('\n* Calculating embeddings for both topics and contents text columns...\n')
    topics_embeds, contents_embeds, topic_id_embeds, content_ids_embeds = get_embeddings(topics_features, contents_features)

    #Saving embeddings
    topic_save_path = work_path / 'topic_embeds.npy'
    with open(topic_save_path, "wb") as f:
        np.save(f, topics_embeds)
    print(f"Embeddings saved to {topic_save_path}")
    contents_save_path = work_path / 'content_embeds.npy'
    with open(contents_save_path, "wb") as f:
        np.save(f, contents_embeds)
    print(f"Embeddings saved to {contents_save_path}")


* Calculating embeddings for both topics and contents text columns...


Getting embeddings for 5 sentences from topics


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Getting embeddings for 154047 sentences from contents


Batches:   0%|          | 0/4814 [00:00<?, ?it/s]

Embeddings saved to /kaggle/working/topic_embeds.npy
Embeddings saved to /kaggle/working/content_embeds.npy


# Getting topics-contents from sentence similarity

In [25]:
#Calculating similar pairs by cosine_similarity function from sentence transformers
print('\n * Calculating sentence similarity predictions...\n\n')
preds = similarity_matches (topic_id_embeds, content_ids_embeds, contents_embeds, topics_embeds, device, content_per_topic)
print('\n First predictions dataframe\n\n')
display(preds)
preds.to_csv('cos_sim_predictions.csv')


 * Calculating sentence similarity predictions...



Calculating matches for 5 topic sentences...
Finished matching in: 0.240051  s

 First predictions dataframe




Unnamed: 0,topic_id,content_ids,score
0,t_00004da3a1b2,c_86adbb25971c,0.970608
1,t_00004da3a1b2,c_ab0ff31d7e92,0.968809
2,t_00004da3a1b2,c_72fbb1e5ac13,0.966928
3,t_00004da3a1b2,c_d5d9c2e82c51,0.964031
4,t_00004da3a1b2,c_8661941795f3,0.962870
...,...,...,...
495,t_4054df11a74e,c_b624fb4c0c31,0.666454
496,t_4054df11a74e,c_766ab825a30e,0.666193
497,t_4054df11a74e,c_298e7207f501,0.665992
498,t_4054df11a74e,c_a1569dea4b4b,0.665694


## Calculating score for 

In [26]:
print('\n Calculating F2 score for first predictions...\n\n')

# Getting 'df_preds' with some precautions as 'preds' shape can explode to 
# number of topics x number of similarities (200) getting easily in millions of lines
# Since we are concatenating all contents per topic in one row for score calculation
# the dataframe 'df_preds' gets again only the number of topics as size.
df_num_rows = 2500000
list_preds=[]
iterations = len(preds)//df_num_rows
if iterations == 0:
    iterations = 1
for i in range(iterations):
    print (f'\nCalculating y_target: {i+1} of {iterations} iterations...\n')
    max_r = df_num_rows*(i+1)
    if max_r > len(preds):
        max_r = len(preds)
    min_r = df_num_rows*i 
    df_preds_aux = preds.copy()[min_r:max_r]
    df_preds_aux = df_preds_aux.iloc[:, [0,1]]
    df_preds_aux = df_preds_aux.set_index('topic_id').groupby(level=0).agg(lambda x: ' '.join(x)).reset_index()
    list_preds.append (df_preds_aux)
df_preds=pd.concat (list_preds)
display(df_preds)

# Calculating fscore and validating topics-contents matches from df_preds (first predictions) with 
# df_corr (correlations.csv)
# Results: a dataframe with scoring; a list with true positive predicted matches
fscore, _ = calculate_Fscore(df_preds.copy(), df_corr.copy().reset_index())
#Showing scoring results
print ('\nTopics to match content:', len(df_preds))
print ('\nCorrect predictions:', fscore.correct_pred.sum())
print ('\nF2 Score:', fscore.f2.mean())
print ('\n')
    
print ('\n*** Score dataframe for similarity matches only ***\n')
display(fscore)
    
print ('\n*** First predictions with sentences similarity ***\n')
display(preds)


 Calculating F2 score for first predictions...



Calculating y_target: 1 of 1 iterations...



Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_86adbb25971c c_ab0ff31d7e92 c_72fbb1e5ac13 c...
1,t_00068291e9a4,c_c8bcd26a5f3e c_ebb7fdf10a7e c_0392fd5c4421 c...
2,t_00069b63a70a,c_870b480bd246 c_594f640ff57f c_ffa424db45d8 c...
3,t_0006d41a73a8,c_1cddbc8c1069 c_e2d9686c8a6b c_451043364ae0 c...
4,t_4054df11a74e,c_f2d184a98231 c_81c0baea8ec5 c_6850adbf2b9a c...



F2score calculation finished.

Topics to match content: 5

Correct predictions: 5

F2 Score: 0.012394754295981556



*** Score dataframe for similarity matches only ***



Unnamed: 0,topic_id,content_ids_pred,content_ids_actual,correct_pred,precision,recall,f0.5,f1,f2
0,t_00004da3a1b2,"[c_86adbb25971c, c_ab0ff31d7e92, c_72fbb1e5ac1...","[c_1108dd0c7a5d, c_376c5a8eb028, c_5bc0e1e2cba...",1,0.25,0.01,0.043103,0.019231,0.012376
1,t_00068291e9a4,"[c_c8bcd26a5f3e, c_ebb7fdf10a7e, c_0392fd5c442...","[c_639ea2ef9c95, c_89ce9367be10, c_ac1672cdcd2...",1,0.25,0.01,0.043103,0.019231,0.012376
2,t_00069b63a70a,"[c_870b480bd246, c_594f640ff57f, c_ffa424db45d...",[c_11a1dc0bfb99],0,0.0,0.0,0.0,0.0,0.0
3,t_0006d41a73a8,"[c_1cddbc8c1069, c_e2d9686c8a6b, c_451043364ae...","[c_0c6473c3480d, c_1c57a1316568, c_5e375cf14c4...",1,0.2,0.01,0.041667,0.019048,0.012346
4,t_4054df11a74e,"[c_f2d184a98231, c_81c0baea8ec5, c_6850adbf2b9...","[c_3695c5dc1df6, c_f2d184a98231]",2,1.0,0.02,0.092593,0.039216,0.024876



*** First predictions with sentences similarity ***



Unnamed: 0,topic_id,content_ids,score
0,t_00004da3a1b2,c_86adbb25971c,0.970608
1,t_00004da3a1b2,c_ab0ff31d7e92,0.968809
2,t_00004da3a1b2,c_72fbb1e5ac13,0.966928
3,t_00004da3a1b2,c_d5d9c2e82c51,0.964031
4,t_00004da3a1b2,c_8661941795f3,0.962870
...,...,...,...
495,t_4054df11a74e,c_b624fb4c0c31,0.666454
496,t_4054df11a74e,c_766ab825a30e,0.666193
497,t_4054df11a74e,c_298e7207f501,0.665992
498,t_4054df11a74e,c_a1569dea4b4b,0.665694


# Setup Features and Target

In [27]:
print ('\n* Calculating datasets X and y_target...\n\n')

X_test = get_features (preds, topics_features, contents_features)
print ('\nDisplaying datasets:\n\n')
display (X_test.head())
print (f'X_test shape{X_test.shape}')


* Calculating datasets X and y_target...



* Defining features for X...


Calculating features: 1 of 1 iterations...


Partial X defined.

X is defined.

Displaying datasets:




Unnamed: 0,topic_id,content_ids,channel,category,level,language,topic_title,kind,content_title
0,t_00004da3a1b2,c_86adbb25971c,000cf7,source,four,bg,Откриването на резисторите,exercise,Разбиране за стойността на израза
328,t_0006d41a73a8,c_af751845b3ac,000cf7,source,four,bg,Графики на експоненциални функции Алгебра 2 н...,video,Преглед на свойствата на триъгълник
329,t_0006d41a73a8,c_d44a4bb17f13,000cf7,source,four,bg,Графики на експоненциални функции Алгебра 2 н...,video,Пресечни точки от таблица
330,t_0006d41a73a8,c_75339525203f,000cf7,source,four,bg,Графики на експоненциални функции Алгебра 2 н...,exercise,Чертане на квадратни функции във вида с отделе...
331,t_0006d41a73a8,c_89b340ad4968,000cf7,source,four,bg,Графики на експоненциални функции Алгебра 2 н...,exercise,Загрявка чертане на квадратни функции в разло...


X_test shape(500, 9)


In [28]:
# save out features and target
DATA_PATH = Path.cwd().parent.resolve()
work_path = DATA_PATH / 'working'

x_test_pth = work_path / "x_test.csv"
x_test_pth.parent.mkdir(exist_ok=True, parents=True)

X_test.to_csv (x_test_pth, index=False)
print (f"\nFeatures set saved in file at {x_test_pth}.\n\n")


Features set saved in file at /kaggle/working/x_test.csv.




# Using LGBMClassifier

As a classifier, the choice is LightGBM's LGBMClassifier.

It has the advantage of not needing hot-encoding and being 8-times faster than the same OneHotEncoder from sklearn.

Below, the %%writefile line magic is used to write out a training script. Then ! to run a shell command from the jupyter notebook, and kick off the script. This is due to a [bug](https://github.com/microsoft/LightGBM/issues/4229).

We'll use the typer package to make our script easy to run in the command line.

# Validation predictions

In [29]:
%%writefile predict_gbm_model.py
import lightgbm as lgb

import joblib
from loguru import logger
import numpy as np
import pandas as pd
from pathlib import Path
import typer

DATA_PATH = Path.cwd().parent.resolve()
MODEL_DIR = DATA_PATH / 'input/learningequalityfiles'
DATA_DIR = DATA_PATH / 'working'

def main(
    model_weights_path = MODEL_DIR / "lgb_classifier.txt",
    features_path = DATA_DIR / "x_test.csv",
    preds_save_path = DATA_DIR / "y_preds.npy",
):
    """
    Generate predictions with a LightGBM model using weights saved at model_weights_path
    and features saved at features_path. Save out predictions to preds_save_path.
    """
    # load model weights
    lgb_model = joblib.load(model_weights_path)
    logger.info(f"Loaded model {lgb_model} from {model_weights_path}")

    # load the features
    X_val = pd.read_csv (features_path, dtype="category")
    logger.info(f"Loaded features of shape {X_val.shape} from {features_path}")

    # generate predictions
    y_preds = lgb_model.predict(X_val)

    # save out predictions
    with open(preds_save_path, "wb") as f:
        np.save(f, y_preds)
    logger.success(f"Predictions saved to {preds_save_path}")


if __name__ == "__main__":
    typer.run(main)

Writing predict_gbm_model.py


In [30]:
!python predict_gbm_model.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[32m2023-02-03 16:17:39.265[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m25[0m - [1mLoaded model LGBMClassifier(learning_rate=1, n_estimators=300, num_leaves=50,
               objective='binary', random_state=10) from /kaggle/input/learningequalityfiles/lgb_classifier.txt[0m
[32m2023-02-03 16:17:39.276[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m29[0m - [1mLoaded features of shape (500, 9) from /kaggle/working/x_test.csv[0m
[32m2023-02-03 16:17:39.337[0m | [32m[1mSUCCESS [0m | [36m__main__[0m:[36mmain[0m:[36m37[0m - [32m[1mPredictions saved to /kaggle/working/y_preds.npy[0m


# Model Scoring

In [31]:
preds_pth = work_path / "y_preds.npy"
with open(preds_pth, "rb") as f:
    y_preds = np.load(f).astype(int)
display(y_preds)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,

In [32]:
df_y_preds = X_test[y_preds.astype(bool)].iloc[:, [0,1]]
y_final_preds=df_y_preds.set_index('topic_id').groupby(level=0).agg(lambda x: ' '.join(x)).reset_index()
print ("\n This is the current model prediction for the input given:\n\n")
print (y_preds)
print (y_final_preds)


 This is the current model prediction for the input given:


[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

# Submissions

In [33]:
#import pickle
#pickle.dump(model, open(f'ST-KNN-{k[1]}', 'wb'))
y_final_preds.to_csv('submission.csv', index=False)