<a href="https://colab.research.google.com/github/sofiammatias/learning-equality/blob/main/Learning_Equality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle Challenge - Learning Equality

https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/overview

## Goal of the Competition

The goal of this competition is to streamline the process of matching educational content to specific topics in a curriculum. You will develop an accurate and efficient model trained on a library of K-12 educational materials that have been organized into a variety of topic taxonomies. These materials are in diverse languages, and cover a wide range of topics, particularly in STEM (Science, Technology, Engineering, and Mathematics).

Your work will enable students and educators to more readily access relevant educational content to support and supplement learning.

## Submission File

For each **topic_id** in the test set, you must predict a space-delimited list of recommended **content_ids** for that topic. The file should contain a header and have the following format:

~~~
topic_id,content_ids
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c_76231f9d0b5e
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c_ebb7fdf10a7e
t_00069b63a70a,c_11a1dc0bfb99
...
~~~

## Scoring
Mean F2 score

## Evaluation Metric - Efficiency Scoring
We compute a submission's efficiency score by:

\begin{equation} \text{Efficiency} = \frac{1}{ \text{Benchmark} - \max\text{F2} }\text{F2} + \frac{1}{32400}\text{RuntimeSeconds} \end{equation}


where **F2** is the submission's score on the main competition metric, **Benchmark** is the score of the benchmark sample_submission.csv, **maxF2** is the maximum  of all submissions on the Private Leaderboard, and **RuntimeSeconds** is the number of seconds it takes for the submission to be evaluated. The objective is to minimize the efficiency score.

During the training period of the competition, you may see a leaderboard for the public test data in the following notebook, updated daily: Efficiency Leaderboard. After the competition ends, we will update this leaderboard with efficiency scores on the private data. During the training period, this leaderboard will show only the rank of each team, but not the complete score.

# Data

## Imports

In [1]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m47.1 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.metrics import fbeta_score, precision_score, recall_score, make_scorer
from sklearn.dummy import DummyClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Data Collection

## Helper functions and classes for data traversal (breadcrumbs, topic context)

In [5]:
# define some helper functions and classes to aid with data traversal

class ContentItem:
    """
    Class to handle content relations (breadcrumbs)
    """
    def __init__(self, content_id):
        self.id = content_id

    @property
    def topics(self):
        return [Topic(topic_id) for topic_id in df_topics.loc[df_corr[df_corr.content_ids.str.contains(self.id)].index].index]

    def get_all_breadcrumbs(self, separator=" >> ", include_root=True):
        breadcrumbs = []
        title = self.title
        for topic in self.topics:
            new_breadcrumb = topic.get_breadcrumbs(separator=separator, include_root=include_root)
            if new_breadcrumb:
                new_breadcrumb = new_breadcrumb + separator + title
            else:
                new_breadcrumb = title
            breadcrumbs.append(new_breadcrumb)
        return breadcrumbs

class Topic:
    """
    Class to handle topics relations (breadcrumbs)
    """
    def __init__(self, topic_id):
        self.id = topic_id

    @property
    def parent(self):
        parent_id = df_topics.loc[self.id].parent
        if pd.isna(parent_id):
            return None
        else:
            return Topic(parent_id)

    @property
    def ancestors(self):
        ancestors = []
        parent = self.parent
        while parent is not None:
            ancestors.append(parent)
            parent = parent.parent
        return ancestors

    @property
    def content(self):
        if self.id in df_corr.index:
            return [ContentItem(content_id) for content_id in df_corr.loc[self.id].content_ids.split()]
        else:
            return tuple([]) if self.has_content else []

    def get_breadcrumbs(self, separator=" >> ", include_self=True, include_root=True):
        ancestors = self.ancestors
        if include_self:
            ancestors = [self] + ancestors
        if not include_root:
            ancestors = ancestors[:-1]
        return separator.join(reversed([a.title for a in ancestors]))

## Load dataframes

In [5]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [6]:
drive_path = '/content/drive/My Drive/Colab Notebooks/learning_equality/'
#drive_path = '/kaggle/input'

# load 'topics' data into pandas dataframe
df_topics = pd.read_csv(f'{drive_path}topics.csv', index_col=0).fillna({"title": "", "description": ""})
df_topics

Unnamed: 0_level_0,title,description,channel,category,level,language,parent,has_content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True
...,...,...,...,...,...,...,...,...
t_fffb0bf2801d,4.3 Graph of functions,,e77b55,aligned,4,en,t_676e6a1a4dc7,False
t_fffbe1d5d43c,Inscribed shapes problem solving,Use properties of inscribed angles to prove pr...,0c929f,source,4,sw,t_50145b9bab3f,True
t_fffe14f1be1e,Lección 7,,6e90a7,aligned,6,es,t_d448c707984d,True
t_fffe811a6da9,تحديد العلاقة بين الإحداثيّات القطبية والإحداث...,5b9e5ca86571f90499ea987f,9fd860,source,2,ar,t_5b4f3ba4eb7d,True


In [7]:
# load 'content' data into pandas dataframe
df_content = pd.read_csv(f'{drive_path}content.csv', index_col=0).fillna("")
df_content

Unnamed: 0_level_0,title,description,kind,text,language,copyright_holder,license
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,
c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,
c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,
c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,\nNado de aproximação\nSaber nadar nas ondas ...,pt,Sikana Education,CC BY-NC-ND
c_00016694ea2a,geometry-m3-topic-a-overview.pdf,geometry-m3-topic-a-overview.pdf,document,Estándares Comunes del Estado de Nueva York\n\...,es,Engage NY,CC BY-NC-SA
...,...,...,...,...,...,...,...
c_fffcbdd4de8b,2. 12: Diffusion,,html5,What will eventually happen to these dyes?\n\n...,en,CSU and Merlot,CC BY-NC-SA
c_fffe15a2d069,Sommare facendo gruppi da 10,Sal somma 5+68 spezzando il 5 in un 2 e un 3.\n\n,video,,it,,
c_fffed7b0d13a,Introdução à subtração,Sal fala sobre o que significa subtrair. Os ex...,video,,pt,,
c_ffff04ba7ac7,SA of a Cone,,video,,en,,


In [8]:
# load 'correlations' data into pandas dataframe
df_corr = pd.read_csv(f'{drive_path}correlations.csv', index_col=0)
df_corr

Unnamed: 0_level_0,content_ids
topic_id,Unnamed: 1_level_1
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,c_11a1dc0bfb99
t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4
...,...
t_fff830472691,c_61fb63326e5d c_8f224e321c87
t_fff9e5407d13,c_026db653a269 c_0fb048a6412c c_20de77522603 c...
t_fffbe1d5d43c,c_46f852a49c08 c_6659207b25d5
t_fffe14f1be1e,c_cece166bad6a


In [9]:
# load 'topic_breadcrumbs' data into pandas dataframe
df_topic_breadcrumbs = pd.read_csv(f'{drive_path}topic_breadcrumbs.csv', index_col=0)
df_topic_breadcrumbs

Unnamed: 0,topic_breadcrumbs
t_00004da3a1b2,Khan Academy (български език) Наука Физика Отк...
t_000095e03056,Ghana JHS Curriculum (in progress) Junior High...
t_00068291e9a4,Khan Academy (Português (Brasil)) Matemática p...
t_00069b63a70a,MIT Blossoms Engineering Flow Charts: Logical ...
t_0006d41a73a8,Khan Academy (български език) Математика Алгеб...
...,...
t_fffb0bf2801d,Camara Education Ethiopia Secondary Science an...
t_fffbe1d5d43c,Khan Academy (Kiswahili) Hisabati Class 9 (Ind...
t_fffe14f1be1e,CREE Para el Estudiante I Ciclo 01 Primero Esp...
t_fffe811a6da9,Edraak (العربيّة) الهندسة وعلم المثلثات تحديد ...


In [None]:
# load 'content_breadcrumbs' data into pandas dataframe
#df_content_breadcrumbs = pd.read_csv(f'{drive_path}content_breadcrumbs.csv', index_col=0)
#df_content_breadcrumbs

Unnamed: 0,topic_breadcrumbs
t_00004da3a1b2,Khan Academy (български език) Наука Физика Отк...
t_000095e03056,Ghana JHS Curriculum (in progress) Junior High...
t_00068291e9a4,Khan Academy (Português (Brasil)) Matemática p...
t_00069b63a70a,MIT Blossoms Engineering Flow Charts: Logical ...
t_0006d41a73a8,Khan Academy (български език) Математика Алгеб...
...,...
t_fffb0bf2801d,Camara Education Ethiopia Secondary Science an...
t_fffbe1d5d43c,Khan Academy (Kiswahili) Hisabati Class 9 (Ind...
t_fffe14f1be1e,CREE Para el Estudiante I Ciclo 01 Primero Esp...
t_fffe811a6da9,Edraak (العربيّة) الهندسة وعلم المثلثات تحديد ...


In [10]:
# Restructure 'df_corr' (correlations): explode the target column to be more practical to use - one topic -> one content
y = df_corr.copy()
y['content_ids'] = y.content_ids.str.split(' ')
y = y.explode('content_ids')
y.reset_index(inplace=True)
y

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d
1,t_00004da3a1b2,c_376c5a8eb028
2,t_00004da3a1b2,c_5bc0e1e2cba0
3,t_00004da3a1b2,c_76231f9d0b5e
4,t_00068291e9a4,c_639ea2ef9c95
...,...,...
279914,t_fff9e5407d13,c_d64037a72376
279915,t_fffbe1d5d43c,c_46f852a49c08
279916,t_fffbe1d5d43c,c_6659207b25d5
279917,t_fffe14f1be1e,c_cece166bad6a


## Create breadcrumbs for topics

In [None]:
%%time
# - Create breadcrumbs for topics and save in csv file
#  these command lines were spending 1h20m without finishingt:
#       df_X['topic_id'].apply (lambda topic: Topic(topic).get_breadcrumbs()))
#       df_X['content_ids'].apply (lambda content: ContentItem(content).get_all_breadcrumbs())
# with for loop + list append it took 5m for topic

topic_breadcrumbs=[]
topic_ids=[]
for i, topic in enumerate(df_topics.reset_index()['id']):
  topic_ids.append (topic)
  topic_breadcrumbs.append (Topic(topic).get_breadcrumbs(separator=' '))
  print (i, ' : ', topic, ' - ', topic_breadcrumbs[-1])

# Save 'topic_breadcrumbs'
df_topic_breadcrumbs = pd.DataFrame (topic_breadcrumbs, columns='breadcrumbs', index_col=topic_ids)
df_topic_breadcrumbs.to_csv(f'{drive_path}topic_breadcrumbs.csv')

## Data Cleaning

## Data cleaning params

In [68]:
levels = {0: 'zero', 1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five', 6: 'six', 7: 'seven', 8: 'eight', 9: 'nine', 10: 'ten'}
langs_t = pd.DataFrame (df_topics['language'].unique(), columns=['lang']) 
langs_c = pd.DataFrame (df_content['language'].unique(), columns=['lang']) 
langs = langs_t.merge (langs_c, how='left')
# Output: ['bg', 'en', 'pt', 'gu', 'my', 'zh', 'ar', 'te', 'es', 'fr', 'sw', 'mr', 'hi', 'bn', 'fil', 'ru', 'it', 'or', 'pnb', 'km', 'as', 'kn','ur', 'pl', 'ta', 'swa', 'tr', 'mul']
topic_cols = ['title', 'description', 'topic_breadcrumbs']
content_cols = ['title_content', 'description_content', 'kind', 'text_content', 'license']

## Data cleaning functions

In [62]:
# Filter by language
def filter_lang(X, lang):
  """
  Filter dataframes by 'language'
  """
  return X[X.language == lang].copy()


# Clean up text
def clean_text(text_col):
  """
  Clean ponctuation and special chars from a dataframe column
  """
  punctuations = string.punctuation
  text_col = text_col.str.replace('\W', ' ', regex=True)
  for punct in string.punctuation:
    text_col = text_col.str.replace(punct, ' ', regex=True)
  return text_col

id
c_00019840d110    LEARNING OBJECTIVES  Outline the three major p...
c_00027d03ca7d    4 3   BPA  Answer the following questions with...
c_000425df0161    Introduction  The perfect gas equation of stat...
c_00046806ad8a                                                     
c_000751f58836    Angle A is a circumscribed angle on circle O  ...
                                        ...                        
c_fffa90b024e3                                                     
c_fffbfc3d60c6                                                     
c_fffcbdd4de8b    What will eventually happen to these dyes   Th...
c_ffff04ba7ac7                                                     
c_ffffe5254266                                                     
Name: text, Length: 65939, dtype: object

## Topics


*   Change 'level' into strings instead of numbers
*   Filter by language (start with english then loop all the process for all languages, and save model between languages)
*   Clean strings: ponctuation and special chars (\n\t...)



In [80]:
topics_features = df_topics.copy()
topics_features = topics_features.merge (df_topic_breadcrumbs, how='outer', right_index=True, left_index=True)
topics_features = filter_lang (topics_features, langs.iloc[1, 0])
for col in topic_cols:
  topics_features[col] = clean_text(topics_features[col])
display (topics_features.head())
topics_sentences = topics_features[topic_cols].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)

Unnamed: 0_level_0,title,description,channel,category,level,language,parent,has_content,topic_breadcrumbs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
t_000095e03056,Unit 3 3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False,Ghana JHS Curriculum in progress Junior High...
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True,MIT Blossoms Engineering Flow Charts Logical ...
t_0008a1bd84ba,12 20 Bird Reproduction,,ebc86c,supplemental,5,en,t_c44ac9711007,True,Libretext Open Educational Resource Library Bo...
t_000c0c854f0b,Nepal,,f2e966,source,1,en,t_f9c4ef0d6290,False,Touchable Earth en Nepal
t_000d1fb3f2f5,2 1 2 Logarithms,,e77b55,aligned,5,en,t_b897d168db90,True,Camara Education Ethiopia Secondary Science an...


pandas.core.series.Series

## Contents


*   Remove 'copyright_holder'
*   Filter by language (start with english then loop all the process for all languages, and save model between languages)
*   Clean strings: ponctuation and special chars (\n\t...)



In [11]:
content_features = df_content.copy().drop(columns=['copyright_holder'])
content_features = filter_lang (content_features, langs.iloc[1, 0])
display (content_features.head())
for col in content_cols:
  content_features[col] = clean_text(content_features[col])
  display (content_features.head())
content_sentences




df_X = y.copy()
topics = df_topics.copy()
topics = topics.merge (df_topic_breadcrumbs, how='outer', right_index=True, left_index=True)
topics['level'] = topics['level'].replace (levels)
topics = topics.reset_index().rename(columns={'id': 'topic_id'})
df_X = df_X.merge (topics, on='topic_id', how='outer')                    
df_X = df_X.merge (df_content.reset_index().rename(columns={'id': 'content_ids'}), on='content_ids', how='left')
df_X = df_X[df_X.language_x == langs].copy()
df_X.fillna("", inplace=True)
y_lang = df_X[['topic_id', 'content_ids']]
df_X.drop (columns = ['copyright_holder', 'language_x', 'language_y'], inplace = True)
df_X.rename(columns={'title_x': 'title_topic', 'title_y': 'title_content', 'description_x': 'description_topic', 'description_y': 'description_content', 'text': 'text_content'}, inplace = True)
df_X['text_content'] = df_X['text_content'].str.replace('\W', ' ', regex=True)
df_X

Unnamed: 0,topic_id,content_ids,title_topic,description_topic,channel,category,level,parent,has_content,topic_breadcrumbs,title_content,description_content,kind,text_content,license
8,t_00069b63a70a,c_11a1dc0bfb99,Transcripts,,6e3ba4,source,three,t_4054df11a74e,True,MIT Blossoms Engineering Flow Charts: Logical ...,Flow Charts: Logical..: Written Transcript of ...,"Written Transcript of this video lesson, Engli...",document,Español Título de la lección ...,CC BY-NC-SA
17,t_0008a1bd84ba,c_7ff92a954a3d,12. 20: Bird Reproduction,,ebc86c,supplemental,five,t_c44ac9711007,True,Libretext Open Educational Resource Library Bo...,12. 20: Bird Reproduction,,html5,Is this pair of birds actually a couple Ye...,CC BY-NC-SA
18,t_0008a1bd84ba,c_8790b074383e,12. 20: Bird Reproduction,,ebc86c,supplemental,five,t_c44ac9711007,True,Libretext Open Educational Resource Library Bo...,Astounding Mating Dance Birds of Paradise -- H...,The Birds of Paradise from BBC's outstanding P...,video,,
19,t_000d1fb3f2f5,c_07f1d0eec4b2,2.1.2 - Logarithms,,e77b55,aligned,five,t_b897d168db90,True,Camara Education Ethiopia Secondary Science an...,Proof of the logarithm change of base rule,Sal proves the logarithmic change of base rule...,video,What I want to do in this video is prove the c...,CC BY-NC-SA
20,t_000d1fb3f2f5,c_15a6fb858696,2.1.2 - Logarithms,,e77b55,aligned,five,t_b897d168db90,True,Camara Education Ethiopia Secondary Science an...,Intro to logarithm properties (2 of 2),Sal introduces the logarithm identities for mu...,video,PROFESSOR Welcome back I m going to show you...,CC BY-NC-SA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295369,t_ffe63e436874,,Orientation to the school,,2293a4,aligned,three,t_0e5ea30f4489,False,Ubongo Kids (Malawi) Standard One Introduction...,,,,,
295370,t_ffe9f9b46542,,Vocabulary,,b4897d,supplemental,three,t_e05c0dbc0667,False,BB-English Maharashtra Board 7 English Vocabulary,,,,,
295371,t_fff80f4eee89,,Polynomial division,"After we have added, subtracted, and multiplie...",6b09a4,supplemental,five,t_e93f4fad3893,False,Fiji Y10 Mathematics Y12 Mathematics Algebra R...,,,,,
295372,t_fffb0bf2801d,,4.3 Graph of functions,,e77b55,aligned,four,t_676e6a1a4dc7,False,Camara Education Ethiopia Secondary Science an...,,,,,


## Create 'topic_sentences'

## Sentence transformer

In [21]:
backbone = 'all-MiniLM-L6-v2' #explore also msmarco models and multilanguage

def SentenceTransf (backbone, X):
  from sentence_transformers import SentenceTransformer
  import time

  model = SentenceTransformer(backbone)
  start = time.time()
  print ('=== SENTENCE-TRANSFORMER ==',backbone, '===')
  stopper = 5
  embeds = []
  #initializing model
  sentence = X.iloc[0].astype(str).values.tolist()
  X_embeddings = pd.DataFrame([])
  embeddings = model.encode(sentence)
  i = 0
  # running model along all X rows
  for _, row in X.iterrows():
    print ('Transforming row ', i)
    sentence = row.astype(str).values.tolist()
    embeddings = model.encode(sentence)
    X_embeddings = X_embeddings.append(pd.DataFrame(zip(embeddings)).transpose(), ignore_index=True)
    i = i+1
    if i == stopper:
      break
  end = time.time()
  time = end - start
  return X_embeddings, time

In [78]:
cat_for_val = 'aligned'
df_X[(df_X.category == cat_for_val) & (df_X.has_content == True)]

Unnamed: 0,topic_id,content_ids,title_topic,description_topic,channel,category,level,parent,has_content,topic_breadcrumbs,title_content,description_content,kind,text_content,license
19,t_000d1fb3f2f5,c_07f1d0eec4b2,2.1.2 - Logarithms,,e77b55,aligned,five,t_b897d168db90,True,Camara Education Ethiopia Secondary Science an...,Proof of the logarithm change of base rule,Sal proves the logarithmic change of base rule...,video,What I want to do in this video is prove the c...,CC BY-NC-SA
20,t_000d1fb3f2f5,c_15a6fb858696,2.1.2 - Logarithms,,e77b55,aligned,five,t_b897d168db90,True,Camara Education Ethiopia Secondary Science an...,Intro to logarithm properties (2 of 2),Sal introduces the logarithm identities for mu...,video,PROFESSOR Welcome back I m going to show you...,CC BY-NC-SA
21,t_000d1fb3f2f5,c_175e9db3fc44,2.1.2 - Logarithms,,e77b55,aligned,five,t_b897d168db90,True,Camara Education Ethiopia Secondary Science an...,Solve exponential equations using logarithms: ...,Solve exponential equations that have 2 or oth...,exercise,,
22,t_000d1fb3f2f5,c_1c2e804fa58a,2.1.2 - Logarithms,,e77b55,aligned,five,t_b897d168db90,True,Camara Education Ethiopia Secondary Science an...,Using the logarithmic power rule,Sal rewrites log₅(x³) as 3log₅(x).\n\n,video,We re asked to simplify log base 5 of x to the...,CC BY-NC-SA
23,t_000d1fb3f2f5,c_1ec97b588bce,2.1.2 - Logarithms,,e77b55,aligned,five,t_b897d168db90,True,Camara Education Ethiopia Secondary Science an...,Evaluating logarithms (advanced),"Sal evaluates log₂(8), log₈(2), log₂(⅛), and l...",video,Let s give ourselves a little bit more practic...,CC BY-NC-SA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279888,t_fff51448598c,c_ab59efb09f73,Comparative advantage and the gains from trade,Comparative advantage is the ability to produc...,2ee29d,aligned,four,t_9d906e167aea,True,Khan Academy (English - CBSE India Curriculum)...,Comparative advantage and absolute advantage,Trade benefits both agents when each specializ...,video,What I want to do in this video is make sure w...,CC BY-NC-SA
279889,t_fff51448598c,c_cef3d40736d3,Comparative advantage and the gains from trade,Comparative advantage is the ability to produc...,2ee29d,aligned,four,t_9d906e167aea,True,Khan Academy (English - CBSE India Curriculum)...,Comparative advantage worked example,In this video we work through an example of a ...,video,Instructor The countries of Kalos and Joht...,CC BY-NC-SA
279890,t_fff51448598c,c_d706cb35bd02,Comparative advantage and the gains from trade,Comparative advantage is the ability to produc...,2ee29d,aligned,four,t_9d906e167aea,True,Khan Academy (English - CBSE India Curriculum)...,Comparative advantage and the gains from trade,Practice what you have learned about calculati...,exercise,,
279891,t_fff51448598c,c_e346107f23d9,Comparative advantage and the gains from trade,Comparative advantage is the ability to produc...,2ee29d,aligned,four,t_9d906e167aea,True,Khan Academy (English - CBSE India Curriculum)...,Terms of trade and the gains from trade,"In this video, we explore how we can use oppor...",video,Instructor Let s imagine a very simple wor...,CC BY-NC-SA


# Model

## Get X and y

In [24]:
cat_for_val = 'aligned'
X_val_all = df_X[(df_X.category == cat_for_val) & (df_X.has_content == True)].copy()
y_val_all = df_X[(df_X.category == cat_for_val) & (df_X.has_content == True)].copy().loc[:, ('topic_id', 'content_ids')]
X_train_all = df_X[df_X.category != cat_for_val].copy()
y_train_all = df_X[df_X.category != cat_for_val].copy().loc[:, ('topic_id', 'content_ids')]

In [25]:
# - Using category 'aligned' for validation and only topics with content
# - For X_topics, using columns ['topic_id', 'content_ids', 'title_topic', 'description_topic', 'channel', 'category', 'level', 'parent', 'has_content','topic_breadcrumbs']
# X_topics = Dropping columns ['topic_id', 'content_ids', 'title_content', 'description_content', 'kind', 'text_content', 'license'] from df_X
#
# - For X_contents, using columns ['topic_id', 'content_ids', 'title_content', 'description_content', 'kind', 'text_content', 'license']
# X_contents = Dropping columns ['topic_id', 'content_ids', 'title_topic', 'description_topic', 'channel', 'category', 'level', 'parent', 'has_content','topic_breadcrumbs'] from df_X
X_val_topics = X_val_all.drop(columns = ['topic_id', 'content_ids', 'title_content', 'description_content', 'kind', 'text_content', 'license'])
X_train_topics = X_train_all.drop(columns = ['topic_id', 'content_ids', 'title_content', 'description_content', 'kind', 'text_content', 'license'])

X_val_content = X_val_all.drop(columns = ['topic_id', 'content_ids', 'title_topic', 'description_topic', 'channel', 'category', 'level', 'parent', 'has_content','topic_breadcrumbs'])
X_train_content = X_train_all.drop(columns = ['topic_id', 'content_ids', 'title_topic', 'description_topic', 'channel', 'category', 'level', 'parent', 'has_content','topic_breadcrumbs'])

In [26]:
limit = 5
X_train_topics, time_train = SentenceTransf (backbone, X_train_topics[:limit])
print ('Time to preprocess X_train', time_train)
X_train_topics

=== SENTENCE-TRANSFORMER == all-MiniLM-L6-v2 ===
Transforming row  0
Transforming row  1
Transforming row  2
Transforming row  3
Transforming row  4
Time to preprocess X_train 1.0213096141815186


Unnamed: 0,0,1,2,3,4,5,6,7
0,"[-0.06976322, 0.02155305, -0.055510722, -0.030...","[-0.11883834, 0.04829865, -0.0025481011, -0.01...","[-0.05365038, -0.017775202, -0.06904105, -0.01...","[0.027087495, 0.037186757, 0.004761678, 0.0243...","[0.04360136, -0.008855685, -0.102738105, -0.01...","[-0.037534688, -0.0034662618, -0.043352548, -0...","[0.07581233, 0.026906624, -0.0056490395, -0.05...","[-0.0072098854, 0.00516788, 0.0687713, -0.0511..."
1,"[-0.008430471, 0.059289858, -0.0068703187, -0....","[-0.11883834, 0.04829865, -0.0025481011, -0.01...","[-0.07810093, 0.0113532655, -0.04234103, -0.01...","[-0.058500007, -0.02762344, -0.032580882, 0.02...","[-0.010091712, 0.020252606, 0.0098109115, 0.01...","[-0.06953352, 0.030418746, -0.059676144, -0.06...","[0.07581233, 0.026906624, -0.0056490395, -0.05...","[0.040176038, 0.021153286, 0.008287823, -0.021..."
2,"[-0.008430471, 0.059289858, -0.0068703187, -0....","[-0.11883834, 0.04829865, -0.0025481011, -0.01...","[-0.07810093, 0.0113532655, -0.04234103, -0.01...","[-0.058500007, -0.02762344, -0.032580882, 0.02...","[-0.010091712, 0.020252606, 0.0098109115, 0.01...","[-0.06953352, 0.030418746, -0.059676144, -0.06...","[0.07581233, 0.026906624, -0.0056490395, -0.05...","[0.040176038, 0.021153286, 0.008287823, -0.021..."
3,"[0.02941689, 0.060091794, -0.00026553357, -0.0...","[-0.11883834, 0.04829865, -0.0025481011, -0.01...","[-0.060675383, -0.042284127, -0.07454348, 0.02...","[0.027087495, 0.037186757, 0.004761678, 0.0243...","[0.018897688, 0.013223383, -0.08231031, 0.0274...","[-0.09829158, 0.039294507, -0.04985952, -0.049...","[0.07581233, 0.026906624, -0.0056490395, -0.05...","[0.037725728, 0.064541526, -0.01960373, -0.055..."
4,"[0.005997027, 0.025607703, -0.008715419, -0.06...","[-0.11883834, 0.04829865, -0.0025481011, -0.01...","[-0.060675383, -0.042284127, -0.07454348, 0.02...","[0.027087495, 0.037186757, 0.004761678, 0.0243...","[0.018897688, 0.013223383, -0.08231031, 0.0274...","[-0.06703734, 0.03091835, -0.042463597, -0.055...","[0.07581233, 0.026906624, -0.0056490395, -0.05...","[-0.023256205, 0.0508107, -0.011230599, -0.059..."


In [57]:
print (X_val.info())
print (X_train.shape)
print (y_val.shape)
print (y_train.shape)
X_val[0]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       5 non-null      object
 1   1       5 non-null      object
 2   2       5 non-null      object
 3   3       5 non-null      object
 4   4       5 non-null      object
 5   5       5 non-null      object
 6   6       5 non-null      object
 7   7       5 non-null      object
 8   8       5 non-null      object
 9   9       5 non-null      object
 10  10      5 non-null      object
 11  11      5 non-null      object
 12  12      5 non-null      object
dtypes: object(13)
memory usage: 648.0+ bytes
None
(5, 13)
(5, 2)
(5, 2)


0    [0.004427435, 0.026911903, 0.049479138, 0.0162...
1    [0.004427435, 0.026911903, 0.049479138, 0.0162...
2    [0.004427435, 0.026911903, 0.049479138, 0.0162...
3    [0.004427435, 0.026911903, 0.049479138, 0.0162...
4    [0.004427435, 0.026911903, 0.049479138, 0.0162...
Name: 0, dtype: object

## Scoring

Scoring is based in mean F2 score

In [45]:
def model_scoring (y_test, y_pred, average='weighted'):
  """
   Calculate precision, recall and f2-score for y_test and y_pred
  """
  Precision = []
  Precision.append (precision_score(y_test['topic_id'], y_pred[:,0], average=average))
  Precision.append (precision_score(y_test['content_ids'], y_pred[:,1], average=average))
  precision_mean = np.mean(Precision)
  print ('Precision:', precision_mean)
  Recall = []
  Recall.append (recall_score(y_test['topic_id'], y_pred[:,0], average=average))
  Recall.append (recall_score(y_test['content_ids'], y_pred[:,1], average=average))
  recall_mean = np.mean(Recall)
  print ('Recall:', recall_mean)
  F2macro = []
  F2macro.append (fbeta_score(y_test['topic_id'], y_pred[:,0], beta=2, average='macro'))
  F2macro.append (fbeta_score(y_test['content_ids'], y_pred[:,1], beta=2, average='macro'))
  F2macro_mean = np.mean(F2macro)
  print ('F2 macro:', F2macro_mean)
  F2micro = []
  F2micro.append (fbeta_score(y_test['topic_id'], y_pred[:,0], beta=2, average='micro'))
  F2micro.append (fbeta_score(y_test['content_ids'], y_pred[:,1], beta=2, average='micro'))
  F2micro_mean = np.mean(F2micro)
  print ('F2 micro:', F2micro_mean)
  return precision_mean, recall_mean, F2macro_mean, F2micro_mean

## Baseline Score

In [46]:
%%time
dummy = DummyClassifier (strategy = 'stratified')
multi_target_dummy = MultiOutputClassifier(dummy, n_jobs=-1)
multi_target_dummy.fit(X_train, y_train)

CPU times: user 11.5 ms, sys: 894 µs, total: 12.4 ms
Wall time: 46.7 ms




MultiOutputClassifier(estimator=DummyClassifier(strategy='stratified'),
                      n_jobs=-1)

In [47]:
%%time
y_pred = multi_target_dummy.predict (X_val)
y_pred.shape

CPU times: user 8.24 ms, sys: 0 ns, total: 8.24 ms
Wall time: 15.4 ms


(5, 2)

In [48]:
model_scoring (y_train, y_pred)

Precision: 0.03333333333333333
Recall: 0.1
F2 macro: 0.08928571428571429
F2 micro: 0.1


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


(0.03333333333333333, 0.1, 0.08928571428571429, 0.1)

## GradientBoosting

In [49]:
%%time
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
    n_estimators=10, 
    learning_rate=0.1,
    max_depth=3
)
multi_target_GB = MultiOutputClassifier(model, n_jobs=-1)
multi_target_GB.fit(X_train, y_train)

ValueError: ignored

In [None]:
%%time
y_pred = multi_target_GB.predict (X_val)
y_pred

CPU times: user 9.18 ms, sys: 1.96 ms, total: 11.1 ms
Wall time: 19 ms


array([['t_001a1575f24a', 'c_433f60c8c551'],
       ['t_0008a1bd84ba', 'c_433f60c8c551'],
       ['t_0008a1bd84ba', 'c_e1d73acfff8e'],
       ['t_0008a1bd84ba', 'c_11a1dc0bfb99'],
       ['t_0016fa92caf4', 'c_8790b074383e']], dtype=object)

In [None]:
model_scoring (y_train, y_pred)

## Random Forest

## XsBoost

## GradientBoosting

# Submissions

In [None]:
# load 'correlations' data into pandas dataframe
df_sample = pd.read_csv(f'{drive_path}sample_submission.csv')
df_sample

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
1,t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
2,t_00069b63a70a,c_11a1dc0bfb99
3,t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
4,t_4054df11a74e,c_3695c5dc1df6 c_f2d184a98231
