<a href="https://colab.research.google.com/github/sofiammatias/learning-equality/blob/main/Learning_Equality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle Challenge - Learning Equality
https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/overview

## Goal of the Competition

The National Football League (NFL) is back with another Big Data Bowl, where contestants use Next Gen Stats player tracking data to generate actionable, creative, and novel stats. Previous iterations have considered running backs, defensive backs, and special teams, and have generated metrics that have been used on television and by NFL teams. In this year’s competition, you’ll have more subtle performances to consider—and potentially more players to measure.

## Submission File

For each **topic_id** in the test set, you must predict a space-delimited list of recommended **content_ids** for that topic. The file should contain a header and have the following format:

~~~
topic_id,content_ids
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c_76231f9d0b5e
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c_ebb7fdf10a7e
t_00069b63a70a,c_11a1dc0bfb99
...
~~~

## Scoring
Mean F2 score

# Evaluation Metric - Efficiency Scoring
We compute a submission's efficiency score by:

\begin{equation} \text{Efficiency} = \frac{1}{ \text{Benchmark} - \max\text{F2} }\text{F2} + \frac{1}{32400}\text{RuntimeSeconds} \end{equation}


where **F2** is the submission's score on the main competition metric, **Benchmark** is the score of the benchmark sample_submission.csv, **maxF2** is the maximum  of all submissions on the Private Leaderboard, and **RuntimeSeconds** is the number of seconds it takes for the submission to be evaluated. The objective is to minimize the efficiency score.

During the training period of the competition, you may see a leaderboard for the public test data in the following notebook, updated daily: Efficiency Leaderboard. After the competition ends, we will update this leaderboard with efficiency scores on the private data. During the training period, this leaderboard will show only the rank of each team, but not the complete score.

# Data

## Imports

In [109]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [110]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.metrics import fbeta_score, precision_score, recall_score, make_scorer
from sklearn.dummy import DummyClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Data Collection

## Helper functions and classes for data traversal (breadcrumbs, topic context)

In [None]:
# define some helper functions and classes to aid with data traversal

class ContentItem:
    """
    Class to handle content relations (breadcrumbs)
    """
    def __init__(self, content_id):
        self.id = content_id

    @property
    def topics(self):
        return [Topic(topic_id) for topic_id in df_topics.loc[df_corr[df_corr.content_ids.str.contains(self.id)].index].index]

    def __getattr__(self, name):
        return df_content.loc[self.id][name]

    def __str__(self):
        return self.title

    def __repr__(self):
        return f"<ContentItem(id={self.id}, title=\"{self.title}\")>"

    def __eq__(self, other):
        if not isinstance(other, ContentItem):
            return False
        return self.id == other.id

    def get_all_breadcrumbs(self, separator=" >> ", include_root=True):
        breadcrumbs = []
        title = self.title
        for topic in self.topics:
            new_breadcrumb = topic.get_breadcrumbs(separator=separator, include_root=include_root)
            if new_breadcrumb:
                new_breadcrumb = new_breadcrumb + separator + title
            else:
                new_breadcrumb = title
            breadcrumbs.append(new_breadcrumb)
        return breadcrumbs

class Topic:
    """
    Class to handle topics relations (breadcrumbs)
    """
    def __init__(self, topic_id):
        self.id = topic_id

    @property
    def parent(self):
        parent_id = df_topics.loc[self.id].parent
        if pd.isna(parent_id):
            return None
        else:
            return Topic(parent_id)

    @property
    def ancestors(self):
        ancestors = []
        parent = self.parent
        while parent is not None:
            ancestors.append(parent)
            parent = parent.parent
        return ancestors

    @property
    def siblings(self):
        if not self.parent:
            return []
        else:
            return [topic for topic in self.parent.children if topic != self]

    @property
    def content(self):
        if self.id in df_corr.index:
            return [ContentItem(content_id) for content_id in df_corr.loc[self.id].content_ids.split()]
        else:
            return tuple([]) if self.has_content else []

    def get_breadcrumbs(self, separator=" >> ", include_self=True, include_root=True):
        ancestors = self.ancestors
        if include_self:
            ancestors = [self] + ancestors
        if not include_root:
            ancestors = ancestors[:-1]
        return separator.join(reversed([a.title for a in ancestors]))

    @property
    def children(self):
        return [Topic(child_id) for child_id in df_topics[df_topics.parent == self.id].index]

    def subtree_markdown(self, depth=0):
        markdown = "  " * depth + "- " + self.title + "\n"
        for child in self.children:
            markdown += child.subtree_markdown(depth=depth + 1)
        for content in self.content:
            markdown += ("  " * (depth + 1) + "- " + "[" + content.kind.title() + "] " + content.title) + "\n"
        return markdown

    def __eq__(self, other):
        if not isinstance(other, Topic):
            return False
        return self.id == other.id

    def __getattr__(self, name):
        return df_topics.loc[self.id][name]

    def __str__(self):
        return self.title
    
    def __repr__(self):
        return f"<Topic(id={self.id}, title=\"{self.title}\")>"

## Load dataframes

In [None]:
drive_path = '/content/drive/My Drive/Colab Notebooks/learning_equality/'

# load 'topics' data into pandas dataframe
df_topics = pd.read_csv(f'{drive_path}topics.csv', index_col=0).fillna({"title": "", "description": ""})
df_topics

Unnamed: 0_level_0,title,description,channel,category,level,language,parent,has_content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True
...,...,...,...,...,...,...,...,...
t_fffb0bf2801d,4.3 Graph of functions,,e77b55,aligned,4,en,t_676e6a1a4dc7,False
t_fffbe1d5d43c,Inscribed shapes problem solving,Use properties of inscribed angles to prove pr...,0c929f,source,4,sw,t_50145b9bab3f,True
t_fffe14f1be1e,Lección 7,,6e90a7,aligned,6,es,t_d448c707984d,True
t_fffe811a6da9,تحديد العلاقة بين الإحداثيّات القطبية والإحداث...,5b9e5ca86571f90499ea987f,9fd860,source,2,ar,t_5b4f3ba4eb7d,True


In [None]:
# load 'content' data into pandas dataframe
df_content = pd.read_csv(f'{drive_path}content.csv', index_col=0).fillna("")
df_content

Unnamed: 0_level_0,title,description,kind,text,language,copyright_holder,license
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,
c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,
c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,
c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,\nNado de aproximação\nSaber nadar nas ondas ...,pt,Sikana Education,CC BY-NC-ND
c_00016694ea2a,geometry-m3-topic-a-overview.pdf,geometry-m3-topic-a-overview.pdf,document,Estándares Comunes del Estado de Nueva York\n\...,es,Engage NY,CC BY-NC-SA
...,...,...,...,...,...,...,...
c_fffcbdd4de8b,2. 12: Diffusion,,html5,What will eventually happen to these dyes?\n\n...,en,CSU and Merlot,CC BY-NC-SA
c_fffe15a2d069,Sommare facendo gruppi da 10,Sal somma 5+68 spezzando il 5 in un 2 e un 3.\n\n,video,,it,,
c_fffed7b0d13a,Introdução à subtração,Sal fala sobre o que significa subtrair. Os ex...,video,,pt,,
c_ffff04ba7ac7,SA of a Cone,,video,,en,,


In [None]:
# load 'correlations' data into pandas dataframe
df_corr = pd.read_csv(f'{drive_path}correlations.csv', index_col=0)
df_corr

Unnamed: 0_level_0,content_ids
topic_id,Unnamed: 1_level_1
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,c_11a1dc0bfb99
t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4
...,...
t_fff830472691,c_61fb63326e5d c_8f224e321c87
t_fff9e5407d13,c_026db653a269 c_0fb048a6412c c_20de77522603 c...
t_fffbe1d5d43c,c_46f852a49c08 c_6659207b25d5
t_fffe14f1be1e,c_cece166bad6a


In [None]:
# load 'topic_breadcrumbs' data into pandas dataframe
df_topic_breadcrumbs = pd.read_csv(f'{drive_path}topic_breadcrumbs.csv', index_col=0)
df_topic_breadcrumbs

Unnamed: 0,topic_breadcrumbs
t_00004da3a1b2,Khan Academy (български език) Наука Физика Отк...
t_000095e03056,Ghana JHS Curriculum (in progress) Junior High...
t_00068291e9a4,Khan Academy (Português (Brasil)) Matemática p...
t_00069b63a70a,MIT Blossoms Engineering Flow Charts: Logical ...
t_0006d41a73a8,Khan Academy (български език) Математика Алгеб...
...,...
t_fffb0bf2801d,Camara Education Ethiopia Secondary Science an...
t_fffbe1d5d43c,Khan Academy (Kiswahili) Hisabati Class 9 (Ind...
t_fffe14f1be1e,CREE Para el Estudiante I Ciclo 01 Primero Esp...
t_fffe811a6da9,Edraak (العربيّة) الهندسة وعلم المثلثات تحديد ...


In [None]:
df_topic_breadcrumbs['topic_breadcrumbs']

t_00004da3a1b2    Khan Academy (български език) Наука Физика Отк...
t_000095e03056    Ghana JHS Curriculum (in progress) Junior High...
t_00068291e9a4    Khan Academy (Português (Brasil)) Matemática p...
t_00069b63a70a    MIT Blossoms Engineering Flow Charts: Logical ...
t_0006d41a73a8    Khan Academy (български език) Математика Алгеб...
                                        ...                        
t_fffb0bf2801d    Camara Education Ethiopia Secondary Science an...
t_fffbe1d5d43c    Khan Academy (Kiswahili) Hisabati Class 9 (Ind...
t_fffe14f1be1e    CREE Para el Estudiante I Ciclo 01 Primero Esp...
t_fffe811a6da9    Edraak (العربيّة) الهندسة وعلم المثلثات تحديد ...
t_fffe88835149    Kolibri Library for Kenya Form 1 Physics 8.0 R...
Name: topic_breadcrumbs, Length: 76972, dtype: object

In [None]:
# load 'content_breadcrumbs' data into pandas dataframe
#df_content_breadcrumbs = pd.read_csv(f'{drive_path}content_breadcrumbs.csv', index_col=0)
#df_content_breadcrumbs

Unnamed: 0,topic_breadcrumbs
t_00004da3a1b2,Khan Academy (български език) Наука Физика Отк...
t_000095e03056,Ghana JHS Curriculum (in progress) Junior High...
t_00068291e9a4,Khan Academy (Português (Brasil)) Matemática p...
t_00069b63a70a,MIT Blossoms Engineering Flow Charts: Logical ...
t_0006d41a73a8,Khan Academy (български език) Математика Алгеб...
...,...
t_fffb0bf2801d,Camara Education Ethiopia Secondary Science an...
t_fffbe1d5d43c,Khan Academy (Kiswahili) Hisabati Class 9 (Ind...
t_fffe14f1be1e,CREE Para el Estudiante I Ciclo 01 Primero Esp...
t_fffe811a6da9,Edraak (العربيّة) الهندسة وعلم المثلثات تحديد ...


In [None]:
# Restructure 'df_corr' (correlations): explode the target column to be more practical to use - one topic -> one content
y = df_corr.copy()
y['content_ids'] = y.content_ids.str.split(' ')
y = y.explode('content_ids')
y.reset_index(inplace=True)
y

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d
1,t_00004da3a1b2,c_376c5a8eb028
2,t_00004da3a1b2,c_5bc0e1e2cba0
3,t_00004da3a1b2,c_76231f9d0b5e
4,t_00068291e9a4,c_639ea2ef9c95
...,...,...
279914,t_fff9e5407d13,c_d64037a72376
279915,t_fffbe1d5d43c,c_46f852a49c08
279916,t_fffbe1d5d43c,c_6659207b25d5
279917,t_fffe14f1be1e,c_cece166bad6a


## Create breadcrumbs for topics

In [None]:
%%time
# - Create breadcrumbs for topics and save in csv file
#  these command lines were spending 1h20m without finishingt:
#       df_X['topic_id'].apply (lambda topic: Topic(topic).get_breadcrumbs()))
#       df_X['content_ids'].apply (lambda content: ContentItem(content).get_all_breadcrumbs())
# with for loop + list append it took 5m for topic

topic_breadcrumbs=[]
topic_ids=[]
for i, topic in enumerate(df_topics.reset_index()['id']):
  topic_ids.append (topic)
  topic_breadcrumbs.append (Topic(topic).get_breadcrumbs(separator=' '))
  print (i, ' : ', topic, ' - ', topic_breadcrumbs[-1])

# Save 'topic_breadcrumbs'
df_topic_breadcrumbs = pd.DataFrame (topic_breadcrumbs, columns='breadcrumbs', index_col=topic_ids)
df_topic_breadcrumbs.to_csv(f'{drive_path}topic_breadcrumbs.csv')

In [None]:
#%%time
# - Create breadcrumbs for content and save in csv file
# - In alternative use, but it seems to take a long time
#       df_X['content_ids'].apply (lambda content: ContentItem(content).get_all_breadcrumbs())
# - Use this ONLY if prediction results are not good

# Create list
#content_breadcrumbs=[]
#content_ids=[]
#for i, content in enumerate(df_content.reset_index()['id']):
#  content_ids.append (content)
#  content_breadcrumbs.append (ContentItem(content).get_all_breadcrumbs(separator=' '))
#  print (i, ' : ', content, ' - ', content_breadcrumbs[-1])

# Save 'content_breadcrumbs'
# df_content_breadcrumbs = pd.DataFrame (content_breadcrumbs, columns='breadcrumbs', index_col=content_ids)
# df_content_breadcrumbs.to_csv(f'{drive_path}content_breadcrumbs.csv')

## Data Cleaning

In [None]:
# - Remove column 'copyright_holder' as it doesn't add to features
# - Remove column 'has_content'
# - Change 'level' column integers into strings (integers may give a sense of "order")
# - Filter by language: start with english only and consider training the model with various datasets, filtered by language
# - Remove columns 'language_x' and 'language_y' after filtering
# - Clean 'text' column from special chars

levels = {0: 'zero', 1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five', 6: 'six', 7: 'seven', 8: 'eight', 9: 'nine', 10: 'ten'}
langs = df_topics['language'].unique()[1] # ['bg', 'en', 'pt', 'gu', 'my', 'zh', 'ar', 'te', 'es', 'fr', 'sw', 'mr', 'hi', 'bn', 'fil', 'ru', 'it', 'or', 'pnb', 'km', 'as', 'kn','ur', 'pl', 'ta', 'swa', 'tr', 'mul']

df_X = y.copy()
topics = df_topics.copy()
topics = topics.merge (df_topic_breadcrumbs, how='outer', right_index=True, left_index=True)
topics['level'] = topics['level'].replace (levels)
topics = topics.reset_index().rename(columns={'id': 'topic_id'})
df_X = df_X.merge (topics, on='topic_id', how='outer')
contents = df_topics.reset_index().rename(columns={'id': 'topic_id'})
#contents['content_breadcrumbs'] = content_breadcrumbs                    
df_X = df_X.merge (df_content.reset_index().rename(columns={'id': 'content_ids'}), on='content_ids', how='left')
df_X = df_X[df_X.language_x == langs].copy()
df_X.fillna("", inplace=True)
y_lang = df_X[['topic_id', 'content_ids']]
df_X.drop (columns = ['copyright_holder', 'has_content', 'language_x', 'language_y'], inplace = True)
df_X.rename(columns={'title_x': 'title_topic', 'title_y': 'title_content', 'description_x': 'description_topic', 'description_y': 'description_content', 'text': 'text_content'}, inplace = True)
df_X['text_content'] = df_X['text_content'].str.replace('\W', ' ', regex=True)
X = df_X.drop (columns = ['topic_id', 'content_ids', 'has_content'])
df_X

Unnamed: 0,topic_id,content_ids,title_topic,description_topic,channel,category,level,parent,has_content,topic_breadcrumbs,title_content,description_content,kind,text_content,license
8,t_00069b63a70a,c_11a1dc0bfb99,Transcripts,,6e3ba4,source,three,t_4054df11a74e,True,MIT Blossoms Engineering Flow Charts: Logical ...,Flow Charts: Logical..: Written Transcript of ...,"Written Transcript of this video lesson, Engli...",document,Español Título de la lección ...,CC BY-NC-SA
17,t_0008a1bd84ba,c_7ff92a954a3d,12. 20: Bird Reproduction,,ebc86c,supplemental,five,t_c44ac9711007,True,Libretext Open Educational Resource Library Bo...,12. 20: Bird Reproduction,,html5,Is this pair of birds actually a couple Ye...,CC BY-NC-SA
18,t_0008a1bd84ba,c_8790b074383e,12. 20: Bird Reproduction,,ebc86c,supplemental,five,t_c44ac9711007,True,Libretext Open Educational Resource Library Bo...,Astounding Mating Dance Birds of Paradise -- H...,The Birds of Paradise from BBC's outstanding P...,video,,
19,t_000d1fb3f2f5,c_07f1d0eec4b2,2.1.2 - Logarithms,,e77b55,aligned,five,t_b897d168db90,True,Camara Education Ethiopia Secondary Science an...,Proof of the logarithm change of base rule,Sal proves the logarithmic change of base rule...,video,What I want to do in this video is prove the c...,CC BY-NC-SA
20,t_000d1fb3f2f5,c_15a6fb858696,2.1.2 - Logarithms,,e77b55,aligned,five,t_b897d168db90,True,Camara Education Ethiopia Secondary Science an...,Intro to logarithm properties (2 of 2),Sal introduces the logarithm identities for mu...,video,PROFESSOR Welcome back I m going to show you...,CC BY-NC-SA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295369,t_ffe63e436874,,Orientation to the school,,2293a4,aligned,three,t_0e5ea30f4489,False,Ubongo Kids (Malawi) Standard One Introduction...,,,,,
295370,t_ffe9f9b46542,,Vocabulary,,b4897d,supplemental,three,t_e05c0dbc0667,False,BB-English Maharashtra Board 7 English Vocabulary,,,,,
295371,t_fff80f4eee89,,Polynomial division,"After we have added, subtracted, and multiplie...",6b09a4,supplemental,five,t_e93f4fad3893,False,Fiji Y10 Mathematics Y12 Mathematics Algebra R...,,,,,
295372,t_fffb0bf2801d,,4.3 Graph of functions,,e77b55,aligned,four,t_676e6a1a4dc7,False,Camara Education Ethiopia Secondary Science an...,,,,,


## Sentence transformer

In [161]:
# For X, using columns ['title_topic', 'description_topic', 'channel', 'category', 'level', 'parent', 'has_content', 'topic_breadcrumbs', 'title_content',
#                       'description_content', 'kind', 'text_content', 'license']
# Dropping columns ['parent', 'text_content', 'license']
X = X.drop(columns = ['parent', 'text_content', 'license'])

In [167]:
%%time
backbone = 'all-MiniLM-L6-v2' #explore also msmarco models and multilanguage

def SentenceTransf (backbone, X):
  from sentence_transformers import SentenceTransformer

  model = SentenceTransformer(backbone)
  print ('=== SENTENCE-TRANSFORMER ==',backbone, '===')
  stopper = 5
  embeds = []
  #initializing model
  sentence = X.iloc[0].astype(str).values.tolist()
  X_embeddings = pd.DataFrame([])
  embeddings = model.encode(sentence)
  i = 0
  # running model along all X rows
  for _, row in X.iterrows():
    print ('Transforming row ', i)
    sentence = row.astype(str).values.tolist()
    embeddings = model.encode(sentence)
    X_embeddings = X_embeddings.append(pd.DataFrame(zip(embeddings)).transpose(), ignore_index=True)
    i = i+1
    if i == stopper:
      break

  return X_embeddings

CPU times: user 9 µs, sys: 0 ns, total: 9 µs
Wall time: 13.6 µs


# Model

## Get X and y

In [178]:
cat_for_val = 'aligned'
X_val_all = df_X[df_X.category == cat_for_val].copy().drop(columns=['topic_id', 'content_ids'])
y_val_all = df_X[df_X.category == cat_for_val].copy().loc[:, ('topic_id', 'content_ids')]
X_train_all = df_X[df_X.category != cat_for_val].copy().drop(columns=['topic_id', 'content_ids'])
y_train_all = df_X[df_X.category != cat_for_val].copy().loc[:, ('topic_id', 'content_ids')]

In [179]:
limit = 5
X_train = SentenceTransf (backbone, X_train_all[:limit])
y_train = y_train_all[:limit]
X_val = SentenceTransf (backbone, X_val_all[:limit])
y_val = y_val_all[:limit]

=== SENTENCE-TRANSFORMER == all-MiniLM-L6-v2 ===
Transforming row  0
Transforming row  1
Transforming row  2
Transforming row  3
Transforming row  4
=== SENTENCE-TRANSFORMER == all-MiniLM-L6-v2 ===
Transforming row  0
Transforming row  1
Transforming row  2
Transforming row  3
Transforming row  4


In [174]:
X_val.shape

(5, 13)

## Scoring

Scoring is based in mean F2 score

In [None]:
def model_scoring (y_test, y_pred, average='weighted'):
  """
   Calculate precision, recall and f2-score for y_test and y_pred
  """
  Precision = []
  Precision.append (precision_score(y_test['topic_id'], y_pred[:,0], average=average))
  Precision.append (precision_score(y_test['content_ids'], y_pred[:,1], average=average))
  precision_mean = np.mean(Precision)
  print ('Precision:', precision_mean)
  Recall = []
  Recall.append (recall_score(y_test['topic_id'], y_pred[:,0], average=average))
  Recall.append (recall_score(y_test['content_ids'], y_pred[:,1], average=average))
  recall_mean = np.mean(Recall)
  print ('Recall:', recall_mean)
  F2macro = []
  F2macro.append (fbeta_score(y_test['topic_id'], y_pred[:,0], beta=2, average='macro'))
  F2macro.append (fbeta_score(y_test['content_ids'], y_pred[:,1], beta=2, average='macro'))
  F2macro_mean = np.mean(F2macro)
  print ('F2 macro:', F2macro_mean)
  F2micro = []
  F2micro.append (fbeta_score(y_test['topic_id'], y_pred[:,0], beta=2, average='micro'))
  F2micro.append (fbeta_score(y_test['content_ids'], y_pred[:,1], beta=2, average='micro'))
  F2micro_mean = np.mean(F2micro)
  print ('F2 micro:', F2micro_mean)
  return precision_mean, recall_mean, F2macro_mean, F2micro_mean

## Baseline Score

In [180]:
%%time
dummy = DummyClassifier (strategy = 'stratified')
multi_target_dummy = MultiOutputClassifier(dummy, n_jobs=-1)
multi_target_dummy.fit(X_train, y_train)

CPU times: user 42.1 ms, sys: 265 ms, total: 307 ms
Wall time: 1.81 s




MultiOutputClassifier(estimator=DummyClassifier(strategy='stratified'),
                      n_jobs=-1)

In [181]:
%%time
y_pred = multi_target_dummy.predict (X_val)
y_pred

CPU times: user 9.18 ms, sys: 1.96 ms, total: 11.1 ms
Wall time: 19 ms


array([['t_001a1575f24a', 'c_433f60c8c551'],
       ['t_0008a1bd84ba', 'c_433f60c8c551'],
       ['t_0008a1bd84ba', 'c_e1d73acfff8e'],
       ['t_0008a1bd84ba', 'c_11a1dc0bfb99'],
       ['t_0016fa92caf4', 'c_8790b074383e']], dtype=object)

In [182]:
model_scoring (y_train, y_pred)

Precision: 0.13333333333333333
Recall: 0.2
F2 macro: 0.11363636363636363
F2 micro: 0.2


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


(0.13333333333333333, 0.2, 0.11363636363636363, 0.2)

## SVM + GridSearch

In [None]:
%%time
y_pred = multi_target_dummy.predict (X_val)
y_pred

CPU times: user 9.18 ms, sys: 1.96 ms, total: 11.1 ms
Wall time: 19 ms


array([['t_001a1575f24a', 'c_433f60c8c551'],
       ['t_0008a1bd84ba', 'c_433f60c8c551'],
       ['t_0008a1bd84ba', 'c_e1d73acfff8e'],
       ['t_0008a1bd84ba', 'c_11a1dc0bfb99'],
       ['t_0016fa92caf4', 'c_8790b074383e']], dtype=object)

## Random Forest

## XsBoost

## GradientBoosting

# Submissions

In [186]:
# load 'correlations' data into pandas dataframe
df_sample = pd.read_csv(f'{drive_path}sample_submission.csv')
df_sample

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
1,t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
2,t_00069b63a70a,c_11a1dc0bfb99
3,t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
4,t_4054df11a74e,c_3695c5dc1df6 c_f2d184a98231
