<a href="https://colab.research.google.com/github/sofiammatias/learning-equality/blob/main/Learning_Equality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle Challenge - Learning Equality

https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/overview

## Goal of the Competition

The goal of this competition is to streamline the process of matching educational content to specific topics in a curriculum. You will develop an accurate and efficient model trained on a library of K-12 educational materials that have been organized into a variety of topic taxonomies. These materials are in diverse languages, and cover a wide range of topics, particularly in STEM (Science, Technology, Engineering, and Mathematics).

Your work will enable students and educators to more readily access relevant educational content to support and supplement learning.

## Submission File

For each **topic_id** in the test set, you must predict a space-delimited list of recommended **content_ids** for that topic. The file should contain a header and have the following format:

~~~
topic_id,content_ids
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c_76231f9d0b5e
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c_ebb7fdf10a7e
t_00069b63a70a,c_11a1dc0bfb99
...
~~~

## Scoring
Mean F2 score

## Evaluation Metric - Efficiency Scoring
We compute a submission's efficiency score by:

\begin{equation} \text{Efficiency} = \frac{1}{ \text{Benchmark} - \max\text{F2} }\text{F2} + \frac{1}{32400}\text{RuntimeSeconds} \end{equation}


where **F2** is the submission's score on the main competition metric, **Benchmark** is the score of the benchmark sample_submission.csv, **maxF2** is the maximum  of all submissions on the Private Leaderboard, and **RuntimeSeconds** is the number of seconds it takes for the submission to be evaluated. The objective is to minimize the efficiency score.

During the training period of the competition, you may see a leaderboard for the public test data in the following notebook, updated daily: Efficiency Leaderboard. After the competition ends, we will update this leaderboard with efficiency scores on the private data. During the training period, this leaderboard will show only the rank of each team, but not the complete score.

# Data

## Imports

In [37]:
!pip install -U sentence-transformers

[0m

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics import fbeta_score, precision_score, recall_score, make_scorer
from sklearn.dummy import DummyClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Data

# Data Collection

## Helper functions and classes for data traversal (breadcrumbs = topic context)

In [39]:
# define some helper functions and classes to aid with data traversal

class ContentItem:
    """
    Class to handle content relations (breadcrumbs)
    """
    def __init__(self, content_id):
        self.id = content_id

    @property
    def topics(self):
        return [Topic(topic_id) for topic_id in df_topics.loc[df_corr[df_corr.content_ids.str.contains(self.id)].index].index]

    def get_all_breadcrumbs(self, separator=" >> ", include_root=True):
        breadcrumbs = []
        title = self.title
        for topic in self.topics:
            new_breadcrumb = topic.get_breadcrumbs(separator=separator, include_root=include_root)
            if new_breadcrumb:
                new_breadcrumb = new_breadcrumb + separator + title
            else:
                new_breadcrumb = title
            breadcrumbs.append(new_breadcrumb)
        return breadcrumbs

class Topic:
    """
    Class to handle topics relations (breadcrumbs)
    """
    def __init__(self, topic_id):
        self.id = topic_id

    @property
    def parent(self):
        parent_id = df_topics.loc[self.id].parent
        if pd.isna(parent_id):
            return None
        else:
            return Topic(parent_id)

    @property
    def ancestors(self):
        ancestors = []
        parent = self.parent
        while parent is not None:
            ancestors.append(parent)
            parent = parent.parent
        return ancestors

    @property
    def content(self):
        if self.id in df_corr.index:
            return [ContentItem(content_id) for content_id in df_corr.loc[self.id].content_ids.split()]
        else:
            return tuple([]) if self.has_content else []

    def get_breadcrumbs(self, separator=" >> ", include_self=True, include_root=True):
        ancestors = self.ancestors
        if include_self:
            ancestors = [self] + ancestors
        if not include_root:
            ancestors = ancestors[:-1]
        return separator.join(reversed([a.title for a in ancestors]))

## Load dataframes

In [57]:
#drive_path = '/content/drive/My Drive/Colab Notebooks/learning_equality/'
drive_path = '/kaggle/input/learning-equality-curriculum-recommendations/'
dataset_path = '/kaggle/input/learning-equality-files/'
output_path = '/kaggle/output/'
work_path = '/kaggle/working/'

# load 'topics' data into pandas dataframe
df_topics = pd.read_csv(f'{drive_path}topics.csv', index_col=0).fillna({"title": "", "description": ""})
df_topics

Unnamed: 0_level_0,title,description,channel,category,level,language,parent,has_content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True
...,...,...,...,...,...,...,...,...
t_fffb0bf2801d,4.3 Graph of functions,,e77b55,aligned,4,en,t_676e6a1a4dc7,False
t_fffbe1d5d43c,Inscribed shapes problem solving,Use properties of inscribed angles to prove pr...,0c929f,source,4,sw,t_50145b9bab3f,True
t_fffe14f1be1e,Lección 7,,6e90a7,aligned,6,es,t_d448c707984d,True
t_fffe811a6da9,تحديد العلاقة بين الإحداثيّات القطبية والإحداث...,5b9e5ca86571f90499ea987f,9fd860,source,2,ar,t_5b4f3ba4eb7d,True


In [41]:
# load 'content' data into pandas dataframe
df_content = pd.read_csv(f'{drive_path}content.csv', index_col=0).fillna("")
df_content

Unnamed: 0_level_0,title,description,kind,text,language,copyright_holder,license
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,
c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,
c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,
c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,\nNado de aproximação\nSaber nadar nas ondas ...,pt,Sikana Education,CC BY-NC-ND
c_00016694ea2a,geometry-m3-topic-a-overview.pdf,geometry-m3-topic-a-overview.pdf,document,Estándares Comunes del Estado de Nueva York\n\...,es,Engage NY,CC BY-NC-SA
...,...,...,...,...,...,...,...
c_fffcbdd4de8b,2. 12: Diffusion,,html5,What will eventually happen to these dyes?\n\n...,en,CSU and Merlot,CC BY-NC-SA
c_fffe15a2d069,Sommare facendo gruppi da 10,Sal somma 5+68 spezzando il 5 in un 2 e un 3.\n\n,video,,it,,
c_fffed7b0d13a,Introdução à subtração,Sal fala sobre o que significa subtrair. Os ex...,video,,pt,,
c_ffff04ba7ac7,SA of a Cone,,video,,en,,


In [42]:
# load 'correlations' data into pandas dataframe
df_corr = pd.read_csv(f'{drive_path}correlations.csv', index_col=0)
df_corr

Unnamed: 0_level_0,content_ids
topic_id,Unnamed: 1_level_1
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,c_11a1dc0bfb99
t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4
...,...
t_fff830472691,c_61fb63326e5d c_8f224e321c87
t_fff9e5407d13,c_026db653a269 c_0fb048a6412c c_20de77522603 c...
t_fffbe1d5d43c,c_46f852a49c08 c_6659207b25d5
t_fffe14f1be1e,c_cece166bad6a


In [43]:
# load 'topic_breadcrumbs' data into pandas dataframe
df_topic_breadcrumbs = pd.read_csv(f'{dataset_path}topic_breadcrumbs.csv', index_col=0)
df_topic_breadcrumbs

Unnamed: 0,topic_breadcrumbs
t_00004da3a1b2,Khan Academy (български език) Наука Физика Отк...
t_000095e03056,Ghana JHS Curriculum (in progress) Junior High...
t_00068291e9a4,Khan Academy (Português (Brasil)) Matemática p...
t_00069b63a70a,MIT Blossoms Engineering Flow Charts: Logical ...
t_0006d41a73a8,Khan Academy (български език) Математика Алгеб...
...,...
t_fffb0bf2801d,Camara Education Ethiopia Secondary Science an...
t_fffbe1d5d43c,Khan Academy (Kiswahili) Hisabati Class 9 (Ind...
t_fffe14f1be1e,CREE Para el Estudiante I Ciclo 01 Primero Esp...
t_fffe811a6da9,Edraak (العربيّة) الهندسة وعلم المثلثات تحديد ...


In [44]:
# Restructure 'df_corr' (correlations): explode the target column to be more practical to use - one topic -> one content
y = df_corr.copy()
y['content_ids'] = y.content_ids.str.split(' ')
y = y.explode('content_ids')
y.reset_index(inplace=True)
y

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d
1,t_00004da3a1b2,c_376c5a8eb028
2,t_00004da3a1b2,c_5bc0e1e2cba0
3,t_00004da3a1b2,c_76231f9d0b5e
4,t_00068291e9a4,c_639ea2ef9c95
...,...,...
279914,t_fff9e5407d13,c_d64037a72376
279915,t_fffbe1d5d43c,c_46f852a49c08
279916,t_fffbe1d5d43c,c_6659207b25d5
279917,t_fffe14f1be1e,c_cece166bad6a


## Create breadcrumbs for topics

In [45]:
#%%time
# - Create breadcrumbs for topics and save in csv file
#  these command lines were spending 1h20m without finishingt:
#       df_X['topic_id'].apply (lambda topic: Topic(topic).get_breadcrumbs()))
#       df_X['content_ids'].apply (lambda content: ContentItem(content).get_all_breadcrumbs())
# with for loop + list append it took 5m for topic

#topic_breadcrumbs=[]
#topic_ids=[]
#for i, topic in enumerate(df_topics.reset_index()['id']):
#    topic_ids.append (topic)
#    topic_breadcrumbs.append (Topic(topic).get_breadcrumbs(separator=' '))
#    print (i, ' : ', topic, ' - ', topic_breadcrumbs[-1])

# Save 'topic_breadcrumbs'
#df_topic_breadcrumbs = pd.DataFrame (topic_breadcrumbs, columns='breadcrumbs', index_col=topic_ids)
#df_topic_breadcrumbs.to_csv(f'{drive_path}topic_breadcrumbs.csv')

## Data Cleaning

## Data cleaning params

In [46]:
levels = {0: 'zero', 1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five', 6: 'six', 7: 'seven', 8: 'eight', 9: 'nine', 10: 'ten'}
langs_t = pd.DataFrame (df_topics['language'].unique(), columns=['lang']) 
langs_c = pd.DataFrame (df_content['language'].unique(), columns=['lang']) 
langs = langs_t.merge (langs_c, how='left')
# Output: ['bg', 'en', 'pt', 'gu', 'my', 'zh', 'ar', 'te', 'es', 'fr', 'sw', 'mr', 'hi', 'bn', 'fil', 'ru', 'it', 'or', 'pnb', 'km', 'as', 'kn','ur', 'pl', 'ta', 'swa', 'tr', 'mul']
topic_cols = ['title', 'description', 'topic_breadcrumbs']
content_cols = ['title', 'description', 'kind', 'text']
cat_for_val = 'aligned'

## Data cleaning functions

In [47]:
# Filter by language
def filter_lang(X, lang):
    """
    Filter dataframes by 'language'
    """
    return X[X.language == lang].copy()


# Clean up text
def clean_text(text_col):
    """
    Clean ponctuation and special chars from a dataframe column
    """
    punctuations = string.punctuation
    text_col = text_col.str.replace('\W', ' ', regex=True)
    for punct in string.punctuation:
        text_col = text_col.str.replace(punct, ' ', regex=True)
    return text_col

## Topics


*   Change 'level' into strings instead of numbers
*   Filter by language (start with english then loop all the process for all languages, and save model between languages)
*   Clean strings: ponctuation and special chars (\n\t...)



In [48]:
topics_features = df_topics.copy()[df_topics.category != cat_for_val]
topics_features = topics_features.merge (df_topic_breadcrumbs, how='outer', right_index=True, left_index=True)
topics_features = filter_lang (topics_features, langs.iloc[1, 0])
for col in topic_cols:
    topics_features[col] = clean_text(topics_features[col])
display (topics_features.head())
topics_sentences = topics_features[topic_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)

Unnamed: 0,title,description,channel,category,level,language,parent,has_content,topic_breadcrumbs
t_00069b63a70a,Transcripts,,6e3ba4,source,3.0,en,t_4054df11a74e,True,MIT Blossoms Engineering Flow Charts Logical ...
t_0008a1bd84ba,12 20 Bird Reproduction,,ebc86c,supplemental,5.0,en,t_c44ac9711007,True,Libretext Open Educational Resource Library Bo...
t_000c0c854f0b,Nepal,,f2e966,source,1.0,en,t_f9c4ef0d6290,False,Touchable Earth en Nepal
t_001139e52dbe,Addition and Subtraction to 20,,fef095,source,5.0,en,t_f078c5cdba92,False,K 12 Math Elementary Math Grade 1 Addition and...
t_001477394250,Part II Computational Thinking Advanced,,c6dab5,source,2.0,en,t_9a2cd1d5f242,False,CSpathshala English Standard II Part II Co...


## Contents


*   Remove 'copyright_holder'
*   Filter by language (start with english then loop all the process for all languages, and save model between languages)
*   Clean strings: ponctuation and special chars (\n\t...)



In [49]:
content_features = df_content.copy().drop(columns=['copyright_holder'])
content_features = filter_lang (content_features, langs.iloc[1, 0])
for col in content_cols:
    content_features[col] = clean_text(content_features[col])
display (content_features.head())
content_sentences =  content_features[content_cols].apply(lambda x: '.'.join(x.dropna().astype(str)), axis=1)

Unnamed: 0_level_0,title,description,kind,text,language,license
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
c_00019840d110,5 12E Regulation of the Calvin Cycle,,html5,LEARNING OBJECTIVES Outline the three major p...,en,CC BY-NC-SA
c_00027d03ca7d,4 E Genomes and Chromosomes Exercises,,html5,4 3 BPA Answer the following questions with...,en,CC BY-NC-SA
c_000425df0161,9 3 Van Der Waals Forces between Atoms,,html5,Introduction The perfect gas equation of stat...,en,CC BY-NC-SA
c_00046806ad8a,Compare multi digit numbers,Use your place value skills to practice compar...,exercise,,en,
c_000751f58836,Tangents of circles problem example 2,Sal finds a missing angle using the property t...,video,Angle A is a circumscribed angle on circle O ...,en,CC BY-NC-SA


# Model

## Model params

In [50]:
backbone = 'all-MiniLM-L6-v2' #explore also msmarco models and multilanguage
dataset_limit = 10000
max_contents_per_topic = 5

## Sentence transformer function

In [51]:
def SentenceTransf (backbone, df):
    """
    From a dataframe of sentences, applies Sentence_Transformer and returns 
    embeddings along with dataframe index
    """
    import datetime
  
    start = datetime.datetime.now()
    model = SentenceTransformer(backbone)
    print ('=== SENTENCE-TRANSFORMER ==',backbone, '===')
    embeds = []
    # running model along all X rows
    for i, sentence in enumerate(df[:dataset_limit]):
        print ('Calculate embeddings for sentence ', i)
        embedding = model.encode(sentence)
        embeds.append(embedding)
    end = datetime.datetime.now()
    delta = (end - start).total_seconds()
    print ('Embeddings total time:', delta, ' s')

    return df.index, embeds

## Apply sentence transformer

*   Apply to 'topics_sentences' and 'contents_sentences'
*   Calculate 'similarity score'
*   Merge 'topics_sentences' and 'contents_sentences' based on similarity score


In [58]:
topics_index, topics_embeds = SentenceTransf (backbone, topics_sentences)
contents_index, contents_embeds = SentenceTransf (backbone, content_sentences)
list_tmp = [] 
contents_id = []
similarity_score = []
for i, embed in enumerate (topics_embeds):
    cos_sim = util.cos_sim (embed, contents_embeds).cpu().detach().numpy()
    df_tmp = pd.DataFrame ([])
    df_tmp = pd.DataFrame (zip (contents_index, cos_sim[0]), columns=['topic_id', 'score'])
    df_tmp.sort_values (by='score', ascending = False, inplace=True)
    #print(f"Topic {topics_index[i]} matches content: {df_tmp.iloc[0,0]} with similarity score: {df_tmp.iloc[0,1]}")
    #print(f"Topic {topics_index[i]} matches content: {df_tmp.iloc[1,0]} with similarity score: {df_tmp.iloc[1,1]}")
    for j in range (max_contents_per_topic):
        list_tmp.append (dict(topic_id = topics_index[i], content_id = df_tmp.iloc[0,0], score = df_tmp.iloc[0,1]))
y_aux = pd.DataFrame (list_tmp)
y_aux.to_csv (f"{work_path}embeds.csv", index=False)

=== SENTENCE-TRANSFORMER == all-MiniLM-L6-v2 ===
Calculate embeddings for sentence  0


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Calculate embeddings for sentence  1


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Calculate embeddings for sentence  2


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Calculate embeddings for sentence  3


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Calculate embeddings for sentence  4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Embeddings total time: 0.544683  s
=== SENTENCE-TRANSFORMER == all-MiniLM-L6-v2 ===
Calculate embeddings for sentence  0


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Calculate embeddings for sentence  1


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Calculate embeddings for sentence  2


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Calculate embeddings for sentence  3


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Calculate embeddings for sentence  4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Embeddings total time: 0.816624  s


In [59]:
from IPython.display import FileLink
import os

os.chdir(r'/kaggle/working')

FileLink(r'embeds.csv')

## Get 'topic_content_found' binary column
'topic_content_found' is a binary column that tells us if sentence_transformer found the same match as the ground truth.
It is obtained by merging 'y_aux' (topic-content matches by sentence_transformer) with 'y' (the ground truth from 'correlations.csv') into a dataframe and comparing, for each topic, if content_id match. If yes, label is 1, if not, label is 0. 
Score is still maintained to provide an idea of how "close" sentence_transformer matched both topics and contents.

Merging topic features and content features along with this dataframe, provides the features for a binary classification problem. 'topic_content_found' is the target (y). Classifier to be used is KNN.

Sentence_transformer + classifier model will be evaluated for topics with category = 'aligned'

In [None]:
y_final = y.merge (y_aux, on='topic_id', how='left')
y_final['topic_content_found'] = (y_final.content_ids == y_final.content_id).astype(int)
y_final

## Get X and y

In [None]:

X_val_all = df_X[(df_X.category == cat_for_val) & (df_X.has_content == True)].copy()
y_val_all = df_X[(df_X.category == cat_for_val) & (df_X.has_content == True)].copy().loc[:, ('topic_id', 'content_ids')]
X_train_all = df_X[df_X.category != cat_for_val].copy()
y_train_all = df_X[df_X.category != cat_for_val].copy().loc[:, ('topic_id', 'content_ids')]

In [None]:
# - Using category 'aligned' for validation and only topics with content
# - For X_topics, using columns ['topic_id', 'content_ids', 'title_topic', 'description_topic', 'channel', 'category', 'level', 'parent', 'has_content','topic_breadcrumbs']
# X_topics = Dropping columns ['topic_id', 'content_ids', 'title_content', 'description_content', 'kind', 'text_content', 'license'] from df_X
#
# - For X_contents, using columns ['topic_id', 'content_ids', 'title_content', 'description_content', 'kind', 'text_content', 'license']
# X_contents = Dropping columns ['topic_id', 'content_ids', 'title_topic', 'description_topic', 'channel', 'category', 'level', 'parent', 'has_content','topic_breadcrumbs'] from df_X
X_val_topics = X_val_all.drop(columns = ['topic_id', 'content_ids', 'title_content', 'description_content', 'kind', 'text_content', 'license'])
X_train_topics = X_train_all.drop(columns = ['topic_id', 'content_ids', 'title_content', 'description_content', 'kind', 'text_content', 'license'])

X_val_content = X_val_all.drop(columns = ['topic_id', 'content_ids', 'title_topic', 'description_topic', 'channel', 'category', 'level', 'parent', 'has_content','topic_breadcrumbs'])
X_train_content = X_train_all.drop(columns = ['topic_id', 'content_ids', 'title_topic', 'description_topic', 'channel', 'category', 'level', 'parent', 'has_content','topic_breadcrumbs'])

In [None]:

X_train_topics, time_train = SentenceTransf (backbone, X_train_topics[:limit])
print ('Time to preprocess X_train', time_train)
X_train_topics

In [None]:
print (X_val.info())
print (X_train.shape)
print (y_val.shape)
print (y_train.shape)
X_val[0]

## Scoring

Scoring is based in mean F2 score

In [None]:
def model_scoring (y_test, y_pred, average='weighted'):
  """
   Calculate precision, recall and f2-score for y_test and y_pred
  """
  Precision = []
  Precision.append (precision_score(y_test['topic_id'], y_pred[:,0], average=average))
  Precision.append (precision_score(y_test['content_ids'], y_pred[:,1], average=average))
  precision_mean = np.mean(Precision)
  print ('Precision:', precision_mean)
  Recall = []
  Recall.append (recall_score(y_test['topic_id'], y_pred[:,0], average=average))
  Recall.append (recall_score(y_test['content_ids'], y_pred[:,1], average=average))
  recall_mean = np.mean(Recall)
  print ('Recall:', recall_mean)
  F2macro = []
  F2macro.append (fbeta_score(y_test['topic_id'], y_pred[:,0], beta=2, average='macro'))
  F2macro.append (fbeta_score(y_test['content_ids'], y_pred[:,1], beta=2, average='macro'))
  F2macro_mean = np.mean(F2macro)
  print ('F2 macro:', F2macro_mean)
  F2micro = []
  F2micro.append (fbeta_score(y_test['topic_id'], y_pred[:,0], beta=2, average='micro'))
  F2micro.append (fbeta_score(y_test['content_ids'], y_pred[:,1], beta=2, average='micro'))
  F2micro_mean = np.mean(F2micro)
  print ('F2 micro:', F2micro_mean)
  return precision_mean, recall_mean, F2macro_mean, F2micro_mean

## Baseline Score

In [None]:
%%time
dummy = DummyClassifier (strategy = 'stratified')
multi_target_dummy = MultiOutputClassifier(dummy, n_jobs=-1)
multi_target_dummy.fit(X_train, y_train)

In [None]:
%%time
y_pred = multi_target_dummy.predict (X_val)
y_pred.shape

In [None]:
model_scoring (y_train, y_pred)

## GradientBoosting

In [None]:
%%time
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
    n_estimators=10, 
    learning_rate=0.1,
    max_depth=3
)
multi_target_GB = MultiOutputClassifier(model, n_jobs=-1)
multi_target_GB.fit(X_train, y_train)

In [None]:
%%time
y_pred = multi_target_GB.predict (X_val)
y_pred

In [None]:
model_scoring (y_train, y_pred)

## Random Forest

## XsBoost

## GradientBoosting

# Submissions

In [None]:
# load 'correlations' data into pandas dataframe
df_sample = pd.read_csv(f'{drive_path}sample_submission.csv')
df_sample