<a href="https://colab.research.google.com/github/sofiammatias/learning-equality/blob/main/Learning_Equality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle Challenge - Learning Equality
https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/overview

## Goal of the Competition

The National Football League (NFL) is back with another Big Data Bowl, where contestants use Next Gen Stats player tracking data to generate actionable, creative, and novel stats. Previous iterations have considered running backs, defensive backs, and special teams, and have generated metrics that have been used on television and by NFL teams. In this year’s competition, you’ll have more subtle performances to consider—and potentially more players to measure.

## Submission File

For each **topic_id** in the test set, you must predict a space-delimited list of recommended **content_ids** for that topic. The file should contain a header and have the following format:

~~~
topic_id,content_ids
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c_76231f9d0b5e
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c_ebb7fdf10a7e
t_00069b63a70a,c_11a1dc0bfb99
...
~~~

## Scoring
Mean F2 score

# Evaluation Metric - Efficiency Scoring
We compute a submission's efficiency score by:

\begin{equation} \text{Efficiency} = \frac{1}{ \text{Benchmark} - \max\text{F2} }\text{F2} + \frac{1}{32400}\text{RuntimeSeconds} \end{equation}


where **F2** is the submission's score on the main competition metric, **Benchmark** is the score of the benchmark sample_submission.csv, **maxF2** is the maximum  of all submissions on the Private Leaderboard, and **RuntimeSeconds** is the number of seconds it takes for the submission to be evaluated. The objective is to minimize the efficiency score.

During the training period of the competition, you may see a leaderboard for the public test data in the following notebook, updated daily: Efficiency Leaderboard. After the competition ends, we will update this leaderboard with efficiency scores on the private data. During the training period, this leaderboard will show only the rank of each team, but not the complete score.

# Data

## Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.dummy import DummyClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Data Collection

In [None]:
drive_path = '/content/drive/My Drive/Colab Notebooks/learning_equality/'

# load 'topics' data into pandas dataframe
df_topics = pd.read_csv(f'{drive_path}topics.csv', index_col=0).fillna({"title": "", "description": ""})
df_topics

Unnamed: 0_level_0,title,description,channel,category,level,language,parent,has_content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True
...,...,...,...,...,...,...,...,...
t_fffb0bf2801d,4.3 Graph of functions,,e77b55,aligned,4,en,t_676e6a1a4dc7,False
t_fffbe1d5d43c,Inscribed shapes problem solving,Use properties of inscribed angles to prove pr...,0c929f,source,4,sw,t_50145b9bab3f,True
t_fffe14f1be1e,Lección 7,,6e90a7,aligned,6,es,t_d448c707984d,True
t_fffe811a6da9,تحديد العلاقة بين الإحداثيّات القطبية والإحداث...,5b9e5ca86571f90499ea987f,9fd860,source,2,ar,t_5b4f3ba4eb7d,True


In [None]:
# load 'content' data into pandas dataframe
df_content = pd.read_csv(f'{drive_path}content.csv', index_col=0).fillna("")
df_content

Unnamed: 0_level_0,title,description,kind,text,language,copyright_holder,license
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,
c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,
c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,
c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,\nNado de aproximação\nSaber nadar nas ondas ...,pt,Sikana Education,CC BY-NC-ND
c_00016694ea2a,geometry-m3-topic-a-overview.pdf,geometry-m3-topic-a-overview.pdf,document,Estándares Comunes del Estado de Nueva York\n\...,es,Engage NY,CC BY-NC-SA
...,...,...,...,...,...,...,...
c_fffcbdd4de8b,2. 12: Diffusion,,html5,What will eventually happen to these dyes?\n\n...,en,CSU and Merlot,CC BY-NC-SA
c_fffe15a2d069,Sommare facendo gruppi da 10,Sal somma 5+68 spezzando il 5 in un 2 e un 3.\n\n,video,,it,,
c_fffed7b0d13a,Introdução à subtração,Sal fala sobre o que significa subtrair. Os ex...,video,,pt,,
c_ffff04ba7ac7,SA of a Cone,,video,,en,,


In [None]:
# load 'correlations' data into pandas dataframe
df_corr = pd.read_csv(f'{drive_path}correlations.csv', index_col=0)
df_corr

Unnamed: 0_level_0,content_ids
topic_id,Unnamed: 1_level_1
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,c_11a1dc0bfb99
t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4
...,...
t_fff830472691,c_61fb63326e5d c_8f224e321c87
t_fff9e5407d13,c_026db653a269 c_0fb048a6412c c_20de77522603 c...
t_fffbe1d5d43c,c_46f852a49c08 c_6659207b25d5
t_fffe14f1be1e,c_cece166bad6a


In [None]:
# Restructure 'df_corr' (correlations): explode the target column to be more practical to use - one topic -> one content
y = df_corr.copy()
y['content_ids'] = y.content_ids.str.split(' ')
y = y.explode('content_ids')
y.reset_index(inplace=True)
y

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d
1,t_00004da3a1b2,c_376c5a8eb028
2,t_00004da3a1b2,c_5bc0e1e2cba0
3,t_00004da3a1b2,c_76231f9d0b5e
4,t_00068291e9a4,c_639ea2ef9c95
...,...,...
279914,t_fff9e5407d13,c_d64037a72376
279915,t_fffbe1d5d43c,c_46f852a49c08
279916,t_fffbe1d5d43c,c_6659207b25d5
279917,t_fffe14f1be1e,c_cece166bad6a


## Data Cleaning

In [None]:
# - Remove column 'copyright_holder' as it doesn't add to features
# - Remove column 'has_content' (sorted by 'has_content' = True)
# - Filter by language: start with english only and consider training the model with various datasets, filtered by language
# - Remove columns 'language_x' and 'language_y' after filtering  

langs = df_topics['language'].unique()[1] # ['bg', 'en', 'pt', 'gu', 'my', 'zh', 'ar', 'te', 'es', 'fr', 'sw', 'mr', 'hi', 'bn', 'fil', 'ru', 'it', 'or', 'pnb', 'km', 'as', 'kn','ur', 'pl', 'ta', 'swa', 'tr', 'mul']
df_X = y.copy()
df_X = df_X.merge (df_topics.reset_index().rename(columns={'id': 'topic_id'}), on='topic_id', how='left')
df_X = df_X.merge (df_content.reset_index().rename(columns={'id': 'content_ids'}), on='content_ids', how='left')
df_X = df_X[df_X.language_x == langs].copy()
y_lang = df_X[['topic_id', 'content_ids']]
df_X.drop (columns = ['copyright_holder', 'has_content', 'language_x', 'language_y'], inplace = True)
df_X.rename(columns={'title_x': 'title_topic', 'title_y': 'title_content', 'description_x': 'description_topic', 'description_y': 'description_content'}, inplace = True)
df_X

Unnamed: 0,topic_id,content_ids,title_topic,description_topic,channel,category,level,parent,title_content,description_content,kind,text,license
8,t_00069b63a70a,c_11a1dc0bfb99,Transcripts,,6e3ba4,source,3,t_4054df11a74e,Flow Charts: Logical..: Written Transcript of ...,"Written Transcript of this video lesson, Engli...",document,Español\t\n \nTítulo\t\n de\t\n la\t\n lec...,CC BY-NC-SA
17,t_0008a1bd84ba,c_7ff92a954a3d,12. 20: Bird Reproduction,,ebc86c,supplemental,5,t_c44ac9711007,12. 20: Bird Reproduction,,html5,Is this pair of birds actually a “couple”?\n\n...,CC BY-NC-SA
18,t_0008a1bd84ba,c_8790b074383e,12. 20: Bird Reproduction,,ebc86c,supplemental,5,t_c44ac9711007,Astounding Mating Dance Birds of Paradise -- H...,The Birds of Paradise from BBC's outstanding P...,video,,
19,t_000d1fb3f2f5,c_07f1d0eec4b2,2.1.2 - Logarithms,,e77b55,aligned,5,t_b897d168db90,Proof of the logarithm change of base rule,Sal proves the logarithmic change of base rule...,video,What I want to do in this video is prove the c...,CC BY-NC-SA
20,t_000d1fb3f2f5,c_15a6fb858696,2.1.2 - Logarithms,,e77b55,aligned,5,t_b897d168db90,Intro to logarithm properties (2 of 2),Sal introduces the logarithm identities for mu...,video,PROFESSOR: Welcome back. I'm going to show you...,CC BY-NC-SA
...,...,...,...,...,...,...,...,...,...,...,...,...,...
279894,t_fff5da49c4d3,c_4a0305f5876d,Business Writing,,735876,source,3,t_c4c783ae6cc1,Vegetarian Lunch Options at Bay High (Email),,html5,Jump to navigation\n\nVegetarian Lunch Options...,CC BY-NC-SA
279895,t_fff5da49c4d3,c_cbceabf1d5d6,Business Writing,,735876,source,3,t_c4c783ae6cc1,Mid-Project Report on History Paper (Email),,html5,Jump to navigation\n\nMid-Project Report on Hi...,CC BY-NC-SA
279896,t_fff7782561f4,c_bca8280a9ad1,Introduction,"In certain situations, comparison by division ...",d5fb04,supplemental,3,t_2a4dc28b0431,Introduction to ratios,What a ratio is. Simple ratio problems.,video,Let's try to learn a thing or two about ratios...,CC BY-NC-SA
279903,t_fff830472691,c_61fb63326e5d,Scalar Projections,,fef095,source,4,t_c75d6acecf78,Scalar Projections,,html5,,


## Helper functions and classes for data traversal (breadcrumbs, topic context)

In [None]:
# define some helper functions and classes to aid with data traversal

class ContentItem:
    """
    Class to handle content relations (breadcrumbs)
    """
    def __init__(self, content_id):
        self.id = content_id

    @property
    def topics(self):
        return [Topic(topic_id) for topic_id in df_topics.loc[df_corr[df_corr.content_ids.str.contains(self.id)].index].index]

    def __getattr__(self, name):
        return df_content.loc[self.id][name]

    def __str__(self):
        return self.title

    def __repr__(self):
        return f"<ContentItem(id={self.id}, title=\"{self.title}\")>"

    def __eq__(self, other):
        if not isinstance(other, ContentItem):
            return False
        return self.id == other.id

    def get_all_breadcrumbs(self, separator=" >> ", include_root=True):
        breadcrumbs = []
        for topic in self.topics:
            new_breadcrumb = topic.get_breadcrumbs(separator=separator, include_root=include_root)
            if new_breadcrumb:
                new_breadcrumb = new_breadcrumb + separator + self.title
            else:
                new_breadcrumb = self.title
            breadcrumbs.append(new_breadcrumb)
        return breadcrumbs

class Topic:
    """
    Class to handle topics relations (breadcrumbs)
    """
    def __init__(self, topic_id):
        self.id = topic_id

    @property
    def parent(self):
        parent_id = df_topics.loc[self.id].parent
        if pd.isna(parent_id):
            return None
        else:
            return Topic(parent_id)

    @property
    def ancestors(self):
        ancestors = []
        parent = self.parent
        while parent is not None:
            ancestors.append(parent)
            parent = parent.parent
        return ancestors

    @property
    def siblings(self):
        if not self.parent:
            return []
        else:
            return [topic for topic in self.parent.children if topic != self]

    @property
    def content(self):
        if self.id in df_corr.index:
            return [ContentItem(content_id) for content_id in df_corr.loc[self.id].content_ids.split()]
        else:
            return tuple([]) if self.has_content else []

    def get_breadcrumbs(self, separator=" >> ", include_self=True, include_root=True):
        ancestors = self.ancestors
        if include_self:
            ancestors = [self] + ancestors
        if not include_root:
            ancestors = ancestors[:-1]
        return separator.join(reversed([a.title for a in ancestors]))

    @property
    def children(self):
        return [Topic(child_id) for child_id in df_topics[df_topics.parent == self.id].index]

    def subtree_markdown(self, depth=0):
        markdown = "  " * depth + "- " + self.title + "\n"
        for child in self.children:
            markdown += child.subtree_markdown(depth=depth + 1)
        for content in self.content:
            markdown += ("  " * (depth + 1) + "- " + "[" + content.kind.title() + "] " + content.title) + "\n"
        return markdown

    def __eq__(self, other):
        if not isinstance(other, Topic):
            return False
        return self.id == other.id

    def __getattr__(self, name):
        return df_topics.loc[self.id][name]

    def __str__(self):
        return self.title
    
    def __repr__(self):
        return f"<Topic(id={self.id}, title=\"{self.title}\")>"

In [None]:
# Features engineering
# - Create breadcrumbs

#topic_id = df_X['topic_id'].iloc[0]
for topic_id in df_X['topic_id']:
  topic = df_X['topic_id'].apply (Topic, axis=0)
  print("Breadcrumbs:\t" + topic.get_breadcrumbs(separator = ' '))

TypeError: ignored

# Model

## Preprocessing

## Baseline Score

In [None]:
X_ex = X_train[X_train.language_x == 'en'][:20]
y_ex = y_train[X_train.language_x == 'en'][:20]
y_ex

Unnamed: 0,topic_id,content_ids
8,t_00069b63a70a,c_11a1dc0bfb99
17,t_0008a1bd84ba,c_7ff92a954a3d
18,t_0008a1bd84ba,c_8790b074383e
19,t_000d1fb3f2f5,c_07f1d0eec4b2
20,t_000d1fb3f2f5,c_15a6fb858696
21,t_000d1fb3f2f5,c_175e9db3fc44
22,t_000d1fb3f2f5,c_1c2e804fa58a
23,t_000d1fb3f2f5,c_1ec97b588bce
24,t_000d1fb3f2f5,c_247c609418f3
25,t_000d1fb3f2f5,c_3a2bf4a358da


In [None]:
%%time
dummy = DummyClassifier (strategy = 'stratified')
multi_target_dummy = MultiOutputClassifier(dummy, n_jobs=-1)
multi_target_dummy.fit(X_ex, y_ex)

CPU times: user 33.6 ms, sys: 154 ms, total: 188 ms
Wall time: 1.66 s




MultiOutputClassifier(estimator=DummyClassifier(strategy='stratified'),
                      n_jobs=-1)

In [None]:
%%time
y_pred = multi_target_dummy.predict (X_ex)
y_pred

CPU times: user 6.31 ms, sys: 0 ns, total: 6.31 ms
Wall time: 13.9 ms


array([['t_000d1fb3f2f5', 'c_247c609418f3'],
       ['t_000d1fb3f2f5', 'c_7cc189e7acb0'],
       ['t_000d1fb3f2f5', 'c_4aea397ae55e'],
       ['t_000d1fb3f2f5', 'c_94e49adc276f'],
       ['t_000d1fb3f2f5', 'c_7ff92a954a3d'],
       ['t_000d1fb3f2f5', 'c_7868f655c31e'],
       ['t_000d1fb3f2f5', 'c_7868f655c31e'],
       ['t_000d1fb3f2f5', 'c_5299d6442ed8'],
       ['t_0008a1bd84ba', 'c_3a2bf4a358da'],
       ['t_000d1fb3f2f5', 'c_4aea397ae55e'],
       ['t_000d1fb3f2f5', 'c_1c2e804fa58a'],
       ['t_000d1fb3f2f5', 'c_4698dc0a94dc'],
       ['t_000d1fb3f2f5', 'c_7ff92a954a3d'],
       ['t_000d1fb3f2f5', 'c_1c2e804fa58a'],
       ['t_000d1fb3f2f5', 'c_11a1dc0bfb99'],
       ['t_000d1fb3f2f5', 'c_7cc189e7acb0'],
       ['t_0008a1bd84ba', 'c_1ec97b588bce'],
       ['t_000d1fb3f2f5', 'c_6e5ae1f2bb90'],
       ['t_0008a1bd84ba', 'c_1c2e804fa58a'],
       ['t_0008a1bd84ba', 'c_451671e513d2']], dtype=object)

In [None]:
# Calculate the f2-score
F2macro = []
F2macro.append (fbeta_score(y_ex['topic_id'], y_pred[:,0], beta=2, average='macro'))
F2macro.append (fbeta_score(y_ex['content_ids'], y_pred[:,1], beta=2, average='macro'))
print ('F2 macro:', np.mean(F2macro))
F2micro = []
F2micro.append (fbeta_score(y_ex['topic_id'], y_pred[:,0], beta=2, average='micro'))
F2micro.append (fbeta_score(y_ex['content_ids'], y_pred[:,1], beta=2, average='micro'))
print ('F2 micro:', np.mean (F2micro))

F2 macro: 0.15396825396825398
F2 micro: 0.3500000000000001


## Decision Tree

### Preprocessing - Feature Engineering

In [None]:
%%time


In [None]:
%%time
model = DecisionTreeClassifier ()
#model = RandomForestClassifier(n_estimators = 10, criterion = 'gini', max_depth = 3, random_state=1)
multi_target_model = MultiOutputClassifier(model, n_jobs=-1)
multi_target_model.fit(X_ex, y_ex)

ValueError: ignored

In [None]:
# Use grid search for models: use with caution! :)

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 2)]
# Using num=10 will significantly increase run time
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 2)]
#max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
# Create the random grid
grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
ftwo_scorer = make_scorer(fbeta_score, beta=2)

grid_search = GridSearchCV(estimator = model, param_grid = grid, cv = 3, scoring=ftwo_scorer, n_jobs = -1, verbose = 1)

# Train the classifier
best_grid = grid_search.fit(X_train, y_train)

# Make predictions
grid_preds = best_grid.predict(X_test)

print("The best hyperparameters found during the grid search are:")
print(best_grid.best_params_)

In [None]:
%%time
y_pred = multi_target_dummy.predict (X_ex)
y_pred

CPU times: user 6.31 ms, sys: 0 ns, total: 6.31 ms
Wall time: 13.9 ms


array([['t_000d1fb3f2f5', 'c_247c609418f3'],
       ['t_000d1fb3f2f5', 'c_7cc189e7acb0'],
       ['t_000d1fb3f2f5', 'c_4aea397ae55e'],
       ['t_000d1fb3f2f5', 'c_94e49adc276f'],
       ['t_000d1fb3f2f5', 'c_7ff92a954a3d'],
       ['t_000d1fb3f2f5', 'c_7868f655c31e'],
       ['t_000d1fb3f2f5', 'c_7868f655c31e'],
       ['t_000d1fb3f2f5', 'c_5299d6442ed8'],
       ['t_0008a1bd84ba', 'c_3a2bf4a358da'],
       ['t_000d1fb3f2f5', 'c_4aea397ae55e'],
       ['t_000d1fb3f2f5', 'c_1c2e804fa58a'],
       ['t_000d1fb3f2f5', 'c_4698dc0a94dc'],
       ['t_000d1fb3f2f5', 'c_7ff92a954a3d'],
       ['t_000d1fb3f2f5', 'c_1c2e804fa58a'],
       ['t_000d1fb3f2f5', 'c_11a1dc0bfb99'],
       ['t_000d1fb3f2f5', 'c_7cc189e7acb0'],
       ['t_0008a1bd84ba', 'c_1ec97b588bce'],
       ['t_000d1fb3f2f5', 'c_6e5ae1f2bb90'],
       ['t_0008a1bd84ba', 'c_1c2e804fa58a'],
       ['t_0008a1bd84ba', 'c_451671e513d2']], dtype=object)

In [None]:
# Calculate the f2-score
F2macro = []
F2macro.append (fbeta_score(y_ex['topic_id'], y_pred[:,0], beta=2, average='macro'))
F2macro.append (fbeta_score(y_ex['content_ids'], y_pred[:,1], beta=2, average='macro'))
print ('F2 macro:', np.mean(F2macro))
F2micro = []
F2micro.append (fbeta_score(y_ex['topic_id'], y_pred[:,0], beta=2, average='micro'))
F2micro.append (fbeta_score(y_ex['content_ids'], y_pred[:,1], beta=2, average='micro'))
print ('F2 micro:', np.mean (F2micro))

F2 macro: 0.15396825396825398
F2 micro: 0.3500000000000001
