<a href="https://colab.research.google.com/github/sofiammatias/learning-equality/blob/main/Learning_Equality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle Challenge - Learning Equality
https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/overview

## Goal of the Competition

The National Football League (NFL) is back with another Big Data Bowl, where contestants use Next Gen Stats player tracking data to generate actionable, creative, and novel stats. Previous iterations have considered running backs, defensive backs, and special teams, and have generated metrics that have been used on television and by NFL teams. In this year’s competition, you’ll have more subtle performances to consider—and potentially more players to measure.

## Submission File

For each **topic_id** in the test set, you must predict a space-delimited list of recommended **content_ids** for that topic. The file should contain a header and have the following format:

~~~
topic_id,content_ids
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c_76231f9d0b5e
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c_ebb7fdf10a7e
t_00069b63a70a,c_11a1dc0bfb99
...
~~~

## Scoring
Mean F2 score

# Evaluation Metric - Efficiency Scoring
We compute a submission's efficiency score by:

\begin{equation} \text{Efficiency} = \frac{1}{ \text{Benchmark} - \max\text{F2} }\text{F2} + \frac{1}{32400}\text{RuntimeSeconds} \end{equation}


where **F2** is the submission's score on the main competition metric, **Benchmark** is the score of the benchmark sample_submission.csv, **maxF2** is the maximum  of all submissions on the Private Leaderboard, and **RuntimeSeconds** is the number of seconds it takes for the submission to be evaluated. The objective is to minimize the efficiency score.

During the training period of the competition, you may see a leaderboard for the public test data in the following notebook, updated daily: Efficiency Leaderboard. After the competition ends, we will update this leaderboard with efficiency scores on the private data. During the training period, this leaderboard will show only the rank of each team, but not the complete score.

# Data

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.dummy import DummyClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Data Collection

In [11]:
drive_path = '/content/drive/My Drive/Colab Notebooks/learning_equality/'

# load the data into pandas dataframes
df_topics = pd.read_csv(f'{drive_path}topics.csv').fillna({"title": "", "description": ""})
df_topics

Unnamed: 0,id,title,description,channel,category,level,language,parent,has_content
0,t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
1,t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
2,t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
3,t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
4,t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True
...,...,...,...,...,...,...,...,...,...
76967,t_fffb0bf2801d,4.3 Graph of functions,,e77b55,aligned,4,en,t_676e6a1a4dc7,False
76968,t_fffbe1d5d43c,Inscribed shapes problem solving,Use properties of inscribed angles to prove pr...,0c929f,source,4,sw,t_50145b9bab3f,True
76969,t_fffe14f1be1e,Lección 7,,6e90a7,aligned,6,es,t_d448c707984d,True
76970,t_fffe811a6da9,تحديد العلاقة بين الإحداثيّات القطبية والإحداث...,5b9e5ca86571f90499ea987f,9fd860,source,2,ar,t_5b4f3ba4eb7d,True


In [10]:
df_content = pd.read_csv(f'{drive_path}content.csv').fillna("")
df_content

Unnamed: 0,id,title,description,kind,text,language,copyright_holder,license
0,c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,
1,c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,
2,c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,
3,c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,\nNado de aproximação\nSaber nadar nas ondas ...,pt,Sikana Education,CC BY-NC-ND
4,c_00016694ea2a,geometry-m3-topic-a-overview.pdf,geometry-m3-topic-a-overview.pdf,document,Estándares Comunes del Estado de Nueva York\n\...,es,Engage NY,CC BY-NC-SA
...,...,...,...,...,...,...,...,...
154042,c_fffcbdd4de8b,2. 12: Diffusion,,html5,What will eventually happen to these dyes?\n\n...,en,CSU and Merlot,CC BY-NC-SA
154043,c_fffe15a2d069,Sommare facendo gruppi da 10,Sal somma 5+68 spezzando il 5 in un 2 e un 3.\n\n,video,,it,,
154044,c_fffed7b0d13a,Introdução à subtração,Sal fala sobre o que significa subtrair. Os ex...,video,,pt,,
154045,c_ffff04ba7ac7,SA of a Cone,,video,,en,,


In [5]:
df_corr = pd.read_csv(f'{drive_path}correlations.csv', index_col=0)
df_corr

Unnamed: 0_level_0,content_ids
topic_id,Unnamed: 1_level_1
t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
t_00069b63a70a,c_11a1dc0bfb99
t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4
...,...
t_fff830472691,c_61fb63326e5d c_8f224e321c87
t_fff9e5407d13,c_026db653a269 c_0fb048a6412c c_20de77522603 c...
t_fffbe1d5d43c,c_46f852a49c08 c_6659207b25d5
t_fffe14f1be1e,c_cece166bad6a


In [6]:
# Explode the target column to be more practical to use
df_corr['content_ids'] = df_corr.content_ids.str.split(' ')
y_train = df_corr.explode('content_ids')
y_train.reset_index(inplace=True)
y_train

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d
1,t_00004da3a1b2,c_376c5a8eb028
2,t_00004da3a1b2,c_5bc0e1e2cba0
3,t_00004da3a1b2,c_76231f9d0b5e
4,t_00068291e9a4,c_639ea2ef9c95
...,...,...
279914,t_fff9e5407d13,c_d64037a72376
279915,t_fffbe1d5d43c,c_46f852a49c08
279916,t_fffbe1d5d43c,c_6659207b25d5
279917,t_fffe14f1be1e,c_cece166bad6a


In [12]:
# Create X_train
# - Remove column 'copyright_holder'
# - Filter by language: start with english only and consider training the model with various datasets, filtered by language

langs = X_train['language'].unique()[2] # ['bg', 'pt', 'en', 'gu', 'zh', 'ar', 'te', 'es', 'fr', 'sw', 'mr', 'hi', 'fil', 'bn', 'my', 'it', 'ru', 'or', 'pnb', 'km', 'as', 'kn', 'pl', 'ta', 'swa', 'ur', 'tr']

df_X = y_train
df_X = df_X.merge (df_topics.rename(columns={'id': 'topic_id'}), on='topic_id', how='left')
df_X = df_X.merge (df_content.rename(columns={'id': 'content_ids'}), on='content_ids', how='left')
df_X.drop (columns = 'copyright_holder', inplace = True)
df_X.rename(columns={'title_x': 'title_topic', 'title_y': 'title_content', 'description_x': 'description_topic', 'description_y': 'description_content', 'language_x': 'language'}, inplace = True)
X_train = df_X.drop(columns = ['topic_id', 'content_ids'])
X_train

Unnamed: 0,title_topic,description_topic,channel,category,level,language,parent,has_content,title_content,description_content,kind,text,language_y,license
0,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True,Молив като резистор,"Моливът причинява промяна в отклонението, подо...",video,,bg,
1,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True,Да чуем променливото съпротивление,Тук чертаем линия на лист хартия и я използвам...,video,,bg,
2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True,Променлив резистор (реостат) с графит от молив,Използваме сърцевината на молива (неговия граф...,video,,bg,
3,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True,Последователно свързване на галваничен елемент...,"Защо отклонението се променя, когато се свърже...",video,,bg,
4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True,Dados e resultados de funções: gráficos,Encontre todas as entradas que correspondem a ...,exercise,,pt,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279914,NA_U06 - El periódico,,71fd51,supplemental,2,es,t_5bd8f6ae9f7d,True,Introducción: El periódico,,html5,,es,
279915,Inscribed shapes problem solving,Use properties of inscribed angles to prove pr...,0c929f,source,4,sw,t_50145b9bab3f,True,Proof: Right triangles inscribed in circles -d...,Proof showing that a triangle inscribed in a c...,video,,sw,
279916,Inscribed shapes problem solving,Use properties of inscribed angles to prove pr...,0c929f,source,4,sw,t_50145b9bab3f,True,Area of inscribed equilateral triangle -dubbed...,A worked example of finding the area of an equ...,video,,sw,
279917,Lección 7,,6e90a7,aligned,6,es,t_d448c707984d,True,Juego con las palabras,,document,,es,


array(['bg', 'pt', 'en', 'gu', 'zh', 'ar', 'te', 'es', 'fr', 'sw', 'mr',
       'hi', 'fil', 'bn', 'my', 'it', 'ru', 'or', 'pnb', 'km', 'as', 'kn',
       'pl', 'ta', 'swa', 'ur', 'tr'], dtype=object)

# Model

## Preprocessing

## Baseline Score

In [8]:
X_ex = X_train[X_train.language_x == 'en'][:20]
y_ex = y_train[X_train.language_x == 'en'][:20]
y_ex

Unnamed: 0,topic_id,content_ids
8,t_00069b63a70a,c_11a1dc0bfb99
17,t_0008a1bd84ba,c_7ff92a954a3d
18,t_0008a1bd84ba,c_8790b074383e
19,t_000d1fb3f2f5,c_07f1d0eec4b2
20,t_000d1fb3f2f5,c_15a6fb858696
21,t_000d1fb3f2f5,c_175e9db3fc44
22,t_000d1fb3f2f5,c_1c2e804fa58a
23,t_000d1fb3f2f5,c_1ec97b588bce
24,t_000d1fb3f2f5,c_247c609418f3
25,t_000d1fb3f2f5,c_3a2bf4a358da


In [9]:
%%time
dummy = DummyClassifier (strategy = 'stratified')
multi_target_dummy = MultiOutputClassifier(dummy, n_jobs=-1)
multi_target_dummy.fit(X_ex, y_ex)

CPU times: user 33.6 ms, sys: 154 ms, total: 188 ms
Wall time: 1.66 s




MultiOutputClassifier(estimator=DummyClassifier(strategy='stratified'),
                      n_jobs=-1)

In [11]:
%%time
y_pred = multi_target_dummy.predict (X_ex)
y_pred

CPU times: user 6.31 ms, sys: 0 ns, total: 6.31 ms
Wall time: 13.9 ms


array([['t_000d1fb3f2f5', 'c_247c609418f3'],
       ['t_000d1fb3f2f5', 'c_7cc189e7acb0'],
       ['t_000d1fb3f2f5', 'c_4aea397ae55e'],
       ['t_000d1fb3f2f5', 'c_94e49adc276f'],
       ['t_000d1fb3f2f5', 'c_7ff92a954a3d'],
       ['t_000d1fb3f2f5', 'c_7868f655c31e'],
       ['t_000d1fb3f2f5', 'c_7868f655c31e'],
       ['t_000d1fb3f2f5', 'c_5299d6442ed8'],
       ['t_0008a1bd84ba', 'c_3a2bf4a358da'],
       ['t_000d1fb3f2f5', 'c_4aea397ae55e'],
       ['t_000d1fb3f2f5', 'c_1c2e804fa58a'],
       ['t_000d1fb3f2f5', 'c_4698dc0a94dc'],
       ['t_000d1fb3f2f5', 'c_7ff92a954a3d'],
       ['t_000d1fb3f2f5', 'c_1c2e804fa58a'],
       ['t_000d1fb3f2f5', 'c_11a1dc0bfb99'],
       ['t_000d1fb3f2f5', 'c_7cc189e7acb0'],
       ['t_0008a1bd84ba', 'c_1ec97b588bce'],
       ['t_000d1fb3f2f5', 'c_6e5ae1f2bb90'],
       ['t_0008a1bd84ba', 'c_1c2e804fa58a'],
       ['t_0008a1bd84ba', 'c_451671e513d2']], dtype=object)

In [12]:
# Calculate the f2-score
F2macro = []
F2macro.append (fbeta_score(y_ex['topic_id'], y_pred[:,0], beta=2, average='macro'))
F2macro.append (fbeta_score(y_ex['content_ids'], y_pred[:,1], beta=2, average='macro'))
print ('F2 macro:', np.mean(F2macro))
F2micro = []
F2micro.append (fbeta_score(y_ex['topic_id'], y_pred[:,0], beta=2, average='micro'))
F2micro.append (fbeta_score(y_ex['content_ids'], y_pred[:,1], beta=2, average='micro'))
print ('F2 micro:', np.mean (F2micro))

F2 macro: 0.15396825396825398
F2 micro: 0.3500000000000001


## Decision Tree

### Preprocessing - Feature Engineering

In [None]:
%%time


In [18]:
%%time
model = DecisionTreeClassifier ()
#model = RandomForestClassifier(n_estimators = 10, criterion = 'gini', max_depth = 3, random_state=1)
multi_target_model = MultiOutputClassifier(model, n_jobs=-1)
multi_target_model.fit(X_ex, y_ex)

ValueError: ignored

In [None]:
# Use grid search for models: use with caution! :)

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 2)]
# Using num=10 will significantly increase run time
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 2)]
#max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
# Create the random grid
grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
ftwo_scorer = make_scorer(fbeta_score, beta=2)

grid_search = GridSearchCV(estimator = model, param_grid = grid, cv = 3, scoring=ftwo_scorer, n_jobs = -1, verbose = 1)

# Train the classifier
best_grid = grid_search.fit(X_train, y_train)

# Make predictions
grid_preds = best_grid.predict(X_test)

print("The best hyperparameters found during the grid search are:")
print(best_grid.best_params_)

In [None]:
%%time
y_pred = multi_target_dummy.predict (X_ex)
y_pred

CPU times: user 6.31 ms, sys: 0 ns, total: 6.31 ms
Wall time: 13.9 ms


array([['t_000d1fb3f2f5', 'c_247c609418f3'],
       ['t_000d1fb3f2f5', 'c_7cc189e7acb0'],
       ['t_000d1fb3f2f5', 'c_4aea397ae55e'],
       ['t_000d1fb3f2f5', 'c_94e49adc276f'],
       ['t_000d1fb3f2f5', 'c_7ff92a954a3d'],
       ['t_000d1fb3f2f5', 'c_7868f655c31e'],
       ['t_000d1fb3f2f5', 'c_7868f655c31e'],
       ['t_000d1fb3f2f5', 'c_5299d6442ed8'],
       ['t_0008a1bd84ba', 'c_3a2bf4a358da'],
       ['t_000d1fb3f2f5', 'c_4aea397ae55e'],
       ['t_000d1fb3f2f5', 'c_1c2e804fa58a'],
       ['t_000d1fb3f2f5', 'c_4698dc0a94dc'],
       ['t_000d1fb3f2f5', 'c_7ff92a954a3d'],
       ['t_000d1fb3f2f5', 'c_1c2e804fa58a'],
       ['t_000d1fb3f2f5', 'c_11a1dc0bfb99'],
       ['t_000d1fb3f2f5', 'c_7cc189e7acb0'],
       ['t_0008a1bd84ba', 'c_1ec97b588bce'],
       ['t_000d1fb3f2f5', 'c_6e5ae1f2bb90'],
       ['t_0008a1bd84ba', 'c_1c2e804fa58a'],
       ['t_0008a1bd84ba', 'c_451671e513d2']], dtype=object)

In [None]:
# Calculate the f2-score
F2macro = []
F2macro.append (fbeta_score(y_ex['topic_id'], y_pred[:,0], beta=2, average='macro'))
F2macro.append (fbeta_score(y_ex['content_ids'], y_pred[:,1], beta=2, average='macro'))
print ('F2 macro:', np.mean(F2macro))
F2micro = []
F2micro.append (fbeta_score(y_ex['topic_id'], y_pred[:,0], beta=2, average='micro'))
F2micro.append (fbeta_score(y_ex['content_ids'], y_pred[:,1], beta=2, average='micro'))
print ('F2 micro:', np.mean (F2micro))

F2 macro: 0.15396825396825398
F2 micro: 0.3500000000000001
