# About this notebook

JMTC-20 is my first competition. I attended it near the end date and did not get a medal (30 places lower than bronze line).

After the end of competition, I read the discussion hold by [1st-place](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/160862). their impressive ideas and extensive trys of technique makes me want to learn by reproducing their result, at least part of it.

They said they would release their code soon, but only a post-processing part is public. Hence, I implement my own multiple mono-lingual models achiving <span style="color:red">lb.9508</span>. Hope this notebook and its previous ones can help other beginers of this JMTC-20 task.

<h3>Public Score milestone</h3>

* [Basic XLM-R model with balanced data](https://www.kaggle.com/mint101/basic-xlm-r-lb-9442-intro?scriptVersionId=39822772):  <span style="color:red">(.942X - .9442)</span>
* Ensemble of XLM-R model:  <span style="color:red">(.9455)</span>
* [Pseudo-lableling on XLM-R](https://www.kaggle.com/mint101/example-code-of-pseudo-label-on-xlm-r) for one turn:  <span style="color:red">(.9462)</span>
* [Transfer to monolinguish models](https://www.kaggle.com/mint101/transfer-to-monolingual-mix):  <span style="color:red">(.9467-.9473)</span>
* Combine all monolinguish models:  <span style="color:red">(.9500)</span>
* Adjusting according to [4th-place](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/160980):  <span style="color:red">(.9508)</span>
* Mix with [simple ensemble on public kernels before end date](https://www.kaggle.com/mint101/lb-9482-by-simple-public-result-bf-end-ensemble): <span style="color:red"> (.9514) </span>

<br/>
    
 * Mix with the public version of [1st](https://www.kaggle.com/rafiko1/1st-place-jigsaw-post-processing-example/output) and [2rd](https://www.kaggle.com/xiwuhan/jmtc-2nd-place-solution?scriptVersionId=37463887) results (just for fun): <span style="color:red">(.9557)</span>
    
    
<u>I just transfer XLM-R result to monolinguish models and combine once. This result can be further used to train XLM-R and further transfer. According to [1st-place](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/160862), this pattern is doable and may provide another .001+ boost. I just stop here as I have run out of TPU quota.</u>

In [None]:
import os
import numpy as np 
import pandas as pd 

from scipy.special import softmax

In [None]:
path = "../input/jigsaw-multilingual-toxic-comment-classification/"
record = "../input/buffer/"

base = record + "submission-9462.csv"
monos = ["submission-it-9467.csv",
         "submission-pt-9470.csv",
         "submission-es-9467.csv",
         "submission-tr-9470.csv",
         "submission-fr-9473.csv",]

get_lang = lambda x: x.split('-')[1]

In [None]:
test = pd.read_csv(path + "test.csv")
dic_ids = {k:v.id for k,v in test.groupby(["lang"])}

# Combine all monolinguish models with base XLM-R

In [None]:
sub = pd.read_csv(base)
for m in monos:
    res = pd.read_csv(record + m)
    ids = dic_ids[get_lang(m)]
    sub.loc[ids,"toxic"] = res.toxic[ids]

sub.head()

**Here, we can obtain <span style="color:red"> lb.9500</span>**

# Adjustment according to [4th-place](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/160980)

In [None]:
adj = {
    "fr":1.04,
    "es":1.06,
    "pt":.96,
    "it":.97,
    "tr":.98,
}
for l,v in adj.items():
    ids = dic_ids[l]
    sub.loc[ids,"toxic"] *= v

sub.head()

**We get .0008 bost to <span style="color:red"> lb.9508</span>**

# Start Ensemble

In [None]:
weight = lambda x: softmax(1/(1-x))

def mix_result(subs,pbs):
    toxics = np.array([df.toxic.values for df in subs])
    w = weight(np.array(pbs))
    print(["{:.3f}".format(i) for i in w])
    return toxics.T@w

<h3>Ensemble with public available kernels before end data</h3>

https://www.kaggle.com/mint101/lb-9482-by-simple-public-result-bf-end-ensemble is my simple ensemble of public available kernels before end data. I used the before adjustment version (lb.9473) in the competition.

In [None]:
sub1 = pd.read_csv(record+"submission-public-mix-9482.csv")
sub["toxic"] = mix_result([sub,sub1],[.9508,.9482])
sub.head()

**Another.0006 bost to <span style="color:red"> lb.9514</span>**

<h3>Ensemble with public 1st and 2nd kernels</h3>

In [None]:
sub1 = pd.read_csv(record+"submission-1st-place-9550.csv")
sub2 = pd.read_csv(record+"submission-2nd-place-9522.csv")
sub["toxic"] = mix_result([sub,sub2,sub1],[.9514,.9522,.9550])
sub.head()

**We can achieve <span style="color:red"> lb.9557</span> here.**

1. Ensemble with 2rd (.9522) alone:  <span style="color:red"> lb.9535</span>
2. Ensemble with 1st (.9550) alone:  <span style="color:red"> lb.9553</span>
3. Blend 1st(.9550) with 2rd(.9522):  <span style="color:red"> lb.9556</span>

In [None]:
sub.to_csv('submission.csv', index=False)

# Further improvement

1. Try variable padding in [4th-place](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/160980)
2. Try augmentation, futher corpus generation with qseudo-labels, in [2nd-place](https://www.kaggle.com/xiwuhan/jmtc-2nd-place-solution?scriptVersionId=37463887)
3. Try to fine-tune like [Jigsaw20 XLM-R lb0.9487 singel model](https://www.kaggle.com/hmendonca/jigsaw20-xlm-r-lb0-9487-singel-model)

.......