<a href="https://colab.research.google.com/github/rizavelioglu/hateful_memes-hate_detectron/blob/main/notebooks/%5BGitHub%5Dreproduce_submissions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color='Aqua'> <b> Team HateDetectron - Phase-2 Submissions </b> </font>

---
This notebook is only for reproducing the results of Phase-2 submissions by the team `HateDetectron`. In other words, just loading the final models and getting predictions for the test set. See the [end2end-process notebook](https://colab.research.google.com/drive/1O0m0j9_NBInzdo3K04jD19IyOhBR1I8i?usp=sharing) to see the whole approach in detail: how the models are trained, how the image features are extracted, which datasets are used, etc.

---
**Author:**\
<font color='Wheat'>
    <b>
        Riza Velioglu
    </b>
</font>

**Contact:**

<center>
<a href="http://rizavelioglu.github.io/"><img src="https://drive.google.com/uc?id=1SWc-ryZf7xxZ_g7AdU_vn2Y451IcCisw" width="200"></a>

[Webpage](http://rizavelioglu.github.io/)
</center>

In [None]:
#@markdown ---
#@title <h1><b> <font color='lightblue'> Running the whole notebook at once! </font> <font color='red'> --Action required!-- </b></font></h1> { run: "auto" }
#@markdown Please download the `Hateful Memes Dataset` from the official challenge webpage: https://hatefulmemeschallenge.com/#download. 
#@markdown <br> After filling the form the `hateful_memes.zip` file will be downloaded, which includes all the required data including images. 
#@markdown <br><br>Please define the following variables:
#@markdown - `PATH_TO_ZIP_FILE`: the full path of the downloaded `.zip` file. **e.g.** `"/content/drive/MyDrive/hateful_memes.zip"`
#@markdown - `HOME`: the home directory of the computer. **e.g.**
#@markdown  For <font color='orange'> Linux </font> users it can be: `"/home/project_folder"`. For <font color='yellow'> Colab </font> it would be: `"/content"`


PATH_TO_ZIP_FILE = '' #@param {type:"string"}
HOME = '' #@param {type:"string"}

#@markdown Then run all the cells:
#@markdown - <font color='yellow'> Colab </font>: **"Runtime" > "Run All"** *OR* `Ctrl+F9`
#@markdown - <font color='orange'> Jupyter Notebook </font>: **"Cell" > "Run All"**
#@markdown ---

# Table of Contents

<details><summary>
<font color='Tan'> I. Installation of MMF & dependencies </font></summary>

- install MMF dependencies
- install MMF from source
</details>

<details><summary>
<font color='Tan'> II. Download the dataset & convert it into MMF format </font></summary>

- download Hateful Memes (HM) dataset
- convert HM into MMF format (unzip and place the dataset to a specific location)
- remove unnecessary data to keep the disk clean
</details>


<details><summary>
<font color='Tan'> III. Feature Extraction </font></summary>

- download image features (.zip files) for 2 datasets:  HM and [Memotion](https://www.kaggle.com/williamscott701/memotion-dataset-7k) [( see paper )](https://arxiv.org/pdf/2008.03781.pdf)
- extract the two `.zip` files and merge them into one folder
</details>

<details><summary>
<font color='Tan'> IV. Validating'fine-tuned' VisualBERT models on `dev_unseen.jsonl` </font></summary>

- download fine-tuned models that were used in Phase-2 submission
- evaluate them on 'dev_unseen' data
</details>


<details><summary>
<font color='Tan'> V. Generate predictions for the Challenge (`test_unseen.jsonl`) </font></summary>

- generate predictions for 2 submissions in Phase-2
</details>

## <font color='magenta'> <b> I. Installation of MMF & dependencies </b> </font>

In [None]:
import os
os.chdir(HOME)
os.getcwd()

'/content'

In [None]:
# Install specified versions of `torch` and `torchvision`, before installing mmf (causes an issue)
!pip install torch==1.6.0 torchvision==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.6.0
  Downloading https://download.pytorch.org/whl/cu92/torch-1.6.0%2Bcu92-cp37-cp37m-linux_x86_64.whl (552.8 MB)
[K     |████████████████████████████████| 552.8 MB 4.3 kB/s 
[?25hCollecting torchvision==0.7.0
  Downloading https://download.pytorch.org/whl/cu92/torchvision-0.7.0%2Bcu92-cp37-cp37m-linux_x86_64.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 61.8 MB/s 
Installing collected packages: torch, torchvision
  Attempting uninstall: torch
    Found existing installation: torch 1.10.0+cu111
    Uninstalling torch-1.10.0+cu111:
      Successfully uninstalled torch-1.10.0+cu111
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.11.1+cu111
    Uninstalling torchvision-0.11.1+cu111:
      Successfully uninstalled torchvision-0.11.1+cu111
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. 

#### *Install MMF from source* 


In [None]:
# Clone the following repo where mmf does not install default image features, 
# since we will use our own features
!git clone --branch no_feats --config core.symlinks=true https://github.com/rizavelioglu/mmf.git

Cloning into 'mmf'...
remote: Enumerating objects: 16730, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 16730 (delta 1), reused 2 (delta 0), pack-reused 16724[K
Receiving objects: 100% (16730/16730), 12.79 MiB | 10.11 MiB/s, done.
Resolving deltas: 100% (10763/10763), done.


In [None]:
os.chdir(os.path.join(HOME, "mmf"))

In [None]:
!pip install --editable .

Obtaining file:///content/mmf
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting GitPython==3.1.0
  Downloading GitPython-3.1.0-py3-none-any.whl (450 kB)
[K     |████████████████████████████████| 450 kB 17.2 MB/s 
Collecting torchtext==0.5.0
  Downloading torchtext-0.5.0-py3-none-any.whl (73 kB)
[K     |████████████████████████████████| 73 kB 1.9 MB/s 
Collecting demjson==2.2.4
  Downloading demjson-2.2.4.tar.gz (131 kB)
[K     |████████████████████████████████| 131 kB 57.0 MB/s 
Collecting nltk==3.4.5
  Downloading nltk-3.4.5.zip (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 41.6 MB/s 
[?25hCollecting fasttext==0.9.1
  Downloading fasttext-0.9.1.tar.gz (57 kB)
[K     |████████████████████████████████| 57 kB 5.5 MB/s 
Collecting transformers==3.4.0
  Downloading transformers-3.4.0-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████|

---
## <font color='magenta'> <b> II. Download the dataset & convert it into *MMF* format </b> </font>

In [None]:
!cp $PATH_TO_ZIP_FILE /content/mmf/

In [None]:
# Add the mmf folder to Python Path
os.environ['PYTHONPATH'] += ":/content/mmf/"

In [None]:
!mmf_convert_hm --zip_file="hateful_memes.zip"

Data folder is /root/.cache/torch/mmf/data
Zip path is hateful_memes.zip
Copying hateful_memes.zip
Unzipping hateful_memes.zip
Extracting the zip can take time. Sit back and relax.
Moving train.jsonl
Moving dev_seen.jsonl
Moving test_seen.jsonl
Moving dev_unseen.jsonl
Moving test_unseen.jsonl
Moving img


In [None]:
# Free up the disk by removing .zip, .tar files
!rm -rf /root/.cache/torch/mmf/data/datasets/hateful_memes/defaults/images/hateful_memes.zip
!rm -rf $HOME/mmf/hateful_memes.zip

---
## <font color='magenta'> <b> III. Feature Extraction </b> </font>

### <font color='lightgreen'> <b> Collect 'pre-extracted' features </b> </font>

There are 2 .zip files which stores the extracted image features: one for <font color='Salmon'>Hateful Memes</font>, and one for <font color='Salmon'> Memotion Dataset</font>.

The following cell downloads both .zip files into the `$HOME` directory:
- [hateful_memes_features](https://drive.google.com/file/d/1YGigTCQQlVvS726YuECTMx0He8p0Xle2/view?usp=sharing)
- [memotion_features](https://drive.google.com/file/d/11o35vKEMDQjvHV42aYMzwjEVlIDZIe1a/view?usp=sharing)

In [None]:
os.chdir(HOME)
# download HM dataset features
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1YGigTCQQlVvS726YuECTMx0He8p0Xle2' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1YGigTCQQlVvS726YuECTMx0He8p0Xle2" -O 'feats_hm.zip' && rm -rf /tmp/cookies.txt
# download Memotion dataset features
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=11o35vKEMDQjvHV42aYMzwjEVlIDZIe1a' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=11o35vKEMDQjvHV42aYMzwjEVlIDZIe1a" -O 'feats_memotion.zip' && rm -rf /tmp/cookies.txt

--2022-04-21 16:00:34--  https://docs.google.com/uc?export=download&confirm=t&id=1YGigTCQQlVvS726YuECTMx0He8p0Xle2
Resolving docs.google.com (docs.google.com)... 142.251.12.138, 142.251.12.139, 142.251.12.101, ...
Connecting to docs.google.com (docs.google.com)|142.251.12.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0o-6o-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/4uh0p4j60kjoqe34jutld88ejbqhg1l7/1650556800000/15631633617501527889/*/1YGigTCQQlVvS726YuECTMx0He8p0Xle2?e=download [following]
--2022-04-21 16:00:34--  https://doc-0o-6o-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/4uh0p4j60kjoqe34jutld88ejbqhg1l7/1650556800000/15631633617501527889/*/1YGigTCQQlVvS726YuECTMx0He8p0Xle2?e=download
Resolving doc-0o-6o-docs.googleusercontent.com (doc-0o-6o-docs.googleusercontent.com)... 74.125.68.132, 2404:6800:4003:c02::84
Connecting to doc-0o-6o-docs.googleusercontent.com (doc-0o-6o-

In [None]:
# Unzip features for Memotion dataset & remove it to free disk
!unzip -q $HOME/feats_memotion.zip -d $HOME/features/
!rm -rf $HOME/feats_memotion.zip
# Unzip features for HM dataset & remove it to free disk
!unzip -q $HOME/feats_hm.zip -d $HOME/features/
!rm -rf $HOME/feats_hm.zip
# Move Memotion features into the same folder as HM features
!mv $HOME/features/feats_memotion/*.npy $HOME/features/feats_hm/

---
## <font color='magenta'> <b> IV. Validating'fine-tuned' VisualBERT models on `dev_unseen.jsonl`</b> </font>

### <font color='Violet'> <b> Submission#1 </b> </font>

|            | ROC-AUC | Accuracy |    Dataset   |
|------------|:-------:|:--------:|:------------:|
|Submission#1| $0.7555$| $0.7352$ | `dev_unseen` |
|Submission#1| $0.8108$| $0.7650$ | `test_unseen`|

The following cell downloads the fine-tuned model from [this link](https://drive.google.com/file/d/1NOX2lJkbK7sKRsg4_y_KUcLamknowsu2/view?usp=sharing) to the `$HOME` directory:

In [None]:
"""
Uncomment it if needed
"""

# os.chdir(HOME)
# # Download the fine-tuned model
# !wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1NOX2lJkbK7sKRsg4_y_KUcLamknowsu2' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1NOX2lJkbK7sKRsg4_y_KUcLamknowsu2" -O 'submission#1.zip' && rm -rf /tmp/cookies.txt
# # unzip the model
# !unzip -qq $HOME/submission#1.zip -d $HOME/submission#1
# # remove the .zip after unzipping to free the disk
# !rm -rf $HOME/submission#1.zip

--2020-12-01 11:52:08--  https://docs.google.com/uc?export=download&confirm=bgVo&id=1NOX2lJkbK7sKRsg4_y_KUcLamknowsu2
Resolving docs.google.com (docs.google.com)... 172.217.2.110, 2607:f8b0:4004:80a::200e
Connecting to docs.google.com (docs.google.com)|172.217.2.110|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0c-8k-docs.googleusercontent.com/docs/securesc/v8qt23aghtn6s5u56li7348eoei3pipj/0s788egsk8ukafuun5kffq6q4q0pi94i/1606823475000/15631633617501527889/17850159486476952229Z/1NOX2lJkbK7sKRsg4_y_KUcLamknowsu2?e=download [following]
--2020-12-01 11:52:08--  https://doc-0c-8k-docs.googleusercontent.com/docs/securesc/v8qt23aghtn6s5u56li7348eoei3pipj/0s788egsk8ukafuun5kffq6q4q0pi94i/1606823475000/15631633617501527889/17850159486476952229Z/1NOX2lJkbK7sKRsg4_y_KUcLamknowsu2?e=download
Resolving doc-0c-8k-docs.googleusercontent.com (doc-0c-8k-docs.googleusercontent.com)... 172.217.15.65, 2607:f8b0:4004:810::2001
Connecting to doc-0c-8

In [None]:
"""
Uncomment it if needed
"""

# # Validate the model on the dev_unseen data
# os.chdir(HOME)
# # where checkpoint is
# ckpt_dir = os.path.join(HOME, "submission#1/best.ckpt")
# feats_dir = os.path.join(HOME, "features/feats_hm")

# !mmf_run config="projects/visual_bert/configs/hateful_memes/defaults.yaml" \
#     model="visual_bert" \
#     dataset=hateful_memes \
#     run_type=val \
#     checkpoint.resume_file=$ckpt_dir \
#     checkpoint.reset.optimizer=True \
#     dataset_config.hateful_memes.annotations.val[0]=hateful_memes/defaults/annotations/dev_unseen.jsonl \
#     dataset_config.hateful_memes.annotations.test[0]=hateful_memes/defaults/annotations/test_unseen.jsonl \
#     dataset_config.hateful_memes.features.train[0]=$feats_dir \
#     dataset_config.hateful_memes.features.val[0]=$feats_dir \
#     dataset_config.hateful_memes.features.test[0]=$feats_dir \

2020-11-26 14:15:02.201805: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
See the compact keys issue for more details: https://github.com/omry/omegaconf/issues/152
[32m2020-11-26T14:15:05 | mmf.utils.configuration: [0mOverriding option config to projects/visual_bert/configs/hateful_memes/defaults.yaml
[32m2020-11-26T14:15:05 | mmf.utils.configuration: [0mOverriding option model to visual_bert
[32m2020-11-26T14:15:05 | mmf.utils.configuration: [0mOverriding option datasets to hateful_memes
[32m2020-11-26T14:15:05 | mmf.utils.configuration: [0mOverriding option run_type to val
[32m2020-11-26T14:15:05 | mmf.utils.configuration: [0mOverriding option checkpoint.resume_file to /content/submission#1/best.ckpt
[32m2020-11-26T14:15:05 | mmf.utils.configuration: [0mOverriding option checkpoint.reset.optimizer to True
[32m2020-11-26T14:15:05 | mmf.utils.configuration: [0mOverriding option dataset_config.hateful_

### <font color='Violet'> <b> Submission#2 </b> </font>

|            | ROC-AUC | Accuracy |    Dataset   |
|------------|:-------:|:--------:|:------------:|
|Submission#2| $0.7757$| $0.7315$ | `dev_unseen` |
|Submission#2| $0.8268$| $0.7805$ | `test_unseen`|

The following cell downloads the fine-tuned model from [this link](https://drive.google.com/file/d/1To4L0on-Us-DHFn53b21lYYcl6RgaDCB/view?usp=sharing) to the `$HOME` directory:

In [None]:
"""
Uncomment it if needed
"""

# os.chdir(HOME)
# # Download the fine-tuned model
# !wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1To4L0on-Us-DHFn53b21lYYcl6RgaDCB' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1To4L0on-Us-DHFn53b21lYYcl6RgaDCB" -O 'submission#2.zip' && rm -rf /tmp/cookies.txt
# # Unzip the model
# !unzip -q $HOME/submission#2.zip -d $HOME/submission#2
# # remove the .zip after unzipping to free the disk
# !rm -rf $HOME/submission#2.zip

--2020-12-01 12:09:50--  https://docs.google.com/uc?export=download&confirm=G9D8&id=1To4L0on-Us-DHFn53b21lYYcl6RgaDCB
Resolving docs.google.com (docs.google.com)... 172.217.2.110, 2607:f8b0:4004:80a::200e
Connecting to docs.google.com (docs.google.com)|172.217.2.110|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-ac-docs.googleusercontent.com/docs/securesc/p2koco7jmelk51cclteicg7jkiakppuc/7qt6bevk35a4fhu1thg5pa1slplt4n9n/1606824525000/15631633617501527889/08426210042794117368Z/1To4L0on-Us-DHFn53b21lYYcl6RgaDCB?e=download [following]
--2020-12-01 12:09:50--  https://doc-14-ac-docs.googleusercontent.com/docs/securesc/p2koco7jmelk51cclteicg7jkiakppuc/7qt6bevk35a4fhu1thg5pa1slplt4n9n/1606824525000/15631633617501527889/08426210042794117368Z/1To4L0on-Us-DHFn53b21lYYcl6RgaDCB?e=download
Resolving doc-14-ac-docs.googleusercontent.com (doc-14-ac-docs.googleusercontent.com)... 172.217.15.65, 2607:f8b0:4004:810::2001
Connecting to doc-14-a

In [None]:
"""
Uncomment it if needed
"""

# os.chdir(HOME)
# # where checkpoint is
# ckpt_dir = os.path.join(HOME, "submission#2/best.ckpt")
# feats_dir = os.path.join(HOME, "features/feats_hm")

# !mmf_run config="projects/visual_bert/configs/hateful_memes/defaults.yaml" \
#     model="visual_bert" \
#     dataset=hateful_memes \
#     run_type=val \
#     checkpoint.resume_file=$ckpt_dir \
#     checkpoint.reset.optimizer=True \
#     dataset_config.hateful_memes.annotations.val[0]=hateful_memes/defaults/annotations/dev_unseen.jsonl \
#     dataset_config.hateful_memes.annotations.test[0]=hateful_memes/defaults/annotations/test_unseen.jsonl \
#     dataset_config.hateful_memes.features.train[0]=$feats_dir \
#     dataset_config.hateful_memes.features.val[0]=$feats_dir \
#     dataset_config.hateful_memes.features.test[0]=$feats_dir \

See the compact keys issue for more details: https://github.com/omry/omegaconf/issues/152
[32m2022-04-21T16:10:52 | mmf.utils.configuration: [0mOverriding option config to projects/visual_bert/configs/hateful_memes/defaults.yaml
[32m2022-04-21T16:10:52 | mmf.utils.configuration: [0mOverriding option model to visual_bert
[32m2022-04-21T16:10:52 | mmf.utils.configuration: [0mOverriding option datasets to hateful_memes
[32m2022-04-21T16:10:52 | mmf.utils.configuration: [0mOverriding option run_type to val
[32m2022-04-21T16:10:52 | mmf.utils.configuration: [0mOverriding option checkpoint.resume_file to /content/submission#2/best.ckpt
[32m2022-04-21T16:10:52 | mmf.utils.configuration: [0mOverriding option checkpoint.reset.optimizer to True
[32m2022-04-21T16:10:52 | mmf.utils.configuration: [0mOverriding option dataset_config.hateful_memes.annotations.val[0] to hateful_memes/defaults/annotations/dev_unseen.jsonl
[32m2022-04-21T16:10:52 | mmf.utils.configuration: [0mOverriding

### <font color='Violet'> <b> Submission#3 ($Best\ Submission$)</b> </font>

|            | ROC-AUC | Accuracy |    Dataset   |
|------------|:-------:|:--------:|:------------:|
|Submission#3| $-$     | $-$      | `dev_unseen` |
|Submission#3| $0.8518$| $0.8050$ | `test_unseen`|

After a hyper-parameter search, we ended up having multiple models having different ROC-AUC scores on `dev_unseen` dataset. We sorted them by the ROC score and took all the models that have a ROC score of `0.76` or higher (the threshold is chosen arbitrarily). The following figure shows all the $27$ models and its ROC-scores, as well as its hyper-parameters ([see this document for all the model scores, 60+ models in total](https://docs.google.com/spreadsheets/d/11m2p7vNxHhZWumkFNvv6d94HcqG77DItRX7MuGfOtWA/edit?usp=sharing)).

<details><summary>
<font color='Tan'> Figure 1: ROC-AUC scores (on `dev_unseen`) of different VisualBERT models </font></summary>

<img src="https://drive.google.com/uc?id=10WUBnSO5L5O44c8WCxHRA_iFZudH73lF" width="1000"> 

</details>

Then, predictions are collected from each of the $27$ models and the `Majority Voting` technique is applied: the `class` of a data point is determined by the majority voted class.

> This technique is also known as: [Voting classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html), [Ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning), [Bootstrap Aggregating (BAGGING)](https://en.wikipedia.org/wiki/Bootstrap_aggregating#:~:text=Bootstrap%20aggregating%2C%20also%20called%20bagging,and%20helps%20to%20avoid%20overfitting.)

Please see [this document](https://drive.google.com/file/d/1vjUsMqaqjZdoNj0w989RX7GKkPod-CGo/view?usp=sharing) to see all the model predictions for `test_unseen` and how the technique is applied.

**Note:**  The `proba` column in the submission which stands for the probability of a data point belonging to a class is chosen to be the maximum probability among all of the $27$ models if $Class\ 1$, and the minumum if $Class\ 0$.

See the Part<font color='magenta'> <b> V. Generate predictions for the Challenge (`test_unseen.jsonl`) </b> </font> for the code!

---
## <font color='magenta'> <b> V. Generate predictions for the Challenge (`test_unseen.jsonl`) </b> </font>

### <font color='Thistle'> <b> Submission#1 </b> </font>

In [None]:
"""
Uncomment it if needed
"""

# os.chdir(HOME)
# # where checkpoint is
# ckpt_dir = os.path.join(HOME, "submission#1/best.ckpt")
# feats_dir = os.path.join(HOME, "features/feats_hm")

# !mmf_predict config="projects/visual_bert/configs/hateful_memes/defaults.yaml" \
#     model="visual_bert" \
#     dataset=hateful_memes \
#     run_type=test \
#     checkpoint.resume_file=$ckpt_dir \
#     checkpoint.reset.optimizer=True \
#     dataset_config.hateful_memes.annotations.val[0]=hateful_memes/defaults/annotations/dev_unseen.jsonl \
#     dataset_config.hateful_memes.annotations.test[0]=hateful_memes/defaults/annotations/test_unseen.jsonl \
#     dataset_config.hateful_memes.features.train[0]=$feats_dir \
#     dataset_config.hateful_memes.features.val[0]=$feats_dir \
#     dataset_config.hateful_memes.features.test[0]=$feats_dir \

2020-11-26 14:26:12.878826: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
See the compact keys issue for more details: https://github.com/omry/omegaconf/issues/152
[32m2020-11-26T14:26:16 | mmf.utils.configuration: [0mOverriding option config to projects/visual_bert/configs/hateful_memes/defaults.yaml
[32m2020-11-26T14:26:16 | mmf.utils.configuration: [0mOverriding option model to visual_bert
[32m2020-11-26T14:26:16 | mmf.utils.configuration: [0mOverriding option datasets to hateful_memes
[32m2020-11-26T14:26:16 | mmf.utils.configuration: [0mOverriding option run_type to test
[32m2020-11-26T14:26:16 | mmf.utils.configuration: [0mOverriding option checkpoint.resume_file to /content/submission#1/best.ckpt
[32m2020-11-26T14:26:16 | mmf.utils.configuration: [0mOverriding option checkpoint.reset.optimizer to True
[32m2020-11-26T14:26:16 | mmf.utils.configuration: [0mOverriding option dataset_config.hateful

### <font color='Thistle'> <b> Submission#2 </b> </font>

In [None]:
"""
Uncomment it if needed
"""

# os.chdir(HOME)
# # where checkpoint is
# ckpt_dir = os.path.join(HOME, "submission#2/best.ckpt")
# feats_dir = os.path.join(HOME, "features/feats_hm")

# !mmf_predict config="projects/visual_bert/configs/hateful_memes/defaults.yaml" \
#     model="visual_bert" \
#     dataset=hateful_memes \
#     run_type=test \
#     checkpoint.resume_file=$ckpt_dir \
#     checkpoint.reset.optimizer=True \
#     dataset_config.hateful_memes.annotations.val[0]=hateful_memes/defaults/annotations/dev_unseen.jsonl \
#     dataset_config.hateful_memes.annotations.test[0]=hateful_memes/defaults/annotations/test_unseen.jsonl \
#     dataset_config.hateful_memes.features.train[0]=$feats_dir \
#     dataset_config.hateful_memes.features.val[0]=$feats_dir \
#     dataset_config.hateful_memes.features.test[0]=$feats_dir \

2020-11-26 14:28:04.106941: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
See the compact keys issue for more details: https://github.com/omry/omegaconf/issues/152
[32m2020-11-26T14:28:07 | mmf.utils.configuration: [0mOverriding option config to projects/visual_bert/configs/hateful_memes/defaults.yaml
[32m2020-11-26T14:28:07 | mmf.utils.configuration: [0mOverriding option model to visual_bert
[32m2020-11-26T14:28:07 | mmf.utils.configuration: [0mOverriding option datasets to hateful_memes
[32m2020-11-26T14:28:07 | mmf.utils.configuration: [0mOverriding option run_type to test
[32m2020-11-26T14:28:07 | mmf.utils.configuration: [0mOverriding option checkpoint.resume_file to /content/submission#2/best.ckpt
[32m2020-11-26T14:28:07 | mmf.utils.configuration: [0mOverriding option checkpoint.reset.optimizer to True
[32m2020-11-26T14:28:07 | mmf.utils.configuration: [0mOverriding option dataset_config.hateful

### <font color='Thistle'> <b> Submission#3  ($Best\ Submission$)</b> </font>


In [None]:
os.mkdir(f"{HOME}/sub3/")
os.chdir(os.path.join(HOME, "sub3"))
!git clone https://github.com/rizavelioglu/hateful_memes-hate_detectron.git
!cp hateful_memes-hate_detectron/utils/generate_submission.sh .
!chmod +x generate_submission.sh

Cloning into 'hateful_memes-hate_detectron'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 39 (delta 14), reused 35 (delta 10), pack-reused 0[K
Unpacking objects: 100% (39/39), done.


Download 27 models in a `.7z` file and extract them all

In [None]:
# Download the .7z file which includes all the 27 models
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1D1nehiowEHMxJwijybfuTiC1835wZaHk' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1D1nehiowEHMxJwijybfuTiC1835wZaHk" -O 'majority_voting_models.7z' && rm -rf /tmp/cookies.txt
# Extract the .7z file to get the models
# !p7zip -d 'majority_voting_models.7z'
# !sudo apt-get install p7zip-full p7zip-rar
!7z e 'majority_voting_models.7z'

Generate submission for each model

In [None]:
from subprocess import call

models = [i for i in os.listdir(".") if i.endswith(".ckpt")]

print(f"[INFO] Getting predictions for {len(models)} models! This might take long..")
for model in models:
    feats_dir = os.path.join(HOME, "features/feats_hm")
    # Execute the bash script which gets predictions for 'test_unseen' data
    rc = call(f"./generate_submission.sh {model} {feats_dir}", shell=True)

Apply majority voting

In [None]:
import numpy as np
import pandas as pd

# Store all the prediction folders
folders = [i for i in os.listdir("save/preds") if i.startswith("hateful_memes")]
preds = pd.DataFrame()

try:
    for folder in folders:
        pred = [i for i in os.listdir(f"save/preds/{folder}/reports/") if i.endswith(".csv")]
        pred = pd.read_csv(f"save/preds/{folder}/reports/{pred[0]}")
        preds = pd.concat([preds, pred], axis=1)
except:
    pass

# assert len(preds.columns) == 27*3

# Create 
submission = pred
np_df = np.asarray(preds)

for idx, row in enumerate(np_df[:,:]):
    probas = row[1::3]
    labels = row[2::3]

    if sum(labels) > 13:
        submission.loc[idx, 'label']=1
        submission.loc[idx, 'proba']=probas.max()    
    else:
        submission.loc[idx, 'label']=0
        submission.loc[idx, 'proba']=probas.min()

Sort the submission with regards to the submission template & save the final submission file to `submission#3.csv`

In [None]:
# Download the Phase2 submission template
!wget -O submission_format_phase_2.csv  "https://drivendata-prod.s3.amazonaws.com/data/70/public/submission_format_phase_2.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYVI2LMPSY%2F20201201%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201201T023533Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=04330cf22c33f1817cac29509178d2c11a07d620e8237241c5088f4fd25df2b3"
os.chdir(os.path.join(HOME, "sub3"))
template = pd.read_csv("submission_format_phase_2.csv")
# Sort the 'submission' file
submission = submission.set_index('id')
submission = submission.reindex(index=template['id'])
submission = submission.reset_index()
# Save submission file
submission.to_csv(f"{HOME}/submission#3.csv", index=False)

--2020-12-01 12:46:36--  https://drivendata-prod.s3.amazonaws.com/data/70/public/submission_format_phase_2.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYVI2LMPSY%2F20201201%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201201T023533Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=04330cf22c33f1817cac29509178d2c11a07d620e8237241c5088f4fd25df2b3
Resolving drivendata-prod.s3.amazonaws.com (drivendata-prod.s3.amazonaws.com)... 52.216.115.59
Connecting to drivendata-prod.s3.amazonaws.com (drivendata-prod.s3.amazonaws.com)|52.216.115.59|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23807 (23K) [text/csv]
Saving to: ‘submission_format_phase_2.csv’


2020-12-01 12:46:36 (14.2 MB/s) - ‘submission_format_phase_2.csv’ saved [23807/23807]

