Still a **WIP**. Try to finish before end of August.

A notebook were I will explore some of the best tools offered by [HuggingFace](https://huggingface.co/) and apply these to the 
competition's data.

Let's first start by exploring the dataset.

# Explore the training data

First, let's check the training data.

## Basic exploration

Let's start with basic exploration, i.e. loading the data, checking the 
columns, their content, and so on.
For that, we will mainly use the beloved Pandas.

In [None]:
import pandas as pd

Let's load the training data first.

In [None]:
train = pd.read_csv("../input/commonlitreadabilityprize/train.csv")

In [None]:
print(train.loc[0, "excerpt"])

In [None]:
train.loc[0, "target"]

In [None]:
train.head(2).T

In [None]:
train["target"].plot(kind="hist", bins=100)

In [None]:
train["standard_error"].plot(kind="hist", bins=100)

=> This is a hard example.

## NLP specific exploraiton

Now that we have a better understanding of the dataset, we 
can explore it content, particulary the textual column.

## Gensim data exploration

Let's use some of gensim to explore the data.

In [None]:
# TODO: Use some gensim and/or spacy to explore the dataset?

# What is HuggingFace?

![hugginface logo](https://huggingface.co/front/assets/huggingface_logo.svg)

If you have been living inside a cave recently, huggingface is **THE** NLP company.

They are offering many NLP services and open-source code.

Most of you might know it for their most popular library: [transformers](https://github.com/huggingface/transformers).

The library has started around ?

Let's start with exploring further this library.

## Transformers

This is one of the best NLP deep learning library available. It has been started when the 
transformers revolution started but now includes models for almost any NLP task and recent paper.

Let's see some code: 




In [None]:
class Dataset:
    def __init__(self, excerpt, tokenizer, max_len):
        self.excerpt = excerpt
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.excerpt)

    def __getitem__(self, item):
        text = str(self.excerpt[item])
        inputs = self.tokenizer(
            text, 
            max_length=self.max_len, 
            padding="max_length", 
            truncation=True
        )

        ids = inputs["input_ids"]
        mask = inputs["attention_mask"]

        return {
            "input_ids": torch.tensor(ids, dtype=torch.long),
            "attention_mask": torch.tensor(mask, dtype=torch.long),
        }

In [None]:
import torch
import pandas as pd
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer





def generate_predictions(model_path, max_len):
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model.to("cuda")
    model.eval()
    
    df = pd.read_csv("../input/commonlitreadabilityprize/test.csv")
    
    dataset = Dataset(excerpt=df.excerpt.values, tokenizer=tokenizer, max_len=max_len)
    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=32, num_workers=4, pin_memory=True, shuffle=False
    )

    final_output = []

    for b_idx, data in enumerate(data_loader):
        with torch.no_grad():
            for key, value in data.items():
                data[key] = value.to("cuda")
            output = model(**data)
            output = output.logits.detach().cpu().numpy().ravel().tolist()
            final_output.extend(output)
    
    torch.cuda.empty_cache()
    return np.array(final_output)


# Datasets

Another one of the popular hugginface libraries is the [datasets](https://github.com/huggingface/datasets) one. 

## AutoNLP

Next, we will explore a new tool offered by HuggingFace: autonlp.


This works via a CLI and can be used to automatically to train an NLP model on a given task and dataset.

Let's see how it works on the dataset.

First, you need to get an invitation and then an API key. Notice that the product is still in beta as of end of May 2021. 

Then, here are the main steps (from [Abhishek Thakur](https://www.kaggle.com/abhishek) himeself in this [discussion](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/237795)):
    
    
- Step-1: Login: `autonlp login --api-key YOUR_HF_API_KEY`

- Step-2: Create a project. You can choose any name: `autonlp create_project --name readability --language en --task single_column_regression --max_models 50`

- Step-3: Upload training data: `autonlp upload --project readability --split train --col_mapping excerpt:text,target:target --files ~/datasets/read/train.csv`

- Step-4: Upload validation data: `autonlp upload --project readability --split valid --col_mapping excerpt:text,target:target --files ~/datasets/read/valid.csv`

- Step-5: Train models: `autonlp train --project readability`

Let's apply these steps.

## NLP COURSE

A new fresh course just came from the HuggingFace team (June 2021).

<img src="https://huggingface.co/course/static/chapter1/transformers_chrono.png">

For now, the first 4 chapters are available and the remaining 8 will release up to the end of 2021.
If you have some time and want to learn more about modern NLP and transformers particularly, give it a try.

# Online services

These are less known but are valuable if you want to quickly prototype and/or you want to deploy a model in produciton.

In what follows, some screenshots: 

## TPUs with HuggingFace?

It is possible to use TPUs with HuggingFace of course. Let's see how it could be done...

## TPUs with JAX?

# Additional resources

- A very good analysis of what huggingface is: https://marksaroufim.substack.com/p/huggingface


- Another great EDA and a baseline by Andrada Olteanu: https://www.kaggle.com/andradaolteanu/i-commonlit-explore-xgbrf-repeatedfold-model
- AutoClasses example by Abhishek Thakur: https://www.kaggle.com/abhishek/yum-yum-yum
- Explanation of how the target is computed: https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240886#1318829
- AutoNLP discussion: https://www.kaggle.com/c/commonlitreadabilityprize/discussion/237795