## Using our fine-tuned classifier to detect child-directed websites

The fine-tuned model can be used to classify child-directed websites using the code below. You can either fine-tune your own model following the instructions in the [fine-tuning.ipynb](https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/blob/main/classifier/fine-tuning.ipynb) notebook, or download the fine-tuned model files from [the release page](https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/releases/download/v0.1).
.

Here's what each step involves:

1. **Loading the Model and Tokenizer**:
   - **Model**: The fine-tuned model is loaded from its saved directory.
   - **Tokenizer**: The tokenizer that was used during the fine-tuning phase is also loaded. The tokenizer converts text data into a format that the model can understand (i.e., tokenized tensors).

2. **Creating a Classification Pipeline**:
   - A pipeline is created using the `pipeline` function from the `transformers` library. This pipeline encapsulates the model and tokenizer into a single object which simplifies the process of making predictions.
   - The pipeline is set up specifically for text classification and it'll run on the GPU if it's available on the system. If no GPU is available, then it'll run on the CPU. This pipeline setup includes options such as enabling truncation to handle text that exceeds the model's maximum input length.

3. **Running Predictions on Sample Text Data**:
   - With the pipeline ready, you can input sample text data to get predictions. The pipeline outputs the predicted class labels along with confidence scores. In the paper, the confidence scores were used to prioritize most-likely child-directed websites for manual labeling.


### Download the fine-tuned model and the tokenizer
- You can skip this step if you already fine-tuned the model yourself and obtained the model weights and the tokenizer.
- If you don't have `wget` installed, you can manually download the files from [the release page](https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/releases/download/v0.1)


In [1]:
%env BASE_URL=https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/releases/download/v0.1

# create the model directory
! mkdir -p model

! wget ${BASE_URL}/model.safetensors -P model
! wget ${BASE_URL}/config.json -P model
! wget ${BASE_URL}/tokenizer.json -P model
! wget ${BASE_URL}/tokenizer_config.json -P model
! wget ${BASE_URL}/trainer_state.json -P model

! ls model


env: BASE_URL=https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/releases/download/v0.1
--2024-05-20 19:05:37--  https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/releases/download/v0.1/model.safetensors
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-05-20 19:05:38 ERROR 404: Not Found.

--2024-05-20 19:05:38--  https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/releases/download/v0.1/config.json
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-05-20 19:05:38 ERROR 404: Not Found.

--2024-05-20 19:05:38--  https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/releases/download/v0.1/tokenizer.json
Resolving github.com (github.com

### Create the classification pipeline

In [2]:
import torch
from transformers import pipeline
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# model weights and tokenizer files should be in this directory
MODEL_PATH = 'model'

# Load the model and tokenizer
MODEL = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH, local_files_only=True)
TOKENIZER = AutoTokenizer.from_pretrained(MODEL_PATH)


def create_pipeline(model, tokenizer):
    """Create a text classification pipeline."""
    DEVICE = 0 if torch.cuda.is_available() else -1
    if DEVICE == 0:
        print(f"Using GPU: {torch.cuda.get_device_name(0)}")
    else:
        print("Using CPU")

    try:
        return pipeline("text-classification", model=model,
                        tokenizer=tokenizer, device=DEVICE, truncation=True)
    except Exception as e:
        print(f"Error creating the pipeline: {e}")
        return None


webpage_classifier = create_pipeline(MODEL, TOKENIZER)

Using CPU


## Example usage

In [3]:
# create the pipeline

TEST_TITLE_DESCRIPTIONS = [
    "Online Toddler Games and Online Games for Kids. Online games for toddlers, preschool kids and babies.",
    "Play the best games for children of all ages! Free games made for 2 - 3 - 4 - 5 years old.",
    "Accueil - Jeux éducatifs gratuits en ligne. Jeux éducatifs gratuits pour enfants de maternelle et primaire.",
    "Kinderspiele - Spiele Kinderspiele online auf Jetztspielen.de. Geh auf Abenteuer, verbessere deine Mathefähigkeiten und mehr in unserer tollen Sammlung an Kinderspielen. Wir haben alles von Lern- bis zu Musikspielen.",
    "Juegos para niños y niñas de 2 a 3 años. Juegos educativos para niños y niñas de 2 a 3 años. Juegos de aprendizaje para niños y niñas de 2 a 3 años.",
    # non-child related:
    "Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine",
    "Home - BBC News. Visit BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provides trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, technology and health news.",
    "NOS Nieuws - Nieuws, Sport en Evenementen | Nederlandse Omroep Stichting. Altijd. Overal. Met de NOS blijf je altijd en overal op de hoogte van het laatste nieuws",
    ]



for title_desc in TEST_TITLE_DESCRIPTIONS:
    result = webpage_classifier(title_desc)
    print(f"Result: {result} {title_desc}")


Result: [{'label': 'LABEL_1', 'score': 0.9773780703544617}] Online Toddler Games and Online Games for Kids. Online games for toddlers, preschool kids and babies.
Result: [{'label': 'LABEL_1', 'score': 0.9791836738586426}] Play the best games for children of all ages! Free games made for 2 - 3 - 4 - 5 years old.
Result: [{'label': 'LABEL_1', 'score': 0.9812673330307007}] Accueil - Jeux éducatifs gratuits en ligne. Jeux éducatifs gratuits pour enfants de maternelle et primaire.
Result: [{'label': 'LABEL_1', 'score': 0.9803311228752136}] Kinderspiele - Spiele Kinderspiele online auf Jetztspielen.de. Geh auf Abenteuer, verbessere deine Mathefähigkeiten und mehr in unserer tollen Sammlung an Kinderspielen. Wir haben alles von Lern- bis zu Musikspielen.
Result: [{'label': 'LABEL_1', 'score': 0.9803564548492432}] Juegos para niños y niñas de 2 a 3 años. Juegos educativos para niños y niñas de 2 a 3 años. Juegos de aprendizaje para niños y niñas de 2 a 3 años.
Result: [{'label': 'LABEL_0', 'sc