## Fine-tuning or using the model to detect child-directed websites

To classify child-directed websites with our fine-tuned model, follow these steps:

**Load Model and Tokenizer**: Download the fine-tuned model files from [the release page](https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/releases/download/v0.1) or fine-tune yourself with your own data using [fine-tuning notebook](https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/blob/main/classifier/fine-tuning.ipynb).


**Creating a Classification Pipeline**: Use `transformers` library to create a text classification pipeline.

**Running Predictions**: The pipeline outputs the predicted class labels along with confidence scores. In the paper, the confidence scores were used to prioritize most-likely child-directed websites for manual labeling.


### Download the fine-tuned model and the tokenizer
- You can skip this step if you already fine-tuned or downloaded the model and the tokenizer files.
- If you don't have `wget` installed, you can manually download the files from [the release page](https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/releases/download/v0.1)


In [1]:
%%bash
# Set environment variables
export BASE_URL=https://github.com/targeted-and-troublesome/targeted-and-troublesome-crawler/releases/download/v0.1
export MODEL_DIR=model

# Create the model directory
mkdir -p ${MODEL_DIR}

# Download files if they do not already exist
wget -nc ${BASE_URL}/model.safetensors -P ${MODEL_DIR}
wget -nc ${BASE_URL}/config.json -P ${MODEL_DIR}
wget -nc ${BASE_URL}/tokenizer.json -P ${MODEL_DIR}
wget -nc ${BASE_URL}/tokenizer_config.json -P ${MODEL_DIR}
wget -nc ${BASE_URL}/trainer_state.json -P ${MODEL_DIR}

# List the contents of the model directory
ls ${MODEL_DIR}


File ‘model/model.safetensors’ already there; not retrieving.

File ‘model/config.json’ already there; not retrieving.

File ‘model/tokenizer.json’ already there; not retrieving.

File ‘model/tokenizer_config.json’ already there; not retrieving.

File ‘model/trainer_state.json’ already there; not retrieving.



config.json
model.safetensors
tokenizer_config.json
tokenizer.json
trainer_state.json


### Create the classification pipeline

In [2]:
import torch
from transformers import pipeline
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# model weights and tokenizer files should be in this directory
MODEL_PATH = 'model'

# Load the model and tokenizer
MODEL = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH, local_files_only=True)
TOKENIZER = AutoTokenizer.from_pretrained(MODEL_PATH)


def create_pipeline(model, tokenizer):
    """Create a text classification pipeline."""

    DEVICE = 0 if torch.cuda.is_available() else -1

    if DEVICE == 0:
        print(f"Using GPU: {torch.cuda.get_device_name(0)}")
    else:
        print("Using CPU")

    try:
        return pipeline("text-classification", model=model,
                        tokenizer=tokenizer, device=DEVICE, truncation=True)
    except Exception as e:
        print(f"Error creating the pipeline: {e}")
        return None


webpage_classifier = create_pipeline(MODEL, TOKENIZER)

Using CPU


## Example usage

In [6]:
MIXED_TITLE_DESCRIPTIONS = [
    # child-related:
    "Online Toddler Games and Online Games for Kids. Online games for toddlers, preschool kids and babies.",
    "Play the best games for children of all ages! Free games made for 2 - 3 - 4 - 5 years old.",
    "Accueil - Jeux éducatifs gratuits en ligne. Jeux éducatifs gratuits pour enfants de maternelle et primaire.",
    "Kinderspiele - Spiele Kinderspiele online auf Jetztspielen.de. Geh auf Abenteuer, verbessere deine Mathefähigkeiten und mehr in unserer tollen Sammlung an Kinderspielen. Wir haben alles von Lern- bis zu Musikspielen.",
    "Juegos para niños y niñas de 2 a 3 años. Juegos educativos para niños y niñas de 2 a 3 años. Juegos de aprendizaje para niños y niñas de 2 a 3 años.",
    # non-child related:
    "Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine",
    "Home - BBC News. Visit BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provides trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, technology and health news.",
    "NOS Nieuws - Nieuws, Sport en Evenementen | Nederlandse Omroep Stichting. Altijd. Overal. Met de NOS blijf je altijd en overal op de hoogte van het laatste nieuws",
    ]



for title_desc in MIXED_TITLE_DESCRIPTIONS:
    result = webpage_classifier(title_desc)
    label = result[0]['label']
    score = result[0]['score']
    print(f"Result: {label} ({score:.4f}) - {title_desc}")

Result: LABEL_1 (0.9774) - Online Toddler Games and Online Games for Kids. Online games for toddlers, preschool kids and babies.
Result: LABEL_1 (0.9792) - Play the best games for children of all ages! Free games made for 2 - 3 - 4 - 5 years old.
Result: LABEL_1 (0.9813) - Accueil - Jeux éducatifs gratuits en ligne. Jeux éducatifs gratuits pour enfants de maternelle et primaire.
Result: LABEL_1 (0.9803) - Kinderspiele - Spiele Kinderspiele online auf Jetztspielen.de. Geh auf Abenteuer, verbessere deine Mathefähigkeiten und mehr in unserer tollen Sammlung an Kinderspielen. Wir haben alles von Lern- bis zu Musikspielen.
Result: LABEL_1 (0.9804) - Juegos para niños y niñas de 2 a 3 años. Juegos educativos para niños y niñas de 2 a 3 años. Juegos de aprendizaje para niños y niñas de 2 a 3 años.
Result: LABEL_0 (0.9759) - Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine
Result: LABEL_0 (0.9966) - Home - BBC News. Visit BBC News for up-to-the-minu