<a href="https://colab.research.google.com/github/vinayprabhu/hate_scaling/blob/main/code/4_Walkthrough_Pysentimiento_400M_2Ben.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GOAL: The goal of this notebook is to calculate the NSFW scores and Pysentimiento hate speech scores for a sample of the LAION-2B En dataset

TLDR:

- Step-1: Download the LAION datasets

- Step-2: Download the data-assets from 
[here](https://hal.cse.msu.edu/assets/data/papers/hate_detect_laion_400m_2B-en.zip) and unzip them into a local directory ```./hate_detect_laion_400m_2B-en```
This should consist of 641 files (detailed below)

- Step-3: Download the summary data-frame from [here](https://raw.githubusercontent.com/vinayprabhu/hate_scaling/main/data/nlp_hate/df_summary_filewise_400M_2B.csv) that allows one to contextualize and index the data-assets from Step-2

# 0: Standard imports and mounting the directory

In [None]:
from psutil import virtual_memory
# Make sure to run it on a high-memory instance
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

import numpy as np
import os
import pandas as pd
from pysentimiento import create_analyzer
# from tqdm import tqdm_notebook as tqdm
from tqdm.notebook import tqdm
import csv
#%matplotlib inline

from scipy.linalg import block_diag
#import seaborn as sns
# Numpy aesthetics
np.set_printoptions(suppress=True)
from collections import Counter
#from IPython.display import set_matplotlib_formats
#set_matplotlib_formats('retina')

import itertools
%precision 6
#############################################
import sys
import importlib
importlib.reload(sys)

In [None]:
import torch
import clip
from PIL import Image
import requests
from io import BytesIO

In [None]:
torch.cuda.is_available()

In [None]:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

In [None]:
def load_safety_model(clip_model="ViT-L/14"):
    """load the safety model"""
    import autokeras as ak  # pylint: disable=import-outside-toplevel
    from tensorflow.keras.models import load_model  # pylint: disable=import-outside-toplevel

    cache_folder = "./NSFW-cache"

    if clip_model == "ViT-L/14":
        model_dir = cache_folder + "/clip_autokeras_binary_nsfw"
        dim = 768
    else:
        raise ValueError("Unknown clip model")
    if not os.path.exists(model_dir):
        os.makedirs(cache_folder, exist_ok=True)

        from urllib.request import urlretrieve  # pylint: disable=import-outside-toplevel

        path_to_zip_file = cache_folder + "/clip_autokeras_binary_nsfw.zip"
        if clip_model == "ViT-L/14":
            url_model = "https://raw.githubusercontent.com/LAION-AI/CLIP-based-NSFW-Detector/main/clip_autokeras_binary_nsfw.zip"
        elif clip_model == "ViT-B/32":
            url_model = (
                "https://raw.githubusercontent.com/LAION-AI/CLIP-based-NSFW-Detector/main/clip_autokeras_nsfw_b32.zip"
            )
        else:
            raise ValueError("Unknown model {}".format(clip_model))  # pylint: disable=consider-using-f-string
        urlretrieve(url_model, path_to_zip_file)
        import zipfile  # pylint: disable=import-outside-toplevel

        with zipfile.ZipFile(path_to_zip_file, "r") as zip_ref:
            zip_ref.extractall(cache_folder)

    loaded_model = load_model(model_dir, custom_objects=ak.CUSTOM_OBJECTS, compile=False)
    
    return loaded_model


# 1: Download the LAION datasets

Source: https://laion.ai/laion-400-open-dataset/


*We produced the dataset in several formats to address the various use cases*: 
- A 50GB url+caption metadata dataset in parquet files. This can be used to compute statistics and redownload part of the dataset
- A 10TB webdataset with 256×256 images, captions and metadata. This is a full version of the dataset, that can be used directly for training
- A 1TB set of the 400M text and image clip embeddings, useful to rebuild new knn indices
- Two 4GB knn indices allowing to easily search in the dataset + two higher quality 16GB knn indices (running in the webdemo)
URL and caption metadata dataset.

We provide 32 parquet files of size around 1GB (total 50GB) with the image URLs, the associated texts and additional metadata in the following format:

SAMPLE_ID | URL | TEXT | LICENSE | NSFW | similarity | WIDTH | HEIGHT

where

- SAMPLE_ID:   A unique identifier
LICENSE:   If a Creative Commons License could be extracted from the image data, we name it here like e.g. “creativecommons.org/licenses/by-nc-sa/3.0/” – otherwise you’ll find it here a “?”
- NSFW: CLIP had been used to estimate if the image has NSFW content. The estimation has been pretty conservative, reducing the number of false negatives at the cost of more false positives. Possible values are “UNLIKELY”, “UNSURE” and “NSFW”
- similarity: Value of the cosine similarity between the text and image embedding
- WIDTH and HEIGHT: image size as the image was embedded. Originals that were larger than 4K size were resized to 4K

*This metadata dataset is best used to redownload the whole dataset or a subset of it. The img2dataset tool can be used to efficiently download such subsets*.

Source of the parquet files:
https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/


```
!wget http://3080.rom1504.fr/cah/cah_dataframe_unique/part-00000-4d76554c-2d66-4112-9420-0bb9d725a79d-c000.snappy.parquet
!wget https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet
!wget -m -np -c -U "eye02" -w 2 -R "index.html*" "https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/"

# LAION-2B-En
!git lfs install
!git clone https://huggingface.co/datasets/laion/laion2B-en
```



After downloading the datasets, your dir-tree should look like:
```
the-eye.eu
├── robots.txt
└── public
    └── AI
        └── cah
            └── laion400m-met-release
                ├── laion400m-meta
                │   ├── part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet
                │   ├── part-00001-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet
                │   ├──  ...

```
```
LAION-2Ben
├── laion2B-en
│   ├── .git
│   ├── .gitattributes
│   ├── README.md
│   ├── part-00026-5114fd87-297e-42b0-9d11-50f1df323dfa-c000.snappy.parquet
│   ├── part-00056-5114fd87-297e-42b0-9d11-50f1df323dfa-c000.snappy.parquet
│   ├──  ...
```

# 2: Download the summary dataframe

The two datasets combined have 160 parquet files.

- LAION-400M is split into 32 parquet files
- LAION-2B-En has 128 parquet files

Now, let us download the summary dataframe that allows us to navigate the assets from [here](https://raw.githubusercontent.com/vinayprabhu/hate_scaling/main/data/nlp_hate/df_summary_filewise_400M_2B.csv)

In [None]:
url_summary='https://raw.githubusercontent.com/vinayprabhu/hate_scaling/main/data/nlp_hate/df_summary_filewise_400M_2B.csv'
df_parquet=pd.read_csv(url_summary)
df_parquet

In [None]:
parquet_list=df_parquet.file_loc.values
df_parquet.groupby('dataset')['file_size_GB'].describe(), df_parquet.groupby('dataset')['file_size_GB'].sum()

In [None]:
#parquet_list_400m=parquet_list[0:32]
parquet_list_2b=parquet_list[32:]

In [None]:
parquet_list_2b[2]

In [None]:
!pip install --quiet pytictoc
from pytictoc import TicToc
t = TicToc()

Now, let us look at how the _raw_ parquet files look like:

In [None]:
t.tic()
df_2B = pd.read_parquet(parquet_list_2b[0])
print(df_2B.shape)
t.toc()
df_2B.head()

In [None]:
df_2B['TEXT'] = df_2B['TEXT'].fillna('')

## 3. Calculate NSFW and hate speech scores

In [None]:
safety_model = load_safety_model()

In [None]:
device = "cuda"
model, preprocess = clip.load("ViT-L/14", device=device)

In [None]:
def normalized(a, axis=-1, order=2):
    import numpy as np  # pylint: disable=import-outside-toplevel
    l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
    l2[l2 == 0] = 1
    return a / np.expand_dims(l2, axis)

In [None]:
analyzer = create_analyzer(task="hate_speech", lang="en")

In [None]:
def get_nsfw_hate_check(row):
    image_url= row['URL']
    alt_text = row['TEXT']
    id_sample = row['SAMPLE_ID']
    nsfw_value= np.nan
    if id_sample not in existing_ids:
        try:
            response = requests.get(image_url, timeout=5) 
        except:
            print('response error')
        try:
            img = preprocess(Image.open(BytesIO(response.content))).unsqueeze(0).to(device)
            with torch.no_grad():
                image_features = model.encode_image(img)
                emb = np.asarray(normalized(image_features.detach().cpu()))
                nsfw_value = safety_model.predict(emb)
        except:
            print('image error')
        hate = analyzer.predict(alt_text).probas
        csvwriter.writerow([id_sample, hate, nsfw_value])
        csvfile.flush()
    return(nsfw_value)

In [None]:
with open('nsfw/laion0.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_MINIMAL)
    df_2B['NSFW_VALUE'] = df_2B.apply(lambda row: get_nsfw_hate_check(row), axis=1)

In [None]:
df_2B_2.to_parquet('./nsfw/laion0.parquet')