<a href="https://colab.research.google.com/github/vinayprabhu/hate_scaling/blob/main/code/4_Walkthrough_Pysentimiento_400M_2Ben.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GOAL: The goal of this notebook is to walkthrough the data-assets generated via the Pysentimiento experiments:

TLDR:

- Step-1: Download the LAION datasets

- Step-2: Download the data-assets from 
[here](https://hal.cse.msu.edu/assets/data/papers/hate_detect_laion_400m_2B-en.zip) and unzip them into a local directory ```./hate_detect_laion_400m_2B-en```
This should consist of 641 files (detailed below)

- Step-3: Download the summary data-frame from [here](https://raw.githubusercontent.com/vinayprabhu/hate_scaling/main/data/nlp_hate/df_summary_filewise_400M_2B.csv) that allows one to contextualize and index the data-assets from Step-2

# 0: Standard imports and mounting the directory

In [1]:
from psutil import virtual_memory
# Make sure to run it on a high-memory instance
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
# from tqdm import tqdm_notebook as tqdm
from tqdm.notebook import tqdm
%matplotlib inline

from scipy.linalg import block_diag
import seaborn as sns
# Numpy aesthetics
np.set_printoptions(suppress=True)
from collections import Counter
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')

import itertools
%precision 6
#############################################
import sys
import importlib
importlib.reload(sys)
#######################
from google.colab import drive
drive.flush_and_unmount()
import os
drive.mount('/gdrive', force_remount=True)
# Enter your own proj_dir here
proj_dir='/gdrive/My Drive/Colab Notebooks/0_laion_dataset/'
os.chdir(proj_dir)

Your runtime has 54.8 gigabytes of available RAM



  set_matplotlib_formats('retina')


Mounted at /gdrive


# 1: Download the LAION datasets

Source: https://laion.ai/laion-400-open-dataset/


*We produced the dataset in several formats to address the various use cases*: 
- A 50GB url+caption metadata dataset in parquet files. This can be used to compute statistics and redownload part of the dataset
- A 10TB webdataset with 256×256 images, captions and metadata. This is a full version of the dataset, that can be used directly for training
- A 1TB set of the 400M text and image clip embeddings, useful to rebuild new knn indices
- Two 4GB knn indices allowing to easily search in the dataset + two higher quality 16GB knn indices (running in the webdemo)
URL and caption metadata dataset.

We provide 32 parquet files of size around 1GB (total 50GB) with the image URLs, the associated texts and additional metadata in the following format:

SAMPLE_ID | URL | TEXT | LICENSE | NSFW | similarity | WIDTH | HEIGHT

where

- SAMPLE_ID:   A unique identifier
LICENSE:   If a Creative Commons License could be extracted from the image data, we name it here like e.g. “creativecommons.org/licenses/by-nc-sa/3.0/” – otherwise you’ll find it here a “?”
- NSFW: CLIP had been used to estimate if the image has NSFW content. The estimation has been pretty conservative, reducing the number of false negatives at the cost of more false positives. Possible values are “UNLIKELY”, “UNSURE” and “NSFW”
- similarity: Value of the cosine similarity between the text and image embedding
- WIDTH and HEIGHT: image size as the image was embedded. Originals that were larger than 4K size were resized to 4K

*This metadata dataset is best used to redownload the whole dataset or a subset of it. The img2dataset tool can be used to efficiently download such subsets*.

Source of the parquet files:
https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/


```
!wget http://3080.rom1504.fr/cah/cah_dataframe_unique/part-00000-4d76554c-2d66-4112-9420-0bb9d725a79d-c000.snappy.parquet
!wget https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet
!wget -m -np -c -U "eye02" -w 2 -R "index.html*" "https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/"

# LAION-2B-En
!git lfs install
!git clone https://huggingface.co/datasets/laion/laion2B-en
```



After downloading the datasets, your dir-tree should look like:
```
the-eye.eu
├── robots.txt
└── public
    └── AI
        └── cah
            └── laion400m-met-release
                ├── laion400m-meta
                │   ├── part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet
                │   ├── part-00001-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet
                │   ├──  ...

```
```
LAION-2Ben
├── laion2B-en
│   ├── .git
│   ├── .gitattributes
│   ├── README.md
│   ├── part-00026-5114fd87-297e-42b0-9d11-50f1df323dfa-c000.snappy.parquet
│   ├── part-00056-5114fd87-297e-42b0-9d11-50f1df323dfa-c000.snappy.parquet
│   ├──  ...
```

# 2: Download the summary dataframe

The two datasets combined have 160 parquet files.

- LAION-400M is split into 32 parquet files
- LAION-2B-En has 128 parquet files

Now, let us download the summary dataframe that allows us to navigate the assets from [here](https://raw.githubusercontent.com/vinayprabhu/hate_scaling/main/data/nlp_hate/df_summary_filewise_400M_2B.csv)

In [2]:
url_summary='https://raw.githubusercontent.com/vinayprabhu/hate_scaling/main/data/nlp_hate/df_summary_filewise_400M_2B.csv'
df_parquet=pd.read_csv(url_summary)
df_parquet

Unnamed: 0,dataset,file_id,file_size_GB,file_loc,P_hateful,P_targeted,P_aggressive,file_ind
0,400m,400m_0,1.6794,./the-eye.eu/public/AI/cah/laion400m-met-relea...,0.294,0.097,0.017,laion400m-meta_00000
1,400m,400m_1,1.6800,./the-eye.eu/public/AI/cah/laion400m-met-relea...,0.300,0.069,0.012,laion400m-meta_00001
2,400m,400m_2,1.6792,./the-eye.eu/public/AI/cah/laion400m-met-relea...,0.306,0.080,0.011,laion400m-meta_00002
3,400m,400m_3,1.6797,./the-eye.eu/public/AI/cah/laion400m-met-relea...,0.298,0.084,0.014,laion400m-meta_00004
4,400m,400m_4,1.6797,./the-eye.eu/public/AI/cah/laion400m-met-relea...,0.303,0.087,0.016,laion400m-meta_00003
...,...,...,...,...,...,...,...,...
155,2B,2B_123,2.5123,./LAION-2Ben/laion2B-en/part-00122-5114fd87-29...,0.359,0.122,0.027,laion2B-en_00122
156,2B,2B_124,2.5122,./LAION-2Ben/laion2B-en/part-00104-5114fd87-29...,0.339,0.126,0.013,laion2B-en_00104
157,2B,2B_125,2.5120,./LAION-2Ben/laion2B-en/part-00102-5114fd87-29...,0.372,0.135,0.032,laion2B-en_00102
158,2B,2B_126,2.5120,./LAION-2Ben/laion2B-en/part-00113-5114fd87-29...,0.371,0.098,0.012,laion2B-en_00113


In [3]:
parquet_list=df_parquet.file_loc.values
df_parquet.groupby('dataset')['file_size_GB'].describe(), df_parquet.groupby('dataset')['file_size_GB'].sum()

(         count      mean       std     min       25%     50%     75%     max
 dataset                                                                     
 2B       128.0  2.512508  0.000356  2.5116  2.512275  2.5125  2.5128  2.5132
 400m      32.0  1.679731  0.000330  1.6792  1.679500  1.6797  1.6800  1.6806,
 dataset
 2B      321.6010
 400m     53.7514
 Name: file_size_GB, dtype: float64)

In [4]:
parquet_list_400m=parquet_list[0:32]
parquet_list_2b=parquet_list[32:]

In [5]:
!pip install --quiet pytictoc
from pytictoc import TicToc
t = TicToc()

Now, let us look at how the _raw_ parquet files look like:

In [6]:
t.tic()
df_400m_0 = pd.read_parquet(parquet_list_400m[0])
print(df_400m_0.shape)
t.toc()
df_400m_0.head(4)

(12933524, 8)
Elapsed time is 15.372087 seconds.


Unnamed: 0,SAMPLE_ID,URL,TEXT,HEIGHT,WIDTH,LICENSE,NSFW,similarity
0,1581282000000.0,http://media.rightmove.co.uk/148k/147518/58718...,View EPC Rating Graph for this property,109.0,100.0,?,UNSURE,0.312813
1,1060015000000.0,https://thumbs.ebaystatic.com/images/g/DYEAAOS...,Silverline Air Framing Nailer 90mm 10 - 12 Gau...,225.0,225.0,?,UNLIKELY,0.312485
2,3372497000000.0,https://farm1.staticflickr.com/784/40182677504...,Anhui Mountains,800.0,514.0,?,UNLIKELY,0.316512
3,382020000000.0,https://t2.ftcdn.net/jpg/00/58/35/35/240_F_583...,Acute pain in a woman knee,257.0,240.0,?,UNLIKELY,0.344278


# 3: Navigating the data-assets

We have curated $(N_{parquet} \times 4)+1 = 641$ meta-dataset files generated for the 400m and 2B-en datasets using [this](https://github.com/vinayprabhu/hate_scaling/blob/main/code/2_Pysentimiento_400M_2Ben.ipynb) notebook and have shared the assets as a single .zip file 
[here](https://hal.cse.msu.edu/assets/data/papers/hate_detect_laion_400m_2B-en.zip).

Download this file and unzip it into a local dir: ```RESULT_DIR='hate_detect_400m_2B-e'```

## 3b: The 641 files were generated during our random sampling experiment whose steps were:

- Generate 0.1 million random indices per parquet file
- Save these 0.1 million random indices of the associated parquet files
- Parse the parquet file and extract the image (alt)textual descriptions pertaining to these 1e5 random indices as a numpy tensor
- Pass the alt-text tensor through the hate-analyzer
- Compute stats and save the results


To summarize, we have generated $(N_{parquet} \times 4)+1 = 641$ files for the two datasets.

The types of meta-dataset files are:


- ```index_random_{ind_i}.npy```: $N_{parquet}$ random-index files of the naming-format: ```index_random_{ind_i}.npy```. Each of these contain 0.1 million random indices pertaining to the rows of the $ind\_i^{th}$ parquet file (in ```parquet_list```). Shape: ```(100000,)```
- ```prob_hate_{ind_i}.npy```: $N_{parquet}$ _hate-probability-matrix_ files of shape ```(100000, 3)``` in the naming-format of ```prob_hate_{ind_i}.npy``` pertaining to the 0.1 million random-indexed rows of the $ind\_i^{th}$ parquet file.
-```qfr_file_{ind_i}.npy```: $N_{parquet}$ _quality-failure-rate_ files of shape ```(3,)``` containing the  percentage of the 0.1 million random-indexed alt-text text samples in the $ind\_i^{th}$ parquet file that triggered a P_hateful/P_targeted/P_aggressive value of > 0.5 by the pysentimento detector (See ```np.mean(res_mat_i>0.5,axis=0)*100``` in the cells above)
- ```alt_text_{ind_i}.npy``` : $N_{parquet}$ _alt-text_ files of shape ```(100000, 1)``` in the naming-format of ```alt_text_{ind_i}.npy``` pertaining to the 0.1 million random-indexed textual row-contents of the $ind\_i^{th}$ parquet file (in the TEXT field)
- ```qfr_400m_2Ben.npy```: A ($N_{parquet}$, 3) shaped numpy file that contains the parquet-file level mean-hate content.

```
# Code reference:

  np.save(f'./{RESULT_DIR}/index_random_{ind_i}.npy',ind_random_i)
  np.save(f'./{RESULT_DIR}/prob_hate_{ind_i}.npy',prob_hate_i)
  np.save(f'./{RESULT_DIR}/qfr_file_{ind_i}.npy',qfr_file_i)
  np.save(f'./{RESULT_DIR}/alt_text_{ind_i}.npy',texts_np_i)
  
np.save(f'./{RESULT_DIR}/qfr_400m_2Ben.npy',qfr_all)
```

 Now, before running verification-code that allows the user to understand what these data-assets are, let us install Pysentimiento and get acquainted with it.

In [7]:
!pip install --upgrade accelerate
!pip install --quiet pysentimiento==0.5.2
##################################
import sys
print("Python version")
# See: https://github.com/pysentimiento/pysentimiento/issues/50
print (sys.version)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Python version
3.10.11 (main, Apr  5 2023, 14:15:10) [GCC 9.4.0]


## 3c: Getting acquainted with ```pysentimiento```:

Let us initialize the pysentimiento analyzer and feed in an example to see what it produces ...

In [8]:
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="hate_speech", lang="en")
df_400m_0.TEXT.values[0],analyzer.predict(df_400m_0.TEXT.values[0])
# ('View EPC Rating Graph for this property',
#  AnalyzerOutput(output=[], probas={hateful: 0.015, targeted: 0.012, aggressive: 0.011}))

('View EPC Rating Graph for this property',
 AnalyzerOutput(output=[], probas={hateful: 0.015, targeted: 0.012, aggressive: 0.011}))

In [9]:
# The inits:
RESULT_DIR='hate_detect_400m_2B-en'
np.random.seed(42)
N_samples=int(1e5) # Number of samples you have chosen to randomly sample
N_parquet=len(parquet_list)
N_samples,N_parquet

(100000, 160)

Now, let us run a quick survey of the data assets created.

In [10]:
pd.Series([f.split('_')[0] for f in os.listdir(f'{RESULT_DIR}')]).value_counts()  

qfr      161
alt      160
index    160
prob     160
dtype: int64

# 4: Last step - Verification

This entails reloading a random raw parquet files manually, re-running the NLP hate classifier and comparing the output of the classifier to the data-assets already curated

In [11]:
df_parquet.head(4)

Unnamed: 0,dataset,file_id,file_size_GB,file_loc,P_hateful,P_targeted,P_aggressive,file_ind
0,400m,400m_0,1.6794,./the-eye.eu/public/AI/cah/laion400m-met-relea...,0.294,0.097,0.017,laion400m-meta_00000
1,400m,400m_1,1.68,./the-eye.eu/public/AI/cah/laion400m-met-relea...,0.3,0.069,0.012,laion400m-meta_00001
2,400m,400m_2,1.6792,./the-eye.eu/public/AI/cah/laion400m-met-relea...,0.306,0.08,0.011,laion400m-meta_00002
3,400m,400m_3,1.6797,./the-eye.eu/public/AI/cah/laion400m-met-relea...,0.298,0.084,0.014,laion400m-meta_00004


In [12]:
# 1: Extract the parquet list from the file_loc column
parquet_list= df_parquet.file_loc[32:]
# 2: Pick a random file
ind_random_2b=np.random.choice(len(parquet_list),1,replace=False)[0]
file_interested=parquet_list[ind_random_2b]
# 3: Covert file name to ``ind_i`` to fetch the pre-computed results (This should be the same as stored in the file_ind column)
ind_i_manual=file_interested.split('/')[-2]+'_'+file_interested.split('/')[-1].split('-')[1]
ind_i= df_parquet.loc[df_parquet.file_loc==file_interested,'file_ind'].values[0]
print(f'Check-1: The manual created index variable is the same as the fetched value? {ind_i==ind_i_manual}')
# 4: Use this ind_i to fetch the pertinent pre-computed results
ind_vec=np.load(f'./{RESULT_DIR}/index_random_{ind_i}.npy') # The 0.1 million random indices sampled from this parquet file
alt_text_file=np.load(f'./{RESULT_DIR}/alt_text_{ind_i}.npy',allow_pickle=True) # The 0.1 million alt-text sentences in the 0.1 million random indices sampled from this parquet file
prob_hate_file=np.load(f'./{RESULT_DIR}/prob_hate_{ind_i}.npy') # The 0.1 million x 3 matrix of Pysentimiento outputs
qfr_file=np.load(f'./{RESULT_DIR}/qfr_file_{ind_i}.npy') # the Quality-Failure-Rate  data pertaining to the chosen parquet file
############################################################
# 5: Now, let us manually re-read the file and run the inference again to verify.
t = TicToc()
import pyarrow.parquet as pq
t.tic()
df_i=pq.read_table(file_interested,columns=['TEXT']).to_pandas()
t.toc()
# 6: Extract the text-description from these indices
texts_np_i = df_i.iloc[ind_vec].TEXT.astype(str).values[0:N_samples]
check_alt_text=texts_np_i==alt_text_file
print(f'Check-2: The manually extracted alt-text is the same as the fetched alt-text vector? {(check_alt_text).mean()==1}')
# 7: Analyze the textual-content ( 1x3 o/p [P_hateful, P_targeted, P_aggressive])
del df_i
import gc
gc.collect()
t.tic()
results_i=analyzer.predict(texts_np_i)
t.toc()
# 8: Compute the results
prob_hate_recomputed=np.array(list(itertools.chain.from_iterable(x.probas.values() for x in results_i))).reshape(N_samples,3)
qfr_recomputed=np.mean(prob_hate_recomputed>0.5,axis=0)*100
# 9: Final Verification:
np.testing.assert_allclose(prob_hate_recomputed, prob_hate_file,rtol=1e-03,), np.testing.assert_allclose(qfr_recomputed,qfr_file,rtol=1e-03,)

Check-1: The manual created index variable is the same as the fetched value? True
Elapsed time is 32.536706 seconds.
Check-2: The manually extracted alt-text is the same as the fetched alt-text vector? 1.0


Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Elapsed time is 371.504717 seconds.


(None, None)