## C4 Dataset

[C4 Dataset](https://huggingface.co/datasets/allenai/c4)

A colossal, cleaned version of Common Crawl's web crawl corpus (from Google). Based on Common Crawl dataset: "https://commoncrawl.org".

We use the processed version of Google's C4 dataset by Allen Institute for AI.
They prepared five variants of the data: `en`, `en.noclean`, `en.noblocklist`, `realnewslike`, and `multilingual (mC4)`.

For reference, these are the sizes of the variants:

- `en`: 305GB
- `en.noclean`: 2.3TB
- `en.noblocklist`: 380GB
- `realnewslike`: 15GB
- `multilingual (mC4)`: 9.7TB (108 subsets, one per language)

The `en.noblocklist` variant is exactly the same as the en variant, except we turned off the so-called "badwords filter", which removes all documents that contain words from the lists at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words.

In [1]:
!pip3 install datasets

Collecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/60/2d/963b266bb8f88492d5ab4232d74292af8beb5b6fdae97902df9e284d4c32/datasets-2.20.0-py3-none-any.whl.metadata
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Obtaining dependency information for pyarrow>=15.0.0 from https://files.pythonhosted.org/packages/4c/21/9ca93b84b92ef927814cb7ba37f0774a484c849d58f0b692b16af8eebcfb/pyarrow-17.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata
  Downloading pyarrow-17.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting pyarrow-hotfix (from datasets)
  Obtaining dependency information for pyarrow-hotfix from https://files.pythonhosted.org/packages/e4/f4/9ec2222f5f5f8ea04f66f184caafd991a39c8782e31f5b0266f101cb68ca/pyarrow_hotfix-0.6-py3-none-any.whl.metadata
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from 

We download 4 out of 1024 files of size ~318Mb compressed each

In [53]:
from datasets import load_dataset

c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0102*-of-01024.json.gz")

Downloading data:   0%|          | 0.00/317M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/318M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/320M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/318M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [54]:
urls = [x['url'] for x in c4_subset["train"]]
documents = [x['text'] for doc_id, x in enumerate(c4_subset["train"])]

print(f"Number of documents: {len(urls)}")
print(f"Number of characters: {sum(len(x) for x in documents)}")

Number of documents: 1425269
Number of characters: 3065881920


In [40]:
c4_subset = ""

In [65]:
print(documents[0], urls[0])

Liposuction has remained one of the most popular cosmetic surgeries for years as people turn to their doctors to remove the fat that diet and exercise can't seem to touch. Recently, there has been a trend towards less invasive aesthetic options as lasers and fillers replace facelifts and laser lipo takes center stage with traditional liposuction. There are two devices, both currently undergoing FDA testing, which could replace fat reduction surgery altogether. They are Zerona and Zeltiq, and they are pain free treatments to cut the fat.
Made by Erchonia, the Zerona laser is not exactly a new concept. Newport Beach Zerona physician Dr. Thomas Barnes has been using this low level laser therapy device in his office for several years. Zerona was originally developed to assist in the performance of traditional lipo. "I was involved with the company that developed Zerona...," says Dr. Barnes. "I've been using this since 2001, thousands of cases. We know it causes fat to leak from the cells w

Let's compute the number of distinct domains

In [78]:
def get_domain(url):
    url = url.replace("https://", "").replace("http://", "")
    domain = url.split("/")[0]
    return domain

In [62]:
domains = {}

for url in urls:
    domain = get_domain(url)
    if domain not in domains:
        domains[domain] = 0
    domains[domain] += 1

print(f"Number of domains: {len(domains)}")
print(f"Number of domains with at least 4 pages: {sum([1 for (domain, x) in domains.items() if x >= 4])}")

Number of domains: 841520
Number of domains with at least 4 pages: 49076


Reorder documents sorting by URL

In [76]:
def get_domain_rev(domain):
    return ".".join(domain.split(".")[::-1])

def get_url_rev(url):
    url = url.replace("https://", "").replace("http://", "")
    split = url.split("/")
    return get_domain_rev(split[0]) + "/" + "/".join(split[1:])

In [80]:
print(urls[0])
print(get_url_rev(urls[0]))

https://americanhealthandbeauty.com/articles/2704/non-surgical-fat-reduction--zerona-vs-zeltiq
com.americanhealthandbeauty/articles/2704/non-surgical-fat-reduction--zerona-vs-zeltiq
