# Data Loading and Validation for AI Datasets

## Purpose
This notebook focuses on how AI datasets are loaded from files and validated
before they are used for annotation, quality checks, or model training.

The emphasis is on:
- JSON and CSV files
- Safe file handling
- Multilingual and RTL-friendly workflows
- Detecting structural issues early


## What is Data Loading?

Data loading is the process of reading datasets from storage
(such as JSON or CSV files) into memory so they can be inspected,
validated, and processed using code.

In AI workflows, data loading is the first step before:
- annotation
- quality assurance
- training
- evaluation


## What is JSON?

JSON (JavaScript Object Notation) is a structured data format used to store
and exchange data in a clear and predictable way.

In AI systems, JSON is commonly used to store:
- annotated datasets
- metadata
- configuration files
- prompt–response pairs

JSON is language-independent and easy for both humans and machines to read.


## Understanding JSON Structure

A typical AI annotation dataset in JSON looks like this:

- The outer `[]` represents a collection of records (the dataset)
- Each `{}` represents one data record
- Each record contains key–value pairs describing the data

This structure allows datasets to scale from one record to millions
without changing the format.


## Why JSON is Used in AI Data Annotation

JSON is used in AI data annotation because it stores structured records clearly.
Each record can contain text, language metadata, and labels in a consistent format.

This consistency makes JSON ideal for:
- validation
- quality checks
- automation
- AI training pipelines


In [2]:
import json


In [None]:
with open("data/annotations.json", "r", encoding="utf-8") as file:
    dataset = json.load(file)


In [None]:
type(dataset)

In [None]:
len(dataset)

In [None]:
type(dataset[0])

In [None]:
for item in dataset:
    print(item["language"], item["label"], "→", item["text"])


In [None]:
for item in dataset:
    if item["language"] in ["ur", "ar"]:
        print(item["text"])


In [None]:
for item in dataset:
    if "text" not in item or "label" not in item:
        print("Invalid annotation:", item)


## What I Learned

- JSON is the standard format for AI datasets and annotations
- AI data loading requires safe file handling
- JSON files are converted into lists and dictionaries in Python
- Early inspection and validation prevent downstream errors
