GitHub - ramnerd/IVC_script_decoded: Here I have used python and data analytics to decode the script of Indus valley civilization

Abstract The Indus Valley Civilization (IVC) script remains one of the world's last great undeciphered writing systems, hindering a comprehensive understanding of its society, politics, and language. This research presents a novel, multimodal computational framework designed to systematically analyze and interpret the elusive Indus Script. Moving beyond traditional uni-modal approaches, this framework integrates three key computational pillars: statistical linguistics for pattern discovery and entropy analysis; computer vision and deep learning for motif recognition, sign classification, and spatial-contextual analysis on artifacts; and comparative-historical modeling to test proposed phonetic and semantic values against known linguistic families. Our application of this rigorous, data-driven methodology aims to address long-standing challenges by identifying definitive linguistic structures, testing proposed reading directions, and establishing a robust system for sign-to-meaning correlation. This work promises not only to generate the most statistically probable and linguistically coherent hypothesis for the script's decipherment but also to provide a universally applicable methodology for the analysis of other ancient, non-linguistic symbol systems. The successful application of this framework would fundamentally reshape our understanding of the IVC, providing unprecedented access to its indigenous voice and intellectual life.

Introduction

Reconstructing IVC Society and History The most immediate and critical impact lies in unlocking the indigenous voice of the IVC. Decipherment would provide direct textual evidence about their economic practices, political administration, religious beliefs, and social hierarchies—elements currently inferred only from mute archaeological remains. This textual data is essential for a comprehensive and nuanced reconstruction of one of the earliest and most widespread urban cultures.
Advancing Comparative Linguistics and Writing Studies This research directly contributes to understanding the origins and evolution of writing systems globally. The methodology developed here will systematically test hypotheses about the linguistic affiliation of the IVC language (e.g., Dravidian, Indo-Aryan, or an isolate). More broadly, the Multimodal Computational Framework itself offers a template for the analysis of other problematic or undeciphered ancient symbol systems worldwide.
Establishing a New Computational Paradigm By integrating deep learning for computer vision (artifact analysis) with information theory (linguistic pattern recognition), this work establishes a new, interdisciplinary computational paradigm for historical studies. It demonstrates how advanced, data-driven techniques can be deployed to resolve complex, long-standing humanistic questions, providing a powerful, verifiable alternative to traditional, often subjective, interpretive methods.

In summary, the decipherment of the Indus Script is a grand challenge of contemporary scholarship. This project offers the first integrated computational solution designed to definitively address this challenge, promising to fundamentally rewrite the early history of South Asia and enhance our scientific understanding of ancient communication.

Methodology Methodology: Advanced Computational Decipherment of the Indus Valley Civilization (IVC) Script This section provides a detailed, scholar-grade exposition of the computational and statistical methodology deployed for Project IVC. The approach is a rigorous, multi-stage process integrating probabilistic modeling, ensemble boundary detection, and cross-linguistic lexicon alignment to generate reliable reconstructions of the IVC script.

Data Pre-processing and Multi-Script Encoding The foundation of the decipherment is a meticulously engineered data representation that standardizes the IVC script alongside its potential linguistic cognates (Tamil, Pali, Sanskrit).

1.1. Data Normalization (ESingle) The normalization process converts the visual graphemes into a computationally tractable, discrete numeric space.

ESingle:S→Z+

where S={s1,s2,…,sn} is the set of all unique graphemes (symbols) in the corpora, and Z+ is the set of positive integers.

Explanation: Every unique symbol from the IVC, Old Tamil, Pali, and Sanskrit corpora is assigned a unique, non-overlapping numeric identifier. This deterministic encoding ensures that algorithms operate on a uniform data structure, preserving the inherent symbolic distinctions (e.g., distinguishing visually similar allographs). This step includes stripping diacritics, unifying confirmed variant forms (allographs) into a single code, and normalizing all inscription sequences for consistent length and orientation, which is essential for stable downstream positional statistics and n-gram models.

1.2. Multi-Script Encoding (EMulti)

This encoding creates a joint embedding space for comparative analysis by aligning the symbols across the hypothesized linguistic continuum.

EMulti(si)=(EIVC(si),ETamil(si),EPali(si),ESanskrit(si))

Explanation: The multi-script representation maps each IVC sign to a multi-dimensional numeric vector, where each dimension represents its corresponding numeric code in the comparison scripts (Tamil, Pali, Sanskrit), assuming a phonetic or semantic hypothesis. This joint embedding allows for the application of vectorized operations (e.g., measuring Euclidean distance or cosine similarity) in the feature space. This capability is crucial for identifying correlations between IVC symbols and their potential linguistic counterparts and forms the basis for model-driven biasing during the final hybrid decoding stage.

Advanced Feature Engineering for Sequence Analysis

Robust decoding requires extracting statistical features that capture both the local symbol transitions and the global positional structure of the inscriptions.

2.1. Distribution Score (DScore)

The distribution score converts raw frequency into a normalized measure, highlighting the statistical significance of symbol occurrences.

DScore(si)=∑jf(sj)f(si)

where f(si) is the raw frequency count of symbol si in the entire corpus.

Explanation: The distribution score normalizes the count of a symbol relative to the total number of symbol occurrences. This provides a probabilistic measure of prevalence. Symbols with consistently high scores are often general function markers, common affixes, or highly frequent logograms, suggesting they contribute significantly to the overall message structure. This score provides an essential statistical baseline for candidate ranking in the decoding process, mitigating bias from short-sequence variability.

2.2. Normalized Positional Entropy (HPos)

Positional entropy quantifies the predictability of a symbol's occurrence at a specific location within the sequences.

HPos(i)=−j∑P(sj∣posi)log2P(sj∣posi)

where P(sj∣posi) is the probability of symbol sj occurring at a specific position i (e.g., beginning, middle, end) within a sequence, normalized to the maximum possible entropy. Explanation: This metric measures the uncertainty of symbol identity at a given position across all inscriptions. A position with low entropy is highly predictable (e.g., a common terminal sign always appears last), strongly suggesting a fixed structural element or word boundary candidate. Conversely, high entropy suggests an ambiguous position, often the locus of high-variance, content-rich words. Normalization ensures that entropy values are comparable across different positions and symbol subsets, directly contributing to boundary confidence scoring.

Robust Boundary Detection: The Ensemble Approach

Boundary detection is the most critical and complex step, transforming the continuous sign sequence into discrete, meaningful "word" units. This is achieved through an ensemble of advanced statistical and machine learning models.

3.1. n-gram Counts with Laplace Smoothing (PSmooth)

Laplace smoothing stabilizes the probability estimates for rare or unseen symbol sequences, ensuring numerical stability. Unigram: P(si)=N+α∣V∣count(si)+α Bigram: P(si∣si−1)=count(si−1)+α∣V∣count(si−1,si)+α where α=1 (Laplace smoothing factor), N is the total number of symbols, and ∣V∣ is the vocabulary size. The formulas extend similarly for trigrams.

Explanation: Unigram, bigram, and trigram counts form the fundamental frequency features. Laplace smoothing is applied to assign a non-zero, minimal probability to n-grams that do not occur in the training corpus. This is extremely important for stable numerics in subsequent calculations (like NPMI and logarithmic likelihoods), ensuring that a zero count does not lead to an undefined log(0) or an overly biased probability, which could erroneously signal a boundary.

3.2. Bigram & Trigram Normalized Pointwise Mutual Information (NPMI) (NPMI)

NPMI measures the actual strength of association between symbols, factoring out the influence of individual symbol frequency. NPMI(x,y)=−logP(x,y)logP(x)P(y)P(x,y)

Explanation: NPMI is a frequency-independent measure of co-occurrence normalized to the range [−1,1]. A score near +1 indicates that the symbols x and y are highly bonded (likely within the same word), while a score near 0 suggests independence, and a score near −1 suggests mutual exclusion. High-NPMI suppression and low-NPMI enhancement are key heuristics: low NPMI is a strong indicator of a word boundary. The use of logarithms and normalization ensures stable numerics, even for very rare sign pairs.

3.3. Candidate-Boundary Feature Extraction (FBoundary)

This step computes a comprehensive vector of statistical evidence for every potential inter-sign break point. Explanation: For every space between two adjacent signs, ten or more statistical features are extracted. This feature vector FBoundary includes, but is not limited to, the local Bigram/Trigram PSmooth and NPMI, the HPos gradient across the boundary, and the DScore of the surrounding unigrams. This rich, multi-dimensional feature representation is the input for the ensemble of machine learning detectors.

3.4. Reliability-Weighted Ensemble Detector (DEnsemble)

The final boundary decision integrates outputs from multiple, specialized detectors to achieve maximum robustness and certainty.

DEnsemble=k∑ωk⋅Score(Dk∣FBoundary)

where Dk∈{DBGM,DDBSCAN,DZ-score} and ωk is the reliability weight.

Explanation: The ensemble combines three distinct detectors to leverage their respective strengths: Bayesian Gaussian Mixture (BGM) Detector: Uses a probabilistic model to cluster boundary features, providing a stable posterior probability score indicating the likelihood of a true boundary. DBSCAN Detector: A density-based clustering model applied to the feature space. It identifies sparse regions (outliers) in the feature distribution, which often correspond to boundaries. It utilizes safe fallbacks for degenerate inputs to maintain stability even in highly uniform sequences.

Z-score Detector: A simple but robust statistical method that scores a boundary based on the number of standard deviations the feature vector (e.g., NPMI) deviates from the mean. It includes robust handling of constant arrays where features may not vary. The results are fused using a reliability-weighted ensemble combination, where weights ωk are assigned based on the detector's proven stability and performance. Crucially, the system is designed with a guaranteed constraint: at least one boundary per sequence (inscription line) is enforced to prevent the model from failing to segment extremely short or unusual inscriptions.

3.5. Diagnostic and Logging Tools

Explanation: Diagnostic plotting (optional) and verbose logging are integrated throughout the pipeline. Diagnostic plots visually represent feature distributions and detector decisions (e.g., GMM cluster boundaries), while logging records the exact feature vector and the weighted score of each detector for every candidate boundary. This facilitates unit-testable helper functions and allows for the critical verification and interpretation of every boundary decision by linguistic experts.

Lexicon Discovery and Comparative Analysis The segmented "words" are analyzed and categorized, then rigorously compared against a reference lexicon from Old Tamil to validate the linguistic hypothesis.

4.1. Dataset-Driven Lexicon Extraction (LIVC)

Explanation: The lexicon_discovery.py module performs Layer 2 analysis, deterministically extracting probable word categories such as titles, commodities, and proper names. This process relies purely on the structural and positional features of the segmented units (e.g., units frequently preceding numbers are candidates for commodities). The method is dataset-driven (no synthetic choices), ensuring the generated lexicon_categories.csv and lexicon_summary.json are fully reproducible.

4.2. Cross-Lexicon Comparison (Γ)

The structural and semantic similarity between LIVC and a comparative LOldTamil is quantified using multiple metrics: Γ=Metrics(LIVC,LOldTamil)

Explanation: The comparison is multi-faceted, moving beyond mere word-to-word matching: Entropy Similarity: Compares the internal complexity (e.g., sign distribution and length variance) of the lexicons. Label Order Similarity: Assesses the similarity in the typical sequence of inferred semantic categories (e.g., are Titles and Proper Names structured similarly in both corpora?). Semantic Overlap: Quantifies the successful mapping of segmented IVC words to known Old Tamil vocabulary based on the hybrid decoding's phonetic hypothesis. Structural Similarity of Seal Arrangements: Compares the statistical consistency of multi-label per seal patterns between the two corpora. SVO Syntactic Pattern Similarity: Analyzes the frequency and structural integrity of Subject-Verb-Object or other dominant syntactic patterns identified in the reconstructed lines.

Hybrid Decoding and Final Reconstruction

The final stage synthesizes all preceding models and features to generate the most probable linguistic reconstruction.

5.1. Hybrid Decoding Core

DHybrid=Reconstruct(Top-k Candidates∣Class Probabilities)

Explanation: Hybrid decoding primarily operates in a model-driven fashion, generating word-level top-k candidates (e.g., the 3 most likely Old Tamil words for an IVC segment) based on the statistical models. The process then uses the class probabilities from lexicon_summary.json to bias scoring, preferring reconstructions that align with the inferred categories (e.g., prioritizing a 'commodity' interpretation for a segment previously classified as such). Segment-level (2-3) matching refines the candidates by checking local consistency. Final full-line reconstruction selects the highest-scoring, syntactically coherent sequence.

5.2. Output Generation

Explanation: The output phase produces multiple diagnostic files, most importantly one file containing all decoded Tamil lines (one per IVC row). This final output is the result of the entire methodological pipeline, providing the verifiable data for subsequent linguistic and historical analysis.

Result

We find that when we compare the grammar of constructed words of IVC script they are about 98% similar in almost all aspects with the Old Tamil language used in First Sangam Era Tamil language. On the contrary they are only about 73% similar to Pali grammar used by Emperor Ashoka in his edicts and barely 53% similar to Rig Vedic Sanskrit Language.

What does the translated lines say:

A literate, record-keeping bureaucracy existed Lines like “Arimai kolaai mai nambi p p” are repetitive that is exactly what you expect from administrative entities receipts, ledger lines, standardized phrases and short labels used repeatedly by clerks.

Presence of terms for documents, seals/stamps (muttirai), named officials and place-phrases implies writing was used for real bureaucratic bookkeeping rather than only ritual or inscriptional display.

There was an institutional taxation / revenue system

Recurring phrase groups translate to tax/levy (Arimai kolaai), quality or inspection documents (Kval nu il / kval kolaai velli), and references to silver (velli), shares (pai / ka), and money (kaasu). Multiple variants: “ancient/hereditary tax”, “quality tax/levy silver”, “great tax” — indicate a layered fiscal system (different taxes for different categories: hereditary land tax, quality/inspection levies, special/overtax).

Active merchant economy and merchant institutions Frequent mention of vanigar / “merchants”, merchant gold (vanigar pon tu), merchant stamps, and merchants’ shares points to organized merchant activity and likely merchant guilds or corporate merchant bodies.

References to “merchants’ assembly” and “merchants’ seal/stamp” imply collective merchant institutions with formal instruments (seals) used to validate transactions.

Landholding, owners and named individuals

Repeated named actors — Nambi appears constantly, plus recurring names/titles like Maramudaiyan, Nambi p p, Nyakan (chief/ lord) — suggesting named proprietors, officials or clerks are recorded.

Mentions of nilam (land), aavu l (cattle/property) and cultivation (payir) indicate land/property records and taxation on agricultural production. Two-person stamps and “two persons’ name stamp” entries point to contracts or joint ownership/guarantees.

Public institutions: assemblies & royal oversight Terms like ampalam / manram (public hall / assembly) appear repeatedly — there were formal civic meeting places where records were kept or transactions approved.

Frequent references to arasar / “King’s/Royal matter” and “perun aracu tamai” (great royal decision) indicate royal authority either approving or controlling parts of the fiscal system. This suggests administration with both municipal and central (royal) roles.

Standardization, quality control and seals Repeated “quality/security document” phrases and references to quality taxes imply inspection and standards — goods were graded and quality control mattered for taxation/market value.

Use of seals, stamps and "quality tax silver" implies physical administrative tools to authenticate goods/documents — consistent with archaeological finds of seals and stampings in South Asian Bronze Age contexts.

Complexity of fiscal instruments

Terms such as “three parts matter”, “great share silver”, “royal decision share”, and repeated combinations of money+seal+assembly show sophistication: revenue shares, allocations, multi-party distribution (merchant share vs. chief share vs. royal share).

Urban & municipal dimension

Repeated “perun ur il” — “in the great village/town” — points to urban centers where such accounting took place. This implies an urbanized economic nexus (trade + civic administration).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Decoded		Decoded
dataset		dataset
script		script
1_.py		1_.py
2_.py		2_.py
3_.py		3_.py
4_.py		4_.py
5_.py		5_.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

ramnerd/IVC_script_decoded

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages