An annotated Indian Court Decision Document Corpus consisting of 10 coarse-grained classes and 30 fine-grained classes as a benchmark dataset for constructing the knowledge graph. Indian Court Case Documents’ knowledge graph constructed by utilizing a rule-based approach for Named Entity Recognition (NER) and Relation Extraction (RE).
The legal documents for creating the corpus were collected from ’Indian Kanoon’, an online search engine for Indian legal documents. The data from the text files were split into sentences, tokenized word by word and annotated with POS tags using SPACY. Named Legal Entities were identified manually from these tokens and were tagged with domain specific tags using CoNLL-2003 format. The dataset is provided in three different encoding schemes of the CoNLL-2003 An Indian Court Decision Annotated Corpus and KG Construction 7 format, namely BILOU ((B-Beginning, I-Internal, L-Last, O-outside,U-Unit), IOB (I-Inside, O-Outside, B-Beginning) and IOBES (I-Inside, O-Outside, B- Beginning, E-End, S-Single). FigShare with CC by 4.0 licence is used to publish the dataset with the DOI: https://doi.org/10.6084/m9.figshare.19719088.v4
To corroborate the domain-specific tags, two semantic classes were defined; namely, coarse-grained class and fine-grained class, each consisting of 10 and 30 attributes respectively. Coarse-grained are the more general semantic classes for the legal domain, which include the classes Court, Party, CourtDecision, Document, Jurisdiction, Location, CaseType, Author, CourtOfficial, and DateOfJudgment.
The two major steps for the construction of the knowledge graph are Named Entity Recognition (NER) and Relation Extraction (RE). Various legal entities identified from the corpus by referring to the NyOn Ontology are combined together with the relations extracted for the construction of the Knowledge Graph (KG).
The entity extraction is substantially carried out with the help of regular expressions and triggering target words. Examples of rules used to extract the entities "JURISDICTION" and "LOCATION" are given below.
#JURISDICTION
jur = re. search (r"(\w+\W+) {1}( JURISDICTION )", contents )
if jur:
data . append (" JURISDICTION $ "+jur . group (0)+"")
#LOCATION
loc = re. search ('(\w{4 ,}) ( High Court )', contents )
if loc:
if (( loc. group (1) ). lower () == 'pradesh '):
loc = re. search ('(\w{4 ,}) ( Pradesh )', contents ,re.IGNORECASE )
data . append (" LOCATION $ "+loc. group ()+"")
elif (( loc. group (1)). lower () == ’kashmir '):
data . append (" LOCATION $ Jammu and Kashmir "+'')
elif (( loc. group (1)). lower () == 'haryana '):
data . append (" LOCATION $ Punjab and Haryana "+'')
else :
data . append (" LOCATION $ "+loc. group (1)+"")An example of the output file from the NER phase for the case "KEWAL KRISHAN VS. STATE OF PUNJAB" dated 06/03/1962 is given below.
Relation extraction phase identifies the relation between the entities extracted in the NER phase. The NyOn Ontology is referred for identifying the various relations between the extracted entities. ince there are no sentences in the output of the NER phase, switch case is used for annotating the relations between the extracted entities. An example of python rule for extracting and annotating relations is given below.
#CASE_NAME
if 'FILE_NAME ' in temp :
re += '\n'+ tok_ent ['Entity '][i]+ '\n'
CASENAME =ids [ index ]
re += 'CASE hasCaseId '+ CASENAME +'\n'
index +=1
#BENCH
if 'BENCH ' in temp :
re += CASENAME +' hasCourtOfficial Judge '+'\n'
str = tok_ent ['Entity '][i]
my_list =str. split (",")
for x in range (len ( my_list )):
re += 'Judge hasName '+ my_list [x]+ '\n'The triples obtained after annotating the entities with the corresponding relation for the case "KEWAL KRISHAN VS. STATE OF PUNJAB" dated 06/03/1962 is given below.
The Triples were formed by annotating the entities obtained from NER with the relations extracted in the RE phase. The output file from the RE phase is passed through a python script to generate the RDF (.ttl) file. The constructed triples were stored in a triple store (Apache Jena Fuseki) and visualized using GraphDb. The generated RDF corresponding to all the 50 documents is given here and knowledge graph visualized through GraphDb is given below.
The triples formed where tested against competency questions with the help of SPARQL queries. Screenshots of the competency questions, corresponnding SPARQL queries and outputs are attached below.
-
List all the cases from the month of September.
Query:
Output:
-
What is the total number of cases filed under case type 'criminal'?
Query:
Output:
-
List all the cases with Vivian Bose/V. Bose as a judge.
Query:
Output:
-
What is the total number of cases having the judgement as appeal is 'Accepted'?
Query:
Output:
-
What is the date of judgement for the case "THE RT. REV. B.P. SUGANDHARBISHOP IN MEDAK VS. SMT. D. DOROTHY DAYASHEELAEBENESER on 12/04/1996".
Query:
Output:
-
List all the cases under 'Appellant Jurisdiction'.
Query:
Output:
-
Name the petitioner of case with "CASE NO.: Appeal (civil) 294 of 1955".
Query:
Output:
-
List all the cases involving Union of India as one of the parties.
Query:

Output:
-
How many appeals have been 'rejected' by the judge 'Vivian Bose/V. Bose'.
Query:

Output:
This work is supported by the IHUB-ANUBHUTI-IIITD FOUNDATION set up under the NM-ICPS scheme of the Department of Science and Technology, India. We thank Mr. Vaibhav Vats, Advocate, Punjab and Haryana High Court, Chandigarh for providing his valuable reviews for the dataset.
Pariskhit Kamat (National Institute of Technology, Kurukshetra, India) pariskhit_52010088@nitkkr.ac.in
Shubham Kalson (National Institute of Technology, Kurukshetra, India) shubham_52010087@nitkkr.ac.in
Suraj S (National Institute of Technology, Kurukshetra, India) suraj_52010085@nitkkr.ac.in
Pooja Harde (National Institute of Technology, Kurukshetra, India) pmharde29@gmail.com
Nandana Mihindukulasooriya (IBM Research, Dublin, Ireland) nandana@ibm.com
Dr. Sarika Jain (National Institute of Technology, Kurukshetra, India) jasarika@nitkkr.ac.in
