# Extracting all information from extracted texts of Acts

The goal is to get all section names, titles and texts AS WELL AS the grouping of sections into chapters

In [1]:
import os
import re
from collections import defaultdict
from copy import deepcopy

In [2]:
path = "/home/workboots/Datasets/IndiaCode/CentralActs/converted_texts/186045.txt"

Loading the data and stripping any preceeding or suceeding spaces

In [3]:
with open(path, 'r') as f:
    txt = f.read()

In [4]:
txt = txt.strip()
txt = bytes(txt, 'utf-8').decode('utf-8', 'ignore')
txt = txt.replace('\x0c', '')

Setting the name of the Act. Here, it is done manually. Later the name will be given by a dictionary having the names of all acts.

In [5]:
title = "The Indian Penal Code"
date = 1860

Creating the regex to find the name of the Act in the text. Usually the text mentions the name only twice:
- Once in the beginning right before providing the 'ARRANGEMENT OF SECTIONS'
- Once right before the main body of the Act
Splitting  along the name therefore lets us divide the document into two parts. The first part, which follows a very nice structure will be used to extract the numbers and titles of the sections as well as their chapters. This information will then be used to extract the text of each section from the second part.

In [6]:
r = re.compile(rf"{title},?\s*({date})?", flags = re.I | re.DOTALL)

In [7]:
parts = r.split(txt, maxsplit=2)

In [8]:
len(parts)

5

While the total number of segments should be 2, due to noisy conversion, it may end up being greater than 2. This is handled by considering the first half of the parts as containing the 'ARRANGEMENT OF SECTIONS' and the second half as containing the actual text. **However, if it is less than 2, we are unable to proceed.**

In [9]:
parts = list(filter(None, parts))

In [10]:
len(parts)

2

In [11]:
if len(parts) < 2:
    print("Error")
else:
    if len(parts) %2 == 0:
        arrangement = " ".join(parts[:len(parts)//2])
        info = " ".join(parts[len(parts)//2:])
    else:
        arrangement = " ".join(parts[:len(parts)//2+1])
        info = " ".join(parts[len(parts)//2+1:])

In [36]:
print([len(p) for p in parts])

[38367, 458275]


## Extracting section numbers, titles and chapter break-up from the first half of the Act

Initial preprocessing to remove spurious elements. Done after observing the text.

In [12]:
arrangement = re.sub(r"SECTIONS?", '', arrangement)
arrangement = re.sub(r"[^\w./,\s-]", '', arrangement)
arrangement = re.sub(r"\n+\d+\s*\n+", '', arrangement, flags=re.DOTALL)

Removing all capital letter words as these are meaningless for extracting the section numbers and titles

Information is extracted by first breaking the arrangements by splitting along the 'CHAPTER' token. Thereafter, for each split, a section regex is used to obtain the section number and title.

In [14]:
chapters = re.split(r"CHAPTER\s+[MCLXVI]+", arrangement, flags = re.DOTALL)

In [15]:
section_num_parse_regex = r"(?P<num>[0-9]+)(?P<alpha>(?:[A-Z]+)?).?\s*(?P<title>.*)"
r = re.compile(section_num_parse_regex)
chap_name_regex = r"([A-Z\s]+)"
c = re.compile(chap_name_regex)

In [16]:
act_info = {}

In [17]:
for chap in chapters:
    chap_name = c.search(chap)
    chap_text = chap_name.groups()[-1].replace("PREAMBLE",'').replace(r"\s+",' ').strip()
    matches = r.finditer(chap)
    for match in matches:
        dct = dict(match.groupdict())
        if act_info.get(dct["num"]+dct["alpha"], -1) == -1:
            act_info[dct["num"]+dct["alpha"]] = { "title": dct["title"].strip().strip("."),
                                                  "chapter": chap_text}

In [18]:
if int(list(act_info.keys())[0]) != 1:
    del act_info[list(act_info.keys())[0]]

In [19]:
act_info

{'1': {'title': 'Title and extent of operation of the Code',
  'chapter': 'INTRODUCTION'},
 '2': {'title': 'Punishment of offences committed within India',
  'chapter': 'INTRODUCTION'},
 '3': {'title': 'Punishment of offences committed beyond, but which by law may be tried within, India',
  'chapter': 'INTRODUCTION'},
 '4': {'title': 'Extension of Code to extra-territorial offences',
  'chapter': 'INTRODUCTION'},
 '5': {'title': 'Certain laws not to be affected by this Act',
  'chapter': 'INTRODUCTION'},
 '6': {'title': 'Definitions in the Code to be understood subject to exceptions',
  'chapter': 'GENERAL EXPLANATIONS'},
 '7': {'title': 'Sense of expression once explained',
  'chapter': 'GENERAL EXPLANATIONS'},
 '8': {'title': 'Gender', 'chapter': 'GENERAL EXPLANATIONS'},
 '9': {'title': 'Number', 'chapter': 'GENERAL EXPLANATIONS'},
 '10': {'title': 'Man. Woman', 'chapter': 'GENERAL EXPLANATIONS'},
 '11': {'title': 'Person', 'chapter': 'GENERAL EXPLANATIONS'},
 '12': {'title': 'Public

## Extraction of section texts

The actual texts are contained in the second half of the text.

Carrying out some initial preprocessing

In [44]:
info = re.sub(r"\d+\[(.*?)\]", r"\1", info, flags=re.DOTALL)
info = re.sub(r"\d+\[", '', info)
info = re.sub(r"\s*(\d)*(\*)+\s*", '', info)
info = re.sub(r"\b[A-Z]+\b", '', info)
info = re.sub(r"[^\w./,\s-]", '', info)
info = re.sub(r"\n+\d+\s*\n+", '', info, flags=re.DOTALL)

In [45]:
print(info)

 . 45  18601 

  

 

6th October, 1860. 

Preamble.  it  is  expedient  to  provide  a  general  Penal  Code  for  India  It  is 

enacted as follows 

1. Title and extent of operation of the Code.This Act shall be called the Indian Penal Code, and 

shall extend to the whole of India. 

2.  Punishment  of  offences  committed  within  India.Every  person  shall  be  liable  to  punishment 
under this Code and not otherwise for every act or omission contrary to the provisions thereof, of which he 
shall be guilty within India.  

3. Punishment of offences committed beyond, but which by law may be tried within, India.Any 
person liable, by any Indian law, to be tried for an offence committed beyond India shall be dealt with 
according to the provisions of this Code for any act committed beyond  India in the same manner as if 
such act had been committed within India. 

4. Extension of Code to extra-territorial offences.The provisions of this Code apply also to any 

offence committed b

We extract the texts of a section by using the title of that section and of the next one. The titles of two consecutive sections are used as the bookending markers for a particular text.

In [46]:
titles = []
for num in act_info:
    text = act_info[num]["title"].split()
    text = r"\s+".join(text)
    titles.append(rf"{num}.?\s+{text}")

In [47]:
titles

['1.?\\s+Title\\s+and\\s+extent\\s+of\\s+operation\\s+of\\s+the\\s+Code',
 '2.?\\s+Punishment\\s+of\\s+offences\\s+committed\\s+within\\s+India',
 '3.?\\s+Punishment\\s+of\\s+offences\\s+committed\\s+beyond,\\s+but\\s+which\\s+by\\s+law\\s+may\\s+be\\s+tried\\s+within,\\s+India',
 '4.?\\s+Extension\\s+of\\s+Code\\s+to\\s+extra-territorial\\s+offences',
 '5.?\\s+Certain\\s+laws\\s+not\\s+to\\s+be\\s+affected\\s+by\\s+this\\s+Act',
 '6.?\\s+Definitions\\s+in\\s+the\\s+Code\\s+to\\s+be\\s+understood\\s+subject\\s+to\\s+exceptions',
 '7.?\\s+Sense\\s+of\\s+expression\\s+once\\s+explained',
 '8.?\\s+Gender',
 '9.?\\s+Number',
 '10.?\\s+Man.\\s+Woman',
 '11.?\\s+Person',
 '12.?\\s+Public',
 '13.?\\s+Omitted',
 '14.?\\s+Servant\\s+of\\s+Government',
 '15.?\\s+Repealed',
 '16.?\\s+Repealed',
 '17.?\\s+Government',
 '18.?\\s+India',
 '19.?\\s+Judge',
 '20.?\\s+Court\\s+of\\s+Justice',
 '21.?\\s+Public\\s+servant',
 '22.?\\s+Moveable\\s+property',
 '23.?\\s+Wrongful\\s+gain',
 '24.?\\s+Dishonest

In [48]:
for i, t1, t2 in zip(act_info, titles, titles[1:]+['']):
    print(t1)
    if t2 == '':
        if i == len(titles) - 1:
            t2 = r"\Z"
    r = re.compile(rf"{t1}(.*){t2}", flags = re.I | re.DOTALL)
    match = r.finditer(info)
    act_info[i]["text"] = ""
    for m in match:
        act_info[i]["text"] = re.sub(r"\s+", ' ', str(m.groups()[0])).strip()

1.?\s+Title\s+and\s+extent\s+of\s+operation\s+of\s+the\s+Code
2.?\s+Punishment\s+of\s+offences\s+committed\s+within\s+India
3.?\s+Punishment\s+of\s+offences\s+committed\s+beyond,\s+but\s+which\s+by\s+law\s+may\s+be\s+tried\s+within,\s+India
4.?\s+Extension\s+of\s+Code\s+to\s+extra-territorial\s+offences
5.?\s+Certain\s+laws\s+not\s+to\s+be\s+affected\s+by\s+this\s+Act
6.?\s+Definitions\s+in\s+the\s+Code\s+to\s+be\s+understood\s+subject\s+to\s+exceptions
7.?\s+Sense\s+of\s+expression\s+once\s+explained
8.?\s+Gender
9.?\s+Number
10.?\s+Man.\s+Woman
11.?\s+Person
12.?\s+Public
13.?\s+Omitted
14.?\s+Servant\s+of\s+Government
15.?\s+Repealed
16.?\s+Repealed
17.?\s+Government
18.?\s+India
19.?\s+Judge
20.?\s+Court\s+of\s+Justice
21.?\s+Public\s+servant
22.?\s+Moveable\s+property
23.?\s+Wrongful\s+gain
24.?\s+Dishonestly
25.?\s+Fraudulently
26.?\s+Reason\s+to\s+believe
27.?\s+Property\s+in\s+possession\s+of\s+wife,\s+clerk\s+or\s+servant
28.?\s+Counterfeit
29.?\s+Document
29A.?\s+Electronic\s

In [49]:
print(info)

 . 45  18601 

  

 

6th October, 1860. 

Preamble.  it  is  expedient  to  provide  a  general  Penal  Code  for  India  It  is 

enacted as follows 

1. Title and extent of operation of the Code.This Act shall be called the Indian Penal Code, and 

shall extend to the whole of India. 

2.  Punishment  of  offences  committed  within  India.Every  person  shall  be  liable  to  punishment 
under this Code and not otherwise for every act or omission contrary to the provisions thereof, of which he 
shall be guilty within India.  

3. Punishment of offences committed beyond, but which by law may be tried within, India.Any 
person liable, by any Indian law, to be tried for an offence committed beyond India shall be dealt with 
according to the provisions of this Code for any act committed beyond  India in the same manner as if 
such act had been committed within India. 

4. Extension of Code to extra-territorial offences.The provisions of this Code apply also to any 

offence committed b