# Extracting all information from extracted pdfs texts of Acts

The goal is to get all section names, titles and texts AS WELL AS the grouping of sections into chapters

In [564]:
import os
import re
from collections import defaultdict

In [565]:
path = "/home/workboots/Datasets/IndiaCode/CentralActs/converted_texts/186045.txt"

Loading the data and stripping any preceeding or suceeding spaces

In [566]:
with open(path, 'r') as f:
    txt = f.read()

In [567]:
txt = txt.strip()

Setting the name of the Act. Here, it is done manually. Later the name will be given by a dictionary having the names of all acts.

In [568]:
title = "The Indian Penal Code"
date = 1860

Creating the regex to find the name of the Act in the text. Usually the text mentions the name only twice:
- Once in the beginning right before providing the 'ARRANGEMENT OF SECTIONS'
- Once right before the main body of the Act
Splitting  along the name therefore lets us divide the document into two parts. The first part, which follows a very nice structure will be used to extract the numbers and titles of the sections as well as their chapters. This information will then be used to extract the text of each section from the second part.

In [569]:
r = re.compile(rf"{title},?\s*({date})?", flags = re.I | re.DOTALL)

In [570]:
parts = r.split(txt, maxsplit=2)

In [571]:
len(parts)

5

While the total number of segments should be 2, due to noisy conversion, it may end up being greater than 2. This is handled by considering the first half of the parts as containing the 'ARRANGEMENT OF SECTIONS' and the second half as containing the actual text. **However, if it is less than 2, we are unable to proceed.**

In [572]:
parts = list(filter(None, parts))

In [573]:
len(parts)

2

In [574]:
if len(parts) < 2:
    print("Error")
else:
    if len(parts) %2 == 0:
        arrangement = " ".join(parts[:len(parts)//2])
        info = " ".join(parts[len(parts)//2:])
    else:
        arrangement = " ".join(parts[:len(parts)//2+1])
        info = " ".join(parts[len(parts)//2+1:])

In [575]:
print([len(p) for p in parts])

[38379, 458380]


## Extracting section numbers, titles and chapter break-up from the first half of the Act

Initial preprocessing to remove spurious elements. Done after observing the text.

In [576]:
arrangement = re.sub(r"SECTIONS?", '', arrangement, flags=re.I)
arrangement = re.sub(r"[^\w./,\s-]", '', arrangement)
arrangement = re.sub(r"\n+\d+\s*\n+", '', arrangement, flags=re.DOTALL)

Removing all capital letter words as these are meaningless for extracting the section numbers and titles

In [498]:
for_sec = re.sub(r"\b[A-Z]+\b", "", for_sec)

Using a regex pattern to get the section numbers and titles

In [502]:
section_num_parse_regex = r"(?P<num>[0-9]+)(?P<alpha>(?:[A-Z]+)?).?\s*(?P<title>.*)"

In [503]:
r = re.compile(section_num_parse_regex)

In [504]:
matches = r.finditer(arrangement)

In [505]:
act_info = {}

In [506]:
for match in matches:
    if match is None:
        print("Error")
    dct = dict(match.groupdict())
    print(dct)
    if act_info.get(dct["num"]+dct["alpha"], -1) == -1:
        act_info[dct["num"]+dct["alpha"]] = { "title": dct["title"].strip().strip(".")}

{'num': '1932', 'alpha': '', 'title': '________ '}
{'num': '1', 'alpha': '', 'title': 'Short title, extent and commencement. '}
{'num': '2', 'alpha': '', 'title': 'Definitions. '}
{'num': '3', 'alpha': '', 'title': 'Application of provisions of Act 9 of 1872. '}
{'num': '4', 'alpha': '', 'title': 'Definition of partnership, partner, firm and firm name. '}
{'num': '5', 'alpha': '', 'title': 'Partnership not created by status. '}
{'num': '6', 'alpha': '', 'title': 'Mode of determining existence of partnership. '}
{'num': '7', 'alpha': '', 'title': 'Partnership at will. '}
{'num': '8', 'alpha': '', 'title': 'Particular partnership. '}
{'num': '9', 'alpha': '', 'title': 'General duties of partners. '}
{'num': '10', 'alpha': '', 'title': 'Duty to indemnify for loss caused by fraud.  '}
{'num': '11', 'alpha': '', 'title': 'Determination of rights and duties of partners by contract between the partners. '}
{'num': '12', 'alpha': '', 'title': 'The conduct of the business. '}
{'num': '13', 'alp

Sometimes, the first act can be erroneous. Check and remove

In [511]:
if int(list(act_info.keys())[0]) != 1:
    del act_info[list(act_info.keys())[0]]

In [512]:
act_info

{'1': {'title': 'Short title, extent and commencement'},
 '2': {'title': 'Definitions'},
 '3': {'title': 'Application of provisions of Act 9 of 1872'},
 '4': {'title': 'Definition of partnership, partner, firm and firm name'},
 '5': {'title': 'Partnership not created by status'},
 '6': {'title': 'Mode of determining existence of partnership'},
 '7': {'title': 'Partnership at will'},
 '8': {'title': 'Particular partnership'},
 '9': {'title': 'General duties of partners'},
 '10': {'title': 'Duty to indemnify for loss caused by fraud'},
 '11': {'title': 'Determination of rights and duties of partners by contract between the partners'},
 '12': {'title': 'The conduct of the business'},
 '13': {'title': 'Mutual rights and liabilities'},
 '14': {'title': 'The property of the firm'},
 '15': {'title': 'Application of the property of the firm'},
 '16': {'title': 'Personal profits earned by partners'},
 '17': {'title': 'Rights and duties of partners'},
 '18': {'title': 'Partner to be agent of the

Information is extracted by first breaking the arrangements by splitting along the 'CHAPTER' token. Thereafter, for each split, a section regex is used to obtain the section number and title.

In [611]:
chapters = re.split(r"CHAPTER\s+[MCLXVI]+", arrangement, flags = re.DOTALL)

In [612]:
section_num_parse_regex = r"(?P<num>[0-9]+)(?P<alpha>(?:[A-Z]+)?).?\s*(?P<title>.*)"
r = re.compile(section_num_parse_regex)
chap_name_regex = r"([A-Z\s]+)"
c = re.compile(chap_name_regex)

In [613]:
act_info = {}

In [614]:
for chap in chapters:
    chap_name = c.search(chap)
    chap_text = chap_name.groups()[-1].replace("PREAMBLE",'').replace(r"\n+",' ').strip()
    matches = r.finditer(chap)
    for match in matches:
        dct = dict(match.groupdict())
        if act_info.get(dct["num"]+dct["alpha"], -1) == -1:
            act_info[dct["num"]+dct["alpha"]] = { "title": dct["title"].strip().strip("."),
                                                  "chapter": chap_text}

In [615]:
if int(list(act_info.keys())[0]) != 1:
    del act_info[list(act_info.keys())[0]]

In [616]:
act_info

{'1': {'title': 'Title and extent of operation of the Code',
  'chapter': 'INTRODUCTION'},
 '2': {'title': 'Punishment of offences committed within India',
  'chapter': 'INTRODUCTION'},
 '3': {'title': 'Punishment of offences committed beyond, but which by law may be tried within, India',
  'chapter': 'INTRODUCTION'},
 '4': {'title': 'Extension of Code to extra-territorial offences',
  'chapter': 'INTRODUCTION'},
 '5': {'title': 'Certain laws not to be affected by this Act',
  'chapter': 'INTRODUCTION'},
 '6': {'title': 'Definitions in the Code to be understood subject to exceptions',
  'chapter': 'GENERAL EXPLANATIONS'},
 '7': {'title': 'Sense of expression once explained',
  'chapter': 'GENERAL EXPLANATIONS'},
 '8': {'title': 'Gender', 'chapter': 'GENERAL EXPLANATIONS'},
 '9': {'title': 'Number', 'chapter': 'GENERAL EXPLANATIONS'},
 '10': {'title': 'Man. Woman', 'chapter': 'GENERAL EXPLANATIONS'},
 '11': {'title': 'Person', 'chapter': 'GENERAL EXPLANATIONS'},
 '12': {'title': 'Public

## Extraction of section texts

In [351]:
info = parts[1]

In [352]:
info = re.sub(r"\d+\[(.*?)\]", r"\1", info, flags=re.DOTALL)
info = re.sub(r"\d+\[", '', info)
info = re.sub(r"(\d)*(\*)+\s*", '', info)

In [353]:
titles = [act_info[num]["title"] for num in act_info]

In [354]:
print(info)

ACT NO. 45 OF 18601 

CHAPTER I 

INTRODUCTION 

[6th October, 1860.] 

Preamble.—WHEREAS  it  is  expedient  to  provide  a  general  Penal  Code  for  India;  It  is 

enacted as follows:— 

1. Title and extent of operation of the Code.—This Act shall be called the Indian Penal Code, and 

shall extend to the whole of India . 

2.  Punishment  of  offences  committed  within  India.—Every  person  shall  be  liable  to  punishment 
under this Code and not otherwise for every act or omission contrary to the provisions thereof, of which he 
shall be guilty within India .  

3. Punishment of offences committed beyond, but which by law may be tried within, India.—Any 
person liable, by any Indian law, to be tried for an offence committed beyond India shall be dealt with 
according to the provisions of this Code for any act committed beyond  India in the same manner as if 
such act had been committed within India. 

4. Extension of Code to extra-territorial offences.—The provisions of thi

In [375]:
info

'ACT NO. 45 OF 18601 \n\nCHAPTER I \n\nINTRODUCTION \n\n[6th October, 1860.] \n\nPreamble.—WHEREAS  it  is  expedient  to  provide  a  general  Penal  Code  for  India;  It  is \n\nenacted as follows:— \n\n1. Title and extent of operation of the Code.—This Act shall be called the Indian Penal Code, and \n\nshall extend to the whole of India . \n\n2.  Punishment  of  offences  committed  within  India.—Every  person  shall  be  liable  to  punishment \nunder this Code and not otherwise for every act or omission contrary to the provisions thereof, of which he \nshall be guilty within India .  \n\n3. Punishment of offences committed beyond, but which by law may be tried within, India.—Any \nperson liable, by any Indian law, to be tried for an offence committed beyond India shall be dealt with \naccording to the provisions of this Code for any act committed beyond  India in the same manner as if \nsuch act had been committed within India. \n\n4. Extension of Code to extra-territorial offen

In [311]:
titles

['Title and extent of operation of the Code',
 'Punishment of offences committed within India',
 'Punishment of offences committed beyond, but which by law may be tried within, India',
 'Extension of Code to extra-territorial offences',
 'Certain laws not to be affected by this Act',
 'Definitions in the Code to be understood subject to exceptions',
 'Sense of expression once explained',
 'Gender',
 'Number',
 'Man. Woman',
 'Person',
 'Public',
 'Omitted',
 'Servant of Government',
 'Repealed',
 'Repealed',
 'Government',
 'India',
 'Judge',
 'Court of Justice',
 'Public servant',
 'Moveable property',
 'Wrongful gain',
 'Dishonestly',
 'Fraudulently',
 'Reason to believe',
 'Property in possession of wife, clerk or servant',
 'Counterfeit',
 'Document',
 'Electronic record',
 'Valuable security',
 'A will',
 'Words referring to acts include illegal omissions',
 'Act',
 'Acts done by several persons in furtherance of common intention',
 'When such an act is criminal by reason of its b

In [374]:
for num, t1, t2 in zip(act_info.keys(), titles, titles[1:]+['']):
    if t2 == '':
        continue
    print(num)
    r = re.compile(rf"{num}.?\s*{t1}(.*)(?={t2})", flags = re.I | re.DOTALL)
    match = r.finditer(info)
    
    if match is None:
        print("Error")
    for m in match:
        if m is None:
            print("Error")
        else:
            print(m.groups())

1
2
3
('.—Any \nperson liable, by any Indian law, to be tried for an offence committed beyond India shall be dealt with \naccording to the provisions of this Code for any act committed beyond  India in the same manner as if \nsuch act had been committed within India. \n\n4. ',)
4
('.—The provisions of this Code apply also to any \n\noffence committed by— \n\n(1) any citizen of India in any place without and beyond India; \n(2) any person on any ship or aircraft registered in India wherever it may be. \n(3) any person in any place without and beyond India committing offence targeting a computer \n\nresource located in India. \n\nExplanation.—In this section— \n\nwould be punishable under this Code; \n\n \n\n(a) the word “offence” includes every act committed outside India which, if committed in India,  \n\n                                                           \n1. The Indian Penal Code has been extended to Berar by the Berar Laws Act, 1941 (4 of 1941) and has been declared in force

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [365]:
print(info)

ACT NO. 45 OF 18601 

CHAPTER I 

INTRODUCTION 

[6th October, 1860.] 

Preamble.—WHEREAS  it  is  expedient  to  provide  a  general  Penal  Code  for  India;  It  is 

enacted as follows:— 

1. Title and extent of operation of the Code.—This Act shall be called the Indian Penal Code, and 

shall extend to the whole of India . 

2.  Punishment  of  offences  committed  within  India.—Every  person  shall  be  liable  to  punishment 
under this Code and not otherwise for every act or omission contrary to the provisions thereof, of which he 
shall be guilty within India .  

3. Punishment of offences committed beyond, but which by law may be tried within, India.—Any 
person liable, by any Indian law, to be tried for an offence committed beyond India shall be dealt with 
according to the provisions of this Code for any act committed beyond  India in the same manner as if 
such act had been committed within India. 

4. Extension of Code to extra-territorial offences.—The provisions of thi