1. Textual Features:
    - `Bag-of-Words (BoW)`:
        - Break down the text into individual words.
        - Create a vocabulary of unique words.
        - Represent documents as vectors of word frequencies.
    - `TF-IDF`:
        - Weigh words based on their importance within a document and the entire corpus.
    - `Word Embeddings`:
        - Capture semantic and syntactic relationships between words.
        - Consider using pre-trained embeddings like Word2Vec or BERT.
        
2. Legal-Specific Features:
    - `Legal ACTS Code Extraction`:
        - While ACTS/IPC codes might not be directly applicable to this judgment, you could extract relevant sections of the Indian Income Tax Act and Bombay Municipal Act.
    - `Legal Citation Analysis`:
        - Analyze the frequency and types of citations to identify legal themes and arguments.
    - `Named Entity Recognition (NER)`:
        - Identify key entities like "municipal property tax," "urban immovable property tax," "Section 9(1)(iv)," etc.
    
3. Structural Features:
    - `Document Structure`: 
        - Analyze the structure of the judgment, including the introduction, arguments, and conclusion.
    - `Sentence Length and Complexity`: 
        - Calculate the average sentence length and lexical diversity.
    - `Part-of-Speech Tagging`:
        - Identify the parts of speech (nouns, verbs, adjectives, etc.) to understand the grammatical structure.

In [30]:
import numpy as np
import pandas as pd
import re

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, words
from nltk.stem import WordNetLemmatizer

import spacy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Load the spaCy NER model
nlp = spacy.load("en_core_web_sm")

# Download the word list if not already downloaded

# Create a set of English words for quick lookup
english_words = set(words.words())

# Download NLTK data (if not already downloaded)
nltk.download('words')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\VICTUS\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\VICTUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\VICTUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\VICTUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Pre Processing Raw Data


In [8]:
# Import the data set
df = pd.read_csv('../dataset/judgement_text.csv')
df.head()


Unnamed: 0,Text
0,Appeal No. LXVI of 1949.\nAppeal from the High...
1,XXIX of 1950.\nApplication under article 32 of...
2,XXXVII of 1950.\nApplication under article 32 ...
3,No. XVI of 1950.\nAppli cation under article 3...
4,Civil Appeal No. 8 of 1951.\nAppeal from the j...


In [9]:
#  a function to perform cleaning, stop words removal, tokenization, and lemmatization on the text data on the baic level
# takes a lot of time to run (datalen:7000+)
def preprocess_text(text):
    """
    Preprocesses text data for NER tasks in legal documents.

    Args:
        text (str): The text to be preprocessed.

    Returns:
        list: A list of preprocessed tokens.
    """

    # Lowercase conversion
    text = text.lower()

    # Remove extra whitespace characters
    text = re.sub(r'\s+', ' ', text)

    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return tokens

# Apply preprocessing to the text column
df['text'] = df['Text'].apply(preprocess_text)

df.head()

Unnamed: 0,Text,text
0,Appeal No. LXVI of 1949.\nAppeal from the High...,"[appeal, ., lxvi, 1949., appeal, high, court, ..."
1,XXIX of 1950.\nApplication under article 32 of...,"[xxix, 1950., application, article, 32, consti..."
2,XXXVII of 1950.\nApplication under article 32 ...,"[xxxvii, 1950., application, article, 32, cons..."
3,No. XVI of 1950.\nAppli cation under article 3...,"[., xvi, 1950., appli, cation, article, 32, co..."
4,Civil Appeal No. 8 of 1951.\nAppeal from the j...,"[civil, appeal, ., 8, 1951., appeal, judgment,..."


In [17]:
df['text'][1]

df['Text'].apply(lambda x: x.lower())

# concatanated text 
df['c_text'] = df['text'].apply(lambda x: ' '.join(x))



In [18]:
df

Unnamed: 0,Text,text,c_text
0,Appeal No. LXVI of 1949.\nAppeal from the High...,"[appeal, ., lxvi, 1949., appeal, high, court, ...",appeal . lxvi 1949. appeal high court judicatu...
1,XXIX of 1950.\nApplication under article 32 of...,"[xxix, 1950., application, article, 32, consti...",xxix 1950. application article 32 constitution...
2,XXXVII of 1950.\nApplication under article 32 ...,"[xxxvii, 1950., application, article, 32, cons...",xxxvii 1950. application article 32 constituti...
3,No. XVI of 1950.\nAppli cation under article 3...,"[., xvi, 1950., appli, cation, article, 32, co...",. xvi 1950. appli cation article 32 constituti...
4,Civil Appeal No. 8 of 1951.\nAppeal from the j...,"[civil, appeal, ., 8, 1951., appeal, judgment,...",civil appeal . 8 1951. appeal judgment decree ...
...,...,...,...
7125,Appeal No. 1690 of 1993.\nFrom the Judgment an...,"[appeal, ., 1690, 1993., judgment, order, date...",appeal . 1690 1993. judgment order dated 14.2....
7126,Appeal Nos.\n2919 20 of 1981.\nFrom the Judgme...,"[appeal, no, ., 2919, 20, 1981., judgment, ord...",appeal no . 2919 20 1981. judgment order dated...
7127,Appeal No. 1695 of 1993.\nFrom the Judgment an...,"[appeal, ., 1695, 1993., judgment, order, date...",appeal . 1695 1993. judgment order dated 5.4.1...
7128,Appeal No. 228 (NT) of 1987.\nFrom the Judgmen...,"[appeal, ., 228, (, nt, ), 1987., judgment, or...",appeal . 228 ( nt ) 1987. judgment order dated...


### Bag-of-Words (BoW):
The **Bag-of-Words (BoW)** model is a fundamental technique in Natural Language Processing (NLP) used to represent text data. It breaks down text into individual words (tokens), disregards grammar, word order, and semantics, and `focuses purely on the presence or frequency of words` in each document. The result is a matrix (or vector for each document) that represents the `count of each word from a fixed vocabulary across the documents`.

In [28]:
# implementation

# custom filtering helped go from about 110000 cols to 98k 
# Remove numbers, special characters & words starting with special characters, roman numerals, remove ner identified person names, non dictionary words
def custom_tokenizer(text):
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Remove special characters (keeping only alphanumeric and spaces)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove words starting with special characters (after removing them, only words with letters remain)
    words = text.split()
    filtered_words = [word for word in words if not re.match(r'^[^a-zA-Z]', word)]
    
    # Remove Roman numerals (e.g., I, II, III, IV, V, VI, VII, VIII, IX, X, etc.)
    filtered_words = [word for word in filtered_words if not re.match(r'^(?=[MDCLXVI])([MDCLXVI]{1,})$', word)]
    
    text = ' '.join(filtered_words)
    
    # Perform NER
    doc = nlp(text)
    
    # Collect named entities (people)
    named_entities = {ent.text.lower() for ent in doc.ents if ent.label_ == "PERSON"}
    
    # Tokenize the text and filter the words and allow only if 1.Word is a english dictionary or 2. Word is not a person;s name
    tokens = text.split()
    filtered_tokens = [
        word for word in tokens 
        if word.lower() in english_words or word.lower() not in named_entities
    ]
    
    
    return filtered_tokens

# Init
vectorizer = CountVectorizer(tokenizer=custom_tokenizer)

# fit_transform the text data to BoW representation
X = vectorizer.fit_transform(df['c_text'])

# Convert the result to a DataFrame for better visualization
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Display the Bag-of-Words representation
print(bow_df)




      a  aa  aaa  aaaa  aaainst  aabdual  aac  aaccorded  aachen  aacwas  ...  \
0     0   0    0     0        0        0    0          0       0       0  ...   
1     4   0    0     0        0        0    0          0       0       0  ...   
2     0   0    0     0        0        0    0          0       0       0  ...   
3     1   0    0     0        0        0    0          0       0       0  ...   
4     0   0    0     0        0        0    0          0       0       0  ...   
...  ..  ..  ...   ...      ...      ...  ...        ...     ...     ...  ...   
7125  0   0    0     0        0        0    0          0       0       0  ...   
7126  0   0    0     0        0        0    0          0       0       0  ...   
7127  8   0    0     0        0        0    0          0       0       0  ...   
7128  0   0    0     0        0        0    0          0       0       0  ...   
7129  1   0    0     0        0        0    0          0       0       0  ...   

      zuluete  zunzuwada  z

In [36]:
# make a seperate custom tokenated column using custom tokenizer

df['filtered_text'] = df['c_text'].apply(custom_tokenizer)

In [40]:
df['c_filtered_text'] = df['filtered_text'].apply(lambda x: ' '.join(x))

df

Unnamed: 0,Text,text,c_text,filtered_text,c_filtered_text
0,Appeal No. LXVI of 1949.\nAppeal from the High...,"[appeal, ., lxvi, 1949., appeal, high, court, ...",appeal . lxvi 1949. appeal high court judicatu...,"[appeal, lxvi, appeal, high, court, judicature...",appeal lxvi appeal high court judicature refer...
1,XXIX of 1950.\nApplication under article 32 of...,"[xxix, 1950., application, article, 32, consti...",xxix 1950. application article 32 constitution...,"[application, article, constitution, india, wr...",application article constitution india writ ce...
2,XXXVII of 1950.\nApplication under article 32 ...,"[xxxvii, 1950., application, article, 32, cons...",xxxvii 1950. application article 32 constituti...,"[xxxvii, application, article, constitution, i...",xxxvii application article constitution india ...
3,No. XVI of 1950.\nAppli cation under article 3...,"[., xvi, 1950., appli, cation, article, 32, co...",. xvi 1950. appli cation article 32 constituti...,"[xvi, appli, cation, article, constitution, wr...",xvi appli cation article constitution writ pro...
4,Civil Appeal No. 8 of 1951.\nAppeal from the j...,"[civil, appeal, ., 8, 1951., appeal, judgment,...",civil appeal . 8 1951. appeal judgment decree ...,"[civil, appeal, appeal, judgment, decree, date...",civil appeal appeal judgment decree dated th o...
...,...,...,...,...,...
7125,Appeal No. 1690 of 1993.\nFrom the Judgment an...,"[appeal, ., 1690, 1993., judgment, order, date...",appeal . 1690 1993. judgment order dated 14.2....,"[appeal, judgment, order, dated, central, admi...",appeal judgment order dated central administra...
7126,Appeal Nos.\n2919 20 of 1981.\nFrom the Judgme...,"[appeal, no, ., 2919, 20, 1981., judgment, ord...",appeal no . 2919 20 1981. judgment order dated...,"[appeal, no, judgment, order, dated, calcutta,...",appeal no judgment order dated calcutta high c...
7127,Appeal No. 1695 of 1993.\nFrom the Judgment an...,"[appeal, ., 1695, 1993., judgment, order, date...",appeal . 1695 1993. judgment order dated 5.4.1...,"[appeal, judgment, order, dated, bombay, high,...",appeal judgment order dated bombay high court ...
7128,Appeal No. 228 (NT) of 1987.\nFrom the Judgmen...,"[appeal, ., 228, (, nt, ), 1987., judgment, or...",appeal . 228 ( nt ) 1987. judgment order dated...,"[appeal, nt, judgment, order, dated, high, cou...",appeal nt judgment order dated high court strp...


### TF-IDF: Definition, Example, and Choice Rationale

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is composed of two components:

In [41]:
# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents to TF-IDF representation
X_tfidf = tfidf_vectorizer.fit_transform(df['c_filtered_text'])

# Convert the result to a DataFrame for better visualization
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display the TF-IDF representation
print(tfidf_df)

       aa  aaa  aaaa  aaainst  aabdual  aac  aaccorded  aachen  aacwas  aad  \
0     0.0  0.0   0.0      0.0      0.0  0.0        0.0     0.0     0.0  0.0   
1     0.0  0.0   0.0      0.0      0.0  0.0        0.0     0.0     0.0  0.0   
2     0.0  0.0   0.0      0.0      0.0  0.0        0.0     0.0     0.0  0.0   
3     0.0  0.0   0.0      0.0      0.0  0.0        0.0     0.0     0.0  0.0   
4     0.0  0.0   0.0      0.0      0.0  0.0        0.0     0.0     0.0  0.0   
...   ...  ...   ...      ...      ...  ...        ...     ...     ...  ...   
7125  0.0  0.0   0.0      0.0      0.0  0.0        0.0     0.0     0.0  0.0   
7126  0.0  0.0   0.0      0.0      0.0  0.0        0.0     0.0     0.0  0.0   
7127  0.0  0.0   0.0      0.0      0.0  0.0        0.0     0.0     0.0  0.0   
7128  0.0  0.0   0.0      0.0      0.0  0.0        0.0     0.0     0.0  0.0   
7129  0.0  0.0   0.0      0.0      0.0  0.0        0.0     0.0     0.0  0.0   

      ...  zuluete  zunzuwada  zure  zurich  zuripe