## Code for analysis of Legal System Dataset
This notebook is an analysis of legal system data from the **Bombay High Court (BHC)** and the **National Company Law Tribunal (NCLT) - Mumbai**.

#### **Purpose**
The primary goal is to provide insights into case progression, hearing frequency, and the time it takes for matters to reach their first hearing or be disposed of.

#### **Key Analysis Performed**
The notebook's analysis is focused on the BHC data and includes:
- **Case and Hearing Counts**: Provides an overview of the number of unique cases and total hearings per case category (e.g., "Commercial Suits," "Suits").
- **Median Hearings to Disposal**: Calculates the median number of hearings required for a case to be resolved, broken down by case type.
- **Time to First Hearing**: Analyzes the time from a case's initiation to its first hearing at the "main_matter_id" level, which represents a case family. This is visualized using a **Kaplan-Meier survival curve**.

To replicate or extend this analysis, ensure you have the correct data files available and run the notebook cells in sequential order.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from lifelines import KaplanMeierFitter
import warnings
import zipfile
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
warnings.filterwarnings('ignore')

In [2]:
# Read from xkdr_bhcDataset.zip
with zipfile.ZipFile("data/xkdr_bhcDataset.zip") as bhc_zip:
    with bhc_zip.open("xkdr_bhc_matters.csv") as f:
        bhc_matters = pd.read_csv(f)
    with bhc_zip.open("xkdr_bhc_hearings.csv") as f:
        bhc_hearings = pd.read_csv(f)

# Read from xkdr_ncltmDataset.zip
with zipfile.ZipFile("data/xkdr_ncltmDataset.zip") as ncltm_zip:
    with ncltm_zip.open("xkdr_ncltm_matters.csv") as f:
        nclt_matters = pd.read_csv(f)
    with ncltm_zip.open("xkdr_ncltm_hearings.csv") as f:
        nclt_hearings = pd.read_csv(f)

# Bombay High Court

In [3]:
# Description: Count unique filings per case_category (number of cases) and number of hearings for each category. 
bhc_case_counts = (
    bhc_matters.groupby("case_category")["filing_no"].nunique().reset_index()
)
bhc_case_counts.columns = ["case_category", "no_of_cases"]

bhc_hearing_counts = (
    bhc_hearings.groupby("case_category").size().reset_index(name="no_of_hearings")
)

bhc_summary = (
    bhc_case_counts.merge(bhc_hearing_counts, on="case_category", how="outer")
    .fillna(0)
)

bhc_summary["case_category"] = pd.Categorical(bhc_summary["case_category"])
bhc_summary = bhc_summary.sort_values("case_category")
bhc_summary["no_of_cases"] = bhc_summary["no_of_cases"].astype(int)
bhc_summary["no_of_hearings"] = bhc_summary["no_of_hearings"].astype(int)

print(bhc_summary)


      case_category  no_of_cases  no_of_hearings
0  Commercial Suits         2123            6874
1             Suits         3419           12475
2     Summary Suits          111             431


#### Median number hearings to disposal by case category

In [4]:
merged_df = bhc_matters.merge(
    bhc_hearings,
    on=['filing_no', 'case_category', 'court_name']
)

merged_df = merged_df[merged_df['case_status'] == 'Disposed'] # To count number of hearings to disposal

hearing_counts = merged_df.groupby(['case_category', 'main_matter_filing_no'])['hearing_date'].nunique()
median_hearings_by_category = hearing_counts.groupby('case_category').median()
print(median_hearings_by_category)


case_category
Commercial Suits    3.0
Suits               2.0
Summary Suits       4.5
Name: hearing_date, dtype: float64


#### Time to First Hearing within 6 months

The analysis calculates the time to first hearing at the level of main_matter_id rather than individual matter_ids, reflecting the concept of case families. A main_matter_id represents a family or root case, grouping related sub-cases or matters (identified by matter_id) under a single entity. This approach captures the earliest hearing date across all matters within a case family, providing a holistic view of when the family first engages with the court process.

In [6]:
bhc_hearings["hearing_date"] = pd.to_datetime(bhc_hearings["hearing_date"], errors="coerce")
bhc_matters["filing_date"] = pd.to_datetime(bhc_matters["filing_date"], errors="coerce")
bhc_matters["updated_on"] = pd.to_datetime(bhc_matters["updated_on"], errors="coerce")

first_hearings_temp = (
    bhc_hearings.groupby("filing_no")["hearing_date"].min().reset_index()
    .rename(columns={"hearing_date": "first_hearing_date"})
)

first_hearings = (
    bhc_matters[["filing_no", "main_matter_filing_no", "filing_date", "case_category", "updated_on"]]
    .merge(first_hearings_temp, on="filing_no", how="left")
    .groupby("main_matter_filing_no")
    .agg({
        "first_hearing_date": "min",
        "filing_date": "first",
        "case_category": "first",
        "updated_on": "max"
    })
    .reset_index()
)


first_hearings["event"] = False
first_hearings["time_to_hearing"] = pd.NaT


six_months = 6  
time_diff = (first_hearings["first_hearing_date"] - first_hearings["filing_date"]).dt.days / 30.44

valid_hearing_mask = (
    first_hearings["first_hearing_date"].notna() &
    first_hearings["filing_date"].notna()
)

event_mask = (
    valid_hearing_mask &
    (time_diff >= 0) &  
    (time_diff <= six_months)
)

if event_mask.any():
    first_hearings.loc[event_mask, "event"] = True
    first_hearings.loc[event_mask, "time_to_hearing"] = (
        (first_hearings.loc[event_mask, "first_hearing_date"] - first_hearings.loc[event_mask, "filing_date"]).dt.days / 30.44
    )

censored_mask = (
    ~first_hearings["event"] &
    first_hearings["filing_date"].notna() &
    first_hearings["updated_on"].notna()
)
if censored_mask.any():
    first_hearings.loc[censored_mask, "time_to_hearing"] = (
        (first_hearings.loc[censored_mask, "updated_on"] - first_hearings.loc[censored_mask, "filing_date"]).dt.days / 30.44
    ).clip(upper=six_months)


km_data = first_hearings.dropna(subset=["time_to_hearing", "event", "case_category"])
fig = go.Figure()
kmf = KaplanMeierFitter()

for category, group in km_data.groupby("case_category"):
    if len(group) > 0:
        kmf.fit(group["time_to_hearing"], event_observed=group["event"], label=category)
        time = kmf.survival_function_.index
        prob_hearing = 1 - kmf.survival_function_[category]
        fig.add_trace(
            go.Scatter(
                x=time,
                y=prob_hearing,
                mode="lines",
                name=category,
                line=dict(width=2),
                showlegend=True,
                hovertemplate=f"<b>{category}</b><br>Months: %{{x:.1f}}<br>% of cases that got a first hearing: %{{y:.2%}}<extra></extra>"
            )
        )

fig.update_layout(
    xaxis_title="Months from Filing",
    yaxis_title="% of cases",
    xaxis=dict(range=[0, 6], gridcolor="rgba(0,0,0,0.3)", showgrid=True, zeroline=False),
    yaxis=dict(gridcolor="rgba(0,0,0,0.3)", showgrid=True, zeroline=False, tickformat=".0%"),
    legend=dict(
        orientation="h",
        y=-0.2,
        x=0.5,
        xanchor="center",
        yanchor="top"
    ),
    template="simple_white",
    hovermode="closest", 
    showlegend=True,
    margin=dict(l=0, r=0, t=40, b=0),  
    width=None,
    height=500
)

# fig.write_html(
#     "images/bhc_time_to_first_hearing.html",
#     config={"displaylogo": False, "modeBarButtonsToRemove": ["toImage"]},
#     include_plotlyjs="cdn"
# )

fig.show()


print("\nSummary statistics by case category:")
for category, group in km_data.groupby("case_category"):
    hearing_count = group["event"].sum()
    total_count = len(group)
    median_duration = group["time_to_hearing"].median()
    print(f"{category}: {int(hearing_count)}/{total_count} with first hearing, "
          f"median time to first hearing: {median_duration:.2f} months")

print("\n% of cases with that got a first hearing within 6 months of filing:")
for category, group in km_data.groupby("case_category"):
    if len(group) > 0:
        kmf.fit(group["time_to_hearing"], event_observed=group["event"], label=category)
        try:
            prob_at_6_months = kmf.cumulative_density_at_times([6]).iloc[0]
            print(f"{category}: {prob_at_6_months:.1%}")
        except Exception as e:
            print(f"{category}: Could not calculate probability at exactly 6 months ({e})")



Summary statistics by case category:
Commercial Suits: 535/933 with first hearing, median time to first hearing: 3.42 months
Suits: 725/1426 with first hearing, median time to first hearing: 4.22 months
Summary Suits: 20/49 with first hearing, median time to first hearing: 6.00 months

% of cases with that got a first hearing within 6 months of filing:
Commercial Suits: 57.5%
Suits: 51.1%
Summary Suits: 40.9%


#### Time to Disposal

In [7]:
bhc_matters = bhc_matters.copy()
bhc_matters["disposal_date"] = pd.to_datetime(bhc_matters["disposal_date"], errors="coerce")
bhc_matters["updated_on"] = pd.to_datetime(bhc_matters["updated_on"], errors="coerce")
bhc_matters["filing_date"] = pd.to_datetime(bhc_matters["filing_date"], errors="coerce")

bhc_matters["event"] = bhc_matters["case_status"] == "Disposed"
bhc_matters["duration"] = pd.NaT

disposed_mask = (
    bhc_matters["event"] &
    bhc_matters["disposal_date"].notna() &
    bhc_matters["filing_date"].notna()
)
if disposed_mask.any():
    bhc_matters.loc[disposed_mask, "duration"] = (
        bhc_matters.loc[disposed_mask, "disposal_date"] - bhc_matters.loc[disposed_mask, "filing_date"]
    ).dt.days

non_disposed_mask = (
    ~bhc_matters["event"] &
    bhc_matters["updated_on"].notna() &
    bhc_matters["filing_date"].notna()
)
if non_disposed_mask.any():
    bhc_matters.loc[non_disposed_mask, "duration"] = (
        bhc_matters.loc[non_disposed_mask, "updated_on"] - bhc_matters.loc[non_disposed_mask, "filing_date"]
    ).dt.days

bhc_matters["duration_years"] = pd.to_numeric(bhc_matters["duration"], errors='coerce') / 365.25
bhc_matters["duration_years"] = bhc_matters["duration_years"].clip(upper=3.5)

km_data = bhc_matters.dropna(subset=["duration_years", "event", "case_category"])

kmf = KaplanMeierFitter()
fig = go.Figure()

for category, group in km_data.groupby("case_category"):
    if len(group) > 0:
        kmf.fit(group["duration_years"], event_observed=group["event"], label=category)
        x_vals = kmf.survival_function_.index
        y_vals = 1 - kmf.survival_function_[category] 

        fig.add_trace(go.Scatter(
            x=x_vals,
            y=y_vals,
            mode='lines',
            name=category,
            hovertemplate=f"<b>{category}</b><br>Years: %{{x:.2f}}<br>% of cases disposed: %{{y:.2%}}<extra></extra>"
        ))

fig.update_layout(
    xaxis_title="Years from Filing",
    yaxis_title="% of cases",
    xaxis=dict(range=[0, 3.25], showgrid=True, zeroline=True),
    yaxis=dict(tickformat=".0%", showgrid=True, zeroline=False),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=-0.2,
        xanchor="center",
        x=0.5
    ),
    template="simple_white",
    width=None,
    height=500,
    margin=dict(l=0, r=0, t=10, b=0)  
)

# fig.write_html("images/bhc_time_to_disposal.html", config={"displaylogo": False, "modeBarButtonsToRemove": ["toImage"]})

fig.show()

print("\nSummary statistics by case category:")
for category, group in km_data.groupby("case_category"):
    disposed_count = group["event"].sum()
    total_count = len(group)
    median_duration = group["duration_years"].median()
    print(f"{category}: {disposed_count}/{total_count} disposed, median duration: {median_duration:.2f} years")

print("\nProbability of disposal within 3 year of filing:")
for category, group in km_data.groupby("case_category"):
    if len(group) > 0:
        kmf.fit(group["duration_years"], event_observed=group["event"], label=category)
        try:
            prob_at_3_year = kmf.cumulative_density_at_times([3.0]).iloc[0]
            print(f"{category}: {prob_at_3_year:.1%}")
        except Exception as e:
            print(f"{category}: Could not calculate probability at 3 year ({e})")



Summary statistics by case category:
Commercial Suits: 874/2120 disposed, median duration: 1.30 years
Suits: 1221/3417 disposed, median duration: 1.12 years
Summary Suits: 61/111 disposed, median duration: 0.97 years

Probability of disposal within 3 year of filing:
Commercial Suits: 53.1%
Suits: 49.5%
Summary Suits: 69.2%


# NCLT

In [8]:
# Description: Count unique filings per case_category (number of cases) and number of hearings for each category. 
nclt_case_counts = (
    nclt_matters.groupby("case_category")["filing_no"].nunique().reset_index()
)
nclt_case_counts.columns = ["case_category", "no_of_cases"]

nclt_hearing_counts = (
    nclt_hearings.groupby("case_category").size().reset_index(name="no_of_hearings")
)

nclt_summary = (
    nclt_case_counts.merge(nclt_hearing_counts, on="case_category", how="outer")
    .fillna(0)
)

nclt_summary["case_category"] = pd.Categorical(nclt_summary["case_category"])
nclt_summary = nclt_summary.sort_values("case_category")
nclt_summary["no_of_cases"] = nclt_summary["no_of_cases"].astype(int)
nclt_summary["no_of_hearings"] = nclt_summary["no_of_hearings"].astype(int)

print(nclt_summary)


  case_category  no_of_cases  no_of_hearings
0           IBC         7346           47353


#### Median number hearings to disposal

In [None]:
merged_df_nclt = nclt_matters.merge(
    nclt_hearings,
    on=['filing_no', 'case_category', 'court_name']
)
merged_df_nclt = merged_df_nclt[merged_df_nclt['case_status'].isin(['Disposed', 'Dispose'])]

hearing_counts_nclt = merged_df_nclt.groupby(['case_category', 'main_matter_filing_no'])['hearing_date'].nunique()

median_hearings_by_category = hearing_counts_nclt.groupby('case_category').median()
print(median_hearings_by_category)

case_category
IBC    8.0
Name: hearing_date, dtype: float64


#### Time to first hearing

In [9]:
nclt_hearings["hearing_date"] = pd.to_datetime(nclt_hearings["hearing_date"], errors="coerce")
nclt_matters["filing_date"] = pd.to_datetime(nclt_matters["filing_date"], errors="coerce")
nclt_matters["updated_on"] = pd.to_datetime(nclt_matters["updated_on"], errors="coerce")

first_hearings_temp_initial = (
    nclt_hearings.groupby("filing_no")["hearing_date"].min().reset_index()
    .rename(columns={"hearing_date": "first_hearing_date"})
)

initial_merge = (
    nclt_matters[["filing_no", "main_matter_filing_no", "filing_date", "case_category", "updated_on"]]
    .merge(first_hearings_temp_initial, on="filing_no", how="left")
)

problematic_cases = initial_merge[
    (initial_merge["first_hearing_date"] < initial_merge["filing_date"]) &
    initial_merge["first_hearing_date"].notna() &
    initial_merge["filing_date"].notna()
]

if len(problematic_cases) > 0:
    problematic_filing_nos = problematic_cases["filing_no"].tolist()
    nclt_hearings_clean = nclt_hearings[~nclt_hearings["filing_no"].isin(problematic_filing_nos)]
else:
    print("No problematic cases found - proceeding with original data")
    nclt_hearings_clean = nclt_hearings

first_hearings_temp_nclt = (
    nclt_hearings_clean.groupby("filing_no")["hearing_date"].min().reset_index()
    .rename(columns={"hearing_date": "first_hearing_date"})
)

first_hearings_nclt = (
    nclt_matters[["filing_no", "main_matter_filing_no", "filing_date", "case_category", "updated_on"]]
    .merge(first_hearings_temp_nclt, on="filing_no", how="left")
    .groupby("main_matter_filing_no")
    .agg({
        "first_hearing_date": "min",
        "filing_date": "first",
        "case_category": "first",
        "updated_on": "max"
    })
    .reset_index()
)

first_hearings_nclt["event"] = False
first_hearings_nclt["time_to_hearing"] = pd.NaT

six_months = 6  

event_mask_nclt = (
    first_hearings_nclt["first_hearing_date"].notna() &
    first_hearings_nclt["filing_date"].notna() &
    ((first_hearings_nclt["first_hearing_date"] - first_hearings_nclt["filing_date"]).dt.days / 30.44 <= six_months)
)
if event_mask_nclt.any():
    first_hearings_nclt.loc[event_mask_nclt, "event"] = True
    first_hearings_nclt.loc[event_mask_nclt, "time_to_hearing"] = (
        (first_hearings_nclt.loc[event_mask_nclt, "first_hearing_date"] - first_hearings_nclt.loc[event_mask_nclt, "filing_date"]).dt.days / 30.44
    )

censored_mask_nclt = (
    ~first_hearings_nclt["event"] &
    first_hearings_nclt["filing_date"].notna() &
    first_hearings_nclt["updated_on"].notna()
)
if censored_mask_nclt.any():
    first_hearings_nclt.loc[censored_mask_nclt, "time_to_hearing"] = (
        (first_hearings_nclt.loc[censored_mask_nclt, "updated_on"] - first_hearings_nclt.loc[censored_mask_nclt, "filing_date"]).dt.days / 30.44
    ).clip(upper=six_months)

km_data_nclt = first_hearings_nclt.dropna(subset=["time_to_hearing", "event", "case_category"])
negative_times = km_data_nclt["time_to_hearing"] < 0
if negative_times.any():
    km_data_nclt = km_data_nclt[km_data_nclt["time_to_hearing"] >= 0]

kmf = KaplanMeierFitter()
fig = go.Figure()

for category, group in km_data_nclt.groupby("case_category"):
    if len(group) > 0:
        kmf.fit(group["time_to_hearing"], event_observed=group["event"], label=category)
        x_vals = kmf.survival_function_.index
        y_vals = 1 - kmf.survival_function_[category] 

        fig.add_trace(go.Scatter(
            x=x_vals,
            y=y_vals,
            mode='lines',
            name=category,
            showlegend=True,
            hovertemplate=f"<b>{category}</b><br>Months: %{{x:.1f}}<br>% of cases that got a first hearing: %{{y:.2%}}<extra></extra>"
        ))

fig.add_shape(
    type="line",
    x0=6, x1=6,
    y0=0, y1=1,
    line=dict(color="black", width=1, dash="dash"),
)

fig.update_layout(
    autosize=True,
    width=None,
    height=500,
    margin=dict(l=0, r=0, t=10, b=10),
    xaxis=dict(
        title="Months from Filing",
        range=[0, 6],
        showgrid=True
    ),
    yaxis=dict(
        title="% of cases",
        tickformat=".0%",
        showgrid=True
    ),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=-0.2,
        xanchor="center",
        x=0.5
    ),
    template="simple_white",
    hovermode="closest"
)

# fig.write_html("images/nclt_time_to_first_hearing.html", full_html=True, include_plotlyjs="cdn",
#                config={"responsive": True, "modeBarButtonsToRemove": ["toImage"]})

fig.show()

print("Summary statistics by case category:")
for category, group in km_data_nclt.groupby("case_category"):
    hearing_count = group["event"].sum()
    total_count = len(group)
    median_duration = group["time_to_hearing"].median()
    print(f"{category}: {int(hearing_count)}/{total_count} with first hearing, "
          f"median time to first hearing: {median_duration:.2f} months")

print("\n% of cases that got a first hearing within 6 months:")
for category, group in km_data_nclt.groupby("case_category"):
    if len(group) > 0:
        kmf.fit(group["time_to_hearing"], event_observed=group["event"], label=category)
        try:
            prob_at_6_months = kmf.cumulative_density_at_times([6]).iloc[0]
            print(f"{category}: {prob_at_6_months:.1%}")
        except Exception as e:
            print(f"{category}: Could not calculate probability at exactly 6 months ({e})")


Summary statistics by case category:
IBC: 2012/2219 with first hearing, median time to first hearing: 1.74 months

% of cases that got a first hearing within 6 months:
IBC: 90.7%


#### Time to Disposal

In [11]:
nclt_matters = nclt_matters.copy()
nclt_matters["disposal_date"] = pd.to_datetime(nclt_matters["disposal_date"], errors="coerce")
nclt_matters["updated_on"] = pd.to_datetime(nclt_matters["updated_on"], errors="coerce")
nclt_matters["filing_date"] = pd.to_datetime(nclt_matters["filing_date"], errors="coerce")

nclt_matters["event"] = (
    (nclt_matters["case_status"] == "Disposed") |
    (nclt_matters["case_status"] == "Dispose")
)

nclt_matters["duration"] = pd.NaT

disposed_mask_nclt = (
    nclt_matters["event"] &
    nclt_matters["disposal_date"].notna() &
    nclt_matters["filing_date"].notna()
)
if disposed_mask_nclt.any():
    nclt_matters.loc[disposed_mask_nclt, "duration"] = (
        nclt_matters.loc[disposed_mask_nclt, "disposal_date"] - nclt_matters.loc[disposed_mask_nclt, "filing_date"]
    ).dt.days

non_disposed_mask_nclt = (
    ~nclt_matters["event"] &
    nclt_matters["updated_on"].notna() &
    nclt_matters["filing_date"].notna()
) # If not disposed: duration = updated_on - filing_date (censored)

if non_disposed_mask_nclt.any():
    nclt_matters.loc[non_disposed_mask_nclt, "duration"] = (
        nclt_matters.loc[non_disposed_mask_nclt, "updated_on"] - nclt_matters.loc[non_disposed_mask_nclt, "filing_date"]
    ).dt.days

nclt_matters["duration_years"] = pd.to_numeric(nclt_matters["duration"], errors='coerce') / 365.25
nclt_matters["duration_years"] = nclt_matters["duration_years"].clip(upper=3.5)

km_data = nclt_matters.dropna(subset=["duration_years", "event", "case_category"])

kmf = KaplanMeierFitter()
fig = go.Figure()

for category, group in km_data.groupby("case_category"):
    if len(group) > 0:
        kmf.fit(group["duration_years"], event_observed=group["event"], label=category)
        x_vals = kmf.survival_function_.index
        y_vals = 1 - kmf.survival_function_[category] 

        fig.add_trace(go.Scatter(
            x=x_vals,
            y=y_vals,
            mode='lines',
            name=category,
            showlegend=True,
            hovertemplate=f"<b>{category}</b><br>Years: %{{x:.2f}}<br>% of cases that got disposed: %{{y:.2%}}<extra></extra>"
        ))

fig.update_layout(
    autosize=True,
    width=None,
    height=500,
    margin=dict(l=0, r=0, t=10, b=10),
    xaxis=dict(title="Years from Filing", range=[0, 3.5], showgrid=True),
    yaxis=dict(title="% of cases", range=[0, 1], tickformat=".0%", showgrid=True),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=-0.2,
        xanchor="center",
        x=0.5
    ),
    template="simple_white"
)

# fig.write_html("images/nclt_time_to_disposal.html", full_html=True, include_plotlyjs="cdn",
#                config={"responsive": True, "modeBarButtonsToRemove": ["toImage"]})

fig.show()

print("\nSummary statistics by case category:")
for category, group in km_data.groupby("case_category"):
    disposed_count = group["event"].sum()
    total_count = len(group)
    median_duration = group["duration_years"].median()
    print(f"{category}: {disposed_count}/{total_count} disposed, median duration: {median_duration:.2f} years")

print("\n% of cases that got disposed within 3 year of filing:")
for category, group in km_data.groupby("case_category"):
    if len(group) > 0:
        kmf.fit(group["duration_years"], event_observed=group["event"], label=category)
        try:
            prob_at_3_year = kmf.cumulative_density_at_times([3.0]).iloc[0]
            print(f"{category}: {prob_at_3_year:.1%}")
        except Exception as e:
            print(f"{category}: Could not calculate probability at 3 year ({e})")



Summary statistics by case category:
IBC: 2737/5264 disposed, median duration: 0.73 years

% of cases that got disposed within 3 year of filing:
IBC: 54.6%
