In [94]:
import numpy as np
import pandas as pd 
import fasttext


The dataset was initially preprocessed using Mixtral 8×7B, a large language model developed by [Mistral AI](https://arxiv.org/abs/2401.04088). Mixtral 8×7B is based on a Mixture-of-Experts architecture composed of eight expert subnetworks of 7 billion parameters each. For each input token, only two experts are dynamically selected, allowing the model to achieve strong performance while keeping inference costs low. This model was used to generate structured summaries from raw textual data.

In [95]:
data=pd.read_csv("dataset.csv")
data.head()

Unnamed: 0,Date,Subject,Content,ParaphrasedSubject,CompactedSummary,DetailedSummary,Impact
0,3-Mar-24,BAWAN,Bawan Co. announces the board of director’s de...,Bawan Co. Declares Cash Dividends for Second H...,Bawan Co. announces the distribution of cash d...,Bawan Co. has announced its board of directors...,Shareholders who meet the eligibility criteria...
1,3-Mar-24,SABIC AGRI-NUTRIENTS,Addendum Announcement from SABIC Agri-Nutrient...,SABIC Agri-Nutrients Company Extends MoU with ...,SABIC Agri-Nutrients Company and Saudi Agricu...,SABIC Agri-Nutrients Company has announced an...,The extension of the MoU between SABIC Agri-N...
2,3-Mar-24,GAS,Gas Arabian Services Co. Announces Contract Si...,GAS Arabian Services Co. Inks Contract with Sa...,GAS Arabian Services Co. has signed a contrac...,GAS Arabian Services Company has announced the...,This contract marks a significant milestone fo...
3,3-Mar-24,GAS,Gas Arabian Services Co. Announces Contract Si...,GAS Arabian Services Co. Inks Contract with Sa...,GAS Arabian Services Co. has signed a contrac...,GAS Arabian Services Company has announced the...,This contract marks a significant milestone fo...
4,3-Mar-24,ADVANCED,ADVANCED PETROCHEMICAL COMPANY ANNOUNCES THE L...,Advanced Petrochemical Company Resumes Propyle...,Advanced Petrochemical Company has resumed op...,Advanced Petrochemical Company has announced t...,The resumption of operations at Advanced Petro...


In [96]:
data['Date'] = pd.to_datetime(data['Date'], dayfirst=True, errors='coerce')
# data['Date'] = pd.to_datetime(data['Date'], format='%d-%b-%y' )


# data['Date']

  data['Date'] = pd.to_datetime(data['Date'], dayfirst=True, errors='coerce')


In [97]:
data['Date'].isna().mean()


np.float64(0.000543773790103317)

In [98]:
data['CompactedSummary'][0]

'Bawan Co. announces the distribution of cash dividends for the second half of 2023, with a dividend per share of 0.85, totaling 51 million, payable to shareholders of record on April 3, 2024.'

In [99]:
data['Impact'][0]

'Shareholders who meet the eligibility criteria can expect to receive cash dividends from Bawan Co. for the second half of 2023. To facilitate seamless dividend disbursement, stakeholders are advised to update their personal information and link their bank accounts to their investment portfolios.'

Teacher model (Mixtral) produced summaries

Student model learns to mimic them (often smaller/faster)

Then you can improve beyond the teacher by:

- filtering low-quality labels
- adding factuality constraints
- adding domain-specific evaluation (numbers/entities)


## DATASET AUDIT

---

### 1 Dataset shape & columns

In [100]:
print("Shape:", data.shape)
print("\nColumns:")
for col in data.columns:
    print("-", col)


Shape: (1839, 7)

Columns:
- Date
- Subject
- Content
- ParaphrasedSubject
- CompactedSummary
- DetailedSummary
- Impact


### 2 Missing values & empty strings

In this analysis we are checking for possible "red flags" for the LLM-generated labels (Mixtral 8x7B), more specifically, we perform a sanity check to see if there is : 
- Missing Impact or Summary
- Empty Content rows
- Any column with >5% missing

In [101]:
missing = (
    data.isna().sum()
    .to_frame("NaN_count")
    .assign(
        empty_string_count=lambda df: [
            (data[col].astype(str).str.strip() == "").sum()
            for col in data.columns
        ]
    )
)

missing["total_missing"] = missing["NaN_count"] + missing["empty_string_count"]
missing["percent_missing"] = (
    missing["total_missing"] / data.shape[0] * 100
)
missing


Unnamed: 0,NaN_count,empty_string_count,total_missing,percent_missing
Date,1,0,1,0.054377
Subject,0,0,0,0.0
Content,0,1,1,0.054377
ParaphrasedSubject,0,0,0,0.0
CompactedSummary,28,0,28,1.522567
DetailedSummary,43,0,43,2.338227
Impact,44,0,44,2.392605


### 3 Date audit

In [102]:
# Range and distribution of dates 

print("Date range:")
print(data["Date"].min(), "→", data["Date"].max())

print("\nTop dates by frequency:")
data["Date"].value_counts().head(10)


#  min date is 2024 ??

Date range:
2024-03-03 00:00:00 → 2024-06-10 00:00:00

Top dates by frequency:


Date
2024-06-09    59
2024-06-05    55
2024-05-16    53
2024-05-23    53
2024-05-13    51
2024-04-04    50
2024-05-19    50
2024-05-15    49
2024-05-14    49
2024-03-28    48
Name: count, dtype: int64

In [103]:
# Missing / invalid dates

data["Date"].isna().sum()

# I believe these NaN correspond 


np.int64(1)

In [104]:
# Articles per day (important for leakage later)

articles_per_day = data.groupby("Date").size()
articles_per_day.describe()


count    62.000000
mean     29.645161
std      15.711640
min       1.000000
25%      21.000000
50%      31.500000
75%      41.750000
max      59.000000
dtype: float64

### 4 Text length statistics (VERY important)

We will use this to : 
- choose max token lengths
- decide truncation strategy
- define what “compact” really means

In [105]:
def word_count(s):
    return len(str(s).split())

text_cols = [
    "Subject",
    "Content",
    "CompactedSummary",
    "DetailedSummary",
    "Impact",
]

length_stats = {}

for col in text_cols:
    length_stats[col] = data[col].apply(word_count)

length_df = pd.DataFrame(length_stats)
length_df.describe(percentiles=[.1, .25, .5, .75, .9, .95])


Unnamed: 0,Subject,Content,CompactedSummary,DetailedSummary,Impact
count,1839.0,1839.0,1839.0,1839.0,1839.0
mean,5.383361,330.088091,32.311039,121.947254,48.319195
std,7.717705,317.096013,13.241173,64.625639,14.347689
min,1.0,0.0,1.0,1.0,1.0
10%,1.0,22.8,18.0,48.0,34.0
25%,1.0,121.0,24.0,74.0,40.0
50%,2.0,221.0,30.0,115.0,48.0
75%,7.0,447.5,39.0,159.0,56.0
90%,19.0,788.2,49.0,205.0,65.0
95%,23.0,985.1,57.0,235.0,71.0


### 5 Extreme outliers (long / short)

We’ll later decide whether to:

- truncate
- chunk
- or filter

In [106]:
# Very short or empty content
data.loc[length_df["Content"] < 50, ["Date", "Subject", "Content"]].head(5)


Unnamed: 0,Date,Subject,Content
50,2024-03-05,AMERICANA,Americana Restaurants International PLC Announ...
70,2024-03-05,The Securities Depository Center Company (Edaa...,The Securities Depository Center Company (Edaa...
91,2024-03-06,NBM,"No English translation, kindly refer to the Ar..."
97,2024-03-07,The Securities Depository Center Company (Edaa...,The Securities Depository Center Company (Edaa...
157,2024-03-11,The Securities Depository Center (Edaa) Announ...,The Securities Depository Center Company (Edaa...


In [107]:
# Extremely long content (PDF dumps?)

data.loc[length_df["Content"] > 3000, ["Date", "Subject"]].head(5)


Unnamed: 0,Date,Subject


### 6 Duplicate detection

In [108]:
# 6.1 Exact duplicate contents
print(f'We have {data.duplicated(subset=["Content"]).sum()} duplicate content entries.')
data[data.duplicated(subset=["Content"], keep=False)][
    ["Date", "Subject"]
].head(10)


We have 178 duplicate content entries.


Unnamed: 0,Date,Subject
91,2024-03-06,NBM
159,2024-03-11,NBM
160,2024-03-11,NBM
238,2024-03-22,Resume trading on SHUAA shares after disclosin...
239,2024-03-22,Resume trading on NIH shares after disclosing ...
240,2024-03-22,"Reminder: Today, 22/03/2024 is the ex-dividend..."
270,2024-03-25,Suspend trading on ASNIC shares starting from ...
271,2024-03-25,Suspend trading on ITHMR shares starting from ...
287,2024-03-26,Resume trading on ITHMR shares after disclosin...
288,2024-03-26,Resume trading on ASNIC shares after disclosin...


In [109]:
# Same subject + same date
# data.duplicated(subset=["Date", "Subject"]).sum()


### 7 Summary leakage check (extractiveness)
We want to know if summaries are copy-pasted from content.

Interpretation:

~0.4–0.6 $\rightarrow$ mixed abstractive

0.8 $\rightarrow$ mostly extractive

very low $\rightarrow$ possible hallucination

In [110]:
def overlap_ratio(text, summary):
    text_words = set(str(text).lower().split())
    summary_words = str(summary).lower().split()
    if len(summary_words) == 0:
        return 0
    overlap = sum(w in text_words for w in summary_words)
    return overlap / len(summary_words)

data["compact_overlap"] = data.apply(
    lambda r: overlap_ratio(r["Content"], r["CompactedSummary"]),
    axis=1
)

data["detailed_overlap"] = data.apply(
    lambda r: overlap_ratio(r["Content"], r["DetailedSummary"]),
    axis=1
)

data[["compact_overlap", "detailed_overlap"]].describe()


Unnamed: 0,compact_overlap,detailed_overlap
count,1839.0,1839.0
mean,0.627678,0.636109
std,0.244103,0.251458
min,0.0,0.0
25%,0.6,0.61808
50%,0.7,0.716495
75%,0.769231,0.783328
max,1.0,0.964476


### 8 Impact field sanity check
Impact should:

- be short
- be interpretative (not summary copy)
- mention consequences


We’ll later check:

- vagueness
- modal verbs (“may”, “could”)
- market language consistency

In [111]:
length_df["Impact"].describe()


count    1839.000000
mean       48.319195
std        14.347689
min         1.000000
25%        40.000000
50%        48.000000
75%        56.000000
max       116.000000
Name: Impact, dtype: float64

In [112]:
data[["Subject", "Impact"]].sample(5, random_state=42)


Unnamed: 0,Subject,Impact
1556,MULKIA,The electronic voting process offers sharehold...
1157,BALADY,This addendum does not indicate any financial ...
352,Resume trading on DRC shares after disclosing ...,The resumption of trading on DRC shares indica...
1018,Resume trading on ORIENTTKAFUL shares after di...,The resumption of trading on ORIENTTKAFUL shar...
1393,The Saudi Exchange announces that the fluctuat...,This adjustment in fluctuation limits for Alkh...


### 9 Language check (quick heuristic)

If close to 1 → English-dominated

If mixed → we’ll need language filtering.

In [113]:
data["Content"].str.contains(r"\b(the|and|is|with|for)\b", case=False).mean()


  data["Content"].str.contains(r"\b(the|and|is|with|for)\b", case=False).mean()


np.float64(0.8983143012506797)

In [114]:
data.shape[0]

1839

#### 9.1 Language cleaning

In [115]:
model = fasttext.load_model("lid.176.bin")

def is_english(text):
    if not isinstance(text, str) or not text.strip():
        return False

    try:
        labels, probs = model.predict(
            text.replace("\n", " "),
            k=1
        )
        return labels[0] == "__label__en" and probs[0] > 0.8
    except ValueError:
        return False

texts = (
    data["Content"]
    .fillna("")
    .astype(str)
    .str.replace("\n", " ", regex=False)
    .tolist()
)

labels, probs = model.predict(texts, k=1)

mask = [
    lbl[0] == "__label__en" and pr[0] > 0.8
    for lbl, pr in zip(labels, probs)
]

data = data.loc[mask].reset_index(drop=True)
data["Content"].str.contains(r"\b(the|and|is|with|for)\b", case=False).mean()


  data["Content"].str.contains(r"\b(the|and|is|with|for)\b", case=False).mean()


np.float64(0.9974635383639823)

In [116]:
print(f"After filtering non-English content, we have {data.shape[0]} rows left.")

After filtering non-English content, we have 1577 rows left.


### 10 Final audit snapshot

In [117]:
audit_summary = {
    "num_rows": len(data),
    "date_min": data["Date"].min(),
    "date_max": data["Date"].max(),
    "avg_content_words": length_df["Content"].mean(),
    "avg_compact_words": length_df["CompactedSummary"].mean(),
    "avg_detailed_words": length_df["DetailedSummary"].mean(),
    "avg_impact_words": length_df["Impact"].mean(),
    "duplicate_contents": data.duplicated(subset=["Content"]).sum(),
}

pd.Series(audit_summary)


num_rows                             1577
date_min              2024-03-03 00:00:00
date_max              2024-06-10 00:00:00
avg_content_words              330.088091
avg_compact_words               32.311039
avg_detailed_words             121.947254
avg_impact_words                48.319195
duplicate_contents                      9
dtype: object