# Machine Learning Invoice Classification System (Advanced)

This notebook implements an enterprise-grade ML system to classify invoices.

**Key Improvements**:
- **Realistic Data Ingestion**: Handles Currency, Requestors, and Descriptions.
- **Smart Matching**: Fuzzy logic on amounts and dates.
- **Advanced Feature Engineering**:
    - **Supplier Behavior**: Tenure, Risk Score, Historical Approval Rate.
    - **PO Variance**: Amount, Currency Mismatch, Description Similarity.
    - **Text Analysis**: TF-IDF Vectorization -> SVD (Latent Semantic Analysis) for description features.
- **Gemini Integration**: Explains the top risk factors.

In [2]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import classification_report
import google.generativeai as genai
import os
from dotenv import load_dotenv

# Load Env
load_dotenv('.env')
GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')
if GEMINI_API_KEY:
    genai.configure(api_key=GEMINI_API_KEY)
    print("Gemini Configured.")
else:
    print("Gemini API Key missing.")

Gemini Configured.


## 1. Data Ingestion & Enrichment

In [3]:
base_path = "dataset/"
suppliers = pd.read_csv(f"{base_path}suppliers.csv")
pos = pd.read_csv(f"{base_path}purchase_orders.csv")
history = pd.read_csv(f"{base_path}invoices_history.csv")
new_invoices = pd.read_csv(f"{base_path}invoices_new.csv")

# Pre-calculate Supplier Approval History (Simulating historical knowledge)
# In a real system, this would be an aggregation of PAST invoices only.
# For this demo, we use the training set to build a lookup.
supplier_stats = history.groupby('Supplier_ID')['Status'].apply(lambda x: (x == 'Approve').mean()).reset_index()
supplier_stats.columns = ['Supplier_ID', 'Supplier_Approval_Rate']
suppliers = suppliers.merge(supplier_stats, on='Supplier_ID', how='left').fillna(0.5)

print("Data Loaded.")

Data Loaded.


## 2. Advanced PO Matching

In [4]:
def match_po(inv_row, pos_df):
    # 1. Strict
    if pd.notna(inv_row['PO_Number']) and inv_row['PO_Number'] != "":
        match = pos_df[pos_df['PO_Number'] == inv_row['PO_Number']]
        if not match.empty:
            return match.iloc[0], "Strict"
            
    # 2. Smart (Supplier + Amount +/- 2%)
    candidates = pos_df[pos_df['Supplier_ID'] == inv_row['Supplier_ID']]
    if not candidates.empty:
        candidates = candidates.copy()
        candidates['Diff'] = abs(candidates['PO_Amount'] - inv_row['Invoice_Amount']) / candidates['PO_Amount']
        best = candidates[candidates['Diff'] < 0.02].sort_values('Diff').head(1)
        if not best.empty:
            return best.iloc[0], "Smart"
            
    return None, "None"

def enrich_data(invoices, pos_df, suppliers_df):
    enriched = []
    for _, row in invoices.iterrows():
        po, match_type = match_po(row, pos_df)
        d = row.to_dict()
        d['Match_Type'] = match_type
        
        if po is not None:
            d['PO_Amount'] = po['PO_Amount']
            d['PO_Currency'] = po['Currency']
            d['PO_Date'] = po['PO_Date']
            d['PO_Desc'] = po['Description']
        else:
            d['PO_Amount'] = 0
            d['PO_Currency'] = "UNK"
            d['PO_Date'] = None
            d['PO_Desc'] = ""
            
        # Supplier Meta
        sup = suppliers_df[suppliers_df['Supplier_ID'] == row['Supplier_ID']]
        if not sup.empty:
            d['Sup_Risk'] = sup.iloc[0]['Risk_Score']
            d['Sup_Tenure'] = sup.iloc[0]['Tenure_Days']
            d['Sup_Approval_Rate'] = sup.iloc[0]['Supplier_Approval_Rate']
        else:
            d['Sup_Risk'] = 50
            d['Sup_Tenure'] = 0
            d['Sup_Approval_Rate'] = 0.5
            
        enriched.append(d)
    return pd.DataFrame(enriched)

df_train_raw = enrich_data(history, pos, suppliers)
df_new_raw = enrich_data(new_invoices, pos, suppliers)

## 3. Text Analysis (TF-IDF + SVD)
We convert descriptions into dense numerical vectors.

In [5]:
# Initialize Vectorizer and SVD
tfidf = TfidfVectorizer(max_features=500, stop_words='english')
svd = TruncatedSVD(n_components=5, random_state=42) # Reduce text to 5 features

# Fit on Training Descriptions
text_train = df_train_raw['Description'].fillna("")
tfidf_matrix = tfidf.fit_transform(text_train)
svd_features = svd.fit_transform(tfidf_matrix)

# Create DataFrame for Text Features
text_cols = [f'Text_SVD_{i}' for i in range(5)]
df_text_features = pd.DataFrame(svd_features, columns=text_cols)
df_train_raw = pd.concat([df_train_raw, df_text_features], axis=1)

# Transform New Data using same pipeline
text_new = df_new_raw['Description'].fillna("")
tfidf_new = tfidf.transform(text_new)
svd_new = svd.transform(tfidf_new)
df_text_new = pd.DataFrame(svd_new, columns=text_cols)
df_new_raw = pd.concat([df_new_raw, df_text_new], axis=1)

print("Text Analysis Complete.")

Text Analysis Complete.


## 4. Comprehensive Feature Engineering

In [6]:
def engineered_features(df):
    # Variance
    df['Amt_Variance'] = np.where(df['PO_Amount'] > 0, (df['Invoice_Amount'] - df['PO_Amount'])/df['PO_Amount'], 0)
    df['Currency_Mismatch'] = (df['Currency'] != df['PO_Currency']).astype(int)
    
    # Date Variance
    df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'])
    df['PO_Date'] = pd.to_datetime(df['PO_Date'])
    df['Days_Since_PO'] = (df['Invoice_Date'] - df['PO_Date']).dt.days.fillna(-1)
    
    # Categorical Encoding
    le = LabelEncoder()
    df['Dept_Code'] = le.fit_transform(df['Department'].astype(str))
    df['Requestor_Code'] = le.fit_transform(df['Requestor'].astype(str))
    df['Match_Type_Code'] = le.fit_transform(df['Match_Type'])
    
    return df

In [7]:
df_train_final = engineered_features(df_train_raw)
df_new_final = engineered_features(df_new_raw)

features = ['Invoice_Amount', 'Sup_Risk', 'Sup_Tenure', 'Sup_Approval_Rate',
            'Amt_Variance', 'Currency_Mismatch', 'Days_Since_PO', 
            'Dept_Code', 'Match_Type_Code'] + text_cols

print("Features ready:", features)

Features ready: ['Invoice_Amount', 'Sup_Risk', 'Sup_Tenure', 'Sup_Approval_Rate', 'Amt_Variance', 'Currency_Mismatch', 'Days_Since_PO', 'Dept_Code', 'Match_Type_Code', 'Text_SVD_0', 'Text_SVD_1', 'Text_SVD_2', 'Text_SVD_3', 'Text_SVD_4']


## 5. Model Training

In [8]:
target = 'Status'
le_target = LabelEncoder()
y = le_target.fit_transform(df_train_final[target])
X = df_train_final[features]
mapping = dict(zip(le_target.transform(le_target.classes_), le_target.classes_))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = lgb.LGBMClassifier(random_state=42)
model.fit(X_train, y_train)

print(classification_report(y_test, model.predict(X_test), target_names=le_target.classes_))

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000253 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1001
[LightGBM] [Info] Number of data points in the train set: 1200, number of used features: 14
[LightGBM] [Info] Start training from score -0.492931
[LightGBM] [Info] Start training from score -1.728785
[LightGBM] [Info] Start training from score -1.552743
                precision    recall  f1-score   support

       Approve       0.99      1.00      1.00       179
        Reject       1.00      0.91      0.95        57
Review Further       0.94      1.00      0.97        64

      accuracy                           0.98       300
     macro avg       0.98      0.97      0.97       300
  weighted avg       0.98      0.98      0.98       300



## 6. Inference & Explanations

In [None]:
# Predict on New Data
X_new = df_new_final[features]
probs = model.predict_proba(X_new)
preds = model.predict(X_new)

df_new_final['Predicted_Status'] = [mapping[p] for p in preds]
df_new_final['Confidence'] = [max(prob) for prob in probs]

# Find Risky Items
risky = df_new_final[df_new_final['Predicted_Status'].isin(['Reject', 'Review Further'])]
print("Risky Invoices Found:", len(risky))

if not risky.empty:
    sample = risky.iloc[0]
    
    # Construct Summary for Gemini
    prompt = f"""
    Role: Accounts Payable Auditor.
    Task: Explain the rejection risk concisely.
    
    Invoice: {sample['Invoice_ID']}
    Description: {sample['Description']}
    Amount: {sample['Invoice_Amount']} {sample['Currency']}
    Supplier Risk: {sample['Sup_Risk']} (High is bad)
    Approval History: {sample['Sup_Approval_Rate']:.0%}
    Variance from PO: {sample['Amt_Variance']:.1%}
    Match Type: {sample['Match_Type']}
    Model Prediction: {sample['Predicted_Status']} (Conf: {sample['Confidence']:.2f})
    
    Output a 1-sentence 'Reason for Rejection/Review' for the UI card.
    """
    
    print(prompt)
    if GEMINI_API_KEY:
        model_gen = genai.GenerativeModel('gemini-2.5-flash')
        try:
            res = model_gen.generate_content(prompt)
            print("\n--- DECISION CARD EXPLANATION ---")
            print(res.text)
        except Exception as e:
            print(e)

Risky Invoices Found: 10

    Role: Accounts Payable Auditor.
    Task: Explain the rejection risk concisely.

    Invoice: INV-NEW-0014
    Description: Software Development Consultation - Q3
    Amount: 13372.7 USD
    Supplier Risk: 91 (High is bad)
    Approval History: 60%
    Variance from PO: 60.0%
    Match Type: Strict
    Model Prediction: Reject (Conf: 1.00)

    Output a 1-sentence 'Reason for Rejection/Review' for the UI card.
    
429 You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit. 
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_input_token_count, limit: 0, model: gemini-2.0-flash-exp
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 0, model: gemini-2.0-flash-exp
* Quota exc