<a href="https://colab.research.google.com/github/ummeamunira/NLP-LLM/blob/main/Reranker_in_O%26G.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In the oil and gas industry, companies often maintain extensive databases of technical reports, maintenance logs, safety incident reports, research papers, and operational manuals. When engineers and other professionals search for information, they need the most relevant documents to appear at the top of the search results to make quick and informed decisions. However, the initial retrieval system may not always provide the best ordering of documents due to the use of basic ranking methods. A reranker can be employed to refine these results, ensuring that the most relevant and useful documents are prioritized.

**Goal:**
Improve the relevance of search results by re-ranking initially retrieved documents, bringing the most critical and pertinent information to the top of the list.

**Initial Ranking System:**

Use an existing search engine or retrieval system (like Elasticsearch, Lucene) to get an initial list of documents relevant to a query.

In [3]:
import pandas as pd

# Example dataset
data = {
    'query': ["oil spill", "pipeline maintenance", "safety incident", "safety report", "oil handling"],
    'document': [
        "Report on oil spill in the Gulf of Mexico.",
        "Routine maintenance for offshore oil rig.",
        "Incident report: Pipeline leak detected.",
        "Guide to maintaining pipeline integrity.",
        "Safety measures for oil spill response."
    ],
    'relevance': [1, 0, 1, 0, 1]  # Example relevance labels
}

df = pd.DataFrame(data)
df

Unnamed: 0,query,document,relevance
0,oil spill,Report on oil spill in the Gulf of Mexico.,1
1,pipeline maintenance,Routine maintenance for offshore oil rig.,0
2,safety incident,Incident report: Pipeline leak detected.,1
3,safety report,Guide to maintaining pipeline integrity.,0
4,oil handling,Safety measures for oil spill response.,1


**Feature Extraction for Reranking:**

Extract richer features from the query-document pairs to be used by the reranker. These features could include:
Textual similarity scores (e.g., cosine similarity).
Metadata features (e.g., document date, author).
Contextual embeddings (e.g., BERT embeddings of query and document).
Domain-specific features (e.g., keywords like "oil spill", "pipeline maintenance")

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Create TF-IDF features for queries and documents
vectorizer = TfidfVectorizer(stop_words='english')
X_queries = vectorizer.fit_transform(df['query'])
X_documents = vectorizer.transform(df['document'])

# Combine query and document features (this is a simplified example)
import scipy.sparse as sp
X = sp.hstack([X_queries, X_documents])

y = df['relevance']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**Reranker Model:**

Train a machine learning model using these features to predict the relevance score for each document. Suitable models include:
Logistic Regression
Gradient Boosting Machines (e.g., XGBoost)
Neural Networks (e.g., BERT-based models)

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train a simple logistic regression model for reranking
reranker = LogisticRegression()
reranker.fit(X_train, y_train)

# Evaluate the model
y_pred = reranker.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00       1.0
           1       0.00      0.00      0.00       0.0

    accuracy                           0.00       1.0
   macro avg       0.00      0.00      0.00       1.0
weighted avg       0.00      0.00      0.00       1.0



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**Training the Reranker:**

Use a labeled dataset where the relevance of documents to queries is known. The training process involves learning to reorder documents to maximize relevance.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train a simple logistic regression model for reranking
reranker = LogisticRegression()
reranker.fit(X_train, y_train)

# Evaluate the model
y_pred = reranker.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00       1.0
           1       0.00      0.00      0.00       0.0

    accuracy                           0.00       1.0
   macro avg       0.00      0.00      0.00       1.0
weighted avg       0.00      0.00      0.00       1.0



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**Deployment:**

Implement the reranker as part of the search pipeline to rerank documents in real-time as users perform searches.

In [None]:
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Save the trained model
joblib.dump(reranker, 'reranker_model.pkl')

# Load the model
model = joblib.load('reranker_model.pkl')

@app.route('/rerank', methods=['POST'])
def rerank():
    data = request.get_json(force=True)
    query = data['query']
    documents = data['documents']

    # Transform the query and documents using the same vectorizer
    query_vec = vectorizer.transform([query])
    docs_vec = vectorizer.transform(documents)
    X_rerank = sp.hstack([sp.vstack([query_vec]*len(documents)), docs_vec])

    # Predict relevance scores
    scores = model.predict_proba(X_rerank)[:, 1]
    ranked_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]

    return jsonify({'ranked_documents': ranked_docs})

if __name__ == '__main__':
    app.run(debug=True)


To use the reranker, send a POST request to the /rerank endpoint with a query and a list of documents to be reranked:

In [None]:
curl -X POST -H "Content-Type: application/json" -d '{"query": "oil spill", "documents": ["Report on oil spill in the Gulf of Mexico.", "Routine maintenance for offshore oil rig.", "Safety measures for oil spill response."]}' http://127.0.0.1:5000/rerank
