<a href="https://colab.research.google.com/github/ummeamunira/llm-text-classifier/blob/main/NLP_Text_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We'll use popular libraries such as pandas for data handling, scikit-learn for machine learning, and Flask for deployment. The example will include steps for data reading, preprocessing, model training, hyperparameter tuning, and deployment.

Assumptions:

Incident reports are stored in a CSV file with two columns: report (the text of the incident report) and category (the label).

The categories are "Equipment Failure", "Safety Hazard", "Environmental Issue", "Maintenance Required", and "Other".

In [None]:
import pandas as pd

# Read the data
data = pd.read_csv('incident_reports.csv')
print(data.head())


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# Split the data into training and testing sets
X = data['report']
y = data['category']

# Encode the labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Define a pipeline combining a text feature extractor with a classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression(solver='liblinear'))
])

# Define hyperparameters for tuning
param_grid = {
    'tfidf__max_df': [0.75, 1.0],
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'clf__C': [0.1, 1, 10]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_}")

# Evaluate on test set
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))


Model Deployment with Flask

In [None]:
from flask import Flask, request, jsonify

app = Flask(__name__)

# Use the best model from the grid search
model = grid_search.best_estimator_

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json(force=True)
    report = data['report']
    prediction = model.predict([report])
    predicted_category = label_encoder.inverse_transform(prediction)[0]
    return jsonify({'category': predicted_category})

if __name__ == '__main__':
    app.run(debug=True)


Save and Load the Model

In [None]:
import joblib

# Save the model and label encoder
joblib.dump(grid_search.best_estimator_, 'incident_classifier_model.pkl')
joblib.dump(label_encoder, 'label_encoder.pkl')


Loading the Model in Flask

Modify the Flask application to load the saved model and label encoder.

In [None]:
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Load the model and label encoder
model = joblib.load('incident_classifier_model.pkl')
label_encoder = joblib.load('label_encoder.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json(force=True)
    report = data['report']
    prediction = model.predict([report])
    predicted_category = label_encoder.inverse_transform(prediction)[0]
    return jsonify({'category': predicted_category})

if __name__ == '__main__':
    app.run(debug=True)


Running the Flask Application

To run the Flask application, execute the script:

In [None]:
python app.py


You can then send POST requests with incident reports to the /predict endpoint to get classifications.

Example POST Request

Using curl or a tool like Postman, send a request to the Flask app:

In [None]:
curl -X POST -H "Content-Type: application/json" -d '{"report": "Oil leak detected in pump station"}' http://127.0.0.1:5000/predict
