<a href="https://colab.research.google.com/github/ummeamunira/NLP-LLM/blob/main/Text-classification/NLP_Text_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We'll use popular libraries such as pandas for data handling, scikit-learn for machine learning, and Flask for deployment. The example will include steps for data reading, preprocessing, model training, hyperparameter tuning, and deployment.

Assumptions:

Incident reports are stored in a CSV file with two columns: report (the text of the incident report) and category (the label).

The categories are "Equipment Failure", "Safety Hazard", "Environmental Issue", "Maintenance Required", and "Other".

In [None]:
import pandas as pd

# Sample data
data = {
    "ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
    "Date": [
        "2024-01-15", "2024-02-10", "2024-03-05", "2024-01-20", "2024-02-14",
        "2024-03-08", "2024-01-25", "2024-02-18", "2024-03-12", "2024-01-30",
        "2024-02-22", "2024-03-15", "2024-01-28", "2024-02-25", "2024-03-18"
    ],
    "category": [
        "Safety", "Safety", "Safety", "Equipment Failure", "Equipment Failure",
        "Equipment Failure", "Environmental", "Environmental", "Environmental",
        "Safety", "Equipment Failure", "Environmental", "Safety", "Equipment Failure",
        "Environmental"
    ],
    "report": [
        "Worker slipped on wet floor, minor injury reported.",
        "Employee not wearing proper protective equipment, no injury.",
        "Fall from ladder, major injury, hospitalization required.",
        "Conveyor belt malfunctioned, production halted for 2 hours.",
        "Generator breakdown caused power outage in section B.",
        "Air compressor failure, minor impact on operations.",
        "Oil spill in storage area, contained within 30 minutes.",
        "Chemical leak detected in waste disposal unit, no external contamination reported.",
        "Excessive smoke emissions from furnace stack, environmental team notified.",
        "Forklift accident, operator bruised but no major injuries.",
        "Hydraulic system failure in press machine, repairs took 4 hours.",
        "Unauthorized waste disposal, regulatory authorities informed.",
        "Minor burn injury due to contact with hot surface.",
        "Conveyor motor overheating, maintenance required.",
        "Minor spillage of coolant, cleaned up with no further issues."
    ],
    "Severity": [
        "Low", "Medium", "High", "Medium", "High",
        "Low", "Medium", "Low", "High",
        "Medium", "High", "High", "Low", "Medium", "Low"
    ]
}

# Create DataFrame
df = pd.DataFrame(data)

# Save DataFrame to CSV
df.to_csv('incident_report.csv', index=False)

print("CSV file saved successfully.")


CSV file saved successfully.


In [None]:
import pandas as pd

# Read the data
data = pd.read_csv('incident_reports.csv')
print(data.head())


   IncidentID        Date           Location  \
0           1  2023-05-01        Warehouse 1   
1           2  2023-05-03  Production Line 2   
2           3  2023-05-05        Warehouse 2   
3           4  2023-05-07             Office   
4           5  2023-05-10  Production Line 1   

                                              report    category  
0         Worker slipped and fell due to a wet floor      Safety  
1       Machine malfunction caused a production halt   Equipment  
2       Fire alarm triggered due to electrical fault      Safety  
3  Employee experienced a minor electric shock wh...  Electrical  
4      Worker injured hand while operating machinery      Safety  


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# Split the data into training and testing sets
X = data['report']
y = data['category']

# Encode the labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


In [None]:
y_train

array([0, 1, 2, 0, 2, 1, 0, 2, 2, 0, 1, 1])

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Define a pipeline combining a text feature extractor with a classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression(solver='liblinear'))
])

# Define hyperparameters for tuning
param_grid = {
    'tfidf__max_df': [0.75, 1.0],
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'clf__C': [0.1, 1, 10]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=3, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_}")

# Evaluate on test set
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))


Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best parameters: {'clf__C': 10, 'tfidf__max_df': 0.75, 'tfidf__ngram_range': (1, 1)}
Best cross-validation score: 0.5833333333333334
                   precision    recall  f1-score   support

    Environmental       1.00      1.00      1.00         1
Equipment Failure       1.00      1.00      1.00         1
           Safety       1.00      1.00      1.00         1

         accuracy                           1.00         3
        macro avg       1.00      1.00      1.00         3
     weighted avg       1.00      1.00      1.00         3



Model Deployment with Flask

In [None]:
from flask import Flask, request, jsonify

app = Flask(__name__)

# Use the best model from the grid search
model = grid_search.best_estimator_

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json(force=True)
    report = data['report']
    prediction = model.predict([report])
    predicted_category = label_encoder.inverse_transform(prediction)[0]
    return jsonify({'category': predicted_category})

if __name__ == '__main__':
    app.run(debug=True)


 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug: * Restarting with stat


Save and Load the Model

In [None]:
import joblib

# Save the model and label encoder
joblib.dump(grid_search.best_estimator_, 'incident_classifier_model.pkl')
joblib.dump(label_encoder, 'label_encoder.pkl')


['label_encoder.pkl']

Loading the Model in Flask

Modify the Flask application to load the saved model and label encoder.

In [None]:
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Load the model and label encoder
model = joblib.load('incident_classifier_model.pkl')
label_encoder = joblib.load('label_encoder.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json(force=True)
    report = data['report']
    prediction = model.predict([report])
    predicted_category = label_encoder.inverse_transform(prediction)[0]
    return jsonify({'category': predicted_category})

if __name__ == '__main__':
    app.run(debug=True)


 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug: * Restarting with stat


Running the Flask Application

To run the Flask application, execute the script:

In [None]:
python app.py


You can then send POST requests with incident reports to the /predict endpoint to get classifications.

Example POST Request

Using curl or a tool like Postman, send a request to the Flask app:

In [None]:
curl -X POST -H "Content-Type: application/json" -d '{"report": "Oil leak detected in pump station"}' http://127.0.0.1:5000/predict
