<a href="https://colab.research.google.com/github/swagatskalita092/Random-Forest-pipeline/blob/main/RandomForest__model_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install & Import Required Libraries

In [3]:
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report


Load the cleaned Dataset

In [4]:
file_path = "/content/cleaned_dataset.csv"  # Ensure the cleaned dataset is uploaded
df = pd.read_csv(file_path)


In [5]:
print("📌 Dataset Overview:")
print(df.info())

📌 Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72289 entries, 0 to 72288
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   issue_url                 72289 non-null  object
 1   issue_label               72289 non-null  object
 2   issue_created_at          72289 non-null  object
 3   issue_author_association  72289 non-null  object
 4   repository_url            72289 non-null  object
 5   issue_title               72289 non-null  object
 6   issue_body                72289 non-null  object
 7   processed_issue_body      72212 non-null  object
dtypes: object(8)
memory usage: 4.4+ MB
None


In [6]:
print("\n❌ Missing Values in Cleaned Data:")
print(df.isnull().sum())


❌ Missing Values in Cleaned Data:
issue_url                    0
issue_label                  0
issue_created_at             0
issue_author_association     0
repository_url               0
issue_title                  0
issue_body                   0
processed_issue_body        77
dtype: int64


Define Features (X) and Target Labels (y)

In [7]:
X = df["processed_issue_body"]  # Already preprocessed text
y = df["issue_label"]  # Target labels (classification output)

Split Data into Training (80%) and Testing (20%) Sets

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
print(f"\n✅ Training Data: {len(X_train)} samples")
print(f"✅ Testing Data: {len(X_test)} samples")


✅ Training Data: 57831 samples
✅ Testing Data: 14458 samples


Build Random Forest Model Pipeline

In [11]:
pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=5000)),  # Convert text to numerical form using TF-IDF
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42))  # Train Random Forest Classifier
])

Train the Model

In [13]:
# Step 5: Define Features (X) and Target Labels (y)
X = df["processed_issue_body"].fillna("No description available")  # Handle NaN values
y = df["issue_label"]  # Target labels (classification output)

# Step 6: Split Data into Training (80%) and Testing (20%) Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 🔹 Check if any NaN values remain in training data
print(f"\n🔍 Missing Values in X_train: {X_train.isnull().sum()}")
print(f"🔍 Missing Values in X_test: {X_test.isnull().sum()}")

# 🔹 Convert any remaining NaN to empty strings (final cleanup)
X_train = X_train.fillna("")
X_test = X_test.fillna("")



🔍 Missing Values in X_train: 0
🔍 Missing Values in X_test: 0


In [14]:
pipeline.fit(X_train, y_train)

In [15]:
print("\n✅ Random Forest Model Training Completed!")



✅ Random Forest Model Training Completed!


In [16]:
#Model Evaluation
y_pred = pipeline.predict(X_test)

Calculate accuracy

In [18]:
accuracy = accuracy_score(y_test, y_pred)
print(f"\n Model Accuracy: {accuracy:.4f}")



 Model Accuracy: 0.7352


classification report

In [19]:
print("\n📊 Classification Report:")
print(classification_report(y_test, y_pred))


📊 Classification Report:
              precision    recall  f1-score   support

         bug       0.75      0.81      0.78      7236
 enhancement       0.72      0.78      0.75      5935
    question       0.62      0.09      0.16      1287

    accuracy                           0.74     14458
   macro avg       0.70      0.56      0.56     14458
weighted avg       0.73      0.74      0.71     14458



Save the model

In [20]:
model_path = "/content/random_forest_issue_classifier.pkl"
joblib.dump(pipeline, model_path)
print(f"\n Model saved at {model_path}")


 Model saved at /content/random_forest_issue_classifier.pkl
