<h1>Rails Issues Classifier</h1>
<p>In this project we aim to fine tune a classifier on Ruby on Rails issues as Labels.</p>
<p>Project sections:</p>
<ul>
<li>Initializing Project (read data, processes data)</li>
<li>Test Pre-Trained Models</li>
<li>Fine tune best Pre-Trained Models</li>
</ul>

<h2>Section 1: Initialize the Project</h2>

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/rails-issues/rails_issues.csv


In [108]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, pipeline
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay,classification_report ,precision_recall_fscore_support, accuracy_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from torch import nn
import pandas as pd
import numpy as np
import torch
import ast

In [4]:
file_path = '/kaggle/input/rails-issues/rails_issues.csv'
df = pd.read_csv(file_path)
df

Unnamed: 0,id,number,title,description,labels,comments,author,created_at,state
0,2661216314,53649,db:drop db:create db:migrate - run only differ...,"When I do `db:drop`, the **database is deleted...",[],2,kubatron117,2024-11-15T08:33:54Z,open
1,2660906300,53648,Active model serializer is not working on Rails 8,### Steps to reproduce\r\n1. install `active_m...,[],5,anthonylee1994,2024-11-15T06:16:31Z,closed
2,2660517937,53646,Help me pleas,,[],0,Madesupra27,2024-11-15T02:14:12Z,closed
3,2660462567,53645,Skip annotation in the page title,### Steps to reproduce\r\nWhen rendering a vie...,[],0,huda-kh,2024-11-15T01:13:39Z,open
4,2658844435,53638,find_each with model repressing a view raise e...,Hello 👋 \r\n\r\nBumping from rails 7.2 to rail...,[],6,aandrieu,2024-11-14T13:34:39Z,closed
...,...,...,...,...,...,...,...,...,...
5341,420300746,35597,Rails bi-directional accepts_nested_attributes...,### Steps to reproduce\r\nLet's say I have 2 m...,"['activerecord', 'stale']",6,hashwin,2019-03-13T03:29:42Z,closed
5342,420244243,35594,Fix random test failure in Active Record i18n ...,This test fails:\r\n\r\n```\r\n# Running:\r\n\...,['good first issue'],4,kaspth,2019-03-12T23:04:28Z,closed
5343,420214598,35592,Unable to use partial locals when using 'rende...,Consider the following two ways of rendering a...,[],1,taylorthurlow,2019-03-12T21:29:48Z,closed
5344,420207317,35590,ActionCable requests High Response Time,High Response Time on `/cable` endpoint when u...,[],6,cesar3030,2019-03-12T21:10:28Z,closed


In [5]:
# Convert the 'labels' column to actual lists and filter for empty lists
df['labels'] = df['labels'].apply(ast.literal_eval)

# Convert 'created_at' column to datetime format
df['created_at'] = pd.to_datetime(df['created_at'])

<p style='color:red'>We need only to work on non empty data as the classifier will take the description as the input and should output a label</p>

In [99]:
df_non_empty_labels = df[(df['labels'].apply(lambda x: len(x) != 0)) & (df['description'].apply(lambda x: len(str(x)) != 0))]

df_non_empty_labels

Unnamed: 0,id,number,title,description,labels,comments,author,created_at,state
5,2658345715,53637,Two get requests in a functional test,### Steps to reproduce\r\n\r\nI am developing ...,[more-information-needed],1,Mathiezz,2024-11-14 10:26:55+00:00,open
7,2655908276,53624,Using DATABASE_URL to refer to config in datab...,### Steps to reproduce\r\nSince upgrading to a...,"[activerecord, attached PR, With reproduction ...",14,jasonperrone,2024-11-13 15:35:17+00:00,closed
16,2648880719,53590,"Polymorphic where clause accepts any object, e...",When using a wrapper class that exposes the id...,"[activerecord, attached PR, With reproduction ...",0,nyikos-zoltan,2024-11-11 10:27:28+00:00,open
23,2645883411,53568,[Getting Started] Tutorial suggest modifying m...,### Steps to reproduce\r\n<!-- (Guidelines for...,"[docs, good first issue]",5,ml4nC3,2024-11-09 10:02:21+00:00,closed
24,2645266168,53565,ActiveRecord::QueryLogs tags being appended to...,### Steps to reproduce\r\n<!-- (Guidelines for...,"[activerecord, attached PR, With reproduction ...",0,mnordin,2024-11-08 22:14:51+00:00,closed
...,...,...,...,...,...,...,...,...,...
5337,420861596,35608,Make ActiveJob::Base more composable,### Steps to reproduce\r\nCurrently I'm using ...,[activejob],1,al3rez,2019-03-14 07:05:17+00:00,closed
5338,420836686,35606,6.0.0.beta3: render partial: wrong number of a...,### Steps to reproduce\r\nAfter updating to 6....,[actionview],3,olegantonyan,2019-03-14 05:21:56+00:00,closed
5340,420418447,35600,Setting `inverse_of` on `belongs_to polymorphi...,### Steps to reproduce\r\nhttps://gist.github....,"[activerecord, stale]",3,allenwu1973,2019-03-13 10:18:09+00:00,closed
5341,420300746,35597,Rails bi-directional accepts_nested_attributes...,### Steps to reproduce\r\nLet's say I have 2 m...,"[activerecord, stale]",6,hashwin,2019-03-13 03:29:42+00:00,closed


In [None]:
#the classifier will output only one label so we will split the rows with multi labels into multi rows with one label
df_non_empty_labels_exploded = df_non_empty_labels.explode('labels')
df_non_empty_labels_exploded

Unnamed: 0,id,number,title,description,labels,comments,author,created_at,state
5,2658345715,53637,Two get requests in a functional test,### Steps to reproduce\r\n\r\nI am developing ...,more-information-needed,1,Mathiezz,2024-11-14 10:26:55+00:00,open
7,2655908276,53624,Using DATABASE_URL to refer to config in datab...,### Steps to reproduce\r\nSince upgrading to a...,activerecord,14,jasonperrone,2024-11-13 15:35:17+00:00,closed
7,2655908276,53624,Using DATABASE_URL to refer to config in datab...,### Steps to reproduce\r\nSince upgrading to a...,attached PR,14,jasonperrone,2024-11-13 15:35:17+00:00,closed
7,2655908276,53624,Using DATABASE_URL to refer to config in datab...,### Steps to reproduce\r\nSince upgrading to a...,With reproduction steps,14,jasonperrone,2024-11-13 15:35:17+00:00,closed
16,2648880719,53590,"Polymorphic where clause accepts any object, e...",When using a wrapper class that exposes the id...,activerecord,0,nyikos-zoltan,2024-11-11 10:27:28+00:00,open
...,...,...,...,...,...,...,...,...,...
5340,420418447,35600,Setting `inverse_of` on `belongs_to polymorphi...,### Steps to reproduce\r\nhttps://gist.github....,activerecord,3,allenwu1973,2019-03-13 10:18:09+00:00,closed
5340,420418447,35600,Setting `inverse_of` on `belongs_to polymorphi...,### Steps to reproduce\r\nhttps://gist.github....,stale,3,allenwu1973,2019-03-13 10:18:09+00:00,closed
5341,420300746,35597,Rails bi-directional accepts_nested_attributes...,### Steps to reproduce\r\nLet's say I have 2 m...,activerecord,6,hashwin,2019-03-13 03:29:42+00:00,closed
5341,420300746,35597,Rails bi-directional accepts_nested_attributes...,### Steps to reproduce\r\nLet's say I have 2 m...,stale,6,hashwin,2019-03-13 03:29:42+00:00,closed


<h2>Section 2: Test Pre-Trained Models</h2>
<p>In this section I tested multiple models on <b>zero-shot-classification</b></p>
<p>zero-shot-classification: is a task to classify text into classifier untrained categories</p>

<p style="color:red">configuring the gpu code was generated by chat gpt</p>

In [None]:
def model_classify_pretrained(df_to_classify, model_name = 'roberta-base'):
    df_to_classify_copy = df_to_classify.copy()
    
    device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

    
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Check if multiple GPUs are available
    if torch.cuda.device_count() > 1:
        print(f"Using {torch.cuda.device_count()} GPUs!")
        model = nn.DataParallel(model)  # Wrap the model in DataParallel

    model.to(device)

    # Initialize the classifier pipeline with GPU support if available
    classifier = pipeline(
        'zero-shot-classification',
        model=model.module if isinstance(model, nn.DataParallel) else model,
        tokenizer=tokenizer,
        device=device
    )

    candidate_labels = df_to_classify_copy['labels'].unique().tolist()

    descriptions = df_to_classify_copy['description'].fillna('').tolist()

    non_empty_descriptions = [desc for desc in descriptions if desc.strip()]


    batch_size = 32 
    results = []
    for i in tqdm(range(0, len(non_empty_descriptions), batch_size), desc="Processing in Batches"):
        batch = non_empty_descriptions[i:i + batch_size]
        result = classifier(batch, candidate_labels, clean_up_tokenization_spaces=False)
        results.extend(result)

    predicted_labels = [res['labels'][0] if res['labels'] else 'No Description' for res in results]
    
    count = 0
    for index, row in df_to_classify_copy.iterrows():
        try:
            if pd.notna(row['description']) and row['description'].strip():
                df_to_classify_copy.loc[index, 'predicted_label'] = predicted_labels[count]
            else:
                df_to_classify_copy.loc[index, 'predicted_label'] = 'No Description'
                continue

        except:
            df_to_classify_copy.loc[index, 'predicted_label'] = 'No Description'
            continue

        count += 1
    return df_to_classify_copy

<h2>In this part I used <b>roberta-base</b> to classify the text into one of all existing categories</h2>

In [None]:
df_all_labels_classification = model_classify_pretrained(def_to_classify = df_non_empty_labels_exploded)

Unnamed: 0,id,number,title,description,labels,comments,author,created_at,state,predicted_label
5,2658345715,53637,Two get requests in a functional test,### Steps to reproduce\r\n\r\nI am developing ...,more-information-needed,1,Mathiezz,2024-11-14 10:26:55+00:00,open,more-information-needed
7,2655908276,53624,Using DATABASE_URL to refer to config in datab...,### Steps to reproduce\r\nSince upgrading to a...,activerecord,14,jasonperrone,2024-11-13 15:35:17+00:00,closed,good first issue
7,2655908276,53624,Using DATABASE_URL to refer to config in datab...,### Steps to reproduce\r\nSince upgrading to a...,attached PR,14,jasonperrone,2024-11-13 15:35:17+00:00,closed,good first issue
7,2655908276,53624,Using DATABASE_URL to refer to config in datab...,### Steps to reproduce\r\nSince upgrading to a...,With reproduction steps,14,jasonperrone,2024-11-13 15:35:17+00:00,closed,good first issue
16,2648880719,53590,"Polymorphic where clause accepts any object, e...",When using a wrapper class that exposes the id...,activerecord,0,nyikos-zoltan,2024-11-11 10:27:28+00:00,open,more-information-needed
...,...,...,...,...,...,...,...,...,...,...
5340,420418447,35600,Setting `inverse_of` on `belongs_to polymorphi...,### Steps to reproduce\r\nhttps://gist.github....,activerecord,3,allenwu1973,2019-03-13 10:18:09+00:00,closed,actionmailbox
5340,420418447,35600,Setting `inverse_of` on `belongs_to polymorphi...,### Steps to reproduce\r\nhttps://gist.github....,stale,3,allenwu1973,2019-03-13 10:18:09+00:00,closed,actionmailbox
5341,420300746,35597,Rails bi-directional accepts_nested_attributes...,### Steps to reproduce\r\nLet's say I have 2 m...,activerecord,6,hashwin,2019-03-13 03:29:42+00:00,closed,more-information-needed
5341,420300746,35597,Rails bi-directional accepts_nested_attributes...,### Steps to reproduce\r\nLet's say I have 2 m...,stale,6,hashwin,2019-03-13 03:29:42+00:00,closed,more-information-needed


In [41]:
filtered_df = df_all_labels_classification[df_all_labels_classification['predicted_label'] == df_all_labels_classification['labels']]
filtered_df

Unnamed: 0,id,number,title,description,labels,comments,author,created_at,state,predicted_label
5,2658345715,53637,Two get requests in a functional test,### Steps to reproduce\r\n\r\nI am developing ...,more-information-needed,1,Mathiezz,2024-11-14 10:26:55+00:00,open,more-information-needed
221,2498156424,52754,params.deep_transform_keys! doesn't work with ...,### Steps to reproduce\r\nApplicationControlle...,more-information-needed,2,fandan,2024-08-30 20:51:10+00:00,closed,more-information-needed
674,2131057210,51052,New find_or_create_by behavior in Rails 7.1 ca...,"In https://github.com/rails/rails/pull/45720, ...",stale,7,stanhu,2024-02-12 21:59:02+00:00,closed,stale
772,2073779810,50691,uninitialized constant autoloading error expli...,i am using ruby 3.2.2 and rails version 7.0.1 ...,more-information-needed,2,AbdulWahab552,2024-01-10 07:41:02+00:00,closed,more-information-needed
918,2002750106,50118,prepend option is no longer supported for afte...,"As a result of #46992 (@tenderlove), with new ...",With reproduction steps,3,jrochkind,2023-11-20 18:17:01+00:00,open,With reproduction steps
...,...,...,...,...,...,...,...,...,...,...
4885,481092185,36942,Refused to load the stylesheet,I'm using Rails and React with Webpacker. My a...,stale,1,ghost,2019-08-15 10:33:04+00:00,closed,stale
5039,457587885,36514,Active Storage direct upload to disk fails whe...,### Steps to reproduce\r\nour application is a...,stale,3,cushingw,2019-06-18 16:45:16+00:00,closed,stale
5089,451591546,36392,ActiveStorage::Blob: analyzed metadata is miss...,"Hello,\r\n\r\n### Steps to reproduce\r\nCreate...",stale,2,michaelnwani,2019-06-03 17:03:35+00:00,closed,stale
5291,424880101,35744,Impossible to create column decimal/numeric wi...,### Steps to reproduce\r\n\r\n```\r\nclass Cre...,needs work,2,MathieuDerelle,2019-03-25 12:39:25+00:00,closed,needs work


In [None]:
df_all_labels_classification.to_csv('/kaggle/working/df_all_labels_classification.csv', index=False)

<p>for only top 3 categories</p>

In [None]:
top_four = ['activerecord', 'With reproduction steps', 'stale']
df_top_four_labels = df_non_empty_labels_exploded[df_non_empty_labels_exploded['labels'].isin(top_four)]

df_top_four_labels

Unnamed: 0,id,number,title,description,labels,comments,author,created_at,state
7,2655908276,53624,Using DATABASE_URL to refer to config in datab...,### Steps to reproduce\r\nSince upgrading to a...,activerecord,14,jasonperrone,2024-11-13 15:35:17+00:00,closed
7,2655908276,53624,Using DATABASE_URL to refer to config in datab...,### Steps to reproduce\r\nSince upgrading to a...,With reproduction steps,14,jasonperrone,2024-11-13 15:35:17+00:00,closed
16,2648880719,53590,"Polymorphic where clause accepts any object, e...",When using a wrapper class that exposes the id...,activerecord,0,nyikos-zoltan,2024-11-11 10:27:28+00:00,open
16,2648880719,53590,"Polymorphic where clause accepts any object, e...",When using a wrapper class that exposes the id...,With reproduction steps,0,nyikos-zoltan,2024-11-11 10:27:28+00:00,open
24,2645266168,53565,ActiveRecord::QueryLogs tags being appended to...,### Steps to reproduce\r\n<!-- (Guidelines for...,activerecord,0,mnordin,2024-11-08 22:14:51+00:00,closed
...,...,...,...,...,...,...,...,...,...
5336,420870021,35609,AutLoad Broken on Windows with Ruby 2.6,Rails autoloading is broken on Windows due to ...,stale,4,cfis,2019-03-14 07:34:53+00:00,closed
5340,420418447,35600,Setting `inverse_of` on `belongs_to polymorphi...,### Steps to reproduce\r\nhttps://gist.github....,activerecord,3,allenwu1973,2019-03-13 10:18:09+00:00,closed
5340,420418447,35600,Setting `inverse_of` on `belongs_to polymorphi...,### Steps to reproduce\r\nhttps://gist.github....,stale,3,allenwu1973,2019-03-13 10:18:09+00:00,closed
5341,420300746,35597,Rails bi-directional accepts_nested_attributes...,### Steps to reproduce\r\nLet's say I have 2 m...,activerecord,6,hashwin,2019-03-13 03:29:42+00:00,closed


<h3>in this part i used <b>AntoineMC/distilbart-mnli-github-issues</b> which is a model fine-tuned version of <b>facebook/bart-large-mnli</b> on GitHub issues</h3>
<p style="color:red"> this model head support only 3 categories</p>

In [94]:
df_top_four_labels_result = model_classify_pretrained(df_top_four_labels,model_name = 'AntoineMC/distilbart-mnli-github-issues')

Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.
Processing in Batches: 100%|██████████| 69/69 [01:22<00:00,  1.20s/it]


In [95]:
filtered_df = df_top_four_labels_result[df_top_four_labels_result['predicted_label'] == df_top_four_labels_result['labels']]
filtered_df

Unnamed: 0,id,number,title,description,labels,comments,author,created_at,state,predicted_label
7,2655908276,53624,Using DATABASE_URL to refer to config in datab...,### Steps to reproduce\r\nSince upgrading to a...,With reproduction steps,14,jasonperrone,2024-11-13 15:35:17+00:00,closed,With reproduction steps
16,2648880719,53590,"Polymorphic where clause accepts any object, e...",When using a wrapper class that exposes the id...,With reproduction steps,0,nyikos-zoltan,2024-11-11 10:27:28+00:00,open,With reproduction steps
24,2645266168,53565,ActiveRecord::QueryLogs tags being appended to...,### Steps to reproduce\r\n<!-- (Guidelines for...,With reproduction steps,0,mnordin,2024-11-08 22:14:51+00:00,closed,With reproduction steps
44,2617542880,53468,`#merge` doesn't merge `#with`,### Steps to reproduce\r\n\r\n```ruby\r\n# fro...,With reproduction steps,0,tim-semba,2024-10-28 07:23:25+00:00,closed,With reproduction steps
153,2545847185,53031,Previously-valid queries now result in `System...,### Steps to reproduce\r\n\r\nI upgraded a Rai...,With reproduction steps,0,matthew-healy,2024-09-24 16:24:04+00:00,closed,With reproduction steps
...,...,...,...,...,...,...,...,...,...,...
5321,421869562,35648,"Could not find ""javascript/channel.coffee"" in ...",### Steps to reproduce\r\n\r\n- Have a Rails 5...,stale,4,localhostdotdev,2019-03-17 01:04:33+00:00,closed,stale
5327,421703321,35630,PROPOSAL to return frozen values from Rails.ap...,We had a serious issue in production that was ...,stale,3,koenhandekyn,2019-03-15 21:02:39+00:00,closed,stale
5328,421622004,35625,MemCacheStore uses a TTL (expires_in) longer t...,The maximum expiration period supported by mem...,stale,6,johnmaxwell,2019-03-15 17:13:50+00:00,closed,stale
5336,420870021,35609,AutLoad Broken on Windows with Ruby 2.6,Rails autoloading is broken on Windows due to ...,stale,4,cfis,2019-03-14 07:34:53+00:00,closed,stale


<h2>Section3: Fine Tune the Model</h2>
<p style= "color:red">I chosed <b>AntoineMC/distilbart-mnli-github-issues</b> because it was fine-tuned on data set for similer purpose and my data set is too small</p>

<p style="color:red">Note: chatgpt generated part of this function while helping me fixing some runtime errors</p>

In [None]:
def fine_tune_and_classify(
    df_to_classify, model_name='roberta-base', save_dir='/kaggle/working/fine_tuned_model', epochs=2, batch_size=16
):
    df_copy = df_to_classify.copy()

    # Check device
    device = "cuda" if torch.cuda.is_available() else "cpu"

    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=df_copy['labels'].nunique(),
        ignore_mismatched_sizes=True 
    )

    model.config.problem_type = "single_label_classification"

    model.to(device)

    # Split data into train and test sets
    train_texts, test_texts, train_labels, test_labels = train_test_split(
        df_copy['description'].fillna('').tolist(),
        pd.Categorical(df_copy['labels']).codes,  # Convert labels to integer indices
        test_size=0.2,
        random_state=42
    )

    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
    test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512)

    class Dataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels

        def __len__(self):
            return len(self.labels)

        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
            return item

    train_dataset = Dataset(train_encodings, train_labels)
    test_dataset = Dataset(test_encodings, test_labels)


    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=10,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss"
    )

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)

        precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
        accuracy = accuracy_score(labels, predictions)

        return {
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall,
            "f1": f1
        }

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics
    )

    trainer.train()

    if not os.path.exists(save_dir):
        os.makedirs(save_dir)
    model.save_pretrained(save_dir)
    tokenizer.save_pretrained(save_dir)
    print(f"Model and tokenizer saved to {save_dir}.")

    classifier = pipeline(
        'text-classification',
        model=trainer.model,
        tokenizer=tokenizer,
        device=0 if device == "cuda" else -1
    )

    descriptions = df_copy['description'].fillna('').tolist()
    batch_results = []
    for i in tqdm(range(0, len(descriptions), batch_size), desc="Classifying"):
        batch = descriptions[i:i + batch_size]
        predictions = classifier(batch, truncation=True, padding=True)
        batch_results.extend(predictions)

    df_copy['predicted_label'] = [pred['label'] for pred in batch_results]
    return df_copy


In [102]:
import os
os.environ["WANDB_MODE"] = "disabled"

In [109]:
result_fine_tune_and_classify = fine_tune_and_classify(df_top_four_labels,model_name = 'AntoineMC/distilbart-mnli-github-issues')
print(fine_tune_and_classify)


A ConfigError was raised whilst setting the number of model parameters in Weights & Biases config.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.9904,0.959929,0.512472,0.447592,0.512472,0.475406
2,0.9196,0.893587,0.646259,0.568022,0.646259,0.599139


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model and tokenizer saved to /kaggle/working/fine_tuned_model.


Classifying: 100%|██████████| 138/138 [00:27<00:00,  4.99it/s]

<function fine_tune_and_classify at 0x7eeb25fdd480>





In [110]:
result_fine_tune_and_classify.to_csv('/kaggle/working/df_top_four_fine_tune_classification.csv', index=False)