# Simple Recommendation Model

To get started with this, we will only use a very small portion of the pr_df clean dataset for now.

In [1]:
import polars as pl

# load a subset of the dataset
pr_df = pl.read_parquet("data/intermediate_data/pr_df_clean_issues.parquet")

print(f"Dataset shape: {pr_df.shape}")
print(f'The column names: {pr_df.columns}')


Dataset shape: (10000, 14)
The column names: ['repo', 'parent_repo', 'child_repo', 'issue_id', 'issue_number', 'issue', 'text_size', 'usernames', 'users', 'mock_number', 'issue_title', 'issue_comments', 'issue_title_clean', 'issue_comments_clean']


Now let's vectorize the text. We will use something like we did before, just with a smaller dataset to avoid catapulting my puny computer into the abyss of RAM hell.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# vectorize the "issue_title_clean" column
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(pr_df['issue_title_clean'])

print(f'tf-idf matrix shape: {tfidf_matrix.shape}')

tf-idf matrix shape: (10000, 13209)


## Build a Simple Recommendation Function

We will compute similarities for a single query or item and recommend the most similar ones.

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def recommend_issues(query_idx, tfidf_matrix, pr_df, top_n = 5):
    """
    recommend the most similar issues based on a given query.
    
    parameters:
    - query_idx: index of the query issue
    - tfidf_matrix: tf-idf matrix
    - pr_df: dataframe of the dataset, contains the issues
    - top_n: number of top similar issues to return
    
    returns:
    - list of top_n similar issues in tuples (index, simlarity_score, title)    
    """

    # compute cosine similarity for the query
    query_vector = tfidf_matrix[query_idx]
    similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()

    # get the top N most similar items (exclude the query itself)
    top_indices = np.argsort(similarities)[::-1][1:top_n+1]

    recommendations = [
        (
            idx,
            similarities[idx],
            pr_df.row(idx)[pr_df.columns.index('issue_title_clean')]
        ) for idx in top_indices
    ]

    return recommendations

In [6]:
# example: recommend similar issues for the first issue
query_idx = 1000
recommendations = recommend_issues(query_idx, tfidf_matrix, pr_df, top_n = 5)

for idx, score, title in recommendations:
    print(f'index: {idx}, similarity: {score:.2f}, title: {title}')

index: 9558, similarity: 0.52, title: Upgraded dependencies
index: 99, similarity: 0.45, title: Upgraded to Kong 2.0.3
index: 5603, similarity: 0.43, title: Upgraded Guice to 4.0
index: 1847, similarity: 0.32, title: upgraded dependecy/plugin versions to latest
index: 8939, similarity: 0.31, title: Upgraded timescaledb to 17.4pg12


Not sure how informative this is. The titles for the most similar issues all have  `WIP`... makes sense, but not much one can do with this, at least not with such a simple model.