**MODEL OPTIMIZATION*

1. In the baseline model I only used CountVectorizer. In this model I will use TF-IDF for model to learn word weights better
2. I will also optimise some hyperparameters such as "max_features" and "ngram_range"

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import time

In [2]:
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))

if project_root not in sys.path:
    sys.path.append(project_root)

from src import config
RAW_DATA_PATH = config.RAW_DATA_PATH
PROCESSED_DATA_PATH = config.PROCESSED_DATA_PATH

In [3]:
df = pd.read_csv(PROCESSED_DATA_PATH)

In [4]:
df['description'] = df['description'].fillna('')
df['tags'] = df['tags'].fillna('')

In [5]:
df["combined_text"] = df["description"] + " " + df["tags"]

**TRYING DIFFERENT 'max_features' VALUES**

In [6]:
features_to_try = [1000, 5000, 10000]
results = []

In [7]:
test_idx = 100

In [10]:
for n_features in features_to_try:
    print(f"Max features: {n_features} in progress...")
    star_time = time.time()

    tfidf = TfidfVectorizer(stop_words="english", max_features=n_features)
    tfidf_matrix = tfidf.fit_transform(df["combined_text"])

    cosine_sim = linear_kernel(tfidf_matrix[test_idx:test_idx+1], tfidf_matrix)

    elapsed = time.time() - star_time
    results.append({"features": n_features,
                    "time": elapsed})
    
    print(f"Completed. Time: {elapsed:.2f} sec")

Max features: 1000 in progress...
Completed. Time: 1.78 sec
Max features: 5000 in progress...
Completed. Time: 1.82 sec
Max features: 10000 in progress...
Completed. Time: 1.75 sec


Increasing max_features from 1,000 to 10,000 did not significantly change the processing time, which stayed around 1.7â€“1.8 seconds. This shows that the vectorizer scales efficiently, and larger vocabularies do not add noticeable overhead.

**N-GRAM RANGE**

In [11]:
tfidf_bigram = TfidfVectorizer(stop_words="english", max_features=10000, ngram_range=(1, 2))
tfidf_matrix_bigram = tfidf_bigram.fit_transform(df["combined_text"])

In [12]:
print("Bigram matris shape:", tfidf_matrix_bigram.shape)

Bigram matris shape: (70948, 10000)
