## Problem Statement
The goal of this project/poc is to build a **blog recommendation system** that helps users find blogs similar to the ones they like. The system should:
- Analyze blog content to understand key topics and words.
- Identify and suggest blogs that have similar content.
- Give more importance to recent blogs so that newer content is recommended first.
- Ensure recommendations come from the same category as the selected blog.
- Allow users to interact and explore recommendations through a simple terminal interface.


In [1]:
!pip install -r requirements.txt

Collecting anyio==4.8.0 (from -r requirements.txt (line 2))
  Downloading anyio-4.8.0-py3-none-any.whl.metadata (4.6 kB)
Collecting certifi==2025.1.31 (from -r requirements.txt (line 3))
  Downloading certifi-2025.1.31-py3-none-any.whl.metadata (2.5 kB)
Collecting colorama==0.4.6 (from -r requirements.txt (line 6))
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting exceptiongroup==1.2.2 (from -r requirements.txt (line 8))
  Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB)
Collecting fastapi==0.115.8 (from -r requirements.txt (line 9))
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting fsspec==2024.12.0 (from -r requirements.txt (line 11))
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting huggingface-hub==0.28.1 (from -r requirements.txt (line 15))
  Downloading huggingface_hub-0.28.1-py3-none-any.whl.metadata (13 kB)
Collecting numpy==2.2.2 (from -r requirements.txt (line 23))
  Downloading 

## Blog Recommendation System Explanation

### 1. TF-IDF is used to convert blog content into numerical vectors.
- The **TF-IDF (Term Frequency-Inverse Document Frequency)** technique converts text data into numerical form by evaluating word importance.  
- This helps in comparing blogs based on their content rather than just keywords.  

### 2. Cosine similarity with NearestNeighbors finds similar blogs.
- The **NearestNeighbors model** uses **cosine similarity** to measure how similar two blogs are based on their TF-IDF vectors.  
- Blogs with closer cosine similarity scores are considered more relevant to each other.  

### 3. Time decay prioritizes recent blogs.
- A **time decay factor** is applied using an exponential function to reduce the importance of older blogs.  
- This ensures that newer blogs are given more weight in recommendations.  

### 4. Topic filtering ensures relevant recommendations.
- The system only recommends blogs that share the **same topic** as the selected blog.  
- This prevents suggesting irrelevant content, improving user satisfaction.  

### 5. Terminal-based interaction allows users to explore blog recommendations dynamically.
- Users can **input a blog ID** to get recommendations in an interactive terminal environment.  
- The system provides real-time recommendations, allowing users to explore similar content efficiently.  



In [12]:
import pandas as pd
import numpy as np
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity


file_path = "/content/medium_blog_data.csv"
df = pd.read_csv(file_path)
df['scrape_time'] = pd.to_datetime(df['scrape_time'])
df['blog_content'] = df['blog_content'].fillna('')

In [13]:
topic_counts_dict = df['topic'].value_counts().to_dict()
topic_counts_dict

{'ai': 736,
 'blockchain': 644,
 'cybersecurity': 642,
 'web-development': 635,
 'data-analysis': 594,
 'cloud-computing': 589,
 'security': 527,
 'web3': 471,
 'machine-learning': 467,
 'nlp': 453,
 'data-science': 444,
 'deep-learning': 430,
 'android': 426,
 'dev-ops': 384,
 'information-security': 374,
 'image-processing': 354,
 'flutter': 343,
 'backend': 341,
 'cloud-services': 339,
 'Cryptocurrency': 331,
 'app-development': 322,
 'backend-development': 312,
 'Software-Development': 309}

In [18]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = vectorizer.fit_transform(df['blog_content'])

# Nearest Neighbors Model
nn_model = NearestNeighbors(n_neighbors=10, metric='cosine', algorithm='auto')
nn_model.fit(tfidf_matrix)

# Normalize timestamps for time decay
max_time = df['scrape_time'].max()
df['time_decay'] = df['scrape_time'].apply(lambda x: np.exp(-(max_time - x).days / 30))

def get_recommendations_advanced(selected_blog_id, previous_recommendations=set()):
    idx = df[df['blog_id'] == selected_blog_id].index[0]
    distances, indices = nn_model.kneighbors(tfidf_matrix[idx], n_neighbors=10)
    selected_blog = df.iloc[idx]
    selected_topic = selected_blog['topic']
    recommendations = []

    for i, distance in zip(indices[0], distances[0]):
        recommended_blog = df.iloc[i]
        if recommended_blog['blog_id'] != selected_blog_id and recommended_blog['blog_id'] not in previous_recommendations:
            if recommended_blog['topic'] == selected_topic:
                adjusted_score = (1 - distance) * recommended_blog['time_decay']
                recommendations.append((recommended_blog['blog_id'], adjusted_score))

    recommendations = sorted(recommendations, key=lambda x: x[1], reverse=True)[:3]
    return [r[0] for r in recommendations]

def terminal_blog_recommendation_system():
    print("Welcome to the Terminal-Based Blog Recommendation System!")
    initial_recommendations = df.sample(3)['blog_id'].tolist()
    previous_recommendations = set(initial_recommendations)
    print("\nHere are some blogs you might like:")
    for blog_id in initial_recommendations:
        blog = df[df['blog_id'] == blog_id].iloc[0]
        print(f"- [{blog_id}] {blog['blog_title']} (Topic: {blog['topic']})")

    while True:
        user_input = input("\nEnter a blog ID to get recommendations (or type 'exit' to quit): ").strip()
        if user_input.lower() == 'exit':
            print("Thank you for using the recommendation system! Goodbye.")
            break
        try:
            selected_blog_id = int(user_input)
            if selected_blog_id not in df['blog_id'].values:
                print("Invalid blog ID. Please try again.")
                continue
            recommended_blog_ids = get_recommendations_advanced(selected_blog_id, previous_recommendations)
            if not recommended_blog_ids:
                print("No similar blogs found. Try selecting a different blog.")
                continue
            print("\nBased on your selection, you might like:")
            for blog_id in recommended_blog_ids:
                blog = df[df['blog_id'] == blog_id].iloc[0]
                print(f"- [{blog_id}] {blog['blog_title']} (Topic: {blog['topic']})")
            previous_recommendations.update(recommended_blog_ids)
        except ValueError:
            print("Invalid input. Please enter a valid blog ID or 'exit' to quit.")

if __name__ == "__main__":
    terminal_blog_recommendation_system()


Welcome to the Terminal-Based Blog Recommendation System!

Here are some blogs you might like:
- [2129] Unleashing the Potential of Histogram Segmentation for Image Segmentation” (Topic: image-processing)
- [1862] Understanding SQL Joins: A Beginner’s Guide with Code Examples (Topic: data-analysis)
- [9600] How to improve assertions in Playwright by adding custom matchers (Topic: Software-Development)

Enter a blog ID to get recommendations (or type 'exit' to quit): 2129

Based on your selection, you might like:
- [10232] Histogram Equalisation From Scratch in Python (Topic: image-processing)
- [2259] What is Image Segmentation? | Image Processing #9 (Topic: image-processing)
- [7056] Image Segmentation in Python (Topic: image-processing)

Enter a blog ID to get recommendations (or type 'exit' to quit): 2259

Based on your selection, you might like:
- [2204] K-Means Clustering for Image Segmentation: An Introduction (Topic: image-processing)
- [7075] Deep Learning for Medical Image Seg

In [21]:

def precision_at_k(selected_blog_id, recommended_blog_ids, k=3):
    selected_topic = df[df['blog_id'] == selected_blog_id]['topic'].values[0]
    relevant_count = sum(df[df['blog_id'] == rec_id]['topic'].values[0] == selected_topic for rec_id in recommended_blog_ids)
    return relevant_count / k


def mean_average_precision(sample_size=100):
    sample_blogs = df.sample(sample_size)['blog_id'].tolist()
    precisions = [precision_at_k(blog_id, get_recommendations_advanced(blog_id)) for blog_id in sample_blogs]
    return np.mean(precisions)


def diversity_score(sample_size=100):
    sample_blogs = df.sample(sample_size)['blog_id'].tolist()
    diversity_scores = []

    for blog_id in sample_blogs:
        recommended_blog_ids = get_recommendations_advanced(blog_id)
        if len(recommended_blog_ids) < 2:
            continue


        indices = [df[df['blog_id'] == rec_id].index[0] for rec_id in recommended_blog_ids]
        sim_matrix = cosine_similarity(tfidf_matrix[indices])
        avg_similarity = np.mean(sim_matrix[np.triu_indices(len(indices), k=1)])
        diversity_scores.append(1 - avg_similarity)

    return np.mean(diversity_scores) if diversity_scores else None


precision_k = mean_average_precision()
diversity = diversity_score()

print(precision_k, diversity)


0.65 0.7448700973372376


## Evaluation Metrics

The two computed metrics evaluate different aspects of the blog recommendation system:

### **Precision at K (0.65)**

- Precision measures how many of the top **K** recommended blogs share the same topic as the selected blog.
- A precision of **0.65** means that, on average, **65.00% of the top 3 recommendations** belong to the same topic as the selected blog.
- This indicates a **moderate level of relevance** in recommendations.

### **Diversity Score (0.74)**

- Diversity measures how different the recommended blogs are from each other.
- A score of **0.74** means that the recommended blogs are **fairly diverse**, meaning they are not too similar to each other.
- This suggests the system provides **varied but still relevant recommendations** rather than overly redundant ones.

