## Fine-Tuning an LLM for Movie Recommendations

#### Goals
- Improve the model's relevance in generating recommendations.
- Address challenges of domain-specific language and sparse user data.
- Create a deployable recommendation pipeline.

#### Requirements

- A pre-trained LLM (e.g., GPT-3, GPT-4, or an open-source model like GPT-J).
- A fine-tuning framework such as Hugging Face Transformers.
- A movie dataset (e.g., MovieLens or IMDb).
- Compute resources (GPU recommended).

In [None]:
pip install transformers datasets accelerate peft promptify

#### Step 1: Prepare the dataset

Use a dataset containing:

- Movie descriptions (e.g., summaries, genres, metadata).
- User-item interaction data (e.g., ratings, reviews, watch history).

Input-Output Pair Formatting:

- Input: A prompt with user context, such as:

`User preferences: [list of liked movies]. Recommend 3 movies based on their taste.`

- Output: A list of relevant recommendations or a response explaining the recommendations.

In [None]:
import pandas as pd

# Load MovieLens data
ratings = pd.read_csv("https://files.grouplens.org/datasets/movielens/ml-latest-small/ratings.csv")
movies = pd.read_csv("https://files.grouplens.org/datasets/movielens/ml-latest-small/movies.csv")

# Merge ratings with movie metadata
data = ratings.merge(movies, on='movieId')
data.head()

#### Step 2: Preprocess the Data

- Clean and deduplicate entries.
- Tokenize using a tokenizer matching the pre-trained LLM.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt-4")
data["input"] = data["user_preferences"] + data["movie_metadata"]
data["input"] = data["input"].apply(lambda x: tokenizer.encode(x, truncation=True, padding="max_length"))


#### Step 3: Format Input-Output Pairs

Create a prompt-response dataset for fine-tuning.

In [None]:
import random

# Format prompts and responses
def create_prompt_response(group):
    movies = list(group['title'])
    liked_movies = random.sample(movies, min(len(movies), 3))  # Randomly pick 3 liked movies
    prompt = f"User preferences: {', '.join(liked_movies)}. Recommend 3 movies based on their taste."
    recommendations = random.sample(movies, min(len(movies), 3))  # Placeholder for recommendations
    return pd.Series({'prompt': prompt, 'response': recommendations})

formatted_data = data.groupby('userId').apply(create_prompt_response).reset_index(drop=True)
formatted_data.head()


#### Step 4: Fine-Tune Methods

** a. Full Fine-Tuning **

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

# Load model and tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize dataset
def tokenize(batch):
    return tokenizer(batch['prompt'], text_target=batch['response'], truncation=True)

dataset = formatted_data.map(tokenize, batched=True)
dataset.set_format("torch")

# Define training arguments
training_args = TrainingArguments(
    output_dir="./full_finetune",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    save_total_limit=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test']
)

# Train the model
trainer.train()


** b. LoRA (Parameter-Efficient Fine-Tuning) **

LoRA requires fewer resources by updating a subset of parameters.


In [None]:
from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
)

peft_model = get_peft_model(model, lora_config)
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test']
)

# Train the model
trainer.train()


** c. Prompt Tuning ** Prompt tuning focuses on learning optimal prompts for task-specific responses.


In [None]:
from transformers import PromptTuningConfig

prompt_config = PromptTuningConfig(model_name_or_path=model_name, task_type="causal-lm")
prompt_tuning_model = get_peft_model(model, prompt_config)


#### Step 5: Evaluate Dataset Diversity

Split data into diverse and limited subsets and compare performance.

In [None]:
# Subset: Limited user preferences
limited_data = formatted_data.sample(frac=0.1, random_state=42)

# Fine-tune and evaluate models with diverse vs. limited data
diverse_results = trainer.evaluate(eval_dataset=dataset['test'])
limited_results = trainer.evaluate(eval_dataset=limited_data)
print(f"Diverse Data Results: {diverse_results}")
print(f"Limited Data Results: {limited_results}")


#### Step 6: Analyze and Compare

Compare methods using metrics like precision@K, recall@K.

In [None]:
import numpy as np
import pandas as pd

def precision_at_k(recommended, relevant, k):
    """Calculate Precision@K."""
    recommended_at_k = recommended[:k]
    relevant_set = set(relevant)
    precision = len(set(recommended_at_k) & relevant_set) / len(recommended_at_k)
    return precision

def recall_at_k(recommended, relevant, k):
    """Calculate Recall@K."""
    recommended_at_k = recommended[:k]
    relevant_set = set(relevant)
    recall = len(set(recommended_at_k) & relevant_set) / len(relevant_set)
    return recall

def evaluate_model(model, tokenizer, test_dataset, k_values=[3, 5, 10]):
    """Evaluate a model using Precision@K and Recall@K."""
    precision_scores = {k: [] for k in k_values}
    recall_scores = {k: [] for k in k_values}

    for sample in test_dataset:
        prompt = sample['prompt']
        ground_truth = sample['response'].split(", ")  # Relevant items
        inputs = tokenizer(prompt, return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=50)
        recommended = tokenizer.decode(outputs[0], skip_special_tokens=True).split(", ")

        for k in k_values:
            precision_scores[k].append(precision_at_k(recommended, ground_truth, k))
            recall_scores[k].append(recall_at_k(recommended, ground_truth, k))

    avg_precision = {k: np.mean(precision_scores[k]) for k in k_values}
    avg_recall = {k: np.mean(recall_scores[k]) for k in k_values}
    return avg_precision, avg_recall

# Define K values
k_values = [3, 5, 10]

# Evaluate Full Fine-Tuning
full_finetune_precision, full_finetune_recall = evaluate_model(model, tokenizer, dataset['test'], k_values)

# Evaluate LoRA
lora_precision, lora_recall = evaluate_model(peft_model, tokenizer, dataset['test'], k_values)

# Evaluate Prompt Tuning
prompt_tuning_precision, prompt_tuning_recall = evaluate_model(prompt_tuning_model, tokenizer, dataset['test'], k_values)

# Aggregate Results
results = []
for k in k_values:
    results.append({
        "Method": "Full Fine-Tuning",
        f"Precision@{k}": full_finetune_precision[k],
        f"Recall@{k}": full_finetune_recall[k],
    })
    results.append({
        "Method": "LoRA",
        f"Precision@{k}": lora_precision[k],
        f"Recall@{k}": lora_recall[k],
    })
    results.append({
        "Method": "Prompt Tuning",
        f"Precision@{k}": prompt_tuning_precision[k],
        f"Recall@{k}": prompt_tuning_recall[k],
    })

# Convert Results to DataFrame
results_df = pd.DataFrame(results)
print(results_df)
