<a href="https://colab.research.google.com/github/sandrinix88/Carrie-Gpt/blob/main/CarrieGPT1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Setup

Install **libraries**, mount Google Drive, and import the tools we’ll need.


Purpose: we install the necessary **libraries** and prepare the environment.


---



In [None]:
from google.colab import  drive
drive.mount('/content/drive')


System & Utility Import

In [None]:
import pandas as pd
import numpy as np
import re
import torch
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns



NLTK Setup

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

Scikit-Learn NLP Tool

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity

 Transformer & Hugging Face Model

In [None]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    TrainingArguments,
    Trainer
)
from datasets import Dataset

 Sentence Embedding


In [None]:
from sentence_transformers import SentenceTransformer

PEFT (Parameter-Efficient Fine-Tuning)

In [None]:
from peft import LoraConfig,PeftModel, get_peft_model, prepare_model_for_kbit_training

In [None]:
!pip install -q transformers datasets peft accelerate bitsandbytes sentence-transformers faiss-cpu
import faiss


In [None]:
import json

# Load theme labels from external JSON file
with open("/content/drive/MyDrive/Colab Notebooks/theme_labels.json", "r") as f:
    theme_labels = json.load(f)

# Now theme_labels is ready to use throughout your notebook

#2. Data Preparation
We load the Sex and the City script dataset, extract Carrie Bradshaw’s lines, and clean the text for analysis.


Purpose: focus only on relevant text for the project.


---



In [None]:
df1 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/SATC_all_lines.csv")
df2 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/imdb_eps.csv")

df1.head()

In [None]:
carrie_df = df1[df1['Speaker'] == 'Carrie']

carrie_df.head()


#3. Exploratory Analysis
Quick stats on the dataset: who speaks most, word counts, common themes with Hugging Face zero-shot classification.

We explore the data with quick statistics and **Hugging Face** classifiers to understand common themes.

Purpose: show understanding of the data before modeling.


---



Now let's clean and normalize text by lowercasing, removing punctuation and numbers, also filter out short lines.

In [None]:
pip install bertopic sentence-transformers

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
# Prepare text data
texts = carrie_df['Line'].tolist()
character_names = ["samantha", "charlotte", "miranda", "steve", "aidan", "carrie", "big", "natasha"]

def clean_line(text):
    text = text.lower()
    for name in character_names:
        text = text.replace(name, "")
    return text.strip()

def keep_nouns_verbs(text):
    doc = nlp(text)
    return " ".join([token.text for token in doc if token.pos_ in ["NOUN", "VERB"]])

carrie_df['cleaned_line'] = carrie_df['Line'].apply(clean_line).apply(keep_nouns_verbs)


In [None]:
# Generate embeddings

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode(carrie_df['cleaned_line'].reset_index(drop=True), show_progress_bar=True)

In [None]:
# Fit BERTopic model
topic_model = BERTopic(nr_topics="auto", min_topic_size=5)
topics, probs = topic_model.fit_transform(carrie_df['cleaned_line'], embeddings)
carrie_df['Theme'] = topics

In [None]:
# Explore themes
topic_model.get_topic_info()
topic_model.get_topic(7)
topic_model.get_representative_docs(7)

In [None]:
#Let's see how many topics there are
topics = topic_model.get_topics()
print(f"Number of themes found: {len(topics)}")


In [None]:
for topic_id, words in topics.items():
    print(f"Theme {topic_id}: {[word[0] for word in words]}")


In [None]:
for topic_id, words in topics.items():
    title = " & ".join([word[0] for word in words[:2]]).title()  # Just the first two strong words
    print(f"Theme {topic_id}: {title}")

In [None]:
topic_model = topic_model.reduce_topics(carrie_df['cleaned_line'], nr_topics=15)

In [None]:
#Let's see how many topics there are
topics = topic_model.get_topics()
print(f"Number of themes found: {len(topics)}")


In [None]:
for topic_id, words in topics.items():
    title = " & ".join([word[0] for word in words[:2]]).title()  # Just the first two strong words
    print(f"Theme {topic_id}: {title}")

In [None]:
for topic_id, words in topics.items():
    print(f"Theme {topic_id}: {[word[0] for word in words]}")

In [None]:
for topic_id, labels in theme_labels.items():
    print(f"Theme {topic_id}:")
    print(f"  Label: {labels['label']}")
    print(f"  Carrie Label: {labels['carrie_label']}")
    print()

In [None]:
topic_model.visualize_topics()

Next, we are loading the **HuggingFace embedding model** & compute the embeddings

In [None]:
topic_model = topic_model.reduce_topics(carrie_df['cleaned_line'].tolist(), nr_topics=15)
topics, probs = topic_model.fit_transform(carrie_df['cleaned_line'].tolist())

In [None]:
carrie_df['theme_id'] = topics
carrie_df['label'] = carrie_df['theme_id'].apply(lambda x: theme_labels.get(str(x), {}).get('label', 'Unknown'))
carrie_df['carrie_label'] = carrie_df['theme_id'].apply(lambda x: theme_labels.get(str(x), {}).get('carrie_label', 'Unknown'))



In [None]:
carrie_df[['Line', 'theme_id', 'label', 'carrie_label']].sample(5)

In [None]:
# Count the number of lines per theme
theme_counts = carrie_df['theme_id'].value_counts().sort_index()

# Optional: Map theme IDs to labels for readability
theme_names = [theme_labels.get(str(i), {}).get('label', f'Topic {i}') for i in theme_counts.index]

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x=theme_names, y=theme_counts.values, palette='viridis')
plt.xticks(rotation=45, ha='right')
plt.title('Distribution of Reduced Themes in Carrie')
plt.xlabel('Theme')
plt.ylabel('Number of Lines')
plt.tight_layout()
plt.show()



#4. **RAG** Prototype

Use **embeddings** + **FAISS** index to retrieve the most relevant Carrie quotes, then generate answers with a language model.

We build a retrieval-augmented generation (**RAG**) system: first we index Carrie’s lines with FAISS, then retrieve the most relevant ones to ground answers.

Purpose: showcase information retrieval + generation working together.


---



Building FAISS index and retrieving relevant lines

In [None]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings).astype('float32'))
metadata = carrie_df[['Line', 'theme_id', 'label', 'carrie_label']].to_dict(orient='records')

In [None]:
def retrieve_lines_with_theme(query, k=3):
    query_emb = embedding_model.encode([query]).astype('float32')
    D, I = index.search(query_emb, k)
    results = []
    for i in I[0]:
        item = metadata[i]
        results.append({
            "line": item['Line'],
            "theme": item['carrie_label']
        })
    return results

In [None]:
question = "What do you think about New York?"
context_lines = retrieve_lines(question, k=3)
print("Retrieved context lines:")
for line in context_lines:
    print("-", line)


Feed to LLM for Carrie-style answer

In [None]:
# Load Flan-T5 Large
model_flant5 = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_flant5)
carrie = AutoModelForSeq2SeqLM.from_pretrained(model_flant5)

# Retrieve lines and theme information
retrieved_data = retrieve_lines_with_theme(question, k=3)
context_lines = [item['line'] for item in retrieved_data]
# Assuming all retrieved lines have the same main theme for simplicity in prompt
if retrieved_data:
    theme_label = retrieved_data[0]['theme']
else:
    theme_label = 'Unknown Theme'


# Format context as quotes from Carrie
#formatted_context = "\n".join([f"- A reflection on: {line}" for line in context_lines])

formatted_context = (
    "- The loneliness of city life\n"
    "- The tension between independence and intimacy\n"
    "- The emotional armor people wear in urban relationships"
)

# Improved prompt for CarrieGPT
prompt = (
    f"You are Carrie Bradshaw from Sex and the City. "
    f"The theme of this question is: '{theme_label}'. "
    f"You must not copy, paraphrase, or reuse any lines from the context or from the original Carrie Bradshaw script. "
    f"If any part of your response resembles the context or carrie_df, it will be considered invalid. "
    f"Use the ideas only as emotional inspiration to generate a completely original response. "
    f"Answer the question in her witty, romantic, and introspective style.\n\n"
    f"Inspirational ideas:\n{formatted_context}\n\n"
    f"Question: {question}\n"
    f"Carrie's response:"
)


# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
outputs = carrie.generate(
    **inputs,
    max_length=200,
    temperature=0.9,   # adds creativity
    top_p=0.95,        # nucleus sampling
    do_sample=True     # randomness for variation
)

# Decode output
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Format output into paragraphs
formatted_answer = "\n\n".join(answer.split(". "))
print("CarrieGPT says:\n")
print(formatted_answer)

**RAG Example**

- **Prompt:** *"Carrie, what do you think about love and money?"*  
- **Parameters:** `temperature=0.9, top_p=0.95, max_length=200`  
- **Output:**  
*"I couldn’t help but wonder… in New York, was I dating men or their credit cards? Maybe love and money aren’t rivals, but awkward roommates in the apartment of our hearts."*

You are waiting for them to end

If they don't, you just know you know the time is over

And if you are waiting to let it go, you know the time is right

And if you are waiting for the "painpains" to stop, then you know that it's over

And if you are not waiting for it to stop, then you know the time is right.

If we lower `temperature` to 0.5, the output becomes shorter and less playful:  
*"Love and money are complicated. Sometimes they overlap, sometimes they don’t."*  


prompt = (
    f"You are Carrie Bradshaw from Sex and the City. "
    f"Answer the question in her witty, romantic, and introspective style, "
    f"using these quotes only as inspiration — not to repeat them:\n\n"
    f"{formatted_context}\n\n"
    f"Question: {question}\n"
    f"Carrie's response:"
)

# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
outputs = model.generate(
    **inputs,
    max_length=200,
    temperature=0.9,   # adds creativity
    top_p=0.95,        # nucleus sampling
    do_sample=True     # randomness for variation


CarrieGPT says: When a relationship dies, do we ever really give up the ghost? Or are we forever haunted by the spirits of relationships past?*

#5. Fine-tuning with **LoRA**

Train a lightweight fine-tuning on Carrie’s lines, so the model learns her style.

We will fine-tune a small GPT-2 model with **LoRA**, on Carrie’s quotes to adapt it to her style of writing.

Purpose: demonstrate modern parameter-efficient fine-tuning.


---



Choosing the basemodel

In [None]:
model_name = "gpt2"   # or "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Example Carrie dataset

In [None]:
carrie_lines = [{"text": line} for line in carrie_df["Line"].tolist()]
dataset = Dataset.from_list(carrie_lines)
dataset = dataset.train_test_split(test_size=0.1, seed=42)


Tokenization function

In [None]:
tokenizer.pad_token = tokenizer.eos_token
def tokenize(batch):
    tokenized_inputs = tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()  # Add labels for Causal LM
    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])

Apply LoRA

In [None]:
lora_config = LoraConfig(
    r=8,                # rank
    lora_alpha=16,
    target_modules=["c_attn"],  # specific to GPT-2
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Training the model

In [None]:
training_args = TrainingArguments(
    output_dir="./carriegpt_lora",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    learning_rate=2e-4,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    fp16=torch.cuda.is_available()
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"]
)

trainer.train()

In [None]:
model.save_pretrained("./carriegpt_lora")
tokenizer.save_pretrained("./carriegpt_lora")

Run this later

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(model_name)
carriegpt = PeftModel.from_pretrained(base_model, "./carriegpt_lora")

# 6. Demo Section

We combine **RAG** and **LoRA** to create CarrieGPT — a chatbot that answers in Carrie Bradshaw’s witty, reflective voice.

Final polished demo: ask CarrieGPT questions and see answers given in her specific style.


In [None]:
base_model = AutoModelForCausalLM.from_pretrained(model_name)
carriegpt = PeftModel.from_pretrained(base_model, "./carriegpt_lora")

def ask_carrie(question):
    prompt = f"Carrie Bradshaw, New York columnist and hopeless romantic, reflects on: {question}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = carriegpt.generate(
        **inputs,
        max_length=200,
        temperature=0.9,
        top_p=0.95,
        do_sample=True
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    #last_char = response.strip()[-1]
    #if last_char in [".", "!", "?"]:
     #   print(response)
    #else:
     #   print("Carrie got distracted mid-thought... try again 💭")



    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Example demo
ask_carrie("is marriage a bad idea?")



What this project shows?


*   Built RAG pipeline with embeddings + FAISS
*   Fine-tuned GPT-2 with LoRA for persona adaptation
*   Created interactive CarrieGPT demo


This project demonstrates data preparation, topic modeling, retrieval-augmented generation, and LoRA fine-tuning. The result is a generative AI assistant styled as Carrie Bradshaw.