<a href="https://colab.research.google.com/github/shagufta24/Personalized_Writing_Assistant/blob/main/Summarize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q -U bitsandbytes

In [None]:
import csv
import json
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from tqdm import tqdm

In [None]:
review_prompt = ("Summarize the following product review in an imperative form, in one or two or maximum three sentences, "
              "ensuring the summary captures key sentiments, features, and feedback (positive or negative) about the product. "
              "Avoid copying the review verbatim but retain the essence. The summary should look like an instruction for writing a similar review "
              "and start with 'Instruction:'. \nHere is the product review: \n")

blog_prompt = ("Summarize the following blog post in an imperative form by capturing its main theme, key events, and personal reflections. "
              "Your output should look like a detailed instruction for writing the blog post and should start with 'Instruction:"
              "\nHere is the blog post:  \n")

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
The token `my_token` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `my_token`


In [None]:
# Load the Mistral 7B Instruct model and tokenizer
model_name = "mistralai/Mistral-7B-Instruct-v0.3"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, add_bos_token=True, trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
def generate_inst(text):
  prompt = review_prompt + text
  inp = tokenizer(prompt, return_tensors="pt").to("cuda")
  model.eval()
  with torch.no_grad():
    output = tokenizer.decode(model.generate(**inp, max_new_tokens=100, repetition_penalty=1.15)[0], skip_special_tokens=True)
    parts = output.split("Instruction:")
    inst = parts[2].strip() if len(parts) > 2 else "blog post"
    return inst

In [None]:
generate_inst("This is a test")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


'Write a concise review for the Fitbit Charge 4 that highlights its sleek design, comfortable'

In [None]:
def summarize(input, output):
    with open(input, "r") as csv_file, open(output, "w", newline='') as csv_out_file:
        reader = csv.DictReader(csv_file)
        fieldnames = ['summary', 'original_text']
        writer = csv.DictWriter(csv_out_file, fieldnames=fieldnames)
        writer.writeheader()

        for i, row in enumerate(tqdm(reader, total=400, desc="Processing rows")):
            if i >= 401: break
            summary = generate_inst(row['text'])
            writer.writerow({'summary': summary, 'original_text': row['text']})

In [None]:
train_csv = "./train/user_1_train.csv"
test_csv = "./test/user_1_test.csv"

In [None]:
summarize(test_csv, 'result_user_100_test.csv')

Processing rows:   0%|          | 0/400 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   0%|          | 1/400 [00:07<50:09,  7.54s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   0%|          | 2/400 [00:15<52:44,  7.95s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   1%|          | 3/400 [00:22<48:12,  7.29s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   1%|          | 4/400 [00:30<50:20,  7.63s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   1%|▏         | 5/400 [00:38<51:31,  7.83s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   2%|▏         | 6/400 [00:46<51:45,  7.88s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   2%|▏         | 7/400 [00:54<52:28,  8.01s/it

In [None]:
# Input CSV file paths
train_csv = "train_blog_user1.csv"
# test_csv = "test_blog_user1.csv"

# Output JSONL file path
train_jsonl = "train_blog_user1.jsonl"
# test_jsonl = "test_blog_user1.jsonl"

summarize("train_blog_user1.csv", "train_blog_user1.jsonl")
# summarize("test_blog_user1.csv", "test_blog_user1.jsonl")

Processing rows:   0%|          | 0/400 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   0%|          | 1/400 [00:08<55:04,  8.28s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   0%|          | 2/400 [00:16<54:54,  8.28s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   1%|          | 3/400 [00:24<54:48,  8.28s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   1%|          | 4/400 [00:33<54:33,  8.27s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   1%|▏         | 5/400 [00:41<54:26,  8.27s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   2%|▏         | 6/400 [00:49<54:26,  8.29s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Processing rows:   2%|▏         | 7/400 [00:57<54:13,  8.28s/it