<a href="https://colab.research.google.com/github/samiha-mahin/NLP/blob/main/Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers datasets torch scikit-learn




# **Create the dataset**

In [None]:
import pandas as pd

data = {
    "text": [
        "I keep reliving the accident every night",
        "Loud noises make me panic",
        "I feel anxious in crowded places",
        "I cannot sleep because of nightmares",
        "I feel happy today",
        "I enjoy spending time with friends",
        "Life feels peaceful",
        "I love watching movies"
    ],
    "label": [1,1,1,1,0,0,0,0]
}

df = pd.DataFrame(data)
df


Unnamed: 0,text,label
0,I keep reliving the accident every night,1
1,Loud noises make me panic,1
2,I feel anxious in crowded places,1
3,I cannot sleep because of nightmares,1
4,I feel happy today,0
5,I enjoy spending time with friends,0
6,Life feels peaceful,0
7,I love watching movies,0


# **Tokenization using BERT**

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset = dataset.map(tokenize, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

### **Train/Test Split**

In [None]:
dataset = dataset.train_test_split(test_size=0.2)

# **Load Model for Classification**

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2
)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


### **Training**

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    num_train_epochs=3
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"]
)

trainer.train()


  super().__init__(loader)


Step,Training Loss


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=6, training_loss=0.611238956451416, metrics={'train_runtime': 24.6626, 'train_samples_per_second': 0.73, 'train_steps_per_second': 0.243, 'total_flos': 101749978440.0, 'train_loss': 0.611238956451416, 'epoch': 3.0})

# **Test the Model**

In [None]:
text = "I wake up sweating from nightmares"

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

import torch
pred = torch.argmax(outputs.logits).item()

print("Prediction:", "PTSD" if pred==1 else "No PTSD")


Prediction: PTSD


# **RAG Agent (Retrieval-Auged Generation)**

In [None]:
!pip install transformers sentence-transformers faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.2


# **Create Sample Dataset**

In [None]:
documents = [
    "Patient experiences recurring nightmares about a car accident.",
    "Loud sounds trigger panic attacks and sweating.",
    "Avoids crowded places due to anxiety.",
    "Reports feeling calm and relaxed recently.",
    "Enjoys social activities and hobbies.",
    "Experiences flashbacks when seeing reminders of trauma."
]


# **Convert Text → Embeddings**

We use Sentence-BERT to turn text into vectors.

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(documents)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# **Store in Vector Database**

We use FAISS (fast similarity search).

In [None]:
import faiss
import numpy as np

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))


# **Agent Function (Search + Answer)**

This function acts like the agent brain.

In [None]:
def agent(query):
    query_embedding = model.encode([query])

    D, I = index.search(np.array(query_embedding), k=2)

    results = [documents[i] for i in I[0]]

    print("Relevant info from dataset:")
    for r in results:
        print("-", r)


# **Ask the Agent**

In [None]:
agent("nightmares and flashbacks")


Relevant info from dataset:
- Experiences flashbacks when seeing reminders of trauma.
- Patient experiences recurring nightmares about a car accident.
