# Welcome to Colab!

## Explore the Gemini API
The Gemini API gives you access to Gemini models created by Google DeepMind. Gemini models are built from the ground up to be multimodal, so you can reason seamlessly across text, images, code, and audio.

**How to get started?**
*  Go to [Google AI Studio](https://aistudio.google.com/) and log in with your Google account.
*  [Create an API key](https://aistudio.google.com/app/apikey).
* Use a quickstart for [Python](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Get_started.ipynb), or call the REST API using [curl](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/rest/Prompting_REST.ipynb).

**Discover Gemini's advanced capabilities**
*  Play with Gemini [multimodal outputs](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Image-out.ipynb), mixing text and images in an iterative way.
*  Discover the [multimodal Live API](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Get_started_LiveAPI.ipynb ) (demo [here](https://aistudio.google.com/live)).
*  Learn how to [analyze images and detect items in your pictures](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Spatial_understanding.ipynb") using Gemini (bonus, there's a [3D version](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/examples/Spatial_understanding_3d.ipynb) as well!).
*  Unlock the power of [Gemini thinking model](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Get_started_thinking.ipynb), capable of solving complex task with its inner thoughts.
      
**Explore complex use cases**
*  Use [Gemini grounding capabilities](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/examples/Search_grounding_for_research_report.ipynb) to create a report on a company based on what the model can find on internet.
*  Extract [invoices and form data from PDF](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/examples/Pdf_structured_outputs_on_invoices_and_forms.ipynb) in a structured way.
*  Create [illustrations based on a whole book](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/examples/Book_illustration.ipynb) using Gemini large context window and Imagen.

To learn more, check out the [Gemini cookbook](https://github.com/google-gemini/cookbook) or visit the [Gemini API documentation](https://ai.google.dev/docs/).


In [2]:
# Step 1: Install required libraries
!pip install transformers datasets torch scikit-learn pandas

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_c

In [3]:
# Cell 3: Load and preprocess data
import pandas as pd

# Load dataset (replace filename)
df = pd.read_csv("/content/sample_data/cleaned_merged_bias_sentences.csv")

# Fix encoding errors
df["text"] = df["text"].str.replace("â€™", "'", regex=False)

# Clean text
df["text"] = df["text"].str.lower().str.replace("[^a-zA-Z' ]", "", regex=True)

# Check classes
print("Class distribution:")
print(df["bias"].value_counts())

Class distribution:
bias
political             200
corporate_consumer    200
disability            200
age                   200
poverty               200
religious             200
gender                200
Name: count, dtype: int64


In [4]:
# Cell 4: Split data
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(
    df, test_size=0.2, stratify=df["bias"], random_state=42
)

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")

Training samples: 1120
Validation samples: 280


In [5]:
# Cell 5: Tokenize text
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

train_encodings = tokenizer(
    train_df["text"].tolist(),
    truncation=True,
    padding=True,
    max_length=128
)

val_encodings = tokenizer(
    val_df["text"].tolist(),
    truncation=True,
    padding=True,
    max_length=128
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [6]:
# Cell 6: Create PyTorch datasets (FIXED)
import torch
from torch.utils.data import Dataset
from sklearn.preprocessing import LabelEncoder

class BiasDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):  # <-- THIS IS CRUCIAL
        return len(self.labels)  # Returns total number of samples

# Encode labels
le = LabelEncoder()
train_labels = le.fit_transform(train_df["bias"])
val_labels = le.transform(val_df["bias"])

# Create datasets
train_dataset = BiasDataset(train_encodings, train_labels)
val_dataset = BiasDataset(val_encodings, val_labels)

In [7]:
# Cell 7: Initialize model
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=7
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
# Cell 8: Configure training
from transformers import TrainingArguments, Trainer
import numpy as np
from torch.nn import CrossEntropyLoss

# Calculate class weights
class_counts = np.bincount(train_labels)
class_weights = torch.tensor(1. / class_counts, dtype=torch.float)

# Cell 8: Update the CustomTrainer class
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):  # <-- Add **kwargs
        labels = inputs.get("labels")
        outputs = model(**inputs)
        loss_fct = CrossEntropyLoss(weight=class_weights.to(labels.device))
        loss = loss_fct(outputs.logits, labels)
        return (loss, outputs) if return_outputs else loss

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    report_to="none"
)



In [9]:
# Cell 9: Run training
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.704795
2,No log,0.110989
3,No log,0.036947
4,No log,0.024623
5,No log,0.021999


TrainOutput(global_step=350, training_loss=0.3899507359095982, metrics={'train_runtime': 23.1439, 'train_samples_per_second': 241.965, 'train_steps_per_second': 15.123, 'total_flos': 27530835931200.0, 'train_loss': 0.3899507359095982, 'epoch': 5.0})

In [10]:
# Cell 10: Evaluate model
from sklearn.metrics import classification_report

predictions = trainer.predict(val_dataset)
preds = np.argmax(predictions.predictions, axis=-1)

print("\nClassification Report:")
print(classification_report(val_labels, preds, target_names=le.classes_))


Classification Report:
                    precision    recall  f1-score   support

               age       1.00      1.00      1.00        40
corporate_consumer       1.00      1.00      1.00        40
        disability       1.00      1.00      1.00        40
            gender       1.00      1.00      1.00        40
         political       1.00      1.00      1.00        40
           poverty       1.00      1.00      1.00        40
         religious       1.00      1.00      1.00        40

          accuracy                           1.00       280
         macro avg       1.00      1.00      1.00       280
      weighted avg       1.00      1.00      1.00       280



In [11]:
def generate_bias_report(text, threshold=0.3):
    # Move model to appropriate device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)

    # Convert to probabilities
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].cpu().numpy()

    # Create category mapping
    categories = {
        'political': 'Political Bias',
        'gender': 'Gender Bias',
        'religion': 'Religion Bias',
        'age': 'Age Bias',
        'poverty': 'Poverty Bias',
        'consumer vs corporate': 'Consumer Protection Bias',
        'disability': 'Disability Bias'
    }

    # Generate report
    report = []
    for label, prob in zip(le.classes_, probs):
        display_name = categories.get(label, label.title() + " Bias")
        if prob >= threshold:
            status = f"✅ Detected ({prob:.2f})"
        else:
            status = f"⚠️ Low Confidence ({prob:.2f})"

        # Formatting for consistent column widths
        display_name = display_name.ljust(25)
        report.append(f"{display_name} {status}")

    # Add header
    header = "Potential Bias Categories".ljust(25) + "Detection Status\n" + "-"*50
    return "\n".join([header] + report)



In [12]:
test_text = " the legal implications of mistakenly selling an expired food product, followed by a customer's false claims, involve a nuanced interpretation of the Indian Penal Code, focusing on intent, negligence, and the specific circumstances of the case. It's crucial to distinguish between intentional acts and genuine mistakes, as well as to consider the legal consequences of making false claims. The application of the law must consider these factors to ensure a fair and just outcome."
print("Bias Analysis Report")
print(generate_bias_report(test_text))

Bias Analysis Report
Potential Bias CategoriesDetection Status
--------------------------------------------------
Age Bias                  ⚠️ Low Confidence (0.01)
Corporate_Consumer Bias   ✅ Detected (0.89)
Disability Bias           ⚠️ Low Confidence (0.01)
Gender Bias               ⚠️ Low Confidence (0.01)
Political Bias            ⚠️ Low Confidence (0.06)
Poverty Bias              ⚠️ Low Confidence (0.01)
Religious Bias            ⚠️ Low Confidence (0.01)


In [13]:
# Save all necessary components
save_dir = "./bias_detection_model"

# Save model
model.save_pretrained(save_dir)
# Save tokenizer
tokenizer.save_pretrained(save_dir)
# Save label encoder
import pickle
with open(f"{save_dir}/label_encoder.pkl", "wb") as f:
    pickle.dump(le, f)

# Save inference script template
with open(f"{save_dir}/inference_example.py", "w") as f:
    f.write('''
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pickle

def load_model():
    model = AutoModelForSequenceClassification.from_pretrained("./bias_detection_model")
    tokenizer = AutoTokenizer.from_pretrained("./bias_detection_model")
    with open("./bias_detection_model/label_encoder.pkl", "rb") as f:
        le = pickle.load(f)
    return model, tokenizer, le

def predict(text):
    model, tokenizer, le = load_model()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    pred_label = le.inverse_transform([torch.argmax(probs).cpu().numpy()])
    return pred_label[0]

if __name__ == "__main__":
    text = input("Enter text to analyze: ")
    print("Predicted bias:", predict(text))
''')

# Download files
from google.colab import files
!zip -r model.zip ./bias_detection_model
files.download("model.zip")

  adding: bias_detection_model/ (stored 0%)
  adding: bias_detection_model/special_tokens_map.json (deflated 42%)
  adding: bias_detection_model/tokenizer_config.json (deflated 75%)
  adding: bias_detection_model/inference_example.py (deflated 52%)
  adding: bias_detection_model/vocab.txt (deflated 53%)
  adding: bias_detection_model/model.safetensors (deflated 8%)
  adding: bias_detection_model/label_encoder.pkl (deflated 20%)
  adding: bias_detection_model/tokenizer.json (deflated 71%)
  adding: bias_detection_model/config.json (deflated 53%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>