# Oracle Dataset Exploration

This notebook explores the Oracle dataset stored in an Excel file located at:
```
/Users/tayebekavousi/Desktop/github_sa/datasets/binary/oracle.xlsx
```

We will:
1. Display the first 10 rows of the dataset.
2. Sample 3 rows per class (`is_toxic` labels).
3. Visualize the distribution of classes.
4. Analyze text lengths (word count) and display statistics.
5. Generate a word cloud for each class.
6. Prepare a PyTorch `Dataset` and `DataLoader` for fine-tuning.


## 1. Importing Required Libraries

We import necessary libraries including `pandas`, `numpy`, `matplotlib`, `wordcloud`, and `torch`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import torch
from torch.utils.data import Dataset, DataLoader

# Enable inline plotting
%matplotlib inline

## 2. Loading the Oracle Dataset from Excel

In this cell, we load the dataset from the specified Excel file path. Ensure that the file exists at the given location.

In [None]:
file_path = "/Users/tayebekavousi/Desktop/github_sa/datasets/binary/oracle.xlsx"  # Path to the Oracle dataset

try:
    df = pd.read_excel(file_path)
    print("Oracle dataset loaded successfully!")
except Exception as e:
    print("Error loading dataset:", e)

print("Dataset shape:", df.shape)

## 3. Display the First 10 Rows

This cell displays the first 10 rows of the Oracle dataset for a quick preview.

In [None]:
print("Head of the Oracle dataset (10 rows):")
display(df.head(10))

## 4. Sampling 3 Rows from Each Class

We sample 3 rows per class (based on the `is_toxic` label) to inspect representative entries.

In [None]:
if 'is_toxic' in df.columns:
    try:
        sample_df = df.groupby('is_toxic').apply(lambda x: x.sample(n=3, random_state=42)).reset_index(drop=True)
        print("Sample (3 rows per class):")
        display(sample_df)
    except ValueError as e:
        print("Sampling error:", e)
else:
    print("Column 'is_toxic' not found in the dataset.")

## 5. Distribution Analysis of Classes

We plot a bar chart to visualize the distribution of `is_toxic` classes.

In [None]:
if 'is_toxic' in df.columns:
    class_counts = df['is_toxic'].value_counts()
    
    plt.figure(figsize=(6, 4))
    class_counts.plot(kind='bar', color='skyblue', edgecolor='black')
    plt.title("Distribution of is_toxic Classes")
    plt.xlabel("is_toxic Label")
    plt.ylabel("Count")
    plt.xticks(rotation=0)
    plt.show()
else:
    print("Column 'is_toxic' not found in the dataset.")

## 6. Text Length (Word) Analysis

We compute the word count for each message, print summary statistics (average, median, maximum, minimum), and display a histogram of word counts.

In [None]:
if 'message' in df.columns:
    df['word_count'] = df['message'].astype(str).apply(lambda x: len(x.split()))
    
    avg_length = df['word_count'].mean()
    median_length = df['word_count'].median()
    max_length = df['word_count'].max()
    min_length = df['word_count'].min()
    
    print("Text Length Analysis (in words):")
    print(f"Average Length: {avg_length:.2f}")
    print(f"Median Length: {median_length}")
    print(f"Maximum Length: {max_length}")
    print(f"Minimum Length: {min_length}")
    
    plt.figure(figsize=(8, 5))
    plt.hist(df['word_count'], bins=30, color='lightgreen', edgecolor='black')
    plt.title("Histogram of Message Word Counts")
    plt.xlabel("Word Count")
    plt.ylabel("Frequency")
    plt.show()
else:
    print("Column 'message' not found in the dataset.")

## 7. Word Cloud for Each Class

We generate a word cloud for each `is_toxic` class to visualize the word distribution separately.

In [None]:
if 'message' in df.columns and 'is_toxic' in df.columns:
    for class_value in df['is_toxic'].unique():
        subset_df = df[df['is_toxic'] == class_value]
        text_combined = " ".join(subset_df['message'].dropna().astype(str).tolist())
        
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text_combined)
        
        plt.figure(figsize=(10, 5))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.title(f"Word Cloud for is_toxic = {class_value}")
        plt.axis("off")
        plt.show()
else:
    print("Either 'message' or 'is_toxic' column not found in the dataset.")

## 8. Preparing a PyTorch Dataset and DataLoader

We define a custom PyTorch `Dataset` class and create a `DataLoader` to prepare the data for fine-tuning.

In [None]:
class OracleDataset(Dataset):
    def __init__(self, dataframe, text_column="message", label_column="is_toxic", transform=None):
        self.data = dataframe
        self.text_column = text_column
        self.label_column = label_column
        self.transform = transform
        
        if label_column in self.data.columns:
            self.label_mapping = {label: label for label in self.data[label_column].unique()}
        else:
            self.label_mapping = {}
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        text = self.data.iloc[idx][self.text_column]
        label = None
        if self.label_column in self.data.columns:
            label = self.label_mapping[self.data.iloc[idx][self.label_column]]
        
        sample = {"text": text, "label": label}
        
        if self.transform:
            sample = self.transform(sample)
        
        return sample

# Instantiate the dataset
dataset = OracleDataset(df, text_column="message", label_column="is_toxic")

# Create a DataLoader (batch size = 4 for demonstration)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

# Display the first batch
for batch in dataloader:
    print("Sample batch from DataLoader:")
    print(batch)
    break