# Notebook Overview
This notebook focuses on data exploration and preparation for sentiment analysis of Yelp reviews.

## Key objectives:
- Data Loading and Initial Assessment
- Distribution Analysis
- Text Analysis
- Preprocessing Pipeline Validation
- Data Quality Assessment
- Feature Analysis

# Imports
This cell imports necessary libraries and modules for data processing, visualization, and model training.
The ``` os.chdir()``` command changes the working directory to the root for relative path consistency.

In [None]:
import os
os.chdir('../')  # Moving up one directory to the root
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from data.data_processing import DataProcessor, TextSignals, SarcasmAugmenter
from utils.dataVisualizer import DataVisualizer
from models.sentiment_model import ModelTrainer

# Object Initialization
This cell initializes instances of the following classes :
- DataProcessor : for handling data preprocessing
- DataVisualizer : for handling data visualization
- ModelTrainer : for handling training tasks

In [None]:
dataProcessor = DataProcessor()
dataVisualizer = DataVisualizer(data_processor=dataProcessor)
trainer = ModelTrainer()

# Loading and Initial Analysis
The dataset is loaded, and initial distribution analyses for ratings and sentiments are performed.
This step helps understand the structure and balance of the raw data.

In [None]:
df = dataProcessor.load_data()
print("\nInitial Distribution Analysis:")
dataVisualizer.analyze_ratings_distribution(df)
dataVisualizer.analyze_sentiment_distribution(df)

# Data Preparation
This cell prepares a balanced dataset by stratifying classes and splitting the data into training,
validation, and testing sets. Model inputs are also generated for subsequent processing.

In [None]:
data = dataProcessor.prepare_data()
train_df = data['dataframes']['train']
val_df = data['dataframes']['val']
test_df = data['dataframes']['test']
model_inputs = data['model_inputs']

# Split Distribution Analysis
Sentiment and sarcasm distributions are analyzed across training, validation, and testing splits.
This ensures consistent representation of different classes in each split.

In [None]:
for split_name, split_df in [('Training', train_df), ('Validation', val_df), ('Test', test_df)]:
    print(f"\n{split_name} Set Analysis:")
    print(f"Total samples: {len(split_df)}")
    print("\nSentiment Distribution:")
    print(split_df['sentiment'].value_counts().sort_index())
    print("\nSarcasm Distribution:")
    print(split_df['is_sarcastic'].value_counts())

# Text Length Analysis
Text lengths in each dataset split are analyzed to identify variations and patterns.
This step is crucial for defining suitable input length constraints for the model.

In [None]:
print("\nText Length Analysis Across Splits:")
for split_name, split_df in [('Training', train_df), ('Validation', val_df), ('Test', test_df)]:
    print(f"\n{split_name} Set Text Lengths:")
    dataVisualizer.analyze_text_lengths(split_df['text'])

# Token Length Analysis and MAX_LENGTH Recommendation
Tokenized data lengths are analyzed to determine a recommended MAX_LENGTH value for input truncation.
The value is adjusted to align with common model input size requirements.

In [None]:
encoded_data = trainer.prepare_dataset(train_df['text'])
suggested_length = dataVisualizer.analyze_token_lengths(encoded_data)
MAX_LENGTH = min(512, (suggested_length + 15) // 16 * 16)
print(f"\nRecommended MAX_LENGTH: {MAX_LENGTH}")

# Word Distribution Visualization
Word clouds are generated to visualize the most frequent words in the training data.
This helps identify key terms and potential biases in the data.

In [None]:
dataVisualizer.visualize_wordclouds(train_df)

# Sample Reviews Analysis
This cell displays a sample of processed reviews to inspect preprocessing quality.
It ensures that the pipeline handles text properly and removes unwanted artifacts.

In [None]:
dataVisualizer.display_processed_reviews(train_df, num_samples=10)

# Text Signals Analysis
Text signal features like word count, punctuation usage, and sentiment indicators are analyzed.
This step helps identify patterns and anomalies in textual data.

In [None]:
print("\nText Signals Analysis for Training Set:")
dataVisualizer.analyze_text_signals(train_df)

# Data Quality Checks
The training, validation, and testing sets are checked for null values and duplicate rows.
This ensures data integrity and quality before model training.

In [None]:
print("Data Quality Checks:")
for split_name, split_df in [('Training', train_df), ('Validation', val_df), ('Test', test_df)]:
    print(f"\n{split_name} Set:")
    print("Null values:")
    print(split_df.isnull().sum())
    print(f"Duplicate rows: {split_df.duplicated().sum()}")