# 01. Data Exploration and Analysis

**Objective**: Understand the financial sentiment datasets and perform exploratory data analysis.

## 📋 Tasks for you to complete:
1. Load and explore the Financial PhraseBank dataset
2. Analyze sentiment distribution
3. Examine text characteristics (length, vocabulary, etc.)
4. Identify data quality issues
5. Visualize key insights

## 🎯 Learning Goals:
- Understanding financial text data characteristics
- Data quality assessment
- Baseline insights for model development

## Setup and Imports

In [None]:
# TODO: Import necessary libraries
# Hint: pandas, numpy, matplotlib, seaborn, plotly
# Also consider: nltk, wordcloud, collections

import pandas as pd
import numpy as np
# Add your imports here

# Set up plotting
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8')
%matplotlib inline

## Data Loading

**Your Task**: Download and load the Financial PhraseBank dataset
- Dataset URL: https://www.researchgate.net/publication/251231107_FinancialPhraseBank-v10
- Alternative: Use sklearn's sample datasets or create synthetic data for now

In [None]:
# TODO: Load the financial sentiment dataset
# For now, you can create a sample dataset or download the real one

# Sample data structure for reference:
# columns: ['text', 'sentiment']
# sentiment values: 'negative', 'neutral', 'positive'

# df = pd.read_csv('../data/raw/financial_phrasebank.csv')
# print(f"Dataset shape: {df.shape}")
# print(f"Columns: {df.columns.tolist()}")

## Basic Data Exploration

In [None]:
# TODO: Explore basic dataset information
# 1. Display first few rows
# 2. Check data types
# 3. Look for missing values
# 4. Basic statistics

# Your code here

## Sentiment Distribution Analysis

In [None]:
# TODO: Analyze sentiment label distribution
# 1. Count distribution of sentiment classes
# 2. Calculate percentages
# 3. Create visualizations (bar plot, pie chart)
# 4. Check for class imbalance

# Your code here

## Text Characteristics Analysis

In [None]:
# TODO: Analyze text characteristics
# 1. Text length distribution (character and word count)
# 2. Vocabulary size and most common words
# 3. Average sentence length by sentiment
# 4. Word clouds for each sentiment class

# Your code here

## Data Quality Assessment

In [None]:
# TODO: Assess data quality
# 1. Check for duplicate texts
# 2. Identify very short or very long texts
# 3. Look for special characters, HTML tags, etc.
# 4. Identify potential noise in the data

# Your code here

## Financial Domain Analysis

In [None]:
# TODO: Analyze financial domain-specific characteristics
# 1. Extract financial terms and entities
# 2. Analyze sentiment patterns around financial keywords
# 3. Identify common financial phrases
# 4. Look for temporal patterns if dates are available

# Your code here

## Key Insights and Conclusions

**Your Task**: Summarize your findings and their implications for model development

### Questions to answer:
1. What is the distribution of sentiment classes?
2. Are there any data quality issues to address?
3. What are the key characteristics of financial text?
4. What preprocessing steps will be needed?
5. What challenges do you anticipate for the model?

### Next Steps:
- Data preprocessing strategy
- Model selection considerations
- Evaluation metrics planning

## 💡 Implementation Hints:

### Data Loading:
```python
# If you don't have the dataset yet, create sample data:
sample_data = {
    'text': [
        'Company profits exceeded expectations this quarter',
        'Stock prices fell sharply amid market uncertainty',
        'Revenue remained stable compared to last year'
    ],
    'sentiment': ['positive', 'negative', 'neutral']
}
df = pd.DataFrame(sample_data)
```

### Text Analysis:
```python
# Text length analysis
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
```

### Visualization:
```python
# Sentiment distribution
plt.figure(figsize=(8, 6))
df['sentiment'].value_counts().plot(kind='bar')
plt.title('Sentiment Distribution')
plt.show()
```