# Exploratory Data Analysis (EDA)

## Objective
Explore the request management dataset to understand:
- Data structure and quality
- Distribution of categories and priorities
- Text characteristics (title, description)
- Potential features for ML models

## Data Source
- `contracts/mock-data/requests.csv`
- `contracts/mock-data/requests.json`

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Jupyter display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
%matplotlib inline

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Load Data

In [None]:
# TODO: Update path based on your directory structure
data_path = Path('../../../contracts/mock-data/requests.csv')

# Load data
df = pd.read_csv(data_path)

print(f"Loaded {len(df)} requests")
df.head()

## 2. Data Overview

In [None]:
# Dataset info
df.info()

In [None]:
# Statistical summary
df.describe(include='all')

In [None]:
# Check for missing values
missing = df.isnull().sum()
missing[missing > 0]

## 3. Target Variable Analysis

### Category Distribution

In [None]:
# TODO: Adjust 'category' field name if different
category_counts = df['category'].value_counts()
print(category_counts)

# Visualize
plt.figure(figsize=(10, 6))
category_counts.plot(kind='bar')
plt.title('Request Category Distribution')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

**Observations**:
- TODO: Note class imbalance if present
- TODO: Identify majority and minority classes

### Priority Distribution

In [None]:
# Priority distribution
priority_counts = df['priority'].value_counts().sort_index()
print(priority_counts)

plt.figure(figsize=(8, 5))
priority_counts.plot(kind='bar', color=['red', 'orange', 'blue', 'green'])
plt.title('Request Priority Distribution')
plt.xlabel('Priority')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

### Category vs. Priority Cross-tabulation

In [None]:
crosstab = pd.crosstab(df['category'], df['priority'])
print(crosstab)

# Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(crosstab, annot=True, fmt='d', cmap='YlOrRd')
plt.title('Category vs. Priority Heatmap')
plt.tight_layout()
plt.show()

## 4. Text Data Analysis

### Title Analysis

In [None]:
# Title length distribution
df['title_length'] = df['title'].str.len()

plt.figure(figsize=(10, 5))
df['title_length'].hist(bins=30)
plt.title('Distribution of Title Lengths')
plt.xlabel('Characters')
plt.ylabel('Frequency')
plt.axvline(df['title_length'].mean(), color='red', linestyle='--', label=f'Mean: {df["title_length"].mean():.1f}')
plt.legend()
plt.show()

print(f"Title length stats:")
print(df['title_length'].describe())

### Description Analysis

In [None]:
# Description length (word count)
df['desc_word_count'] = df['description'].str.split().str.len()

plt.figure(figsize=(10, 5))
df['desc_word_count'].hist(bins=30)
plt.title('Distribution of Description Word Count')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.axvline(df['desc_word_count'].mean(), color='red', linestyle='--', label=f'Mean: {df["desc_word_count"].mean():.1f}')
plt.legend()
plt.show()

### Sample Requests by Category

In [None]:
# TODO: Sample a few requests from each category to understand text patterns
for category in df['category'].unique():
    print(f"\n=== {category} ===")
    samples = df[df['category'] == category].sample(min(2, len(df[df['category'] == category])))
    for _, row in samples.iterrows():
        print(f"Title: {row['title']}")
        print(f"Description: {row['description'][:100]}...")
        print(f"Priority: {row['priority']}")
        print()

## 5. Temporal Analysis (if timestamp available)

In [None]:
# TODO: If createdAt field exists, analyze temporal patterns
# df['createdAt'] = pd.to_datetime(df['createdAt'])
# df['created_date'] = df['createdAt'].dt.date
# df.groupby('created_date').size().plot(kind='line')
# plt.title('Requests Over Time')
# plt.show()

## 6. Feature Ideas

Based on EDA, potential features for ML models:

**Text Features**:
- TF-IDF vectors from title + description
- Word embeddings (Word2Vec, GloVe, BERT)
- N-grams (unigrams, bigrams)

**Derived Features**:
- Title length
- Description length (words, characters)
- Presence of keywords (e.g., "urgent", "help", "broken")
- Sentiment score

**Metadata** (if available):
- Time of day submitted
- Day of week
- Requester department/role

TODO: Document additional feature ideas

## 7. Data Quality Issues

**Findings**:
- TODO: List any data quality concerns
- TODO: Missing values?
- TODO: Duplicates?
- TODO: Class imbalance?
- TODO: Text quality (typos, incomplete sentences)?

## 8. Next Steps

1. Proceed to `02-feature-engineering.ipynb`
2. Implement text preprocessing pipeline
3. Extract features (TF-IDF as baseline)
4. Build train/test splits
5. Train initial models in `03-model-training.ipynb`

---
**References**:
- Data Dictionary: `contracts/data-models/field-dictionary.md`
- Priority Definitions: `contracts/data-models/priority-definitions.md`