<a href="https://colab.research.google.com/github/stfgrz/ESS_electives_report/blob/main/ESS_electives.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ESS Electives Survey Analysis
## Comprehensive Publication-Ready Analysis of Student Feedback

This notebook provides a thorough, publication-ready analysis of the ESS electives survey data.

### Features:
- ✅ Comprehensive data quality validation
- ✅ Descriptive and inferential statistics
- ✅ 10+ publication-quality visualizations
- ✅ Correlation and comparative analysis  
- ✅ Statistical significance testing
- ✅ Key insights and recommendations
- ✅ Text analysis of open-ended feedback

---

## 📑 Table of Contents
1. [Executive Summary](#executive-summary)
2. [Setup](#setup)
3. [Data Loading and Quality](#data-loading)
4. [Data Processing](#data-processing)
5. [Descriptive Statistics](#descriptive-stats)
6. [Statistical Analysis](#statistical-analysis)
7. [Comprehensive Visualizations](#visualizations)
8. [Key Insights](#key-insights)
9. [Text Analysis](#text-analysis)
10. [Recommendations](#recommendations)
11. [Exports](#exports)

---

<a id="executive-summary"></a>
## 1. Executive Summary

This section will present key findings after analysis completion.

**Quick Stats:**
- Analysis Date: December 2025
- Survey: ESS Electives Feedback
- Metrics: Course Difficulty, Exam Difficulty, Exam Alignment, Re-enrollment Intent

---

<a id="setup"></a>
## 2. Setup and Configuration

### Package Installation and Environment Configuration

In [None]:
# Install and import required packages
import sys
import subprocess

required_packages = ["pandas", "numpy", "matplotlib", "seaborn", "scipy", "plotly", "wordcloud", "tabulate"]

try:
    import pandas as pd
    import seaborn as sns
except ImportError:
    print("Installing required packages...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + required_packages)
    print("✓ Installation complete")

# Import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import pearsonr, spearmanr, ttest_ind, mannwhitneyu, chi2_contingency
import warnings
import re
from collections import Counter
import os
from datetime import datetime
warnings.filterwarnings('ignore')

# Try to import wordcloud (optional)
try:
    from wordcloud import WordCloud
    WORDCLOUD_AVAILABLE = True
except ImportError:
    WORDCLOUD_AVAILABLE = False
    print("Note: wordcloud not available for text analysis")

# Configuration Constants
DATA_PATH = 'data/ESS_electives_op_DEC2025.csv'
OUTPUT_DIR = 'outputs'
FIGURE_DIR = 'figures'
MIN_RESPONSES = 3  # Minimum responses required for course analysis
FIGURE_DPI = 300   # High resolution for publication
RANDOM_SEED = 42

# Create output directories
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(FIGURE_DIR, exist_ok=True)

# Configure plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
sns.set_context("notebook", font_scale=1.1)
np.random.seed(RANDOM_SEED)

print("="*70)
print(" " * 22 + "SETUP COMPLETE")
print("="*70)
print(f"✓ All libraries loaded")
print(f"✓ Output directory: {OUTPUT_DIR}/")
print(f"✓ Figure directory: {FIGURE_DIR}/")
print(f"✓ Configuration: min_responses={MIN_RESPONSES}, DPI={FIGURE_DPI}")
print("="*70)

<a id="data-loading"></a>
## 3. Data Loading and Quality Assessment

### Loading Survey Data

The survey CSV has a special structure:
- **Row 0:** Question text
- **Row 1:** ImportId mappings
- **Rows 2+:** Actual survey responses

In [None]:
# Load the survey data
print(f"Loading data from: {DATA_PATH}")
df_raw = pd.read_csv(DATA_PATH)

print(f"\n{'='*70}")
print(" " * 25 + "DATA OVERVIEW")
print("="*70)
print(f"Total rows (with metadata):       {len(df_raw):>6}")
print(f"Total columns:                    {len(df_raw.columns):>6}")
print(f"Data rows (actual responses):     {len(df_raw) - 2:>6}")

# Identify metadata and data sections
questions_row = df_raw.iloc[0]
importid_row = df_raw.iloc[1]
data_df = df_raw.iloc[2:].copy()

# Analyze columns
standard_cols = [c for c in df_raw.columns if not c.startswith('Q') and c not in ['A', 'B', 'C1', 'C2']]
question_cols = [c for c in df_raw.columns if c.startswith('Q') or c in ['A', 'B', 'C1', 'C2']]

print(f"\nColumn types:")
print(f"  - Standard metadata columns:    {len(standard_cols):>6}")
print(f"  - Survey question columns:      {len(question_cols):>6}")
print("="*70)

### Data Quality Assessment

Comprehensive quality checks and filtering.

In [None]:
print("\n" + "="*70)
print(" " * 20 + "DATA QUALITY REPORT")
print("="*70)

# 1. Check for preview/test responses
preview_count = len(data_df[data_df['Status'] == 'Survey Preview'])
print(f"\n1. Preview/Test Responses:        {preview_count:>6}")

# 2. Completion status
finished_count = len(data_df[data_df['Finished'] == 'True'])
incomplete_count = len(data_df) - finished_count
print(f"\n2. Survey Completion:")
print(f"   - Finished surveys:             {finished_count:>6}  ({finished_count/len(data_df)*100:>5.1f}%)")
print(f"   - Incomplete surveys:           {incomplete_count:>6}  ({incomplete_count/len(data_df)*100:>5.1f}%)")

# 3. Progress statistics
if 'Progress' in data_df.columns:
    data_df['Progress_Numeric'] = pd.to_numeric(data_df['Progress'], errors='coerce')
    mean_progress = data_df['Progress_Numeric'].mean()
    median_progress = data_df['Progress_Numeric'].median()
    complete_100 = len(data_df[data_df['Progress_Numeric'] == 100])
    
    print(f"\n3. Progress Distribution:")
    print(f"   - Mean progress:                {mean_progress:>5.1f}%")
    print(f"   - Median progress:              {median_progress:>5.1f}%")
    print(f"   - 100% complete:                {complete_100:>6}")

# 4. Check for duplicates
duplicate_count = data_df.duplicated(subset=['ResponseId']).sum()
print(f"\n4. Duplicate Response IDs:        {duplicate_count:>6}")

# 5. Sample missing data analysis
print(f"\n5. Missing Data Preview (first 10 question columns):")
sample_cols = question_cols[:10]
missing_info = []
for col in sample_cols:
    if col in data_df.columns:
        missing_pct = (data_df[col].isna().sum() / len(data_df)) * 100
        missing_info.append((col, missing_pct))

for col, pct in sorted(missing_info, key=lambda x: x[1], reverse=True)[:5]:
    print(f"   - {col:25s}: {pct:>5.1f}% missing")

# Apply filtering
print("\n" + "="*70)
print("FILTERING CRITERIA:")
print("  ✓ Exclude preview/test responses")
print("  ✓ Include only finished surveys")
print("="*70)

df_filtered = data_df[
    (data_df['Status'] != 'Survey Preview') & 
    (data_df['Finished'] == 'True')
].copy()

print(f"\n✓ Filtered dataset: {len(df_filtered)} valid responses")
print(f"  (Excluded: {len(data_df) - len(df_filtered)} responses)")
print("="*70)

<a id="data-processing"></a>
## 4. Data Processing and Transformation

### Building Course Mapping

Extracting course information from question text using regex patterns.

In [None]:
def extract_course_name(text):
    """
    Extract course name from question text using regex patterns.
    
    Args:
        text (str): Question text
        
    Returns:
        str or None: Course name if found
    """
    if not isinstance(text, str):
        return None
    
    # Pattern 1: "about <Course>? -"
    match = re.search(r"about\s+(.*?)\?\s+-", text)
    if match:
        return match.group(1).strip()
    
    # Pattern 2: "Would you enrol in <Course> again?"
    match = re.search(r"Would you enrol in\s+(.*?)( again)?\??$", text, re.IGNORECASE)
    if match:
        return match.group(1).strip()
    
    # Pattern 3: "exam for <Course>?"
    match = re.search(r"exam for\s+(.*?)\??$", text)
    if match:
        return match.group(1).strip()
    
    return None

def determine_question_type(text):
    """
    Determine question type from question text.
    
    Args:
        text (str): Question text
        
    Returns:
        str or None: Question type if identified
    """
    if not isinstance(text, str):
        return None
    
    # Match question types
    if "How difficult was the course" in text or "How hard was the course" in text:
        return "Course Difficulty"
    elif "How difficult was the exam" in text or "How hard was the exam" in text:
        return "Exam Difficulty"
    elif "exam was in line with what has been explained" in text:
        return "Exam Alignment"
    elif "Would you enrol in" in text:
        return "Enrol Again"
    elif "When did you sit the exam" in text:
        return "Exam Session"
    elif "What did you appreciate about" in text:
        return "Positive Feedback"
    elif "What could be improved about" in text:
        return "Improvement Feedback"
    
    return None

# Build course mapping dictionary
print("Building course mapping from question texts...")
course_map = {}

for col in df_raw.columns:
    q_text = questions_row[col]
    course_name = extract_course_name(q_text)
    q_type = determine_question_type(q_text)
    
    if course_name and q_type:
        if course_name not in course_map:
            course_map[course_name] = {}
        course_map[course_name][q_type] = col

print(f"\n{'='*70}")
print(f"✓ Course mapping complete")
print(f"\nCourses identified:               {len(course_map):>6}")
print(f"\nSample mappings (first 3 courses):")
for i, (course, cols) in enumerate(list(course_map.items())[:3]):
    print(f"\n  {i+1}. {course}")
    for q_type, col in cols.items():
        print(f"     - {q_type:20s}: {col}")
print("="*70)

# Export data dictionary
data_dict = pd.DataFrame([
    {'Course': course, 'Question_Type': q_type, 'Column': col}
    for course, questions in course_map.items()
    for q_type, col in questions.items()
])
data_dict_path = f'{OUTPUT_DIR}/data_dictionary.csv'
data_dict.to_csv(data_dict_path, index=False)
print(f"\n✓ Data dictionary saved to: {data_dict_path}")

### Response Scale Mappings

Defining mappings to convert text responses to numeric scales.

In [None]:
# Define scale mappings
difficulty_map = {
    'Easy': 1,
    'Somewhat easy': 2,
    'Neither easy nor difficult': 3,
    'Somewhat difficult': 4,
    'Difficult': 5
}

agreement_map = {
    'Strongly disagree': 1,
    'Somewhat disagree': 2,
    'Neither agree nor disagree': 3,
    'Somewhat agree': 4,
    'Strongly agree': 5
}

enrol_again_map = {
    'Yes': 3,
    'No, but I would enrol in the module next year': 2,
    'No, I would not enrol in the module again': 1
}

def clean_enrol_text(text):
    """Simplify enrolment text for analysis."""
    if not isinstance(text, str):
        return "No Response"
    if "Yes" in text:
        return "Yes"
    if "No, but" in text:
        return "No (Next Year)"
    if "No, I would not" in text:
        return "No (Never)"
    return "Other"

print("Scale mappings defined:\n")
print("1. Difficulty Scale (1-5):")
for k, v in difficulty_map.items():
    print(f"   {v}: {k}")

print("\n2. Agreement Scale (1-5):")
for k, v in agreement_map.items():
    print(f"   {v}: {k}")

print("\n3. Enrolment Intent (1-3):")
for k, v in enrol_again_map.items():
    print(f"   {v}: {k}")

### Transform to Long Format

Converting wide-format data to long format for easier analysis.

In [None]:
def safe_get(row, col_name):
    """Safely retrieve column value from row."""
    if col_name and col_name in row.index:
        return row[col_name]
    return None

# Transform data to long format
print("Transforming data to long format...")
records = []

for idx, row in df_filtered.iterrows():
    respondent_id = row['ResponseId']
    
    for course, cols in course_map.items():
        # Check if respondent reviewed this course
        c_diff_col = cols.get('Course Difficulty')
        if c_diff_col and pd.notna(safe_get(row, c_diff_col)):
            # Extract all course-related responses
            c_diff_text = safe_get(row, cols.get('Course Difficulty'))
            e_diff_text = safe_get(row, cols.get('Exam Difficulty'))
            e_align_text = safe_get(row, cols.get('Exam Alignment'))
            enrol_text = safe_get(row, cols.get('Enrol Again'))
            session_text = safe_get(row, cols.get('Exam Session'))
            positive_fb = safe_get(row, cols.get('Positive Feedback'))
            improve_fb = safe_get(row, cols.get('Improvement Feedback'))
            
            # Create record with both text and numeric values
            record = {
                'RespondentId': respondent_id,
                'Course': course,
                'Course Difficulty Text': c_diff_text,
                'Exam Difficulty Text': e_diff_text,
                'Exam Alignment Text': e_align_text,
                'Enrol Again Text': enrol_text,
                'Exam Session': session_text,
                'Positive Feedback': positive_fb,
                'Improvement Feedback': improve_fb,
                # Numeric conversions
                'Course Difficulty': difficulty_map.get(c_diff_text),
                'Exam Difficulty': difficulty_map.get(e_diff_text),
                'Exam Alignment': agreement_map.get(e_align_text),
                'Enrol Again Numeric': enrol_again_map.get(enrol_text),
                'Enrol Again Simplified': clean_enrol_text(enrol_text)
            }
            records.append(record)

# Create long-format dataframe
df_long = pd.DataFrame(records)

print(f"\n{'='*70}")
print(" " * 18 + "TRANSFORMATION COMPLETE")
print("="*70)
print(f"Total course reviews:             {len(df_long):>6}")
print(f"Unique respondents:               {df_long['RespondentId'].nunique():>6}")
print(f"Unique courses reviewed:          {df_long['Course'].nunique():>6}")
print(f"Avg reviews per respondent:       {len(df_long) / df_long['RespondentId'].nunique():>6.1f}")
print("="*70)

# Save processed data
processed_path = f'{OUTPUT_DIR}/processed_course_feedback.csv'
df_long.to_csv(processed_path, index=False)
print(f"\n✓ Processed data saved to: {processed_path}")

<a id="descriptive-stats"></a>
## 5. Descriptive Statistics

### Overview of Course Feedback Data

In [None]:
# Calculate comprehensive descriptive statistics
print("="*70)
print(" " * 18 + "DESCRIPTIVE STATISTICS")
print("="*70)

# Response statistics by course
course_stats = df_long.groupby('Course').agg({
    'Course Difficulty': ['mean', 'std', 'count'],
    'Exam Difficulty': ['mean', 'std'],
    'Exam Alignment': ['mean', 'std'],
    'Enrol Again Numeric': ['mean', 'std']
}).round(2)

# Flatten column names
course_stats.columns = ['_'.join(col).strip() for col in course_stats.columns.values]
course_stats = course_stats.reset_index()
course_stats.columns = ['Course', 'Course_Diff_Mean', 'Course_Diff_Std', 'Response_Count',
                         'Exam_Diff_Mean', 'Exam_Diff_Std', 
                         'Exam_Align_Mean', 'Exam_Align_Std',
                         'Enrol_Again_Mean', 'Enrol_Again_Std']

# Filter by minimum responses
course_stats_filtered = course_stats[course_stats['Response_Count'] >= MIN_RESPONSES].copy()

print(f"\nCourses with {MIN_RESPONSES}+ responses:  {len(course_stats_filtered)}")
print(f"Total courses in dataset:     {len(course_stats)}")
print(f"\nTop 10 most reviewed courses:")
print(course_stats.nlargest(10, 'Response_Count')[['Course', 'Response_Count']].to_string(index=False))

# Overall statistics
print(f"\n{'='*70}")
print("OVERALL STATISTICS (all courses):")
print("="*70)
print(f"\nCourse Difficulty:")
print(f"  Mean:    {df_long['Course Difficulty'].mean():.2f}  (1=Easy, 5=Difficult)")
print(f"  Median:  {df_long['Course Difficulty'].median():.2f}")
print(f"  Std:     {df_long['Course Difficulty'].std():.2f}")

print(f"\nExam Difficulty:")
print(f"  Mean:    {df_long['Exam Difficulty'].mean():.2f}")
print(f"  Median:  {df_long['Exam Difficulty'].median():.2f}")
print(f"  Std:     {df_long['Exam Difficulty'].std():.2f}")

print(f"\nExam Alignment:")
print(f"  Mean:    {df_long['Exam Alignment'].mean():.2f}  (1=Disagree, 5=Agree)")
print(f"  Median:  {df_long['Exam Alignment'].median():.2f}")
print(f"  Std:     {df_long['Exam Alignment'].std():.2f}")

# Enrolment intentions
enrol_dist = df_long['Enrol Again Simplified'].value_counts()
print(f"\nRe-enrolment Intent:")
for intent, count in enrol_dist.items():
    pct = (count / len(df_long)) * 100
    print(f"  {intent:20s}: {count:>4}  ({pct:>5.1f}%)")

print("="*70)

# Save summary statistics
course_stats.to_csv(f'{OUTPUT_DIR}/course_summary_statistics.csv', index=False)
print(f"\n✓ Summary statistics saved to: {OUTPUT_DIR}/course_summary_statistics.csv")

### Module Comparison (II A vs II B)

Analyzing differences between Module II A and Module II B courses.

In [None]:
# Extract module information from course names
# Assuming courses are preceded by module name in the metadata
# Check for module information in Q1 and Q2 columns

if 'A' in df_filtered.columns and 'B' in df_filtered.columns:
    print("Analyzing Module distribution...")
    
    # Create module assignment for each response
    module_assignments = []
    for idx, row in df_filtered.iterrows():
        resp_id = row['ResponseId']
        module_a = row.get('A')
        module_b = row.get('B')
        
        if pd.notna(module_a) and 'Module II A' in str(module_a):
            module_assignments.append((resp_id, 'Module II A'))
        elif pd.notna(module_b) and 'Module II B' in str(module_b):
            module_assignments.append((resp_id, 'Module II B'))
        else:
            module_assignments.append((resp_id, 'Unknown'))
    
    module_df = pd.DataFrame(module_assignments, columns=['RespondentId', 'Module'])
    df_long_with_module = df_long.merge(module_df, on='RespondentId', how='left')
    
    # Compare modules
    print(f"\n{'='*70}")
    print("MODULE COMPARISON:")
    print("="*70)
    
    module_comparison = df_long_with_module.groupby('Module').agg({
        'Course Difficulty': ['mean', 'std', 'count'],
        'Exam Difficulty': ['mean', 'std'],
        'Exam Alignment': ['mean', 'std']
    }).round(2)
    
    print(f"\n{module_comparison.to_string()}")
    
    # Statistical test
    module_a_diff = df_long_with_module[df_long_with_module['Module'] == 'Module II A']['Course Difficulty'].dropna()
    module_b_diff = df_long_with_module[df_long_with_module['Module'] == 'Module II B']['Course Difficulty'].dropna()
    
    if len(module_a_diff) > 0 and len(module_b_diff) > 0:
        t_stat, p_value = ttest_ind(module_a_diff, module_b_diff, equal_var=False)
        print(f"\nT-test for Course Difficulty (Module II A vs B):")
        print(f"  t-statistic: {t_stat:.3f}")
        print(f"  p-value:     {p_value:.3f}")
        if p_value < 0.05:
            print(f"  → Statistically significant difference (p < 0.05)")
        else:
            print(f"  → No significant difference (p >= 0.05)")
    
    print("="*70)
else:
    print("Module information not available in expected columns")
    df_long_with_module = df_long.copy()
    df_long_with_module['Module'] = 'Unknown'
    
# Store for later use
df_long = df_long_with_module

<a id="statistical-analysis"></a>
## 6. Statistical Analysis

### Correlation Analysis

Examining relationships between different metrics.

In [None]:
# Correlation analysis
print("="*70)
print(" " * 20 + "CORRELATION ANALYSIS")
print("="*70)

# Calculate correlations on complete cases
metrics_for_corr = df_long[['Course Difficulty', 'Exam Difficulty', 'Exam Alignment', 'Enrol Again Numeric']].dropna()

if len(metrics_for_corr) > 0:
    # Pearson correlation
    corr_matrix = metrics_for_corr.corr()
    
    print("\nPearson Correlation Matrix:\n")
    print(corr_matrix.round(3).to_string())
    
    # Highlight key correlations
    print("\nKey Findings:")
    
    # Difficulty vs Alignment
    diff_align_corr = corr_matrix.loc['Course Difficulty', 'Exam Alignment']
    print(f"  - Course Difficulty vs Exam Alignment: r = {diff_align_corr:.3f}")
    if abs(diff_align_corr) > 0.3:
        direction = "positive" if diff_align_corr > 0 else "negative"
        print(f"    → Moderate {direction} correlation")
    
    # Difficulty vs Enrolment
    diff_enrol_corr = corr_matrix.loc['Course Difficulty', 'Enrol Again Numeric']
    print(f"  - Course Difficulty vs Re-enrolment: r = {diff_enrol_corr:.3f}")
    if abs(diff_enrol_corr) > 0.3:
        direction = "positive" if diff_enrol_corr > 0 else "negative"
        print(f"    → Moderate {direction} correlation")
    
    # Alignment vs Enrolment
    align_enrol_corr = corr_matrix.loc['Exam Alignment', 'Enrol Again Numeric']
    print(f"  - Exam Alignment vs Re-enrolment: r = {align_enrol_corr:.3f}")
    if abs(align_enrol_corr) > 0.3:
        direction = "positive" if align_enrol_corr > 0 else "negative"
        print(f"    → Moderate {direction} correlation")
    
    print("="*70)
    
    # Save correlation matrix
    corr_matrix.to_csv(f'{OUTPUT_DIR}/correlation_matrix.csv')
    print(f"\n✓ Correlation matrix saved to: {OUTPUT_DIR}/correlation_matrix.csv")
else:
    print("\nInsufficient data for correlation analysis")
    corr_matrix = pd.DataFrame()

### Identifying Outliers and Polarizing Courses

Finding courses with unusual patterns or high variance in opinions.

In [None]:
# Identify outliers and polarizing courses
print("="*70)
print(" " * 18 + "OUTLIER & VARIANCE ANALYSIS")
print("="*70)

# Calculate variance for each course (courses with sufficient responses)
course_variance = df_long.groupby('Course').agg({
    'Course Difficulty': ['std', 'count']
}).reset_index()
course_variance.columns = ['Course', 'Difficulty_Std', 'Count']
course_variance = course_variance[course_variance['Count'] >= MIN_RESPONSES]

# High variance courses (polarizing opinions)
high_variance = course_variance.nlargest(5, 'Difficulty_Std')
print("\nMost Polarizing Courses (high variance in difficulty ratings):")
for idx, row in high_variance.iterrows():
    print(f"  - {row['Course'][:50]:50s}  std={row['Difficulty_Std']:.2f}")

# Low variance courses (consistent opinions)
low_variance = course_variance.nsmallest(5, 'Difficulty_Std')
print("\nMost Consistent Courses (low variance):")
for idx, row in low_variance.iterrows():
    print(f"  - {row['Course'][:50]:50s}  std={row['Difficulty_Std']:.2f}")

print("="*70)

<a id="visualizations"></a>
## 7. Comprehensive Visualizations

### Publication-Ready Figures

Creating high-quality visualizations for analysis and publication.

### 1. Summary Statistics Table

In [None]:
# Display nicely formatted summary table
print("Top 15 Courses by Response Count:\n")
top_courses = course_stats.nlargest(15, 'Response_Count')

# Create a formatted display
display_cols = ['Course', 'Response_Count', 'Course_Diff_Mean', 'Exam_Diff_Mean', 
                'Exam_Align_Mean', 'Enrol_Again_Mean']
display_df = top_courses[display_cols].copy()
display_df.columns = ['Course', 'N', 'Course Diff', 'Exam Diff', 'Exam Align', 'Re-enrol']

# Format Course name to fit
display_df['Course'] = display_df['Course'].str[:45]

print(display_df.to_string(index=False))

# Save formatted table
display_df.to_csv(f'{OUTPUT_DIR}/top_courses_summary.csv', index=False)

### 2. Response Distribution by Course (Bar Chart)

In [None]:
# Create bar chart of response counts
plt.figure(figsize=(12, 10))

top_20 = course_stats.nlargest(20, 'Response_Count')
courses = [c[:40] for c in top_20['Course']]  # Truncate names
counts = top_20['Response_Count'].values

plt.barh(range(len(courses)), counts, color=sns.color_palette("husl", len(courses)))
plt.yticks(range(len(courses)), courses)
plt.xlabel('Number of Responses', fontsize=12)
plt.ylabel('Course', fontsize=12)
plt.title('Top 20 Courses by Response Count', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()

plt.savefig(f'{FIGURE_DIR}/01_response_distribution.png', dpi=FIGURE_DPI, bbox_inches='tight')
print(f"✓ Figure saved: {FIGURE_DIR}/01_response_distribution.png")
plt.close()

### 3. Course Metrics Heatmap

In [None]:
# Create heatmap of all metrics for top courses
plt.figure(figsize=(10, 12))

# Select top courses and prepare data
top_courses_list = course_stats.nlargest(20, 'Response_Count')['Course'].tolist()
heatmap_data = course_stats[course_stats['Course'].isin(top_courses_list)].copy()
heatmap_data = heatmap_data.set_index('Course')[['Course_Diff_Mean', 'Exam_Diff_Mean', 
                                                   'Exam_Align_Mean', 'Enrol_Again_Mean']]
heatmap_data.index = [c[:35] for c in heatmap_data.index]  # Truncate course names
heatmap_data.columns = ['Course\nDifficulty', 'Exam\nDifficulty', 
                         'Exam\nAlignment', 'Re-enrol\nIntent']

# Sort by course difficulty
heatmap_data = heatmap_data.sort_values('Course\nDifficulty', ascending=False)

# Create heatmap with annotations
sns.heatmap(heatmap_data, annot=True, fmt='.2f', cmap='RdYlGn_r', 
            linewidths=0.5, cbar_kws={'label': 'Scale (1-5)'})
plt.title('Course Metrics Heatmap (Top 20 Courses)', fontsize=14, fontweight='bold')
plt.ylabel('Course', fontsize=12)
plt.xlabel('Metric', fontsize=12)
plt.tight_layout()

plt.savefig(f'{FIGURE_DIR}/02_course_metrics_heatmap.png', dpi=FIGURE_DPI, bbox_inches='tight')
print(f"✓ Figure saved: {FIGURE_DIR}/02_course_metrics_heatmap.png")
plt.close()

### 4. Course Difficulty Distribution (Box Plot)

In [None]:
# Box plot of course difficulty for top courses
plt.figure(figsize=(14, 10))

top_15_courses = course_stats.nlargest(15, 'Response_Count')['Course'].tolist()
df_top = df_long[df_long['Course'].isin(top_15_courses)].copy()
df_top['Course_Short'] = df_top['Course'].str[:35]

# Calculate median for sorting
median_order = df_top.groupby('Course_Short')['Course Difficulty'].median().sort_values().index

sns.boxplot(data=df_top, y='Course_Short', x='Course Difficulty', 
            order=median_order, palette='coolwarm')
plt.title('Distribution of Course Difficulty Ratings (Top 15 Courses)', 
          fontsize=14, fontweight='bold')
plt.xlabel('Difficulty Rating (1=Easy, 5=Difficult)', fontsize=12)
plt.ylabel('Course', fontsize=12)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()

plt.savefig(f'{FIGURE_DIR}/03_course_difficulty_boxplot.png', dpi=FIGURE_DPI, bbox_inches='tight')
print(f"✓ Figure saved: {FIGURE_DIR}/03_course_difficulty_boxplot.png")
plt.close()

### 5. Difficulty vs Alignment Scatter Plot

In [None]:
# Scatter plot with bubble sizes
plt.figure(figsize=(12, 8))

# Use course stats for plotting
plot_data = course_stats[course_stats['Response_Count'] >= MIN_RESPONSES].copy()

scatter = plt.scatter(plot_data['Course_Diff_Mean'], 
                     plot_data['Exam_Align_Mean'],
                     s=plot_data['Response_Count']*20,
                     alpha=0.6,
                     c=plot_data['Course_Diff_Mean'],
                     cmap='coolwarm',
                     edgecolors='black',
                     linewidth=0.5)

# Add labels for selected courses (highest/lowest in each dimension)
# Top 3 most difficult
top_diff = plot_data.nlargest(3, 'Course_Diff_Mean')
for _, row in top_diff.iterrows():
    plt.annotate(row['Course'][:25], 
                xy=(row['Course_Diff_Mean'], row['Exam_Align_Mean']),
                xytext=(5, 5), textcoords='offset points',
                fontsize=8, alpha=0.7)

# Top 3 lowest alignment
low_align = plot_data.nsmallest(3, 'Exam_Align_Mean')
for _, row in low_align.iterrows():
    plt.annotate(row['Course'][:25],
                xy=(row['Course_Diff_Mean'], row['Exam_Align_Mean']),
                xytext=(5, -10), textcoords='offset points',
                fontsize=8, alpha=0.7)

plt.colorbar(scatter, label='Course Difficulty')
plt.xlabel('Average Course Difficulty (1-5)', fontsize=12)
plt.ylabel('Average Exam Alignment (1-5)', fontsize=12)
plt.title('Course Difficulty vs Exam Alignment\n(Bubble size = Response count)', 
          fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()

plt.savefig(f'{FIGURE_DIR}/04_difficulty_vs_alignment.png', dpi=FIGURE_DPI, bbox_inches='tight')
print(f"✓ Figure saved: {FIGURE_DIR}/04_difficulty_vs_alignment.png")
plt.close()

### 6. Correlation Matrix Heatmap

In [None]:
# Correlation heatmap
if not corr_matrix.empty:
    plt.figure(figsize=(10, 8))
    
    # Create heatmap
    sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='coolwarm', 
                center=0, square=True, linewidths=1,
                cbar_kws={'label': 'Correlation Coefficient'})
    
    plt.title('Correlation Matrix of Course Metrics', fontsize=14, fontweight='bold')
    plt.tight_layout()
    
    plt.savefig(f'{FIGURE_DIR}/05_correlation_heatmap.png', dpi=FIGURE_DPI, bbox_inches='tight')
    print(f"✓ Figure saved: {FIGURE_DIR}/05_correlation_heatmap.png")
    plt.close()
else:
    print("Skipping correlation heatmap (insufficient data)")

### 7. Re-enrolment Intent Distribution

In [None]:
# Stacked bar chart for enrolment intent by course
plt.figure(figsize=(14, 10))

# Get enrolment breakdown for top courses
top_courses_enrol = course_stats.nlargest(15, 'Response_Count')['Course'].tolist()
enrol_data = df_long[df_long['Course'].isin(top_courses_enrol)].copy()
enrol_data['Course_Short'] = enrol_data['Course'].str[:35]

# Create pivot table
enrol_pivot = enrol_data.groupby(['Course_Short', 'Enrol Again Simplified']).size().unstack(fill_value=0)

# Convert to percentages
enrol_pct = enrol_pivot.div(enrol_pivot.sum(axis=1), axis=0) * 100

# Sort by "Yes" percentage
if 'Yes' in enrol_pct.columns:
    enrol_pct = enrol_pct.sort_values('Yes', ascending=True)

# Plot stacked horizontal bar
enrol_pct.plot(kind='barh', stacked=True, 
               color=['#ff6b6b', '#ffd93d', '#6bcf7f', '#95a5a6'],
               figsize=(14, 10))

plt.xlabel('Percentage of Responses (%)', fontsize=12)
plt.ylabel('Course', fontsize=12)
plt.title('Would Students Re-enrol? (Top 15 Courses)', fontsize=14, fontweight='bold')
plt.legend(title='Response', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()

plt.savefig(f'{FIGURE_DIR}/06_reenrolment_distribution.png', dpi=FIGURE_DPI, bbox_inches='tight')
print(f"✓ Figure saved: {FIGURE_DIR}/06_reenrolment_distribution.png")
plt.close()

### 8. Difficulty Distribution (Violin Plot)

In [None]:
# Violin plot for difficulty distribution
plt.figure(figsize=(12, 8))

# Select courses for violin plot
top_10_courses = course_stats.nlargest(10, 'Response_Count')['Course'].tolist()
df_violin = df_long[df_long['Course'].isin(top_10_courses)].copy()
df_violin['Course_Short'] = df_violin['Course'].str[:30]

# Sort by median difficulty
median_order = df_violin.groupby('Course_Short')['Course Difficulty'].median().sort_values().index

sns.violinplot(data=df_violin, y='Course_Short', x='Course Difficulty',
               order=median_order, palette='Set2', inner='box')

plt.title('Distribution of Course Difficulty (Violin Plot, Top 10 Courses)', 
          fontsize=14, fontweight='bold')
plt.xlabel('Difficulty Rating (1=Easy, 5=Difficult)', fontsize=12)
plt.ylabel('Course', fontsize=12)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()

plt.savefig(f'{FIGURE_DIR}/07_difficulty_violin_plot.png', dpi=FIGURE_DPI, bbox_inches='tight')
print(f"✓ Figure saved: {FIGURE_DIR}/07_difficulty_violin_plot.png")
plt.close()

### 9. Module Comparison (if applicable)

In [None]:
# Module comparison visualization
if 'Module' in df_long.columns and df_long['Module'].nunique() > 1:
    fig, axes = plt.subplots(2, 2, figsize=(14, 12))
    fig.suptitle('Module II A vs Module II B Comparison', fontsize=16, fontweight='bold')
    
    # Remove "Unknown" for cleaner comparison
    df_module_clean = df_long[df_long['Module'] != 'Unknown'].copy()
    
    if len(df_module_clean) > 0:
        # Plot 1: Course Difficulty
        sns.boxplot(data=df_module_clean, x='Module', y='Course Difficulty', 
                   ax=axes[0,0], palette='Set2')
        axes[0,0].set_title('Course Difficulty by Module')
        axes[0,0].set_ylabel('Difficulty (1-5)')
        axes[0,0].grid(axis='y', alpha=0.3)
        
        # Plot 2: Exam Difficulty
        sns.boxplot(data=df_module_clean, x='Module', y='Exam Difficulty',
                   ax=axes[0,1], palette='Set2')
        axes[0,1].set_title('Exam Difficulty by Module')
        axes[0,1].set_ylabel('Difficulty (1-5)')
        axes[0,1].grid(axis='y', alpha=0.3)
        
        # Plot 3: Exam Alignment
        sns.boxplot(data=df_module_clean, x='Module', y='Exam Alignment',
                   ax=axes[1,0], palette='Set2')
        axes[1,0].set_title('Exam Alignment by Module')
        axes[1,0].set_ylabel('Alignment (1-5)')
        axes[1,0].grid(axis='y', alpha=0.3)
        
        # Plot 4: Re-enrolment Intent
        module_enrol = df_module_clean.groupby(['Module', 'Enrol Again Simplified']).size().unstack(fill_value=0)
        module_enrol_pct = module_enrol.div(module_enrol.sum(axis=1), axis=0) * 100
        module_enrol_pct.plot(kind='bar', stacked=True, ax=axes[1,1],
                             color=['#ff6b6b', '#ffd93d', '#6bcf7f', '#95a5a6'])
        axes[1,1].set_title('Re-enrolment Intent by Module')
        axes[1,1].set_ylabel('Percentage (%)')
        axes[1,1].set_xlabel('Module')
        axes[1,1].legend(title='Response', bbox_to_anchor=(1.05, 1), loc='upper left')
        axes[1,1].grid(axis='y', alpha=0.3)
        
        plt.tight_layout()
        plt.savefig(f'{FIGURE_DIR}/08_module_comparison.png', dpi=FIGURE_DPI, bbox_inches='tight')
        print(f"✓ Figure saved: {FIGURE_DIR}/08_module_comparison.png")
        plt.close()
    else:
        print("Insufficient module data for comparison plots")
        plt.close()
else:
    print("Module comparison not available")

### 10. Pairwise Relationships (Scatter Matrix)

In [None]:
# Create scatter matrix for key metrics
from pandas.plotting import scatter_matrix

metrics_scatter = df_long[['Course Difficulty', 'Exam Difficulty', 
                            'Exam Alignment', 'Enrol Again Numeric']].dropna()

if len(metrics_scatter) > 10:
    fig = plt.figure(figsize=(12, 12))
    scatter_matrix(metrics_scatter, alpha=0.5, figsize=(12, 12), diagonal='hist',
                  hist_kwds={'bins': 20, 'edgecolor': 'black'})
    plt.suptitle('Pairwise Relationships Between Metrics', 
                fontsize=14, fontweight='bold', y=0.995)
    
    plt.savefig(f'{FIGURE_DIR}/09_scatter_matrix.png', dpi=FIGURE_DPI, bbox_inches='tight')
    print(f"✓ Figure saved: {FIGURE_DIR}/09_scatter_matrix.png")
    plt.close()
else:
    print("Insufficient data for scatter matrix")

<a id="key-insights"></a>
## 8. Key Insights and Findings

### Automated Insight Generation

Identifying the most important patterns and trends in the data.

### Top/Bottom Performers

In [None]:
# Generate key insights automatically
print("="*70)
print(" " * 22 + "KEY INSIGHTS")
print("="*70)

# Filter for courses with sufficient responses
insights_data = course_stats[course_stats['Response_Count'] >= MIN_RESPONSES].copy()

# 1. Most/Least Difficult Courses
print("\n1. MOST DIFFICULT COURSES (Top 5):")
most_difficult = insights_data.nlargest(5, 'Course_Diff_Mean')
for idx, row in most_difficult.iterrows():
    print(f"   - {row['Course'][:50]:50s}  {row['Course_Diff_Mean']:.2f}/5.0  (N={row['Response_Count']:.0f})")

print("\n2. LEAST DIFFICULT COURSES (Top 5):")
least_difficult = insights_data.nsmallest(5, 'Course_Diff_Mean')
for idx, row in least_difficult.iterrows():
    print(f"   - {row['Course'][:50]:50s}  {row['Course_Diff_Mean']:.2f}/5.0  (N={row['Response_Count']:.0f})")

# 3. Best/Worst Exam Alignment
print("\n3. BEST EXAM ALIGNMENT (Top 5):")
best_alignment = insights_data.nlargest(5, 'Exam_Align_Mean')
for idx, row in best_alignment.iterrows():
    print(f"   - {row['Course'][:50]:50s}  {row['Exam_Align_Mean']:.2f}/5.0  (N={row['Response_Count']:.0f})")

print("\n4. WORST EXAM ALIGNMENT (Bottom 5):")
worst_alignment = insights_data.nsmallest(5, 'Exam_Align_Mean')
for idx, row in worst_alignment.iterrows():
    print(f"   - {row['Course'][:50]:50s}  {row['Exam_Align_Mean']:.2f}/5.0  (N={row['Response_Count']:.0f})")

# 5. Highest Re-enrolment Intent
print("\n5. HIGHEST RE-ENROLMENT INTENT (Top 5):")
highest_reenrol = insights_data.nlargest(5, 'Enrol_Again_Mean')
for idx, row in highest_reenrol.iterrows():
    print(f"   - {row['Course'][:50]:50s}  {row['Enrol_Again_Mean']:.2f}/3.0  (N={row['Response_Count']:.0f})")

# 6. Lowest Re-enrolment Intent
print("\n6. LOWEST RE-ENROLMENT INTENT (Bottom 5):")
lowest_reenrol = insights_data.nsmallest(5, 'Enrol_Again_Mean')
for idx, row in lowest_reenrol.iterrows():
    print(f"   - {row['Course'][:50]:50s}  {row['Enrol_Again_Mean']:.2f}/3.0  (N={row['Response_Count']:.0f})")

print("="*70)

### Special Categories

In [None]:
# Identify special course categories
print("\n" + "="*70)
print(" " * 18 + "SPECIAL COURSE CATEGORIES")
print("="*70)

# "Hidden Gems" - High satisfaction (re-enrol), moderate/low difficulty
hidden_gems = insights_data[
    (insights_data['Enrol_Again_Mean'] >= insights_data['Enrol_Again_Mean'].quantile(0.75)) &
    (insights_data['Course_Diff_Mean'] <= insights_data['Course_Diff_Mean'].median())
].copy()

print("\n7. HIDDEN GEMS (High Re-enrol Intent + Lower Difficulty):")
if len(hidden_gems) > 0:
    hidden_gems = hidden_gems.sort_values('Enrol_Again_Mean', ascending=False).head(5)
    for idx, row in hidden_gems.iterrows():
        print(f"   - {row['Course'][:40]:40s}  Difficulty: {row['Course_Diff_Mean']:.2f}  Re-enrol: {row['Enrol_Again_Mean']:.2f}")
else:
    print("   No courses match this criteria")

# "Challenging but Rewarding" - High difficulty, high re-enrol
challenging_rewarding = insights_data[
    (insights_data['Course_Diff_Mean'] >= insights_data['Course_Diff_Mean'].quantile(0.75)) &
    (insights_data['Enrol_Again_Mean'] >= insights_data['Enrol_Again_Mean'].quantile(0.75))
].copy()

print("\n8. CHALLENGING BUT REWARDING (High Difficulty + High Re-enrol Intent):")
if len(challenging_rewarding) > 0:
    challenging_rewarding = challenging_rewarding.sort_values('Enrol_Again_Mean', ascending=False).head(5)
    for idx, row in challenging_rewarding.iterrows():
        print(f"   - {row['Course'][:40]:40s}  Difficulty: {row['Course_Diff_Mean']:.2f}  Re-enrol: {row['Enrol_Again_Mean']:.2f}")
else:
    print("   No courses match this criteria")

# "Needs Attention" - Low alignment, low re-enrol
needs_attention = insights_data[
    (insights_data['Exam_Align_Mean'] <= insights_data['Exam_Align_Mean'].quantile(0.25)) &
    (insights_data['Enrol_Again_Mean'] <= insights_data['Enrol_Again_Mean'].quantile(0.25))
].copy()

print("\n9. NEEDS ATTENTION (Low Exam Alignment + Low Re-enrol Intent):")
if len(needs_attention) > 0:
    needs_attention = needs_attention.sort_values('Enrol_Again_Mean').head(5)
    for idx, row in needs_attention.iterrows():
        print(f"   - {row['Course'][:40]:40s}  Alignment: {row['Exam_Align_Mean']:.2f}  Re-enrol: {row['Enrol_Again_Mean']:.2f}")
else:
    print("   No courses match this criteria")

print("="*70)

# Save insights
insights_summary = {
    'Most Difficult': most_difficult[['Course', 'Course_Diff_Mean', 'Response_Count']].to_dict('records'),
    'Least Difficult': least_difficult[['Course', 'Course_Diff_Mean', 'Response_Count']].to_dict('records'),
    'Best Alignment': best_alignment[['Course', 'Exam_Align_Mean', 'Response_Count']].to_dict('records'),
    'Worst Alignment': worst_alignment[['Course', 'Exam_Align_Mean', 'Response_Count']].to_dict('records'),
    'Hidden Gems': hidden_gems[['Course', 'Course_Diff_Mean', 'Enrol_Again_Mean']].to_dict('records') if len(hidden_gems) > 0 else [],
    'Challenging But Rewarding': challenging_rewarding[['Course', 'Course_Diff_Mean', 'Enrol_Again_Mean']].to_dict('records') if len(challenging_rewarding) > 0 else [],
    'Needs Attention': needs_attention[['Course', 'Exam_Align_Mean', 'Enrol_Again_Mean']].to_dict('records') if len(needs_attention) > 0 else []
}

import json
with open(f'{OUTPUT_DIR}/key_insights.json', 'w') as f:
    json.dump(insights_summary, f, indent=2)

print(f"\n✓ Key insights saved to: {OUTPUT_DIR}/key_insights.json")

<a id="text-analysis"></a>
## 9. Text Analysis of Open-Ended Feedback

### Analyzing Qualitative Responses

Examining the open-ended feedback for common themes and patterns.

In [None]:
# Text analysis of open-ended feedback
print("="*70)
print(" " * 20 + "TEXT ANALYSIS")
print("="*70)

# Collect all positive and improvement feedback
positive_texts = df_long['Positive Feedback'].dropna().tolist()
improvement_texts = df_long['Improvement Feedback'].dropna().tolist()

print(f"\nPositive feedback responses:      {len(positive_texts)}")
print(f"Improvement feedback responses:   {len(improvement_texts)}")

# Word frequency analysis
def get_word_frequencies(texts, top_n=20):
    """Extract word frequencies from text list."""
    from collections import Counter
    import re
    
    # Combine all texts
    all_text = ' '.join(texts).lower()
    
    # Remove common stop words
    stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 
                  'of', 'with', 'by', 'from', 'as', 'is', 'was', 'are', 'were', 'been',
                  'be', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would',
                  'could', 'should', 'may', 'might', 'can', 'this', 'that', 'these',
                  'those', 'i', 'you', 'he', 'she', 'it', 'we', 'they', 'what', 'which',
                  'who', 'when', 'where', 'why', 'how', 'very', 'too', 'not', 'so'}
    
    # Extract words (alphanumeric, 3+ characters)
    words = re.findall(r'\b[a-z]{3,}\b', all_text)
    
    # Filter stop words
    filtered_words = [w for w in words if w not in stop_words]
    
    # Count frequencies
    word_counts = Counter(filtered_words)
    
    return word_counts.most_common(top_n)

# Analyze positive feedback
if len(positive_texts) > 0:
    print("\nMOST COMMON WORDS IN POSITIVE FEEDBACK (Top 15):")
    positive_words = get_word_frequencies(positive_texts, 15)
    for word, count in positive_words:
        print(f"   {word:20s}: {count:>4} occurrences")

# Analyze improvement feedback
if len(improvement_texts) > 0:
    print("\nMOST COMMON WORDS IN IMPROVEMENT FEEDBACK (Top 15):")
    improvement_words = get_word_frequencies(improvement_texts, 15)
    for word, count in improvement_words:
        print(f"   {word:20s}: {count:>4} occurrences")

print("="*70)

### Word Clouds (Optional)

Generating word clouds if the wordcloud library is available.

In [None]:
# Generate word clouds if available
if WORDCLOUD_AVAILABLE and len(positive_texts) > 0:
    try:
        from wordcloud import WordCloud
        
        # Positive feedback word cloud
        if len(positive_texts) > 5:
            plt.figure(figsize=(12, 6))
            
            positive_text_combined = ' '.join(positive_texts)
            wordcloud_pos = WordCloud(width=800, height=400, 
                                      background_color='white',
                                      colormap='Greens',
                                      max_words=100).generate(positive_text_combined)
            
            plt.imshow(wordcloud_pos, interpolation='bilinear')
            plt.axis('off')
            plt.title('Positive Feedback Word Cloud', fontsize=14, fontweight='bold')
            plt.tight_layout()
            
            plt.savefig(f'{FIGURE_DIR}/10_positive_feedback_wordcloud.png', 
                       dpi=FIGURE_DPI, bbox_inches='tight')
            print(f"✓ Figure saved: {FIGURE_DIR}/10_positive_feedback_wordcloud.png")
            plt.close()
        
        # Improvement feedback word cloud
        if len(improvement_texts) > 5:
            plt.figure(figsize=(12, 6))
            
            improvement_text_combined = ' '.join(improvement_texts)
            wordcloud_imp = WordCloud(width=800, height=400,
                                      background_color='white',
                                      colormap='Reds',
                                      max_words=100).generate(improvement_text_combined)
            
            plt.imshow(wordcloud_imp, interpolation='bilinear')
            plt.axis('off')
            plt.title('Improvement Feedback Word Cloud', fontsize=14, fontweight='bold')
            plt.tight_layout()
            
            plt.savefig(f'{FIGURE_DIR}/11_improvement_feedback_wordcloud.png',
                       dpi=FIGURE_DPI, bbox_inches='tight')
            print(f"✓ Figure saved: {FIGURE_DIR}/11_improvement_feedback_wordcloud.png")
            plt.close()
            
    except Exception as e:
        print(f"Could not generate word clouds: {e}")
else:
    print("Word cloud generation skipped (library not available or insufficient data)")

<a id="recommendations"></a>
## 10. Recommendations

### Actionable Insights for Course Improvement

Based on the comprehensive analysis, here are key recommendations.

### Overall Recommendations

1. **Courses Needing Immediate Attention:**
   - Review courses with low exam alignment scores
   - Address courses with low re-enrolment intent
   - Consider curriculum adjustments for courses with misaligned exams

2. **Best Practices to Replicate:**
   - Study "hidden gem" courses to understand what makes them accessible yet valuable
   - Analyze "challenging but rewarding" courses to see how they maintain student engagement despite difficulty

3. **Resource Allocation:**
   - Provide additional support for highly difficult courses
   - Ensure adequate preparation materials are available

4. **Continuous Improvement:**
   - Regularly collect and analyze student feedback
   - Address common themes from open-ended responses
   - Monitor changes in difficulty and satisfaction over time

5. **Communication:**
   - Set clear expectations about course difficulty
   - Ensure exam content aligns with course materials
   - Provide transparent information to help students make informed choices

<a id="exports"></a>
## 11. Exports and Final Report

### Summary of Generated Files

In [None]:
# Summary of all generated files
print("="*70)
print(" " * 20 + "ANALYSIS COMPLETE")
print("="*70)

print("\nGENERATED FILES:\n")

print("1. Data Files:")
print(f"   ✓ {OUTPUT_DIR}/data_dictionary.csv")
print(f"   ✓ {OUTPUT_DIR}/processed_course_feedback.csv")
print(f"   ✓ {OUTPUT_DIR}/course_summary_statistics.csv")
print(f"   ✓ {OUTPUT_DIR}/correlation_matrix.csv")
print(f"   ✓ {OUTPUT_DIR}/top_courses_summary.csv")
print(f"   ✓ {OUTPUT_DIR}/key_insights.json")

print("\n2. Visualizations:")
import os
if os.path.exists(FIGURE_DIR):
    figures = sorted([f for f in os.listdir(FIGURE_DIR) if f.endswith('.png')])
    for fig in figures:
        print(f"   ✓ {FIGURE_DIR}/{fig}")

print("\n" + "="*70)
print("All analyses completed successfully!")
print("Review the generated figures and data files for detailed insights.")
print("="*70)

## Analysis Summary

This comprehensive analysis has provided:

1. ✅ **Data Quality Report** - Validated and cleaned survey data
2. ✅ **Descriptive Statistics** - Comprehensive overview of all metrics
3. ✅ **Statistical Analysis** - Correlations, comparisons, and significance testing
4. ✅ **10+ Publication-Ready Visualizations** - High-quality figures for reports
5. ✅ **Key Insights** - Automated identification of top/bottom performers
6. ✅ **Text Analysis** - Examination of open-ended feedback
7. ✅ **Actionable Recommendations** - Evidence-based suggestions for improvement

---

**For questions or further analysis, refer to the generated output files and figures.**