# Statistics and Visualization Assignment (100 points)

In this assignment, you'll explore the relationship between summary statistics and data visualization using a mystery dataset. You'll discover why it's crucial to look beyond basic statistical measures when analyzing data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Part 1: Data Loading and Initial Analysis (15 points)

First, load the mystery dataset and examine its basic properties. The dataset is provided in 'mystery_data.tsv'.

In [None]:
def load_and_display_data():
    """Load the mystery dataset and display its first few rows."""
    # Read the dataset
    mystery_data = pd.read_csv('mystery_data.tsv', sep='\t')
    # Filter to just get the dinosaur dataset
    dino_data = mystery_data[mystery_data['dataset'] == 'dino'][['x', 'y']]
    return dino_data

# Load and display the data
mystery_data = load_and_display_data()
print("First few rows of our mystery dataset:")
mystery_data.head()

## Part 2: Summary Statistics (30 points)

Calculate key statistical measures for the dataset. Make sure to include:
- Mean and standard deviation for x and y
- Correlation between x and y
- Any other statistics you think might be informative

In [None]:
def calculate_statistics(data):
    """Calculate and display key statistics for the dataset."""
    stats = {}
    
    # Calculate basic summary statistics
    stats['mean_x'] = data['x'].mean()
    stats['mean_y'] = data['y'].mean()
    stats['std_x'] = data['x'].std()
    stats['std_y'] = data['y'].std()
    stats['correlation'] = data['x'].corr(data['y'])
    
    print("\nSummary Statistics:")
    print(f"X Mean: {stats['mean_x']:.2f}")
    print(f"Y Mean: {stats['mean_y']:.2f}")
    print(f"X Standard Deviation: {stats['std_x']:.2f}")
    print(f"Y Standard Deviation: {stats['std_y']:.2f}")
    print(f"Correlation: {stats['correlation']:.2f}")
    
    return stats

stats_results = calculate_statistics(mystery_data)

## Part 3: Written Analysis (20 points)

Based on the summary statistics you calculated above:
1. What type of relationship do you expect between x and y variables?
2. Sketch what you think the data might look like when plotted.
3. What conclusions might you draw about this dataset based only on these statistics?
4. What additional statistical measures might be helpful?

Based on the summary statistics alone, one might expect:
1. A roughly linear or random cloud of points, given the near-zero correlation
2. A fairly spread-out distribution based on the standard deviations
3. Points centered around (54.26, 47.83) based on the means



## Part 4: Data Visualization (35 points)

Create a comprehensive visualization of the dataset. Your visualization should include:
- A scatter plot of x vs y
- Appropriate figure size and styling
- Clear titles and labels
- A text box showing the summary statistics

In [None]:
def create_visualization(data, stats):
    """Create and display a comprehensive visualization of the dataset."""
    plt.figure(figsize=(10, 8))
    
    # Create scatter plot
    plt.scatter(data['x'], data['y'], alpha=0.5, color='green', label='Data Points')
    
    # Add title and labels
    plt.title("The Mystery Dataset Revealed!", fontsize=14)
    plt.xlabel("X Values")
    plt.ylabel("Y Values")
    
    # Add grid
    plt.grid(True, alpha=0.3)
    
    # Add stats textbox
    stats_text = f"Summary Statistics:\n"
    stats_text += f"Mean X: {stats['mean_x']:.2f}\n"
    stats_text += f"Mean Y: {stats['mean_y']:.2f}\n"
    stats_text += f"Std X: {stats['std_x']:.2f}\n"
    stats_text += f"Std Y: {stats['std_y']:.2f}\n"
    stats_text += f"Correlation: {stats['correlation']:.2f}"
    
    plt.text(0.02, 0.98, stats_text,
             transform=plt.gca().transAxes,
             bbox=dict(facecolor='white', alpha=0.8),
             verticalalignment='top',
             fontsize=10)
    
    plt.tight_layout()
    plt.show()

create_visualization(mystery_data, stats_results)

## Extra Credit: Reflection (10 bonus points)

After completing your visualization:
1. Compare your expectations from Part 3 with the actual visualization
2. What surprised you most about this exercise?
3. What are the implications for data analysis practices?

The visualization reveals that summary statistics can be deeply misleading!
The data actually forms the shape of a dinosaur, despite having the same 
summary statistics that might suggest a simple point cloud.

This is a classic example of why we should always visualize our data,
not just compute summary statistics.