# Datasaurus Assignment Solutions and Grading Guide

This notebook contains solutions and grading guidelines for each part of the assignment. Total points: 100

General grading philosophy:
- Reward both correct functionality and good coding practices
- Give partial credit for correct concepts even if implementation is flawed
- Value insightful analysis even if technical execution isn't perfect

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import os

# Get all files in current directory
fileNames = os.listdir('.')
fileNames

## Part 1: Data Loading and Processing (15 points)

### List Comprehension Solution (5 points)

Grading breakdown:
- 2 points: Correct use of list comprehension syntax
- 1.5 points: Correct use of startswith('mystery_data')
- 1.5 points: Correct use of endswith('.tsv')

Partial credit:
- 2 points if using correct methods but with regular for loop
- 1 point if filtering for only one condition correctly
- 0.5 points for attempt with incorrect methods

In [None]:
mystery_fileNames = [fN for fN in fileNames if fN.startswith('mystery_data') and fN.endswith('.tsv')]
mystery_fileNames

### Loading Function Solution (5 points)

Grading breakdown:
- 3 points: Correct use of pd.read_csv()
- 2 points: Correct specification of sep='\t'

Partial credit:
- 2 points if they use read_csv without separator
- 1 point if they attempt to read file but use wrong method

In [None]:
def load_tsv_data(filename):
    """Load a TSV file and return a pandas DataFrame."""
    return pd.read_csv(filename, sep='\t')

### Loading Multiple Datasets Solution (5 points)

Grading breakdown:
- 2 points: Correct initialization of empty list
- 3 points: Correct loading and appending in loop

Partial credit:
- 2 points if they load but don't store correctly
- 1 point if they attempt iteration but don't load

In [None]:
datasets = []  # will hold all DataFrames

for filename in mystery_fileNames:
    df = load_tsv_data(filename)
    datasets.append(df)

print(f"Loaded {len(datasets)} datasets")
print(f"Each dataset shape: {datasets[0].shape}")

## Part 2: Summary Statistics (30 points)

Grading breakdown:
- 10 points: Correct list comprehension structure with enumerate
- 5 points: Correct calculation of means
- 5 points: Correct calculation of standard deviations
- 5 points: Correct calculation of correlation
- 5 points: Correct calculation of min/max values

Partial credit:
- 15 points if using for loop instead of list comprehension
- -2 points for each missing statistic
- 5 points if structure is correct but calculations are wrong
- 10 points if calculations are correct but structure is wrong

In [None]:
all_stats = [
    {'dataset': f'dataset_{i+1}',

        'mean_x': df['x'].mean(),
        'mean_y': df['y'].mean(),
        'std_x': df['x'].std(),
        'std_y': df['y'].std(),
        'correlation': df['x'].corr(df['y']),
        'min_x': df['x'].min(),
        'max_x': df['x'].max(),
        'min_y': df['y'].min(),
        'max_y': df['y'].max()
    }
    for i, df in enumerate(datasets)
]

stats_df = pd.DataFrame(all_stats).round(3)
display(stats_df)

## Part 3: Written Analysis (20 points)

Based on the summary statistics you calculated above:
1. What type of relationship do you expect between x and y variables?
2. Sketch what you think the data might look like when plotted.
3. What conclusions might you draw about this dataset based only on these statistics?
4. What additional statistical measures might be helpful?


**Example of full-credit response:**
1. Based on the near-zero correlations (around -0.06), we would expect little to no linear relationship between x and y variables. The data might appear as a random cloud of points.

2. The similar means, standard deviations, and ranges across all datasets suggest these are very similar or possibly identical datasets. The x values consistently center around 54 with std dev ~17, while y values center around 48 with std dev ~27.

3. Additional helpful measures might include:
   - Quartiles or percentiles to understand distribution shape
   - Tests for normality
   - Non-linear correlation measures
   - Measures of modality

**Grading breakdown:**
- 4 points: Correct interpretation of correlation
- 3 points: Recognition of similarity across datasets
- 3 points: Thoughtful suggestions for additional measures

**Partial credit:**
- Deduct 1-2 points for shallow analysis
- Minimum 2 points for any reasonable attempt

## Part 4: Data Visualization (35 points)

Grading breakdown:
- 5 points: Correct layout calculation
- 5 points: Proper figure sizing
- 5 points: Global min/max calculation
- 10 points: Correct subplot creation and iteration
- 5 points: Adding statistics to plots
- 5 points: Consistent scaling across plots

Partial credit:
- -5 points if scales aren't consistent
- -5 points if statistics are missing
- -3 points if grid/labels are missing
- Half credit if plots are created but poorly formatted

In [None]:
# Calculate layout
n_datasets = len(datasets)
n_cols = 3
n_rows = (n_datasets + n_cols - 1) // n_cols

# Create figure
plt.figure(figsize=(15, 5 * n_rows))

# Calculate global min/max
all_x = pd.concat([df['x'] for df in datasets])
all_y = pd.concat([df['y'] for df in datasets])
x_min, x_max = all_x.min(), all_x.max()
y_min, y_max = all_y.min(), all_y.max()

# Create subplots
for i, df in enumerate(datasets, 1):
    plt.subplot(n_rows, n_cols, i)

    # Create scatter plot
    plt.scatter(df['x'], df['y'], alpha=0.5)

    # Add statistics text box
    stats = stats_df.iloc[i-1]
    stats_text = f"μx={stats['mean_x']:.1f}\n"
    stats_text += f"μy={stats['mean_y']:.1f}\n"
    stats_text += f"ρ={stats['correlation']:.2f}"

    plt.text(0.05, 0.95, stats_text,
             transform=plt.gca().transAxes,
             bbox=dict(facecolor='white', alpha=0.8),
             verticalalignment='top')

    # Set consistent scales
    plt.xlim(x_min - 1, x_max + 1)
    plt.ylim(y_min - 1, y_max + 1)

    plt.title(f'Dataset {i}')
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Reflection (10 points)

**Example of full-credit response:**

1. The visualizations reveal that despite having nearly identical summary statistics, each dataset forms a completely different pattern. This is a striking demonstration of why visualization is crucial in data analysis.

2. Consistent scaling is essential here because:
   - It allows direct comparison between datasets
   - It prevents misleading interpretations
   - It reveals the true relative sizes of patterns

3. The most surprising aspect is how such different patterns can share the same summary statistics. This demonstrates that summary statistics alone can hide important patterns in data.

4. Implications for data analysis:
   - Always visualize data before drawing conclusions
   - Don't rely solely on summary statistics
   - Consider multiple approaches to understanding data
   - Be aware that different visualization techniques might reveal different patterns

**Grading breakdown:**
- 3 points: Recognition of the disconnect between statistics and visualization
- 3 points: Understanding of scaling importance
- 4 points: Thoughtful reflection on implications

**Partial credit:**
- Half credit for surface-level observations without deeper insight
- -2 points if missing any major point
- Full credit requires specific examples or insights