# Assignment 1
# Statistics and Visualization (100 points)

In this assignment, you'll explore a collection of datasets that demonstrate why visualization is crucial in data analysis and how relying solely on summary statistics can be misleading. You'll work with multiple files containing x-y coordinate data that share similar statistical properties but tell very different visual stories.

## Part 1: Data Loading and Processing (15 points)

In your working directory, you have several TSV (tab-separated values) files. It's common in data analysis to work with several related files with similar name stems. you can get a list of all files in your working directory using the os library. Run the cell below and take a look at the files

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import os

# Get all files in current directory
fileNames = os.listdir('.')
fileNames

['.config',
 'mystery_data04.tsv',
 'mystery_data07.tsv',
 'mystery_data03.tsv',
 'mystery_data12.tsv',
 'mystery_data05.tsv',
 'mystery_data10.tsv',
 'mystery_data02.tsv',
 'mystery_data09.tsv',
 'mystery_data11.tsv',
 'mystery_data06.tsv',
 'mystery_data01.tsv',
 'netflix_cleaned_data.csv',
 'mystery_data13.tsv',
 'mystery_data.tsv',
 'mystery_data08.tsv',
 'sample_data']

We want to work with all the files that have names that begin with 'mystery' and are of the type '.tsv'. Note that all of those file names are strings in a list called fileNames. Lets use what we know about string methods and list comprehensions to filter just the fileNames we want.

Hint: We can use list methods such as .startswith() and .endswith() in the following list comprehension

In [None]:
#fill in the blank to generate a list with just the desired fileNames
mystery_fileNames = [fN for fN in fileNames if _____ and ________]
mystery_fileNames

Next, lets write a function that loads a TSV file and returns a pandas dataframe.

In [3]:
def load_tsv_data(filename):
    """
    Load a TSV file and return a pandas DataFrame.

    Parameters:
    -----------
    filename : str
        Name of TSV file to read

    Returns:
    --------
    pandas.DataFrame
        DataFrame containing data from TSV file

    Example:
    --------
    >>> df = load_tsv_data('mystery_data1.tsv')

    Example:
    --------
    >>> df = load_tsv_data('mystery_data1.tsv')
    >>> print(df.head())
       x      y
    0  54.3  47.8
    1  53.1  48.9
    """
    # YOUR CODE HERE
    raise NotImplementedError()

Now we need to use our function to load all the relevant datsets from the mystery_fileNames list.

Note: In many real-world data science applications where the source file names contain
important metadata or information, we would use a dictionary to maintain the connection
between the data and its source file. For example:
    datasets = {'sales_Minnesota': df1, 'sales_Michigan': df2}

In contexts where the sequence matters, such as monthly or yearly datasets, or when you're doing batch processing, using a list data structure is preferred for ease of iteration.

For this exercise, since we're just working with mystery datasets, we'll use a list
for simpler iteration through our data.



In [None]:
datasets = []  # this empty list will hold all our DataFrames

# Load each dataset into our list
for filename in mystery_fileNames:
    # YOUR CODE HERE
    raise NotImplementedError()

# Let's see what we loaded
print(f"Loaded {len(datasets)} datasets")
print(f"Each dataset shape: {datasets[0].shape}")  # checking first dataset's dimensions

## Part 2: Summary Statistics (30 points)

Let's look at key statistical measures for each dataset including:
- Mean and standard deviation for x and y coordinates
- Correlation between x and y
- Min and max values to understand the range of our data

To do this create a list comprehension that calculates these statistics for each dataset.
Hint: Each item in the list should be a dictionary containing the statistics for one dataset.

In [None]:
# Your code should create a list where each item is a dictionary of statistics for one dataset
# Use enumerate(datasets) to number each dataset as dataset_1, dataset_2, etc.

all_stats = [
   # YOUR CODE HERE
   raise NotImplementedError()
]

"""
Expected output format:
[
   {
       'dataset': 'dataset_1',
       'mean_x': 54.266,
       'mean_y': 47.835,
       'std_x': 16.762,
       'std_y': 26.935,
       'correlation': -0.064,
       'min_x': 15.34,
       'max_x': 91.638,
       'min_y': 0.0,
       'max_y': 105.373
   },
   {
       'dataset': 'dataset_2',
       ...
   },
   ...
]
"""

# Convert to DataFrame and round for cleaner display
stats_df = pd.DataFrame(all_stats).round(3)

# Display the results
print("Summary Statistics for all datasets:")
display(stats_df)



## Part 3: Written Analysis (10 points)

Based on the summary statistics you calculated above:
1. What type of relationship do you expect between x and y variables?

2. What conclusions might you draw about this dataset based only on these statistics?

3. What additional statistical measures might be helpful?

YOUR ANSWER HERE

## Part 4: Data Visualization (35 points)

Now lets visualize all our datasets!
We want to create a grid of scatter plots where:
- Each dataset gets its own subplot which is a scatterplot of x vs y
- All plots should have the same scale (hint: look at global min/max across all data)
- Each subplot should include the dataset's statistics in a relevant textbox
- The overall figure should be titled and well-labeled

In [None]:
# First, let's determine our plot layout
n_datasets = len(datasets)
n_cols = 3  # you might want to adjust this
n_rows = (n_datasets + n_cols - 1) // n_cols

# YOUR CODE HERE: Create the figure and subplots
raise NotImplementedError()

"""
Expected steps:
1. Create figure and subplots with appropriate size
2. Find global min/max values for consistent scaling
3. For each dataset:
   - Create scatter plot
   - Add relevant statistics
   - Set titles and labels
4. Adjust layout and display
"""

plt.tight_layout()
plt.show()

## Reflection (10 points)

After completing your visualization:
1. How do these visualizations compare to what you expected from the statistics?

2. Why might it be important to use the same scale for all plots?

3. What surprised you most about this exercise? What story do these visualizations tell that the statistics didn't?

4. What are the implications for data analysis practices?

YOUR ANSWER HERE