# Week 2, Day 2: File I/O and Data Processing

## Programming Concept: File Input/Output

Today we'll learn about **File I/O** (Input/Output) - how to read data from files and write data to files. This is crucial for working with real-world datasets and persisting your program's results.

### Key Programming Concepts:
- **File handles**: Objects that represent connections to files
- **Context managers**: Using `with` statements for safe file handling
- **Text vs binary modes**: Different ways to read/write file content
- **File paths**: Absolute vs relative paths to locate files
- **Exception handling**: Dealing with file errors gracefully

### Why File I/O Matters:
- **Data persistence**: Save your work and results
- **Large datasets**: Process data too big to type manually
- **Automation**: Read configuration files, process batches of data
- **Integration**: Exchange data with other programs and systems
- **Real-world applications**: Almost all software works with files

## Exercise 1: Basic File Reading

Let's start by creating and reading simple text files. We'll work with research data stored in various formats.

In [1]:
# First, let's create a sample data file to work with
sample_data = """Participant_ID,Age,Score,Group
P001,25,85,A
P002,32,92,B
P003,28,78,A
P004,29,96,B
P005,31,88,A"""

# Write this data to a file
with open('participants.csv', 'w') as file:
    file.write(sample_data)

print("Sample data file created: participants.csv")

Sample data file created: participants.csv


In [4]:
# Task 1a: Read the entire file
# Open the file and read all its contents
# Print the contents to see what we're working with
with open('participants.csv') as file:
    contents = file.read()
    print(contents)


Participant_ID,Age,Score,Group
P001,25,85,A
P002,32,92,B
P003,28,78,A
P004,29,96,B
P005,31,88,A


In [8]:
# Task 1b: Read file line by line
# Read the file one line at a time
# Print each line with its line number
with open('participants.csv') as file:
    for line_number, line in enumerate(file, start=1):
        print(f"Line {line_number}: {line.strip()}")


Line 1: Participant_ID,Age,Score,Group
Line 2: P001,25,85,A
Line 3: P002,32,92,B
Line 4: P003,28,78,A
Line 5: P004,29,96,B
Line 6: P005,31,88,A


In [None]:
# Task 1c: Parse CSV data
# Read the file and convert it to a list of dictionaries
# Each participant should be a dictionary with keys: Participant_ID, Age, Score, Group
# Hint: The first line is the header with column names



## Exercise 2: Writing Files and Data Processing

Now let's process our data and write results to new files.

In [None]:
# Task 2a: Calculate and save statistics
# From the participant data, calculate:
# - Average age
# - Average score
# - Average score by group
# Write these statistics to a new file called 'statistics.txt'



In [None]:
# Task 2b: Filter and save high performers
# Create a new CSV file with only participants who scored 85 or higher
# Save it as 'high_performers.csv' with the same format as the original



In [None]:
# Task 2c: Append new data
# Add three new participants to the original file:
# P006,27,91,B
# P007,33,79,A  
# P008,26,94,B
# Use append mode to add them without overwriting existing data



## Exercise 3: Working with Different File Formats

Real-world data comes in many formats. Let's work with JSON and handle different file structures.

In [None]:
import json

# Create sample JSON data
experiment_data = {
    "experiment_id": "EXP_001",
    "date": "2025-01-08",
    "participants": [
        {"id": "P001", "condition": "control", "response_time": 245, "accuracy": 0.92},
        {"id": "P002", "condition": "treatment", "response_time": 198, "accuracy": 0.95},
        {"id": "P003", "condition": "control", "response_time": 267, "accuracy": 0.88},
        {"id": "P004", "condition": "treatment", "response_time": 223, "accuracy": 0.93}
    ],
    "settings": {
        "trials_per_participant": 50,
        "stimulus_duration": 200,
        "inter_trial_interval": 500
    }
}

# Save to JSON file
with open('experiment.json', 'w') as file:
    json.dump(experiment_data, file, indent=2)

print("JSON experiment data created")

In [None]:
# Task 3a: Load and explore JSON data
# Read the JSON file and explore its structure
# Print the experiment ID, date, and number of participants



In [None]:
# Task 3b: Analyze condition differences
# Calculate average response time and accuracy for each condition
# Save the results to a new JSON file called 'condition_analysis.json'



In [None]:
# Task 3c: Convert JSON to CSV
# Extract the participant data and convert it to CSV format
# Save as 'experiment_participants.csv'
# Include columns: id, condition, response_time, accuracy



## Exercise 4: Error Handling and File Management

Real applications need to handle file errors gracefully and manage file operations safely.

In [None]:
# Task 4a: Safe file reading with error handling
# Write a function that safely reads a file and returns its contents
# If the file doesn't exist, return None and print a helpful message
# Test it with both existing and non-existing files

def safe_read_file(filename):
    # Your code here
    pass

# Test the function


In [None]:
# Task 4b: File existence checking
# Write a function that checks if a file exists before trying to process it
# If it exists, read and return the data; if not, create a default file

import os

def get_or_create_config(filename, default_config):
    # Your code here
    pass

# Test with a config file
default_settings = {"theme": "dark", "font_size": 12, "auto_save": True}


In [None]:
# Task 4c: Backup and versioning
# Write a function that creates a backup of a file before modifying it
# The backup should have a timestamp in the filename

from datetime import datetime

def backup_and_write(filename, new_content):
    # Your code here
    pass

# Test the backup function


## Exercise 5: Data Pipeline Project

Let's build a complete data processing pipeline that reads raw data, processes it, and outputs results.

In [None]:
# Create sample raw data files to process
raw_data_files = {
    'day1_results.txt': "P001:85:23.5\nP002:92:18.7\nP003:78:28.1",
    'day2_results.txt': "P001:88:22.1\nP002:89:19.3\nP003:82:26.8",
    'day3_results.txt': "P001:91:21.0\nP002:94:17.9\nP003:85:25.2"
}

# Create the raw data files
for filename, content in raw_data_files.items():
    with open(filename, 'w') as file:
        file.write(content)

print("Raw data files created")

In [None]:
# Task 5a: Data parsing function
# Write a function that reads a raw data file and returns structured data
# Format: participant_id:score:time (separated by colons)
# Return a list of dictionaries with keys: participant, score, time

def parse_raw_data(filename):
    # Your code here
    pass

# Test with one file


In [None]:
# Task 5b: Combine multiple files
# Process all raw data files and combine them into a single dataset
# Add a 'day' field to track which file each record came from
# Save the combined data as 'combined_results.json'



In [None]:
# Task 5c: Generate summary report
# Create a summary report with:
# - Participant progress over days (average score improvement)
# - Best and worst performing participants
# - Day-by-day statistics
# Save as both 'summary_report.txt' (human-readable) and 'summary_data.json' (machine-readable)



## Challenge: Advanced File Operations

For those ready to tackle more complex file handling:

In [None]:
# Challenge 1: Log file parser
# Create a function that parses log files with timestamps and extracts specific events
# Sample log format: "2025-01-08 10:30:45 - INFO - Participant P001 completed trial 15"

sample_log = """2025-01-08 10:30:45 - INFO - Experiment started
2025-01-08 10:31:12 - INFO - Participant P001 completed trial 1
2025-01-08 10:31:45 - WARNING - Participant P002 timeout on trial 1
2025-01-08 10:32:10 - INFO - Participant P001 completed trial 2
2025-01-08 10:32:33 - ERROR - System error during trial recording
2025-01-08 10:33:01 - INFO - Participant P002 completed trial 2"""

with open('experiment.log', 'w') as file:
    file.write(sample_log)

# Write a function to extract all events for a specific participant


In [None]:
# Challenge 2: Configuration file manager
# Create a class that manages configuration files with validation
# Should support getting/setting values, saving changes, and reverting to defaults

class ConfigManager:
    def __init__(self, config_file):
        # Your code here
        pass
    
    def get(self, key, default=None):
        # Your code here
        pass
    
    def set(self, key, value):
        # Your code here
        pass
    
    def save(self):
        # Your code here
        pass

# Test the configuration manager
