In [None]:
# In colab run this cell first to setup the file structure!
%cd /content
!rm -rf MOL518-Intro-to-Data-Analysis

!git clone https://github.com/shaevitz/MOL518-Intro-to-Data-Analysis.git
%cd MOL518-Intro-to-Data-Analysis/Lecture_5

# Lecture 5: Working with files and folders

In scientific computing, you will often need to work with many files organized into folders (also called directories). If you have experience using the command line (Terminal on Mac or Command Prompt/PowerShell on Windows), this will feel familiar. If you have only ever navigated files by clicking through Finder (Mac) or File Explorer (Windows), this programmatic approach may feel new, but it is a powerful skill that will save you time when processing large datasets.

The goals of this lecture are to:

- Learn about the working directory
- Build paths using `pathlib.Path`
- List the contents of a folder from inside Python
- Find groups of files using patterns
- Loop over many files and compute summary statistics

## Imports

We will need to import two new packages

- `pathlib.Path` for working with file paths
- `csv` module for writing csv files

In [2]:
import numpy as np
from pathlib import Path
import csv

## Navigating the file system

### Files and folders
- A **folder** (also called a directory) can contain files and other folders.
- A **file** holds content, such as numbers in a CSV file, text in a document, or pixels in an image.

### File extensions
A file extension is the suffix after the last dot in a filename and is used to give you information about what is stored in the file.

- `growth_curve.csv` has the extension `.csv`. A `Comma-Separated Values` file is a plain text file format for storing tabular data, where each row is a data record and columns are separated by commas (or other characters like tabs or semi-colons).
- A `.jpg` file contains an image ...


### Paths
A **path** is a description of where a file or folder resides within the computer system.

- A **relative path** is interpreted relative to your current working directory.
  - Example: `data/raw/experiment.csv` (same on Mac and Windows when using Python's pathlib)
  - Example: `../Lecture_4/data.csv` (go up one folder first)

- An **absolute path** describes a location starting from the top of the file system.
  - Mac example: `/Users/username/Documents/project/data.csv`
  - Windows example: `C:\Users\username\Documents\project\data.csv`

> It is often useful to use relative paths because they work across different computers/systems as long as the project folder structure is consistent.


## The working directory

When Python opens a file using a relative path, what is it relative to? That reference point is called the **working directory**. You can think of it as the folder where Python is currently working.

If the working directory is not what you expect, your code will fail to find files even if they exist. In Colab, the working directory can reset if you restart the runtime or open the notebook in a different way.

> A good habit is to print the current working directory before you load data,  to make sure you are where you think you are.


In [4]:
# Get the current working directory as a Path object.
Path.cwd()


PosixPath('/Users/jshaevitz/Library/CloudStorage/Dropbox/Documents/Teaching/MOL518_Programming/MOL518-Intro-to-Data-Analysis/Lecture_5')

If you ran the setup cell at the top of this notebook, your working directory should end with:

`/content/MOL518-Intro-to-Data-Analysis/Lecture_5`


## Listing the contents of a folder

We can still explore the the contents of a folder from inside Python.

`Path.iterdir()` lists the contents of a folder. It returns a sequence of **Path objects**. We will use the `list()` function to convert it to a list so we can print it out.

In [8]:
# List the contents of the current working directory
entries = list(Path.cwd().iterdir())
entries

[PosixPath('/Users/jshaevitz/Library/CloudStorage/Dropbox/Documents/Teaching/MOL518_Programming/MOL518-Intro-to-Data-Analysis/Lecture_5/.DS_Store'),
 PosixPath('/Users/jshaevitz/Library/CloudStorage/Dropbox/Documents/Teaching/MOL518_Programming/MOL518-Intro-to-Data-Analysis/Lecture_5/MOL518_Lecture5.ipynb'),
 PosixPath('/Users/jshaevitz/Library/CloudStorage/Dropbox/Documents/Teaching/MOL518_Programming/MOL518-Intro-to-Data-Analysis/Lecture_5/data')]

You should see folders such as `data` in the list. The exact set depends on your repository layout.

### Exercise 1
Find the `data` folder in the printed list above. What other folders do you see at the top level of the repository?


## Building paths with `pathlib.Path`

In earlier lectures you passed filenames to functions as strings. That works, but it becomes fragile when you start combining folders and filenames.

`pathlib.Path` lets you build paths in a way that works across operating systems.

A key feature is that you can join path pieces using `/`.

For example, `Path("data") / "raw"` means: a folder named `raw` inside a folder named `data`.


In [None]:
raw_dir = Path("data") / "raw"
raw_dir


Before we do anything with a path, we should check whether it exists.

- `.exists()` tells you whether something is there
- `.is_dir()` tells you whether it is a folder
- `.is_file()` tells you whether it is a file


In [None]:
raw_dir.exists(), raw_dir.is_dir()


If this prints `(False, False)`, then Python cannot find `data/raw` from your current working directory.

In that case, stop and check:

- Did you run the setup cell at the top of the notebook
- Are you in the correct lecture folder
- Does the repository contain the `data/raw` folder


## The growth curve dataset layout

For this lecture, the growth curves are organized like this:

- `data/raw/DrugName/DrugName_rep1.csv`
- `data/raw/DrugName/DrugName_rep2.csv`

So the folder name tells you the drug condition, and each file inside is one replicate.

Let us list the drug folders.


In [None]:
drug_folders = sorted([p for p in raw_dir.iterdir() if p.is_dir()])

# Print a clean list of folder names
for p in drug_folders:
    print(p.name)


## Finding files with patterns using `glob`

Often you want "all CSV files in this folder".

`glob` finds paths that match a pattern.

- `*.csv` means: any filename that ends in `.csv`

This does not load the files. It only finds their paths.

Let us choose one drug folder and see what is inside.


In [None]:
# Pick the first drug folder in the list
example_drug_dir = drug_folders[0]
example_drug_dir


In [None]:
example_csv_files = sorted(list(example_drug_dir.glob("*.csv")))

print("Number of CSV files found:", len(example_csv_files))
for p in example_csv_files:
    print(p.name)


Printing the paths before loading anything is a safety habit.

If you find zero files, you should fix the path or the pattern before proceeding.


## Loading one growth curve file

Each growth curve CSV contains two columns:

- Time
- Optical density (OD)

The first row is a header, so we tell NumPy to skip it.

In this lecture we will compute a few simple values that we can do with the tools you already know:

- The replicate number (from the filename)
- The total number of time points
- The last OD value

We will use the last OD values from both replicates to compute one number per drug: the average last OD across replicates.

We will also compute the total elapsed time as a sanity check, but we will not put it into the per drug tables.


In [None]:
# Load a single replicate file
path = example_csv_files[0]

data = np.loadtxt(path, delimiter=",", skiprows=1)

time = data[:, 0]
od = data[:, 1]

print("File:", path.name)
print("Number of time points:", len(time))
print("First time:", time[0])
print("Last time:", time[-1])
print("Last OD:", od[-1])


## Extracting metadata from filenames

In real workflows, important information often lives in folder names and filenames.

Our files follow a simple naming pattern:

`DrugName_rep1.csv`

The part before `.csv` is called the **stem**.

We can access pieces of a path like this:

- `path.name` gives the full filename
- `path.stem` gives the filename without the extension

We will extract the replicate number from the stem.


In [None]:
stem = path.stem
stem


In [None]:
# A helper function to parse replicate numbers from stems like "Ampicillin_rep2"

def replicate_number_from_stem(stem):
    # We expect something like "..._rep2"
    if "_rep" not in stem:
        return None
    rep_part = stem.split("_rep")[-1]

    # Convert to an integer if possible
    try:
        return int(rep_part)
    except ValueError:
        return None

replicate_number_from_stem(stem)


If the function returns `None`, that means the filename did not match the expected pattern.

That is not a disaster. It is a signal that your assumptions about the dataset naming were wrong.

In batch workflows, naming conventions matter.


## A small per file summary function

Batch processing becomes much easier if you can describe what you do to one file.

We will write a function that takes a file path and returns a dictionary of results.

This does not introduce new ideas. It just packages existing code so we can reuse it.


In [None]:
def summarize_growth_curve(csv_path):
    """Load one growth curve CSV and return a summary dictionary."""

    data = np.loadtxt(csv_path, delimiter=",", skiprows=1)

    time = data[:, 0]
    od = data[:, 1]

    rep_num = replicate_number_from_stem(csv_path.stem)
    last_od = od[-1]
    n_time_points = len(time)
    elapsed_time = time[-1] - time[0]

    return {
        "replicate": rep_num,
        "last_od": last_od,
        "n_time_points": n_time_points,
        "elapsed_time": elapsed_time,
        "file": csv_path.name,
    }

# Try it on one file
summarize_growth_curve(path)


## Batch processing all replicates for one drug

Now we scale up from one file to many.

We will:

1. Find all replicate CSV files for a drug folder
2. Summarize each file
3. Create a small table that has one row per replicate
4. Compute the average last OD across replicates

In the full batch processing section, we will save one replicate table per drug to disk.
That table will include:

- Replicate number
- Total number of time points
- Average last OD for that drug (the same number repeated for each replicate row)


In [None]:
drug_dir = example_drug_dir
replicate_files = sorted(list(drug_dir.glob("*.csv")))

replicate_rows = []

for csv_path in replicate_files:
    row = summarize_growth_curve(csv_path)
    replicate_rows.append(row)

print("Drug:", drug_dir.name)
print("Number of replicate files:", len(replicate_rows))

# Print the table in a readable way
for row in replicate_rows:
    print(row)


In [None]:
# Compute the average last OD across replicates
last_ods = [row["last_od"] for row in replicate_rows]

avg_last_od = sum(last_ods) / len(last_ods)
print("Average last OD for", drug_dir.name, ":", avg_last_od)


This average is a simple summary of growth under that drug condition.

In later lectures you will learn more sophisticated summaries, but the workflow pattern will stay the same.


## Writing output tables to disk

In a real workflow, you do not want results only printed to the screen.

You want to save them so you can:

- Load them later
- Share them
- Plot them

We will save outputs in a separate folder.

We will never overwrite the raw data.


In [None]:
processed_dir = Path("data") / "processed"
processed_dir.mkdir(parents=True, exist_ok=True)
processed_dir


In [None]:
# Write the per replicate table for this drug
output_path = processed_dir / f"{drug_dir.name}_replicate_table.csv"

fieldnames = ["replicate", "last_od", "n_time_points", "elapsed_time", "file"]

with open(output_path, "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(replicate_rows)

print("Wrote:", output_path)


## Full batch processing: all drugs

Now we do the same procedure for every drug folder.

For each drug we will:

- Find the replicate files
- Compute a small summary for each replicate
- Compute the average last OD across the replicates
- Save a table for that drug in `data/processed`

Each per drug table will have one row per replicate with these columns:

- `replicate`
- `n_time_points`
- `avg_last_od` (the average across replicates for that drug)

We will also build an overall summary table that has one row per drug.


In [None]:
verbose = True

drug_summary_rows = []
failed_files = []

for drug_dir in drug_folders:
    replicate_files = sorted(list(drug_dir.glob("*.csv")))

    replicate_rows = []

    for csv_path in replicate_files:
        try:
            row = summarize_growth_curve(csv_path)
            replicate_rows.append(row)
        except Exception as e:
            failed_files.append({"drug": drug_dir.name, "file": csv_path.name, "error": str(e)})
            if verbose:
                print("Failed:", drug_dir.name, csv_path.name)
                print("Error:", e)

    if len(replicate_rows) == 0:
        if verbose:
            print("No replicate files found for", drug_dir.name)
        continue

    # Compute average last OD across replicates for this drug
    last_ods = [row["last_od"] for row in replicate_rows]
    avg_last_od = sum(last_ods) / len(last_ods)

    # Build a per drug table where each row is one replicate.
    # Each row includes the drug average, repeated, because it is a per drug value.
    per_drug_table_rows = []
    for row in replicate_rows:
        per_drug_table_rows.append({
            "replicate": row["replicate"],
            "n_time_points": row["n_time_points"],
            "avg_last_od": avg_last_od,
        })

    # Save the per drug replicate table
    per_drug_output = processed_dir / f"{drug_dir.name}_replicate_table.csv"
    fieldnames = ["replicate", "n_time_points", "avg_last_od"]

    with open(per_drug_output, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(per_drug_table_rows)

    # Add one row to the overall drug summary table
    drug_summary_rows.append({
        "drug": drug_dir.name,
        "n_replicates": len(replicate_rows),
        "avg_last_od": avg_last_od,
    })

    if verbose:
        print("Processed", drug_dir.name, "with", len(replicate_rows), "replicates")


In [None]:
# Write the overall drug summary table
summary_output = processed_dir / "drug_summary_table.csv"

with open(summary_output, "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["drug", "n_replicates", "avg_last_od"])
    writer.writeheader()
    writer.writerows(drug_summary_rows)

print("Wrote:", summary_output)
print("Number of drugs summarized:", len(drug_summary_rows))

if len(failed_files) > 0:
    print("
Some files failed to process. Here are the failures:")
    for item in failed_files:
        print(item)


If you want to double check your results, open one of the saved CSV files in `data/processed`.

The most important outcome of this lecture is not the specific numbers, but the workflow:

- Find files
- Loop
- Summarize
- Save outputs


## Reporting failures

In batch processing, one bad file should not crash everything.

At the same time, you should never silently ignore failures.

We kept a list of failed files, if any. Let us print it.


In [None]:
if len(failed_files) == 0:
    print("No failures detected.")
else:
    print("Failures:")
    for item in failed_files:
        print(item)


## Exercises

### Exercise 2
Modify `summarize_growth_curve` so it also returns the first OD value.

### Exercise 3
For each drug, check whether the two replicates have the same number of time points. If not, print a warning.

### Exercise 4
Create a new summary table that includes the elapsed time for each drug. One simple choice is to average the elapsed times across replicates.

Work slowly and use the habits from this lecture.

- Print the paths you are using
- Print the number of files found
- Print a few intermediate results before writing files


## Common failure modes

- Forgetting that relative paths depend on the working directory
- Typing folder names that do not match the actual names on disk
- Using `glob` and not checking whether you found the files you expected
- Confusing a `Path` object with the file contents
- Writing outputs into the wrong folder and then not being able to find them

If you get stuck, the most important debugging step is still the simplest.

Print what you think you are using, then compare it to what is actually happening.
