In [None]:
# In colab run this cell first to setup the file structure!
%cd /content
!rm -rf MOL518-Intro-to-Data-Analysis

!git clone https://github.com/shaevitz/MOL518-Intro-to-Data-Analysis.git
%cd MOL518-Intro-to-Data-Analysis/Lecture_5

# Lecture 5: Working with files and folders

In scientific computing, you will often need to work with many files organized into folders (also called directories). If you have experience using the command line (Terminal on Mac or Command Prompt/PowerShell on Windows), this will feel familiar. If you have only ever navigated files by clicking through Finder (Mac) or File Explorer (Windows), this programmatic approach may feel new, but it is a powerful skill that will save you time when processing large datasets.

The goals of this lecture are to:

- Learn about the working directory
- Build paths using `pathlib.Path`
- List the contents of a folder from inside Python
- Find groups of files using patterns
- Loop over many files and compute summary statistics

## Imports

We will need to import two new packages

- `pathlib.Path` for working with file paths
- `csv` module for writing csv files

In [2]:
import numpy as np
from pathlib import Path
import csv

## Navigating the file system

### Files and folders
- A **folder** (also called a directory) can contain files and other folders.
- A **file** holds content, such as numbers in a CSV file, text in a document, or pixels in an image.

### File extensions
A file extension is the suffix after the last dot in a filename and is used to give you information about what is stored in the file.

- `growth_curve.csv` has the extension `.csv`. A `Comma-Separated Values` file is a plain text file format for storing tabular data, where each row is a data record and columns are separated by commas (or other characters like tabs or semi-colons).
- A `.jpg` file contains an image ...


### Paths
A **path** is a description of where a file or folder resides within the computer system.

- A **relative path** is interpreted relative to your current working directory.
  - Example: `data/raw/experiment.csv` (same on Mac and Windows when using Python's pathlib)
  - Example: `../Lecture_4/data.csv` (go up one folder first)

- An **absolute path** describes a location starting from the top of the file system.
  - Mac example: `/Users/username/Documents/project/data.csv`
  - Windows example: `C:\Users\username\Documents\project\data.csv`

> It is often useful to use relative paths because they work across different computers/systems as long as the project folder structure is consistent.


## The working directory

When Python opens a file using a relative path, what is it relative to? That reference point is called the **working directory**. You can think of it as the folder where Python is currently working.

If the working directory is not what you expect, your code will fail to find files even if they exist. In Colab, the working directory can reset if you restart the runtime or open the notebook in a different way.

> A good habit is to print the current working directory before you load data,  to make sure you are where you think you are.

`Path` is a class that represents file and folder paths. Classes are collections of related functionality bundled together. We've seen this before: in Lecture 2 we used `np.array()` to create NumPy arrays, and in Lecture 3 we used methods like `.plot()` on matplotlib objects.

`Path.cwd()` is a function that belongs to the `Path` class that returns the current working directory as a Path object. The parentheses `()` mean we are *calling* the method to execute it.


In [23]:
# Get the current working directory as a Path object.
Path.cwd()

PosixPath('/Users/jshaevitz/Library/CloudStorage/Dropbox/Documents/Teaching/MOL518_Programming/MOL518-Intro-to-Data-Analysis/Lecture_5')

If you ran the setup cell at the top of this notebook, your working directory should look like:

`/content/MOL518-Intro-to-Data-Analysis/Lecture_5`


## Listing the contents of a folder

We can still explore the the contents of a folder from inside Python.

The `Path.iterdir()` function lists the contents of a folder. It returns a sequence of **Path objects**. We will use the `list()` function to convert it to a list so we can print it out.

In [26]:
# List the contents of the current working directory
entries = list(Path.cwd().iterdir())
entries

[PosixPath('/Users/jshaevitz/Library/CloudStorage/Dropbox/Documents/Teaching/MOL518_Programming/MOL518-Intro-to-Data-Analysis/Lecture_5/MOL518_Lecture5.ipynb'),
 PosixPath('/Users/jshaevitz/Library/CloudStorage/Dropbox/Documents/Teaching/MOL518_Programming/MOL518-Intro-to-Data-Analysis/Lecture_5/data')]

## Building paths with `pathlib.Path`

In earlier lectures we have written the filename directly into the code cells. That works, but it becomes tedious if you have many files and fragile when you start combining folders and filenames.

`pathlib.Path` lets you build paths in a way that works across systems and operating systems.

You can join path pieces using `/`. For example, `Path("data") / "ecoli_drug_curves"` means: a folder named `ecoli_drug_curves` inside a folder named `data`.


In [17]:
data_dir = Path("data") / "ecoli_drug_curves"
data_dir

PosixPath('data/ecoli_drug_curves')

Before we do anything with a path, it's good to check whether it exists:

- `.exists()` tells you whether something is there
- `.is_dir()` tells you whether it is a folder
- `.is_file()` tells you whether it is a file



In [18]:
data_dir.exists(), data_dir.is_dir(), data_dir.is_file()

(True, True, False)

## Navigating a complex directory structure

For this lecture, I have put the antibiotic growth curve replicates into folders based on the drug name via the following convention:

- `data/ecoli_drug_curves/DrugName/DrugName_rep1.csv`
- `data/ecoli_drug_curves/DrugName/DrugName_rep2.csv`

We'll start by listing all the folders inside `data/` that represent the different drugs


In [27]:
list(data_dir.iterdir())

[PosixPath('data/ecoli_drug_curves/Ampicillin'),
 PosixPath('data/ecoli_drug_curves/Trimethoprim'),
 PosixPath('data/ecoli_drug_curves/Novabiocin'),
 PosixPath('data/ecoli_drug_curves/Gentamycin'),
 PosixPath('data/ecoli_drug_curves/Rifampicin'),
 PosixPath('data/ecoli_drug_curves/Chloramphenicol')]

Let's make a nicer list alphabetized by drug name. 

First, we will use a `for` loop to go through the contents of `data_dir` and keep only the items that are folders (not files). Then we will sort the list alphabetically.

Note that the `.name` attribute of a Path object returns only the final component of the path, i.e. just the filename or the last folder name, without any of the parent directories.

In [39]:
# Step 1: Build a list of only the folders (not files) in data_dir
drug_folders_list = []
for p in data_dir.iterdir():
    if p.is_dir():
        drug_folders_list.append(p) # appends the current folder to the list

# Step 2: Sort them alphabetically by name
drug_folders = sorted(drug_folders_list)

# Step 3: Print a clean list of folder names
print("Drug folders found:")
for p in drug_folders:
    print(p.name)

Drug folders found:
Ampicillin
Chloramphenicol
Gentamycin
Novabiocin
Rifampicin
Trimethoprim


Above we used the `sorted` function which takes a list or sequence and returns a new list with the same items arranged in order (alphabetically for text, numerically for numbers). This makes the output easier to read.

## Finding files with patterns using `glob`

Sometimes you want to load all files of a particular type from a folder, like "all CSV files in this directory" or "all PNG images from this experiment".

The `glob` method finds file paths that match a pattern using **wildcards**:

- `*.csv` means: any filename that ends in `.csv`
- `data_*.txt` means: any filename that starts with `data_` and ends in `.txt`
- `*` by itself means: everything

Wildcards are a quick way to describe groups of files without listing each one by hand. Note, the `glob` function only finds the paths; it does not load or open the files.

Let's list all the CSV files inside one the first drug folder.

In [29]:
# Pick the first drug folder in the list
example_drug_dir = drug_folders[0]
example_drug_dir

PosixPath('data/ecoli_drug_curves/Ampicillin')

If we list all the files, we see there are some non-CSV files with other information in them.

In [35]:
list(example_drug_dir.iterdir())

[PosixPath('data/ecoli_drug_curves/Ampicillin/OtherInfo.txt'),
 PosixPath('data/ecoli_drug_curves/Ampicillin/Ampicillin_rep2.csv'),
 PosixPath('data/ecoli_drug_curves/Ampicillin/Ampicillin_rep1.csv')]

We can use `*.csv` to find only the CSV files:

In [37]:
example_csv_files = sorted(list(example_drug_dir.glob("*.csv")))

print("Number of CSV files found:", len(example_csv_files))
for p in example_csv_files:
    print(p.name) # Note, here name prints the filename

Number of CSV files found: 2
Ampicillin_rep1.csv
Ampicillin_rep2.csv


## Loading one growth curve file

Each growth curve CSV contains two columns as before, Time and OD. The first row is a header, so we need to remember to tell NumPy to skip it.

Here is example code that loads the first file in `example_drug_dir` and prints some statistics.

In [42]:
data = np.loadtxt(example_csv_files[0], delimiter=",", skiprows=1)

time = data[:, 0]
od = data[:, 1]

print("File:", example_csv_files[0].name)
print("Number of time points:", len(time))
print("First time:", time[0])
print("Last time:", time[-1])
print("First OD:", od[0])
print("Last OD:", od[-1])


File: Ampicillin_rep1.csv
Number of time points: 97
First time: 0.0
Last time: 960.0
First OD: 0.116
Last OD: 0.114


### Exercise 1

Write code to load both CSV files in `example_drug_dir` and calculate the average Last OD measurement. Use paths and a for loop over the files as we've been discussing for your code rather than hard coding the file names etc.

In [None]:
# Your code here







File: Ampicillin_rep1.csv, Last OD: 0.114
File: Ampicillin_rep2.csv, Last OD: 0.115

Average last OD: 0.1145


## Extracting metadata from filenames

In real workflows, important information often lives in folder names and filenames.

Our files follow a simple naming pattern:

`DrugName_rep1.csv`

The part before `.csv` is called the **stem**.

We can access pieces of a path like this:

- `path.name` gives the full filename
- `path.stem` gives the filename without the extension

Below we print the Path, name, and stem of `example_csv_files[0]`:


In [46]:
example_csv_files[0], example_csv_files[0].name, example_csv_files[0].stem

(PosixPath('data/ecoli_drug_curves/Ampicillin/Ampicillin_rep1.csv'),
 'Ampicillin_rep1.csv',
 'Ampicillin_rep1')

## Batch processing all replicates for one drug

Now we scale up from one file to many.

We will:

1. Find all replicate CSV files for a drug folder
2. Summarize each file
3. Create a small table that has one row per replicate
4. Compute the average last OD across replicates

In the full batch processing section, we will save one replicate table per drug to disk.
That table will include:

- Replicate number
- Total number of time points
- Average last OD for that drug (the same number repeated for each replicate row)


In [None]:
drug_dir = example_drug_dir
replicate_files = sorted(list(drug_dir.glob("*.csv")))

replicate_rows = []

for csv_path in replicate_files:
    row = summarize_growth_curve(csv_path)
    replicate_rows.append(row)

print("Drug:", drug_dir.name)
print("Number of replicate files:", len(replicate_rows))

# Print the table in a readable way
for row in replicate_rows:
    print(row)


In [None]:
# Compute the average last OD across replicates
last_ods = [row["last_od"] for row in replicate_rows]

avg_last_od = sum(last_ods) / len(last_ods)
print("Average last OD for", drug_dir.name, ":", avg_last_od)


This average is a simple summary of growth under that drug condition.

In later lectures you will learn more sophisticated summaries, but the workflow pattern will stay the same.


## Writing output tables to disk

In a real workflow, you do not want results only printed to the screen.

You want to save them so you can:

- Load them later
- Share them
- Plot them

We will save outputs in a separate folder.

We will never overwrite the raw data.


In [None]:
processed_dir = Path("data") / "processed"
processed_dir.mkdir(parents=True, exist_ok=True)
processed_dir


In [None]:
# Write the per replicate table for this drug
output_path = processed_dir / f"{drug_dir.name}_replicate_table.csv"

fieldnames = ["replicate", "last_od", "n_time_points", "elapsed_time", "file"]

with open(output_path, "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(replicate_rows)

print("Wrote:", output_path)


## Full batch processing: all drugs

Now we do the same procedure for every drug folder.

For each drug we will:

- Find the replicate files
- Compute a small summary for each replicate
- Compute the average last OD across the replicates
- Save a table for that drug in `data/processed`

Each per drug table will have one row per replicate with these columns:

- `replicate`
- `n_time_points`
- `avg_last_od` (the average across replicates for that drug)

We will also build an overall summary table that has one row per drug.


In [None]:
verbose = True

drug_summary_rows = []
failed_files = []

for drug_dir in drug_folders:
    replicate_files = sorted(list(drug_dir.glob("*.csv")))

    replicate_rows = []

    for csv_path in replicate_files:
        try:
            row = summarize_growth_curve(csv_path)
            replicate_rows.append(row)
        except Exception as e:
            failed_files.append({"drug": drug_dir.name, "file": csv_path.name, "error": str(e)})
            if verbose:
                print("Failed:", drug_dir.name, csv_path.name)
                print("Error:", e)

    if len(replicate_rows) == 0:
        if verbose:
            print("No replicate files found for", drug_dir.name)
        continue

    # Compute average last OD across replicates for this drug
    last_ods = [row["last_od"] for row in replicate_rows]
    avg_last_od = sum(last_ods) / len(last_ods)

    # Build a per drug table where each row is one replicate.
    # Each row includes the drug average, repeated, because it is a per drug value.
    per_drug_table_rows = []
    for row in replicate_rows:
        per_drug_table_rows.append({
            "replicate": row["replicate"],
            "n_time_points": row["n_time_points"],
            "avg_last_od": avg_last_od,
        })

    # Save the per drug replicate table
    per_drug_output = processed_dir / f"{drug_dir.name}_replicate_table.csv"
    fieldnames = ["replicate", "n_time_points", "avg_last_od"]

    with open(per_drug_output, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(per_drug_table_rows)

    # Add one row to the overall drug summary table
    drug_summary_rows.append({
        "drug": drug_dir.name,
        "n_replicates": len(replicate_rows),
        "avg_last_od": avg_last_od,
    })

    if verbose:
        print("Processed", drug_dir.name, "with", len(replicate_rows), "replicates")


In [None]:
# Write the overall drug summary table
summary_output = processed_dir / "drug_summary_table.csv"

with open(summary_output, "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["drug", "n_replicates", "avg_last_od"])
    writer.writeheader()
    writer.writerows(drug_summary_rows)

print("Wrote:", summary_output)
print("Number of drugs summarized:", len(drug_summary_rows))

if len(failed_files) > 0:
    print("
Some files failed to process. Here are the failures:")
    for item in failed_files:
        print(item)


If you want to double check your results, open one of the saved CSV files in `data/processed`.

The most important outcome of this lecture is not the specific numbers, but the workflow:

- Find files
- Loop
- Summarize
- Save outputs


## Reporting failures

In batch processing, one bad file should not crash everything.

At the same time, you should never silently ignore failures.

We kept a list of failed files, if any. Let us print it.


In [None]:
if len(failed_files) == 0:
    print("No failures detected.")
else:
    print("Failures:")
    for item in failed_files:
        print(item)


## Exercises

### Exercise 2
Modify `summarize_growth_curve` so it also returns the first OD value.

### Exercise 3
For each drug, check whether the two replicates have the same number of time points. If not, print a warning.

### Exercise 4
Create a new summary table that includes the elapsed time for each drug. One simple choice is to average the elapsed times across replicates.

Work slowly and use the habits from this lecture.

- Print the paths you are using
- Print the number of files found
- Print a few intermediate results before writing files


## Common failure modes

- Forgetting that relative paths depend on the working directory
- Typing folder names that do not match the actual names on disk
- Using `glob` and not checking whether you found the files you expected
- Confusing a `Path` object with the file contents
- Writing outputs into the wrong folder and then not being able to find them

If you get stuck, the most important debugging step is still the simplest.

Print what you think you are using, then compare it to what is actually happening.
