In [None]:
# In colab run this cell first to setup the file structure!
%cd /content
!rm -rf MOL518-Intro-to-Data-Analysis

!git clone https://github.com/shaevitz/MOL518-Intro-to-Data-Analysis.git
%cd MOL518-Intro-to-Data-Analysis/Lecture_5

# Lecture 5: Working with files and folders

In scientific computing, you will often need to work with many files organized into folders (also called directories). If you have experience using the command line (Terminal on Mac or Command Prompt/PowerShell on Windows), this will feel familiar. If you have only ever navigated files by clicking through Finder (Mac) or File Explorer (Windows), this programmatic approach may feel new, but it is a powerful skill that will save you time when processing large datasets.

The goals of this lecture are to:

- Learn about the working directory
- Build paths using `pathlib.Path`
- List the contents of a folder from inside Python
- Find groups of files using patterns
- Loop over many files and compute summary statistics

## Imports

We will need to import two new packages

- `pathlib.Path` for working with file paths
- `csv` module for writing csv files

In [None]:
import numpy as np
from pathlib import Path
import csv

## Navigating the file system

### Files and folders
- A **folder** (also called a directory) can contain files and other folders.
- A **file** holds content, such as numbers in a CSV file, text in a document, or pixels in an image.

### File extensions
A file extension is the suffix after the last dot in a filename and is used to give you information about what is stored in the file.

- `growth_curve.csv` has the extension `.csv`. A `Comma-Separated Values` file is a plain text file format for storing tabular data, where each row is a data record and columns are separated by commas (or other characters like tabs or semi-colons).
- A `.jpg` file contains an image ...


### Paths
A **path** is a description of where a file or folder resides within the computer system.

- A **relative path** is interpreted relative to your current working directory.
  - Example: `data/raw/experiment.csv` (same on Mac and Windows when using Python's pathlib)
  - Example: `../Lecture_4/data.csv` (go up one folder first)

- An **absolute path** describes a location starting from the top of the file system.
  - Mac example: `/Users/username/Documents/project/data.csv`
  - Windows example: `C:\Users\username\Documents\project\data.csv`

> It is often useful to use relative paths because they work across different computers/systems as long as the project folder structure is consistent.


## The working directory

When Python opens a file using a relative path, what is it relative to? That reference point is called the **working directory**. You can think of it as the folder where Python is currently working.

If the working directory is not what you expect, your code will fail to find files even if they exist. In Colab, the working directory can reset if you restart the runtime or open the notebook in a different way.

> A good habit is to print the current working directory before you load data,  to make sure you are where you think you are.

`Path` is a class that represents file and folder paths. Classes are collections of related functionality bundled together. We've seen this before: in Lecture 2 we used `np.array()` to create NumPy arrays, and in Lecture 3 we used methods like `.plot()` on matplotlib objects.

`Path.cwd()` is a function that belongs to the `Path` class that returns the current working directory as a Path object. The parentheses `()` mean we are *calling* the method to execute it.


In [None]:
# Get the current working directory as a Path object.
Path.cwd()

If you ran the setup cell at the top of this notebook, your working directory should look like:

`/content/MOL518-Intro-to-Data-Analysis/Lecture_5`


## Listing the contents of a folder

We can still explore the the contents of a folder from inside Python.

The `Path.iterdir()` function lists the contents of a folder. It returns a sequence of **Path objects**. We will use the `list()` function to convert it to a list so we can print it out.

In [None]:
# List the contents of the current working directory
entries = list(Path.cwd().iterdir())
entries

## Building paths with `pathlib.Path`

In earlier lectures we have written the filename directly into the code cells. That works, but it becomes tedious if you have many files and fragile when you start combining folders and filenames.

`pathlib.Path` lets you build paths in a way that works across systems and operating systems.

You can join path pieces using `/`. For example, `Path("data") / "ecoli_drug_curves"` means: a folder named `ecoli_drug_curves` inside a folder named `data`.


In [None]:
data_dir = Path("data") / "ecoli_drug_curves"
data_dir

Before we do anything with a path, it's good to check whether it exists:

- `.exists()` tells you whether something is there
- `.is_dir()` tells you whether it is a folder
- `.is_file()` tells you whether it is a file



In [None]:
data_dir.exists(), data_dir.is_dir(), data_dir.is_file()

## Navigating a complex directory structure

For this lecture, I have put the antibiotic growth curve replicates into folders based on the drug name via the following convention:

- `data/ecoli_drug_curves/DrugName/DrugName_rep1.csv`
- `data/ecoli_drug_curves/DrugName/DrugName_rep2.csv`

We'll start by listing all the folders inside `data/` that represent the different drugs


In [None]:
list(data_dir.iterdir())

Let's make a nicer list alphabetized by drug name. 

First, we will use a `for` loop to go through the contents of `data_dir` and keep only the items that are folders (not files). Then we will sort the list alphabetically.

Note that the `.name` attribute of a Path object returns only the final component of the path, i.e. just the filename or the last folder name, without any of the parent directories.

In [None]:
# Step 1: Build a list of only the folders (not files) in data_dir
drug_folders_list = []
for p in data_dir.iterdir():
    if p.is_dir():
        drug_folders_list.append(p) # appends the current folder to the list

# Step 2: Sort them alphabetically by name
drug_folders = sorted(drug_folders_list)

# Step 3: Print a clean list of folder names
print("Drug folders found:")
for p in drug_folders:
    print(p.name)

Above we used the `sorted` function which takes a list or sequence and returns a new list with the same items arranged in order (alphabetically for text, numerically for numbers). This makes the output easier to read.

## Finding files with patterns using `glob`

Sometimes you want to load all files of a particular type from a folder, like "all CSV files in this directory" or "all PNG images from this experiment".

The `glob` method finds file paths that match a pattern using **wildcards**:

- `*.csv` means: any filename that ends in `.csv`
- `data_*.txt` means: any filename that starts with `data_` and ends in `.txt`
- `*` by itself means: everything

Wildcards are a quick way to describe groups of files without listing each one by hand. Note, the `glob` function only finds the paths; it does not load or open the files.

Let's list all the CSV files inside one the first drug folder.

In [None]:
# Pick the first drug folder in the list
example_drug_dir = drug_folders[0]
example_drug_dir

If we list all the files, we see there are some non-CSV files with other information in them.

In [None]:
list(example_drug_dir.iterdir())

We can use `*.csv` to find only the CSV files:

In [None]:
example_csv_files = sorted(list(example_drug_dir.glob("*.csv")))

print("Number of CSV files found:", len(example_csv_files))
for p in example_csv_files:
    print(p.name) # Note, here name prints the filename

## Loading one growth curve file

Each growth curve CSV contains two columns as before, Time and OD. The first row is a header, so we need to remember to tell NumPy to skip it.

Here is example code that loads the first file in `example_drug_dir` and prints some statistics.

In [None]:
data = np.loadtxt(example_csv_files[0], delimiter=",", skiprows=1)

time = data[:, 0]
od = data[:, 1]

print("File:", example_csv_files[0].name)
print("Number of time points:", len(time))
print("First time:", time[0])
print("Last time:", time[-1])
print("First OD:", od[0])
print("Last OD:", od[-1])

### Exercise 1

Write code to load both CSV files in `example_drug_dir` and calculate the average Last OD measurement. Use paths and a for loop over the files as we've been discussing for your code rather than hard coding the file names etc.

In [None]:
# Your code here







### Extracting metadata from filenames

In real workflows, important information often lives in folder names and filenames.

Our files follow a simple naming pattern:

`DrugName_rep1.csv`

The part before `.csv` is called the **stem**.

We can access pieces of a path like this:

- `path.name` gives the full filename
- `path.stem` gives the filename without the extension

Below we print the Path, name, and stem of `example_csv_files[0]`:


In [None]:
example_csv_files[0], example_csv_files[0].name, example_csv_files[0].stem

## Batch processing multiple files

We will now use 'for' loops to systematically go through all the folders in the `ecoli_drug_curves` folder and compile a list calculate the number of replicates in each one. 

In [None]:
replicate_counts = []

for i in range(len(drug_folders)):
    # Find all CSV files in this drug folder
    replicate_files = list(drug_folders[i].glob("*.csv"))

    # Store the count in a simple list
    replicate_counts.append(len(replicate_files))

print("Number of replicates by drug:")
for i in range(len(drug_folders)):
    print(drug_folders[i].name + ": " + str(replicate_counts[i]))

### Exercise 2

Write code to go through all the drug folders and plot the final OD measurement averaged over the two replicates and display a summary of the results.

In [None]:
# Your code goes here





## Writing output tables to disk

We will often want to write our results out to a file. It's often a good idea to put these files into a new folder to keep them separate from the raw data.

First, let's make a folder called `processed` inside the `data` folder. Once we make the Path object, we can use the `mkdir` attribute to make the new folder. The `parents=True` flag will create any missing parent folders, and `exist_ok=True` prevents an error if the folder already exists.

In [None]:
processed_dir = Path("data") / "processed"
processed_dir.mkdir(parents=True, exist_ok=True)
processed_dir

Let's write a simple table of replicate counts into the `processed` folder. Each row will have the drug name and how many replicate CSV files were found.

We will use `csv.writer`, which writes one row at a time. First we write the header with `writerow(["drug", "n_replicates"])`, then in a loop we call `writerow([...])` for each drug. The `open` command creates the file (in write mode) and `newline=""` prevents extra blank lines on some systems.

In [None]:
replicate_counts_path = processed_dir / "replicate_counts.csv"

with open(replicate_counts_path, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["drug", "n_replicates"])
    for i in range(len(drug_folders)):
        writer.writerow([drug_folders[i].name, replicate_counts[i]])

print("Wrote:", replicate_counts_path)

## Introduction to pandas

So far we have used `pathlib` for navigating folders and the `csv` module for writing data files. These tools work, but when working with tabular data (rows and columns like a spreadsheet), there is a more powerful and convenient tool: **pandas**.

Pandas is the standard Python library for working with tabular data. It provides:

- **DataFrames**: A structure similar to a spreadsheet with named columns
- **Easy CSV reading and writing**: Handles headers, mixed data types, and missing values automatically
- **Intuitive column access**: Use names instead of numerical indices
- **Powerful filtering and grouping**: Select rows based on conditions, aggregate data by categories
- **Integration with plotting**: DataFrames can plot themselves

We won't cover pandas in depth in this lecture, but let's see how it can simplify some of the tasks we just did. Think of this as a preview of a tool you will use often in data analysis.


### Importing pandas

Like NumPy and matplotlib, pandas is not part of core Python and must be imported. The standard convention is to import it as `pd`.


In [None]:
import pandas as pd


### Reading CSV files with pandas

Earlier in this course we used `np.loadtxt` to read CSV files. This works, but has limitations:

- You must manually skip header rows with `skiprows=1`
- All data must be numeric (strings cause errors)
- You access columns by numerical index (was Time column 0 or 1?)

Pandas makes this much easier. Let's load one of the growth curve CSV files we saw earlier.


In [None]:
# Load a growth curve file using pandas into a dataframe called `df`
df = pd.read_csv(example_csv_files[0])

print("File loaded:", example_csv_files[0])
print("\nFirst few rows:")
df.head()


Notice that:

- Pandas automatically recognized and used the header row
- The output displays as a nice table
- `.head()` shows the first 5 rows by default

The object `df` is a **DataFrame**, which is like a table where each column has a name.


### Accessing columns by name

With NumPy arrays, we had to remember that Time was column 0 and OD was column 1. With pandas, we can use the column names directly.


In [None]:
# Access columns by name
time = df['Time']
od = df['OD']

print("First time:", time.iloc[0])
print("Last time:", time.iloc[-1])
print("First OD:", od.iloc[0])
print("Last OD:", od.iloc[-1])


Much more readable! `df['Time']` says "get the Time column" in plain English.

If you need a NumPy array from a pandas column (for example, to use with matplotlib), you can convert it with `.values`:


In [None]:
time_array = df['Time'].values
od_array = df['OD'].values

print("Type of df['Time']:", type(df['Time']))
print("Type of df['Time'].values:", type(time_array))


### Writing CSV files with pandas

Earlier we used the `csv` module to write our replicate counts table. It took several lines and required a loop. With pandas, we can do the same thing much more simply.

First, let's build a DataFrame from our drug names and replicate counts:


In [None]:
# Create a DataFrame from our drug data
results_df = pd.DataFrame({
    'drug': [f.name for f in drug_folders],
    'n_replicates': replicate_counts
})

print("DataFrame contents:")
results_df


Now we can write it to a CSV file with a single line:


In [None]:
# Write to CSV using pandas
pandas_output_path = processed_dir / "replicate_counts_pandas.csv"
results_df.to_csv(pandas_output_path, index=False)

print("Wrote:", pandas_output_path)


Compare this to the `csv` module version we wrote earlier:

```python
# The csv module way (7 lines)
with open(replicate_counts_path, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["drug", "n_replicates"])
    for i in range(len(drug_folders)):
        writer.writerow([drug_folders[i].name, replicate_counts[i]])

# The pandas way (1 line, after creating the DataFrame)
results_df.to_csv(pandas_output_path, index=False)
```

The `index=False` parameter tells pandas not to write row numbers to the file.


### Combining pathlib and pandas

One powerful pattern is to use **pathlib for navigation** and **pandas for data**.

Here's an example that loads all CSV files from a drug folder and computes the average final OD:


In [None]:
# Use pathlib to find files
drug_folder = drug_folders[0]
csv_files = list(drug_folder.glob("*.csv"))

# Use pandas to load and process data
final_ods = []
for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    final_od = df['OD'].iloc[-1]  # Last OD value
    final_ods.append(final_od)

average_final_od = np.mean(final_ods)

print(f"Drug: {drug_folder.name}")
print(f"Number of replicates: {len(csv_files)}")
print(f"Final OD values: {final_ods}")
print(f"Average final OD: {average_final_od:.3f}")
