In [2]:
%pip install pandas

Collecting pandas
  Using cached pandas-2.2.0-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.2.0-cp311-cp311-win_amd64.whl (11.6 MB)
Downloading pytz-2024.1-py2.py3-none-any.whl (505 kB)
   ---------------------------------------- 0.0/505.5 kB ? eta -:--:--
   ---- ----------------------------------- 61.4/505.5 kB 3.4 MB/s eta 0:00:01
   ---------------------------------------  501.8/505.5 kB 6.3 MB/s eta 0:00:01
   ---------------------------------------- 505.5/505.5 kB 5.3 MB/s eta 0:00:00
Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
   ---------------------------------------- 0.0/345.4 kB ? eta -:--:--
   --------------------------------------- 345.4/345.4 kB 10.8 MB/s eta 0:00:00
Installing collected packages: pytz, tzdata, pandas
Successfully inst

In [3]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Extracting Metadata from strings



In neuroscience, we often work with large datasets where file naming conventions encode crucial metadata, helping to find the relavant files for a given analysis. String manipulation--the extraction of structured data from text written in a machine-readable pattern-- makes it possible to extract this information efficiently, streamlining data processing workflows.




## Extracting Metadata from Fixed-Length Strings using String Slicing

| Code | Description |
| :--- | :--- |
| **Indexing by Position (i.e. "Slicing" a String)** |   |
| **`"BonnKölnAachen"[:4]`** | Extracts the first four characters 'Bonn' |
| **`"BonnKölnAachen"[4:8]`** | Extracts the characters from position 4 to 7, resulting in 'Köln' |
| **`"BonnKölnAachen"[8:]`** | Extracts all characters from position 8 onwards, resulting in 'Aachen' |
| **`"BonnKölnAachen"[4:6]`** | Extracts the characters from position 4 to 5, resulting in 'Kö' |
| **`"BonnKölnAachen"[6:8]`** | Extracts the characters from position 6 to 7, resulting in 'ln' |
| **`"BonnKölnAachen"[:4]`** | Extracts the first four characters, resulting in 'Bonn' |
| **`"BonnKölnAachen"[-6:]`** | Extracts the last six characters, resulting in 'Aachen' |

These examples provide a clear understanding of how to use slicing to extract specific substrings from a larger string based on their positions. This is a powerful tool in string manipulation, often used in data processing and analysis.

**Exercises**

This researcher had a rule for her filenames: she would store session metadata in **fixed-length** strings, with information always in the same order:
  - **Subject Name**: 6 Characters
  - **Date**: 8 Characters
  - **Treatmet Group**: 7 Characters:
  - **Session Number**: 5 Characters ("sess" and then the number)

That way, when she later needed the information, she could extract it from the filename just by slicing it!

Use the following filename to extract the requested data:

In [4]:
fname = "Arthur20241008controlsess1.txt"   # Filename convention: Subject, Date, Group, Session

**Example**: What subject name's data is in this file?

In [5]:
fname[:6]

'Arthur'

What group is this subject in?

In [8]:
fname[14:21]

'control'

What Session number was this?  (Note: after extracting the number, turn it from a string into an int with the `int()` function.)

In [9]:
fname[-1]

't'

Extract all four metadata variables from the following file and put them into their own variables (note that the subject has fewer than 6 characters in their name.  After slicing the data, you can replace the underscore characters with "empty strings" by using the `replace()` method on strings (e.g. `"name__".replace('_', '')`):

In [None]:
fname = "Joe___20241009experimsess1.txt"  # Filename convention: Subject, Date, Group, Session

('Joe', '20241009', 'experim', 1)

Make a dictionary with the keys "Subject", "Date", "Group", and "SessionNum" with the data from this filename:

In [None]:
fname = "Arthur20241008controlsess1.txt"   # Filename convention: Subject, Date, Group, Session

{'Subject': 'Arthur', 'Date': '20241008', 'Group': 'control', 'SessionNum': 1}

Building a table of metadata usually has the following steps, which can be done in a loop:

1. Extract data into a dictionary
2. Append the dictionary into a list of dictionaries
3. Change the list of dictionaries into a data frame (the table)

**Example**: Fill in the missing data extraction code for the filenames below to make a session table.  Include the original filename in its own column, to make finding the file later simpler:

In [7]:
fnames = ["a2.txt", "b3.txt"]

In [8]:
all_sessions = []
for fname in fnames:
    session = {
        "Letter": fname[0],
        "Number": int(fname[1]),
        "Filename": fname,
    }
    all_sessions.append(session)

all_sessions

[{'Letter': 'a', 'Number': 2, 'Filename': 'a2.txt'},
 {'Letter': 'b', 'Number': 3, 'Filename': 'b3.txt'}]

**Example**: Use the Pandas library to turn this list of dictionaries into a table:

In [None]:
# %pip install pandas

In [9]:
import pandas as pd
df = pd.DataFrame(all_sessions)
df

Unnamed: 0,Letter,Number,Filename
0,a,2,a2.txt
1,b,3,b3.txt



**Exercise**: Fill in the missing data extraction code for the filenames below to make a session table. Include the original filename in its own column, to make finding the file later simpler:


In [10]:
fnames = ["Arthur20241008controlsess1.txt", "Joseph20241009controlsess1.txt", "Arthur20241010treatmesess2.txt", "Joseph20241011controlsess2.txt"]
fnames

['Arthur20241008controlsess1.txt',
 'Joseph20241009controlsess1.txt',
 'Arthur20241010treatmesess2.txt',
 'Joseph20241011controlsess2.txt']

#### Technique: Variable-Length, Character-Seperated Strings (string splitting)

In this section, we explore a flexible and practical approach to handling filenames in data management: variable-length, character-separated strings. This method is particularly useful in scenarios where the length of data attributes varies significantly, such as with names of different lengths. By adopting a convention where each piece of metadata in the filename is separated by a specific character (like an underscore "_"), researchers can accommodate varying data lengths effortlessly. This technique is common in many fields, including neuroscience, where data files often need to contain detailed, yet neatly organized, metadata.  For example:

`<Subject>_<Date>_<SessionCondition>_<SessionNum>.<FileExtension>`

The filename convention here uses underscores to separate different data elements and a dot to denote the file extension. For example, a filename like "Joe_20230101_Control_01.txt" is easily parsed into its constituent parts: subject name, date, session condition, and session number. You'll learn to use the `split` method in Python, which is a straightforward way to divide a string into a list of substrates based on a specified separator.


| Code | Description |
| :--- | :--- |
| values = "hello_world".split('_') | Splits the string "hello_world" at underscores, resulting in a list: ['hello', 'world'] |
| hello = "hello_world".split('_')[0] | Splits "hello_world" at underscores and takes the first element, resulting in 'hello' |
| world = "hello world".split(' ')[1] | Splits "hello world" at spaces and takes the second element, resulting in 'world' |
| hello, world = "hello world".split(' ') | Splits "hello world" at spaces and assigns the elements to variables 'hello' and 'world' |
| basename, extension = "filename.txt".split('.') | Splits "filename.txt" at the dot and assigns the elements to 'basename' and 'extension', resulting in 'filename' and 'txt' |
| hello, *rest = "hello dog cat bunny cow".split(' ') | Splits "hello dog cat bunny cow" at spaces, assigns 'hello' to the first variable and the rest of the elements to 'rest' as a list |


**Exercises**

**Example**: The filename convention here is `<Subject>_<Date>_<Group>_<SessionNum>.<FileExtension>`.  Extract the date this filename into its own variables:

In [3]:
fname = "Arthur_20241008_control_1.txt"

In [4]:
base, ext = fname.split('.')
data = base.split('_')
date = data[1]
date

'20241008'

Extract the Group from this filename into its own variables:

In [1]:
fname = "Arthur_20241008_control_1.txt"

Extract all the data from this filename into a dictionary:

In [2]:
fname = "Arthur_20241008_control_1.txt"

Use the filenames below to extract data into a session metadata table in a for-loop (feel free to copy-paste and adjust the solution from the previous section!) Include the original filename in its own column, to make finding the file later simpler:

In [5]:
fnames = ["Arthur_20241008_control_1.txt", "Josephine_20241009_control_1.txt", "Arthur_20241010_treatment_2.txt", "Joseph_20241011_control_2.txt"]
fnames

['Arthur_20241008_control_1.txt',
 'Josephine_20241009_control_1.txt',
 'Arthur_20241010_treatment_2.txt',
 'Joseph_20241011_control_2.txt']

## Self-Describing Metadata: Getting Key-Values Directly from a String

### Searching the String for Patterns using index()

In this section, we focus on extracting self-describing metadata from strings using pattern searching, a technique especially useful in scenarios where data is embedded within a string in a predictable manner. This method is crucial when dealing with filenames or text data where specific metadata follows a known pattern or a set keyword. Neuroscience researchers often encounter such situations, for instance, when filenames or data entries include coded information like session numbers or participant IDs embedded within them.

Certainly! Here's the completed table with additional examples demonstrating how to use the `index()` method for finding specific patterns in strings and extracting relevant information:

| Code | Description |
| :--- | :--- |
| idx = "JoeSess1".index("Sess") | Finds the index of the substring "Sess" in the string "JoeSess1", storing the position in `idx` |
| sessNum = "JoeSess1"[idx+4 : idx+5] | Extracts the session number following "Sess" by slicing from `idx+4` to `idx+5`, resulting in '1' |
| idx = "Data202302_experiment".index("2023") | Finds the index of the year "2023" in the string, useful for extracting the year data |
| year = "Data202302_experiment"[idx : idx+4] | Extracts the year "2023" from the string by slicing from the found index |
| idx = "experiment_control_groupB".index("group") | Finds the index of "group" in the string, indicating the start of group information |
| group = "experiment_control_groupB"[idx+5:] | Extracts the group identifier 'B' from the string after "group" |


**Exercises**

The following Filenames have a different file naming convention:

`<SessionID>_<BrainRegion>-d1=<ImageHeightInPixels>,d2=<ImageWidthInPixels>.<FileExtension>`

**Example**: Using the index to find the `d1=` section from this filename, extract the image height:

In [5]:
fname = "242_CA1-d1=720,d2=1080.tif"

In [None]:
start_idx = fname.index("d1=") + len("d1=")
end_idx = fname.index(",")
height = int(fname[start_idx:end_idx])
height

Using the index to find the `d2=` section from this filename, extract the image width:

In [25]:
fname = "2045_CA3-d1=1080,d2=720.tif"

720

Using the index to find the `_` section from this filename, extract the brain region:

In [7]:
fname = "24_DG-d1=720,d2=720.tif"

Extract all the data from the following filenames in a loop to build a session table.

In [46]:
fnames = ["242_CA1-d1=720,d2=1080.tif", "2045_CA3-d1=1080,d2=720.tif", "24_DG-d1=720,d2=720.tif", "52313_CA1-d1=720,d2=720.tif", "4_DG-d1=1080,d2=1080.tif"]
fnames

['242_CA1-d1=720,d2=1080.tif',
 '2045_CA3-d1=1080,d2=720.tif',
 '24_DG-d1=720,d2=720.tif',
 '52313_CA1-d1=720,d2=720.tif',
 '4_DG-d1=1080,d2=1080.tif']

### Variable-Length Data on Variable Keys: Using a Double-Seperator to Store Keys Directly in the Filename

### Variable-Length Data on Variable Keys: Using a Double-Separator to Store Keys Directly in the Filename

**Introduction:**
Extracting the key-value pairs in a filename can be fully automated when they use a double-separator method. This technique is particularly useful when dealing with variable-length data and keys, a common scenario in scientific data management, including neuroscience research. By embedding key-value pairs in the filename itself, researchers can create self-descriptive files that contain crucial metadata in an organized and accessible format.

**`"sess=232_subj=Bill_grp=Control.txt"`**

In this method, filenames are constructed using two separators: one to separate different metadata elements (e.g., '_') and another to distinguish between keys and their corresponding values (e.g., '='). For example, in the filename above, each underscore separates different metadata items, and the equals sign distinguishes the key (e.g., 'sess', 'subj', 'grp') from its value. 

Here, we'll practice splitting these filenames to extract each key-value pair and store them in a Python dictionary. This practice is invaluable for organizing data in a way that is both human-readable and easily parsed programmatically, streamlining data analysis and retrieval.

**Reference Table:**

| Code | Description |
| :--- | :--- |
| **`base, ext = fname.split('.')`** | Splits the filename at the dot to separate the base name from the file extension |
| **`for item in items:`** | start a for-loop, iterating over each item in a sequence. |
| **`data = {}`** | Initializes an empty dictionary to store the extracted metadata |
| **`data[key] = value`** | Assigns the value to its respective key in the dictionary |





**Exercises**

**Example**: Extract all the data from the filename:

In [18]:
fname = "sess=232_subj=Bill_grp=Control.txt"

In [17]:
base, ext = fname.split('.')
data = {}
for item in base.split('_'):
    key, value = item.split('=')
    data[key] = value

data

{'sess': '232', 'subj': 'Bill', 'grp': 'Control'}

Extract all the data from the filename

In [21]:
fname = "day-22 clinic-Tuebingen room-3.dat"

Extract all the data from the following filenames in a loop to build a session table. Include the original filename in its own column, to make finding the file later simpler:

In [58]:
fnames = ["sessId-11_height-720_width-1028_region-DG.tif", "sessId-13_height-720_width-720.tif", "height-720_width-1028_region-DG_sessId-110.tif", "height-720_width-1028_region-DG_sessId-110_quality-bad.tif"]
fnames

['sessId-11_height-720_width-1028_region-DG.tif',
 'sessId-13_height-720_width-720.tif',
 'height-720_width-1028_region-DG_sessId-110.tif',
 'height-720_width-1028_region-DG_sessId-110_quality-bad.tif']

## (Demo) Making Data Model Contracts Explicit With Schemas

In scientific research, particularly in fields like neuroscience, it's crucial to have a clear understanding of the data structure you're working with. A schema, or a data model contract, serves as a blueprint for the data, outlining its format and the relationships between different data elements. By defining these contracts explicitly, you ensure that your data adheres to a specific structure, which facilitates more efficient and error-free data processing.


In this demonstration, we explore the use of Python's built-in `namedtuple` feature from the `collections` module to create explicit schemas. A `namedtuple` allows you to create tuple-like objects that are accessible via named fields, making your code more self-documenting and easy to understand.  

The example below shows an example of how this would work:


In [26]:
from collections import namedtuple

# The Schema
MetadataModel = namedtuple("MetadataModel", "subject date group sess_num")

# Extracting the data
fname = "Arthur_20241008_control_1.txt"
base, ext = fname.split('.')

# Putting the data into the schema
data_tuple = MetadataModel(*base.split('_'))
data_tuple


MetadataModel(subject='Arthur', date='20241008', group='control', sess_num='1')

Named tuples can be converted to dictionaries using the `_asdict()` method.

In [8]:
data_dict = data_tuple._asdict()
data_dict

{'subject': 'Arthur', 'date': '20241008', 'group': 'control', 'sess_num': '1'}

Python comes with several built-in utilities for making th these schemas: below is a reference comparing three of them.  Very handy for writing well-documented code! Each of these is a way to create structured data types, but they have different features and use cases:

| Feature/Tool | `collections.namedtuple` | `typing.NamedTuple` | `dataclasses.dataclass` |
| :--- | :--- | :--- | :--- |
| **Module** | `collections` | `typing` | `dataclasses` |
| **Basic Use** | Creates tuple-like objects with named fields | Extends `namedtuple` with type hints | Creates classes with built-in methods for handling data |
| **Syntax** | `Point = namedtuple('Point', ['x', 'y'])` | `class Point(NamedTuple): x: int; y: int` | `@dataclass class Point: x: int; y: int` |
| **Mutability** | Immutable | Immutable | Mutable by default, can be made immutable |
| **Type Annotations** | Not supported natively | Supports type annotations | Supports type annotations |
| **Default Values** | Not supported natively | Supports default values | Supports default values |
| **Inheritance** | Can't inherit from other classes | Can inherit from other classes | Can inherit from other classes |
| **Field Ordering** | Maintains order of fields | Maintains order of fields | Maintains order of fields |
| **Methods** | Limited to tuple methods | Can define additional methods | Can define methods, and comes with built-in methods like `__init__`, `__repr__`, etc. |
| **Use Case** | Simple use cases where a lightweight, immutable container is needed | When you need immutable containers with type hinting | Ideal for more complex data structures requiring mutability and additional functionality |

Each of these tools serves a different purpose:

- `collections.namedtuple` is great for when you need a simple, lightweight container with named fields.
- `typing.NamedTuple` is useful for a similar purpose but with the added benefit of type hints.
- `dataclasses.dataclass` is more suited for complex data structures where you might need mutability, default values, and built-in methods for common tasks.

Choosing the right tool depends on your specific needs, especially in terms of complexity, mutability, and the requirement for type hinting.