# RSE GitHub Repo Classifier

## Data Prep

Load the annotated dataset used for training from the project, then use the `github_link` column to threaded request the README content.
Store to a new dataframe.

In [None]:
from soft_search.data import load_soft_search_2022_training

df = load_soft_search_2022_training()
df.sample(3)

In [None]:
import requests
from bs4 import BeautifulSoup
from requests.exceptions import HTTPError
from tqdm.contrib.concurrent import thread_map
import time
from random import randint

def _process_row(row):
    # Request GitHub page
    try:
        github_page = requests.get(row.github_link)
        github_page.raise_for_status()
    except HTTPError:
        row["readme_text"] = None
        return row

    # Find and scrape README text
    soup = BeautifulSoup(github_page.content, "html.parser")
    readme_container = soup.find(id="readme")
    if readme_container is None:
        row["readme_text"] = "no readme available"
        return row

    # Add the readme text to the row
    row["readme_text"] = readme_container.text
    time.sleep(randint(0, 5))
    return row

In [None]:
import pandas as pd

rows = thread_map(_process_row, [row for _, row in df.iterrows()], total=len(df))
df = pd.DataFrame(rows)
df.to_parquet("github-readme-with-software-prediction.parquet")
df.sample(3)

## Modeling Training

I wrote a package called `lazy-text-classifiers` a while back that let's me train a bunch of differents models in a single go and see what works best.

Installing locally here though just for minor additions and changes (adding and removing certain models).

In [None]:
!pip install -e ../../personal/lazy-text-classifiers/

In [1]:
import pandas as pd

df = pd.read_parquet("github-readme-with-software-prediction.parquet")
df = df.dropna(subset=["label", "readme_text"])
df.readme_text = df.readme_text.str.strip()
df.sample(3)

Unnamed: 0,label,github_link,nsf_award_id,nsf_award_link,from_template_repo,is_a_fork,abstract_text,project_outcomes,readme_text
1377,software-predicted,https://github.com/junyanz/VON,1524817,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,The goal of this work is to develop a set of t...,It is an exciting time for computer vision. Wi...,Visual Object Networks\nExample results\nMore ...
115,software-not-predicted,https://github.com/sugwg/sn-core-bounce-pe,1836702,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,The NSF's Advanced LIGO and the European Virgo...,"Albert Einstein predicted gravitational waves,...",Inferring physical properties of stellar colla...
977,software-predicted,https://github.com/IBPA/SBROME,1146926,https://www.nsf.gov/awardsearch/showAward?AWD_...,False,False,Synthetic Biology is a nascent field with appl...,The design of biological circuits requires to ...,SBROME\nWhat is SBROME?\nDependencies\nInstall...


In [2]:
from lazy_text_classifiers import LazyTextClassifiers
from sklearn.model_selection import train_test_split
import pandas as pd
import random
import numpy as np

random.seed(20220420)
np.random.seed(20220420)

# Example data from sklearn
# `x` should be an iterable of strings
# `y` should be an iterable of string labels
x = df.readme_text
y = df.label

# Split the data into train and test
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size=0.2,
    random_state=20220420,
)

print(len(x_train), len(x_test), len(y_train), len(y_test))

# Init and fit all models
ltc = LazyTextClassifiers(random_state=12)
results = ltc.fit(x_train, x_test, y_train, y_test)
results

732 184 732 184
Initializing model: 'tfidf-logit'
Fitting model: 'tfidf-logit'
Testing model: 'tfidf-logit'
'tfidf-logit' eval results: {'model': 'tfidf-logit', 'accuracy': 0.8858695652173914, 'balanced_accuracy': 0.8864285714285715, 'precision': 0.8866056662390455, 'recall': 0.8858695652173914, 'f1': 0.8860016959193393, 'time': 23.11274494600002}
Initializing model: 'semantic-logit-distilbert-sst2'


No sentence-transformers model found with name /home/eva/.cache/torch/sentence_transformers/distilbert-base-uncased-finetuned-sst-2-english. Creating a new one with MEAN pooling.


Fitting model: 'semantic-logit-distilbert-sst2'
Testing model: 'semantic-logit-distilbert-sst2'
'semantic-logit-distilbert-sst2' eval results: {'model': 'semantic-logit-distilbert-sst2', 'accuracy': 0.8206521739130435, 'balanced_accuracy': 0.8197619047619047, 'precision': 0.8208607817303468, 'recall': 0.8206521739130435, 'f1': 0.8207321661045128, 'time': 24.197274980000657}


Unnamed: 0,model,accuracy,balanced_accuracy,precision,recall,f1,time
0,tfidf-logit,0.88587,0.886429,0.886606,0.88587,0.886002,23.112745
1,semantic-logit-distilbert-sst2,0.820652,0.819762,0.820861,0.820652,0.820732,24.197275


In [3]:
semantic_logit = ltc.fit_models["semantic-logit-distilbert-sst2"]
tfidf_logit = ltc.fit_models["tfidf-logit"]

## Eval on Repos I Created

In [4]:
SOFT_SEARCH_README = """
# soft-search

[![Build Status](https://github.com/si2-urssi/eager/workflows/CI/badge.svg)](https://github.com/si2-urssi/eager/actions)
[![Documentation](https://github.com/si2-urssi/eager/workflows/Documentation/badge.svg)](https://si2-urssi.github.io/eager)

searching for software promises in grant applications

---

## Installation

**Stable Release:** `pip install soft-search`<br>
**Development Head:** `pip install git+https://github.com/si2-urssi/eager.git`

This repository contains the library code and the paper generation code
created for our paper [Searching for Software in NSF Awards](https://si2-urssi.github.io/eager/_static/paper.html).

### Abstract
Software is an important tool for scholarly work, but software produced for research is in many cases not easily identifiable or discoverable. A potential first step in linking research and software is software identification. In this paper we present two datasets to study the identification and production of research software. The first dataset contains almost 1000 human labeled annotations of software production from National Science Foundation (NSF) awarded research projects. We use this dataset to train models that  predict software production. Our second dataset is created by applying the trained predictive models across the abstracts and project outcomes reports for all NSF funded projects between the years of 2010 and 2023. The result is an inferred dataset of software production for over 150,000 NSF awards. We release the Soft-Search dataset to aid in identifying and understanding research software production: https://github.com/si2-urssi/eager


## The Soft-Search Inferred Dataset

Please download the 500MB Soft-Search Inferred dataset from
[Google Drive](https://drive.google.com/file/d/1k0jvs47bCWT18GHOMXY6EdG5MIDdCiM2/view?usp=share_link).

The dataset is shared as a `parquet` file and can be read in Python with
```python
import pandas as pd

nsf_soft_search = pd.read_parquet("nsf-soft-search-2022.parquet")
```

Please view the
[Parquet R Documentation](https://arrow.apache.org/docs/r/reference/read_parquet.html)
for information regarding reading the dataset in R.

## Quickstart

1. Load our best model (the "TF-IDF Vectorizer Logistic Regression Model")
2. Pull award abstract texts from the NSF API
3. Predict if the award will produce software using the abstract text for each award

```python
from soft_search.constants import NSFFields, NSFPrograms
from soft_search.label import (
  load_tfidf_logit_for_prediction_from_abstract,
  load_tfidf_logit_for_prediction_from_outcomes,
)
from soft_search.nsf import get_nsf_dataset

# Load the abstract model
pipeline = load_tfidf_logit_for_prediction_from_abstract()
# or load the outcomes model
# pipeline = load_tfidf_logit_for_prediction_from_outcomes()

# Pull data
data = get_nsf_dataset(
  start_date="2022-10-01",
  end_date="2023-01-01",
  program_name=NSFPrograms.Mathematical_and_Physical_Sciences,
  dataset_fields=[
    NSFFields.id_,
    NSFFields.abstractText,
    NSFFields.projectOutComesReport,
  ],
  require_project_outcomes_doc=False,
)

# Predict
data["prediction_from_abstract"] = pipeline.predict(data[NSFFields.abstractText])
print(data[["id", "prediction_from_abstract"]])

#           id prediction_from_abstract
# 0    2238468   software-not-predicted
# 1    2239561   software-not-predicted
```

### Annotated Training Data

```python
from soft_search.data import load_soft_search_2022_training

df = load_soft_search_2022_training()
```

### Reproducible Models

| predictive_source 	| model                  	| accuracy 	| precision 	| recall   	| f1       	|
|-------------------	|------------------------	|----------	|-----------	|----------	|----------	|
| project-outcomes  	| tfidf-logit            	| 0.744898 	| 0.745106  	| 0.744898 	| 0.744925 	|
| project-outcomes  	| fine-tuned-transformer 	| 0.673469 	| 0.637931  	| 0.770833 	| 0.698113 	|
| abstract-text     	| tfidf-logit            	| 0.673913 	| 0.673960  	| 0.673913 	| 0.673217 	|
| abstract-text     	| fine-tuned-transformer 	| 0.635870 	| 0.607843  	| 0.696629 	| 0.649215 	|
| project-outcomes  	| semantic-logit         	| 0.632653 	| 0.632568  	| 0.632653 	| 0.632347 	|
| abstract-text     	| semantic-logit         	| 0.630435 	| 0.630156  	| 0.630435 	| 0.629997 	|
| abstract-text     	| regex                  	| 0.516304 	| 0.514612  	| 0.516304 	| 0.513610 	|
| project-outcomes  	| regex                  	| 0.510204 	| 0.507086  	| 0.510204 	| 0.481559 	|

To train and evaluate all of our models you can run the following:

```bash
pip install soft-search

fit-and-eval-all-models
```

Also available directly in Python

```python
from soft_search.label.model_selection import fit_and_eval_all_models

results = fit_and_eval_all_models()
```

## Annotated Dataset Creation

1. We queried GitHub for repositories with references to NSF Awards.
  - We specifically queried for the terms: "National Science Foundation", "NSF Award",
    "NSF Grant", "Supported by the NSF", and "Supported by NSF". This script is available
    with the command `get-github-repositories-with-nsf-ref`. The code for the script is
    available at the following link:
    https://github.com/si2-urssi/eager/blob/main/soft_search/bin/get_github_repositories_with_nsf_ref.py
  - Note: the `get-github-repositories-with-nsf-ref` script produces a directory of CSV
    files. This is useful for paginated queries and protecting against potential crashes
    but the actual stored data in the repo (and the data we use going forward) is
    the a DataFrame with all of these chunks concatenated together and duplicate GitHub
    repositories removed.
  - Because the `get-github-repositories-with-nsf-ref` script depends on the returned
    data from GitHub themselves, we have archived the data produced by the original run
    of this script to the repository and made it available as follows:
    ```python
    from soft_search.data import load_github_repos_with_nsf_refs_2022

    data = load_github_repos_with_nsf_refs_2022()
    ```
2. We manually labeled each of the discovered repositories as "software"
   or "not software" and cleaned up the dataset to only include awards 
   which have a valid NSF Award ID.
  - A script was written to find all NSF Award IDs within a repositories README.md file
    and check that each NSF Award ID found was valid (if we could successfully query
    that award ID using the NSF API). Only valid NSF Award IDs were kept and therefore,
    only GitHub repositories which contained valid NSF Award IDs were kept in the
    dataset. This script is available with the command
    `find-nsf-award-ids-in-github-readmes-and-link`. The code for the script is
    available at the following link:
    https://github.com/si2-urssi/eager/blob/main/soft_search/bin/find_nsf_award_ids_in_github_readmes_and_link.py
  - A function was written to merge all of the manual annotations and the NSF Award IDs
    found. This function also stored the cleaned and prepared data to the project data
    directory. The code for this function is available at the following link:
    https://github.com/si2-urssi/eager/blob/main/soft_search/data/soft_search_2022.py#L143
  - The manually labeled, cleaned, prepared, and stored data is made available with the
    following code:
     ```python
     from soft_search.data import load_soft_search_2022_training

     data = load_soft_search_2022_training()
     ```
  - Prior to the manual annotation process, we conducted multiple rounds of
    annotation trials to ensure we had agreement on our labeling definitions.
    The final annotation trial results which resulted in an inter-rater
    reliability (Fleiss Kappa score) of 0.8918 (near perfect) is available
    via the following function:
    ```python
    from soft_search.data import load_soft_search_2022_training_irr

    data = load_soft_search_2022_training_irr()
    ```
    Additionally, the code for calculating the Fleiss Kappa Statistic
    is available at the following link:
    https://github.com/si2-urssi/eager/blob/main/soft_search/data/irr.py


## Documentation

For full package documentation please visit [si2-urssi.github.io/eager](https://si2-urssi.github.io/eager).

## Development

See [CONTRIBUTING.md](CONTRIBUTING.md) for information related to developing the code.

**MIT License**
"""

AICSIMAGEIO_README = """
# AICSImageIO

[![Build Status](https://github.com/AllenCellModeling/aicsimageio/workflows/Build%20Main/badge.svg)](https://github.com/AllenCellModeling/aicsimageio/actions)
[![Documentation](https://github.com/AllenCellModeling/aicsimageio/workflows/Documentation/badge.svg)](https://AllenCellModeling.github.io/aicsimageio/)
[![Code Coverage](https://codecov.io/gh/AllenCellModeling/aicsimageio/branch/main/graph/badge.svg)](https://app.codecov.io/gh/AllenCellModeling/aicsimageio/branch/main)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4906608.svg)](https://doi.org/10.5281/zenodo.4906608)

Image Reading, Metadata Conversion, and Image Writing for Microscopy Images in Pure Python

---

## Features

-   Supports reading metadata and imaging data for:
    -   `OME-TIFF`
    -   `TIFF`
    -   `ND2` -- (`pip install aicsimageio[nd2]`)
    -   `DV` -- (`pip install aicsimageio[dv]`)
    -   `CZI` -- (`pip install aicspylibczi>=3.1.1 fsspec>=2022.8.0`)
    -   `LIF` -- (`pip install readlif>=0.6.4`)
    -   `PNG`, `GIF`, [etc.](https://github.com/imageio/imageio) -- (`pip install aicsimageio[base-imageio]`)
    -   Files supported by [Bio-Formats](https://docs.openmicroscopy.org/bio-formats/latest/supported-formats.html) -- (`pip install aicsimageio bioformats_jar`) (Note: requires `java` and `maven`, see below for details.)
-   Supports writing metadata and imaging data for:
    -   `OME-TIFF`
    -   `PNG`, `GIF`, [etc.](https://github.com/imageio/imageio) -- (`pip install aicsimageio[base-imageio]`)
-   Supports reading and writing to
    [fsspec](https://github.com/intake/filesystem_spec) supported file systems
    wherever possible:

    -   Local paths (i.e. `my-file.png`)
    -   HTTP URLs (i.e. `https://my-domain.com/my-file.png`)
    -   [s3fs](https://github.com/dask/s3fs) (i.e. `s3://my-bucket/my-file.png`)
    -   [gcsfs](https://github.com/dask/gcsfs) (i.e. `gcs://my-bucket/my-file.png`)

    See [Cloud IO Support](#cloud-io-support) for more details.

## Installation

**Stable Release:** `pip install aicsimageio`<br>
**Development Head:** `pip install git+https://github.com/AllenCellModeling/aicsimageio.git`

AICSImageIO is supported on Windows, Mac, and Ubuntu.
For other platforms, you will likely need to build from source.

#### Extra Format Installation

TIFF and OME-TIFF reading and writing is always available after
installing `aicsimageio`, but extra supported formats can be
optionally installed using `[...]` syntax.

-   For a single additional supported format (e.g. ND2): `pip install aicsimageio[nd2]`
-   For a single additional supported format (e.g. ND2), development head: `pip install "aicsimageio[nd2] @ git+https://github.com/AllenCellModeling/aicsimageio.git"`
-   For a single additional supported format (e.g. ND2), specific tag (e.g. `v4.0.0.dev6`): `pip install "aicsimageio[nd2] @ git+https://github.com/AllenCellModeling/aicsimageio.git@v4.0.0.dev6"`
-   For faster OME-TIFF reading with tile tags: `pip install aicsimageio[bfio]`
-   For multiple additional supported formats: `pip install aicsimageio[base-imageio,nd2]`
-   For all additional supported (and openly licensed) formats: `pip install aicsimageio[all]`
-   Due to the GPL license, LIF support is not included with the `[all]` extra, and must be installed manually with `pip install aicsimageio readlif>=0.6.4`
-   Due to the GPL license, CZI support is not included with the `[all]` extra, and must be installed manually with `pip install aicsimageio aicspylibczi>=3.1.1 fsspec>=2022.8.0`
-   Due to the GPL license, Bio-Formats support is not included with the `[all]` extra, and must be installed manually with `pip install aicsimageio bioformats_jar`. **Important!!** Bio-Formats support also requires a `java` and `mvn` executable in the environment. The simplest method is to install `bioformats_jar` from conda: `conda install -c conda-forge bioformats_jar` (which will additionally bring `openjdk` and `maven` packages).

## Documentation

For full package documentation please visit
[allencellmodeling.github.io/aicsimageio](https://allencellmodeling.github.io/aicsimageio/index.html).

## Quickstart

### Full Image Reading

If your image fits in memory:

```python
from aicsimageio import AICSImage

# Get an AICSImage object
img = AICSImage("my_file.tiff")  # selects the first scene found
img.data  # returns 5D TCZYX numpy array
img.xarray_data  # returns 5D TCZYX xarray data array backed by numpy
img.dims  # returns a Dimensions object
img.dims.order  # returns string "TCZYX"
img.dims.X  # returns size of X dimension
img.shape  # returns tuple of dimension sizes in TCZYX order
img.get_image_data("CZYX", T=0)  # returns 4D CZYX numpy array

# Get the id of the current operating scene
img.current_scene

# Get a list valid scene ids
img.scenes

# Change scene using name
img.set_scene("Image:1")
# Or by scene index
img.set_scene(1)

# Use the same operations on a different scene
# ...
```

#### Full Image Reading Notes

The `.data` and `.xarray_data` properties will load the whole scene into memory.
The `.get_image_data` function will load the whole scene into memory and then retrieve
the specified chunk.

### Delayed Image Reading

If your image doesn't fit in memory:

```python
from aicsimageio import AICSImage

# Get an AICSImage object
img = AICSImage("my_file.tiff")  # selects the first scene found
img.dask_data  # returns 5D TCZYX dask array
img.xarray_dask_data  # returns 5D TCZYX xarray data array backed by dask array
img.dims  # returns a Dimensions object
img.dims.order  # returns string "TCZYX"
img.dims.X  # returns size of X dimension
img.shape  # returns tuple of dimension sizes in TCZYX order

# Pull only a specific chunk in-memory
lazy_t0 = img.get_image_dask_data("CZYX", T=0)  # returns out-of-memory 4D dask array
t0 = lazy_t0.compute()  # returns in-memory 4D numpy array

# Get the id of the current operating scene
img.current_scene

# Get a list valid scene ids
img.scenes

# Change scene using name
img.set_scene("Image:1")
# Or by scene index
img.set_scene(1)

# Use the same operations on a different scene
# ...
```

#### Delayed Image Reading Notes

The `.dask_data` and `.xarray_dask_data` properties and the `.get_image_dask_data`
function will not load any piece of the imaging data into memory until you specifically
call `.compute` on the returned Dask array. In doing so, you will only then load the
selected chunk in-memory.

### Mosaic Image Reading

Read stitched data or single tiles as a dimension.

Readers that support mosaic tile stitching:

-   `LifReader`
-   `CziReader`

#### AICSImage

If the file format reader supports stitching mosaic tiles together, the
`AICSImage` object will default to stitching the tiles back together.

```python
img = AICSImage("very-large-mosaic.lif")
img.dims.order  # T, C, Z, big Y, big X, (S optional)
img.dask_data  # Dask chunks fall on tile boundaries, pull YX chunks out of the image
```

This behavior can be manually turned off:

```python
img = AICSImage("very-large-mosaic.lif", reconstruct_mosaic=False)
img.dims.order  # M (tile index), T, C, Z, small Y, small X, (S optional)
img.dask_data  # Chunks use normal ZYX
```

If the reader does not support stitching tiles together the M tile index will be
available on the `AICSImage` object:

```python
img = AICSImage("some-unsupported-mosaic-stitching-format.ext")
img.dims.order  # M (tile index), T, C, Z, small Y, small X, (S optional)
img.dask_data  # Chunks use normal ZYX
```

#### Reader

If the file format reader detects mosaic tiles in the image, the `Reader` object
will store the tiles as a dimension.

If tile stitching is implemented, the `Reader` can also return the stitched image.

```python
reader = LifReader("ver-large-mosaic.lif")
reader.dims.order  # M, T, C, Z, tile size Y, tile size X, (S optional)
reader.dask_data  # normal operations, can use M dimension to select individual tiles
reader.mosaic_dask_data  # returns stitched mosaic - T, C, Z, big Y, big, X, (S optional)
```

#### Single Tile Absolute Positioning

There are functions available on both the `AICSImage` and `Reader` objects
to help with single tile positioning:

```python
img = AICSImage("very-large-mosaic.lif")
img.mosaic_tile_dims  # Returns a Dimensions object with just Y and X dim sizes
img.mosaic_tile_dims.Y  # 512 (for example)

# Get the tile start indices (top left corner of tile)
y_start_index, x_start_index = img.get_mosaic_tile_position(12)
```

### Metadata Reading

```python
from aicsimageio import AICSImage

# Get an AICSImage object
img = AICSImage("my_file.tiff")  # selects the first scene found
img.metadata  # returns the metadata object for this file format (XML, JSON, etc.)
img.channel_names  # returns a list of string channel names found in the metadata
img.physical_pixel_sizes.Z  # returns the Z dimension pixel size as found in the metadata
img.physical_pixel_sizes.Y  # returns the Y dimension pixel size as found in the metadata
img.physical_pixel_sizes.X  # returns the X dimension pixel size as found in the metadata
```

### Xarray Coordinate Plane Attachment

If `aicsimageio` finds coordinate information for the spatial-temporal dimensions of
the image in metadata, you can use
[xarray](http://xarray.pydata.org/en/stable/index.html) for indexing by coordinates.

```python
from aicsimageio import AICSImage

# Get an AICSImage object
img = AICSImage("my_file.ome.tiff")

# Get the first ten seconds (not frames)
first_ten_seconds = img.xarray_data.loc[:10]  # returns an xarray.DataArray

# Get the first ten major units (usually micrometers, not indices) in Z
first_ten_mm_in_z = img.xarray_data.loc[:, :, :10]

# Get the first ten major units (usually micrometers, not indices) in Y
first_ten_mm_in_y = img.xarray_data.loc[:, :, :, :10]

# Get the first ten major units (usually micrometers, not indices) in X
first_ten_mm_in_x = img.xarray_data.loc[:, :, :, :, :10]
```

See `xarray`
["Indexing and Selecting Data" Documentation](http://xarray.pydata.org/en/stable/indexing.html)
for more information.

### Cloud IO Support

[File-System Specification (fsspec)](https://github.com/intake/filesystem_spec) allows
for common object storage services (S3, GCS, etc.) to act like normal filesystems by
following the same base specification across them all. AICSImageIO utilizes this
standard specification to make it possible to read directly from remote resources when
the specification is installed.

```python
from aicsimageio import AICSImage

# Get an AICSImage object
img = AICSImage("http://my-website.com/my_file.tiff")
img = AICSImage("s3://my-bucket/my_file.tiff")
img = AICSImage("gcs://my-bucket/my_file.tiff")

# Or read with specific filesystem creation arguments
img = AICSImage("s3://my-bucket/my_file.tiff", fs_kwargs=dict(anon=True))
img = AICSImage("gcs://my-bucket/my_file.tiff", fs_kwargs=dict(anon=True))

# All other normal operations work just fine
```

Remote reading requires that the file-system specification implementation for the
target backend is installed.

-   For `s3`: `pip install s3fs`
-   For `gs`: `pip install gcsfs`

See the [list of known implementations](https://filesystem-spec.readthedocs.io/en/latest/?badge=latest#implementations).

### Saving to OME-TIFF

The simpliest method to save your image as an OME-TIFF file with key pieces of
metadata is to use the `save` function.

```python
from aicsimageio import AICSImage

AICSImage("my_file.czi").save("my_file.ome.tiff")
```

**Note:** By default `aicsimageio` will generate only a portion of metadata to pass
along from the reader to the OME model. This function currently does not do a full
metadata translation.

For finer grain customization of the metadata, scenes, or if you want to save an array
as an OME-TIFF, the writer class can also be used to customize as needed.

```python
import numpy as np
from aicsimageio.writers import OmeTiffWriter

image = np.random.rand(10, 3, 1024, 2048)
OmeTiffWriter.save(image, "file.ome.tif", dim_order="ZCYX")
```

See
[OmeTiffWriter documentation](./aicsimageio.writers.html#aicsimageio.writers.ome_tiff_writer.OmeTiffWriter.save)
for more details.

#### Other Writers

In most cases, `AICSImage.save` is usually a good default but there are other image
writers available. For more information, please refer to
[our writers documentation](https://allencellmodeling.github.io/aicsimageio/aicsimageio.writers.html).

## Benchmarks

AICSImageIO is benchmarked using [asv](https://asv.readthedocs.io/en/stable/).
You can find the benchmark results for every commit to `main` starting at the 4.0
release on our
[benchmarks page](https://AllenCellModeling.github.io/aicsimageio/_benchmarks/index.html).

## Development

See our
[developer resources](https://allencellmodeling.github.io/aicsimageio/developer_resources)
for information related to developing the code.

## Citation

If you find `aicsimageio` useful, please cite this repository as:

> Eva Maxfield Brown, Dan Toloudis, Jamie Sherman, Madison Swain-Bowden, Talley Lambert, AICSImageIO Contributors (2021). AICSImageIO: Image Reading, Metadata Conversion, and Image Writing for Microscopy Images in Pure Python [Computer software]. GitHub. https://github.com/AllenCellModeling/aicsimageio

bibtex:

```bibtex
@misc{aicsimageio,
  author    = {Brown, Eva Maxfield and Toloudis, Dan and Sherman, Jamie and Swain-Bowden, Madison and Lambert, Talley and {AICSImageIO Contributors}},
  title     = {AICSImageIO: Image Reading, Metadata Conversion, and Image Writing for Microscopy Images in Pure Python},
  year      = {2021},
  publisher = {GitHub},
  url       = {https://github.com/AllenCellModeling/aicsimageio}
}
```

_Free software: BSD-3-Clause_

_(The LIF component is licensed under GPLv3 and is not included in this package)_
_(The Bio-Formats component is licensed under GPLv2 and is not included in this package)_
_(The CZI component is licensed under GPLv3 and is not included in this package)_
"""

CDP_DATA_README = """
# cdp-data

[![Build Status](https://github.com/CouncilDataProject/cdp-data/workflows/CI/badge.svg)](https://github.com/CouncilDataProject/cdp-data/actions)
[![Documentation](https://github.com/CouncilDataProject/cdp-data/workflows/Documentation/badge.svg)](https://CouncilDataProject.github.io/cdp-data)

Data Utilities and Processing Generalized for All CDP Instances

---

![Keywords over time in Seattle, Portland, and Oakland](https://raw.githubusercontent.com/CouncilDataProject/cdp-data/main/docs/_static/header-keywords-over-time.png)

## Installation

**Stable Release:** `pip install cdp-data`<br>
**Development Head:** `pip install git+https://github.com/CouncilDataProject/cdp-data.git`

## Documentation

For full package documentation please visit [councildataproject.github.io/cdp-data](https://councildataproject.github.io/cdp-data).

## Quickstart

### Pulling Datasets

Install basics: `pip install cdp-data`

#### Transcripts and Session Data

```python
from cdp_data import CDPInstances, datasets

ds = datasets.get_session_dataset(
    infrastructure_slug=CDPInstances.Seattle,
    start_datetime="2021-01-01",
    store_transcript=True,
)
```

##### Transcript Schema and Usage

It may be useful to look at our
[transcript model documentation](https://councildataproject.org/cdp-backend/transcript_model.html).

Transcripts can be read into memory and processed as an object:

```python
from cdp_backend.pipeline.transcript_model import Transcript

# Read the file as a Transcript object
with open("transcript.json", "r") as open_f:
    transcript = Transcript.from_json(open_f.read())

# Navigate the object
for sentence in transcript.sentences:
    if "clerk" in sentence.text.lower():
        print(f"{sentence.index}, {sentence.start_time}: '{sentence.text}')
```

If you do not want to do this processing in Python or prefer to work with
a DataFrame, you can convert transcripts to DataFrames like so:

```python
from cdp_data import datasets

# assume that transcript is the same transcript as the prior code snippet
sentences = datasets.convert_transcript_to_dataframe(transcript)
```

You can also do this conversion (and storage of the coverted transcript) for
all transcripts in a session dataset during dataset construction with the
`store_transcript_as_csv` parameter.

```python
from cdp_data import CDPInstances, datasets

ds = datasets.get_session_dataset(
    infrastructure_slug=CDPInstances.Seattle,
    start_datetime="2021-01-01",
    store_transcript=True,
    store_transcript_as_csv=True,
)
```

This will store the transcript for each session as both JSON and CSV.

#### Voting Data

```python
from cdp_data import CDPInstances, datasets

ds = dataset.get_vote_dataset(
    infrastructure_slug=CDPInstances.Seattle,
    start_datetime="2021-01-01",
)
```

#### Data Definitions and Schema

Please refer to our
[database schema](https://councildataproject.org/cdp-backend/database_schema.html)
and our
[database model definitions](https://councildataproject.org/cdp-backend/cdp_backend.database.html#module-cdp_backend.database.models)
for more information on CDP generated and archived data is structured.

#### Saving Datasets

Because we heavily rely on our database models for database interaction,
in many cases, we default to returning the full `fireo.models.Model` object
as column values.

These objects cannot be immediately stored to disk so we provide a helper to
replace all model objects with their database IDs for storage.

This can be done directly if you already have a dataset you have been working with:

```python
from cdp_data import datasets

# data should be a pandas dataframe
dataset.save_dataset(data, "data.csv")
```

Or this can be premptively be done during dataset construction:

```python
from cdp_data import CDPInstances, dataset

# both get_session_dataset and get_vote_dataset
# have a `replace_py_objects` parameter
sessions = datasets.get_session_dataset(
    infrastructure_slug=CDPInstances.Seattle,
    replace_py_objects=True,
)

votes = datasets.get_vote_dataset(
    infrastructure_slug=CDPInstances.Seattle,
    replace_py_objects=True,
)
```

### Plotting and Analysis

Install plotting support: `pip install cdp-data[plot]`

#### Ngram Usage over Time

```python
from cdp_data import CDPInstances, keywords, plots

ngram_usage = keywords.compute_ngram_usage_history(
    CDPInstances.Seattle,
    start_datetime="2022-03-01",
    end_datetime="2022-10-01",
)
grid = plots.plot_ngram_usage_histories(
    ["police", "housing", "transportation"],
    ngram_usage,
    lmplot_kws=dict(  # extra plotting params
        col="ngram",
        hue="ngram",
        scatter_kws={"alpha": 0.2},
        aspect=1.6,
    ),
)
grid.savefig("seattle-keywords-over-time.png")
```

![Seattle keyword usage over time](https://raw.githubusercontent.com/CouncilDataProject/cdp-data/main/docs/_static/seattle-keywords-over-time.png)

## Development

See [CONTRIBUTING.md](CONTRIBUTING.md) for information related to developing the code.

**MIT license**
"""

CDP_OAKLAND_README = """
# CDP - Oakland

[![Infrastructure Deployment Status](https://github.com/CouncilDataProject/oakland/workflows/Infrastructure/badge.svg)](https://github.com/CouncilDataProject/oakland/actions?query=workflow%3A%22Infrastructure%22)
[![Event Processing Pipeline](https://github.com/CouncilDataProject/oakland/workflows/Event%20Gather/badge.svg)](https://github.com/CouncilDataProject/oakland/actions?query=workflow%3A%22Event+Gather%22)
[![Event Index Pipeline](https://github.com/CouncilDataProject/oakland/workflows/Event%20Index/badge.svg)](https://github.com/CouncilDataProject/oakland/actions?query=workflow%3A%22Event+Index%22)
[![Web Deployment Status](https://github.com/CouncilDataProject/oakland/workflows/Web%20App/badge.svg)](https://councildataproject.github.io/oakland)
[![Repo Build Status](https://github.com/CouncilDataProject/oakland/workflows/Build%20Main/badge.svg)](https://github.com/CouncilDataProject/oakland/actions?query=workflow%3A%22Build+Main%22)

---

## Council Data Project

Council Data Project is an open-source project dedicated to providing journalists, activists, researchers, and all members of each community we serve with the tools they need to stay informed and hold their Council Members accountable.

For more information about Council Data Project, please visit [our website](https://councildataproject.org/).

## Instance Information

This repo serves the municipality: **Oakland**

### Python Access

Install:

`pip install cdp-backend`

Quickstart:

```python
from cdp_backend.database import models as db_models
from cdp_backend.pipeline.transcript_model import Transcript
import fireo
from gcsfs import GCSFileSystem
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client

# Connect to the database
fireo.connection(client=Client(
    project="cdp-oakland-ba81c097",
    credentials=AnonymousCredentials()
))

# Read from the database
five_people = list(db_models.Person.collection.fetch(5))

# Connect to the file store
fs = GCSFileSystem(project="cdp-oakland-ba81c097", token="anon")

# Read a transcript's details from the database
transcript_model = list(db_models.Transcript.collection.fetch(1))[0]

# Read the transcript directly from the file store
with fs.open(transcript_model.file_ref.get().uri, "r") as open_resource:
    transcript = Transcript.from_json(open_resource.read())

# OR download and store the transcript locally with `get`
fs.get(transcript_model.file_ref.get().uri, "local-transcript.json")
# Then read the transcript from your local machine
with open("local-transcript.json", "r") as open_resource:
    transcript = Transcript.from_json(open_resource.read())
```

-   See the [CDP Database Schema](https://councildataproject.org/cdp-backend/database_schema.html)
    for a Council Data Project database schema diagram.
-   See the [FireO documentation](https://octabyte.io/FireO/)
    to learn how to construct queries using CDP database models.
-   See the [GCSFS documentation](https://gcsfs.readthedocs.io/en/latest/index.html)
    to learn how to retrieve files from the file store.

## Contributing

If you wish to contribute to CDP please note that the best method to do so is to contribute to the upstream libraries that compose the CDP Instances themselves. These are detailed below.

-   [cdp-backend](https://github.com/CouncilDataProject/cdp-backend): Contains all the database models, data processing pipelines, and infrastructure-as-code for CDP deployments. Contributions here will be available to all CDP Instances. Entirely written in Python.
-   [cdp-frontend](https://github.com/CouncilDataProject/cdp-frontend): Contains all of the components used by the web apps to be hosted on GitHub Pages. Contributions here will be available to all CDP Instances. Entirely written in TypeScript and React.
-   [cookiecutter-cdp-deployment](https://github.com/CouncilDataProject/cookiecutter-cdp-deployment): The repo used to generate new CDP Instance deployments. Like this repo!
-   [councildataproject.org](https://github.com/CouncilDataProject/councildataproject.github.io): Our landing page! Contributions here should largely be text changes and admin updates.

## Instance Admin Documentation

You can find documentation on how to customize, update, and maintain this CDP instance
in the
[admin-docs directory](https://github.com/CouncilDataProject/oakland/tree/main/admin-docs).

## License

CDP software is licensed under a [MIT License](./LICENSE).

Content produced by this instance is available under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).
"""

LAZY_TEXT_CLASSIFIERS_README = """
# lazy-text-classifiers

[![Build Status](https://github.com/evamaxfield/lazy-text-classifiers/workflows/CI/badge.svg)](https://github.com/evamaxfield/lazy-text-classifiers/actions)
[![Documentation](https://github.com/evamaxfield/lazy-text-classifiers/workflows/Documentation/badge.svg)](https://evamaxfield.github.io/lazy-text-classifiers)

Build and test a variety of text binary or multi-class classification models.

---

## Installation

**Stable Release:** `pip install lazy-text-classifiers`<br>
**Development Head:** `pip install git+https://github.com/evamaxfield/lazy-text-classifiers.git`

## Quickstart

```python
from lazy_text_classifiers import LazyTextClassifiers
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

# Example data from sklearn
# `x` should be an iterable of strings
# `y` should be an iterable of string labels
data = fetch_20newsgroups(subset="all", remove=("header", "footers", "quotes"))
x = data.data[:1000]
y = data.target[:1000]
y = [data.target_names[id_] for id_ in y]

# Split the data into train and test
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size=0.4,
    random_state=12,
)

# Init and fit all models
ltc = LazyTextClassifiers(random_state=12)
results = ltc.fit(x_train, x_test, y_train, y_test)

# Results is a dataframe
# | model                  |   accuracy |   balanced_accuracy |   precision |   recall |       f1 |    time |
# |:-----------------------|-----------:|--------------------:|------------:|---------:|---------:|--------:|
# | semantic-logit         |    0.73    |            0.725162 |    0.734887 |  0.73    | 0.728247 |  13.742 |
# | tfidf-logit            |    0.70625 |            0.700126 |    0.709781 |  0.70625 | 0.702073 | 187.217 |
# | fine-tuned-transformer |    0.11125 |            0.1118   |    0.10998  |  0.11125 | 0.109288 | 220.105 |

# Get a specific model
semantic_logit = ltc.fit_models["semantic-logit"]
# either an scikit-learn Pipeline or a custom Transformer wrapper class

# All models have a `save` function which will store into the normal format
# * pickle for scikit-learn pipelines
# * torch model directory for Transformers
```

## Documentation

For full package documentation please visit [evamaxfield.github.io/lazy-text-classifiers](https://evamaxfield.github.io/lazy-text-classifiers).

## Acknowledgements

This package was heavily inspired by [lazypredict](https://github.com/shankarpandala/lazypredict).

## Development

See [CONTRIBUTING.md](CONTRIBUTING.md) for information related to developing the code.

**MIT License**
"""

SPEAKERBOX_README = """
# speakerbox

[![Build Status](https://github.com/CouncilDataProject/speakerbox/workflows/CI/badge.svg)](https://github.com/CouncilDataProject/speakerbox/actions)
[![Documentation](https://github.com/CouncilDataProject/speakerbox/workflows/Documentation/badge.svg)](https://CouncilDataProject.github.io/speakerbox)
[![status](https://joss.theoj.org/papers/49cfcef1769c812ce4ff2e388a5c7641/status.svg)](https://joss.theoj.org/papers/49cfcef1769c812ce4ff2e388a5c7641)

Few-Shot Multi-Recording Speaker Identification Transformer Fine-Tuning and Application

---

## Installation

**Stable Release:** `pip install speakerbox`<br>
**Development Head:** `pip install git+https://github.com/CouncilDataProject/speakerbox.git`

## Documentation

For full package documentation please visit [councildataproject.github.io/speakerbox](https://councildataproject.github.io/speakerbox).

## Example Usage Video

[![screenshot from example usage youtube video](https://raw.githubusercontent.com/CouncilDataProject/speakerbox/main/docs/_static/images/speakerbox-example-video-screenshot.png)](https://youtu.be/SK2oVqSKPTE)

Link: [https://youtu.be/SK2oVqSKPTE](https://youtu.be/SK2oVqSKPTE)

In the example video, we use the Speakerbox library to quickly annotate a 
dataset of audio clips from the show 
[The West Wing](https://en.wikipedia.org/wiki/The_West_Wing) 
and train a speaker identification model to identify three of 
the show's characters (President Bartlet, Charlie Young, and Leo McGarry).

## Problem

Given a set of recordings of multi-speaker recordings:

```
example/
├── 0.wav
├── 1.wav
├── 2.wav
├── 3.wav
├── 4.wav
└── 5.wav
```

Where each recording has some or all of a set of speakers, for example:

-   0.wav -- contains speakers: A, B, C
-   1.wav -- contains speakers: A, C
-   2.wav -- contains speakers: B, C
-   3.wav -- contains speakers: A, B, C
-   4.wav -- contains speakers: A, B, C
-   5.wav -- contains speakers: A, B, C

You want to train a model to classify portions of audio as one of the N known speakers
in future recordings not included in your original training set.

`f(audio) -> [(start_time, end_time, speaker), (start_time, end_time, speaker), ...]`

i.e. `f(audio) -> [(2.4, 10.5, "A"), (10.8, 14.1, "D"), (14.8, 22.7, "B"), ...]`

The `speakerbox` library contains methods for both generating datasets for annotation
and for utilizing multiple audio annotation schemes to train such a model.

![Typical workflow to prepare a speaker identification dataset and fine-tune a new model using tools provided from the Speakerbox library. The user starts with a collection of audio files that include portions speech from the speakers they want to train a model to identify. The `diarize_and_split_audio` function will create a new directory with the same name as the audio file, diarize the audio file, and finally, sort the audio portions produced from diarization into sub-directories within this new directory. The user should then manually rename each of the produced sub-directories to the correct speaker identifier (i.e. the speaker's name or a unique id) and additionally remove any incorrectly diarized or mislabeled portions of audio. Finally, the user can prepare training, evaluation, and testing datasets (via the `expand_labeled_diarized_audio_dir_to_dataset` and `preprocess_dataset` functions) and fine-tune a new speaker identification model (via the `train` function).](https://raw.githubusercontent.com/CouncilDataProject/speakerbox/main/docs/_static/images/workflow.png)

The following table shows model performance results as the dataset size increases:

| dataset_size   | mean_accuracy   | mean_precision   | mean_recall   | mean_training_duration_seconds   |
|:---------------|----------------:|-----------------:|--------------:|---------------------------------:|
| 15-minutes     | 0.874 ± 0.029   | 0.881 ± 0.037    | 0.874 ± 0.029 | 101 ± 1                          |
| 30-minutes     | 0.929 ± 0.006   | 0.94 ± 0.007     | 0.929 ± 0.006 | 186 ± 3                          |
| 60-minutes     | 0.937 ± 0.02    | 0.94 ± 0.017     | 0.937 ± 0.02  | 453 ± 7                          |

All results reported are the average of five model training and evaluation trials for each
of the different dataset sizes. All models were fine-tuned using an NVIDIA GTX 1070 TI.

**Note:** this table can be reproduced in ~1 hour using an NVIDIA GTX 1070 TI by:

Installing the example data download dependency:

```bash
pip install speakerbox[example_data]
```

Then running the following commands in Python:

```python
from speakerbox.examples import (
    download_preprocessed_example_data,
    train_and_eval_all_example_models,
)

# Download and unpack the preprocessed example data
dataset = download_preprocessed_example_data()

# Train and eval models with different subsets of the data
results = train_and_eval_all_example_models(dataset)
```

## Workflow

### Diarization

We quickly generate an annotated dataset by first diarizing (or clustering based
on the features of speaker audio) portions of larger audio files and splitting each the
of the clusters into their own directories that you can then manually clean up
(by removing incorrectly clustered audio segments).

#### Notes

-   It is recommended to have each larger audio file named with a unique id that
    can be used to act as a "recording id".
-   Diarization time depends on machine resources and make take a long time -- one
    potential recommendation is to run a diarization script overnight and clean up the
    produced annotations the following day.
-   During this process audio will be duplicated in the form of smaller audio clips --
    ensure you have enough space on your machine to complete this process before
    you begin.
-   Clustering accuracy depends on how many speakers there are, how distinct their
    voices are, and how much speech is talking over one-another.
-   If possible, try to find recordings where speakers have a roughly uniform distribution
    of speaking durations.

⚠️ To use the diarization portions of `speakerbox` you need to complete the
following steps: ⚠️

1. Visit [hf.co/pyannote/speaker-diarization](https://hf.co/pyannote/speaker-diarization)
   and accept user conditions.
2. Visit [hf.co/pyannote/segmentation](https://hf.co/pyannote/segmentation)
   and accept user conditions.
3. Visit [hf.co/settings/tokens](https://hf.co/settings/tokens) to create an access token
   (only if you had to complete 1.)

**Diarize a single file:**

```python
from speakerbox import preprocess

# The token can also be provided via the 'HUGGINGFACE_TOKEN` environment variable.
diarized_and_split_audio_dir = preprocess.diarize_and_split_audio(
    "0.wav",
    hf_token="token-from-hugging-face",
)
```

**Diarize all files in a directory:**
```python
from speakerbox import preprocess
from pathlib import Path
from tqdm import tqdm

# Iterate over all 'wav' format files in a directory called 'data'
for audio_file in tqdm(list(Path("data").glob("*.wav"))):
    # The token can also be provided via the 'HUGGINGFACE_TOKEN` environment variable.
    diarized_and_split_audio_dir = preprocess.diarize_and_split_audio(
        audio_file,
        # Create a new directory to place all created sub-directories within
        storage_dir=f"diarized-audio/{audio_file.stem}",
        hf_token="token-from-hugging-face",
    )
```

### Cleaning

Diarization will produce a directory structure organized by unlabeled speakers with
the audio clips that were clustered together.

For example, if `"0.wav"` had three speakers, the produced directory structure may look
like the following tree:

```
0/
├── SPEAKER_00
│   ├── 567-12928.wav
│   ├── ...
│   └── 76192-82901.wav
├── SPEAKER_01
│   ├── 34123-38918.wav
│   ├── ...
│   └── 88212-89111.wav
└── SPEAKER_02
    ├── ...
    └── 53998-62821.wav
```

We leave it to you as a user to then go through these directories and remove any audio
clips that were incorrectly clustered together as well as renaming the sub-directories
to their correct speaker labels. For example, labelled sub-directories may look like
the following tree:

```
0/
├── A
│   ├── 567-12928.wav
│   ├── ...
│   └── 76192-82901.wav
├── B
│   ├── 34123-38918.wav
│   ├── ...
│   └── 88212-89111.wav
└── D
    ├── ...
    └── 53998-62821.wav
```

#### Notes

-   Most operating systems have an audio playback application to queue an entire directory
    of audio files as a playlist for playback. This makes it easy to listen to a whole
    unlabeled sub-directory (i.e. "SPEAKER_00") at a time and pause playback and remove
    files from the directory which were incorrectly clustered.
-   If any clips have overlapping speakers, it is up to you as a user if you want to
    remove those clips or keep them and properly label them with the speaker you wish to
    associate them with.

### Training Preparation

Once you have annotated what you think is enough recordings, you can try preparing
a dataset for training.

The following functions will prepare the audio for training by:

1. Finding all labeled audio clips in the provided directories
2. Chunk all found audio clips into smaller duration clips _(parametrizable)_
3. Check that the provided annotated dataset meets the following conditions:
    1. There is enough data such that the training, test, and validation subsets all
       contain different recording ids.
    2. There is enough data such that the training, test, and validation subsets each
       contain all labels present in the whole dataset.

#### Notes

-   During this process audio will be duplicated in the form of smaller audio clips --
    ensure you have enough space on your machine to complete this process before
    you begin.
-   Directory names are used as recording ids during dataset construction.

```python
from speakerbox import preprocess

dataset = preprocess.expand_labeled_diarized_audio_dir_to_dataset(
    labeled_diarized_audio_dir=[
        "0/",  # The cleaned and checked audio clips for recording id 0
        "1/",  # ... recording id 1
        "2/",  # ... recording id 2
        "3/",  # ... recording id 3
        "4/",  # ... recording id 4
        "5/",  # ... recording id 5
    ]
)

dataset_dict, value_counts = preprocess.prepare_dataset(
    dataset,
    # good if you have large variation in number of data points for each label
    equalize_data_within_splits=True,
    # set seed to get a reproducible data split
    seed=60,
)

# You can print the value_counts dataframe to see how many audio clips of each label
# (speaker) are present in each data subset.
value_counts
```

### Model Training and Evaluation

Once you have your dataset prepared and available, you can provide it directly to the
training function to begin training a new model.

The `eval_model` function will store a filed called `results.md` with the accuracy,
precision, and recall of the model and additionally store a file called
`validation-confusion.png` which is a
[confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).

#### Notes

-   The model (and evaluation metrics) will be stored in a new directory called
    `trained-speakerbox` _(parametrizable)_.
-   Training time depends on how much data you have annotated and provided.
-   It is recommended to train with an NVidia GPU with CUDA available to speed up
    the training process.
-   Speakerbox has only been tested on English-language audio and the base model for
    fine-tuning was trained on English-language audio. We provide no guarantees as to
    it's effectiveness on non-English-language audio. If you try Speakerbox on with
    non-English-language audio, please let us know!

```python
from speakerbox import train, eval_model

# dataset_dict comes from previous preparation step
train(dataset_dict)

eval_model(dataset_dict["valid"])
```

## Model Inference

Once you have a trained model, you can use it against a new audio file:

```python
from speakerbox import apply

annotation = apply(
    "new-audio.wav",
    "path-to-my-model-directory/",
)
```

The apply function returns a
[pyannote.core.Annotation](http://pyannote.github.io/pyannote-core/structure.html#annotation).

## Development

See [CONTRIBUTING.md](CONTRIBUTING.md) for information related to developing the code.

## Citation

```bibtex
@article{Brown2023,
    doi = {10.21105/joss.05132},
    url = {https://doi.org/10.21105/joss.05132},
    year = {2023},
    publisher = {The Open Journal},
    volume = {8},
    number = {83},
    pages = {5132},
    author = {Eva Maxfield Brown and To Huynh and Nicholas Weber},
    title = {Speakerbox: Few-Shot Learning for Speaker Identification with Transformers},
    journal = {Journal of Open Source Software}
} 
```

**MIT License**
"""

In [5]:
READMES = {
    "soft-search": SOFT_SEARCH_README,
    "aicsimageio": AICSIMAGEIO_README, 
    "cdp-data": CDP_DATA_README, 
    "cdp-oakland": CDP_OAKLAND_README, 
    "speakerbox": SPEAKERBOX_README, 
    "lazy-text-classifiers": LAZY_TEXT_CLASSIFIERS_README,
}

# Predict
semantic_results = semantic_logit.predict(list(READMES.values()))
tfidf_results = tfidf_logit.predict(list(READMES.values()))

print("SEMANTIC RESULTS")
for short_name, result in zip(list(READMES.keys()), semantic_results):
    print(f"'{short_name}': {result}")

print("TFIDF RESULTS")
for short_name, result in zip(list(READMES.keys()), tfidf_results):
    print(f"'{short_name}': {result}")

SEMANTIC RESULTS
'soft-search': software-not-predicted
'aicsimageio': software-not-predicted
'cdp-data': software-not-predicted
'cdp-oakland': software-predicted
'speakerbox': software-not-predicted
'lazy-text-classifiers': software-not-predicted
TFIDF RESULTS
'soft-search': software-not-predicted
'aicsimageio': software-predicted
'cdp-data': software-not-predicted
'cdp-oakland': software-predicted
'speakerbox': software-predicted
'lazy-text-classifiers': software-predicted


## Trying To Understand Why It's Bad

`eli5` has a problem with scikit-learn 1.3.0: https://github.com/TeamHG-Memex/eli5/issues/425

In [None]:
!pip install "scikit-learn>=1,<1.3"

In [6]:
import eli5

eli5.show_weights(tfidf_logit, top=40)

TfidfVectorizer(ngram_range=(1, 2), stop_words='english',
                strip_accents='unicode')


Weight?,Feature
+2.182,package
+1.355,installation
+1.250,version
+1.233,gpu
+1.175,install
+1.044,video
+1.033,examples
+0.998,library
+0.987,example
+0.972,tool
