# Pipeline, return_all_scores

* After quite some detour and reading `transformers` source code, here's the simple solution:

In [2]:
from racebert import RaceBERT

* Here's a modification of the RaceBERT.predict_race function, note that this only applies to `transformers 4.19`
* In `4.20`, the modified line should be `results = self.race_pipeline(names, top_k=5)` (not tested)
* Relevant code: 
    - [TextClassificationPipeline.postprocess](https://github.com/huggingface/transformers/blob/d0acc9537829e7d067edbb791473bbceb2ecf056/src/transformers/pipelines/text_classification.py#L165)
    - [TextClassificationPipeline.\_\_call__](https://github.com/huggingface/transformers/blob/d0acc9537829e7d067edbb791473bbceb2ecf056/src/transformers/pipelines/text_classification.py#L104)
    - [Pipeline.\_\_call__](https://github.com/huggingface/transformers/blob/4975002df50c472cbb6f8ac3580e475f570606ab/src/transformers/pipelines/base.py#L972)

In [3]:
# for type in predict_race_all below
from __future__ import annotations
from typing import List, Dict

def predict_race_all(self, names: str | List[str]) -> List[Dict]:
    if type(names) == str:
            names = self.normalize_name(names, strategy="first LAST")
    else:
        names = [self.normalize_name(x, strategy="first LAST") for x in names]
    
    # modified here
    results = self.race_pipeline(names, return_all_scores=True)
    # note: the above works for tranformers 4.19, in 4.20 return_all_scores has been deprecated, should top_k as below (not tested)
    # results = self.race_pipeline(names, top_k=5)
    
    return results

In [4]:
RaceBERT.predict_race_all = predict_race_all

In [5]:
raceBert = RaceBERT()

In [6]:
raceBert.predict_race_all("Daniel Xia")

[[{'label': 'nh_white', 'score': 0.057758793234825134},
  {'label': 'hispanic', 'score': 0.0033216052688658237},
  {'label': 'nh_black', 'score': 0.002415939699858427},
  {'label': 'api', 'score': 0.9329151511192322},
  {'label': 'aian', 'score': 0.0035884305834770203}]]

* The same edit would also apply to `predict_ethnicity`.

# Remove defined race categories

* Yes, I would think so. 
* See the updated [tleitch/raceImpute/preprocess_data.py](https://github.com/tleitch/raceImpute/blob/main/models/preprocess_data.py) (small modifications from [raceBERT/models/process_raw_data.py](https://github.com/parasurama/raceBERT/blob/55f7eff322f5cdd8714bc1ff25e882ae6a00d9b9/models/process_raw_data.py))
    - `cat_race_map` are now a function argument, so long as the race names in `cat_race_map` and `include_race` are the same we should be good
    - provided that all race value in the data have a key-value pair in `cat_race_map`, even those we don't want to include