## Acknowledge example
This notebook shows the main workflow for the article. This should be easily replicated with your own data files. Simply replace the path_to_file variable with a path to your own file to read into pandas. You can then follow the steps as they are. The article goes more in depth into the motivation for these methods. 

In [12]:
from acknowledge.acknowledge import few_shot, instruction
from acknowledge.completions import *

import pandas as pd
import pickle
import numpy as np
import json
#for nicer df display
from IPython.display import display, HTML
#simple data exploration
from itertools import chain

### Loading data
The article and this example use Web of Science data. Depending on how you download it, the funding text column could be named differently. You can also use it for other data, just make sure you are using the right column names for the rest of the notebook.

In [2]:
#load csv into pandas
path_to_file='path-to-file.csv'

df = pd.read_csv(path_to_file)
df.shape

(1482, 92)

In [22]:
#the texts to extract
df[['Funding Text']].head()

Unnamed: 0,Funding Text
0,We thank ESRF for the synchrotron radiation be...
1,Financial support from the European Union with...
2,We gratefully acknowledge the ESRF for grantin...
3,K.C. benefited from support of the Deutsche Fo...
4,This work was supported by the U.S. National S...


### Few shot and instruction methods
Both methods use the same keys (see below). The difference is how we prompt the models. 
1. The few shot method maps three examples of prompts and completions to teach the model what to extract.
2. The instruction method simply tells the model what to do. 

In [20]:
#keys/information to extract from text
keys

'funding agencies (FUND), grant numbers (GRNB), facilities or universities or labs (INFRA), beamlines (BEAM), individuals with full names (IND), and type of assistance (TYPE)'

### Few-shot learning
Maps three examples of prompt to completion.

In [4]:
#few_shot module
few_shot??

In [29]:
#completions used for few_shot method, see prompt_2, prompt_3 with their completions for more examples
prompt_1

'Open Access funding provided by Projekt DEAL. The authors gratefully acknowledge the support of Dr. Olaf Medenbach for performing the pre-preparation and Mr. Maximilian Kruth, Forschungszentrum Julich, Ernst Ruska-Centre for Microscopy and Spectroscopy with Electrons (ER-C) for the fine-tuning of the SXNT sample. The authors would like to thank Dr. Robert Mucke, Forschungszentrum Julich, Institute of Energy and Climate Research: Materials Synthesis and Processing (IEK-1) for the data management and assisting with the analyses of the tomograms. We gratefully acknowledge the German Federal Ministry of Education and Research (BMBF) for financing of the project Verbundvorhaben SOFC Degradation (Proposal Number 03SF0494A) and the ESRF for its financial support under MA-3254 experiment at the ID16B beamline.'

In [30]:
completion_1# expected completion from prompt 1

{'FUND': 'Projekt DEAL|German Federal Ministry of Education and Research (BMBF)|ESRF',
 'IND': 'Dr. Olaf Medenbach|Mr. Maximilian Kruth|Dr. Robert Mucke',
 'INFRA': 'Forschungszentrum Julich|Ernst Ruska-Centre for Microscopy and Spectroscopy with Electrons (ER-C)|Institute of Energy and Climate Research: Materials Synthesis and Processing (IEK-1)',
 'BEAM': 'ID16B',
 'GRNB': '03SF0494A',
 'TYPE': 'open access funding|performing preparation|fine-tuning of the SXNT sample|data management and assisting with analysis|financing the project|financial support'}

In [32]:
prompt='\n\n\n[Keys]:{}\n\n\n[Text1]:{}\n\n[Output1]:{}\n\n[Text2]:{}\n\n[Output2]:{}\n\n[Text3]:{}\n\n[Output3]:{}\n\n[Text4]:{}\n\n[Output4]:'.format(
                          keys, #keys  
                          prompt_1,#trainx_1  
                          json.dumps(completion_1),#trainy_1  
                          prompt_2, #trainx_2  
                          json.dumps(completion_2), #trainy_2  
                          prompt_3, #trainx_3   
                          json.dumps(completion_3), #trainy_3  
                          '[NEW_PROMPT]'#testx_i  
            )
#print(prompt) #uncomment to see how the model takes in this prompt

### Instruction (or Instruct)
Gives the prompt in the form of an instruction.

In [35]:
#instruction module
instruction??

In [36]:
#instructions for instruction module
instructions + '[NEW_PROMPT]'

'Extract the following data: funding agencies (FUND), grant numbers (GRNB), facilities or universities or labs (INFRA), beamlines (BEAM), individuals with full names (IND), and type of assistance (TYPE) in JSON format, use "NaN" if none found. From the following text: [NEW_PROMPT]'

### Running the model
The preliminary evaluation in the article found that **fsl** with **gpt-3.5-turbo** was the most accurate version and are thus the **default**. It is also cheaper to run that InstructGPT. GPT-4 came out while finalising this article. Limited testing with fsl showed no obvious signs of improvement. However, it is possible that it will perform better with instructions.

#### Checkpoints
This script implements checkpoints with pickle. If stopped or interrupted somehow, simply re-run the block to resume with the data structuring/processing. The file is saved in the checkpoints folder and the name will be constructed based on model and method name, as well as a prefix you can add below. "checkpoint name" + "model version" + "method"

In [38]:
#runs the model, saves error list and index if any (only available for few_shot with gpt-3.5-)
# other model options 'text-davinci-003', 'gpt-3.5-turbo', 'ada', 'gpt-4'
#to run with instruction method, simply change the function name to instruction, or use the block below

few_shot(#using few_shot
    df, #pandas dataframe
    'Funding Text', #column name with text 
    checkpoint_name='v2', #name for your checkpoint
    model='gpt-3.5-turbo', #model version to use
    delay=0 #delay in seconds between texts
)

Using gpt-3.5-turbo with the following parameters:
max_tokens=500
stop=

[Output]:
temperature=0
Dataset has 1482 rows and 141488 words
Resuming from checkpoint
Saving to v2_gpt-3.5-turbo_fsl.pickle


0it [00:00, ?it/s]


'Finished!'

The script tells you the checkpoint name: "Saving to v2_gpt-3.5-turbo_fsl.pickle"

In [41]:
#to run with instruction method, simply uncomment below

#instruction(#using instruction
#    df, #pandas dataframe
#    'Funding Text', #column name with text 
#    checkpoint_name='v2', #name for your checkpoint
#    model='gpt-3.5-turbo', #model version to use
#    delay=0 #delay in seconds between texts
#)

In [40]:
#check for errors again
path_errors='errors/error_list.pickle'

with open(path_errors, 'rb') as f:
    error_list = pickle.load(f)
error_list

[]

### Inspecting data
Load the checkpoint with pickle to inspect data, you can just copy and paste the checkpoint name below, making sure you are directing the path to the checkpoints folder

In [42]:
#loads checkpoint in pickle format

path_to_checkpoint = 'checkpoints/#your-checkpoint-here#'

with open(path_to_checkpoint, 'rb') as f:
    checkpoint = pickle.load(f)

The texts have been structured into 6 columns, based on the keys provided to the model.

In [11]:
#structured acknowledgements data
acknowledge = pd.DataFrame(checkpoint)
print(acknowledge.shape)
acknowledge.head()

(1482, 6)


Unnamed: 0,FUND,IND,INFRA,BEAM,GRNB,TYPE
0,French Research Agency|German Science Foundati...,,ESRF,synchrotron radiation,ANR-14-CE35-0028-01|DFG Ta259/12|NANOTRANSMED|...,acknowledgement of support|fellowship|PhD fell...
1,European Union NFFA|Helmholtz Associations Ini...,PT-DESY,,FIB dual beam,NFFA 654360|5K13WC3|18-41-06001,financial support|use of instrument|support
2,Russian Science Foundation,Patrick Merz|Horst Bormann|Gohil Takur,ESRF|Nuclear Resonance beamline ID18,,17-72-20200|HC-2440,granting beam time|synthesizing <SUP>57</SUP>F...
3,Deutsche Forschungsgemeinschaft (DFG)|Max Plan...,K.C.,SOLEIL|ESRF|NSRRC,synchrotron,SE 1441/1-2|SFB 1143 (project-id 247310070)|Gr...,benefited from support|partially supported|gra...
4,U.S. National Science Foundation|U.S. Departme...,,Advanced Photon Source|Argonne National Labora...,x-ray microdiffraction,DMR-1609545|NaN|51802057|DD45001017|2016YFA030...,financial support|development of instrumentati...


An example of the first text

In [25]:
#example, original text
df[['Funding Text']].iloc[0].to_list()

['We thank ESRF for the synchrotron radiation beam time. We thank the French Research Agency (grant ANR-14-CE35-0028-01 to M.P.K.), the German Science Foundation (DFG Ta259/12 to M.T.), the INTERREG V Upper Rhine Program (NANOTRANSMED to M.P.K. and M.T.), and the Japan Society for the Promotion of Science (16K05515 to A.Y.). M.V. thanks DFG (GRK1114, EcTop2) for the fellowship. S.M. is thankful to Konrad-Adenauer Foundation and X.L. thanks ANR for PhD fellowships. W.A. thanks the Alexander von Humboldt Foundation. M.T. thanks Nakatani Foundation for support.']

The following is extracted and structured into tabular data, with a pipe separator and no spaces, like so: a|b|c

In [46]:
#extracts following table
display(HTML(acknowledge.iloc[[0]].to_html()))

Unnamed: 0,FUND,IND,INFRA,BEAM,GRNB,TYPE
0,"French Research Agency|German Science Foundation|INTERREG V Upper Rhine Program|Japan Society for the Promotion of Science|DFG (GRK1114, EcTop2)|Konrad-Adenauer Foundation|Alexander von Humboldt Foundation|Nakatani Foundation",,ESRF,synchrotron radiation,ANR-14-CE35-0028-01|DFG Ta259/12|NANOTRANSMED|16K05515,acknowledgement of support|fellowship|PhD fellowships


It can be further processed and explored. Below, the most common types of acknowledgements are explored. This can be changed to any column name.

In [64]:
column='TYPE' #change column name here
n=10 #top n to see below

x_list=[]
for i in acknowledge.index:
    x_list.append(acknowledge[column].str.split('|')[i])
x_list = list(chain(*x_list))

sample = pd.Series(x_list)

print('{}: {} observations'.format(column, sample.shape[0]))
print('{} are unique'.format(len(sample.unique())))
print('\n...of which, the top {} are:'.format(n))
sample.value_counts().head(n) # showing top 10 results

TYPE: 5726 observations
2588 are unique

...of which, the top 10 are:


financial support               668
support                         391
funding                         338
NaN                             135
technical support                52
technical assistance             49
fruitful discussions             42
acknowledgement                  37
research support                 37
assistance in using beamline     37
dtype: int64