# Previous work 
*The previous notebook covered Exploratory Data Analysis of the data provided for the competition, the metadata and other dataset. We also covered the geographical distribution, scraped information for the birds.*

**BIRDNET - Extensive EDA:** https://www.kaggle.com/code/virajkadam/birdclef-2024-extenstive-eda/


*In this notebook, we will use a language model (phi 3.8 small), and the scraped information for the birds, to extract key information, that might be useful for this competition.We can create various soft labels using the extracted data.*

# Imports

In [27]:
import pickle 
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from tqdm import tqdm
from copy import deepcopy
import json
import pandas as pd
torch.random.manual_seed(0)

<torch._C.Generator at 0x7c90898172f0>

In [5]:
class CFG: 
    model_checkpoint="microsoft/Phi-3-mini-4k-instruct"
    
    generation_args = { "top_k":10,
                       "top_p":0.95,
                        "num_return_sequences":1,
                        "max_new_tokens": 300,
                        "return_full_text": False,
                        "temperature": 0.1,
                        "do_sample": True,
                        }
    
    

# Loading bird information data

In [6]:
with open("/kaggle/input/birdclef-2024-extenstive-eda/bird_info_dict.pkl",'rb') as f:
    bird_info_dict = pickle.load(f)
    
len(bird_info_dict)

182

In [7]:
list(bird_info_dict.items())[-3:]

[('yebbab1',
  '\n\nA gregarious and noisy species endemic to southern India and Sri Lanka. Very similar to Jungle Babbler, with which it is often confused. On Yellow-billed Babbler, note the pale spot in front of the eyes and a pale crown that contrast with the dark throat and cheeks, giving it a pale capped and blank-faced appearance. Also note pale blue eyes and a pale base to the tail. Prefers drier habitats than the Jungle Babbler, though they can sometimes be seen in the same vicinity.\nPOWERED BY MERLIN\n'),
 ('yebbul3',
  '\n\nYellow-green bulbul with a clean face and dark eyes. Inhabits foothill and lower montane forests; often in moist areas. Commonly encountered in pairs or small flocks, but will also join mixed-species flocks. Typically sticks to cover. Song is comprised of cheery whistles mixed with slurred warbles. Calls include warbles, clear “weeeep” notes, and short nasal notes.\nPOWERED BY MERLIN\n'),
 ('zitcis1',
  '\n\nVery small, with a distinctive short, rounded t

# Prompt to extract specific information from the scraped bird information.

In [8]:
system_prompt = "You are a information extraction model, which extracts the information that the user requests for, from the context that the user provides.Your output is always in json format"

In [9]:
extraction_template = """Extract the following details about the bird from the provided context,delimited by triple single-ticks and format them into a JSON object. The description will be provided in the Input section.

Note that:
- If the relevant information for a field does not exist in the given context, return an empty string for that field.
- Please extract relevant information strictly from the context provided.
-  An example of the task to be done is provided in the Example section. 

###Field_name-Description:
1. **song_vocalization**: Identify and describe the type of song or singing patterns the bird exhibits. Include details about the rhythm, pitch, and repetition.
2. **habitat**: Describe the bird's natural environment or specific locations where it is typically found. Include information about the vegetation, climate, and geographic regions.
3. **size**: Provide measurements or descriptions of the bird's physical dimensions. Include details about its length, wingspan, and weight if available.
4. **appearance**: Describe the bird's visual characteristics. Include information about its coloration, plumage patterns, distinctive markings, and overall shape.
5. **call_vocalization**: Identify and describe the types of calls the bird makes. Include specific sounds, frequencies, and purposes (e.g., alarms, communication).
6. **similar_species**: List species that are mentioned as similar to the bird being described. Include specific attributes or features that make them comparable.

### JSON Output Format
```json
{output_json}
###Example
{example}

###Input
Context: '''
{bird_info_context}
'''
"""

In [10]:
example_0="""

### Example Description and JSON Output

**Example Bird Description**:
"The Red-winged Blackbird is known for its distinctive song, which is a loud, musical 'conk-la-ree!' It inhabits wetlands, marshes, and open fields. The bird measures about 7-9 inches in length with a wingspan of approximately 12-16 inches. It has striking black plumage with bright red and yellow shoulder patches. The bird's call is a sharp 'check!' similar to that of other blackbird species. It is often confused with the Tricolored Blackbird, which has similar shoulder patches but a different song pattern."

**JSON Output**:
```json
{
  "song_vocalization": "A loud, musical 'conk-la-ree!'",
  "habitat": "Wetlands, marshes, and open fields",
  "size": "Measures about 7-9 inches in length with a wingspan of approximately 12-16 inches",
  "appearance": "Striking black plumage with bright red and yellow shoulder patches",
  "call_vocalization": "A sharp 'check!'",
  "similar_species": "Tricolored Blackbird with similar shoulder patches but a different song pattern"
}
"""

output_json = """{
  "song_vocalization": "Description of the bird's song vocalizations",
  "habitat": "Description of the bird's habitat",
  "size": "Description of the bird's size",
  "appearance": "Description of the bird's appearance",
  "call_vocalization": "Description of the bird's call vocalizations",
  "similar_species": "Description of similar species and their attributes"
                }"""


# PHI-mini-4k Instruct 

In [11]:
#model
model = AutoModelForCausalLM.from_pretrained(
    CFG.model_checkpoint, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, )


#tokenizer 
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")


config.json:   0%|          | 0.00/931 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [12]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

In [13]:
def get_chat_json(system_message:str,
                  user_messages:[str,])->[{}]:
    """Convert the user instruction and system prompt to the standard template."""
    chat_buf = []
    
    chat_buf.append({"role": "system", "content": system_message})
    
    for instr in user_messages:
        chat_buf.append({"role": "user", "content": instr})
        
    return chat_buf


def get_bird_info_fields(bird_name:str,
                         generation_pipeline=pipe,
                        birdn_info_dict:dict=bird_info_dict,
                         system_prompt:str=system_prompt,
                        output_json=output_json)->{}:
    
    bird_info = birdn_info_dict.get(bird_name,"")
    bird_ex_temp = extraction_template.format(example = example_0,
                                              bird_info_context =bird_info,
                                             output_json=output_json)
    
    prompt_messages = get_chat_json(system_prompt,
                                [bird_ex_temp,]
                               )
    
    bird_info_extracted =  generation_pipeline(prompt_messages,**CFG.generation_args)
    
    return bird_info_extracted
    
    

**Extract the bird information, from the texual information provided**

In [14]:
def pickle_obj(obj,path):
    with open(path,"wb") as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
        
    return 

In [15]:
def process_output(asst_inp:str):
    processed = json.loads(asst_inp.split("```json")[1].split("```")[0])
    return processed

In [16]:
%%time

all_bird_info_dict={}
not_parsed = []
n=0
for bird_name in tqdm(bird_info_dict.keys()):
    
    try:
        n+=1
        derived_bird_info = get_bird_info_fields(bird_name)
        derived_bird_info_parsed = process_output(derived_bird_info[0]['generated_text'])

        if n%10 ==0:
            print(f"{bird_name} \n {derived_bird_info}",end = "-"*50)


        derived_bird_info_parsed["info"] = bird_info_dict[bird_name]

        all_bird_info_dict[bird_name]= derived_bird_info_parsed
    except Exception as e:
        print(e)
        not_parsed.append(derived_bird_info)
        


  5%|▌         | 10/182 [01:16<20:44,  7.24s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


barswa 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "Dry, scratchy\'svit svit\' calls",\n  "habitat": "Any open habitat, especially large fields and wetlands",\n  "size": "Fairly large, long, forked tail, and dark rump",\n  "appearance": "Iridescent navy-blue above with a rich orange throat and forehead, underparts vary from bright buffy-orange to whitish",\n  "call_vocalization": "Dry, scratchy\'svit svit\' calls",\n  "similar_species": "Welcome Swallow, Angolan Swallow, Pacific Swallow"\n}\n```'}]--------------------------------------------------

 11%|█         | 20/182 [02:32<21:45,  8.06s/it]

blnmon1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "A bland, unremarkable \'wi-wi-wi.\'",\n  "habitat": "Inhabits both dry and wet forested areas from sea level up into foothills",\n  "size": "",\n  "appearance": "Beautiful azure-blue male with a black nape patch and a thin black gorget. Female lacks the black nape patch and gorget markings, with gray-brown wings and bright blue only on the head. Both male and female show black feathers on top of the bill and grayish-white underparts.",\n  "call_vocalization": "Harsh nasal \'wheh\' and an explosive \'whi-cheh!\'",\n  "similar_species": ""\n}\n```'}]--------------------------------------------------

 16%|█▋        | 30/182 [03:50<18:57,  7.49s/it]

brwjac1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "",\n  "habitat": "Well-vegetated wetlands, especially ponds and lakes with floating vegetation",\n  "size": "",\n  "appearance": "Adult’s dark neck and bright white eyebrows, Juvenile’s white cheek and orangish neck",\n  "call_vocalization": "",\n  "similar_species": "Pheasant-tailed Jacana and Bronze-winged Jacana"\n}\n```'}]--------------------------------------------------

 22%|██▏       | 40/182 [05:02<17:35,  7.43s/it]

comkin1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "High, shrill whistled call",\n  "habitat": "Rivers, streams, lakes, and ponds with small fish; fresh or brackish habitats",\n  "size": "Not specified",\n  "appearance": "Beautiful little blue-and-orange bird with a long, pointed bill",\n  "call_vocalization": "High, shrill whistled call",\n  "similar_species": "Not specified"\n}\n```'}]--------------------------------------------------

 27%|██▋       | 50/182 [06:25<19:48,  9.01s/it]

crfbar1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "Its call is a fast paced \'kuk-kuk-kuk\', doubtfully separable from that of a Coppersmith Barbet",\n  "habitat": "Wetter jungles of the Western Ghats",\n  "size": "",\n  "appearance": "Chunky green barbet with diagnostic red over its entire face, forehead, throat, and breast. It has a diffuse blue band on the sides of the head and breast, while a thick black traverse band separates the blue parts from the red. Juvenile is duller green, lacks the prominent face pattern of the adult, sports a gray band instead of blue, and shows some yellow among the red feathers.",\n  "call_vocalization": "A fast paced \'kuk-kuk-kuk\'",\n  "similar_species": "Coppersmith Barbet, with its call being slightly faster than that of the Malabar Barbet"\n}\n```'}]--------------------------------------------------

 33%|███▎      | 60/182 [07:51<17:56,  8.82s/it]

gargan 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "",\n  "habitat": "Wetlands with reedy and other fringing vegetation, migrates to Africa and southern Asia in winter, associates with other ducks on open water and tidal mudflats",\n  "size": "",\n  "appearance": "Handsome male with big white eyebrow on dark brown head, pale gray flanks, and striking pale gray forewing in flight; female similar to Green-winged Teal but with a more contrasting head pattern, paler eyebrow, darker eyestripe, and unstreaked whitish throat",\n  "call_vocalization": "",\n  "similar_species": "Female Green-winged Teal"\n}\n```'}]--------------------------------------------------

 35%|███▌      | 64/182 [08:16<12:26,  6.33s/it]

list index out of range


 38%|███▊      | 70/182 [09:06<14:20,  7.68s/it]

grenig1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "Loud \'bee-AHWEE!\'",\n  "habitat": "Forests, forest edges, and fields and plains with some tree cover",\n  "size": "",\n  "appearance": "Unique, feline-looking nightjar with distinct ears and complex shades of browns",\n  "call_vocalization": "Loud \'bee-AHWEE!\'",\n  "similar_species": ""\n}\n```'}]--------------------------------------------------

 44%|████▍     | 80/182 [10:34<15:35,  9.17s/it]

heswoo1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "Twangy \'cheeYEWW\' as well as high-pitched \'chee\' and \'chi’lee\' calls",\n  "habitat": "Lowland and foothill forests, dense wet areas with bamboo and teak, can also be found in drier forests",\n  "size": "",\n  "appearance": "Small and unique black-and-cream woodpecker with an oversized crested head and short tail. Male has a dark forehead, whereas female has a white one. Spots on the wing coverts",\n  "call_vocalization": "Twangy \'cheeYEWW\', high-pitched \'chee\', and \'chi’lee\' calls",\n  "similar_species": ""\n}\n```'}]--------------------------------------------------

 49%|████▉     | 90/182 [11:55<13:01,  8.50s/it]

inpher1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "",\n  "habitat": "Aquatic habitats across the Indian subcontinent",\n  "size": "",\n  "appearance": "Adults in breeding plumage have a dark reddish brown back with a yellowish head, neck, and breast. In nonbreeding plumage, they are virtually indistinguishable from nonbreeding Chinese Pond-Heron. In flight, adults appear surprisingly white due to their strikingly white wings, underparts, and tail.",\n  "call_vocalization": "",\n  "similar_species": "Nonbreeding Chinese Pond-Heron"\n}\n```'}]--------------------------------------------------

 51%|█████     | 93/182 [12:14<09:43,  6.55s/it]

list index out of range


 55%|█████▍    | 100/182 [13:12<11:08,  8.16s/it]

kerlau2 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "A melodious \'pe-kee-ke\' song and several other chatters intermingled with musical notes",\n  "habitat": "Stunted montane forests, tea estates, and wooded second growth",\n  "size": "",\n  "appearance": "Dull olive-brown upperparts, a grey-streaked white breast, and a rufous-chestnut lower breast and belly with a white eyebrow flaring in front of the eye over a dark eye-line",\n  "call_vocalization": "Songs and calls usually the first indication of its presence",\n  "similar_species": ""\n}\n```'}]--------------------------------------------------

 60%|██████    | 110/182 [14:31<08:55,  7.44s/it]

litswi1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "Constant descending trill, particularly near the nest",\n  "habitat": "Natural cliff settings and urban areas where it nests on buildings",\n  "size": "Small swift with a square-tipped tail, broad outer wings",\n  "appearance": "White on the rump extending to the sides, contrasting white throat, lacks a white belly",\n  "call_vocalization": "Constant descending trill",\n  "similar_species": "Superficially similar to house-martins"\n}\n```'}]--------------------------------------------------

 66%|██████▌   | 120/182 [15:50<07:23,  7.15s/it]

niwpig1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "Occasionally gives low sonorous coos",\n  "habitat": "Wet forests in hilly and montane areas",\n  "size": "",\n  "appearance": "Uncommon smooth gray pigeon with dark rusty-red wings and a black-and-white barred neck pattern",\n  "call_vocalization": "Low sonorous coos",\n  "similar_species": ""\n}\n```'}]--------------------------------------------------

 71%|███████▏  | 130/182 [17:05<06:39,  7.69s/it]

plapri1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "A repeated series of insectlike trills",\n  "habitat": "Various habitats, especially farms and wetlands, but avoids woodlands with a canopy",\n  "size": "",\n  "appearance": "Drab gray-brown plumage with a pale, wide eyebrow and a long tail that appears loosely attached to the body",\n  "call_vocalization": "Various buzzing and clinking notes, and snapping sounds with its wings",\n  "similar_species": ""\n}\n```'}]--------------------------------------------------

 77%|███████▋  | 140/182 [18:32<06:09,  8.80s/it]

revbul 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "Variety of chirps and whistles",\n  "habitat": "Scrubby edge habitat in urban and rural areas, introduced to Kuwait, Qatar, United Arab Emirates, Oman, and some Polynesian islands including Hawaii",\n  "size": "Medium-sized bird with a black crest and a white rump",\n  "appearance": "Dark, sleek, black crest, white rump, red color under the tail",\n  "call_vocalization": "Variety of chirps and whistles",\n  "similar_species": ""\n}\n```'}]--------------------------------------------------

 82%|████████▏ | 150/182 [19:51<04:19,  8.10s/it]

sbeowl1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "Deep echoing hoots",\n  "habitat": "Dense wet lowland and hill forests",\n  "size": "Silver-gray adult with a large yellow bill, black eyes, and sweeping sidewards-facing \'horns\'",\n  "appearance": "Silver-gray adult with a large yellow bill, black eyes, and sweeping sidewards-facing \'horns\'; juvenile is much paler, almost white, with shorter tufts",\n  "call_vocalization": "Deep echoing hoots",\n  "similar_species": ""\n}\n```'}]--------------------------------------------------

 88%|████████▊ | 160/182 [21:14<03:09,  8.60s/it]

sttwoo1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "",\n  "habitat": "Open forest from lowlands up into the foothills, usually in drier areas",\n  "size": "",\n  "appearance": "Lime-green woodpecker with a gray face and finely scaled underparts. Crown is blood red in males and black in females.",\n  "call_vocalization": "",\n  "similar_species": "Streak-breasted Woodpecker with pale eyes, denser streaking, and a thinner pale moustache stripe. Laced Woodpecker with less streaking on the underparts and a warm yellowish suffusion to the green on the neck. Scaly-bellied Woodpecker with clean throat and breast."\n}\n```'}]--------------------------------------------------

 93%|█████████▎| 170/182 [22:29<01:27,  7.30s/it]

whbtre1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "Noisy, giving hooting, chattering, and grating calls",\n  "habitat": "Humid hill forests in the Western Ghats, both primary and secondary patches, away from human habitation",\n  "size": "",\n  "appearance": "Black face and white underparts, White-bellied",\n  "call_vocalization": "Hooting, chattering, and grating calls",\n  "similar_species": "Rufous Treepie"\n}\n```'}]--------------------------------------------------

 99%|█████████▉| 180/182 [23:49<00:15,  7.73s/it]

yebbab1 
 [{'generated_text': ' ```json\n{\n  "song_vocalization": "",\n  "habitat": "Southern India and Sri Lanka, prefers drier habitats than the Jungle Babbler",\n  "size": "",\n  "appearance": "Pale spot in front of the eyes, pale crown, dark throat and cheeks, pale capped and blank-faced appearance, pale blue eyes, pale base to the tail",\n  "call_vocalization": "",\n  "similar_species": "Jungle Babbler, which it is often confused with"\n}\n```'}]--------------------------------------------------

100%|██████████| 182/182 [24:05<00:00,  7.94s/it]

CPU times: user 23min 26s, sys: 39.4 s, total: 24min 6s
Wall time: 24min 5s





# handling empty values 

In [35]:
missing_bird_info = {"integr":"A stocky egret that is larger than Eastern Cattle and Little Egrets, but smaller than the Great Egret. Often confusing and difficult to identify, this bird is truly intermediate between potential confusion species. Things to look for include a yellow bill that is shorter than Great Egret but longer than Eastern Cattle Egret, a relatively rounded head, and shorter legs than Great Egret, approaching Eastern Cattle Egret. Like most egret species, associates with all types of water bodies including wet fields, often flocking with other egrets.",
                    "grbeat1":"A small, dainty bee-eater of open woodlands, farmland, and sparse human habitation. Mainly green with a bluish throat, slender black eye mask and throat band, rufous or slightly rusty crown and back of the neck, and central tail streamers. Juveniles are duller than adults and lack the long tail feathers. In flight, note the rusty underwing with a black trailing edge. Flies gracefully as it catches insects and other prey. Calls are a series of gentle liquid trills or more staccato notes."}

bird_info_dict.update(missing_bird_info)

#saving updated dictionary
pickle_obj(bird_info_dict,"./bird_info_dict_updated.pkl")

In [36]:
for bird_name in missing_bird_info.keys():
    derived_bird_info = get_bird_info_fields(bird_name)
    derived_bird_info_parsed = process_output(derived_bird_info[0]['generated_text'])
    derived_bird_info_parsed["info"] = bird_info_dict[bird_name]
    all_bird_info_dict[bird_name]= derived_bird_info_parsed
    print(derived_bird_info_parsed)

{'song_vocalization': '', 'habitat': 'Associates with all types of water bodies including wet fields, often flocking with other egrets', 'size': 'Stocky egret that is larger than Eastern Cattle and Little Egrets, but smaller than the Great Egret', 'appearance': 'Yellow bill that is shorter than Great Egret but longer than Eastern Cattle Egret, a relatively rounded head, and shorter legs than Great Egret, approaching Eastern Cattle Egret', 'call_vocalization': '', 'similar_species': 'Eastern Cattle Egret, Little Egret, and Great Egret', 'info': 'A stocky egret that is larger than Eastern Cattle and Little Egrets, but smaller than the Great Egret. Often confusing and difficult to identify, this bird is truly intermediate between potential confusion species. Things to look for include a yellow bill that is shorter than Great Egret but longer than Eastern Cattle Egret, a relatively rounded head, and shorter legs than Great Egret, approaching Eastern Cattle Egret. Like most egret species, a

In [37]:
pickle_obj(all_bird_info_dict,"./all_bird_info_dict.pkl")

# Next notebook in the series.

Birdclef2024 - Extracting spectograms and Exploratory Analysis of Spectrograms : https://www.kaggle.com/code/virajkadam/birdclef2024-extracting-spectograms-and-eda/edit