# Spacy & Cohere Location + Accomplishment Mapping
In this notebook, we will go over mapping for the LLM model [Cohere](https://cohere.com/) and [Spacy](https://spacy.io/). We will map through the pdfs_for_dss folder, find all GPEs (locations, or geopolitical entities) in addition to finding relevant locations out of the batch with Cohere. Finally, we will list accomplishments with Cohere.

The following code cell runs the necessary installs.

In [None]:
!pip install cohere
!pip install langchain
!pip install pypdf
!pip install tiktoken
!pip install chromadb
!pip install pdfminer.six

import cohere
import os
import time
import re
import pandas as pd
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import VectorDBQA
from langchain.llms import Cohere
from langchain.embeddings import CohereEmbeddings
from google.colab import drive
import spacy
from pdfminer.high_level import extract_text


Collecting cohere
  Downloading cohere-4.32-py3-none-any.whl (47 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/48.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.0/48.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting backoff<3.0,>=2.0 (from cohere)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting fastavro==1.8.2 (from cohere)
  Downloading fastavro-1.8.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fastavro, backoff, cohere
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires openai, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not instal

API keys, mounting to my Google Drive. You may have to change the folder path for your specific case!

In [None]:
os.environ["COHERE_API_KEY"] = "enter api key here"

drive.mount('/content/drive')

folder_path = "/content/drive/MyDrive/WWF x DSS/pdfs_for_dss"

Mounted at /content/drive


These are the good pdf files I will run my algorithm on (you can change accordingly).



In [None]:
good_pdf_files = [
    "4bd9978c-41b5-4a4d-bc4b-0a50a15c9d8d.pdf",
    "7adc790c-cde9-4581-80cc-4b196a1d5dfa.pdf",
    "1646f364-e360-42d8-9c0a-0a84f63e0265.pdf",
    "057385c7-73d0-4ffe-8a0e-02b6521e781d.pdf",
    "ea2b521c-7a6e-42d0-935b-ad7265b30a28.pdf",
    "feca6b34-bee5-46df-a34e-dbf33dd672a1.pdf",
    "f65582c6-adbb-4de7-9096-4df08064916d.pdf",
    "76d11bb3-328b-4e51-a56c-dbbc52832f14.pdf",
    "3a7665cf-ea44-4525-8669-c38064489f88.pdf",
    "3a5b9550-0d83-49b2-b2af-1a55e27f9e69.pdf",
    "5a35c033-666e-4dc8-aacf-d6ba9fa37d8f.pdf",
    "0694c9ab-1b8f-4bb2-9907-8ddaa7ee3da2.pdf",
    "39f6fc9b-9aff-4922-8c47-2fdac28c2238.pdf",
    "d86e6c0d-88c5-41c7-8ccf-9094dd99c3ad.pdf",
    "cc130fcb-e047-404a-b918-1b09c61bb638.pdf",
    "e23bb1cf-bec2-406b-9e86-d8c98d616961.pdf",
    "0271e0ce-5672-47c0-8ef8-b69b3c66428f.pdf",
    "4ab943c3-9fff-4c37-9ccf-a99b38fac8dc.pdf",
    "51edb2b9-0d58-452b-a662-ed2bc3e94b16.pdf"
]

Now using spacy to extract all our locations.

In [None]:
# Load SpaCy's NER model
nlp = spacy.load("en_core_web_sm")

# Function to extract locations or GPEs using Spacy
def extract_geolocations(text):
    doc = nlp(text)
    return [ent.text for ent in doc.ents if ent.label_ == "GPE"]


# List to hold the data for creating DataFrame later
data_for_df = []

# Iterate over PDF filenames to extract locations, make sure good_pdf_files exists and folder_path is the path to the pdf
for pdf_name in good_pdf_files:
    pdf_path = os.path.join(folder_path, pdf_name)

    # Extract text from PDF
    text = extract_text(pdf_path)

    # Extract locations from text
    locations = extract_geolocations(text)

    # Append the results to our list
    data_for_df.append({"PDF Name": pdf_name, "Locations": locations})

# Create a DataFrame, add the data into it
df_locations = pd.DataFrame(data_for_df)

# Show the DataFrame, here's all our locations as a list
df_locations

Unnamed: 0,PDF Name,Locations
0,4bd9978c-41b5-4a4d-bc4b-0a50a15c9d8d.pdf,"[Nepal, Mongolia, Pakistan, Mongolia, Pakistan..."
1,7adc790c-cde9-4581-80cc-4b196a1d5dfa.pdf,"[US, BMDS, spills6, Oregon, WASHINGTON, US, Or..."
2,1646f364-e360-42d8-9c0a-0a84f63e0265.pdf,"[Mexico, Belize, Guatemala, Honduras, China, G..."
3,057385c7-73d0-4ffe-8a0e-02b6521e781d.pdf,"[Honduras, Mexico, Belize, Guatemala, Honduras..."
4,ea2b521c-7a6e-42d0-935b-ad7265b30a28.pdf,"[COMMUNITIES, US, KYRGYZ, Pakistan, UK, the Un..."
5,feca6b34-bee5-46df-a34e-dbf33dd672a1.pdf,"[US, Bering, Chukchi, Washington, DC, US]"
6,f65582c6-adbb-4de7-9096-4df08064916d.pdf,"[Bhutan, India, Mongolia, Nepal, Pakistan, Nep..."
7,76d11bb3-328b-4e51-a56c-dbbc52832f14.pdf,"[COMMUNITIES, the United States \n\nAgency for..."
8,3a7665cf-ea44-4525-8669-c38064489f88.pdf,"[US, US, US, US, US, US, Netherlands, US, US, ..."
9,3a5b9550-0d83-49b2-b2af-1a55e27f9e69.pdf,"[US, UK, US, US, California, Hawai’i, Virginia..."


In this following code cell, we are finding the accomplishments for every PDF.

In [None]:
# Creating the text splitter for us, you can change chunk size if needed
text_splitter = CharacterTextSplitter(chunk_size=5000, chunk_overlap=0)

data = []
# Some pdfs are problematic so we have to deal with them on the side
problematic_pdfs = []

pdf_files_processed = 0
# Unfortunately Spacy has a limit of 5 API calls per minute so I need to keep track of time
start_time = time.time()
time_last = 0
api_call_no = 0

for file_name in good_pdf_files:
    if file_name.endswith('.pdf'):
        # This will not execute the API call if I've exceeded my limit per min
        if api_call_no >= 5:
          time.sleep(60)
          api_call_no = 0
        try:
          all_docs = []
          full_path = os.path.join(folder_path, file_name)
          loader = PyPDFLoader(full_path)
          pages = loader.load_and_split()
          docs = text_splitter.split_documents(pages)
          all_docs.extend(docs)
          embeddings = CohereEmbeddings()
          db_new = Chroma.from_documents(docs, embeddings)
          # You can change the chain_type if you want to experiment
          qa = VectorDBQA.from_chain_type(llm=Cohere(), chain_type="stuff", vectorstore=db_new)

          api_call_no = api_call_no + 1

          if api_call_no >= 5:
            time.sleep(60)
            api_call_no = 0
          # Query i'm using for accomplishments
          query_loc_2 = "Identify the key accomplishments/performance metrics/results of this pdf. The results should pertrain to the work of WWF in 1 or multiple regions. Limit your response to 3 sentences."
          result_loc_2 = qa.run(query = query_loc_2, temperature = 0)

          api_call_no = api_call_no + 1

          if api_call_no >= 5:
            time.sleep(60)
            api_call_no = 0

        except Exception as e:
          print(f"An unexpected error occurred while processing {file_name}: {e}")
          problematic_pdfs.append(file_name)

        # Append the pdf name and result to the data list
        finally:
          data.append([file_name, result_loc_2])
          pdf_files_processed += 1


NameError: ignored

Handling problematic PDFs now for accomplishments

In [None]:
# Define the special character removal function (which we will use to handle)
# The PDFs dont work because they are too large, but removing
# special characters help in this case.
def remove_special_characters(text):
    cleaned_text = re.sub(r'[^\w\s]', ' ', text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
    return cleaned_text

text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)
all_docs = []

full_path = os.path.join(folder_path, problematic_pdfs[0])
loader = PyPDFLoader(full_path)
pages = loader.load_and_split()

cleaned_pages = []
for page in pages:
    if hasattr(page, 'page_content') and page.page_content:
        # Clean the page content
        page.page_content = remove_special_characters(page.page_content)
        cleaned_pages.append(page)  # Append the cleaned page to the new list
    else:
        cleaned_pages.append(page)
docs = text_splitter.split_documents(cleaned_pages)
all_docs.extend(docs)

embeddings = CohereEmbeddings()

db_new = Chroma.from_documents(docs, embeddings)
qa = VectorDBQA.from_chain_type(llm=Cohere(), chain_type="stuff", vectorstore=db_new)




In [None]:
query_prob = "Identify the key accomplishments/performance metrics/results of this pdf. The results should pertrain to the work of WWF in 1 or multiple regions. Limit your response to 3 sentences."
result_prob = qa.run(query = query_prob, temperature = 0)

data.append([problematic_pdfs[0], result_prob])

In [None]:
locations_df = pd.DataFrame(data, columns=["PDF Name", "Accomplishments"])
locations_df

Unnamed: 0,PDF Name,Accomplishments
0,4bd9978c-41b5-4a4d-bc4b-0a50a15c9d8d.pdf,"Looking at the entire PDF, it appears that WW..."
1,7adc790c-cde9-4581-80cc-4b196a1d5dfa.pdf,Looking at the accomplishments of WWF in the ...
2,1646f364-e360-42d8-9c0a-0a84f63e0265.pdf,Looking at the PDF it looks like WWF's key ac...
3,057385c7-73d0-4ffe-8a0e-02b6521e781d.pdf,Here are some of the key accomplishments and ...
4,ea2b521c-7a6e-42d0-935b-ad7265b30a28.pdf,Looking at the PDF it seems like WWF's bigges...
5,feca6b34-bee5-46df-a34e-dbf33dd672a1.pdf,Looking at the PDF it looks like most of thes...
6,f65582c6-adbb-4de7-9096-4df08064916d.pdf,Looking at the PDF it looks like WWF's key ac...
7,76d11bb3-328b-4e51-a56c-dbbc52832f14.pdf,Looking at the annual report for WWF-US for 2...
8,3a7665cf-ea44-4525-8669-c38064489f88.pdf,Here are some of the key accomplishments/perf...
9,3a5b9550-0d83-49b2-b2af-1a55e27f9e69.pdf,Here are the key accomplishments/performance ...


Joining accomplishments & locations back

In [None]:
df_locations['PDF Name'] = df_locations['PDF Name'].astype(str)
locations_df['PDF Name'] = locations_df['PDF Name'].astype(str)

combined_df = locations_df.merge(df_locations, on="PDF Name", how="left")

combined_df


Unnamed: 0,PDF Name,Accomplishments,Locations
0,4bd9978c-41b5-4a4d-bc4b-0a50a15c9d8d.pdf,"Looking at the entire PDF, it appears that WW...","[Nepal, Mongolia, Pakistan, Mongolia, Pakistan..."
1,7adc790c-cde9-4581-80cc-4b196a1d5dfa.pdf,Looking at the accomplishments of WWF in the ...,"[US, BMDS, spills6, Oregon, WASHINGTON, US, Or..."
2,1646f364-e360-42d8-9c0a-0a84f63e0265.pdf,Looking at the PDF it looks like WWF's key ac...,"[Mexico, Belize, Guatemala, Honduras, China, G..."
3,057385c7-73d0-4ffe-8a0e-02b6521e781d.pdf,Here are some of the key accomplishments and ...,"[Honduras, Mexico, Belize, Guatemala, Honduras..."
4,ea2b521c-7a6e-42d0-935b-ad7265b30a28.pdf,Looking at the PDF it seems like WWF's bigges...,"[COMMUNITIES, US, KYRGYZ, Pakistan, UK, the Un..."
5,feca6b34-bee5-46df-a34e-dbf33dd672a1.pdf,Looking at the PDF it looks like most of thes...,"[US, Bering, Chukchi, Washington, DC, US]"
6,f65582c6-adbb-4de7-9096-4df08064916d.pdf,Looking at the PDF it looks like WWF's key ac...,"[Bhutan, India, Mongolia, Nepal, Pakistan, Nep..."
7,76d11bb3-328b-4e51-a56c-dbbc52832f14.pdf,Looking at the annual report for WWF-US for 2...,"[COMMUNITIES, the United States \n\nAgency for..."
8,3a7665cf-ea44-4525-8669-c38064489f88.pdf,Here are some of the key accomplishments/perf...,"[US, US, US, US, US, US, Netherlands, US, US, ..."
9,3a5b9550-0d83-49b2-b2af-1a55e27f9e69.pdf,Here are the key accomplishments/performance ...,"[US, UK, US, US, California, Hawai’i, Virginia..."


In [None]:
# Assuming combined_df is your existing DataFrame and folder_path is defined as your PDF directory

# Define a function to load, split, and create embeddings from a PDF document
def process_pdf(full_path):
    loader = PyPDFLoader(full_path)
    pages = loader.load_and_split()
    cleaned_pages = []
    for page in pages:
        if hasattr(page, 'page_content') and page.page_content:  # Make sure the attribute exists and is not None
            # Clean the page content
            page.page_content = remove_special_characters(page.page_content)
            cleaned_pages.append(page)  # Append the cleaned page to the new list
        else:
            # Handle pages with no content or no 'page_content' attribute as you see fit
            cleaned_pages.append(page)
    docs = text_splitter.split_documents(pages)
    embeddings = CohereEmbeddings()
    return Chroma.from_documents(docs, embeddings)

api_call_count = 0

# Prepare the new column for the DataFrame
combined_df["Relevant Locations"] = None  # Initialize the new column

for index, row in combined_df.iterrows():
    # Update the API call count and pause if needed
    if api_call_count >= 5:
        time.sleep(60)  # Wait for 60 seconds after every 5 API calls
        api_call_count = 0  # Reset the counter

    locations_list = row["Locations"]
    location_string = ", ".join(set(locations_list))
    full_path = os.path.join(folder_path, row["PDF Name"])
    db_new = process_pdf(full_path)

    qa = VectorDBQA.from_chain_type(llm=Cohere(), chain_type="stuff", vectorstore=db_new)

    query = f"From the following locations: {location_string}, tell me the most relevant locations in terms of the text you were trained with. Your response should be formatted as the exact locations separated by commas."

    # Execute the query and add the result to the DataFrame
    result = qa.run(query=query, temperature=0)
    combined_df.at[index, "Relevant Locations"] = result

    api_call_count += 1  # Increment the API call count



In [None]:
combined_df

Unnamed: 0,PDF Name,Accomplishments,Locations,Relevant Locations
0,4bd9978c-41b5-4a4d-bc4b-0a50a15c9d8d.pdf,"Looking at the entire PDF, it appears that WW...","[Nepal, Mongolia, Pakistan, Mongolia, Pakistan...","North Sikkim, Gurudongmar Lake, India, Bhutan..."
1,7adc790c-cde9-4581-80cc-4b196a1d5dfa.pdf,Looking at the accomplishments of WWF in the ...,"[US, BMDS, spills6, Oregon, WASHINGTON, US, Or...","US, BMDS, China, Australia, Mexico, Belize, G..."
2,1646f364-e360-42d8-9c0a-0a84f63e0265.pdf,Looking at the PDF it looks like WWF's key ac...,"[Mexico, Belize, Guatemala, Honduras, China, G...","Amur-Heilong, China, Russia, Mesoamerican Ree..."
3,057385c7-73d0-4ffe-8a0e-02b6521e781d.pdf,Here are some of the key accomplishments and ...,"[Honduras, Mexico, Belize, Guatemala, Honduras...","Amur-Heilong, Great Barrier Reef, Mesoamerica..."
4,ea2b521c-7a6e-42d0-935b-ad7265b30a28.pdf,Looking at the PDF it seems like WWF's bigges...,"[COMMUNITIES, US, KYRGYZ, Pakistan, UK, the Un...","Kyrgyz Republic:, Kyrgyz Republic: Production..."
5,feca6b34-bee5-46df-a34e-dbf33dd672a1.pdf,Looking at the PDF it looks like most of thes...,"[US, Bering, Chukchi, Washington, DC, US]","US, Chukchi, Bering, Washington, DC"
6,f65582c6-adbb-4de7-9096-4df08064916d.pdf,Looking at the PDF it looks like WWF's key ac...,"[Bhutan, India, Mongolia, Nepal, Pakistan, Nep...","Mongolia, Australia, China, Russia, Mexico, B..."
7,76d11bb3-328b-4e51-a56c-dbbc52832f14.pdf,Looking at the annual report for WWF-US for 2...,"[COMMUNITIES, the United States \n\nAgency for...","GSLEP, Nepal, India, Pakistan, Mongolia, Kyrg..."
8,3a7665cf-ea44-4525-8669-c38064489f88.pdf,Here are some of the key accomplishments/perf...,"[US, US, US, US, US, US, Netherlands, US, US, ...","Beijing, Tianjin, Shanghai, Shenyang, Nanjing..."
9,3a5b9550-0d83-49b2-b2af-1a55e27f9e69.pdf,Here are the key accomplishments/performance ...,"[US, UK, US, US, California, Hawai’i, Virginia...","J.D., C.C., Gunter, UK, US, Hawai’i, L.K., U...."


In [None]:
!pip install googlemaps
!pip install gmaps
import googlemaps
from googlemaps import Client as GoogleMaps
import gmaps

gmaps_client = googlemaps.Client(key='AIzaSyC7BKlb5HsNhEyIU-ZEmcfayiastd09ndA')

Collecting googlemaps
  Downloading googlemaps-4.10.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: googlemaps
  Building wheel for googlemaps (setup.py) ... [?25l[?25hdone
  Created wheel for googlemaps: filename=googlemaps-4.10.0-py3-none-any.whl size=40711 sha256=cab391035dba0242b866427cbde4e7c550cd9ae3b75d1dfd4626ff16eb9f5992
  Stored in directory: /root/.cache/pip/wheels/17/f8/79/999d5d37118fd35d7219ef57933eb9d09886c4c4503a800f84
Successfully built googlemaps
Installing collected packages: googlemaps
Successfully installed googlemaps-4.10.0
Collecting gmaps
  Downloading gmaps-0.9.0.tar.gz (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting geojson>=2.0.0 (from gmaps)
  Downloading geojson-3.1.0-py3-none-any.whl (15 kB)
Collecting jedi>=0.16 (from ipython>=5.3.0->gmaps)
  Do

In [None]:
!jupyter nbextension enable --py --sys-prefix gmaps

Enabling notebook extension jupyter-gmaps/extension...
Paths used for configuration of notebook: 
    	/usr/etc/jupyter/nbconfig/notebook.json
Paths used for configuration of notebook: 
    	
      - Validating: [32mOK[0m
Paths used for configuration of notebook: 
    	/usr/etc/jupyter/nbconfig/notebook.json


In [None]:
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
ca = gmaps_client.geocode("California")


[{'address_components': [{'long_name': 'California',
    'short_name': 'CA',
    'types': ['administrative_area_level_1', 'political']},
   {'long_name': 'United States',
    'short_name': 'US',
    'types': ['country', 'political']}],
  'formatted_address': 'California, USA',
  'geometry': {'bounds': {'northeast': {'lat': 42.009503, 'lng': -114.131211},
    'southwest': {'lat': 32.528832, 'lng': -124.482003}},
   'location': {'lat': 36.778261, 'lng': -119.4179324},
   'location_type': 'APPROXIMATE',
   'viewport': {'northeast': {'lat': 42.009503, 'lng': -114.131211},
    'southwest': {'lat': 32.528832, 'lng': -124.482003}}},
  'place_id': 'ChIJPV4oX_65j4ARVW8IJ6IJUYs',
  'types': ['administrative_area_level_1', 'political']}]

In [None]:
import collections.abc

# If any module or package imports Iterable from collections instead of collections.abc,
# this will serve as a compatibility patch.
collections.Iterable = collections.abc.Iterable


In [None]:
# Initialize the Google Maps client with your API key
gmaps_client = googlemaps.Client(key='AIzaSyC7BKlb5HsNhEyIU-ZEmcfayiastd09ndA')

# A function to safely geocode a location and handle errors
def safe_geocode(location):
    try:
        # Try geocoding the location
        return gmaps_client.geocode(location)
    except Exception as e:
        # If an error occurs, return a formatted error string
        return f"Unable to Parse: {location} - {str(e)}"

# Create a new column 'Geocode Info' initialized with empty lists
combined_df['Geocode Info'] = [[] for _ in range(len(combined_df))]

# Iterate over the DataFrame rows
for index, row in combined_df.iterrows():
    # Split 'Relevant Locations' into a list
    locations = [loc.strip() for loc in row['Relevant Locations'].split(',')]
    geocode_info = []

    # Geocode each location
    for location in locations:
        result = safe_geocode(location)
        # Check if result is an error message or valid geocode information
        if isinstance(result, str) and result.startswith("Unable to Parse"):
            geocode_info.append(result)
        elif result:
            # If valid geocode information is available, extract relevant details
            geocode_info.append(result)
        else:
            geocode_info.append(f"Unable to Parse: {location} - No result")

    # Update the 'Geocode Info' column with the geocode information
    combined_df.at[index, 'Geocode Info'] = geocode_info

combined_df


Unnamed: 0,PDF Name,Accomplishments,Locations,Relevant Locations,Geocode Info
0,4bd9978c-41b5-4a4d-bc4b-0a50a15c9d8d.pdf,"Looking at the entire PDF, it appears that WW...","[Nepal, Mongolia, Pakistan, Mongolia, Pakistan...","North Sikkim, Gurudongmar Lake, India, Bhutan...",[[{'address_components': [{'long_name': 'North...
1,7adc790c-cde9-4581-80cc-4b196a1d5dfa.pdf,Looking at the accomplishments of WWF in the ...,"[US, BMDS, spills6, Oregon, WASHINGTON, US, Or...","US, BMDS, China, Australia, Mexico, Belize, G...",[[{'address_components': [{'long_name': 'Unite...
2,1646f364-e360-42d8-9c0a-0a84f63e0265.pdf,Looking at the PDF it looks like WWF's key ac...,"[Mexico, Belize, Guatemala, Honduras, China, G...","Amur-Heilong, China, Russia, Mesoamerican Ree...",[[{'address_components': [{'long_name': 'Amur ...
3,057385c7-73d0-4ffe-8a0e-02b6521e781d.pdf,Here are some of the key accomplishments and ...,"[Honduras, Mexico, Belize, Guatemala, Honduras...","Amur-Heilong, Great Barrier Reef, Mesoamerica...",[[{'address_components': [{'long_name': 'Amur ...
4,ea2b521c-7a6e-42d0-935b-ad7265b30a28.pdf,Looking at the PDF it seems like WWF's bigges...,"[COMMUNITIES, US, KYRGYZ, Pakistan, UK, the Un...","Kyrgyz Republic:, Kyrgyz Republic: Production...",[[{'address_components': [{'long_name': 'Kyrgy...
5,feca6b34-bee5-46df-a34e-dbf33dd672a1.pdf,Looking at the PDF it looks like most of thes...,"[US, Bering, Chukchi, Washington, DC, US]","US, Chukchi, Bering, Washington, DC",[[{'address_components': [{'long_name': 'Unite...
6,f65582c6-adbb-4de7-9096-4df08064916d.pdf,Looking at the PDF it looks like WWF's key ac...,"[Bhutan, India, Mongolia, Nepal, Pakistan, Nep...","Mongolia, Australia, China, Russia, Mexico, B...",[[{'address_components': [{'long_name': 'Mongo...
7,76d11bb3-328b-4e51-a56c-dbbc52832f14.pdf,Looking at the annual report for WWF-US for 2...,"[COMMUNITIES, the United States \n\nAgency for...","GSLEP, Nepal, India, Pakistan, Mongolia, Kyrg...","[Unable to Parse: GSLEP - No result, [{'addres..."
8,3a7665cf-ea44-4525-8669-c38064489f88.pdf,Here are some of the key accomplishments/perf...,"[US, US, US, US, US, US, Netherlands, US, US, ...","Beijing, Tianjin, Shanghai, Shenyang, Nanjing...",[[{'address_components': [{'long_name': 'Beiji...
9,3a5b9550-0d83-49b2-b2af-1a55e27f9e69.pdf,Here are the key accomplishments/performance ...,"[US, UK, US, US, California, Hawai’i, Virginia...","J.D., C.C., Gunter, UK, US, Hawai’i, L.K., U....","[Unable to Parse: J.D. - No result, Unable to ..."


In [None]:
marker_locations = []

for pdf_gmap_info in combined_df["Geocode Info"]:
    # Check if pdf_gmap_info is a list (valid geocode results are in a list)
    if isinstance(pdf_gmap_info, list):
        for loc in pdf_gmap_info:  # Iterate through the list
            # Check if loc is a dictionary, as expected for a geocode result
            if 'geometry' in loc[0]:
                location = loc[0]['geometry']['location']  # Get the location dictionary
                latitude = location['lat']
                longitude = location['lng']
                marker_locations.append((latitude, longitude))
            # If loc is not a dictionary with a 'geometry' key, it's an unexpected format
            else:
                print(f"Invalid location format: {loc}")
    else:
        print(f"Non-list geocode info: {pdf_gmap_info}")  # Handle non-list geocode info

# Create the marker layer using the accumulated locations
markers = gmaps.marker_layer(marker_locations)

# Assume you have some valid entries to get the initial map center
if marker_locations:  # Check if there's at least one valid location to center the map
    first_lat, first_long = marker_locations[0]  # Use the first valid location
else:
    first_lat, first_long = (0, 0)  # Default to (0,0) if no valid locations

# Create the map object
fig = gmaps.figure(center=(first_lat, first_long), zoom_level=6)
fig.add_layer(markers)
fig


Invalid location format: Unable to Parse: BMDS - No result
Invalid location format: Unable to Parse: Mesoamerican Reef - No result
Invalid location format: Unable to Parse: Mesoamerican Reef - No result
Invalid location format: Unable to Parse: Chukchi - No result
Invalid location format: Unable to Parse: Bering - No result
Invalid location format: Unable to Parse: DC - No result
Invalid location format: Unable to Parse: GSLEP - No result
Invalid location format: Unable to Parse: J.D. - No result
Invalid location format: Unable to Parse: C.C. - No result
Invalid location format: Unable to Parse: Gunter - No result
Invalid location format: Unable to Parse: L.K. - No result
Invalid location format: Unable to Parse: DR - No result
Invalid location format: Unable to Parse: Srepok Wildlife - No result
Invalid location format: Unable to Parse: Mae Wong - No result
Invalid location format: Unable to Parse: forests1 - No result
Invalid location format: Unable to Parse: GREATER MEKONG CONTEXT -

Figure(layout=FigureLayout(height='420px'))

In [None]:
combined_df.to_csv('cohere_results.csv', index=False)