# Getting Started with the Signal AI API using Python

Note: run `pip install -r requirements.txt` to install the dependencies for this notebook

In [None]:
!pip install -r requirements.txt

In [None]:
import backoff
import requests
import json
import os
import pandas as pd
import matplotlib
import math
import datetime
from tqdm import tqdm

# Introduction

Welcome to the Signal AI API data science tutorial. 
In this notebook we will explore how to interact with the Signal AI API using Python.

This Getting Started guide is designed to complement the full documentation for the API which can be found at:

https://api.signal-ai.com/docs


## What can I do with the Signal AI API?

The Signal AI API provides a new and interesting way to search and explore news stories by examining which entities and topics appear in which stories, and when.

### Entities
A set of people, organisations, locations, substances, diseases and products identified by the Signal AI system. 
Mentions of entities are tracked along with their saliency (prominence) and sentiment (positive, neutral or negative).

### Topics
Over 300 topics (or themes), from Health to Blockchain, relevant to businesses.

### Documents
Documents are news stories with associated metadata such as publication date, language, media type, publication source and title. 
Each document is supplied with a list of entities and topics.
The full text for each news story is not currently available through the API.

### Queries

The following are examples of questions one could answer using the Signal AI API:

> How many documents mentioned a company in relation to the Environment this year?

> Do mentions of a company have more positive or negative sentiment when a particular individual or location is also mentioned? How has this evolved over time? Does this depend on the publication source?

> Which topics, locations, organisations are most frequently mentioned alongside a particular individual?

> How many documents mention entity A but not entity B

During this tutorial we'll gradually build up the knowledge to answer these questions.


## Outline

Firstly we will explore some basic usage of the API such as authentication and the pagination system.
Then we'll explore the different topics and entities available before building some search queries.

We'll be using the popular requests library to communicate with the API and pandas to manipulate and visualise the data but prior experience with these libraries is not required.

Before we start lets make sure this machine can contact the Signal AI API using the requests library.

In [None]:
response = requests.get("https://api.signal-ai.com")
if response.ok:
    print('Successfully sent a request to the Signal AI API')
else:
    print('Error: Cannot communicate with the Signal AI API')

# Authentication

You will need a `client_id` and `client_secret` to gain access to the API.
The code below will assume they have been set and the environment variables `SIGNAL_API_CLIENT_ID` and `SIGNAL_API_CLIENT_SECRET` respectively.

Using your credentials you can request a temporary access token from the API using the url:

`https://api.signal-ai.com/auth/token`

Since we will be using this token a lot lets create a small class to authenticate against the Signal AI API using the requests library.

We'll also add a method called `request` that can be used to send queries to the API with the new temporary access token. Keep in mind this token is only valid for 24 hours and you may need to call `authenticate` to request a new token from time to time.

In [None]:
def authenticate(client_id, client_secret, url = "https://api.signal-ai.com"):
    """ obtain a temporary access token using user credentials """
    token_url = f'{url}/auth/token'
    payload = {
        "grant_type": "client_credentials",
        "client_id": client_id,
        "client_secret": client_secret
    }
    response = requests.post(token_url, data=payload)
    return response.json().get("access_token")

Once authenticated the token will last for 24 hours.

In [None]:
TEMP_ACCESS_TOKEN = authenticate(os.environ['SIGNAL_API_CLIENT_ID'], os.environ['SIGNAL_API_CLIENT_SECRET'])
if TEMP_ACCESS_TOKEN:
    print('Congratulations! You have an access token, it will last for 24 hours before you will need to reauthenticate by repeating this step')
else:
    print('Error: Perhaps the credentials are incorrect?')

## Making a Simple Request

Here's an example request showing how to use the access token.
The results will always come back as JSON.
We'll explore the specific endpoints and parameters in more detail later.
Notice the `next-cursor` key in the results.
By default ten results are returned in each response, to get the next 10 we need to paginate.

In [None]:
response = requests.get(
    # call the entities endpoint
    'https://api.signal-ai.com/entities',
    params={
        # Look for entities containing the token 'Environment'
        'name':'Environment',
        # Limit the search to organisations
        'type':'organisation'
    },
    # include the access token in the header
    headers={
        "Authorization": f'Bearer {TEMP_ACCESS_TOKEN}',
        "Content-Type": "application/json",
    }
)

response.json()

# Rate Limiting and Retries

Sometimes if the usage limits of the API are exceeded or if there is a short connection issue you may need to repeat a failed request.
The backoff library is a really easy way to do this.
It's a good idea to use backoff if you are making a lot or requests in a script or function, otherwise a single error might cause it to terminate.
Let's use backoff and put together everything so far in a simple function

In [None]:
# Custom error for 429 HTTP errors. 
# We will raise it when the API returns this error meaning that we hit a rate limit 
class RateLimitError(Exception):
    pass

# retry requests with an exponentially increasing wait time upto 10 times
@backoff.on_exception(
    backoff.expo, 
    RateLimitError,
    max_value=10)
def request(method, endpoint, params=None, json=None):
    """ Make get requests using a tempory access token """
    
    response = requests.request(
        method,
        f'https://api.signal-ai.com/{endpoint}',
        params=params,
        json=json,
        headers={
            "Authorization": f'Bearer {TEMP_ACCESS_TOKEN}',
            "Content-Type": "application/json",
        },
    )
    
    # cehck if we hit the rate limit
    if response.status_code == 429: 
        raise RateLimitError
    
    # in all cases raise an exception if the status is not success
    response.raise_for_status()
    return response


request('GET', 'entities', {'name': 'Environment'}).json()

# Pagination

In the previous example the response contained a `next-cursor` key. 
By default, responses contain 10 items and the `next-cursor` allows us to get the next page of 10 items.
If there is no `next-cursor` key then there are no more items remaining.
To get the next page, use the token provided by `next-cursor` in your next request as the `from-cursor`


Example:

In [None]:
# Pagination Example:

page_0 = request('GET', 'entities', {'name': 'Environment'})

page_1 = requests.get(
    # call the entities endpoint
    'https://api.signal-ai.com/entities',
    params={
        # Look for entities containing the token 'Environment'
        'name':'Environment',
        # Limit the search to organisations
        'type':'organisation',
        # 'from-cursor' is found under the'next-cursor' key in the previous response
        'from-cursor': page_0.json().get('next-cursor')
        
    },
    # include the access token in the header
    headers={
        "Authorization": f'Bearer {TEMP_ACCESS_TOKEN}',
        "Content-Type": "application/json",
    }
)

page_1.json()

Because we will frequently need to iterate over pages returned from the API, let's create a class called Paginate to manage this for us. 

In [None]:
from urllib.parse import urlparse, parse_qs

def response_to_url(response) -> str:
    """ get the url from a response """
    obj = urlparse(response.request.url)
    return f"{obj.scheme}://{obj.netloc}{obj.path}"

def response_to_params(response) -> dict:
    """ get the params from a response """
    obj = urlparse(response.request.url)
    return parse_qs(obj.query)

def response_to_body(response) -> dict:
    body = response.request.body
    return json.loads(body.decode()) if body is not None else {}


class Paginate:
    """ 
    A class to iterate over API requestes
    """
    
    def __init__(self, response):
        self.response = response

    def __iter__(self):
        return self
    
    def _get(self, response):
        """ get the next get request from the previous one """
        nxt = response.json().get('next-cursor', None)
        # the absence of next-cursor signifies we have reached the final page
        if not nxt:
            return None
        params = response_to_params(response)
        params['from-cursor'] = nxt
        return requests.request(
                'GET',
                response_to_url(response),
                headers=response.request.headers,
                params=params,
        )
    
    def _post(self, response):
        nxt = response.json().get('next-cursor', None)
        if not nxt:
            return None
        body = response_to_body(response)
        body['from-cursor'] = nxt
        return requests.request(
                'POST',
                response_to_url(response),
                headers=response.request.headers,
                json=body,
        )
        
    # retry requests with an exponentially increasing wait time upto 10 times
    @backoff.on_exception(backoff.expo, RateLimitError, max_value=10)
    def __next__(self):
        """ get the next response from the previous one """
        response = self.response

        # Check if we have reached the final page
        if not response:
            raise StopIteration()
        
        # cehck if we hit the rate limit
        if response.status_code == 429: 
            raise RateLimitError

        # check if the response is valid
        response.raise_for_status()
        
        
        method = response.request.method
        if method == 'GET':
            self.response = self._get(response)
        elif method == 'POST':
            self.response = self._post(response)
        else:
            raise ValueError(f'{method} method not supported')

        return response


Now we can easily loop over pages in a pythonic way. 
Below we get all the the results for our search by iterating over all the pages.

In [None]:
results = []
response = request('GET', 'entities', {'name': 'Environment'})
for page in Paginate(response):
    results.extend(page.json()['entities'])
print(f'{len(results)} results found')

# The Discovery Endpoints

We now have all we need to explore all of the API endpoints in detail.
We'll start with the discovery endpoints since they are the simplest, and illustrate their usage using code.
Use these endpoints to discover which entities and topics are in the system:
we will use the IDs returned in our subsequent searches. 

In [None]:
"""
Entities are named people, organisations, locations, substances, diseases and products
which all have a unique identifier within the Signal AI API.
"""

def get_entity(uuid):
    """ Get a specific entity by its unique identifier """
    return request('GET', f'entities/{uuid}').json().get('entity')

def search_entities(name: str = None, typ: str = None, size: int = None):
    """
    Search for entities using any combination of name and type

    name: Any entity whose name contains this search term will match
    type: Enum: "person" "organisation" "location" "substance" "disease" "product"
    size: number of entities per response (affects page size, not search results)
    """
    response = request('GET', 'entities', {'name': name, 'type': typ, 'size': size})
    results = []
    for page in Paginate(response):
        results.extend(page.json().get('entities'))
    return results

We can search for entities with any combination of `name` and `type` fields.

In [None]:
search_entities(name="Conservation", typ="organisation")[:5]

In [None]:
# get a specific entity by id
get_entity('6a29ecbc-a8e8-3a73-b172-5efe4901b8e4')

Topics and Sources have very similar interfaces

In [None]:
""" 
Signal AI experts have trained over 300 topics (or themes), 
from Health to Blockchain to provide clients an easy way 
to track emerging trends relevant to their businesses. """

def get_topic(uuid):
    """ Get a topic by id """
    return request('GET', f'topics/{uuid}').json().get('topic')

def search_topics(name: str = None, size=10, private=False):
    """
    Search A.I. trained topics by name

    name: Any topic whose name contains this search term will match
    size: number of entities per request (effects performance, not search results)
    private: Only return topics which are private to your organisation
    """
    response = request('GET', 'topics', {'name': name, 'size': size, 'private': str(private).lower()})
    results = []
    for page in Paginate(response):
        results.extend(page.json().get('topics'))
    return results

In [None]:
search_topics(name='Energy')

In [None]:
def get_sources(uuid):
    """ Get a publication source by id """
    return self.request('GET', f'sources/{uuid}').json().get('source')

def search_sources(
    name: str = None, size: int = None, country: str = None,
    region: str = None, subregion: str = None
):
    """
    Search publication sources

    name: Any publication whose name contains this search term will match
    size: number of entities per request (effects performance, not search results)
    country: Country name
    region: Region name
    subregion: Subregion name
    """
    response = request(
        'GET',
        'sources',
        {
            'name': name, 'size': size, 'country': country,
            'region': region, 'subregion': subregion
        }
    )
    results = []
    for page in Paginate(response):
        results.extend(page.json().get('sources'))
    return results

In [None]:
search_sources(name='Guardian', country='United Kingdom')[:5]

# Search Endpoint

The search endpoint allows the user to search for documents.
Each document is tagged with metadata such as date of publication, language and all of the topics, sources and entities mentioned in the document.
In addition the sentiment, saliency and sentiment of each mention is available.
This makes the search endpoint a good way to explore when different tags are mentioned as well as how they occur with other topics and entities.

The search endpoint is very powerful and offers a lot of freedom.
To see all the possibilities it's best to look at the API docs.
Here we will explore some examples to get a flavour for what is possible.
The query needs to be provided as a JSON object and represents a filter for which documents are relevant.
Keep in mind that very broad queries might take a long time to return all the results.

Make sure you look at the documentation if you want to take full advantage of this endpoint:

https://api.signal-ai.com/docs#tag/Content-search

In [None]:
def search_documents(query):
    documents = []
    response = request('POST', 'search', json=query)
    total = response.json().get('stats').get('total')
    if total == 0:
        return []
    n_pages = total/len(response.json().get('documents'))
    # Use a progress bar, big queries may take some time
    for page in tqdm(Paginate(response), total=math.ceil(n_pages)):
        documents.extend(page.json()['documents'])
    return documents

Let's look at an example query and the results

## Example 1: Searching for stories about the environment

In [None]:
search_topics(name='environment')

In [None]:
todays = datetime.datetime.today().strftime('%Y-%m-%d')
three_days_ago = (datetime.datetime.today() - datetime.timedelta(days=3)).strftime('%Y-%m-%d')
a_month_ago = (datetime.datetime.today() - datetime.timedelta(days=30)).strftime('%Y-%m-%d')

In [None]:
query = {
    'where': {
        # documents from the last 3 days
        "published-at": {"gt": three_days_ago, "lte": todays},
        'topics': {
            'id': {
                # id for the environment topic found using the dicovery endpoints
                'eq': '7a162a73-0062-4772-9dc0-252dd862dad0'
            },
        },
    },
    'size': 500
}
documents = search_documents(query)
print(f'{len(documents)} documents found')

## Example 2: Search for stories about Greenpeace

In [None]:
search_entities(name='Greenpeace')

In [None]:
query = {
    'where': {
        "published-at": {"gt": three_days_ago , "lt": todays},
        'topics': {
            'id': {
                'eq': '7a162a73-0062-4772-9dc0-252dd862dad0'
            },
        },
        'entities': {
            'id': {
                # id for greenpeace
                'eq': 'f09d0747-d36d-400e-8aca-a1b5d51c65f7'
            }
        }
    },
    'size': 500
}
documents = search_documents(query)
print(f'{len(documents)} documents found')

## Example 3: Sentiment about BP when Greta Thunberg is mentioned

In [None]:

query = {
    'where': {
        # search over the last month
        "published-at": {"gt": a_month_ago, "lte": todays},
        'topics': {
            'id': {
                'eq': '7a162a73-0062-4772-9dc0-252dd862dad0'
            },
        },
        # both BP and Greta Thunberg
        'entities': {
            'id': {
                'all': [
                    '52e28982-5bd9-40d4-ab8f-7cba471f598c',
                    '2bf240bf-bbc2-4416-910a-608a3fdd967d'
                ]
            }
        }
    },
    'size': 500
}
documents = search_documents(query)
print(f'{len(documents)} documents found')

entities_df = pd.DataFrame.from_records([e for d in documents for e in d.get('entities', [])]).set_index('id')
entities_df[
    # Filter for salient mentions
    (entities_df['salient']==True)
].loc['52e28982-5bd9-40d4-ab8f-7cba471f598c'].sentiment.hist()

## Example 4: Sentiment about BP when Mexico is mentioned

In [None]:
search_entities(name='Mexico', typ='location')[:3]

query = {
    'where': {
        # search over a 1 year period
        "published-at": {"gt": a_month_ago, "lte": todays},
        'topics': {
            'id': {
                'eq': '7a162a73-0062-4772-9dc0-252dd862dad0'
            },
        },
        # both BP and Mexico
        'entities': {
            'id': {
                # to look for either entity use the key 'or'
                'all': [
                    '52e28982-5bd9-40d4-ab8f-7cba471f598c',
                    '995444c6-eae3-400c-9457-0393b51efe0e'
                ]
            }
        }
    },
    'size': 500
}
documents = search_documents(query)
print(f'{len(documents)} documents found')

entities_df = pd.DataFrame.from_records([e for d in documents for e in d.get('entities', [])
]).set_index('id')
entities_df = pd.DataFrame.from_records([e for d in documents for e in d.get('entities', [])]).set_index('id')
entities_df[
    # Filter for salient mentions
    (entities_df['salient']==True)
].loc['52e28982-5bd9-40d4-ab8f-7cba471f598c'].sentiment.hist()

## Example 5: People and Organisations mentioned alongside Greta Thunberg 

In [None]:
query = {
    'where': {
        "published-at": {"gt": three_days_ago, "lte": todays},
        'topics': {
            'id': {
                'eq': '7a162a73-0062-4772-9dc0-252dd862dad0'
            },
        },
        'entities': {
            'id': {
                'eq': '2bf240bf-bbc2-4416-910a-608a3fdd967d'
            }
        }
    },
    'size': 500
}
documents = search_documents(query)
print(f'{len(documents)} documents found')
entities_df = pd.DataFrame.from_records([e for d in documents for e in d.get('entities', [])])
entities_df.head()

In [None]:
entities_df[entities_df['type'] == 'person']['name'].value_counts().head(20)

In [None]:
entities_df[entities_df['type'] == 'organisation']['name'].value_counts().head(20)