# TextProbe Quickstart Guide

TextProbe is an **all-in-one** Text Analysis API.  The API is organized into four core endpoints which return a rich analysis of your text across many dimensions:
1. **Entities Endpoint**:  [extracts entities](https://en.wikipedia.org/wiki/Named-entity_recognition) like people, groups, and places with automatic [entity linking](https://en.wikipedia.org/wiki/Entity_linking) to the [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) knowledge base. Also extracts [measurements](https://en.wikipedia.org/wiki/Measured_quantity#:~:text=In%20a%20physical%20setting%20a,the%20context%20of%20quantum%20mechanics) (e.g., *10 cm* from "pencil is 10cm") and measured things (e.g., *pencil* from "pencil is 10cm").
2. **Relations Endpoint**: [extracts relations](https://en.wikipedia.org/wiki/Relationship_extraction) between entities including certain causal relatonships.  Also extracts quotations (speaker->said->quote).
3. **Topics Endpoint**: [auto-categorizes](https://en.wikipedia.org/wiki/Document_classification) text by subject matter. Also extracts salient concepts or subthemes as represented by [keyphrases](https://en.wikipedia.org/wiki/Keyword_extraction).
4. **Feelings Endpoint**: detects [emotion](https://en.wikipedia.org/wiki/Emotion_recognition) (e.g., joy, anger, fear, sadness, neutral) and [sentiment](https://en.wikipedia.org/wiki/Sentiment_analysis) (i.e., positive vs. negative vs. neutral) including [entity sentiment](https://www.semanticscholar.org/paper/Entity-Based-Sentiment-Analysis-on-Twitter-Batra-Rao/b6595c1bb5ae1bd6ea1fd9bf561796dd84c25295) 


TextProbe is a REST API that accepts JSON requests via both POST and GET.  Thus, you can access our API with any programming language (e.g., Python, PHP, Node.js, C#) and with very little coding required. We will use Python in this notebook and send requests using the `requests` library. Let's begin by installing `requests`.

In [1]:
!pip install -q requests # install requests if not already installed
import requests          # import requests

## STEP 1: Getting an API Key

TextProbe currently offers our API through [RapidAPI](https://rapidapi.com/marketplace).  A single API request can be used to perform all analyses listed above.  Our API can be accessed for free for up to 500 requests per day with no credit card required. 

To get started, you can:
1. [Register an account](https://rapidapi.com/marketplace) with RapidAPI for free
2. [Subscribe](https://rapidapi.com/textprobe/api/textprobe/pricing) to the TextProbe API.  Choose the Basic Plan to try it out for free.
2. [Go here](https://rapidapi.com/textprobe/api/textprobe/endpoints) and make note of your API key. It will be the value associated with the `x-rapidapi-key` field on the lower right of the screen under **Code Snippets**.

The RapidAPI site also shows example code for accessing the API through other languages and libraries such as Node.js.

Once you obtain a free API key, enter it below:

In [2]:
API_KEY = 'ENTER YOUR API KEY HERE'

## STEP 2:  Define Request Function

Let's define a convience function to help send our requests to different TextProbe endpoints:

In [3]:
BASE_URL = 'https://textprobe.p.rapidapi.com'
def post_request(doc, endpoint, **kwargs):
    kwargs['text'] = doc
    r = requests.post(url = BASE_URL+endpoint, json = kwargs, 
                      headers={'x-rapidapi-key': API_KEY})
    print('status code: %s' % (r.status_code))
    if r.status_code in range(500,599): return {"detail" : "Internal Server Error"}
    data = r.json()
    return data

## STEP 3:   Analyze Your Text

Here is the text we will analyze in this notebook, which consists of two paragraphs (i.e., two sets of sentences separated with blank line).  When sending text to TextProbe, we recommend you try to retain this kind of paragraph structure if possible.

In [4]:
document = """
Artifical intelligence remains a hot area in the technology sector.
In 2014, Google bought London-based artificial intelligence company DeepMind with an acquisition
price was more than $500 million. Facebook was also in talks to buy the startup in late 2013.
DeepMind had confirmed the acquisition, but couldn’t disclose deal terms.
CMU Professor Larry Wasserman once wrote that "the startup is trying to build a system that thinks."

As tech giants increase their AI acquisitions, some are concerned about the harmful effects of AI
and the role of Big Tech in society in general. Indeed, many we interviewed now feel that Facebook is 
"oficially evil" due to policies that force the sharing of personal data among other transgressions.  
Many are critical of Apple for other reasons, but happily continue to buy their iPhones and MacBooks 
with 3.2GHz M1 processors.
"""

We will now send the above text to each of our four endpoints for analysis.  Let's begin with the **Entities** endpoint.

### Entities Endpoint

Let's now send our text to the **/entities** endpoint using our `post_request` convenience function defined above.

In [5]:
result = post_request(document, '/entities')

status code: 200


The **/entities** endpoint returns a dictionary with three main keys.
1. `sentences`: the original text split into sentences with slight normalizations to facilitate analyses
2. `entities`: extracted entities like people, groups, and places
3. `measurements`: extracted measured quantities along with properties and attributes being measured

In addition, all endpoints will return a `version` and `message` field, which return the API version and any messages from the API, respectively.

In [6]:
result.keys()

dict_keys(['sentences', 'entities', 'measurements', 'version', 'message'])

Let's go through each of the three main keys.

#### `sentences`

The value for `sentences` is the original text split into sentences with slight normalizations necessary for analysis.  For example, you'll notice that **3.2GHz** in the last sentence was normalized to **3.2 GHZ** (with a space separating the quantity and measurement unit).

In [7]:
result['sentences']

['Artifical intelligence remains a hot area in the technology sector.',
 'In 2014, Google bought London-based artificial intelligence company DeepMind with an acquisition price was more than $500 million.',
 'Facebook was also in talks to buy the startup in late 2013.',
 "DeepMind had confirmed the acquisition, but couldn't disclose deal terms.",
 'CMU Professor Larry Wasserman once wrote that "the startup is trying to build a system that thinks."',
 'As tech giants increase their AI acquisitions, some are concerned about the harmful effects of AI and the role of Big Tech in society in general.',
 'Indeed, many we interviewed now feel that Facebook is  "oficially evil" due to policies that force the sharing of personal data among other transgressions.',
 'Many are critical of Apple for other reasons, but happily continue to buy their iPhones and MacBooks  with 3.2 GHz M1 processors.']

#### `entities`

The `entities` field of the response is a list of extracted entities (e.g., people, groups, places).  The `entity_type` fields show the categorization of each entity (e.g., **Person**, **Group**, **Place**, **Product**).  The `sentence_id` shows from which sentence the entity was extracted, and the `start` and  `end` fields show the character span of the entity within that sentence.  Entities are automatically disambiguated and linked to the [Wikidata Knowledgebase](https://www.wikidata.org) when possible.  For instance, the first entity, **Google**, is linked to entry [Q95](https://www.wikidata.org/wiki/Q95) in the Wikidata knowledgebase, which can be accessed to retrieve more information on the company.

In [8]:
result['entities']

[{'entity': 'Google',
  'start': 9,
  'end': 15,
  'entity_type': 'Group',
  'wikidata_id': 'Q95',
  'wikidata_name': 'Google',
  'sentence_id': 1},
 {'entity': 'London',
  'entity_type': 'Place',
  'start': 23,
  'end': 29,
  'sentence_id': 1},
 {'entity': 'DeepMind',
  'start': 68,
  'end': 76,
  'entity_type': 'Group',
  'wikidata_id': 'Q15733006',
  'wikidata_name': 'DeepMind',
  'sentence_id': 1},
 {'entity': 'more than $500 million',
  'entity_type': 'Money',
  'start': 107,
  'end': 129,
  'sentence_id': 1},
 {'entity': 'DeepMind',
  'start': 0,
  'end': 8,
  'entity_type': 'Group',
  'wikidata_id': 'Q15733006',
  'wikidata_name': 'DeepMind',
  'sentence_id': 3},
 {'entity': 'CMU',
  'entity_type': 'Group',
  'start': 0,
  'end': 3,
  'sentence_id': 4},
 {'entity': 'Larry Wasserman',
  'start': 14,
  'end': 29,
  'entity_type': 'Person',
  'wikidata_id': 'Q6489856',
  'wikidata_name': 'Larry A. Wasserman',
  'sentence_id': 4},
 {'entity': 'AI',
  'entity_type': 'Group',
  'start

The **/entities** endpoint accepts the optional argument, `enable_freebase`.  If True, a richer, pipe-delimited list of freebase types will be assigned to entities where possible.  For instance, here are the freebase types for Google:

In [9]:
result = post_request(document, '/entities', enable_freebase=True)
print(result['entities'][0])

status code: 200
{'entity': 'Google', 'start': 9, 'end': 15, 'entity_type': 'Group', 'wikidata_id': 'Q95', 'wikidata_name': 'Google', 'freebase_types': '/organization/organization_member | /influence/influence_node | /business/consumer_company | /computer/operating_system_developer | /conferences/conference_sponsor | /internet/website_owner | /venture_capital/venture_funded_company | /business/brand | /dataworld/data_provider | /computer/computer_manufacturer_brand | /education/educational_institution | /travel/hotel_grading_authority | /freebase/list | /venture_capital/venture_investor | /business/business_operation | /law/litigant | /organization/organization_founder | /freebase/freebase_interest_group | /book/book_subject | /computer/software_developer | /award/ranked_item | /organization/organization | /business/sponsor | /award/award_presenting_organization | /award/award_winner | /business/customer | /organization/organization_partnership | /business/issuer | /law/patent_assignee

#### `measurements`

The `measurements` field of the response will contain a list of the extracted measured quantities.  Our measurement extractor can discover measurements even when corrupted through PDF-to-text conversions (e.g., lost exponents and distorted scientific notation). 

In addition to containing extracted measured quantities, each returned dictionary may also include an extra entry `thing_measured` which will contain, the object, property or attribute being measured (when possible). In this case, we see that it is the **M1 processor** that is being measured (at **3.2 GHz**).

In [10]:
result['measurements']

[{'measurement': '3.2 GHz',
  'units': 'GHz',
  'number': 3.2,
  'start': 107,
  'end': 114,
  'sentence_id': 7,
  'thing_measured': 'M1 processors'}]

### Relations Endpoint

The **/relations** endpoint extracts relations among entities from the text supplied.

In [11]:
result = post_request(document, '/relations')

status code: 200


The response from **/relations** contains two main keys: `relations` and `quotations` (in addition to `version` and `message`):

In [12]:
result.keys()

dict_keys(['relations', 'quotations', 'version', 'message'])

#### `relations`
The `relations` field contains a list of discovered relations from the supplied text. In this case, we have discovered and extracted Merger/Acquisition activity (i.e., Google acquires DeepMind) that can be useful for a variety of downstream tasks.  The `sentence` from where the relations was extracted is also included.  

In [13]:
result['relations']

[{'subject': 'Google',
  'predicate': 'bought',
  'object': 'London - based artificial intelligence company DeepMind',
  'subject_type': 'Group',
  'object_type': 'Group',
  '_sentence': 'In 2014, Google bought London-based artificial intelligence company DeepMind with an acquisition price was more than $500 million.',
  'subject_id': 'Q95',
  'object_id': 'Q15733006'}]

#### `quotations`
The `quotations` entry in the response will include a list of extracted quotations. (Quotations can be viewed as a kind of relation between a speaker and a statement.  Each quotation entry includes a `polarity` field which will indicate the positive (closer to 1.0) or negative (closer to -1.0) sentiment of the quote.  In this case, the quotation is neutral (i.e., `polarity=0.0`).

In [14]:
result['quotations']

[{'speaker': 'Professor Larry Wasserman',
  'action': 'wrote',
  'quote': '"the startup is trying to build a system that thinks."',
  'polarity': 0.0,
  'subject_id': 'Q6489856'}]

As shown in the examples above, TextProbe will automatically link entities discovered in the subject and object of a relation to the Wikidata knowledgebase.  For instance, the speaker of the above quote, Larry Wasserman, is linked to the appropriate Wikidata entry ([Q6489856](https://www.wikidata.org/wiki/Q6489856)).  

In addition, although not illustrated in this example, the TextProbe can also optionally perform [coreference resolution](https://en.wikipedia.org/wiki/Coreference) over the first ten sentences in the text. That is, words like "he", "she", and "they" are replaced with the entities that they represent. To turn this behavior off, you can supply `enable_coref=False` to the **/relations** endpoint. We will show an example of this below in the **Tips and Tricks** section.

### Topics Endpoint

The **/topics** endpoint extracts information related to the underlying subject matter of the text.

In [15]:
result = post_request(document, '/topics')

status code: 200


The response from **/topics** contains two main keys: `categories` and `keywords` (in addition to `version` and `message`):

In [16]:
result.keys()

dict_keys(['categories', 'keywords', 'version', 'message'])

#### `categories`
The **/topics** endpoint categorizes the text into one or more of about 40 different topic categories.  The `categories` field contains the predicted categories for the text. Here, we see the text is primarily classified into **Tech**, which makes sense for this text.


In [17]:
result['categories']

{'Business/Finance': 0.124656, 'Tech': 0.775994}

#### `keywords`
Extracted keyphrases representing important concepts and subthemes in the text are available in the `keywords` field of the response. Here, we can see **artificial intellgence** as the top salient concept in the text.

In [18]:
result['keywords']

['artifical intelligence',
 'AI',
 'acquisition',
 'Facebook',
 'startup',
 'Google',
 'London',
 'CMU',
 'Professor']

### Feelings Endpoint

The **/feelings** endpoint extracts information related to the underlying emotions and sentiments expressed in the text.  

In [19]:
result = post_request(document, '/feelings')

status code: 200


The response from **/feelings** will return up to five main fields (in addition to `message` and `version`):

In [20]:
result.keys()

dict_keys(['emotion_prediction', 'emotion_scores', 'sentiment_prediction', 'sentiment_scores', 'entity_sentiments', 'version', 'message'])

#### `sentiment_prediction` and `sentiment_scores`
The `sentiment_prediction` contains a string indicating whether the supplied text as a whole is `positive`, `negative`, or `neutral`.  In this case, the overall sentiment of the text is neutral (i.e., only very slightly negative), since the article is mostly making objective statements about AI acquisitions..  The `sentiment_scores` field shows the probability scores for the prediction.

In [21]:
print(result['sentiment_prediction'])
print(result['sentiment_scores'])

Neutral
{'Positive': 0.52265625, 'Negative': 0.47734374999999996}


#### `emotion_prediction` and `emotion_scores`

TextProbe can also discover the overall emotional tone of the text. The `emotion_prediction` field categorizes the text into one of five emotional categories including a *Neutral* category.  The `emotion_scores` show the predicted probabilities for each emotion.  In this case, due to the concerns about Big Tech expressed in the text, the predicted emotion of the text is **Fear**.

In [22]:
print(result['emotion_prediction'])
print(result['emotion_scores'])

Fear
{'Anger': 0.003734, 'Fear': 0.98739, 'Joy': 0.001806, 'Neutral': 0.002218, 'Sadness': 0.004853}


#### `entity_sentiment`

The **/feelings** endpoint can also detect sentiments towards specific entities, which, when available, are stored in the `entity_sentiments` field.

In [23]:
print(result['entity_sentiments'])

[{'entity': 'Facebook', 'polarity': -0.6597, 'descriptor': 'evil'}]


While the overall sentiment is neutral, we see here that there is negative sentiment expressed specifically towards Facebook buried in the text, as contained in the `entity_sentiments` field.  As with quotation extraction, the sentiment score is included as `polarity`. Entity sentiments are currently limited to adjetival forms of sentiment where the entity is the subject (e.g., Barack Obama was a great President).

## Tips and Tricks

Here we include some additional tips.

### Splitting up Larger Texts

Each request to TextProbe can be a maximum of 10KB (about 10K ASCII characters).  If your document is larger than 10KB, you can split it up into 10KB chunks and send each chunk as a separate API request. Let's create a document of over 10KB by duplicating the document used above.

In [24]:
large_document = "\n".join([document] * 25)

In [25]:
len(large_document.encode('utf-8'))

21799

At 22549 bytes, `large_document` is over the 10KB limit.  We can split into smaller chunks by using the [nltk library](https://pypi.org/project/nltk/) to split the document into sentences.  The function `text2chunks` shown below will split a large text document into chunks of text that are <= 10KB and will also try to prevent paragraphs from being split across chunks whenever possible.

In [26]:
!pip install -q nltk==3.5

In [27]:
from nltk import sent_tokenize
def text2chunks(document, limit=10000):
    chunks = []
    texts = ''
    current_length = 0
    for paragraph in filter(lambda x : x != '', document.split('\n\n')):
        sent_list = []
        sent_list.extend(sent_tokenize(paragraph))
        sentences = ' '.join(sent_list)
        sentences = "\n\n%s" % (sentences)
        if len(sentences.encode('utf-8', 'ignore')) > limit:
            texts += '\n\n'
            for sent in sent_list:
                if current_length + len(sent.encode('utf-8', 'ignore')) < limit:
                    texts += ' %s' % (sent)
                    current_length += len(sent.encode('utf-8', 'ignore'))
                else:
                    chunks.append(texts)
                    texts = ''
                    current_length = 0                
        elif current_length + len(sentences.encode('utf-8', 'ignore')) < limit:
            texts += sentences
            current_length += len(sentences.encode('utf-8', 'ignore'))

        else:
            chunks.append(texts)
            texts = ''
            current_length = 0
    if texts: chunks.append(texts)
    return chunks

Let's use `text2chunks` to split our large document:

In [28]:
chunks = text2chunks(large_document)

In [29]:
print("# of chunks: %s" % len(chunks))
for i, chunk in enumerate(chunks):
    print("length of chunk %s: %s bytes" % (i+1, len(chunk.encode('utf-8', 'ignore'))))

# of chunks: 3
length of chunk 1: 9570 bytes
length of chunk 2: 9570 bytes
length of chunk 3: 1740 bytes


Each chunk of text can then be sent to TextProbe for analysis as a separate API request.

### Viewing Entities Within Sentence Context
An extracted entity (or measurement) can be localized using a combination of the `sentence_id` and the `start` and `end` character indices.  For instance, in the cells below, we will view an entity (Google) and a measurement (3.2 GhZ) both within their respective sentence contexts.  We will highlight the extraction in blue using the `highlight_entity` function.

In [30]:
def highlight_entity(ent, sentence):
    def colortext(x):
        return '\033[34m' + str(x) + '\033[0m'
    start = ent['start']
    end = ent['end']
    print(sentence[:start] + colortext(sentence[start:end]) + sentence[end:])

In [31]:
result = post_request(document, '/entities')
entity = result['entities'][0]
sentence_of_entity = result['sentences'][entity['sentence_id']]
highlight_entity(entity, sentence_of_entity)

status code: 200
In 2014, [34mGoogle[0m bought London-based artificial intelligence company DeepMind with an acquisition price was more than $500 million.


In [32]:
entity = result['measurements'][0]
sentence_of_entity = result['sentences'][entity['sentence_id']]
highlight_entity(entity, sentence_of_entity)

Many are critical of Apple for other reasons, but happily continue to buy their iPhones and MacBooks  with [34m3.2 GHz[0m M1 processors.


### Coreference Resolution

As mentioned above, the **/relations** and **/feelings** endpoints accept an optional `enable_coref` argument. When set to `True`, TextProbe will perform [coreference resolution](https://en.wikipedia.org/wiki/Coreference) over the first ten sentences in the supplied text.  Consider the following text:

In [33]:
some_text =  """
President Obama had a yacht. He loved sailing it. 
I enjoyed hanging out on his boat because he was a great entertainer. 
He once told me, "Sailing is the only thing that   relaxes me."
"""

Let's run this text through both the **/relations** endpoint and the **/feelings** endpoint with `enable_coref=True` and observe results.

In [34]:
result_relations = post_request(some_text, '/relations', enable_coref=True)
result_feelings = post_request(some_text, '/feelings', enable_coref=True)
print(result_relations['relations'])
print()
print(result_relations['quotations'])
print()
print(result_feelings['entity_sentiments'])

status code: 200
status code: 200
[{'subject': 'President Obama', 'predicate': 'was', 'object': 'a great entertainer', 'subject_type': 'Person', 'object_type': '', '_sentence': 'I enjoyed hanging out on President Obama boat because President Obama was a great entertainer.', 'subject_id': 'Q76'}]

[{'speaker': 'President Obama', 'action': 'told', 'quote': '"Sailing is the only thing that   relaxes me."', 'polarity': 0.0, 'subject_id': 'Q76'}]

[{'entity': 'Obama', 'polarity': 0.6249, 'descriptor': 'great'}]


Notice that the discovered relations, quotations, and entity sentiments related to Obama all involve sentences that only mention "he" in the original text. Obama is never explicitly mentioned in these sentences.   With `enable_coref=True`, TextProbe automatically replaces these references allowing for the correct extractions and assignments of these relations.

Finally, we note that, when invoking an endpoint for the first time (or with changed parameters like `enable_coref=True`), you might experience a slight initial delay due to the endpoint being "warmed up". Subsequent invocations will be faster.  

## Conclusion

Here we have given a quick introduction in how to get up and running with the TextProbe API. You should now be ready to easily explore your own data.

If you need support or have questions, please email **support@textprobe.com**.