# Text Analytics

<img src="https://github.com/retkowsky/images/blob/master/acs.png?raw=true">

Documentation : https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/

## 1) Language

https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-language-detection

In [1]:
subscription_key = "f5bd0cc15f804988984d9d18909c20cd"
assert subscription_key

In [2]:
text_analytics_base_url = "https://westeurope.api.cognitive.microsoft.com/text/analytics/v2.0"

In [3]:
language_api_url = text_analytics_base_url + "/languages"
print(language_api_url)

https://westeurope.api.cognitive.microsoft.com/text/analytics/v2.0/languages


In [4]:
documents = { 'documents': [
    { 'id': '1', 'text': 'This is a document written in English.' },
    { 'id': '2', 'text': 'Este es un document escrito en Español.' },
    { 'id': '3', 'text': '这是一个用中文写的文件' },
    { 'id': '4', 'text': 'Ceci est une présentation du service cognitif Azure Text Analytics.' },
    { 'id': '5', 'text': 'सुप्रभात। आप ठीक तो हैं न?' }
    ] }

In [5]:
import requests
from pprint import pprint
headers   = {"Ocp-Apim-Subscription-Key": subscription_key}
response  = requests.post(language_api_url, headers=headers, json=documents)
languages = response.json()
pprint(languages)

{'documents': [{'detectedLanguages': [{'iso6391Name': 'en',
                                       'name': 'English',
                                       'score': 1.0}],
                'id': '1'},
               {'detectedLanguages': [{'iso6391Name': 'es',
                                       'name': 'Spanish',
                                       'score': 1.0}],
                'id': '2'},
               {'detectedLanguages': [{'iso6391Name': 'zh_chs',
                                       'name': 'Chinese_Simplified',
                                       'score': 1.0}],
                'id': '3'},
               {'detectedLanguages': [{'iso6391Name': 'fr',
                                       'name': 'French',
                                       'score': 1.0}],
                'id': '4'},
               {'detectedLanguages': [{'iso6391Name': 'hi',
                                       'name': 'Hindi',
                                       'score': 1.0}],
           

In [6]:
from IPython.display import HTML
table = []
for document in languages["documents"]:
    text  = next(filter(lambda d: d["id"] == document["id"], documents["documents"]))["text"]
    langs = ", ".join(["{0}({1})".format(lang["name"], lang["score"]) for lang in document["detectedLanguages"]])
    table.append("<tr><td>{0}</td><td>{1}</td>".format(text, langs))
HTML("<table><tr><th>Text</th><th>Detected languages(scores)</th></tr>{0}</table>".format("\n".join(table)))

Text,Detected languages(scores)
This is a document written in English.,English(1.0)
Este es un document escrito en Español.,Spanish(1.0)
这是一个用中文写的文件,Chinese_Simplified(1.0)
Ceci est une présentation du service cognitif Azure Text Analytics.,French(1.0)
सुप्रभात। आप ठीक तो हैं न?,Hindi(1.0)


## 2) Sentiment

Text Analytics uses a machine learning classification algorithm to generate a sentiment score between 0 and 1. **Scores closer to 1 indicate positive sentiment, while scores closer to 0 indicate negative sentiment.** The model is pretrained with an extensive body of text with sentiment associations. Currently, it is not possible to provide your own training data. The model uses a combination of techniques during text analysis, including text processing, part-of-speech analysis, word placement, and word associations. For more information about the algorithm, see Introducing Text Analytics.

Sentiment analysis is performed on the entire document, as opposed to extracting sentiment for a particular entity in the text. In practice, there is a tendency for scoring accuracy to improve when documents contain one or two sentences rather than a large block of text. During an objectivity assessment phase, the model determines whether a document as a whole is objective or contains sentiment. A document that is mostly objective does not progress to the sentiment detection phrase, resulting in a .50 score, with no further processing. For documents continuing in the pipeline, the next phase generates a score above or below .50, depending on the degree of sentiment detected in the document.

> https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-sentiment-analysis

In [7]:
sentiment_api_url = text_analytics_base_url + "/sentiment"
print(sentiment_api_url)

https://westeurope.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment


In [8]:
documents = {'documents' : [
  {'id': '1', 'language': 'en', 'text': 'I had a wonderful experience! The rooms were wonderful and the staff was helpful.'},
  {'id': '2', 'language': 'fr', 'text': "Je suis très content de mon voyage en Australie. Je reviendrai."},
  {'id': '3', 'language': 'fr', 'text': "Le restaurant n'était vraiment pas bon et trop cher en plus."}
]}

In [9]:
headers   = {"Ocp-Apim-Subscription-Key": subscription_key}
response  = requests.post(sentiment_api_url, headers=headers, json=documents)
sentiments = response.json()
pprint(sentiments)

{'documents': [{'id': '1', 'score': 0.9734485149383545},
               {'id': '2', 'score': 0.7375986576080322},
               {'id': '3', 'score': 0.0}],
 'errors': []}


> Scores closer to 1 indicate positive sentiment, while scores closer to 0 indicate negative sentiment. 

## 3) Key phrases
**The Key Phrase Extraction API** evaluates unstructured text, and for each JSON document, returns a list of **key phrases.**

This capability is useful if you need to quickly identify the main points in a collection of documents. For example, given input text "The food was delicious and there were wonderful staff", the service returns the main talking points: "food" and "wonderful staff".

Currently, Key Phrase Extraction supports English, German, Spanish, and Japanese. Other languages are in preview. For more information, see Supported languages.

> https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-keyword-extraction

In [10]:
key_phrase_api_url = text_analytics_base_url + "/keyPhrases"
print(key_phrase_api_url)

https://westeurope.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases


In [11]:
documents = {'documents' : [
  {'id': '1', 'language': 'en', 'text': 'I had a wonderful experience! The rooms were wonderful and the staff was helpful.'},
  {'id': '2', 'language': 'en', 'text': 'I had a terrible time at the hotel. The staff was rude and the food was awful.'},  
  {'id': '3', 'language': 'fr', 'text': "Le restaurant n'était vraiment pas bon et la vue pas terrible."},
  {'id': '4', 'language': 'fr', 'text': "Ceci est une présentation des services cognitifs Azure"},  
]}
headers   = {'Ocp-Apim-Subscription-Key': subscription_key}
response  = requests.post(key_phrase_api_url, headers=headers, json=documents)
key_phrases = response.json()
pprint(key_phrases)

{'documents': [{'id': '1',
                'keyPhrases': ['wonderful experience', 'staff', 'rooms']},
               {'id': '2',
                'keyPhrases': ['food', 'terrible time', 'hotel', 'staff']},
               {'id': '3', 'keyPhrases': ['vue', 'restaurant']},
               {'id': '4',
                'keyPhrases': ['présentation des services cognitifs']}],
 'errors': []}


In [12]:
from IPython.display import HTML
table = []
for document in key_phrases["documents"]:
    text    = next(filter(lambda d: d["id"] == document["id"], documents["documents"]))["text"]    
    phrases = ",".join(document["keyPhrases"])
    table.append("<tr><td>{0}</td><td>{1}</td>".format(text, phrases))
HTML("<table><tr><th>Text</th><th>Key phrases</th></tr>{0}</table>".format("\n".join(table)))

Text,Key phrases
I had a wonderful experience! The rooms were wonderful and the staff was helpful.,"wonderful experience,staff,rooms"
I had a terrible time at the hotel. The staff was rude and the food was awful.,"food,terrible time,hotel,staff"
Le restaurant n'était vraiment pas bon et la vue pas terrible.,"vue,restaurant"
Ceci est une présentation des services cognitifs Azure,présentation des services cognitifs


## 4) Entities
The Entity Recognition API takes unstructured text, and for each JSON document, returns a list of **disambiguated entities** with links to more information on the web (Wikipedia and Bing).

> https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-entity-linking

In [13]:
entity_linking_api_url = text_analytics_base_url + "/entities"
print(entity_linking_api_url)

https://westeurope.api.cognitive.microsoft.com/text/analytics/v2.0/entities


In [17]:
documents = {'documents' : [
  {'id': '1', 'text': " Microsoft Corporation (/ˈmaɪkroʊsɒft/ MY-kroh-soft) is an American multinational technology company with headquarters in Redmond, Washington. It develops, manufactures, licenses, supports, and sells computer software, consumer electronics, personal computers, and related services. Its best known software products are the Microsoft Windows line of operating systems, the Microsoft Office suite, and the Internet Explorer and Edge web browsers. Its flagship hardware products are the Xbox video game consoles and the Microsoft Surface lineup of touchscreen personal computers. Microsoft ranked No. 21 in the 2020 Fortune 500 rankings of the largest United States corporations by total revenue;[3] it was the world's largest software maker by revenue as of 2016.[4] It is considered one of the Big Five companies in the U.S. information technology industry, along with Google, Apple, Amazon, and Facebook."},
]}

In [18]:
headers   = {"Ocp-Apim-Subscription-Key": subscription_key}
response  = requests.post(entity_linking_api_url, headers=headers, json=documents)
entities = response.json()

In [19]:
print(entities)

{'documents': [{'id': '1', 'entities': [{'name': 'Facebook', 'matches': [{'text': 'Facebook', 'offset': 896, 'length': 8}], 'wikipediaLanguage': 'en', 'wikipediaId': 'Facebook', 'wikipediaUrl': 'https://en.wikipedia.org/wiki/Facebook', 'bingId': '4bc8f781-7083-d1a0-f781-9466e0eb62e7'}, {'name': 'Microsoft', 'matches': [{'text': 'Microsoft Corporation', 'offset': 1, 'length': 21}, {'text': 'Microsoft', 'offset': 578, 'length': 9}], 'wikipediaLanguage': 'en', 'wikipediaId': 'Microsoft', 'wikipediaUrl': 'https://en.wikipedia.org/wiki/Microsoft', 'bingId': 'a093e9b9-90f5-a3d5-c4b8-5855e1b01f85'}, {'name': 'Redmond, Washington', 'matches': [{'text': 'Redmond, Washington', 'offset': 122, 'length': 19}], 'wikipediaLanguage': 'en', 'wikipediaId': 'Redmond, Washington', 'wikipediaUrl': 'https://en.wikipedia.org/wiki/Redmond,_Washington', 'bingId': '8769d4c0-b645-70ac-03ec-6eebabf6d26e'}, {'name': 'Microsoft Surface', 'matches': [{'text': 'Microsoft Surface', 'offset': 518, 'length': 17}], 'wiki