## Microsoft Concept Graph

[Microsoft Concept Graph](https://concept.research.microsoft.com/) is a large taxonomy of terms mined from the internet, with `is-a` relations between concepts. 

Context Graph is available in two forms:
 * Large text file for download
 * REST API

Statistics:
 * 5401933 unique concepts, 
 * 12551613 unique instances
 * 87603947 `is-a` relations

## Using Web Service

Web service offers different calls to estimate probability of a concept belonging to different groups. More info is available [here](https://concept.research.microsoft.com/Home/Api).
Here is the sample URL to call: `https://concept.research.microsoft.com/api/Concept/ScoreByProb?instance=microsoft&topK=10`

In [3]:
import urllib
import json
import ssl

def http(x):
    ssl._create_default_https_context = ssl._create_unverified_context
    response = urllib.request.urlopen(x)
    data = response.read()
    return data.decode('utf-8')

def query(x):
    return json.loads(http("https://concept.research.microsoft.com/api/Concept/ScoreByProb?instance=microsoft&topK=10".format(urllib.parse.quote(x))))

query('microsoft')

URLError: <urlopen error [WinError 10060] 接続済みの呼び出し先が一定の時間を過ぎても正しく応答しなかったため、接続できませんでした。または接続済みのホストが応答しなかったため、確立された接続は失敗しました。>

Let's try to categorize the news titles using parent concepts. To get news titles, we will use [NewsApi.org](http://newsapi.org) service. You need to obtain your own API key in order to use the service - go to the web site and register for free developer plan.

In [20]:
newsapi_key = '<your API key here>'
def get_news(country='us'):
    res = json.loads(http("https://newsapi.org/v2/top-headlines?country={0}&apiKey={1}".format(country,newsapi_key)))
    return res['articles']

all_titles = [x['title'] for x in get_news('us')+get_news('gb')]

In [21]:
all_titles

['Covid-19 Live Updates: Vaccines and Boosters News - The New York Times',
 'Ukrainians Flee Mariupol as Russian Forces Push to Take Port City - The Wall Street Journal',
 'Bond Yields Jump, Stock Futures Rise After Powell Says Fed Is Ready to Be More Aggressive - The Wall Street Journal',
 'Putin critic Alexei Navalny found guilty by Russian court - New York Post ',
 "Supreme Court nominee Ketanji Brown Jackson will face questions at confirmation hearing's second day - CNN",
 '2 teachers killed at Swedish high school, student arrested - ABC News',
 'Clues to Covid-19’s Next Moves Come From Sewers - The Wall Street Journal',
 'Republicans to roll dice by grilling Jackson over child-pornography sentencing decisions | TheHill - The Hill',
 '‘Clear sign’ Putin considering using chemical weapons in Ukraine, claims President Biden - The Independent',
 'NASA confirms there are 5,000 planets outside our solar system - Daily Mail',
 "US stocks whipsawed overnight after Fed Chair Powell's remar

First of all, we want to be able to extract nouns from news titles. We will use `TextBlob` library to do this, which simplifies a lot of typical NLP tasks like this.

In [15]:
import sys
!{sys.executable} -m pip install textblob
!{sys.executable} -m textblob.download_corpora
from textblob import TextBlob

Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\dmitryso\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dmitryso\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dmitryso\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dmitryso\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\dmitryso\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\dmitryso\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is a

In [22]:
w = {}
for x in all_titles:
    for n in TextBlob(x).noun_phrases:
        if n in w:
            w[n].append(x)
        else:
            w[n]=[x]
{ x:len(w[x]) for x in w.keys()}

{'covid-19 live updates': 1,
 'vaccines': 1,
 'boosters': 1,
 'york': 4,
 'ukrainians flee mariupol': 1,
 'forces push': 1,
 'port city': 1,
 'wall street journal': 3,
 'bond yields': 1,
 'futures rise': 1,
 'powell says fed': 1,
 'ready': 1,
 'be': 1,
 'aggressive': 1,
 'putin': 3,
 'alexei navalny': 2,
 'russian': 2,
 'supreme court nominee': 1,
 'ketanji brown jackson': 1,
 "confirmation hearing 's": 1,
 'cnn': 1,
 'swedish': 1,
 'high school': 1,
 'abc': 1,
 'clues': 1,
 'covid-19': 1,
 '’ s': 2,
 'moves': 1,
 'sewers': 1,
 'roll dice': 1,
 'jackson': 1,
 'decisions |': 1,
 'thehill': 1,
 'clear': 2,
 'chemical weapons': 2,
 'ukraine': 3,
 'claims president': 2,
 'biden': 2,
 'nasa': 2,
 'solar system': 2,
 'daily mail': 3,
 'us stocks': 1,
 'fed chair powell': 1,
 "'s remarks": 1,
 'fox': 1,
 "'we 've": 1,
 'tests': 1,
 'covid': 1,
 'politico': 1,
 'duchess': 1,
 'cambridge': 1,
 'swaps khaki jungle gear': 1,
 'vampire': 1,
 'wife': 1,
 'belize': 1,
 'china': 2,
 'flight recorders

We can see that nouns do not give us large thematic groups. Let's substitute nouns by more general terms obtained from the concept graph. This will take some time, because we are doing REST call for each noun phrase.

In [23]:
w = {}
for x in all_titles:
    for noun in TextBlob(x).noun_phrases:
        terms = query(noun.replace(' ','%20'))
        for term in [u for u in terms.keys() if terms[u]>0.1]:
            if term in w:
                w[term].append(x)
            else:
                w[term]=[x]

In [24]:
{ x:len(w[x]) for x in w.keys() if len(w[x])>3}

{'city': 9,
 'brand': 4,
 'place': 9,
 'town': 4,
 'factor': 4,
 'film': 4,
 'nation': 11,
 'state': 5,
 'person': 4,
 'organization': 5,
 'publication': 10,
 'market': 5,
 'economy': 4,
 'company': 6,
 'newspaper': 6,
 'relationship': 6}

In [27]:
print('\nECONOMY:\n'+'\n'.join(w['economy']))
print('\nNATION:\n'+'\n'.join(w['nation']))
print('\nPERSON:\n'+'\n'.join(w['person']))


ECONOMY:
China searches for victims, flight recorders after first plane crash in 12 years - Reuters
Live updates: Russia stops talks with Japan over sanctions - The Associated Press - en Español
China plane crash – live: Search for survivors continues as witness describes moment flight fell from sky - The Independent
UK prepares to nationalize Russia natural gas giant Gazprom's retail unit - Business Insider

NATION:
‘Clear sign’ Putin considering using chemical weapons in Ukraine, claims President Biden - The Independent
Duchess of Cambridge swaps khaki jungle gear for Vampire's Wife dress on Belize trip - Daily Mail
China searches for victims, flight recorders after first plane crash in 12 years - Reuters
Live updates: Russia stops talks with Japan over sanctions - The Associated Press - en Español
Live updates: Russia stops talks with Japan over sanctions - The Associated Press - en Español
China plane crash – live: Search for survivors continues as witness describes moment flight 