## Microsoft Concept Graph

[Microsoft Concept Graph](https://concept.research.microsoft.com/) is a large taxonomy of terms mined from the internet, with `is-a` relations between concepts. 

Context Graph is available in two forms:
 * Large text file for download
 * REST API

Statistics:
 * 5401933 unique concepts, 
 * 12551613 unique instances
 * 87603947 `is-a` relations

## Using Web Service

Web service offers different calls to estimate probability of a concept belonging to different groups. More info is available [here](https://concept.research.microsoft.com/Home/Api).
Here is the sample URL to call: `https://concept.research.microsoft.com/api/Concept/ScoreByProb?instance=microsoft&topK=10`

In [1]:
import urllib
import json
import ssl

def http(x):
    ssl._create_default_https_context = ssl._create_unverified_context
    response = urllib.request.urlopen(x)
    data = response.read()
    return data.decode('utf-8')

def query(x):
    return json.loads(http("https://concept.research.microsoft.com/api/Concept/ScoreByProb?instance={}&topK=10".format(urllib.parse.quote(x))))

query('microsoft')

URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>

Let's try to categorize the news titles using parent concepts. To get news titles, we will use [NewsApi.org](http://newsapi.org) service. You need to obtain your own API key in order to use the service - go to the web site and register for free developer plan.

In [None]:
# newsapi_key = '123912da12d1481da171c53ad6f069ea'
# def get_news(country='us'):
#     res = json.loads(http("https://newsapi.org/v2/top-headlines?country={0}&apiKey={1}".format(country,newsapi_key)))
#     return res['articles']

# all_titles = [x['title'] for x in get_news('us')+get_news('gb')]

import requests

newsapi_key = '123912da12d1481da171c53ad6f069ea'

def get_news(country='us'):
    url = f"https://newsapi.org/v2/top-headlines?country={country}&apiKey={newsapi_key}"
    try:
        res = requests.get(url, timeout=10)
        res.raise_for_status()
        return res.json()['articles']
    except requests.RequestException as e:
        print(f"Request failed: {e}")
        return []

# def get_news(country='us'):
#     url = f"https://newsapi.org/v2/top-headlines?country={country}&apiKey={newsapi_key}"
#     response = requests.get(url, timeout=10)
#     response.raise_for_status()  # Raises HTTPError if the request failed
#     return response.json()['articles']

# Example: get combined US and UK headlines
all_titles = [x['title'] for x in get_news('us') + get_news('gb')]
print(all_titles[:5])  # Print first 5 titles to confirm


['US and EU close in on 15% tariff deal - Financial Times', 'Top UN court says countries can sue each other over climate change - BBC', '2025 NFL training camp: Anthony Richardson, Travis Etienne among veterans who could lose starting roles - NFL.com', "Pixel 10 Pro Fold leaks in official-looking renders with only Google's two best Pro colors [Gallery] - 9to5Google", 'Idaho murders: Bryan Kohberger to face sentencing - BBC']


In [12]:
all_titles

['US and EU close in on 15% tariff deal - Financial Times',
 'Top UN court says countries can sue each other over climate change - BBC',
 '2025 NFL training camp: Anthony Richardson, Travis Etienne among veterans who could lose starting roles - NFL.com',
 "Pixel 10 Pro Fold leaks in official-looking renders with only Google's two best Pro colors [Gallery] - 9to5Google",
 'Idaho murders: Bryan Kohberger to face sentencing - BBC',
 "‘Fit and Healthy’ Dad of Four, 57, Gets Random Whiffs of 'Strange, Sweet Caramel Smell.' It's a Fatal Sign - AOL.com",
 'Patient dies of brain-eating amoeba in South Carolina, hospital confirms - CBS News',
 'Trump says he is ‘thinking about’ nixing capital gains tax on home sales. Here’s what that could mean for homeowners - CNN',
 "Home Sales Fall as Prices Hit Record High. Mortgage Rates Are Keeping the Market Stuck. - Barron's",
 'Israeli forces have killed over 1,000 aid-seekers in Gaza since May, the U.N. says - NPR',
 'Stevie and Lindsey’s ‘Buckingham 

First of all, we want to be able to extract nouns from news titles. We will use `TextBlob` library to do this, which simplifies a lot of typical NLP tasks like this.

In [13]:
import sys
!{sys.executable} -m pip install textblob
!{sys.executable} -m textblob.download_corpora
from textblob import TextBlob

Defaulting to user installation because normal site-packages is not writeable
Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\sophi\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\sophi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sophi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\sophi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\sophi\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\sophi\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is alr

In [14]:
w = {}
for x in all_titles:
    for n in TextBlob(x).noun_phrases:
        if n in w:
            w[n].append(x)
        else:
            w[n]=[x]
{ x:len(w[x]) for x in w.keys()}

{'eu': 1,
 '% tariff': 1,
 'financial': 1,
 'top': 1,
 'un court': 1,
 'bbc': 2,
 'nfl': 1,
 'training camp': 1,
 'anthony richardson': 1,
 'travis etienne': 1,
 'nfl.com': 1,
 'pixel': 1,
 'pro fold': 1,
 'google': 1,
 'pro': 1,
 'colors [ gallery ]': 1,
 'idaho': 1,
 'bryan kohberger': 1,
 'fit': 1,
 'healthy': 1,
 'dad': 1,
 'whiffs': 1,
 'caramel smell': 1,
 'fatal sign': 1,
 'aol.com': 1,
 'patient': 1,
 'carolina': 1,
 'hospital confirms': 1,
 'cbs news': 1,
 'trump': 1,
 'capital gains tax': 1,
 '’ s': 1,
 'cnn': 1,
 'prices': 1,
 'record': 1,
 'rates': 1,
 'keeping': 1,
 'market stuck': 1,
 'barron': 1,
 'israeli': 1,
 'gaza': 1,
 'may': 1,
 'u.n.': 1,
 'npr': 1,
 'stevie': 1,
 'lindsey': 1,
 '’ s ‘': 1,
 'buckingham nicks': 1,
 'reissue': 1,
 'decades': 1,
 'print': 1,
 'rolling stone': 1,
 'potus': 1,
 'furious ’': 1,
 'white house': 1,
 'epstein': 2,
 'politico': 1,
 'exclusive': 2,
 'uber': 1,
 "new 'women": 1,
 'preferences': 1,
 'pilot program': 1,
 'us drivers': 1,
 'abc

We can see that nouns do not give us large thematic groups. Let's substitute nouns by more general terms obtained from the concept graph. This will take some time, because we are doing REST call for each noun phrase.

In [None]:
# w = {}
# for x in all_titles[:10]:
#     for noun in TextBlob(x).noun_phrases:
#         try:
#             terms = query(noun.replace(' ', '%20'))
#             if terms:  # Make sure it's not None
#                 for term in [u for u in terms.keys() if terms[u] > 0.1]:
#                     w.setdefault(term, []).append(x)
#         except Exception as e:
#             print(f"Skipping '{noun}' due to error: {e}")


# w = {}
# for x in all_titles[:10]:
#     for noun in TextBlob(x).noun_phrases:
#         terms = query(noun.replace(' ', '%20'))
#         if terms:  # Only proceed if query returned valid data
#             for term in [u for u in terms.keys() if terms[u] > 0.1]:
#                 if term in w:
#                     w[term].append(x)
#                 else:
#                     w[term] = [x]
all_titles = [
    "Microsoft announces new AI initiative",
    "UK government faces economic challenges",
    "NASA plans Mars mission"
]
w = {}
for x in all_titles:
    for noun in TextBlob(x).noun_phrases:
        terms = query(noun.replace(' ','%20'))
        for term in [u for u in terms.keys() if terms[u]>0.1]:
            if term in w:
                w[term].append(x)
            else:
                w[term]=[x]

TypeError: '>' not supported between instances of 'str' and 'float'

In [24]:
{ x:len(w[x]) for x in w.keys() if len(w[x])>3}

{'city': 9,
 'brand': 4,
 'place': 9,
 'town': 4,
 'factor': 4,
 'film': 4,
 'nation': 11,
 'state': 5,
 'person': 4,
 'organization': 5,
 'publication': 10,
 'market': 5,
 'economy': 4,
 'company': 6,
 'newspaper': 6,
 'relationship': 6}

In [27]:
print('\nECONOMY:\n'+'\n'.join(w['economy']))
print('\nNATION:\n'+'\n'.join(w['nation']))
print('\nPERSON:\n'+'\n'.join(w['person']))


ECONOMY:
China searches for victims, flight recorders after first plane crash in 12 years - Reuters
Live updates: Russia stops talks with Japan over sanctions - The Associated Press - en Español
China plane crash – live: Search for survivors continues as witness describes moment flight fell from sky - The Independent
UK prepares to nationalize Russia natural gas giant Gazprom's retail unit - Business Insider

NATION:
‘Clear sign’ Putin considering using chemical weapons in Ukraine, claims President Biden - The Independent
Duchess of Cambridge swaps khaki jungle gear for Vampire's Wife dress on Belize trip - Daily Mail
China searches for victims, flight recorders after first plane crash in 12 years - Reuters
Live updates: Russia stops talks with Japan over sanctions - The Associated Press - en Español
Live updates: Russia stops talks with Japan over sanctions - The Associated Press - en Español
China plane crash – live: Search for survivors continues as witness describes moment flight 