# Distant Supervision

In this notebook we will be acquiring sources of distant supervision for our models.

## WikiData Programming Languages

For the Snorkel example for Chapter 5, we create a programming language extractor from the titles and bodies of Stack Overflow questions. Here we generate the file that we used by querying WikiData using SPARQL to get a list of programming languages. We then use these language names to label positive examples of programming languages in posts for training our discriminative/network extractor model.

The following SPARQL query prints out the names of all [Property:31:instances of](https://www.wikidata.org/wiki/Property:P31) [Item:Q9143 programming languages](https://www.wikidata.org/wiki/Q9143) in English content from WikiData.

We `SELECT DISTINCT` the item and item labels, then filter the language of the item label to English, to avoid duplicate content from other languages.

In [23]:
import requests

url = 'https://query.wikidata.org/sparql'
query = """
# Get all programming language names from English sources
SELECT DISTINCT ?item ?item_label
WHERE {
 ?item wdt:P31 wd:Q9143 # P31:instances of Q9143:programming language
 ; rdfs:label ?item_label .
  
  FILTER (LANG(?item_label) = "en"). # English only
}

"""
r = requests.get(url, params = {'format': 'json', 'query': query})
data = r.json()

In [24]:
import json
print(json.dumps(data, indent=4, sort_keys=True))

{
    "head": {
        "vars": [
            "item",
            "item_label"
        ]
    },
    "results": {
        "bindings": [
            {
                "item": {
                    "type": "uri",
                    "value": "http://www.wikidata.org/entity/Q121572"
                },
                "item_label": {
                    "type": "literal",
                    "value": "literal",
                    "xml:lang": "en"
                }
            },
            {
                "item": {
                    "type": "uri",
                    "value": "http://www.wikidata.org/entity/Q169422"
                },
                "item_label": {
                    "type": "literal",
                    "value": "MagicPlot",
                    "xml:lang": "en"
                }
            },
            {
                "item": {
                    "type": "uri",
                    "value": "http://www.wikidata.org/entity/Q19528"
                },
          

## Extract the Language Labels from nested JSON

Nested JSON is a pain to work with in `DataFrames`, so we un-nest it, retaining only what we need.

In [21]:
languages = [
    {
        'name': x['item_label']['value'],
        'kb_url': x['item']['value'],
        'kb_id': x['item']['value'].split('/')[-1], # Get the ID
    }
    for x in data['results']['bindings']
]

# Filter out an erroneous language
languages = list(
    filter(
        lambda x: x['kb_id'] != 'Q25111344', 
        languages
    )
)

print(f'There were {len(languages):,} languages returned.\n')

for l in languages[0:10]:
    print(l)

There were 1,413 languages returned.

{'name': 'PHP', 'kb_url': 'http://www.wikidata.org/entity/Q59', 'kb_id': 'Q59'}
{'name': 'Java', 'kb_url': 'http://www.wikidata.org/entity/Q251', 'kb_id': 'Q251'}
{'name': 'BCPL', 'kb_url': 'http://www.wikidata.org/entity/Q810009', 'kb_id': 'Q810009'}
{'name': 'BeanShell', 'kb_url': 'http://www.wikidata.org/entity/Q812964', 'kb_id': 'Q812964'}
{'name': 'Script.NET', 'kb_url': 'http://www.wikidata.org/entity/Q820978', 'kb_id': 'Q820978'}
{'name': 'newLISP', 'kb_url': 'http://www.wikidata.org/entity/Q827233', 'kb_id': 'Q827233'}
{'name': 'Befunge', 'kb_url': 'http://www.wikidata.org/entity/Q814269', 'kb_id': 'Q814269'}
{'name': 'Clean', 'kb_url': 'http://www.wikidata.org/entity/Q377986', 'kb_id': 'Q377986'}
{'name': 'Whitespace', 'kb_url': 'http://www.wikidata.org/entity/Q378222', 'kb_id': 'Q378222'}
{'name': 'XL', 'kb_url': 'http://www.wikidata.org/entity/Q368880', 'kb_id': 'Q368880'}


## Write Languages to Disk as CSV

In [22]:
import jsonlines

with jsonlines.open('../data/programming_languages.jsonl', mode='w') as writer:
    writer.write_all(languages)

## Now get a list of operating systems to create negative LFs from

In [28]:
import requests

url = 'https://query.wikidata.org/sparql'
query = """
# Get all operating system names from English sources
SELECT DISTINCT ?item ?item_label
WHERE {
 ?item wdt:P31 wd:Q9135 # instances of programming language
 ; rdfs:label ?item_label .
  
  FILTER (LANG(?item_label) = "en"). 
}

"""
r = requests.get(url, params = {'format': 'json', 'query': query})
data = r.json()

In [29]:
programs = [
    {
        'name': x['item_label']['value'],
        'kb_url': x['item']['value'],
        'kb_id': x['item']['value'].split('/')[-1], # Get the ID
    }
    for x in data['results']['bindings']
]

print(f'There were {len(programs):,} programs returned.\n')

for l in programs[0:10]:
    print(l)

There were 967 programs returned.

{'name': 'Microsoft Windows', 'kb_url': 'http://www.wikidata.org/entity/Q1406', 'kb_id': 'Q1406'}
{'name': 'Windows 8', 'kb_url': 'http://www.wikidata.org/entity/Q5046', 'kb_id': 'Q5046'}
{'name': 'Apple DOS', 'kb_url': 'http://www.wikidata.org/entity/Q621234', 'kb_id': 'Q621234'}
{'name': 'IRIX', 'kb_url': 'http://www.wikidata.org/entity/Q627611', 'kb_id': 'Q627611'}
{'name': 'Apple ProDOS', 'kb_url': 'http://www.wikidata.org/entity/Q621328', 'kb_id': 'Q621328'}
{'name': 'Apple CP/M', 'kb_url': 'http://www.wikidata.org/entity/Q621220', 'kb_id': 'Q621220'}
{'name': 'MUSIC/SP', 'kb_url': 'http://www.wikidata.org/entity/Q629795', 'kb_id': 'Q629795'}
{'name': 'UNIX System V', 'kb_url': 'http://www.wikidata.org/entity/Q633192', 'kb_id': 'Q633192'}
{'name': 'Novell Open Enterprise Server', 'kb_url': 'http://www.wikidata.org/entity/Q1197839', 'kb_id': 'Q1197839'}
{'name': 'SunOS', 'kb_url': 'http://www.wikidata.org/entity/Q1208460', 'kb_id': 'Q1208460'}


In [30]:
import jsonlines

with jsonlines.open('../data/operating_systems.jsonl', mode='w') as writer:
    writer.write_all(programs)

## Conclusion

Now we are ready to use our programming languages in our Label Functions (LFs) in the Snorkel notebook!