# Distant Supervision

In this notebook we will be acquiring sources of distant supervision for our models.

## WikiData Programming Languages

For the Snorkel example for Chapter 5, we create a programming language extractor from the titles and bodies of Stack Overflow questions. Here we generate the file that we used by querying WikiData using SPARQL to get a list of programming languages. We then use these language names to label positive examples of programming languages in posts for training our discriminative/network extractor model.

The following SPARQL query prints out the names of all [Property:31:instances of](https://www.wikidata.org/wiki/Property:P31) [Item:Q9143 programming languages](https://www.wikidata.org/wiki/Q9143) in English content from WikiData.

We `SELECT DISTINCT` the item and item labels, then filter the language of the item label to English, to avoid duplicate content from other languages.

In [1]:
import requests

url = 'https://query.wikidata.org/sparql'
query = """
# Get all programming language names from English sources
SELECT DISTINCT ?item ?item_label
WHERE {
 ?item wdt:P31 wd:Q9143 # instances of programming language
 ; rdfs:label ?item_label .
  
  FILTER (LANG(?item_label) = "en"). 
}

"""
r = requests.get(url, params = {'format': 'json', 'query': query})
data = r.json()

In [16]:
data

{'head': {'vars': ['item', 'item_label']},
 'results': {'bindings': [{'item': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q59'},
    'item_label': {'xml:lang': 'en', 'type': 'literal', 'value': 'PHP'}},
   {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q251'},
    'item_label': {'xml:lang': 'en', 'type': 'literal', 'value': 'Java'}},
   {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q810009'},
    'item_label': {'xml:lang': 'en', 'type': 'literal', 'value': 'BCPL'}},
   {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q812964'},
    'item_label': {'xml:lang': 'en', 'type': 'literal', 'value': 'BeanShell'}},
   {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q820978'},
    'item_label': {'xml:lang': 'en',
     'type': 'literal',
     'value': 'Script.NET'}},
   {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q827233'},
    'item_label': {'xml:lang': 'en', 'type': 'literal', 'valu

## Extract the Language Labels from Nested JSON

In [10]:
languages = [x['item_label']['value'] for x in data['results']['bindings']]
languages[0:20]

['PHP',
 'Java',
 'BCPL',
 'BeanShell',
 'Script.NET',
 'newLISP',
 'Befunge',
 'Clean',
 'Whitespace',
 'XL',
 'Pharo',
 'SolidThinking Embed',
 'Kent Recursive Calculator',
 'Oberon',
 'Emacs Lisp',
 'GT.M',
 'REBOL',
 'Embedded SQL',
 'Turbo Basic',
 'Puck']

## Write Languages to Disk as CSV

In [15]:
import csv

with open('../data/programming_languages.csv', 'w') as f:
    writer = csv.writer(f)
    for language in languages:
        writer.writerow([language])

## Conclusion

Now we are ready to use our programming languages in our Label Functions (LFs) in the Snorkel notebook!