# Hindi Wiki Article Generation Bot



### Problem Definition
We want to write a script that generate a wikipedia page in our native language. In my case this corresponds to generating information in Hindi. The page should have relevant information on the selected topic with all or most of its content in hindi. 
Use the wikidata knowledge graph for this purpose. 

### Tools and libraries
Wikidata is a knowledge graph with entities linked together by properties. These can be retrieved using the SPARQL queries using the qwikidata api on python. Wikidata can be freely accessed and each entitiy has a label, description and a set of properties(known as claims) associated with it. The good thing with these entities and poperties is that the label and sometimes the description information is present in hindi and can be easily accessed through the api. This makes is possible to accumlate and club together new information in our native language even though that information may not be present in any explicit context such as a wikipedia page.

### Solution Approach
The idea is to aggregate the various information for an entity and its realtionship with other entities in the native language. The entities and links can already be queried in english and then the corresponding label in hindi can be taken, if present, to textually represent the aggregated information. To demonstrate this we choose to create biographies of people. For simplicity I have kept the results in simple key and value pairs both in hindi. Converting the retrieved aggreged information into a more wikipedia like form usin statements is trivial. For example its easy enough to convert 'place_of_birth':'name' to He/She was born at 'name'. 

In [5]:
from qwikidata.sparql import return_sparql_query_results
from qwikidata.entity import WikidataItem, WikidataLexeme, WikidataProperty
from qwikidata.linked_data_interface import get_entity_dict_from_api

In wikidata each property is represented using P[NUMBER] identifier and each entity is represented using a Q[NUMBER] identifier. Properties are essentially descriptive words such as the ones present in the keys of the dictionary below. The information retrieval part is divided into two sections. For the first part we are retrieving the most notable inforamtion about someones biography explicitly using SPARQL queries. In the second part all the claims(properties) in the entity response, retrieved from the entity, is queried again to generate additional property value pair. The separation is done so that the user can query the SPARQL explicitly for property values, which is not possible in the second case as the set of claims are fixed. Accordingly the set of property values which are explicitly queried are given below in the biography dict. The remaining values are generated from the entity structure itself.

### Program

In [66]:
biography = {
    "nationality": "P27",
    "gender":"P21",
#     "image": "P18",
    "residence":"P551",
    "place_of_birth":"P19",
    "date_of_birth":"P569",
    "profession": "P106",
    "notable_works": "P800",
    "education": "P69",
    "positions":"P39",
    "awards": "P166",
    "spouse": "P26",
}

translation = {
    "name":"नाम",
    "description":"विवरण",
    "image": "चित्र",
    "gender":"लिंग",
    "residence":"निवास",
    "place_of_birth":"जन्म स्थान",
    "date_of_birth":"जन्म की तारीख",
    "profession": "व्यवसाय",
    "notable_works": "उल्लेखनीय कार्य",
    "education": "शिक्षा",
    "positions": "पद",
    "awards": "पुरस्कार",
    "spouse": "पति या पत्नी",
    "other_available_information":"अन्य उपलब्ध जानकारी",
    "main_info": "मुख्य जानकारी",
    "nationality": "राष्ट्रीयता"
}

In [68]:
#calls qwikidata get entity function for an entity id
def getEntityInfo(eid):
    return get_entity_dict_from_api(eid)

#extract the name of the entity or the property value in native language
def extractName(info):
    return info.get('labels', {}).get('hi', {}).get('value', "")

#extract the description of the entity or the property value in native language
def extractDescription(info):
    return info.get('descriptions', {}).get('hi', {}).get('value', "")

#print the name and the description
def printNameAndDescription(info, trans):
    name = extractName(info)
    if name != "":
        print(trans['name'] + ":", name)
    desc = extractDescription(info)
    if desc != "":
        print(trans['description'] + ":", desc)
        
#print the information present in the entity itself
def printOtherInfo(entity, bio):
    #get the claims subdict
#     print(entity['claims'])
    for p in entity['claims'].keys():
        if p not in bio.values():
            #get information on the property in bengali
            ent_info = getEntityInfo(p)
            name, desc = extractName(ent_info), extractDescription(ent_info)
            if name == "":
                continue
            value = ""
            #for every property in the claims subdict get inforamtion on the correspoding values
            for data in entity.get('claims', {}).get(p, []):
                res = data.get('mainsnak',{}).get('datavalue', {}).get('value', {})
                if type(res) == dict:
                    info_id = res.get('id', "")
                    if info_id == "":
                        continue
                    info = getEntityInfo(info_id)
                    pname, pdesc = extractName(info), extractDescription(info)
                    if pname == "":
                        continue
                    if pdesc != "":
                        pname += str(f'({pdesc})')
                    value += pname + ","
            #only print property name and value if the value is present in native language
            if value != "":
                print(name + ": " + value)
    
def getBiography(wd, bio, trans):
    #get entity from api
    entity_info = getEntityInfo(wd)
    #print name and description
    printNameAndDescription(entity_info, trans)
    #explicitly query using sparql to get main biography data
    print("----------------------",trans['main_info'],"----------------------")
    for entity, wdt in bio.items():
        spqrqlq = f"SELECT ?entity ?entityLabel ?entityDescription WHERE {{ wd:{wd} wdt:{wdt} ?entity; SERVICE wikibase:label {{ bd:serviceParam wikibase:language \"hi\". }} }}"
        s, v = trans[entity] + ": ", ""
        res = return_sparql_query_results(str(spqrqlq))
        for entity in res['results']['bindings']:
            value = entity.get('entityLabel').get('value', "")
            if value != '' and 'Q' not in value:
                v += value + ","
        if v != "":
            print(s + v)
    #euqry and print the rest of the data already present in the entity 
    print("----------------------",trans['other_available_information'],"----------------------")
    printOtherInfo(entity_info, bio)

### Examples

The input requires the user to the give the id of the person to make a biography of in their language. The entity id is the string Q[NUMBERS]. This unique id can be obtained from the wikidata search page. Following are some example of biography of people that do not exist in bengali.

In [70]:
#information about japanese game designer Hideo Kojima
getBiography('Q315577', biography, translation)

---------------------- मुख्य जानकारी ----------------------
रा: जापान,
लिंग: पुरुष,
जन्म स्थान: सेतागया-कू,
जन्म की तारीख: 1963-08-24T00:00:00Z,
व्यवसाय: पटकथा लेखक,कंप्यूटर वैज्ञानिक,
---------------------- अन्य उपलब्ध जानकारी ----------------------
का उदहारण है: मनुष्य(होमो-सैपीयंज़ स्तनपायी जो दो पैर पर चलता है),


## Wiki_biography

In [71]:
#information about  Narendra modi from the godfather
getBiography('Q1058', biography, translation)

नाम: नरेन्द्र मोदी
विवरण: भारत के प्रधानमंत्री
---------------------- मुख्य जानकारी ----------------------
रा: भारत,
लिंग: पुरुष,
निवास: ७, लोक कल्याण मार्ग,
जन्म स्थान: वड़नगर,
जन्म की तारीख: 1950-09-17T00:00:00Z,
व्यवसाय: राजनीतिज्ञ,
शिक्षा: गुजरात विश्वविद्यालय,दिल्ली विश्वविद्यालय,
पद: भारत का प्रधानमन्त्री,जनशिकायत मंत्रालय, भारत सरकार,गुजरात विधान सभा के सदस्य,
पुरस्कार: सीएनएन-आईबीएन इंडियन ऑफ़ द इयर,


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [6]:
#information about Gary Kildall inventor of CP/M Operating system
getBiography('Q92627', biography, translation)

---------------------- मुख्य जानकारी ----------------------
लिंग: पुरुष,
जन्म स्थान: सीऐटल,
जन्म की तारीख: 1942-05-19T00:00:00Z,
व्यवसाय: व्यापारी,कंप्यूटर वैज्ञानिक,प्रोग्रामर,
---------------------- अन्य उपलब्ध जानकारी ----------------------
नागरिकता: संयुक्त राज्य अमेरिका(उत्तर अमेरिका में एक संघीय गणतन्त्र),
उदहारण है: मनुष्य(होमो-सैपीयंज़ स्तनपायी जो दो पैर पर चलता है),
मृत्यु का स्थान: मॉन्टेरी,
मौत का कारण: दुर्घटना,
कार्य क्षेत्र: संगणक विज्ञान,
बोली या लेखी भाषा: अंग्रेज़ी भाषा(भाषा),


In [7]:
#information about Zuzana Čaputová president of slovakia
getBiography('Q26273268', biography, translation)

नाम: ज़ुज़ाना कैपुटोवा
---------------------- मुख्य जानकारी ----------------------
लिंग: महिला,
जन्म स्थान: ब्रातिस्लावा,
जन्म की तारीख: 1973-06-21T00:00:00Z,
व्यवसाय: वक़ील,राजनीतिज्ञ,विधिवेत्ता,पर्यावरणविद्,
पद: स्लोवाकिया के राष्ट्रपति,
---------------------- अन्य उपलब्ध जानकारी ----------------------
उदहारण है: मनुष्य(होमो-सैपीयंज़ स्तनपायी जो दो पैर पर चलता है),
नागरिकता: स्लोवाकिया,चेकोस्लोवाकिया,
बोली या लेखी भाषा: स्लोवाक भाषा,रूसी भाषा,अंग्रेज़ी भाषा(भाषा),


Even though the information retrieved may seem small it is quite exhaustive as much of the linked property information is not present for these people in hindi. This mostly down to the fact that neither of the three are household names in the India and hence not much information about their works are not translated in hindi. As a comparison lets look at the information retrieved for famed hindi novelist 'Munshi Premchand'.

In [8]:
getBiography('Q174152', biography, translation)

नाम: प्रेमचंद
विवरण: भारतीय हिंदी के लेखक
---------------------- मुख्य जानकारी ----------------------
लिंग: पुरुष,
जन्म स्थान: वाराणसी,
जन्म की तारीख: 1880-07-31T00:00:00Z,
व्यवसाय: पटकथा लेखक,लेखक,उपन्यासकार,
उल्लेखनीय कार्य: गोदान,सेवासदन,कर्मभूमि,शतरंज के खिलाड़ी,
---------------------- अन्य उपलब्ध जानकारी ----------------------
मृत्यु का स्थान: वाराणसी(विश्व के सर्वाधिक पौराणिक नगरों में से एक जिसे बनारस नाम से भी जाना जाता है),
नागरिकता: ब्रिटिश राज,
उदहारण है: मनुष्य(होमो-सैपीयंज़ स्तनपायी जो दो पैर पर चलता है),
संतान: अमृत राय(हिन्दी लेखक),
बोली या लेखी भाषा: हिन्दुस्तानी भाषा,हिन्दी(भारतीय भाषा),
धर्म: हिन्दू धर्म(धार्मिक धर्म),


In [9]:
getBiography('Q1797306', biography, translation)

नाम: नैनीताल जिला
विवरण: उत्तराखण्ड का जिला
---------------------- मुख्य जानकारी ----------------------
---------------------- अन्य उपलब्ध जानकारी ----------------------
उदहारण है: भारत के ज़िला,
प्रशासनिक इकाई में है: कुमाऊँ मण्डल,उत्तर प्रदेश(भारत का सर्वाधिक जनसंख्या वाला  राज्य),संयुक्त प्रांत(उत्तर प्रदेश का भूतपूर्व नाम),संयुक्त प्रान्त आगरा व अवध,
देश: भारत(विश्व का सबसे बड़ा संघीय गणतन्त्र),
विषय की मुख्य श्रेणी: श्रेणी:नैनीताल जिला(विकिमीडिया श्रेणी),
सीमा लगती है: पौड़ी गढ़वाल ज़िला(उत्तराखण्ड का जिला),उधम सिंह नगर जिला(उत्तराखण्ड का जिला),
राजधानी: नैनीताल,
