# Getting Data from Wikipedia

Wikipedia provides an powerful API called [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) for fetching data out of the worlds largest Encyclopedia. There's a [Webinterface](https://query.wikidata.org/) for Querying the API in [SPARQL Syntax](https://www.w3.org/TR/rdf-sparql-query/)

I'm using 2 Python3 Libraries 

- [qwikidata](https://pypi.org/project/qwikidata/) 
- [wikidata](https://pypi.org/project/Wikidata/)


Every item in the Wikipdia is an Entity with labels + properties. In step 1 I have to get Entity Id's for each City 

In [6]:
def get_entity_id_for_city_by_name(city_name):
    """
    
    :param city_name: 
    :return: string Wikidata entity_id 
    """
    sparql_query = """
    SELECT ?item ?itemLabel
    WHERE {{ 
        ?item wdt:P31/wdt:P279* wd:Q515 .
      ?item rdfs:label ?itemLabel. 
      FILTER(CONTAINS(LCASE(?itemLabel), "{}"@en)). 
    }} limit 1
    
    """.format(city_name.lower())
    res = return_sparql_query_results(sparql_query)
    url = res.get("results").get("bindings")[0].get('item').get('value')

    return url.replace('http://www.wikidata.org/entity/', '')


city = "Venice"
entity_id = get_entity_id_for_city_by_name(city_name=city)
print("Entity Id for", city, "is", entity_id)

Entity Id for Venice is Q641


Wikidata returns a bunch of interesting informations about Venice - please visit the url [https://www.wikidata.org/wiki/Q641](https://www.wikidata.org/wiki/Q641) to get an intuition about the datasets.

The API is unfortunately not very reliable - Instead of results it sometimes returns "Bad Gateway". I'm using cached data to ensure reliable results in my notebook.

I wrote a Class that collects methods to retrieve all relevant informations like **Name of the City**, **Population**, **Area**, **Population Density**, **Coordinates**, **Bouroughs** for my capstone project.


In [7]:
from qwikidata.linked_data_interface import get_entity_dict_from_api
import pandas as pd
from wikidata.client import Client


class WikiDataWrapper:
    def __init__(self, entity_id):
        self.entity_id = entity_id
        self.entity = get_entity_dict_from_api(entity_id)

    def get_name(self):
        return self.entity.get('labels').get("en").get("value")

    def get_image_from_entity_dict(self):
        """
        Returns url to the full image path
        @see https://stackoverflow.com/a/34402875
        :return:
        """
        client = Client()
        entity = client.get(self.entity_id, load=True)

        image_prop = client.get("P18")
        image = entity[image_prop]
        return image.image_url

    def get_coordinate_location(self):
        """
        :return: lat, lon
        """
        property = self.entity.get("claims").get("P625")[0].get('mainsnak').get('datavalue').get('value')
        return property.get('latitude'), property.get('longitude')

    def get_population(self):
        """
        https://www.wikidata.org/wiki/Property:P1082 population
        :return:
        """
        property = self.entity.get("claims").get("P1082")[-1].get('mainsnak').get('datavalue').get('value').get(
            "amount")
        return int(property)

    def get_area(self):
        """
        https://www.wikidata.org/wiki/Property:P2046 area 415,9 m^2
        :return:
        """
        property = self.entity.get("claims").get("P2046")[-1].get('mainsnak').get('datavalue').get('value').get(
            "amount")
        return float(property)

    def get_population_density(self):
        return self.get_population() / self.get_area()

    def get_boroughs(self):
        """

        Venice Q641 has  6 boroughs
            - Cannaregio (including San Michele),
            - San Polo,
            - Dorsoduro (including Giudecca and Sacca Fisola),
            - Santa Croce,
            - San Marco (including San Giorgio Maggiore) and
            - Castello (including San Pietro di Castello and Sant'Elena).
         https://www.wikidata.org/wiki/Property:P150        contains administrative territorial entity
        :return:
        """
        property = self.entity.get("claims").get("P150")
        # @Todo Q_TRIER = "Q3138" has NO P150
        if property is None:
            print(self.entity_id, "has no P150")
            return []
        bourough_ids = []
        bouroughs = []

        for item in property:
            entity = item.get("mainsnak").get("datavalue").get('value').get('id')
            bourough_ids.append(entity)

        for entity_id in bourough_ids:
            entity = get_entity_dict_from_api(entity_id)
            english_label = entity.get('labels').get("en")
            if None is english_label:
                key = next(iter(entity.get('labels')))
                bourough_name = entity.get('labels').get(key).get("value")
            else:
                bourough_name = entity.get('labels').get("en").get("value")
            property = entity.get("claims").get("P625")[0].get('mainsnak').get('datavalue').get('value')
            lat, lon = property.get('latitude'), property.get('longitude')

            bouroughs.append({"Name": bourough_name, "Lat": lat, "Lon": lon})

        return bouroughs

    def get_series_for_data_frame(self):
        lat, lon = self.get_coordinate_location()
        return {'Name': self.get_name(),
                "EntityID": self.entity_id,
                'Population': self.get_population(),
                'Area': self.get_area(),
                'PopulationDensity': self.get_population_density(),
                'Lat': lat,
                'Lon': lon,
                'Image': self.get_image_from_entity_dict(),
                "Bouroughs": self.get_boroughs()
                }


### Now we are able to extract some interesting informations with basic python3 and present them

In [68]:
wrapper = WikiDataWrapper(entity_id)
df_cities = pd.DataFrame()
df_cities = df_cities.append(wrapper.get_series_for_data_frame(), ignore_index=True)

df_cities.set_index("EntityID",inplace=True)
df_cities.head()


Unnamed: 0_level_0,Area,Bouroughs,Image,Lat,Lon,Name,Population,PopulationDensity
EntityID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Q641,415.9,[{'Name': 'Municipality 1 Venezia-Murano-Buran...,https://upload.wikimedia.org/wikipedia/commons...,45.439722,12.331944,Venice,261321.0,628.326521


In [76]:
from IPython.display import Image
from IPython.core.display import HTML
print(df_cities.loc['Q641']['Name'])
print("Population of the City", df_cities.loc['Q641']['Population'])
print("Area", df_cities.loc['Q641']['Area'], "km^2")
print("Population Density", round(df_cities.loc['Q641']['PopulationDensity'],2) ,  "Humans / km^2 " )

display(Image(url= df_cities.loc['Q641']['Image']))

Venice
Population of the City 261321.0
Area 415.9 km^2
Population Density 628.33 Humans / km^2 


# Let's identify the Bouroughs of Venice

In [109]:
for b in  (df_cities.loc['Q641']['Bouroughs']):
    print(b['Name'])

Municipality 1 Venezia-Murano-Burano
Municipality Lido-PellestrinaLido-Pellestrina
Municipalità di Chirignago-Zelarino
Municipalità di Favaro Veneto
Municipalità di Marghera
Municipalità di Mestre-Carpenedo


In [112]:
import folium # map rendering library

map = folium.Map(location=[df_cities.loc['Q641']['Lat'], df_cities.loc['Q641']['Lon']], zoom_start=12)

for b in  (df_cities.loc['Q641']['Bouroughs']):
    label = '{}'.format(b['Name'])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [b.get("Lat"), b.get("Lon")],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map)     
map