# Getting Data from Wikipedia

Wikipedia provides an powerful API called [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) for fetching data out of the worlds largest Encyclopedia. There's a [Webinterface](https://query.wikidata.org/) for Querying the API in [SPARQL Syntax](https://www.w3.org/TR/rdf-sparql-query/)

I'm using 2 Python3 Libraries 

- [qwikidata](https://pypi.org/project/qwikidata/) 
- [wikidata](https://pypi.org/project/Wikidata/)


Every item in the Wikipdia is an Entity with labels + properties. In step 1 I have to get Entity Id's for each City 

In [58]:
from qwikidata.sparql import return_sparql_query_results

def get_entity_id_for_city_by_name(city_name):
    """
    
    :param city_name: 
    :return: string Wikidata entity_id 
    """
    sparql_query = """
    SELECT ?item ?itemLabel
    WHERE {{ 
        ?item wdt:P31/wdt:P279* wd:Q515 .
      ?item rdfs:label ?itemLabel. 
      FILTER(CONTAINS(LCASE(?itemLabel), "{}"@en)). 
    }} limit 1
    
    """.format(city_name.lower())
    res = return_sparql_query_results(sparql_query)
    url = res.get("results").get("bindings")[0].get('item').get('value')

    return url.replace('http://www.wikidata.org/entity/', '')


city = "Venice"
entity_id = get_entity_id_for_city_by_name(city_name=city)
print("Entity Id for", city, "is", entity_id)

Entity Id for Venice is Q641


Wikidata returns a bunch of interesting informations about Venice - please visit the url [https://www.wikidata.org/wiki/Q641](https://www.wikidata.org/wiki/Q641) to get an intuition about the datasets.

The API is unfortunately not very reliable - Instead of results it sometimes returns "Bad Gateway". I'm using cached data to ensure reliable results in my notebook.

I wrote a Class that collects methods to retrieve all relevant informations like **Name of the City**, **Population**, **Area**, **Population Density**, **Coordinates**, **Bouroughs** for my capstone project.


In [59]:
from qwikidata.linked_data_interface import get_entity_dict_from_api
import pandas as pd
from wikidata.client import Client
class WikiDataWrapper:
    def __init__(self, entity_id):
        self.entity_id = entity_id
        self.entity = get_entity_dict_from_api(entity_id)

    def get_name(self):
        return self.entity.get('labels').get("en").get("value")

    def get_image_from_entity_dict(self):
        """
        Returns url to the full image path
        @see https://stackoverflow.com/a/34402875
        :return:
        """
        client = Client()
        entity = client.get(self.entity_id, load=True)

        image_prop = client.get("P18")
        image = entity[image_prop]
        return image.image_url

    def get_coordinate_location(self):
        """
        :return: lat, lon
        """
        property = self.entity.get("claims").get("P625")[0].get('mainsnak').get('datavalue').get('value')
        return property.get('latitude'), property.get('longitude')

    def get_population(self):
        """
        https://www.wikidata.org/wiki/Property:P1082 population
        :return:
        """
        property = self.entity.get("claims").get("P1082")[-1].get('mainsnak').get('datavalue').get('value').get(
            "amount")
        return int(property)

    def get_area(self):
        """
        https://www.wikidata.org/wiki/Property:P2046 area 415,9 m^2
        :return:
        """
        property = self.entity.get("claims").get("P2046")[-1].get('mainsnak').get('datavalue').get('value').get(
            "amount")
        return float(property)

    def get_population_density(self):
        return self.get_population() / self.get_area()

    def get_boroughs(self):
        """

        Venice Q641 has  6 boroughs
            - Cannaregio (including San Michele),
            - San Polo,
            - Dorsoduro (including Giudecca and Sacca Fisola),
            - Santa Croce,
            - San Marco (including San Giorgio Maggiore) and
            - Castello (including San Pietro di Castello and Sant'Elena).
         https://www.wikidata.org/wiki/Property:P150        contains administrative territorial entity
        :return:
        """
        bourough_ids = []
        bouroughs = []

        property = self.entity.get("claims").get("P150")
        # @Todo Q_TRIER = "Q3138" has NO P150
        if property is None:

            print(self.entity_id, "has no P150")
            lat, lon = self.get_coordinate_location()
            bouroughs.append({"Name": self.get_name(), "Lat": lat, "Lon": lon})
            return bouroughs


        for item in property:
            entity = item.get("mainsnak").get("datavalue").get('value').get('id')
            bourough_ids.append(entity)

        for entity_id in bourough_ids:
            entity = get_entity_dict_from_api(entity_id)
            english_label = entity.get('labels').get("en")
            if None is english_label:
                key = next(iter(entity.get('labels')))
                bourough_name = entity.get('labels').get(key).get("value")
            else:
                bourough_name = entity.get('labels').get("en").get("value")
            property = entity.get("claims").get("P625")[0].get('mainsnak').get('datavalue').get('value')
            lat, lon = property.get('latitude'), property.get('longitude')

            bouroughs.append({"Name": bourough_name, "Lat": lat, "Lon": lon})

        return bouroughs

    def get_series_for_data_frame(self):
        lat, lon = self.get_coordinate_location()
        return {'Name': self.get_name(),
                "EntityID": self.entity_id,
                'Population': self.get_population(),
                'Area': self.get_area(),
                'PopulationDensity': self.get_population_density(),
                'Lat': lat,
                'Lon': lon,
                'Image': self.get_image_from_entity_dict(),
                "Bouroughs": self.get_boroughs()
                }



### Now we are able to extract some interesting informations with basic python3 and present them

In [60]:
wrapper = WikiDataWrapper(entity_id)
df_cities = pd.DataFrame()
df_cities = df_cities.append(wrapper.get_series_for_data_frame(), ignore_index=True)

df_cities.set_index("EntityID",inplace=True)
df_cities.head()

Unnamed: 0_level_0,Area,Bouroughs,Image,Lat,Lon,Name,Population,PopulationDensity
EntityID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Q641,415.9,[{'Name': 'Municipality 1 Venezia-Murano-Buran...,https://upload.wikimedia.org/wikipedia/commons...,45.439722,12.331944,Venice,261321.0,628.326521


In [61]:
from IPython.display import Image
from IPython.core.display import HTML
print(df_cities.loc['Q641']['Name'])
print("Population of the City", df_cities.loc['Q641']['Population'])
print("Area", df_cities.loc['Q641']['Area'], "km^2")
print("Population Density", round(df_cities.loc['Q641']['PopulationDensity'],2) ,  "Humans / km^2 " )

display(Image(url= df_cities.loc['Q641']['Image']))

Venice
Population of the City 261321.0
Area 415.9 km^2
Population Density 628.33 Humans / km^2 


# Let's identify the Bouroughs of Venice

In [62]:
for b in  (df_cities.loc['Q641']['Bouroughs']):
    print(b['Name'])

Municipality 1 Venezia-Murano-Burano
Municipality Lido-PellestrinaLido-Pellestrina
Municipalità di Chirignago-Zelarino
Municipalità di Favaro Veneto
Municipalità di Marghera
Municipalità di Mestre-Carpenedo


In [63]:
import folium # map rendering library

map = folium.Map(location=[df_cities.loc['Q641']['Lat'], df_cities.loc['Q641']['Lon']], zoom_start=12)

for b in  (df_cities.loc['Q641']['Bouroughs']):
    label = '{}'.format(b['Name'])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [b.get("Lat"), b.get("Lon")],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map)     
map

## Fetching  Foursquare Data for Venice

*Note: I'm using a simple filesystem cache and Foursquare API credentials are not part of this Repository. If you want to run the code you have to copy the file* **credentials.py.dist** *to credentials.py and enter your credentials*




In [64]:
from credentials import CLIENT_ID, CLIENT_SECRET
import os.path as path
import requests
import json
import pandas as pd
from IPython.display import FileLink, FileLinks

VERSION = '20180605'  # Foursquare API version


def foursquare_explore_venues(lat, lon, radius=2000, limit=5000):
    cache_key = "venues-explore_lat={}-lon={}-radius={}-limit={}".format(lat, lon, radius, limit)
    cache_file_name = "./data_tmp/" + cache_key + ".json"
    if path.exists(cache_file_name):
        with open(cache_file_name, 'r') as f:
            return json.load(f)

    else:
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lon,
            radius,
            limit)
        print("cache miss for ", cache_key)
        r = requests.get(url)
        with open(cache_file_name, "wb") as f:
            f.write(r.content)

        with open(cache_file_name, 'r') as f:
            return json.load(f)



bouroughs = [
    {'Name': 'Municipality 1 Venezia-Murano-Burano', 'Lat': 45.436944444444, 'Lon': 12.345833333333},
    {'Name': 'Municipality Lido-PellestrinaLido-Pellestrina', 'Lat': 45.41323, 'Lon': 12.36713},
    {'Name': 'Municipalità di Chirignago-Zelarino', 'Lat': 45.484444, 'Lon': 12.188611},
    {'Name': 'Municipalità di Favaro Veneto', 'Lat': 45.504444, 'Lon': 12.281944},
    {'Name': 'Municipalità di Marghera', 'Lat': 45.475833, 'Lon': 12.224722},
    {'Name': 'Municipalità di Mestre-Carpenedo', 'Lat': 45.493889, 'Lon': 12.241389},
]

venues_list = []
for b in bouroughs:

    results= foursquare_explore_venues(b['Lat'], b['Lon'])["response"]['groups'][0]['items']



    venues_list.append([(
        b['Name'],
        b['Lat'],
        b['Lon'],
        v['venue']['name'],
        v['venue']['location']['lat'],
        v['venue']['location']['lng'],
        v['venue']['categories'][0]['name']) for v in results])

nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
                         'Neighborhood Latitude',
                         'Neighborhood Longitude',
                         'Venue',
                         'Venue Latitude',
                         'Venue Longitude',
                         'Venue Category']

from IPython.display import FileLink, FileLinks

nearby_venues.to_excel('data-venice.xlsx', index=False)

FileLink('data-venice.xlsx')



# Categorize Venues
I want to label Venues by relevant and not relevant for tourits.
'Furniture / Home Store' or  'Department Store' are not relevant for tourists but a Restaurant or Gift Shop is. 



In [65]:
# lets take a look at the unique Venue categories
nearby_venues['Venue Category'].unique()



array(['Ice Cream Shop', 'Used Bookstore', 'Park', 'Italian Restaurant',
       'Wine Shop', 'Bakery', 'Hotel', 'Plaza', 'Museum',
       'History Museum', 'Historic Site', 'Deli / Bodega', 'Public Art',
       'Gourmet Shop', 'Church', 'Chocolate Shop', 'Bridge',
       'Scenic Lookout', 'Arts & Crafts Store', 'Department Store',
       'Pizza Place', 'Wine Bar', 'Furniture / Home Store', 'Canal',
       'Outdoors & Recreation', 'Café', 'Boutique', 'Theater',
       'Opera House', 'Art Gallery', 'Bar', 'American Restaurant',
       'Sandwich Place', 'Veneto Restaurant', 'Gastropub', 'Bay',
       'Breakfast Spot', 'Snack Place', 'Food', 'Resort', 'Beach',
       'Seafood Restaurant', 'Movie Theater', 'Sculpture Garden',
       'Multiplex', 'Airport', 'Lounge', 'Cocktail Bar', 'Event Space',
       'Grocery Store', 'Indian Restaurant', 'Boat or Ferry',
       'Harbor / Marina', 'Pub', 'Chinese Restaurant', 'Restaurant',
       'Toll Plaza', 'Bed & Breakfast', 'Gym', 'Farm',
       'Col

In [66]:
# We define a list of relevant categories and label the data

relevant = [
    'Ice Cream Shop'
    #   'Used Bookstore',
    'Italian Restaurant'
    'Park',
    'Bakery'
    'Hotel'
    'Wine Shop'
    'Museum'
    'Plaza',
    'Arts & Crafts Store'
    'Historic Site'
    'Church'
    'Deli / Bodega',
    'History Museum'
    'Bridge'
    'Public Art'
    'Gourmet Shop',
    'Chocolate Shop'
    'Snack Place'
    'Scenic Lookout'
    'Pizza Place',
    'Winery'
    #  'Furniture / Home Store'
    #   'Department Store'
    'Café',
    'Wine Bar'
    'Seafood Restaurant'
    'Brewery'
    'Concert Hall'
    'Bar',
    #   'Outdoors & Recreation'
    'Canal'
    'Gift Shop'
    'Pub',
    'Breakfast Spot'
    'Gastropub'
    'Resort'
    'Food'
    'Movie Theater',
    'Beach'
    'Multiplex'
    #  'Grocery Store'
    'Lounge',
    'Indian Restaurant'
    'Restaurant'
    #    'Convenience Store',
    'Soccer Stadium'
    #    'Basketball Court'
    'Cocktail Bar'
    'Gym',
    'Sandwich Place'
    'Jazz Club'
    'Food & Drink Shop'
    'Music Venue',
    # 'Supermarket'
    'Hotel Bar'
    'Juice Bar'
    #   'Electronics Store',
    'Hostel'
    'Bistro'
    'Platform'
    'Shop & Service'
    #   'Stadium',
    'Bookstore'
    'Pastry Shop'
    'Noodle House'
    'Theater',
    #  'Gym / Fitness Center'
    'Art Gallery'
    #   'Comic Shop',
    'Chinese Restaurant'
    'Cupcake Shop'
    'Asian Restaurant',
    'Steakhouse'
    'Mediterranean Restaurant'
    'Pool'
    #   'Clothing Store',
    'Miscellaneous Shop'
    'Dessert Shop'
    'Mobile Phone Shop',
    'Fountain'
    'Smoke Shop'
]

def label_relevant_categories(row, relevant_categories):
    row['is_touristic'] = False
    if row['Venue Category'] in str(relevant_categories):
        row['is_touristic'] = True
    return row


nearby_venues = nearby_venues.apply(lambda x: label_relevant_categories(x, relevant), axis=1, )
value_counts= nearby_venues['is_touristic'].value_counts()
print(value_counts)

# Assign the labels to the cities df
df_cities.at["Q641", "non_touristic"] = value_counts[0]
df_cities.at["Q641", "touristic"] = value_counts[1]

df_cities.head()

df_cities_venice = df_cities



True     353
False     70
Name: is_touristic, dtype: int64


In [67]:
nearby_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,is_touristic
0,Municipality 1 Venezia-Murano-Burano,45.436944,12.345833,La Mela Verde,45.435710,12.344047,Ice Cream Shop,True
1,Municipality 1 Venezia-Murano-Burano,45.436944,12.345833,Libreria Acqua Alta,45.437883,12.342292,Used Bookstore,False
2,Municipality 1 Venezia-Murano-Burano,45.436944,12.345833,St. Michael Square,45.433716,12.346318,Park,True
3,Municipality 1 Venezia-Murano-Burano,45.436944,12.345833,Taverna Scalinetto,45.434404,12.346473,Italian Restaurant,True
4,Municipality 1 Venezia-Murano-Burano,45.436944,12.345833,Covino,45.434570,12.347570,Italian Restaurant,True
...,...,...,...,...,...,...,...,...
418,Municipalità di Mestre-Carpenedo,45.493889,12.241389,Supermercato Alì,45.484662,12.232716,Grocery Store,False
419,Municipalità di Mestre-Carpenedo,45.493889,12.241389,Macao,45.495339,12.243267,Bar,True
420,Municipalità di Mestre-Carpenedo,45.493889,12.241389,Hotel Martello,45.477579,12.230604,Hotel,True
421,Municipalità di Mestre-Carpenedo,45.493889,12.241389,Sai Ke Sushi,45.490692,12.247410,Asian Restaurant,True


In [68]:
import folium # map rendering library

map = folium.Map(location=[df_cities.loc['Q641']['Lat'], df_cities.loc['Q641']['Lon']], zoom_start=12)

for b in  (df_cities.loc['Q641']['Bouroughs']):
    label = '{}'.format(b['Name'])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [b.get("Lat"), b.get("Lon")],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map)     
    
for _,  r in nearby_venues.iterrows():
    color = "green" 
    if r['is_touristic'] == True:
        color = "blue"
    
    label = '{} category: {}'.format(r['Venue'], r['Venue Category'])
    folium.CircleMarker(
        [r['Venue Latitude'], r['Venue Longitude']],
        radius=5,
        popup=label,
        color=None,
        fill=True,
        fill_color=color,
        fill_opacity=0.7,
        parse_html=False).add_to(map)    
    
map

# Analysis

there are a lot of problems


The center of tourism in venice is the island in the center of the map. But there's a lot of whitespace. 

This can be caused by:
- Limitations of the Free API plan
- Missing data
- Wrong usage of the API by me
- Not enough data entered by users

The Relation of Touristic to Non-Touristic Venues is a bit shady:

True     353
False     70

the next step is to collect data for further cities and compare the relations

# Get Data for further Cities from Wikidata

In [69]:
import os.path 

cities = ["Q3138", "Q1724",  "Q1726",   "Q1492", "Q727", "Q1218", "Q2103", "Q2066", "Q60", "Q35765",
          "Q23768", "Q641"]

path_name = "./data_tmp/cities_with_wikidata.json"
df_cities = pd.DataFrame()
if not os.path.exists(path_name):
    print("Cache miss")
    for city in cities:
        wrapper = WikiDataWrapper(city)
        df_cities = df_cities.append(wrapper.get_series_for_data_frame(), ignore_index=True)

    df_cities.to_json(path_or_buf=path_name)
with open(path_name, 'r') as f:
    df_cities = pd.read_json(path_name)


df_cities

Unnamed: 0,Area,Bouroughs,EntityID,Image,Lat,Lon,Name,Population,PopulationDensity
0,117.06,"[{'Name': 'Trier', 'Lat': 49.7566666667, 'Lon'...",Q3138,https://upload.wikimedia.org/wikipedia/commons...,49.756667,6.641389,Trier,100338,857.150179
1,167.52,"[{'Name': 'Saarbrücken', 'Lat': 49.2333333333,...",Q1724,https://upload.wikimedia.org/wikipedia/commons...,49.233333,7.0,Saarbrücken,205336,1225.74021
2,310.71,"[{'Name': 'Altstadt-Lehel', 'Lat': 48.1361, 'L...",Q1726,https://upload.wikimedia.org/wikipedia/commons...,48.137194,11.5755,Munich,1314865,4231.807795
3,101.3,"[{'Name': 'Ciutat Vella', 'Lat': 41.3808333333...",Q1492,https://upload.wikimedia.org/wikipedia/commons...,41.3825,2.176944,Barcelona,1636762,16157.57157
4,219.0,"[{'Name': 'Amsterdam', 'Lat': 52.3833333333, '...",Q727,https://upload.wikimedia.org/wikipedia/commons...,52.383333,4.9,Amsterdam,860124,3927.506849
5,125.42,"[{'Name': 'Jerusalem', 'Lat': 31.7833333333, '...",Q1218,https://upload.wikimedia.org/wikipedia/commons...,31.783333,35.216667,Jerusalem,919438,7330.872269
6,145.66,"[{'Name': 'Bochum-Mitte (district)', 'Lat': 51...",Q2103,https://upload.wikimedia.org/wikipedia/commons...,51.483333,7.216667,Bochum,364628,2503.281615
7,210.34,"[{'Name': 'Essen', 'Lat': 51.4508333333, 'Lon'...",Q2066,https://upload.wikimedia.org/wikipedia/commons...,51.450833,7.013056,Essen,677568,3221.298849
8,1214.0,"[{'Name': 'Manhattan', 'Lat': 40.7283333333, '...",Q60,https://upload.wikimedia.org/wikipedia/commons...,40.67,-73.94,New York City,8398748,6918.243822
9,223000000.0,"[{'Name': 'Miyakojima-ku', 'Lat': 34.701277777...",Q35765,https://upload.wikimedia.org/wikipedia/commons...,34.693611,135.501944,Ōsaka,2665314,0.011952


# Get the Touristic / Non Touristic counts

In [70]:

path_name = "./data_tmp/cities_with_touristic_label.json"
if not os.path.exists(path_name):


    for id , city in df_cities.iterrows():
        venues_list = []

        for   b in city['Bouroughs']:
            results = foursquare_explore_venues(b['Lat'], b['Lon'])["response"]['groups'][0]['items']
            venues_list.append([(
                b['Name'],
                b['Lat'],
                b['Lon'],
                v['venue']['name'],
                v['venue']['location']['lat'],
                v['venue']['location']['lng'],
                v['venue']['categories'][0]['name']) for v in results])
        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Neighborhood',
                                 'Neighborhood Latitude',
                                 'Neighborhood Longitude',
                                 'Venue',
                                 'Venue Latitude',
                                 'Venue Longitude',
                                 'Venue Category']
        nearby_venues = nearby_venues.apply(lambda x: label_relevant_categories(x, relevant), axis=1, )
        value_counts = nearby_venues['is_touristic'].value_counts()

        df_cities.at[id , "non_touristic"] = value_counts[0]
        df_cities.at[id, "touristic"] = value_counts[1]
        
    df_cities.to_json(path_or_buf=path_name)
with open(path_name, 'r') as f:
    df_cities = pd.read_json(path_name)
        
        

In [71]:


def calculate_relation(row):
    row['score'] = row['touristic']/  row['non_touristic'] 
    return row 

df_cities = df_cities.apply(lambda x : calculate_relation(x), axis=1)



df_cities["zscore"] = (df_cities["score"] - df_cities['score'].mean())/df_cities['score'].std(ddof=0)
df = df_cities[["Name", "zscore"]]
df.sort_values(by=['zscore'], ascending=False)

Unnamed: 0,Name,zscore
11,Venice,3.086572
4,Amsterdam,0.419817
5,Jerusalem,0.347567
10,Las Vegas,-0.109458
7,Essen,-0.198324
3,Barcelona,-0.298467
1,Saarbrücken,-0.317738
0,Trier,-0.408492
8,New York City,-0.487549
6,Bochum,-0.501531
