# Manually adding cities

### This notebook will be used to add the capital cities that are missing data and cannot be added using our data extraction functions to our city_dataset.

1. Check which cities cannot be obtained with our Numbeo function (for socio-economic data).
2. If there is missing data, we unfortunately cannot move forward with the missing.
3. Check which of the new cities is present in climatedata.eu.
4. For the cities that have missing data, we will be able to manually fill these in as there are many data sources for weather online.
5. Lastly we need to add the latitude and longitude for each city.

In [1]:
import pandas as pd
import numpy as np

In [2]:
city_data = pd.read_json('../data/Combined_data.json')
city_data.head()

Unnamed: 0,city,autumn_high,autumn_prec_days,autumn_sun_hrs,spring_high,spring_prec_days,spring_sun_hrs,summer_high,summer_prec_days,summer_sun_hrs,...,cost_of_living,health_care,pollution,property_income_ratio,purchasing_power,safety,traffic_time,quality_of_life,lat,lng
0,Amsterdam,14,17,98,13,15,166,21,13,203,...,84.18,69.45,30.79,10.98,81.63,67.32,29.88,168.38,52.35,4.916667
1,Athens,23,8,212,20,9,235,31,2,347,...,59.28,56.17,57.3,12.75,40.69,50.49,37.98,119.84,37.983333,23.733333
2,Belgrade,18,7,153,18,9,182,26,8,266,...,40.49,53.69,63.57,22.22,34.87,62.02,35.89,107.89,44.833333,20.5
3,Berlin,13,9,109,13,9,167,23,10,219,...,67.41,69.68,39.45,9.63,98.54,58.92,34.06,164.83,52.516667,13.4
4,Bratislava,15,6,148,15,7,214,25,7,292,...,50.81,57.17,41.12,13.37,61.82,68.68,30.89,147.54,48.15,17.116667


In [3]:
capitals = ['Amsterdam', 'Andorra la Vella', 'Athens', 'Belgrade', 'Berlin', 'Bern', 
            'Bratislava', 'Brussels', 'Bucharest', 'Budapest', 'Chisinau', 'Copenhagen', 
            'Dublin', 'Helsinki', 'Kyiv', 'Lisbon', 'Ljubljana', 'London', 'Luxembourg', 
            'Madrid', 'Minsk', 'Monaco', 'Moscow', 'Nicosia', 'Nuuk', 'Oslo', 'Paris', 
            'Podgorica', 'Prague', 'Pristina', 'Reykjavik', 'Riga', 'Rome', 'San Marino', 
            'Sarajevo', 'Skopje', 'Sofia', 'Stockholm', 'Tallinn', 'Tirana', 'Vaduz', 
            'Valletta', 'Vatican City','Vienna', 'Vilnius', 'Warsaw', 'Zagreb']

Checking which cities are still missing from the city_dataset

In [4]:
city_data_list = list(city_data.city)
missing_capitals = []

for capital in capitals:
    if capital not in city_data_list:
        missing_capitals.append(capital)


In [5]:
missing_capitals

['Andorra la Vella',
 'Bern',
 'Chisinau',
 'Kyiv',
 'Minsk',
 'Monaco',
 'Moscow',
 'Nuuk',
 'Podgorica',
 'Prague',
 'Pristina',
 'San Marino',
 'Vaduz',
 'Vatican City']

### 1.1 Getting the socio-economic data

In [6]:
def get_city_info(city_list) -> list:
    
    # create empty database
    df = pd.DataFrame()
    errors = []
    
    # iterate over list of cities to obtain table data from Numbeo
    for city in city_list:
        try:
            table = pd.read_html(f'https://www.numbeo.com/quality-of-life/in/{city}')[3]
            table = (table.assign(city=city)
                      .drop(columns=[2], axis=1)
                      .rename(columns={0:'category', 1:'numeral'})
                      .drop([table.index[8]]))
            df = df.append(table)
        
        except:
            errors.append(city)
    
    df.reset_index(inplace=True, drop=True)

    return df, errors

In [7]:
capitals_sc = get_city_info(missing_capitals)[0]
errors = get_city_info(missing_capitals)[1]

In [8]:
capitals_sc

Unnamed: 0,category,numeral,city
0,Purchasing Power Index,26.95,Chisinau
1,Safety Index,54.76,Chisinau
2,Health Care Index,51.93,Chisinau
3,Climate Index,76.91,Chisinau
4,Cost of Living Index,34.66,Chisinau
...,...,...,...
67,Cost of Living Index,29,Pristina
68,Property Price to Income Ratio,10.06,Pristina
69,Traffic Commute Time Index,21.92,Pristina
70,Pollution Index,71.18,Pristina


### 1.2 The cities in the error list will have to be checked manually.

In [9]:
errors

['Andorra la Vella', 'Bern', 'Kyiv', 'San Marino', 'Vaduz', 'Vatican City']

We'll need to manually add the data for these cities, but if they are not present in Numbeo, they will have to be dropped.

- Andorra la Vella (deleted, missing climate index, missing quality of life index, was an error because Numbeo has the word hyphenated)
- Bern (added, html table at a the wrong index)
- Kyiv (added, spelled 'Kiev' in Numbeo)
- San Marino (deleted, no data in Numbeo)
- Vaduz (deleted, no data in Numbeo)
- Vatican City (deleted, no data in Numbeo)

Adding 'Bern' by updating the function to take the table at the new index.

In [10]:
def manual_city_info(city_list) -> list:
    
    # create empty database
    df = pd.DataFrame()
    errors = []
    
    # iterate over list of cities to obtain table data from Numbeo
    for city in city_list:
        try:
            table = pd.read_html(f'https://www.numbeo.com/quality-of-life/in/{city}')[4]
            table = (table.assign(city=city)
                      .drop(columns=[2], axis=1)
                      .rename(columns={0:'category', 1:'numeral'})
                      .drop([table.index[8]]))
            df = df.append(table)
        
        except:
            errors.append(city)
    
    df.reset_index(inplace=True, drop=True)

    return df, errors

In [11]:
Bern_sc =  manual_city_info(errors)[0]

In [12]:
Bern_sc

Unnamed: 0,category,numeral,city
0,Purchasing Power Index,128.75,Bern
1,Safety Index,80.61,Bern
2,Health Care Index,74.76,Bern
3,Climate Index,75.97,Bern
4,Cost of Living Index,117.96,Bern
5,Property Price to Income Ratio,7.05,Bern
6,Traffic Commute Time Index,19.77,Bern
7,Pollution Index,11.11,Bern
8,ƒ Quality of Life Index:,210.9,Bern
