## Budget Calculator App API

This notebook serves as a model for the app API

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Initial Requests 

This code runs each time the application starts. 

In [None]:
base_url = 'https://www.numbeo.com/cost-of-living/historical-data-city-selector'

In [None]:
page = requests.get(base_url)
numbeo_city_soup = BeautifulSoup(page.content, "html.parser")
results = numbeo_city_soup.find('table', class_='related_links')
print(results())

In [None]:
list_cities = results.find_all('a')
list_cities[0]

In [None]:
city_name = lambda x: f"({x[0]}) {x[1]}" if len(x) > 2 else x[0]

city_dict = lambda x: {'City':city_name(x.text.split(",")), 'Country':x.text.split(",")[-1].strip(), 'Url':x["href"]}
city_pages = [city_dict(city) for city in list_cities]

In [None]:
df = pd.DataFrame(city_pages) # creates a dataframe with all cities, their country and page urls for their data tables 
df.head()

## User Interaction

This part of the code kicks in when the user inputs search criterion via the UI

In [None]:
country_selection = input("Select Country")
country_slice = df[df["Country"] == country_selection]

In [None]:
country_slice

Once the data has been sliced by country, the user can select a city. Some cities are duplicated accross countries or states. Being able to avoid ambiguity by defining country then city is an essential step in ensuring the app provides accurate information.

In [None]:
def get_tables(city):
    from io import StringIO
    page = requests.get(df.iloc[df.loc[df['City'] == city].index[0]]["Url"])
    one_city_soup = BeautifulSoup(page.content, "html.parser")
    inner_width = one_city_soup.find_all('div', class_='innerWidth')
    results = inner_width[2].find_all('table')
    return results

In [None]:
city = "Johannesburg"
data = get_tables(city)

In [None]:
data[0]


The below function takes the raw tables and combines them into categories of cost groups which the user will be able to track their expenditure in.

In [94]:
def categorize_data(tables):
    from io import StringIO
    reader_converter = lambda x: pd.DataFrame(pd.read_html(StringIO(str(x)))[0])
    df_list = [reader_converter(table) for table in tables]
    market = pd.concat([df_list[2], df_list[3], df_list[4]], axis=1).T.drop_duplicates().T
    leisure = pd.concat([df_list[0], df_list[12]], axis=1).T.drop_duplicates().T
    rental = df_list[5]
    public_transport = df_list[9]
    utillities = df_list[11]
    clothing = df_list[13]
    category_frames = [market, leisure, rental, public_transport, utillities, clothing]
    return [frame.set_index("Year") for frame in category_frames]

The index code for categories is as follows:

0 - Market
1 - Leisure
2 - Rental
3 - Public Transport
5 - Utilities
6 - Clothing


The i

In [95]:
categorized_data = categorize_data(data)

The below function cleans the categorized data, removing nulls and changing datatypes to floats. This is essential for the next step: interpolation.

In [128]:
def clean_data(frames):
    for frame in frames:
        frame.replace({'-': np.nan}, inplace=True)
        frame = frame.astype(float)
    return(frames)

In [129]:
clean_data = clean_data(categorized_data)

Each dataframe is interpolated in a linear manner, filling the nulls according to the progression or sequence of values accross the column. 

In [137]:
clean_data[0].interpolate()

Unnamed: 0_level_0,"Meal, Inexpensive Restaurant","Meal for 2 People, Mid-range Restaurant, Three-course",McMeal at McDonalds (or Equivalent Combo Meal),"Fitness Club, Monthly Fee for 1 Adult",Tennis Court Rent (1 Hour on Weekend),"Cinema, International Release, 1 Seat"
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023.0,150.0,700.0,80.0,605.04,287.5,122.5
2022.0,150.0,600.0,80.0,579.24,147.0,100.0
2021.0,160.0,600.0,64.0,512.8,151.01,98.5
2020.0,150.0,600.0,57.0,478.82,163.33,96.5
2019.0,135.0,500.0,55.0,566.57,172.65,90.0
2018.0,130.0,500.0,60.0,478.61,141.4,80.0
2017.0,120.0,500.0,55.0,450.18,163.33,75.0
2016.0,100.0,450.0,51.0,446.61,98.12,70.0
2015.0,100.0,400.0,50.0,398.55,100.91,65.0
2014.0,85.0,400.0,50.0,376.67,129.25,60.0


Once interpolated, the data can be ploted as a time series.