## Budget Calculator App API

This notebook serves as a model for the app API

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Initial Requests 

This code runs each time the application starts. 

In [2]:
base_url = 'https://www.numbeo.com/cost-of-living/historical-data-city-selector'

In [None]:
page = requests.get(base_url)
numbeo_city_soup = BeautifulSoup(page.content, "html.parser")
results = numbeo_city_soup.find('table', class_='related_links')
print(results())

In [4]:
list_cities = results.find_all('a')
list_cities[0]

<a href="https://www.numbeo.com/cost-of-living/city-history/in/Aachen">Aachen, Germany</a>

In [5]:
city_name = lambda x: f"({x[0]}) {x[1]}" if len(x) > 2 else x[0]

city_dict = lambda x: {'City':city_name(x.text.split(",")), 'Country':x.text.split(",")[-1].strip(), 'Url':x["href"]}
city_pages = [city_dict(city) for city in list_cities]

In [6]:
df = pd.DataFrame(city_pages) # creates a dataframe with all cities, their country and page urls for their data tables 
df.head()

Unnamed: 0,City,Country,Url
0,Aachen,Germany,https://www.numbeo.com/cost-of-living/city-his...
1,Aalborg,Denmark,https://www.numbeo.com/cost-of-living/city-his...
2,Aarhus,Denmark,https://www.numbeo.com/cost-of-living/city-his...
3,Abbotsford,Canada,https://www.numbeo.com/cost-of-living/city-his...
4,Aberdeen,United Kingdom,https://www.numbeo.com/cost-of-living/city-his...


## User Interaction

This part of the code kicks in when the user inputs search criterion via the UI. Country and city are case sensitive. Appropriate measures should be taken in the app logic to prevent link brakage due to incorrect user entries.

In [7]:
country_selection = input("Select Country")
country_slice = df[df["Country"] == country_selection]

In [8]:
country_slice

Unnamed: 0,City,Country,Url
169,Cape Town,South Africa,https://www.numbeo.com/cost-of-living/city-his...
256,Durban,South Africa,https://www.numbeo.com/cost-of-living/city-his...
396,Johannesburg,South Africa,https://www.numbeo.com/cost-of-living/city-his...
673,Port Elizabeth,South Africa,https://www.numbeo.com/cost-of-living/city-his...
686,Pretoria,South Africa,https://www.numbeo.com/cost-of-living/city-his...


Once the data has been sliced by country, the user can select a city. Some cities are duplicated accross countries or states. Being able to avoid ambiguity by defining country then city is an essential step in ensuring the app provides accurate information.

In [9]:
def get_tables(city):
    from io import StringIO
    page = requests.get(df.iloc[df.loc[df['City'] == city].index[0]]["Url"])
    one_city_soup = BeautifulSoup(page.content, "html.parser")
    inner_width = one_city_soup.find_all('div', class_='innerWidth')
    results = inner_width[2].find_all('table')
    return results

In [10]:
city = input("Enter City")
data = get_tables(city)

In [11]:
data[0]


<table class="stripe row-border order-column compact" id="tier_1">
<thead>
<tr>
<th><div class="font_in_table_headers">Year</div></th><th><div class="font_in_table_headers">Meal, Inexpensive Restaurant</div></th><th><div class="font_in_table_headers">Meal for 2 People, <br/>Mid-range Restaurant, Three-course</div></th><th><div class="font_in_table_headers">McMeal at McDonalds <br/>(or Equivalent Combo Meal)</div></th></tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">2023</td>
<td style="white-space: nowrap; text-align: right">170.00</td>
<td style="white-space: nowrap; text-align: right">700.00</td>
<td style="white-space: nowrap; text-align: right">80.00</td>
</tr>
<tr>
<td style="text-align: right">2022</td>
<td style="white-space: nowrap; text-align: right">150.00</td>
<td style="white-space: nowrap; text-align: right">600.00</td>
<td style="white-space: nowrap; text-align: right">70.00</td>
</tr>
<tr>
<td style="text-align: right">2021</td>
<td style="white-space: nowrap; t

The below function takes the raw tables and combines them into categories of cost groups which the user will be able to track their expenditure in.

In [12]:
def categorize_data(tables):
    from io import StringIO
    reader_converter = lambda x: pd.DataFrame(pd.read_html(StringIO(str(x)))[0])
    df_list = [reader_converter(table) for table in tables]
    market = pd.concat([df_list[2], df_list[3], df_list[4]], axis=1).T.drop_duplicates().T
    leisure = pd.concat([df_list[0], df_list[12]], axis=1).T.drop_duplicates().T
    rental = df_list[5]
    public_transport = df_list[9]
    utillities = df_list[11]
    clothing = df_list[13]
    category_frames = [market, leisure, rental, public_transport, utillities, clothing]
    return [frame.set_index("Year") for frame in category_frames]

The index code for categories is as follows:

0 - Market  
1 - Leisure  
2 - Rental  
3 - Public Transport  
4 - Utilities  
5 - Clothing

In [13]:
categorized_data = categorize_data(data)

The below function cleans the categorized data, removing nulls and changing datatypes to floats. This is essential for the next step: interpolation.

In [14]:
def clean_data(frames):
    for frame in frames:
        frame.replace({'-': np.nan}, inplace=True)
        frame = frame.astype(float)
    return(frames)

In [15]:
clean_data = clean_data(categorized_data)

Each dataframe is interpolated in a linear manner, filling the nulls according to the progression or sequence of values accross the column. 

In [29]:
clean_data[1].interpolate()

  clean_data[1].interpolate()


Unnamed: 0_level_0,"Meal, Inexpensive Restaurant","Meal for 2 People, Mid-range Restaurant, Three-course",McMeal at McDonalds (or Equivalent Combo Meal),"Fitness Club, Monthly Fee for 1 Adult",Tennis Court Rent (1 Hour on Weekend),"Cinema, International Release, 1 Seat"
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023,170.0,700.0,80.0,595.27,87.5,107.5
2022,150.0,600.0,70.0,728.82,188.33,100.0
2021,150.0,600.0,65.0,541.35,155.62,100.0
2020,120.0,550.0,60.0,514.19,179.44,100.0
2019,140.0,600.0,60.0,567.01,126.47,85.0
2018,120.0,500.0,60.0,476.75,132.62,80.0
2017,100.0,460.0,57.5,499.09,80.0,75.0
2016,100.0,450.0,55.0,530.39,95.0,67.25
2015,90.0,400.0,45.0,433.25,81.67,65.0
2014,80.0,300.0,45.0,440.91,,60.0


Once interpolated, the data can be ploted as a time series (line plot showing values from 2011 to 2023 for each feature).

## Helper Functions

#### Market Average 

In [17]:
market = clean_data[0].astype(float).interpolate()
p_mark = round(market.loc[2023].sum() * 2, 2)

#### Leisure Average

In [18]:
leisure = clean_data[1].astype(float).interpolate()
p_leis = round(leisure.loc[2023].sum() / 3, 2)
p_leis 


#### Rental Average

In [19]:
rental = clean_data[2].astype(float).interpolate()
p_rent = round(rental.loc[2023].mean(), 2)
p_rent

#### Public Transport 

In [20]:
public_transport = clean_data[3].astype(float).interpolate()
p_trans = public_transport.loc[2023].sum()
p_trans

#### Utilites 

In [21]:
utilities = clean_data[4].astype(float).interpolate()
p_utils = round(utilities.loc[2023].astype(float).sum() / 4, 2)
p_utils 

#### Clothing

In [22]:
clothing = clean_data[5].astype(float).interpolate()
p_cloth = round(clothing.loc[2023].sum() / 2, 2)
p_cloth

In [23]:
total = p_cloth + p_utils + p_trans + p_rent + p_leis + p_mark
round(total, 2)

20935.09

In [24]:
import os

def save_df(df_list, city): 
    categories=["Market","Leisure","Rental","Transport", "Utilities","Clothing"]
    # Create directory if it doesn't exist
    directory = f'data/processed/{city}'
    if not os.path.exists(directory):
        os.makedirs(directory)
    for index, frame in enumerate(df_list):
        # Ensure the title index is within bounds
        frame.astype(float).interpolate()
        if index < len(categories):
            # Save DataFrame to CSV
            frame.to_csv(f'{directory}/{categories[index]}.csv', sep=',', index=False, encoding='utf-8')

In [25]:
save_df(clean_data, 'cape-town')
