## Budget Calculator App API

This notebook serves as a model for the app API

In [28]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Initial Requests 

This code runs each time the application starts. 

In [29]:
base_url = 'https://www.numbeo.com/cost-of-living/historical-data-city-selector'

In [30]:
page = requests.get(base_url)
numbeo_city_soup = BeautifulSoup(page.content, "html.parser")
results = numbeo_city_soup.find('table', class_='related_links')
print(results())

[<tr><td style="width: 20%">
<h4>A</h4>
<a href="https://www.numbeo.com/cost-of-living/city-history/in/Aachen">Aachen, Germany</a><br/>
<a href="https://www.numbeo.com/cost-of-living/city-history/in/Aalborg">Aalborg, Denmark</a><br/>
<a href="https://www.numbeo.com/cost-of-living/city-history/in/Aarhus-Denmark">Aarhus, Denmark</a><br/>
<a href="https://www.numbeo.com/cost-of-living/city-history/in/Abbotsford">Abbotsford, Canada</a><br/>
<a href="https://www.numbeo.com/cost-of-living/city-history/in/Aberdeen">Aberdeen, United Kingdom</a><br/>
<a href="https://www.numbeo.com/cost-of-living/city-history/in/Abidjan">Abidjan, Ivory Coast</a><br/>
<a href="https://www.numbeo.com/cost-of-living/city-history/in/Abu-Dhabi">Abu Dhabi, United Arab Emirates</a><br/>
<a href="https://www.numbeo.com/cost-of-living/city-history/in/Abuja">Abuja, Nigeria</a><br/>
<a href="https://www.numbeo.com/cost-of-living/city-history/in/Accra">Accra, Ghana</a><br/>
<a href="https://www.numbeo.com/cost-of-living/ci

In [31]:
list_cities = results.find_all('a')
list_cities[0]

<a href="https://www.numbeo.com/cost-of-living/city-history/in/Aachen">Aachen, Germany</a>

In [32]:
city_name = lambda x: f"({x[0]}) {x[1]}" if len(x) > 2 else x[0]

city_dict = lambda x: {'City':city_name(x.text.split(",")), 'Country':x.text.split(",")[-1].strip(), 'Url':x["href"]}
city_pages = [city_dict(city) for city in list_cities]

In [33]:
df = pd.DataFrame(city_pages) # creates a dataframe with all cities, their country and page urls for their data tables 
df.head()

Unnamed: 0,City,Country,Url
0,Aachen,Germany,https://www.numbeo.com/cost-of-living/city-his...
1,Aalborg,Denmark,https://www.numbeo.com/cost-of-living/city-his...
2,Aarhus,Denmark,https://www.numbeo.com/cost-of-living/city-his...
3,Abbotsford,Canada,https://www.numbeo.com/cost-of-living/city-his...
4,Aberdeen,United Kingdom,https://www.numbeo.com/cost-of-living/city-his...


## User Interaction

This part of the code kicks in when the user inputs search criterion via the UI. Country and city are case sensitive. Appropriate measures should be taken in the app logic to prevent link brakage due to incorrect user entries.

In [34]:
country_selection = input("Select Country")
country_slice = df[df["Country"] == country_selection]

In [35]:
country_slice

Unnamed: 0,City,Country,Url
574,Nairobi,Kenya,https://www.numbeo.com/cost-of-living/city-his...


Once the data has been sliced by country, the user can select a city. Some cities are duplicated accross countries or states. Being able to avoid ambiguity by defining country then city is an essential step in ensuring the app provides accurate information.

In [36]:
def get_tables(city):
    from io import StringIO
    page = requests.get(df.iloc[df.loc[df['City'] == city].index[0]]["Url"])
    one_city_soup = BeautifulSoup(page.content, "html.parser")
    inner_width = one_city_soup.find_all('div', class_='innerWidth')
    results = inner_width[2].find_all('table')
    return results

In [37]:
city = input("Enter City")
data = get_tables(city)

In [38]:
data[0]


<table class="stripe row-border order-column compact" id="tier_1">
<thead>
<tr>
<th><div class="font_in_table_headers">Year</div></th><th><div class="font_in_table_headers">Meal, Inexpensive Restaurant</div></th><th><div class="font_in_table_headers">Meal for 2 People, <br/>Mid-range Restaurant, Three-course</div></th><th><div class="font_in_table_headers">McMeal at McDonalds <br/>(or Equivalent Combo Meal)</div></th></tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">2023</td>
<td style="white-space: nowrap; text-align: right">500.00</td>
<td style="white-space: nowrap; text-align: right">4000.00</td>
<td style="white-space: nowrap; text-align: right">850.00</td>
</tr>
<tr>
<td style="text-align: right">2022</td>
<td style="white-space: nowrap; text-align: right">600.00</td>
<td style="white-space: nowrap; text-align: right">4750.00</td>
<td style="white-space: nowrap; text-align: right">650.00</td>
</tr>
<tr>
<td style="text-align: right">2021</td>
<td style="white-space: nowra

The below function takes the raw tables and combines them into categories of cost groups which the user will be able to track their expenditure in.

In [39]:
def categorize_data(tables):
    from io import StringIO
    reader_converter = lambda x: pd.DataFrame(pd.read_html(StringIO(str(x)))[0])
    df_list = [reader_converter(table) for table in tables]
    market = pd.concat([df_list[2], df_list[3], df_list[4]], axis=1).T.drop_duplicates().T
    leisure = pd.concat([df_list[0], df_list[12]], axis=1).T.drop_duplicates().T
    rental = df_list[5]
    public_transport = df_list[9]
    utillities = df_list[11]
    clothing = df_list[13]
    category_frames = [market, leisure, rental, public_transport, utillities, clothing]
    return category_frames

The index code for categories is as follows:

0 - Market  
1 - Leisure  
2 - Rental  
3 - Public Transport  
4 - Utilities  
5 - Clothing

In [40]:
categorized_data = categorize_data(data)

The below function cleans the categorized data, removing nulls and changing datatypes to floats. This is essential for the next step: interpolation.

In [41]:
def clean_data(frames):
    for frame in frames:
        frame.replace({'-': np.nan}, inplace=True)
        frame = frame.astype(float)
    return(frames)

In [42]:
clean_data = clean_data(categorized_data)

Each dataframe is interpolated in a linear manner, filling the nulls according to the progression or sequence of values accross the column. 

In [43]:
clean_data[1].interpolate()

  clean_data[1].interpolate()


Unnamed: 0,Year,"Meal, Inexpensive Restaurant","Meal for 2 People, Mid-range Restaurant, Three-course",McMeal at McDonalds (or Equivalent Combo Meal),"Fitness Club, Monthly Fee for 1 Adult",Tennis Court Rent (1 Hour on Weekend),"Cinema, International Release, 1 Seat"
0,2023,500.0,4000.0,850.0,5771.08,2278.82,800.0
1,2022,600.0,4750.0,650.0,5947.37,,850.0
2,2021,500.0,3500.0,700.0,5730.77,1750.0,700.0
3,2020,500.0,3750.0,675.2,5441.18,,800.0
4,2019,500.0,3000.0,702.9,5522.49,802.38,800.0
5,2018,500.0,3000.0,700.0,6650.0,780.0,800.0
6,2017,500.0,2500.0,500.0,5472.22,500.0,700.0
7,2016,500.0,2500.0,600.0,6194.44,660.0,600.0
8,2015,400.0,2750.0,850.0,7045.45,,800.0
9,2014,400.0,3000.0,650.0,7509.36,766.67,500.0


Once interpolated, the data can be ploted as a time series (line plot showing values from 2011 to 2023 for each feature).

## Helper Functions

#### Market Average 

In [44]:
market = clean_data[0].astype(float).interpolate()
p_mark = round(market.loc[0].sum() * 2, 2)

#### Leisure Average

In [45]:
leisure = clean_data[1].astype(float).interpolate()
p_leis = round(leisure.loc[0].sum() / 3, 2)
p_leis 


5407.63

#### Rental Average

In [46]:
rental = clean_data[2].astype(float).interpolate()
p_rent = round(rental.loc[0].mean(), 2)
p_rent

58542.23

#### Public Transport 

In [47]:
public_transport = clean_data[3].astype(float).interpolate()
p_trans = public_transport.loc[0].sum()
p_trans

7023.0

#### Utilites 

In [48]:
utilities = clean_data[4].astype(float).interpolate()
p_utils = round(utilities.loc[0].astype(float).sum() / 4, 2)
p_utils 

3125.14

#### Clothing

In [49]:
clothing = clean_data[5].astype(float).interpolate()
p_cloth = round(clothing.loc[0].sum() / 2, 2)
p_cloth

10325.9

In [50]:
total = p_cloth + p_utils + p_trans + p_rent + p_leis + p_mark
round(total, 2)

102761.32

In [51]:
import os

def save_df(df_list, city): 
    categories=["Market","Leisure","Rental","Transport", "Utilities","Clothing"]
    # Create directory if it doesn't exist
    directory = f'data/processed/{city}'
    if not os.path.exists(directory):
        os.makedirs(directory)
    for index, frame in enumerate(df_list):
        # Ensure the title index is within bounds
        frame.astype(float).interpolate()
        if index < len(categories):
            # Save DataFrame to CSV
            frame.to_csv(f'{directory}/{categories[index]}.csv', sep=',', index=False, encoding='utf-8')

In [52]:
save_df(clean_data, 'nairobi')
