# Project 3

Hello! For Project 3, I'm looking to explore the relationship between the number of McDonalds across different countries and the obesity rate in those countries in 2022. This was inspired by an article exploring a similar topic in Denver [https://medium.com/@iantdraves/mcdonalds-footprint-in-denver-exploring-correlation-to-obesity-1ac2ff5098c9].



Dataset(s) to be used: [https://www.kaggle.com/datasets/forveryou/mcdonalds-stores-data and https://ourworldindata.org/grapher/share-of-adults-defined-as-obese?country=USA~GBR~BRA~IND]

Analysis question: Is there a correlation between the number of McDonalds outlets in a country and the adult obesity rates in that particular country?

Columns to be used to merge/join them:

[McDonalds Dataset] [name", "country", "subdivision", "city","latitude","longitude]

[Our World in Data] ["country", "obesity_rate"]

Hypothesis: [Countries with more McDonald’s restaurants tend to have higher obesity rates. However, we may have a reverse causality issue here - countries with higher obesity rates may also tend to have more McDonald’s restaurants.]

#Part 1: McDonald's Dataset - This dataset includes a record for every McDonald's store location currently as of September 2022.

In [4]:
# First, let's import our pandas package and load our data
import pandas as pd


mcdonalds = pd.read_csv(
    "https://raw.githubusercontent.com/s-najiihah/computing-in-context/refs/heads/main/mcdonalds.csv"
)
mcdonalds

Unnamed: 0,name,storeid,country,subdivision,city,address,postcode,telephone,runhours,latitude,longitude,services
0,McDonald's,nodata,Andorra,Andorra la Vella,Andorra la Vella,"Av. Meritxell, 32 AD500 Andorra la Vella",AD500,885 885,8:00-23:00,42.507875,1.524985,Terrace/Accessibility/Birthday/McCafé/Breakfast
1,McDonald's - Pas de la Casa,nodata,Andorra,Encamp,El Pas de la Casa,"Plaça dels Coprínceps, 23 AD200 Pas de la Casa",AD200,880 512,9:00-21:00,42.543500,1.732089,McDrive/Accessibility/Breakfast/Dessert/Birthd...
2,McDonald's - Av. Tarragona,nodata,Andorra,Andorra la Vella,Andorra la Vella,"Av. of Tarragona, 49 AD500 Andorra la Vella",AD500,800 400,11:00-24:00/11:00-01:00,42.505142,1.527843,McDelivery/Terrace/McDrive/Accessibility/McCaf...
3,McDonald's - Epizen,nodata,Andorra,Sant Julià de Lòria,Sant Julià de Lòria,Carretera General CG1 - Epizen Shopping Center...,AD600,742 400,09:00-22:00,42.459812,1.489049,McDelivery/Terrace/McDrive/Accessibility/Pakri...
4,McDonald's Andorra - Meritxell,nodata,Andorra,Andorra la Vella,Andorra la Vella,"Av. Meritxell, 105 AD500 Andorra la Vella",AD500,726 396,11:00-23:00,42.508503,1.533468,McDrive/TakeAway/McCafé/Breakfast/Accessibility
...,...,...,...,...,...,...,...,...,...,...,...,...
39190,McDonald's Nguyễn Huệ,nodata,Vietnam,Ho Chi Minh City,Ho Chi Minh City,"123 Nguyễn Huệ, Quận 1, HCM",71006,(028) 3914 0888,08:00-23:00,10.774377,106.700709,Wifi/Birthday/McDelivery
39191,McDonald's Vivo City,nodata,Vietnam,Ho Chi Minh City,Ho Chi Minh City,"1058 Đại lộ Nguyễn Văn Linh, Quận 7, HCM",72915,(028) 3775 3000,08:00-22:00,10.729858,106.701097,Parking/Wifi/Birthday/McDelivery
39192,McDonald’s Satra Phạm Hùng,nodata,Vietnam,Ho Chi Minh City,Ho Chi Minh City,"Tầng trệt, TTTM Centre Mall Phạm Hùng, C6/27 P...",73009,(028) 3758 8000,10:00-22:00/08:00-22:00,10.733498,106.672177,Wifi/Birthday/McDelivery
39193,McDonald's Giga Mall,nodata,Vietnam,Ho Chi Minh City,Ho Chi Minh City,"240-242 Phạm Văn Đồng, Quận Thủ Đức, HCM",71409,(028) 36203644,09:30:22:00/09:00-22:00,10.818210,106.687410,Parking/Wifi/Birthday/McDelivery


That's a lot of data! Let's keep only certain columns that we need - name, country, subdivision, city as well as latitude and longitude.

In [5]:
mcdonalds_clean = mcdonalds[["name", "country", "subdivision", "city","latitude","longitude"]]
mcdonalds_clean


Unnamed: 0,name,country,subdivision,city,latitude,longitude
0,McDonald's,Andorra,Andorra la Vella,Andorra la Vella,42.507875,1.524985
1,McDonald's - Pas de la Casa,Andorra,Encamp,El Pas de la Casa,42.543500,1.732089
2,McDonald's - Av. Tarragona,Andorra,Andorra la Vella,Andorra la Vella,42.505142,1.527843
3,McDonald's - Epizen,Andorra,Sant Julià de Lòria,Sant Julià de Lòria,42.459812,1.489049
4,McDonald's Andorra - Meritxell,Andorra,Andorra la Vella,Andorra la Vella,42.508503,1.533468
...,...,...,...,...,...,...
39190,McDonald's Nguyễn Huệ,Vietnam,Ho Chi Minh City,Ho Chi Minh City,10.774377,106.700709
39191,McDonald's Vivo City,Vietnam,Ho Chi Minh City,Ho Chi Minh City,10.729858,106.701097
39192,McDonald’s Satra Phạm Hùng,Vietnam,Ho Chi Minh City,Ho Chi Minh City,10.733498,106.672177
39193,McDonald's Giga Mall,Vietnam,Ho Chi Minh City,Ho Chi Minh City,10.818210,106.687410


Before we move onto our next dataset, I want to inspect the values under 'country' so that we can compare it to the values for our next dataset too.

In [6]:
mcdonalds_country = mcdonalds_clean["country"].unique()
mcdonalds_country

array(['Andorra', 'Argentina', 'Aruba', 'Australia', 'Austria',
       'Azerbaijan', 'Bahamas', 'Bahrain', 'Belarus', 'Belgium', 'Brazil',
       'Brunei', 'Bulgaria', 'Canada', 'Chile', 'China', 'Colombia',
       'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czech', 'Denmark',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Estonia',
       'Fiji', 'Finland', 'France', 'Georgia', 'Germany', 'Greece',
       'Guatemala', 'Honduras', 'Hungary', 'India', 'Indonesia',
       'Ireland', 'Israel', 'Italy', 'Japan', 'Kazakhstan', 'Kuwait',
       'Latvia', 'Lebanon', 'Liechtenstein', 'Lithuania', 'Luxembourg',
       'Malaysia', 'Malta', 'Mauritius', 'Mexico', 'Moldova', 'Monaco',
       'Morocco', 'Netherlands', 'New Zealand', 'Nicaragua', 'Norway',
       'Oman', 'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines',
       'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia', 'Samoa',
       'Saudi Arabia', 'Serbia', 'Singapore', 'Slovakia', 'Slovenia',
       'South Afric

Part 2: Next, let's get our obesity data at country level. This data is retrieved from 'Our World in Data'.

In [7]:
# Let's load our data and inspect
obesity_adults_2022 = pd.read_csv(
    "https://ourworldindata.org/grapher/share-of-adults-defined-as-obese.csv?v=1&csvType=filtered&useColumnShortNames=true&country=USA~GBR~BRA~IND&overlay=download-data"
)

obesity_adults_2022

Unnamed: 0,Entity,Code,Year,prevalence_of_obesity_among_adults__bmi__gt__30__crude_estimate__pct__sex_both_sexes__age_group_18plus__years_of_age,time
0,Afghanistan,AFG,2022,17.59419,2022
1,Albania,ALB,2022,26.58335,2022
2,Algeria,DZA,2022,24.24859,2022
3,Andorra,AND,2022,20.47006,2022
4,Angola,AGO,2022,10.53502,2022
...,...,...,...,...,...
188,Venezuela,VEN,2022,22.83876,2022
189,Vietnam,VNM,2022,2.08170,2022
190,Yemen,YEM,2022,11.55534,2022
191,Zambia,ZMB,2022,9.41690,2022


Looking great, now let's keep only the columns we want from this dataset so that we can combine with the McDonalds dataset later on. Let's also shorten our column names or change the column name so that it matches our McDonalds dataset.

In [8]:
# First, let's clean up our column names
obesity_adults_2022.rename(
    columns={
        "prevalence_of_obesity_among_adults__bmi__gt__30__crude_estimate__pct__sex_both_sexes__age_group_18plus__years_of_age": "obesity_rate",
        "Entity" : "country"
    },
    inplace=True,
)

#Next, let's inspect
obesity_adults_2022


Unnamed: 0,country,Code,Year,obesity_rate,time
0,Afghanistan,AFG,2022,17.59419,2022
1,Albania,ALB,2022,26.58335,2022
2,Algeria,DZA,2022,24.24859,2022
3,Andorra,AND,2022,20.47006,2022
4,Angola,AGO,2022,10.53502,2022
...,...,...,...,...,...
188,Venezuela,VEN,2022,22.83876,2022
189,Vietnam,VNM,2022,2.08170,2022
190,Yemen,YEM,2022,11.55534,2022
191,Zambia,ZMB,2022,9.41690,2022


In [9]:
# Next, let's keep only columns we want, i.e. country and obesity rate

obesity_adults_2022_clean = obesity_adults_2022[["country", "obesity_rate"]]
obesity_adults_2022_clean

Unnamed: 0,country,obesity_rate
0,Afghanistan,17.59419
1,Albania,26.58335
2,Algeria,24.24859
3,Andorra,20.47006
4,Angola,10.53502
...,...,...
188,Venezuela,22.83876
189,Vietnam,2.08170
190,Yemen,11.55534
191,Zambia,9.41690


Similarly, let's inspect the values under the country column so that it matches our McDonalds dataset.

In [10]:
obesity_country= obesity_adults_2022_clean["country"].unique()
obesity_country

array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia',
       'Cameroon', 'Canada', 'Cape Verde', 'Central African Republic',
       'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus',
       'Czechia', 'Democratic Republic of Congo', 'Denmark', 'Djibouti',
       'Dominica', 'Dominican Republic', 'East Timor', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Eswatini', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon',
       'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Greenland',
       'Grenada', 'Guatemala', 'Guinea',

That's a lot of countries! To ensure that the country names in the McDonald's dataset matched those in the obesity dataset, I'm using AI (ChatGPT) to give me a code that can check for inconsistencies and correct any mismatches before merging the datasets. This can help me avoid missing or incorrect data in the analysis.

In [11]:
# Finding countries in McDonald's dataset not in obesity dataset
missing_in_obesity = [c for c in mcdonalds_country if c not in obesity_country]
print(
    "Countries in McDonald's dataset but missing in obesity dataset:",
    missing_in_obesity,
)

# Finding countries in obesity dataset not in McDonald's dataset
missing_in_mcd = [c for c in obesity_country if c not in mcdonalds_country]
print("Countries in obesity dataset but missing in McDonald's dataset:", missing_in_mcd)


Countries in McDonald's dataset but missing in obesity dataset: ['Aruba', 'Czech', 'Liechtenstein', 'Monaco']
Countries in obesity dataset but missing in McDonald's dataset: ['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda', 'Armenia', 'Bangladesh', 'Barbados', 'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Cape Verde', 'Central African Republic', 'Chad', 'Comoros', 'Congo', "Cote d'Ivoire", 'Czechia', 'Democratic Republic of Congo', 'Djibouti', 'Dominica', 'East Timor', 'Equatorial Guinea', 'Eritrea', 'Eswatini', 'Ethiopia', 'Gabon', 'Gambia', 'Ghana', 'Greenland', 'Grenada', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Iceland', 'Iran', 'Iraq', 'Jamaica', 'Jordan', 'Kenya', 'Kiribati', 'Kyrgyzstan', 'Laos', 'Lesotho', 'Liberia', 'Libya', 'Madagascar', 'Malawi', 'Maldives', 'Mali', 'Marshall Islands', 'Mauritania', 'Micronesia (country)', 'Mongolia', 'Montenegro', 'Mozambique', 'My

Look at that, it seems that there are inconsistencies between certain country names, like 'Czech' in the obesity dataset and 'Czechia' in the McDonald's Dataset. Let's correct that below.

(On another note: I didn't know that there wasn't a McDonald's in Iceland)

In [12]:
mcdonalds_clean.replace(
    {
        "Czech": "Czechia",  # match obesity dataset
        # Add other replacements if needed
    },
    inplace=True,
)

# Validating
print("McDonald's countries after renaming:", mcdonalds_clean["country"].unique())


McDonald's countries after renaming: ['Andorra' 'Argentina' 'Aruba' 'Australia' 'Austria' 'Azerbaijan'
 'Bahamas' 'Bahrain' 'Belarus' 'Belgium' 'Brazil' 'Brunei' 'Bulgaria'
 'Canada' 'Chile' 'China' 'Colombia' 'Costa Rica' 'Croatia' 'Cuba'
 'Cyprus' 'Czechia' 'Denmark' 'Dominican Republic' 'Ecuador' 'Egypt'
 'El Salvador' 'Estonia' 'Fiji' 'Finland' 'France' 'Georgia' 'Germany'
 'Greece' 'Guatemala' 'Honduras' 'Hungary' 'India' 'Indonesia' 'Ireland'
 'Israel' 'Italy' 'Japan' 'Kazakhstan' 'Kuwait' 'Latvia' 'Lebanon'
 'Liechtenstein' 'Lithuania' 'Luxembourg' 'Malaysia' 'Malta' 'Mauritius'
 'Mexico' 'Moldova' 'Monaco' 'Morocco' 'Netherlands' 'New Zealand'
 'Nicaragua' 'Norway' 'Oman' 'Pakistan' 'Panama' 'Paraguay' 'Peru'
 'Philippines' 'Poland' 'Portugal' 'Qatar' 'Romania' 'Russia' 'Samoa'
 'Saudi Arabia' 'Serbia' 'Singapore' 'Slovakia' 'Slovenia' 'South Africa'
 'South Korea' 'Spain' 'Sri Lanka' 'Suriname' 'Sweden' 'Switzerland'
 'Thailand' 'Turkey' 'Ukraine' 'United Arab Emirates' 'Unite



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Part 3: Combining datasets.
Looking good so far. We have cleaned both our McDonalds and Obesity Rate datasets for 2022. Next, let's combine them so that they will look ready for our map.

In [13]:
combined_dataset = pd.merge(mcdonalds_clean, obesity_adults_2022_clean, on="country")
combined_dataset


Unnamed: 0,name,country,subdivision,city,latitude,longitude,obesity_rate
0,McDonald's,Andorra,Andorra la Vella,Andorra la Vella,42.507875,1.524985,20.47006
1,McDonald's - Pas de la Casa,Andorra,Encamp,El Pas de la Casa,42.543500,1.732089,20.47006
2,McDonald's - Av. Tarragona,Andorra,Andorra la Vella,Andorra la Vella,42.505142,1.527843,20.47006
3,McDonald's - Epizen,Andorra,Sant Julià de Lòria,Sant Julià de Lòria,42.459812,1.489049,20.47006
4,McDonald's Andorra - Meritxell,Andorra,Andorra la Vella,Andorra la Vella,42.508503,1.533468,20.47006
...,...,...,...,...,...,...,...
39185,McDonald's Nguyễn Huệ,Vietnam,Ho Chi Minh City,Ho Chi Minh City,10.774377,106.700709,2.08170
39186,McDonald's Vivo City,Vietnam,Ho Chi Minh City,Ho Chi Minh City,10.729858,106.701097,2.08170
39187,McDonald’s Satra Phạm Hùng,Vietnam,Ho Chi Minh City,Ho Chi Minh City,10.733498,106.672177,2.08170
39188,McDonald's Giga Mall,Vietnam,Ho Chi Minh City,Ho Chi Minh City,10.818210,106.687410,2.08170


Part 4: Creating our choropleth map

This seems like it can be quite confusing. Let's take it step by step, add one layer at a time.

In [14]:
import plotly.io as pio

pio.renderers.default = "notebook_connected+plotly_mimetype"


In [15]:
# First, we neet to import plotly 
import plotly.express as px

#Create a base choropleth with only obesity rates
fig = px.choropleth(
    obesity_adults_2022_clean,
    locations="country",
    locationmode="country names",
    color="obesity_rate",
)
fig.show()


The library used by the *country names* `locationmode` option is changing in an upcoming version. Country names in existing plots may not work in the new version. To ensure consistent behavior, consider setting `locationmode` to *ISO-3*.



In [16]:
# Base Choropleth
fig = px.choropleth(
    obesity_adults_2022_clean,
    locations="country",
    locationmode="country names",
    color="obesity_rate",
    color_continuous_scale="Reds",
    title="Adult Obesity Rates by Country with McDonald's Locations",
)

# Add McDonald's locations
fig.add_scattergeo(
    lon=mcdonalds_clean["longitude"],
    lat=mcdonalds_clean["latitude"],
    text=mcdonalds_clean["name"],  # hover info
    mode="markers",
    marker=dict(size=3, color="blue", opacity=0.7),
    name="McDonald's",
)

fig.show()



The library used by the *country names* `locationmode` option is changing in an upcoming version. Country names in existing plots may not work in the new version. To ensure consistent behavior, consider setting `locationmode` to *ISO-3*.



This does not seem to be telling us much... let's try aggregating McDonald's per country instead and size the markers by the number of McDonald’s per country. (Note: I also had to ask ChatGPT how I could move forward with this exercise)

In [17]:
# Next, plot choropleth and McDonald's markers. Initially I did this without resizing the markers, but the circles were too big.
# I had to depend on AI a LOT to help guide me to generate the following data visualisations.

import numpy as np

mcd_per_country = (
    combined_dataset.groupby("country")
    .agg(
        num_mcd=("name", "count"),
        avg_lat=("latitude", "mean"),
        avg_lon=("longitude", "mean"),
    )
    .reset_index()
)

# Use square root for initial scaling
mcd_per_country["marker_size"] = np.sqrt(mcd_per_country["num_mcd"])

# Then normalize to a fixed range, e.g., 5–30
min_size = 5
max_size = 30
mcd_per_country["marker_size"] = (
    mcd_per_country["marker_size"] - mcd_per_country["marker_size"].min()
) / (mcd_per_country["marker_size"].max() - mcd_per_country["marker_size"].min()) * (
    max_size - min_size
) + min_size

fig = px.choropleth(
    obesity_adults_2022_clean,
    locations="country",
    locationmode="country names",
    color="obesity_rate",
    color_continuous_scale="Reds",
    title="Adult Obesity Rates by Country with McDonald's Locations",
)

fig.update_coloraxes(colorbar_title="Obesity Rate")

fig.add_scattergeo(
    lon=mcd_per_country["avg_lon"],
    lat=mcd_per_country["avg_lat"],
    text=mcd_per_country["country"]
    + ": "
    + mcd_per_country["num_mcd"].astype(str)
    + " McDonald's",
    mode="markers",
    marker=dict(
        size=mcd_per_country["marker_size"],
        color="blue",
        opacity=0.7,
        line=dict(width=0),
    ),
    name="McDonald's per country",
)

fig.show()



The library used by the *country names* `locationmode` option is changing in an upcoming version. Country names in existing plots may not work in the new version. To ensure consistent behavior, consider setting `locationmode` to *ISO-3*.



Conclusion:

Based on the choropleth map above, there seems to be some correlation betweeen the number of McDonalds in a country (illustrated by the size of the markers / circles) and the obesity rate in those countries. 

Does our hypothesis hold? Maybe. As highlighted, which came first? The McDonalds affecting the obesity rates or the obesity rates driving the demand up for fast food and therefore McDonalds?

Thank you!