# DSCI 503 - Project 01

### Seif Kungulio

# Introduction

For this data analysis project, I will utilize Python to delve into the extensive Gapminder dataset. This valuable collection of socioeconomic indicators spans 184 countries over 219 years, from 1800 to 2018. By leveraging Python's robust data analysis libraries and tools, I aim to uncover trends, patterns, and correlations within this rich historical data. The goal is to provide insights into global socioeconomic development over the past two centuries, offering a comprehensive view of the world's evolution through demographic shifts and economic progress.

The dataset is structured as a dataframe, with each row representing a single country during a specific year, and each column representing various data points about these countries. The dataframe includes the following columns:

- **country**: Names of 184 countries, each appearing 219 times—once for each year data was collected.
- **continent**: Continental region of each country, categorized into 'africa', 'americas', 'asia', and 'europe'.
- **year**: Year the data for that record was collected.
- **population**: Population of a given country in a particular year.
- **life_exp**: Average life expectancy in a given country for each year.
- **gdp_per_cap**: Per capita GDP of a country for a specific year, calculated as the total GDP divided by the population, indicating the average wealth of the country's citizens.
- **gini**: Gini score of a country for each year, measuring economic inequality on a scale from 0 to 100.

In [1]:
import pandas as pd
df = pd.read_csv('gapminder_data.txt', sep='\t')
df.head(10)


Unnamed: 0,country,year,continent,population,life_exp,gdp_per_cap,gini
0,Afghanistan,1800,asia,3280000,28.2,603,30.5
1,Albania,1800,europe,410000,35.4,667,38.9
2,Algeria,1800,africa,2500000,28.8,715,56.2
3,Angola,1800,africa,1570000,27.0,618,57.2
4,Antigua and Barbuda,1800,americas,37000,33.5,757,40.0
5,Argentina,1800,americas,534000,33.2,1510,47.7
6,Armenia,1800,europe,413000,34.0,514,31.5
7,Australia,1800,asia,351000,34.0,814,38.7
8,Austria,1800,europe,3210000,34.4,1850,33.4
9,Azerbaijan,1800,europe,880000,29.2,775,70.5


In [2]:
country = df.country.to_list()
continent = df.continent.to_list()
year = df.year.to_list()
population = df.population.to_list()
life_exp = df.life_exp.to_list()
pcgdp = df.gdp_per_cap.to_list()
gini = df.gini.to_list()


In [3]:
N = 9

print('Country:          ', country[N])
print('Continent:        ', continent[N])
print('Year:             ', year[N])
print('Population:       ', population[N])
print('Life Expectancy:  ', life_exp[N])
print('Per Capita GDP:   ', pcgdp[N])
print('Gini Index:       ', gini[N])


Country:           Azerbaijan
Continent:         europe
Year:              1800
Population:        880000
Life Expectancy:   29.2
Per Capita GDP:    775
Gini Index:        70.5


# Part 1: Displaying Past 20 Years of US Data

In this section, I will retrieve and display the data for the United States from the Gapminder dataframe, focusing on the last twenty years. The information will include the year, country, population, life expectancy, GDP per capita, and Gini index.

In [4]:
# Define the desired columns
desired_columns = ['year', 'country', 'population', 'life_exp', 'gdp_per_cap', 'gini']

# Print the header
print(f"Year \tCountry \tPopulation\tLExp \tpcGDP \tGini")
print("------------------------------------------------------------")

# Iterate over the DataFrame rows
for index, row in df.iterrows():
    # Check if the year is >= 1999 and the country is 'United States'
    if row['country'] == 'United States' and row['year'] >= 1999:
        # Print the relevant data in the specified format
        print(f"{row['year']}\t{row['country']}\t{row['population']}\t{row['life_exp']}\t{row['gdp_per_cap']}\t{row['gini']}")
        

Year 	Country 	Population	LExp 	pcGDP 	Gini
------------------------------------------------------------
1999	United States	279000000	76.8	44700	40.5
2000	United States	282000000	76.9	46000	40.5
2001	United States	285000000	77.0	46000	40.5
2002	United States	288000000	77.1	46400	40.5
2003	United States	290000000	77.3	47300	40.5
2004	United States	293000000	77.5	48600	40.6
2005	United States	295000000	77.7	49800	40.7
2006	United States	298000000	77.9	50600	40.8
2007	United States	301000000	78.1	51000	40.8
2008	United States	303000000	78.3	50400	40.8
2009	United States	306000000	78.6	48600	40.7
2010	United States	309000000	78.7	49400	40.7
2011	United States	311000000	78.8	49800	40.7
2012	United States	313000000	78.9	50500	40.8
2013	United States	316000000	78.9	51000	41.0
2014	United States	318000000	78.9	51800	41.2
2015	United States	320000000	78.8	52800	41.3
2016	United States	322000000	78.8	53300	41.4
2017	United States	324000000	79.0	54200	41.5
2018	United States	327000000	79.1	54900	

# Part 2: Selecting the 2018 Data

In this part of the project, I will extract and analyze the data specifically for the year 2018. This will involve creating separate lists for each relevant variable, filtered to include only records from 2018.

In [5]:
# Empty lists to store data for 2018
country_2018 = []
population_2018 = []
continent_2018 = []
life_exp_2018 = []
pcgdp_2018 = []
gini_2018 = []

# Loop through each row index
for i in range(len(df)):
    row = df.iloc[i] # Access the row using its index
    if row['year'] == 2018: # Check if the year is 2018
        country_2018.append(row['country'])
        population_2018.append(row['population'])
        continent_2018.append(row['continent'])
        life_exp_2018.append(row['life_exp'])
        pcgdp_2018.append(row['gdp_per_cap'])
        gini_2018.append(row['gini'])

# Calculate total population in 2018
global_population_2018 = sum(population_2018)

# Print the result
print("The global population in 2018 was", global_population_2018)


The global population in 2018 was 7595200200


# Part 3: Identifying Countries with Largest and Smallest Populations

In this part, I will identify and display the countries with the largest and smallest populations in 2018. This involves sorting the population data for 2018 and printing the relevant information for the top and bottom ten countries.

In [6]:
# Filter the DataFrame for the year 2018
df_2018 = df[df['year'] == 2018]

# Create a list of populations in 2018
population_2018 = df_2018['population'].tolist()

# Create a sorted copy of the population list
sorted_population_2018 = sorted(population_2018, reverse=True)

# Determine the population of the country with the tenth largest population
tenth_largest_population = sorted_population_2018[9]

# Print the header lines
print("Countries with Largest Populations in 2018")
print("------------------------------------------------------")

# Create loop counter
counter = 0

# Loop through the DataFrame and print information for each country with population >= tenth_largest_population
for index, row in df_2018.iterrows():
    if row['population'] >= tenth_largest_population:
        print(f"The population of {row['country']} in 2018 was {row['population']}.")
        counter += 1 # Increment the counter
        if counter == 2: # Check the number of counter
            break # exit the loop


Countries with Largest Populations in 2018
------------------------------------------------------
The population of Bangladesh in 2018 was 166000000.
The population of Brazil in 2018 was 211000000.


In [7]:
# Filter the data for the year 2018
df_2018 = df[df['year'] == 2018]

# Sort the data by population and select the ten countries with the smallest populations
sorted_df = df_2018.sort_values(by='population')
smallest_populations = sorted_df.head(10)

# Determine the population of the country with the tenth smallest population
tenth_smallest_population = smallest_populations.iloc[-1]['population']

# Print the header and separator
print("Countries with Smallest Populations in 2018")
print("------------------------------------------------------")

# Create loop counter
counter = 0

# Loop through the filtered DataFrame and print the information
for _, row in smallest_populations.iterrows():
    print(f"The population of {row['country']} in 2018 was {row['population']}.")
    counter += 1 # Increment the counter
    if counter == 2: # Check the number of counter
        break # exit the loop


Countries with Smallest Populations in 2018
------------------------------------------------------
The population of Seychelles in 2018 was 95200.
The population of Antigua and Barbuda in 2018 was 103000.


# Part 4: Identifying Countries with Highest and Lowest Life Expectancies

In this section, I will identify and display the countries with the highest and lowest life expectancies in 2018. The goal is to provide a clear comparison of global health and longevity by highlighting these extremes.

In [8]:
# Filter the data for the year 2018
df_2018 = df[df['year'] == 2018]

# Sort the data by life expectancy in descending order
df_sorted = df_2018.sort_values(by='life_exp', ascending=False)

# Select the top 10 countries with the highest life expectancy
top_10_countries = df_sorted.head(10)

# Print the information in the required format
print("Countries with Highest Life Expectancy in 2018")
print("------------------------------------------------------")

# Create loop counter
counter = 0

for index, row in top_10_countries.iterrows():
    print(f"The life expectancy of {row['country']} in 2018 was {row['life_exp']}.")
    counter += 1 # Increment the counter
    if counter == 2: # Check the number of counter
        break # exit the loop


Countries with Highest Life Expectancy in 2018
------------------------------------------------------
The life expectancy of Japan in 2018 was 84.2.
The life expectancy of Singapore in 2018 was 84.0.


In [9]:
# Filter the data for the year 2018
df_2018 = df[df['year'] == 2018]

# Sort the data by life expectancy in ascending order
df_2018_sorted = df_2018.sort_values(by='life_exp')

# Select the top ten countries with the lowest life expectancies
lowest_life_expectancy_countries = df_2018_sorted.head(10)

# Print the required output
print("Countries with Lowest Life Expectancy in 2018")
print("------------------------------------------------------")

# Create loop counter
counter = 0

for index, row in lowest_life_expectancy_countries.iterrows():
    print(f"The life expectancy of {row['country']} in 2018 was {row['life_exp']}.")
    counter += 1 # Increment the counter
    if counter == 2: # Check the number of counter
        break # exit the loop


Countries with Lowest Life Expectancy in 2018
------------------------------------------------------
The life expectancy of Lesotho in 2018 was 51.1.
The life expectancy of Central African Republic in 2018 was 51.6.


# Part 5: Calculating GDP by Country

In this part of the project, I will calculate the GDP for each country in 2018 and determine the total global GDP for that year.

In [10]:
# Create an empty list to store gdp
gdp_2018 = []

# Loop through each row of the dataframe and add gdp for 2018 to gdp_2018
for index, row in df.iterrows():
  if row["year"] == 2018:
    gdp_2018.append(row["gdp_per_cap"])

# Find the sum of gdp
total_gdp = sum(gdp_2018)

# Print result in desired format
print(f"The total global GDP in 2018 was ${total_gdp}.")

The total global GDP in 2018 was $3297010.


In [11]:
# Filter the DataFrame to include only the data for the year 2018
df_2018 = df[df['year'] == 2018]

# Find the maximum and minimum GDP values in 2018
max_gdp = df_2018['gdp_per_cap'].max()
min_gdp = df_2018['gdp_per_cap'].min()

# Get the index of the max and min GDP values
country_max_gdp = df_2018[df_2018['gdp_per_cap'] == max_gdp].iloc[0]['country']
country_min_gdp = df_2018[df_2018['gdp_per_cap'] == min_gdp].iloc[0]['country']

# Print the results
print(f"The country with the highest GDP in 2018 was {country_max_gdp} with a GDP of {max_gdp}.")
print(f"The country with the lowest GDP in 2018 was {country_min_gdp} with a GDP of {min_gdp}.")


The country with the highest GDP in 2018 was Qatar with a GDP of 121000.
The country with the lowest GDP in 2018 was Somalia with a GDP of 629.


# Part 6: Grouping by Continent

In this part of the project, I will calculate the total population, per capita GDP, and life expectancy for each of the four continental regions in 2018.

In [12]:
# Continent names
continents = ["africa", "americas", "asia", "europe"]

# Empty lists for results
pcgdp_2018_by_continent = []
pop_2018_by_continent = []
life_exp_2018_by_continent = []

# Loop through continents
for continent in continents:
    temp_life_exp = 0
    temp_pop = 0
    temp_gdp = 0

    # Loop through each row (year might not be the first column)
    for index, row in df.iterrows():
        if row["year"] == 2018 and row["continent"].lower() == continent:
            temp_pop += row["population"]
            temp_gdp += row["gdp_per_cap"] * row["population"]  # Weighted sum for life exp
            temp_life_exp += row["life_exp"] * row["population"]

    # Calculate averages and round values
    if temp_pop > 0:
        temp_life_exp /= temp_pop
        temp_gdp /= temp_pop
        temp_gdp = round(temp_gdp)
        temp_life_exp = round(temp_life_exp, 1)

    # Append results to lists
    pcgdp_2018_by_continent.append(temp_gdp)
    pop_2018_by_continent.append(temp_pop)
    life_exp_2018_by_continent.append(temp_life_exp)

# Print results in formatted table
print(f"{'Continent':<10} {'Population':>10} {'pcGDP':>10} {'Life Exp':>10}")
print("-" * 43)
for i in range(len(continents)):
    print(f"{continents[i].capitalize():<10} {pop_2018_by_continent[i]:>10} {pcgdp_2018_by_continent[i]:>10} {life_exp_2018_by_continent[i]:>10.1f}")


Continent  Population      pcGDP   Life Exp
-------------------------------------------
Africa     1287150200       4700       65.9
Americas   1010978000      28708       77.5
Asia       4455113000      12705       73.2
Europe      841959000      31534       78.4
