In [64]:
import pandas as pd
import numpy as  np
import matplotlib.pyplot as plt
import scipy.stats as stats
import plotly.express as px

# PART 1

### Question 1

#### Read in the 'gapminder_clean.csv' data as a 'pandas' 'DataFrame'.

In [65]:
# Read in gapminder_clean.csv data as a pandas DataFrame
df = pd.read_csv('gapminder_clean.csv')

# Remove NaNs from both subsets
df_g = df.dropna(subset=['gdpPercap'])
df_gc = df_g.dropna(subset=['CO2 emissions (metric tons per capita)'])


### Question 2

#### Filter the data to include only rows where 'Year' is '1962' and then make a scatter plot comparing 'CO2 emissions (metric tons per capita)' and 'gdpPercap' for the filtered data.

In [66]:
# Filter the data
df_1962 = df_gc[df_gc['Year'] == 1962]

fig = px.scatter(df_1962, x='gdpPercap', y='CO2 emissions (metric tons per capita)', log_x=True, log_y=True)
fig.show()


##### Explaining the code:
The first thing we do is use the Pandas library (pd) to read in "gapminder_clean.csv" as a Pandas DataFrame called 'df'. We then remove any rows that may contain any missing data (NaNs) from two subsets of the DataFrame, which we name 'df_g' and 'df_gc'. Specifically, we are dropping rows where the gdpPercap or CO2 emissions columns contain NaNs, so that we can have complete subsets of data. The resulting DataFrames contain only the rows without missing data in the specified columns. Next, we filter the data to only include rows where the "Year" column is equal to 1962. We create a new DataFrame 'df_1962' to store this filtered data. Finally, we create a scatter plot using the filtered data and the Plotly Express library. In our particular case, we are specifying the 'gdpPercap' column to be the x-axis, and the 'CO2 emissions' to be the y-axis. Lastly, we log-scale the x and y axes for easier visualizations, and show the plot. 

### Question 3

#### On the filtered data, calculate the correlation of 'CO2 emissions (metric tons per capita)' and 'gdpPercap'. What is the correlation and associated p value?

In [67]:
correlation, p = stats.pearsonr(df_1962['CO2 emissions (metric tons per capita)'], df_1962['gdpPercap'])

print(f'The correlation is {correlation:.4f}, and the associated p-value is {p}.')

The correlation is 0.9261, and the associated p-value is 1.1286792210038754e-46.


##### Explaining the code:
We are using the pearsonr() function from the statistics module of the SciPy library to calculate the Pearson correlation coefficient between the variables 'CO2 emissions' and 'gdpPercap'. We pass in the relevant columns to obtain the correlation and p-value. 

### Question 4

#### On the unfiltered data, answer "In what year is the correlation between 'CO2 emissions (metric tons per capita)' and 'gdpPercap' the strongest?" Filter the dataset to that year for the next step...

In [68]:
# Determine year with the strongest correlation
df_year = df_gc.groupby('Year')

# Apply pearsonR to every group of rows
corr = df_year.apply(lambda x: stats.pearsonr(x['CO2 emissions (metric tons per capita)'], x['gdpPercap'])[0])

# Find the strongest correlation
max_corr = corr.abs().idxmax()

print(f"The year with the strongest correlation is {max_corr}, with a correlation of {corr.abs().max():.4f}")

df_filtered = df[df['Year'] == max_corr]

The year with the strongest correlation is 1967, with a correlation of 0.9388


##### Explaining the code:
We first group the 'df_gc' DataFrame by year using the groupby() function, and assign the resulting DataFrame to 'df_year'. We then apply the pearsonr() function to every group of rows in 'df_year' using the apply function. This calculates the correlation coefficient for each group of rows. We then find the year with the highest absolute correlation coefficient by using the idxmax() function, which we assign to 'max_corr'. This leaves us with the year with the highest magnitude of correlation coefficient. We then create a new DataFrame df_filtered that contains only data for the year with the strongest correlation. 

### Question 5

#### Using 'plotly' or 'bokeh', create an interactive scatter plot comparing 'CO2 emissions (metric tons per capita)' and 'gdpPercap', where the point size is determined by 'pop' (population) and the color is determined by the 'continent'.

In [69]:
# Create the scatter plot
fig = px.scatter(df_filtered, x="gdpPercap", y="CO2 emissions (metric tons per capita)", size="pop", color="continent", log_x=True, log_y=True)

fig.show()

##### Explaining the code:
Here, we are creating a scatter plot using the scatter() function that is within the Plotly library. We pass in the filtered dataframe 'df_filtered' as the first argument. We then specify what columns to use for the x and y axes, and use the 'size' parameter to specify the size of the markers based on the 'pop' column (population). The 'color' parameter is specified to color the markers based on the 'continent' column. We log-scale the x-axis and the y-axis for easier visualization of the plot.

# PART 2

### Question 1

#### What is the relationship between 'continent' and 'Energy use (kg of oil equivalent per capita)'? (Stats test needed)

--------------------------------------------------------------------------
##### Before we begin:

To analyze the relationship between continent and "Energy use (kg of oil equivalent per capita)", we can use a one-way analysis of variance (ANOVA) test.

The one-way ANOVA test compares the means of three or more groups to determine whether there is a statistically significant difference between them. Our groups happen to be the continents, and the means we are comparing are those of the continents' respective energy use.

The null hypothesis expects there to be no difference between the means of the different continents in terms of energy use, while the alternative hypothesis is that there is a significant difference between the means.

To conduct the ANOVA test, we use different statistical values (mean, standard deviation, sample size) to calculate the F-statistic. We calculate a critical value using the sample sizes to compare to the F-statistic. If the F-statistic is greater than the critical value for a given level of significance (we will use alpha = 0.05), then we reject the null hypothesis and conclude that there is, in fact, a significant difference between the means of the different continents. 

In [70]:
# Drop NaNs from the energy use subset
df_e = df.dropna(subset=['Energy use (kg of oil equivalent per capita)'])

# Filter for the continents and their respective energy useage
asia = df_e[df_e['continent'] == 'Asia']['Energy use (kg of oil equivalent per capita)']
europe = df_e[df_e['continent'] == 'Europe']['Energy use (kg of oil equivalent per capita)']
africa = df_e[df_e['continent'] == 'Africa']['Energy use (kg of oil equivalent per capita)']
americas = df_e[df_e['continent'] == 'Americas']['Energy use (kg of oil equivalent per capita)']
oceania = df_e[df_e['continent'] == 'Oceania']['Energy use (kg of oil equivalent per capita)']

# Get sample sizes for each continent
n1 = len(asia)
n2 = len(europe)
n3 = len(africa)
n4 = len(americas)
n5 = len(oceania)

k = 5    # number of groups

N = n1 + n2 + n3 + n4 + n5   # total sample size

df_between = k - 1   # degrees of freedom for between-groups variance

df_total = (n1-1) + (n2-1) + (n3-1) + (n4-1) + (n5-1)   # total degrees of freedom

df_within = df_total - df_between   # degrees of freedom for within-groups variance

alpha = 0.05  # specify the level of significance

# use the f.ppf() function to find the critical F-value
critical_value = stats.f.ppf(1 - alpha, df_between, df_within)

# use the f_oneway() function to run the oneway ANOVA on our dataset
f, p = stats.f_oneway(asia, europe, africa, americas, oceania)

print(f'f-statistic: {f:.4f}\ncritical f-statistic: {critical_value:.4f}\np-value: {p}\n')
print(f'\nf-statistic > critical f-statistic: {f > critical_value}')

f-statistic: 51.4592
critical f-statistic: 2.3825
p-value: 8.527003487154367e-39


f-statistic > critical f-statistic: True


In [71]:
# Boxplot visualization
fig = px.box(df_e, x='continent', y='Energy use (kg of oil equivalent per capita)', log_y=True)

# Show the plot
fig.show()

##### Explaining the code:
First, we drop all rows that have NaN values in the "Energy use" column and assign the resulting DataFrame to 'df_e'. We then filter 'df_e' by continent and assign the resulting columns to separate variables, 'asia', 'europe', 'africa', 'americas', and 'oceania'. We then conduct a one-way ANOVA test on the energy use data for each continent. We calculate the sample sizes for each continent, the total sample size, the degrees of freedom for between groups and within-groups variance, the critical F-value, the F-statistic, and finally the p-value from the ANOVA. We visualize our data with a box plot with a log-scaled y-axis, and we can now analyze our F-statistic, our critical F-value, and our p-value. 

##### Results
We can conclude that there is a significant difference between the means of energy use in the different continents, as our F-statistic is greater than the critical value we calculated. Our p-value is much lower than our given level of significance, so we can be confident that our F-statistic was not a random occurence. 

### Question 2
#### Is there a significant difference between Europe and Asia with respect to 'Imports of goods and services (% of GDP)' in the years after 1990? (Stats test needed)

-------------------------------------------------------------------------
##### Before we begin:
To test whether there is a significant difference between Europe and Asia with respect to "Imports of goods and services (% of GDP)" in the years after 1990, we will use a two-sample t-test.

The t-test would allow you to compare the means of the two samples and determine whether any observed difference is statistically significant. In our case, our samples are Europe and Asia. Our null hypothesis expects there to be no difference in the imports of goods and services between Europe and Asia, while our alternative hypothesis expects there to be a significant difference. 

If the p-value from the t-test is less than our chosen significance level (we will once again choose 0.05), then we will reject the null hypothesis, and if the p-value is greater than your significance level, then you would fail to reject the null hypothesis.

In [72]:
# Filter df to include only data for the years after 1990
df_after1990 = df[df['Year'] > 1990].dropna(subset='Imports of goods and services (% of GDP)')


# Filter the resulting df to include imports data for Europe and Asia
df_europe = df_after1990[df_after1990['continent'] == 'Europe']['Imports of goods and services (% of GDP)']
df_asia = df_after1990[df_after1990['continent'] == 'Asia']['Imports of goods and services (% of GDP)']

# Calculate the test statistic
t, p = stats.ttest_ind(df_europe, df_asia)

print(f't-statistic: {t}\np-value: {p}')

t-statistic: -1.4185256887958868
p-value: 0.15751969325554196


In [73]:
# Create a dataframe with the imports data for Europe and Asia
df_imports = df_after1990[df_after1990['continent'].isin(['Europe', 'Asia'])]

# Create a histogram of the imports data by continent
fig = px.histogram(df_imports, x='Imports of goods and services (% of GDP)', 
                   color='continent', marginal='box',
                   nbins=20, opacity=0.7)

# Show the plot
fig.show()

##### Explaining the code:
First, we filter the DataFrame 'df' to only include data for the years after 1990. We are then dropping any rows that contain null values in the column 'Imports of goods and services (% of GDP)'. The resulting DataFrame is assigned to 'df_after1990'. We hen further filter the 'df_after1990' DataFrame to only include data for imports of goods and services for Europe and Asia. We create two DataFrames 'df_europe' and 'df_asia' for each respective continent. We then calculate the t-test statistic and the corresponding p-value using the ttest_ind() function from the statistics module of the SciPy library. We pass in our continent dataframes, and can now analyze the resulting t-test statistic and p-value. We visualize our data with a histogram, to make sure our results make sense. 

##### Results:
We can conclude that there is no significant difference between the imports of goods and services in Europe and Asia, as our p-value is greater than our chosen significance level. Our p-value is greater than 0.05, so we can't be confident that our t-statistic was not a random occurence. 

### Question 3
#### What is the country (or countries) that has the highest 'Population density (people per sq. km of land area)' across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)

In [74]:
df.dropna(subset=['Population density (people per sq. km of land area)'])

df["pop_density_rank"] = df.groupby("Year")["Population density (people per sq. km of land area)"].rank(ascending=False)

# Calculate the mean ranking across all years for each country
avg_rank = df.groupby("Country Name")["pop_density_rank"].mean()

# Find the country or countries with the highest mean ranking
max_rank = avg_rank[avg_rank == avg_rank.min()]

print(f'The country (or countries) with the highest mean ranking based on population density is/are {list(max_rank.index)} with a ranking of {avg_rank.min()} out of {len(avg_rank)} available countries.')

The country (or countries) with the highest mean ranking based on population density is/are ['Macao SAR, China', 'Monaco'] with a ranking of 1.5 out of 263 available countries.


In [75]:
# Calculate the mean population density for each country
avg_density = df.groupby("Country Name")["Population density (people per sq. km of land area)"].mean().reset_index()
avg_density = avg_density.sort_values(by="Population density (people per sq. km of land area)", ascending=False)

# Create a bar chart showing the mean population density for each country
fig = px.bar(avg_density, y="Country Name", x="Population density (people per sq. km of land area)",
             labels={"country": "Country", "Population density (people per sq. km of land area)": "Mean population density"},
             title="Mean Population Density by Country", log_x=True)
fig.show()


##### Explaining the code:
We first drop any rows with missing values in the "Population density" column. We then add a new column to the DataFrame that shows the population density rank for each year. We then calculate the mean ranking across all years for each country and find the country/countries with the highest mean ranking based on population density. We create a bar chart showing the mean population density for each country through the years to get a better sense of the ranking. The x-axis is log-scaled for easier visualization.

### Question 4
#### What is the country (or countries) that has the highest 'Population density (people per sq. km of land area)' across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)

In [76]:
# Get the dataframe for the years between 1962 and 2007
df_year = df[df['Year'] <= 2007]
df_year = df_year[df_year['Year'] >= 1962]

# Pivot the dataframe to have separate columns for each year
df_pivot = df_year.pivot(index='Country Name', columns='Year', values='Life expectancy at birth, total (years)')

# Calculate the difference in life expectancy
df_pivot['diff'] = df_pivot[2007] - df_pivot[1962]

# Find the country (or countries) with the max increase in life expectancy
max_increase = np.max(df_pivot['diff'])

# Find the country (or countries) that correspond to the max increase in life expectancy
highest_increase_countries = df_pivot[df_pivot['diff'] == max_increase].index.tolist()

print(f'The country (or countries) with the greatest increase in life expectancy is/are {highest_increase_countries}, with an increase in expectancy of {max_increase:.3f} years.')

The country (or countries) with the greatest increase in life expectancy is/are ['Maldives'], with an increase in expectancy of 36.916 years.


In [77]:
# Sort the countries by increase in life expectancy in descending order
df_pivot_sorted = df_pivot.sort_values(by='diff', ascending=False)

# Create a bar chart of the difference in life expectancy between 1962 and 2007 for each country
fig = px.bar(df_pivot_sorted, x='diff', y=df_pivot_sorted.index, orientation='h', text='diff')

# Title and axis
fig.update_layout(title='Difference in Life Expectancy between 1962 and 2007 by Country',
                  xaxis_title='Increase in Life Expectancy (years)', yaxis_title='Country')

# Show the plot
fig.show()

##### Explaining the code:
First, we create a new DataFrame containing only data from the years 1962 to 2007. We then pivot the DataFrame to have separate columns for each year, and calculate the differences in life expectancy between 1962 and 2007 for each country. We then find the country or countries with the greatest increase in life expectancy, and print out the result. We sort the countries by the increase in life expectancy in descending order, and create an interactive bar chart for easy visualization. 