# WEEK 03
# Encounter 05 - Aggregation and Groupby
# Project Challenge - Solve aggregation and groupby One-Liners

## Task Description

Using the gapminder_total dataset solve the following tasks with pandas one-liners:

 1. Read in data:
>`df = pd.read_csv('../data/gapminder_total.csv')`

 2. What is the median population in the data set?

 3. How often does each continent appear in the data set?

 4. Which continent has the lowest average fertility rate overall?

 5. What was the average life expectancy in Europe in 2015? 
    **Hint:** first filter for 2015 then apply groupby.

 6. How many countries does each continent have in the dataset?
    **Hint:** filter for one year and count

 7. What is the average population of a European country in 1976 compare to 2015?
    **Hint:** once again filter for the year in question and do each year separately to compare

In [1]:
import pandas as pd

In [2]:
# 1. Read in data:

df = pd.read_csv('../data/gapminder_total.csv')

# converting 'year' and 'population' datatype to 'Int64'
df['year'] = df['year'].astype('Int64')
df['population'] = df['population'].astype('Int64')

df

Unnamed: 0,country,year,life expectancy,continent,population,fertility
0,Afghanistan,1950,26.85,Asia,7752118,7.67
1,Afghanistan,1951,27.13,Asia,7839426,7.67
2,Afghanistan,1952,27.67,Asia,7934798,7.67
3,Afghanistan,1953,28.19,Asia,8038312,7.67
4,Afghanistan,1954,28.73,Asia,8150037,7.67
...,...,...,...,...,...,...
16970,Turks and Caicos Islands,2015,,,34339,
16971,Tuvalu,2015,,,9916,
16972,Wallis et Futuna,2015,,,13151,
16973,Curaçao,2015,,,157203,


In [3]:
# 2. What is the median population in the data set?

df['population'].agg('median')

3047769.0

In [4]:
# 3. How often does each continent appear in the data set?

# V1
#df_continent_popularity = pd.DataFrame(df['continent'].value_counts())

# V2
df_continent_popularity = df.groupby('continent')[['continent']].agg('count')
df_continent_popularity.columns = ['count']
df_continent_popularity

Unnamed: 0_level_0,count
continent,Unnamed: 1_level_1
Africa,3354
Asia,2618
Australia and Oceania,634
Europe,2713
North America,1303
South America,804


In [5]:
# 4. Which continent has the lowest average fertility rate overall?

# calculating average (mean) fertility rate for each continent
df_avg_fertility_by_continent = df.groupby('continent')[['fertility']].agg('mean')
df_avg_fertility_by_continent.columns = ['avg_fertility']
df_avg_fertility_by_continent

Unnamed: 0_level_0,avg_fertility
continent,Unnamed: 1_level_1
Africa,5.931345
Asia,4.673862
Australia and Oceania,4.682172
Europe,2.169754
North America,4.002329
South America,4.077235


In [6]:
# getting the lowest (min) average fertility and its continent

# V1
df_avg_fertility_by_continent.idxmin()[0]

'Europe'

In [7]:
# V2

mask = df_avg_fertility_by_continent['avg_fertility'] == df_avg_fertility_by_continent['avg_fertility'].min()
df_avg_fertility_by_continent[mask]

Unnamed: 0_level_0,avg_fertility
continent,Unnamed: 1_level_1
Europe,2.169754


In [8]:
# getting the index value for this results
mask = df_avg_fertility_by_continent['avg_fertility'] == df_avg_fertility_by_continent['avg_fertility'].min()
df_avg_fertility_by_continent[mask].index[0]

'Europe'

In [9]:
# 5. What was the average life expectancy in Europe in 2015? 
# Hint: first filter for 2015 then apply groupby.

mask_2015 = df['year'] == 2015

df_2015 = df[mask_2015]
df_2015

Unnamed: 0,country,year,life expectancy,continent,population,fertility
65,Afghanistan,2015,53.8,Asia,32526562,4.47
132,Albania,2015,78.0,Europe,2896679,1.78
199,Algeria,2015,76.4,Africa,39666519,2.71
266,Angola,2015,59.6,Africa,25021974,5.65
333,Antigua and Barbuda,2015,76.4,North America,91818,2.06
...,...,...,...,...,...,...
16970,Turks and Caicos Islands,2015,,,34339,
16971,Tuvalu,2015,,,9916,
16972,Wallis et Futuna,2015,,,13151,
16973,Curaçao,2015,,,157203,


In [10]:
# grouping data by continent
df_2015.groupby('continent')[['life expectancy']].agg('mean').loc[['Europe']]

Unnamed: 0_level_0,life expectancy
continent,Unnamed: 1_level_1
Europe,78.902439


In [11]:
# V2
df[(df['year'] == 2015) & (df['continent'] == 'Europe')][['life expectancy']].mean()[0]

78.90243902439025

In [12]:
# 6. How many countries does each continent have in the dataset?
# Hint: filter for one year and count

mask2 = (df['year'] == 2015)
df[mask2].groupby('continent')[['country']].agg('count')

Unnamed: 0_level_0,country
continent,Unnamed: 1_level_1
Africa,50
Asia,39
Australia and Oceania,10
Europe,41
North America,20
South America,12


In [13]:
# 7. What is the average population of a European country in 1976 compare to 2015?
# Hint: once again filter for the year in question and do each year separately to compare

mask_1976 = df['year'] == 1976
df_p1 = df[mask_1976].groupby('continent')[['population']].agg('mean')

# renaming column 'population' to 'population_in_1976'
df_p1.columns = ['population_in_1976']

# getting particular aggregated row for index_label = 'Europe'
df_p1 = df_p1.loc[['Europe']]

# resetting index to get 'continent' as a column
df_p1.reset_index(inplace=True)
df_p1

Unnamed: 0,continent,population_in_1976
0,Europe,13840493.8


In [14]:
mask_2015 = df['year'] == 2015
df_p2 = df[mask_2015].groupby('continent')[['population']].agg('mean')

# renaming column 'population' to 'population_in_2015'
df_p2.columns = ['population_in_2015']

# getting particular aggregated row for index_label = 'Europe'
df_p2 = df_p2.loc[['Europe']]

# resetting index to get 'continent' as a column
df_p2.reset_index(inplace=True)
df_p2

Unnamed: 0,continent,population_in_2015
0,Europe,14755548.829268


In [15]:
# combining results together

pd.merge(left=df_p1, right=df_p2, how='inner', on='continent')

Unnamed: 0,continent,population_in_1976,population_in_2015
0,Europe,13840493.8,14755548.829268


In [16]:
# v2 - hard to compare values for particular country

mask_complex = ((df['year'] == 1976) | (df['year'] == 2015)) & (df['continent'] == 'Europe')
df_task_7 = df[mask_complex].groupby('year')[['population']].agg('mean')

# new column with normalized 'population' values
df_task_7['population_normalized'] = df_task_7['population']/df_task_7['population'].sum()
df_task_7


Unnamed: 0_level_0,population,population_normalized
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1976,13840493.8,0.484
2015,14755548.829268,0.516
