### Introduction

In my brief research I will try to show correlation (or correlation absence) between COVID-19 vaccination level and such social indicators as:
* Health expenditure (% of GDP)
* Physicians (per 1000 pop.)
* Education expenditure (% of GDP)
* Live births per woman
* Infant mortality rate (per 1000 live births)
* Female life expectancy at birth

To evaluate vaccination level for certain country I will use *total_vaccinations_per_hundred* values taken from the [COVID-19 World Vaccination Progress](https://www.kaggle.com/gpreda/covid-world-vaccination-progress). Social indicators for every country will be obtained from the [Country Statistics - UNData](https://www.kaggle.com/sudalairajkumar/undata-country-profiles/) (the data is from 2017 when available or the most recent data previous to the year).

### Data preparation
The selected columns from the both *country_profile_variables.csv* and *country_vaccinations.csv* datasets will be joined together to form the analysis dataset. Since *country_vaccinations.csv* contains daily data I decided to analyze the date with maximum data entries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

FONT_SIZE = 16
MARKER_COLOR = '#B22222'

# preprocess Countries dataset
countries = pd.read_csv('../input/undata-country-profiles/country_profile_variables.csv',
                        index_col='country')
country_demography_cols = [
    'Population in thousands (2017)',
    'Population density (per km2, 2017)',
    'Sex ratio (m per 100 f, 2017)',
    'Population growth rate (average annual %)',
    'Urban population (% of total population)',
    'Urban population growth rate (average annual %)',
    'Fertility rate, total (live births per woman)',
    'Life expectancy at birth (females/males, years)',
    'Population age distribution (0-14 / 60+ years, %)',
    'Infant mortality rate (per 1000 live births',
    'Health: Total expenditure (% of GDP)',
    'Health: Physicians (per 1000 pop.)',
    'Education: Government expenditure (% of GDP)',
    'Education: Primary gross enrol. ratio (f/m per 100 pop.)',
    'Education: Secondary gross enrol. ratio (f/m per 100 pop.)',
    'Education: Tertiary gross enrol. ratio (f/m per 100 pop.)',
    'Seats held by women in national parliaments %',
    'Mobile-cellular subscriptions (per 100 inhabitants)',
    'Individuals using the Internet (per 100 inhabitants)',
]
demography_df = countries[country_demography_cols]

# preprocess Vaccinations dataset
vacs = pd.read_csv('../input/covid-world-vaccination-progress/country_vaccinations.csv',
                   parse_dates=['date'])
sample_date = vacs.date.value_counts().idxmax()  # the date with maximum data entries
sample_vacs = vacs.loc[vacs.date == sample_date]
sample_vacs.set_index('country', inplace=True)
sample_vacs_cols = [
    'total_vaccinations',
    'people_vaccinated',
    'people_fully_vaccinated',
    'daily_vaccinations_raw',
    'daily_vaccinations',
    'total_vaccinations_per_hundred',
    'people_vaccinated_per_hundred',
    'people_fully_vaccinated_per_hundred',
    'daily_vaccinations_per_million',
    'vaccines',
]
sample_vacs = sample_vacs[sample_vacs_cols]

# create joined dataset
join_df = sample_vacs.join(demography_df)

In [None]:
print('The date with maximum data entries:', sample_date.strftime('%d %B %Y'))

### Scatter plots
The data were cleaned before plotting from NaNs, negative values and improper values like '...'. Also, columns of *object* datatype were converted to *float64*.

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20,8))

# Health: Total expenditure (% of GDP) vs. Total vaccinations per hundred
x_col = 'Health: Total expenditure (% of GDP)'
y_col = 'total_vaccinations_per_hundred'
# remove incorrect data
plot_df = join_df.loc[(join_df[x_col] >= 0) & (join_df[y_col] <= 80)]
ax[0].scatter(plot_df[x_col], plot_df[y_col], s=100, c=MARKER_COLOR, alpha=0.7)
ax[0].set_xlabel('Health expenditure (% of GDP)', fontsize=FONT_SIZE)
ax[0].set_ylabel('Total vaccinations per hundred', fontsize=FONT_SIZE)
ax[0].grid()

# Health: Physicians (per 1000 pop.) vs. Total vaccinations per hundred
x_col = 'Health: Physicians (per 1000 pop.)'
y_col = 'total_vaccinations_per_hundred'
plot_df = join_df[[x_col, y_col]].dropna()
# some x_col values contain '...'; remove it
plot_df = plot_df.loc[plot_df[x_col] != '...']
# x_col values are strings; convert it to float64
plot_df[x_col] = plot_df[x_col].astype('float64')
# remove incorrect data
plot_df = plot_df.loc[(plot_df[x_col] >= 0) & (plot_df[y_col] <= 80)]
ax[1].scatter(plot_df[x_col], plot_df[y_col], s=100, c=MARKER_COLOR, alpha=0.7)
ax[1].set_xlabel('Physicians (per 1000 pop.)', fontsize=FONT_SIZE)
ax[1].grid()

# Education: Government expenditure (% of GDP) vs. Total vaccinations per hundred
x_col = 'Education: Government expenditure (% of GDP)'
y_col = 'total_vaccinations_per_hundred'
plot_df = join_df[[x_col, y_col]].dropna()
# some x_col values contain '...'; remove it
plot_df = plot_df.loc[plot_df[x_col] != '...']
# x_col values are strings; convert it to float64
plot_df[x_col] = plot_df[x_col].astype('float64')
# remove incorrect data
plot_df = plot_df.loc[(plot_df[x_col] >= 0) & (plot_df[y_col] <= 80)]
ax[2].scatter(plot_df[x_col], plot_df[y_col], s=100, c=MARKER_COLOR, alpha=0.7)
ax[2].set_xlabel('Education expenditure (% of GDP)', fontsize=FONT_SIZE)
ax[2].grid()

plt.show()

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20,8))

# Fertility rate, total (live births per woman) vs. Total vaccinations per hundred
x_col = 'Fertility rate, total (live births per woman)'
y_col = 'total_vaccinations_per_hundred'
plot_df = join_df[[x_col, y_col]].dropna()
# some x_col values contain '...'; remove it
plot_df = plot_df.loc[plot_df[x_col] != '...']
# x_col values are strings; convert it to float64
plot_df[x_col] = plot_df[x_col].astype('float64')
# remove incorrect data
plot_df = plot_df.loc[(plot_df[x_col] >= 0) & (plot_df[y_col] <= 80)]
ax[0].scatter(plot_df[x_col], plot_df[y_col], s=100, c=MARKER_COLOR, alpha=0.7)
ax[0].set_xlabel('Live births per woman', fontsize=FONT_SIZE)
ax[0].set_ylabel('Total vaccinations per hundred', fontsize=FONT_SIZE)
ax[0].grid()

# Infant mortality rate (per 1000 live births) vs. Total vaccinations per hundred
x_col = 'Infant mortality rate (per 1000 live births'
y_col = 'total_vaccinations_per_hundred'
plot_df = join_df[[x_col, y_col]].dropna()
# x_col values are strings; convert it to float64
plot_df[x_col] = plot_df[x_col].astype('float64')
# remove incorrect data
plot_df = plot_df.loc[(plot_df[x_col] >= 0) & (plot_df[y_col] <= 80)]
ax[1].scatter(plot_df[x_col], plot_df[y_col], s=100, c=MARKER_COLOR, alpha=0.7)
ax[1].set_xlabel('Infant mortality rate (per 1000 live births)', fontsize=FONT_SIZE)
ax[1].grid()

# Life expectancy at birth vs. Total vaccinations per hundred
x_col = 'Life expectancy at birth (females/males, years)'
y_col = 'total_vaccinations_per_hundred'
plot_df = join_df[[x_col, y_col]].dropna()
# extract life expectancy
life_df = plot_df[x_col].str.split('/', expand=True)
life_df = life_df.loc[(life_df[0] != '...') & (life_df[1] != '...')]
life_df = life_df.astype('float64')
x2_col = 'Female life expectancy at birth'
plot_df[x2_col] = life_df[0]
# remove incorrect data
plot_df = plot_df.loc[(plot_df[x2_col] >= 0) & (plot_df[y_col] <= 80)]
ax[2].scatter(plot_df[x2_col], plot_df[y_col], s=100, c=MARKER_COLOR, alpha=0.7)
ax[2].set_xlabel('Female life expectancy at birth (years)', fontsize=FONT_SIZE)
ax[2].grid()

plt.show()

### Conclusion
1. The first three plots show that there is no explicit correlation between country vaccination level and such social indicators as health and education expenditures and amount of physicians.
2. The last three plots confirm the following intuitive assumptions:
   * Less live births per woman -> Higher country development level (contraception, women's rights and education) -> Higher country vaccination level
   * Less infant mortality rate -> Better country health system -> Higher country vaccination level
   * Longer life expectancy at birth -> Better country health system -> Higher country vaccination level