![](./Ridge_Plot_First_Wave_vs_Second_Wave.jpg)

![](./Life_Expectancy_at_65_vs_Total_Deaths_per_100k_Population_Latest_Data_2021_01_14.jpg)

# UK COVID-19 with Demographics Data

Hi all - this is a follow up to my [UK COVID-19 analysis](https://www.kaggle.com/vascodegama/uk-covid-19-analysis) notebook, that I've been doing using the [UK COVID-19 Dataset](https://www.kaggle.com/vascodegama/uk-covid19-data), scraped from the official UK government COVID-19 reporting website.

In doing that notebook it highlighted some quite stark differences in outcomes across deifferent regions of the UK, so the objective here is to determine if there are any key demographic characteristics that have relationships to the total cumulative number of deaths per 100,000 people for each local authority or 'borough' in the the UK.

The supporting data is scraped from the Office for National Statistics (notebook) and currently includes:
* Population Density (per sq. km)
* OADR - Old Age Dependency Ratio
* Life_Expectancy_at_65 - Life Expectancy of Population currently aged 65 years (2014)
* IMD - Indices of Multiple Deprivation (Overall Score)
* IMD Health Score - Indices of Multiple Deprivation (Health Score)

Either for all UK boroughs or in some cases data is only available for England & Wales

I would really like to get some data on household occupancy but can't find this for UK Boroughs - if anyone does please let me know!

# Imports
----------------------------------------------------------

In [None]:
# General Imports
import numpy as np 
import pandas as pd      
import matplotlib.pyplot as plt
import seaborn as sns
import math
from datetime import date, timedelta
import scipy.stats

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load Data and Create DataFrames
----------------------------------------------------------------------------

In [None]:
# Load CSV data to pandas DataFrames

UK_TOTAL_DATA = pd.read_csv('../input/uk-covid19-data/UK_National_Total_COVID_Dataset.csv', index_col='date',parse_dates=True) 
DEVOLVED_NATION_DATA = pd.read_csv('../input/uk-covid19-data/UK_Devolved_Nations_COVID_Dataset.csv', index_col='date',parse_dates=True)
ENGLAND_REGIONS_DATA = pd.read_csv('../input/uk-covid19-data/England_Regions_COVID_Dataset.csv', index_col='date',parse_dates=True)
UK_LOCAL_AUTHORITY_DATA = pd.read_csv('../input/uk-covid19-data/UK_Local_Authority_UTLA_COVID_Dataset.csv', index_col='date',parse_dates=True)


# Demographic Data

POPULATION_DATA = pd.read_csv('../input/uk-covid19-data/NEW_Official_Population_Data_ONS_mid-2019.csv',index_col='Unnamed: 0')
OADR_DATA = pd.read_csv('../input/uk-demographics-dataset/Old_Age_Dependency_Ratios_England.csv',index_col='Unnamed: 0')
LE_DATA = pd.read_csv('../input/uk-demographics-dataset/Life_Expectancy_Age65_England_Wales_Boroughs.csv',index_col='Unnamed: 0')
IMD_DATA = pd.read_csv('../input/uk-demographics-dataset/Indices_Multiple_Deprevation_England_Boroughs.csv',index_col='Unnamed: 0')
IMD_HEALTH_DATA = pd.read_csv('../input/uk-demographics-dataset/IMD_Health_Deprivation_and_Disability_England_Boroughs.csv',index_col='Unnamed: 0')

# Combine dataframes
UK_AND_NATIONS = pd.concat([UK_TOTAL_DATA,DEVOLVED_NATION_DATA])
ENGLAND_AND_REGIONS = pd.concat([DEVOLVED_NATION_DATA[DEVOLVED_NATION_DATA['areaName'] == 'England'],ENGLAND_REGIONS_DATA])
UK_AND_UTLAS = pd.concat([UK_TOTAL_DATA,UK_LOCAL_AUTHORITY_DATA])

In [None]:
# Add population density data to UK_LOCAL_AUTHORITY_DATA dataframe
mapping = dict(POPULATION_DATA[['areaCode', 'Population Density (per sq. km)']].values)
UK_LOCAL_AUTHORITY_DATA['Population Density (per sq. km)'] = UK_LOCAL_AUTHORITY_DATA['areaCode'].map(mapping)

# Add population data to UK_LOCAL_AUTHORITY_DATA dataframe
MAPPING_POP = dict(POPULATION_DATA[['areaCode', 'Population']].values)
UK_LOCAL_AUTHORITY_DATA['Population'] = UK_LOCAL_AUTHORITY_DATA['areaCode'].map(MAPPING_POP)

# Add OADR data to UK_LOCAL_AUTHORITY_DATA dataframe
MAPPING_OADR = dict(OADR_DATA[['areaCode', 'OADR_2020']].values)
UK_LOCAL_AUTHORITY_DATA['OADR'] = UK_LOCAL_AUTHORITY_DATA['areaCode'].map(MAPPING_OADR)

# Add Life Expectancy data to UK_LOCAL_AUTHORITY_DATA dataframe
MAPPING_LE = dict(LE_DATA[['areaCode', 'LE_Total_Age_65']].values)
UK_LOCAL_AUTHORITY_DATA['Life_Expectancy_at_65'] = UK_LOCAL_AUTHORITY_DATA['areaCode'].map(MAPPING_LE)

# Add IMD data to UK_LOCAL_AUTHORITY_DATA dataframe
MAPPING_IMD = dict(IMD_DATA[['areaCode', 'IMD']].values)
UK_LOCAL_AUTHORITY_DATA['IMD'] = UK_LOCAL_AUTHORITY_DATA['areaCode'].map(MAPPING_IMD)

# Add IMD HEALTH SCORE data to UK_LOCAL_AUTHORITY_DATA dataframe
MAPPING_IMD_HEALTH = dict(IMD_HEALTH_DATA[['areaCode', 'IMD_Health_Score']].values)
UK_LOCAL_AUTHORITY_DATA['IMD_Health_Score'] = UK_LOCAL_AUTHORITY_DATA['areaCode'].map(MAPPING_IMD_HEALTH)

# Add IMD HEALTH RANK data to UK_LOCAL_AUTHORITY_DATA dataframe
MAPPING_IMD_HEALTH_RANK = dict(IMD_HEALTH_DATA[['areaCode', 'IMD_Health_Rank']].values)
UK_LOCAL_AUTHORITY_DATA['IMD_Health_Rank'] = UK_LOCAL_AUTHORITY_DATA['areaCode'].map(MAPPING_IMD_HEALTH_RANK)

In [None]:
LB = '''Hackney and City of London
Westminster
Kensington and Chelsea
Hammersmith and Fulham
Wandsworth
Lambeth
Southwark
Tower Hamlets
Islington
Camden
Brent
Ealing
Hounslow
Richmond upon Thames
Kingston upon Thames
Merton
Sutton
Croydon
Bromley
Lewisham
Greenwich
Bexley
Havering
Barking and Dagenham
Redbridge
Newham
Waltham Forest
Haringey
Enfield
Barnet
Harrow
Hillingdon
'''.split("\n")[0:-1]

In [None]:
# London borough list
LONDON_BOROUGHS = list(set(LB).intersection(UK_LOCAL_AUTHORITY_DATA['areaName'].unique()))
# Greater Manchester borough list
MANC_BOROUGHS = ['Bolton', 'Bury', 'Oldham', 'Rochdale', 'Stockport', 'Tameside',
                 'Trafford', 'Wigan', 'Manchester', 'Salford']
# West Midlands borough list
WM_BOROUGHS = ['Birmingham', 'Coventry', 'Wolverhampton', 'Dudley', 'Sandwell', 'Solihull', 'Walsall']

# UK Boroughs COVID-19 Cumulative Death-Rate per Capita vs. Demographics DataFrame Creation
-----------------------------------------------------------------

In [None]:
# Create dataframe containing latest total cumulative deaths per 100,000 population and demographics

UK_BOROUGHS = pd.DataFrame()
UK_BOROUGHS['areaName'] = [i for i in UK_LOCAL_AUTHORITY_DATA.loc['2020-03-30']['areaName']]
UK_BOROUGHS['areaCode'] = [i for i in UK_LOCAL_AUTHORITY_DATA.loc['2020-03-30']['areaCode']]
UK_BOROUGHS['MARCH_30'] = [i for i in UK_LOCAL_AUTHORITY_DATA.loc['2020-03-30']['cumDeaths28DaysByDeathDateRate']]
UK_BOROUGHS['MARCH_30'] = UK_BOROUGHS['MARCH_30'].fillna(0)
UK_BOROUGHS['Total_Deaths_per_100k_Population_Latest_Data'] = [i for i in UK_LOCAL_AUTHORITY_DATA.loc[UK_AND_NATIONS.index[-2]]['cumDeaths28DaysByDeathDateRate']]
UK_BOROUGHS['Region'] = ['London' if i in LONDON_BOROUGHS else ('Greater Manchester' if i in MANC_BOROUGHS else ('West Midlands' if i in WM_BOROUGHS else 'Rest of UK')) for i in UK_LOCAL_AUTHORITY_DATA.loc['2020-03-30']['areaName']]

UK_BOROUGHS['Population Density (per sq. km)'] = UK_BOROUGHS['areaCode'].map(mapping)
UK_BOROUGHS['OADR'] = UK_BOROUGHS['areaCode'].map(MAPPING_OADR)
UK_BOROUGHS['Life_Expectancy_at_65'] = UK_BOROUGHS['areaCode'].map(MAPPING_LE)
UK_BOROUGHS['IMD'] = UK_BOROUGHS['areaCode'].map(MAPPING_IMD)
UK_BOROUGHS['IMD_Health_Score'] = UK_BOROUGHS['areaCode'].map(MAPPING_IMD_HEALTH)
UK_BOROUGHS['IMD_Health_Rank'] = UK_BOROUGHS['areaCode'].map(MAPPING_IMD_HEALTH_RANK)


# Linear Regression and Plot Functions
----------------------------------------------

In [None]:
def reg_r_sq(data,x,y):
    '''Function to calculate and plot linear regressions'''
    
    # LinReg Calc
    slope, intercept, r, p, se = scipy.stats.linregress(data[data[x].notnull()][x], data[data[x].notnull()][y])
    
    return round(r**2,2)

In [None]:
def reg_plot(data,x,y):
    '''Function to calculate and plot linear regressions'''
    
    # LinReg Calc
    slope, intercept, r, p, se = scipy.stats.linregress(data[data[x].notnull()][x], data[data[x].notnull()][y])
    REG_RESULT = 'Linear Regression:\ny = {}x + {}\nR\u00b2 = {}\np = {}'.format(round(slope,2),
                                                                                 round(intercept,2),
                                                                                 round(r**2,2),
                                                                                 round(p,3))

    print('Linear Regressions for relationship between {} and {} for English Boroughs:'.format(x,y))
    print(scipy.stats.linregress(data[data[x].notnull()][x], data[data[x].notnull()][y]),'\n')
    
    
    # Create Plot
    f, ax = plt.subplots(figsize=(12,6))
    sns.scatterplot(data=data, x=x, y=y, hue="Region", style="Region")
    
    plt.title('Relationship between {} and {} for UK Boroughs on {}'.format(x,y,UK_AND_NATIONS.index[-2].strftime("%d-%b-%y")))
    ax.set_xlabel(x)
    ax.set_ylabel('{}'.format(y))

    # Plot Best Fit Line
    REG_RANGE = np.array([data[x].min(),data[x].max()])
    plt.plot(REG_RANGE, slope*REG_RANGE + intercept,linestyle='--')

    # Annotate LinReg result
    ax.annotate(REG_RESULT,xy=(0.5,0.5),xytext=(.8,.03),xycoords = 'axes fraction')

    plt.savefig('{}_vs_{}_{}.jpg'.format(x,y,UK_AND_NATIONS.index[-2].strftime("%Y_%m_%d")),dpi=300)
    plt.show
    

# Linear Regressions
---------------------------------------

In [None]:
# Run Linear Regressions

for x in ['IMD','IMD_Health_Score','OADR', 'Life_Expectancy_at_65','Population Density (per sq. km)']:
    reg_plot(UK_BOROUGHS,x,'Total_Deaths_per_100k_Population_Latest_Data')
    
    

In [None]:
# Create DF with R squared values

R_SQ_DF = pd.DataFrame(columns=['Feature','R_squared'])

for x in ['IMD','IMD_Health_Score','OADR', 'Life_Expectancy_at_65','Population Density (per sq. km)']:
    R_SQ_DF = R_SQ_DF.append({'Feature': x,'R_squared':reg_r_sq(UK_BOROUGHS,x,'Total_Deaths_per_100k_Population_Latest_Data')} ,ignore_index=True)

In [None]:
# Plot R squared values

fig, ax = plt.subplots(figsize=(12,5))
sns.barplot(x=R_SQ_DF['Feature'],y=R_SQ_DF['R_squared'],color='#0099ff',edgecolor='black',ax=ax)
ax.grid(axis='y',linestyle=':', linewidth='0.5')
ax.set_xlabel('',size='large')
ax.set_ylabel('R\u00b2 (Pearson)',size='large')
plt.title('Pearson R\u00b2 for Linear Relationships between Demographic Indicators and\nTotal Cumulative Death Rates per Capita for UK Boroughs on {}'.format(UK_AND_NATIONS.index[-2].strftime("%Y-%m-%d")))
plt.xticks(ticks=range(5),
          labels=['Indices of Multiple\nDeprivation (IMD)',
                  'IMD Health Score',
                  'Old-Age Dependency\nRatio',
                  'Life Expectancy at\nAge 65',
                  'Population Density'])
plt.savefig('LinRegs_R_Sq_Values.jpg',dpi=300)
plt.show

# OADR Outliers

Surprisingly, Old-Age Dependency Ratio hasn't shown a relationship to cumulative death-rates when looking at all UK boroughs for which data is available (England). This is a little suprising as it is well know that COVID-19 has disproportionately affected older people.

Below I have removed some of the outlier boroughs with very high OADR and relatively low COVID-19 death rates. In each case the borough is either an island, very rural and has a very low population density. Torbay is the exception which has a population density above the UK median, although it is in the South-West which is England's least affected region. Torbay has a reputation for being a place to retire to, so they should be commended for having much lower than average deaths from COVID-19 - although I don't know if this is down to good fortune or good judgement.

After removing these outliers the R squared value rises to 0.34 (written 12th Jan 2021), however it goes to show that predicting the impact of COVID-19 using demographic data is difficult. There may be other demographic indicators out there that better predict the regional differences but the weak nature of these relationships speaks to the indiscriminancy of the virus and it's affects right across the UK.

In [None]:
# Outliers - all have low population density except Torbay which in South West which so far is least affected in UK

OADR_OUTLIERS = UK_BOROUGHS[(UK_BOROUGHS['OADR']>300) & (UK_BOROUGHS['Total_Deaths_per_100k_Population_Latest_Data']<120)]['areaName']
print('UK Boroughs with High OADR and Relatively Low COVID-19 Death-Rates:\n',OADR_OUTLIERS)

In [None]:
# Remove OADR outlier boroughs
OADR_OL_DF = UK_BOROUGHS.query('areaName not in @OADR_OUTLIERS')

# Run Lin Reg plot function
reg_plot(OADR_OL_DF,'OADR','Total_Deaths_per_100k_Population_Latest_Data')

# Multiple Linear Regression
----------------------------------------------------

Multiple Linear Regression conducted for the 4 features with at least some relationship to Total Cumulative Deaths.

Outliers identified in the OADR analysis have been removed, for the same reasons identfied in the section above. (Which increases the R sqaured of the resulting multiple regression by nearly 50%)

In [None]:
MR_DATA = OADR_OL_DF[OADR_OL_DF['OADR'].notnull()|OADR_OL_DF['Life_Expectancy_at_65'].notnull()].drop(['areaName','areaCode','Region'],axis=1)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer

X = MR_DATA[['OADR', 'Life_Expectancy_at_65','IMD','IMD_Health_Score']]
y = MR_DATA['Total_Deaths_per_100k_Population_Latest_Data']


scaler = StandardScaler()
X_standard = scaler.fit_transform(X)


In [None]:
from scipy.stats import kurtosis,kurtosistest, skew

for i,j in zip(range(4),['OADR', 'Life_Expectancy_at_65','IMD','IMD_Health_Score']):
    print(j,'Kurtosis:',kurtosis((MR_DATA[j]),nan_policy='omit'))
    print(j,'Skew:',skew((MR_DATA[j]),nan_policy='omit'))
    sns.distplot(pd.DataFrame(X_standard)[i])

In [None]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X_standard,y)

print('Multiple Regression Coefficients:')
for i,j in zip(['OADR', 'Life_Expectancy_at_65','IMD','IMD_Health_Score'],regr.coef_):
    print(i,':',round(j,1))

In [None]:
print('Multiple Regression: R\u00b2 = ',round(regr.score(X_standard, y),2))

# Conclusion
---------------------------------------------------

There seem to be mild but statistically significant relationships between the total cumulative death rates per capita in UK boroughs and some key demographic indicators, particularly Life Expectancy at 65 and IMD Health Score, which is a measure of local health deprivation.

Perhaps surprisingly, at the time of this analysis there was no significant relationship to population density.

Also, in order to see a realtionship to Old-Age Dependency ratio 10 outliers with high OADR and low death-rates were omitted. We know that age is a key risk factor but this shows that relatively low deaths have been possible even with a larger relative population of older people. It seems here that geographic isolation (rather than population density itself) is a significant factor although there may be other influential factors.

The multiple regression using the 4 statisically significant features showed a reasonably strong R squared value of 0.47. 

Whether or not these relationships are strong enough to warrant them influencing policy or vaccine prioritisation is difficult to say, I haven't seen as much being suggested by any senior medical officials or politicians in the UK - but if someone has please let me know!

Any comments or questions please let me know!