# **Intro**

Since early 2020, we all know the COVID-19 pandemic was declared by the WHO, according to the new definition of Pandemic from 2009, that removed the section that stated ("with enormous numbers of deaths and illness."):
![Screenshot from webcrawler text, as the official doc was deleted by the WHO.](https://i.ibb.co/8YHpWv9/Pandemic-def.jpg)

Ever since then, a race for developing a vaccine was triggered. And after 2 months of phase 3 trials of a new mRNA technology never tested in humans before, vaccines where [approved for emergency use worldwide](https://www.fda.gov/vaccines-blood-biologics/vaccines/emergency-use-authorization-vaccines-explained), during their phase 3 trial experiments currently in progress (officially until 2023).

This development urge us to consider the currently available data for analysis.

# **Hypothesis**
A hyphothesis that I would like to put to the test is the claim that vaccination benefits outweight the risks. In order to do that with the data that we have currently, we are going to analyse how the vaccination allocations may be correlated with the number of deaths for the next month. We will listen now to the unforgiving answers from the data itself.

# **Method**

* **Step 1:**
Import libraries, declare functions and variables.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from datetime import datetime
from dateutil.relativedelta import relativedelta
from scipy.stats import pearsonr
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import math

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

#Gets sum of all deaths 1 month after vaccination week date, categorized by different causes:
def get_monthly_increment(state, start_date, number_of_months):
    end_date = start_date + relativedelta(months = number_of_months) + relativedelta(days = 7)
    df_all_records_date_range = df_death_count_state_only[df_death_count_state_only[column_name_date] >= start_date]
    df_all_records_date_range = df_all_records_date_range[df_all_records_date_range[column_name_jurisdiction_deaths] == state]
    df_all_records_date_range = df_all_records_date_range[df_all_records_date_range[column_name_date] <= end_date]
    df_all_records_date_range = df_all_records_date_range[[column_name_disease_all_cause, column_name_disease_natural_cause, column_name_disease_septicemia, column_name_disease_malignant_neoplasms, column_name_disease_diabetes, column_name_disease_alzheimer, column_name_disease_influenza_pneumonia, column_name_disease__lower_respiratory, column_name_disease_respiratory, column_name_disease_nephritis, column_name_disease_non_classified, column_name_disease_non_cardiovascular, column_name_disease_non_cerebrovascular, column_name_disease_non_covid_multiple, column_name_disease_non_covid_underlying]].sum(axis=0)
    return df_all_records_date_range

#rounds decimals:
def round_decimals_up(number:float, decimals:int=2):
    """
    Returns a value rounded up to a specific number of decimal places.
    """
    if not isinstance(decimals, int):
        raise TypeError("decimal places must be an integer")
    elif decimals < 0:
        raise ValueError("decimal places has to be 0 or more")
    elif decimals == 0:
        return math.ceil(number)

    factor = 10 ** decimals
    return math.ceil(number * factor) / factor

#filename_death_count_1 = 'Weekly_Counts_of_Deaths_by_State_and_Select_Causes__2014-2019.csv'
filename_death_count = '../input/covid-19-vaccines-and-deaths-by-state/Weekly_Provisional_Counts_of_Deaths_by_State_and_Select_Causes__2020-2021.csv'
filename_vaxxed_janssen = '../input/covid-19-vaccines-and-deaths-by-state/COVID-19_Vaccine_Distribution_Allocations_by_Jurisdiction_-_Janssen.csv'
filename_vaxxed_moderna = '../input/covid-19-vaccines-and-deaths-by-state/COVID-19_Vaccine_Distribution_Allocations_by_Jurisdiction_-_Moderna.csv'
filename_vaxxed_pfizer = '../input/covid-19-vaccines-and-deaths-by-state/COVID-19_Vaccine_Distribution_Allocations_by_Jurisdiction_-_Pfizer.csv'

#Column names original files:
column_name_jurisdiction_vaxxed = 'Jurisdiction'
column_name_week_vaxxed = 'Week of Allocations'
column_name_first_dose_vaxxed = '1st Dose Allocations'
column_name_second_dose_vaxxed = '2nd Dose Allocations'

column_name_week_deaths = 'Week Ending Date'
column_name_jurisdiction_deaths = 'Jurisdiction of Occurrence'

#Diseases columns:
column_name_disease_all_cause = 'All Cause'
column_name_disease_natural_cause = 'Natural Cause'
column_name_disease_septicemia = 'Septicemia (A40-A41)'
column_name_disease_malignant_neoplasms = 'Malignant neoplasms (C00-C97)'
column_name_disease_diabetes = 'Diabetes mellitus (E10-E14)'
column_name_disease_alzheimer = 'Alzheimer disease (G30)'
column_name_disease_influenza_pneumonia = 'Influenza and pneumonia (J09-J18)'
column_name_disease__lower_respiratory = 'Chronic lower respiratory diseases (J40-J47)'
column_name_disease_respiratory = 'Other diseases of respiratory system (J00-J06,J30-J39,J67,J70-J98)'
column_name_disease_nephritis = 'Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17-N19,N25-N27)'
column_name_disease_non_classified = 'Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified (R00-R99)'
column_name_disease_non_cardiovascular = 'Diseases of heart (I00-I09,I11,I13,I20-I51)'
column_name_disease_non_cerebrovascular = 'Cerebrovascular diseases (I60-I69)'
column_name_disease_non_covid_multiple = 'COVID-19 (U071, Multiple Cause of Death)'
column_name_disease_non_covid_underlying = 'COVID-19 (U071, Underlying Cause of Death)'

#New column names:
column_name_date = 'Date'

#columns for monthly death totals:
column_name_increment_monthly_all_cause = 'MI All Causes'
column_name_increment_monthly_natural_cause = 'MI Natural Causes'
column_name_increment_monthly_septicemia = 'MI Septicemia'
column_name_increment_monthly_malignant_neoplasms = 'MI Malignant Neoplasms'
column_name_increment_monthly_diabetes = 'MI Diabetes'
column_name_increment_monthly_alzheimer = 'MI Alzheimer'
column_name_increment_monthly_influenza_pneumonia = 'MI Influenza and Pneumonia'
column_name_increment_monthly__lower_respiratory = 'MI Chronic lower respiratory diseases'
column_name_increment_monthly_respiratory = 'MI Other respiratory diseases'
column_name_increment_monthly_nephritis = 'MI Nephrosis Like'
column_name_increment_monthly_non_classified = 'MI Not Classified Causes'
column_name_increment_monthly_non_cardiovascular = 'MI Cardiovascular'
column_name_increment_monthly_non_cerebrovascular = 'MI Cerebrovascular'
column_name_increment_monthly_non_covid_multiple = 'MI COVID-19 Multiple Causes'
column_name_increment_monthly_non_covid_underlying = 'MI COVID-19 as Underlying Cause'

* **Step 2:**
Import all of the files we are going to be using (raw files from the CDC).

In [None]:
df_raw_death_count = pd.read_csv(filename_death_count, encoding = 'latin')
df_raw_vaxxed_janssen = pd.read_csv(filename_vaxxed_janssen, encoding = 'latin')
df_raw_vaxxed_moderna = pd.read_csv(filename_vaxxed_moderna, encoding = 'latin')
df_raw_vaxxed_pfizer = pd.read_csv(filename_vaxxed_pfizer, encoding = 'latin')

* **Step 3:**
**Data preprocessing: **
* We are going to be using all of the vendors vaccines together (Pfizer, Moderna and Janssen), because the goal is not to evaluate them separately.
* Remove the country-level records in order to analyse only by State. We want to analyze the records by State, as it gives us the necesary variability for States with different vaccination allocation efforts.
* Remove the last 1 month and 1/2 vaccination allocation records, because we want to correlate the results with 1 month after vaccination, and such data doesn't exist on the death count dataset (it's real time, so that would mean 1 month from now we will have that data).
* Add 2 months of data from death counts to the vaccination records, in order to include 2 months of pre-vaccination death count records for higher correlation precision.

In [None]:
#puts together all vaccine types (janssen, moderna and pfizer), groups them by state:
df_vaxxed = pd.concat([df_raw_vaxxed_janssen, df_raw_vaxxed_moderna, df_raw_vaxxed_pfizer])
df_vaxxed = df_vaxxed.groupby(by=[column_name_jurisdiction_vaxxed, column_name_week_vaxxed]).sum().reset_index()

#Removes country-level records:
df_death_count_state_only = df_raw_death_count[df_raw_death_count[column_name_jurisdiction_deaths] != 'United States']

#Adds a date column with corresponding datetime objects instead of string dates:
df_vaxxed[column_name_date] = pd.to_datetime(df_vaxxed[column_name_week_vaxxed], format='%m/%d/%Y')

#remove records before 1 and 1/2 months before the last record, because there is no death data and bias results:
max_date = max(df_vaxxed[column_name_date]) - relativedelta(months = 1) - relativedelta(days = 15)
df_vaxxed = df_vaxxed[df_vaxxed[column_name_date] < max_date]

#add 8 records before vax starts (2 months) to get data with dose = 0
states = df_vaxxed[column_name_jurisdiction_vaxxed].unique()

#For each state, it adds 9 weeks of pre-vaccination records, in oder to get pre-vaccination death counts:
for state in states:
    min_date = min(df_vaxxed[df_vaxxed[column_name_jurisdiction_vaxxed] == state][column_name_date])
    #9 records of 1 week each, means we're adding 2 and 1/2 months pre-vaccinations records
    for i in range(1, 9):
        min_date = min_date - relativedelta(days = 7)
        min_date_date_time = datetime.fromtimestamp(min_date.timestamp())
        new_row = {column_name_jurisdiction_vaxxed:state, column_name_date:min_date_date_time, column_name_week_vaxxed:min_date_date_time.strftime("%m/%d/%Y"), column_name_first_dose_vaxxed:0, column_name_second_dose_vaxxed:0}
        df_vaxxed = df_vaxxed.append(new_row, ignore_index=True)
        
#Adds a date column with corresponding datetime objects instead of string dates:
df_death_count_state_only[column_name_date] = pd.to_datetime(df_death_count_state_only[column_name_week_deaths], format='%Y-%m-%d')

* **Step 4:**
**Data preprocessing:** 

We are going to add the sum of all deaths by different causes to each weekly vaccination record, for the state.
So, for each vaccination weekly record, we include a whole month of accumulated death count.

In [None]:
#calculate for monthly death count after vaccination
df_vaxxed[[column_name_increment_monthly_all_cause, column_name_increment_monthly_natural_cause, column_name_increment_monthly_septicemia, column_name_increment_monthly_malignant_neoplasms, column_name_increment_monthly_diabetes, column_name_increment_monthly_alzheimer, column_name_increment_monthly_influenza_pneumonia, column_name_increment_monthly__lower_respiratory, column_name_increment_monthly_respiratory, column_name_increment_monthly_nephritis, column_name_increment_monthly_non_classified, column_name_increment_monthly_non_cardiovascular, column_name_increment_monthly_non_cerebrovascular, column_name_increment_monthly_non_covid_multiple, column_name_increment_monthly_non_covid_underlying]] = df_vaxxed.apply(lambda x: get_monthly_increment(x[column_name_jurisdiction_vaxxed], x[column_name_date], 1), axis=1)

column_name_vaxxed_total = 'Total Vaccinations'
df_vaxxed[column_name_vaxxed_total] = df_vaxxed[column_name_first_dose_vaxxed] + df_vaxxed[column_name_second_dose_vaxxed]

* **Step 5:**
Now lets figure out correlation:

In [None]:
# Only compute pearson prod-moment correlations between features
# columns and -First Dose- target column
exclude_columns = [column_name_vaxxed_total, column_name_first_dose_vaxxed, column_name_second_dose_vaxxed, column_name_jurisdiction_vaxxed, column_name_week_vaxxed, column_name_date]
feature_target_corr = pd.DataFrame(columns = ['Variables', 'Correlation', 'p-value'])

for col in df_vaxxed:
    if col not in exclude_columns:
        new_row = {'Variable':col, 'Correlation':pearsonr(df_vaxxed[col], df_vaxxed[column_name_vaxxed_total])[0], 'p-value':round_decimals_up(pearsonr(df_vaxxed[col], df_vaxxed[column_name_vaxxed_total])[1], 4)}
        feature_target_corr = feature_target_corr.append(new_row, ignore_index=True)

feature_target_corr = feature_target_corr.sort_values('Correlation', ascending = False)
feature_target_corr['Correlation'] = feature_target_corr['Correlation']*100

* **Step 6:**
We are going to do some visualizations of our findings:

In [None]:
#Plots correlation data of Vaccination Total by different disease death count:
plt.figure(figsize = (12,10))
sns.barplot(data = feature_target_corr, y='Variable', x='Correlation', orient = 'h').set_title('Correlation between Vaccination Dose and Death Cause')

plt.show()

That gives us an initial taste of what the data is saying. A weak Pearson Product-Moment Correlation (random features) is around 30%-40%, so everything above 50% is considered correlated.
This initial analysis tells us that the number of vaccination doses, is somehow correlated with all causes of death, except for Septicemia and COVID-19; and is **significantly** correlated with "Not classified Causes".
Let's dig deeper...

In [None]:
# Dark theme seems fitting...
sns.set_theme(style="darkgrid")

# Fitting data in Kmeans theorem in order to get some indication on the groups
kmeans = KMeans(n_clusters=3
                , random_state=1).fit(df_vaxxed[[column_name_vaxxed_total, column_name_increment_monthly_non_classified]])
# This creates a new column called cluster which has cluster number for each row respectively.
df_vaxxed['cluster'] = kmeans.labels_

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

fig.suptitle('Death Stats by Vaccination Doses')

sns.scatterplot(data=df_vaxxed, x=column_name_vaxxed_total, y=column_name_increment_monthly_non_classified, ax=axes[0, 0], hue='cluster')
sns.scatterplot(data=df_vaxxed, x=column_name_vaxxed_total, y=column_name_increment_monthly_malignant_neoplasms, ax=axes[0, 1], hue='cluster')
sns.scatterplot(data=df_vaxxed, x=column_name_vaxxed_total, y=column_name_increment_monthly_non_cerebrovascular, ax=axes[0, 2], hue='cluster')
sns.scatterplot(data=df_vaxxed, x=column_name_vaxxed_total, y=column_name_increment_monthly_non_cardiovascular, ax=axes[1, 0], hue='cluster')
sns.scatterplot(data=df_vaxxed, x=column_name_vaxxed_total, y=column_name_increment_monthly__lower_respiratory, ax=axes[1, 1], hue='cluster')
sns.scatterplot(data=df_vaxxed, x=column_name_vaxxed_total, y=column_name_increment_monthly_diabetes, ax=axes[1, 2], hue='cluster')

plt.show()

# Conclusion #1

Statistical analysis shows that COVID Vaccine Dose is **strongly** correlated to unclassified deaths. From the visualization alone we can tell that with 0 doses, the max number of unclassified deaths for the next month following vaccination is around 500, for any state. As the number of vaccination doses increase, the number of unclassified deaths for the month following increase 300%+.

It'd be interesting to see what more pro data analysts can make of this data.

# Evaluating different regression models for deaths prediction
* **Linear Regression:**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import r2_score,mean_squared_error

In [None]:
#Setting the value for X and Y
x = df_vaxxed[[column_name_vaxxed_total]].values
y = df_vaxxed[column_name_increment_monthly_non_classified].values

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

In [None]:
model = LinearRegression() 
model.fit(x_train, y_train)

In [None]:
y_pred = model.predict(x_test)

In [None]:
plt.scatter(x, y, s = 10)

# predicted values
plt.plot(x_test, y_pred, color='r')
plt.show()

In [None]:
print(r2_score(y_test, y_pred))

**63.63 of the data fits the simple Linear Regression model**

In [None]:
# 0 means the model is perfect. Therefore the value should be as close to 0 as possible
meanAbErr = metrics.mean_absolute_error(y_test, y_pred)
meanSqErr = metrics.mean_squared_error(y_test, y_pred)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)

* **Decision Tree Regression:**

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
model = DecisionTreeRegressor() 
model.fit(x_train, y_train)

In [None]:
y_pred = model.predict(x_test)

In [None]:
# 0 means the model is perfect. Therefore the value should be as close to 0 as possible
meanAbErr = metrics.mean_absolute_error(y_test, y_pred)
meanSqErr = metrics.mean_squared_error(y_test, y_pred)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)

In [None]:
plt.figure()
plt.scatter(x, y, s=20, edgecolor="black",
            c="darkorange", label="data")
plt.plot(x_test, y_test, color="cornflowerblue",
         label="Test Set", linewidth=2)
plt.plot(x_test, y_pred, color="yellowgreen", label="Predicted", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

In [None]:
print(r2_score(y_test, y_pred))

**62.23 of the data fits the simple Decision Tree Regression model**

* **Random Forest Regression:**

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
model = RandomForestRegressor(n_estimators=300, max_depth=15, random_state=0) 
model.fit(x_train, y_train)

In [None]:
y_pred = model.predict(x_test)

In [None]:
# 0 means the model is perfect. Therefore the value should be as close to 0 as possible
meanAbErr = metrics.mean_absolute_error(y_test, y_pred)
meanSqErr = metrics.mean_squared_error(y_test, y_pred)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)

In [None]:
plt.figure()
plt.scatter(x, y, s=20, edgecolor="black",
            c="darkorange", label="data")
plt.plot(x_test, y_test, color="cornflowerblue",
         label="Test Set", linewidth=2)
plt.plot(x_test, y_pred, color="yellowgreen", label="Predicted", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Random Forest Regression")
plt.legend()
plt.show()

In [None]:
print(r2_score(y_test, y_pred))

**67.14 of the data fits the Random Forest Regression model**

# Using our predictor to measure deaths increment per million people.

In [None]:
from array import array

In [None]:
model.fit(x, y)

In [None]:
x_new = np.array([1]).reshape(-1, 1)
deaths_non_classified_no_vax = model.predict(x_new)
x_new = np.array([1000000]).reshape(-1, 1)
deaths_non_classified_vaxxed = model.predict(x_new)

Model for Tumors:

In [None]:
y = df_vaxxed[column_name_increment_monthly_malignant_neoplasms].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(r2_score(y_test, y_pred))

In [None]:
# 0 means the model is perfect. Therefore the value should be as close to 0 as possible
meanAbErr = metrics.mean_absolute_error(y_test, y_pred)
meanSqErr = metrics.mean_squared_error(y_test, y_pred)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)

In [None]:
x_new = np.array([1]).reshape(-1, 1)
deaths_non_tumor_no_vax = model.predict(x_new)
x_new = np.array([1000000]).reshape(-1, 1)
deaths_non_tumor_vaxxed = model.predict(x_new)

Model for cerebrovascular

In [None]:
y = df_vaxxed[column_name_increment_monthly_non_cerebrovascular].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(r2_score(y_test, y_pred))

In [None]:
# 0 means the model is perfect. Therefore the value should be as close to 0 as possible
meanAbErr = metrics.mean_absolute_error(y_test, y_pred)
meanSqErr = metrics.mean_squared_error(y_test, y_pred)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)

In [None]:
x_new = np.array([1]).reshape(-1, 1)
deaths_non_brain_no_vax = model.predict(x_new)
x_new = np.array([1000000]).reshape(-1, 1)
deaths_non_brain_vaxxed = model.predict(x_new)

Model for cardiovascular

In [None]:
y = df_vaxxed[column_name_increment_monthly_non_cardiovascular].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(r2_score(y_test, y_pred))

In [None]:
# 0 means the model is perfect. Therefore the value should be as close to 0 as possible
meanAbErr = metrics.mean_absolute_error(y_test, y_pred)
meanSqErr = metrics.mean_squared_error(y_test, y_pred)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)

In [None]:
x_new = np.array([1]).reshape(-1, 1)
deaths_non_heart_no_vax = model.predict(x_new)
x_new = np.array([1000000]).reshape(-1, 1)
deaths_non_heart_vaxxed = model.predict(x_new)

Model for lower respiratory disceases

In [None]:
y = df_vaxxed[column_name_increment_monthly__lower_respiratory].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(r2_score(y_test, y_pred))

In [None]:
# 0 means the model is perfect. Therefore the value should be as close to 0 as possible
meanAbErr = metrics.mean_absolute_error(y_test, y_pred)
meanSqErr = metrics.mean_squared_error(y_test, y_pred)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)

In [None]:
x_new = np.array([1]).reshape(-1, 1)
deaths_non_lungs_no_vax = model.predict(x_new)
x_new = np.array([1000000]).reshape(-1, 1)
deaths_non_lungs_vaxxed = model.predict(x_new)

Model for diabetes

In [None]:
y = df_vaxxed[column_name_increment_monthly_diabetes].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(r2_score(y_test, y_pred))

In [None]:
# 0 means the model is perfect. Therefore the value should be as close to 0 as possible
meanAbErr = metrics.mean_absolute_error(y_test, y_pred)
meanSqErr = metrics.mean_squared_error(y_test, y_pred)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)

In [None]:
x_new = np.array([1]).reshape(-1, 1)
deaths_non_diabetes_no_vax = model.predict(x_new)
x_new = np.array([1000000]).reshape(-1, 1)
deaths_non_diabetes_vaxxed = model.predict(x_new)

In [None]:
total_deaths_no_vax = deaths_non_classified_no_vax + deaths_non_tumor_no_vax + deaths_non_brain_no_vax + deaths_non_heart_no_vax + deaths_non_lungs_no_vax + deaths_non_diabetes_no_vax
total_deaths_vaxxed = deaths_non_classified_vaxxed + deaths_non_tumor_vaxxed + deaths_non_brain_vaxxed + deaths_non_heart_vaxxed + deaths_non_lungs_vaxxed + deaths_non_diabetes_vaxxed
delta = total_deaths_vaxxed - total_deaths_no_vax
print(delta*100/1000000)

In [None]:
delta_non_classified = deaths_non_classified_vaxxed- deaths_non_classified_no_vax
print(delta_non_classified*100/delta)

Given this model: 
* We are only considering causes of death with more correlation rate: non classified, tumors, cerebrovascular, cardiovascular, lower respiratory illness and diabetes. 
* We are excluding completely: nephrosis, natural causes, influenza, pneumonia, alzheimer, septicemia, other respiratory and covid.
* We have a fair margin of error on the correlated causes.
* We can safely assume that we are not including potential vaccine related deaths from the excluded causes.

# **Conclusion:** Covid-19 Vaccination alone has around 1.09% death rate in one month after vaccination date. And 4% of those definitely did not have an underlying condition (unclassified deaths).

# Covid 19 stats from official sources:

* [67.6% of people](https://www.reuters.com/world/india/india-govt-survey-shows-two-thirds-have-coronavirus-antibodies-2021-07-20/) have natural inmunity against Covid-19.
* 15.6% of people who get the disease, are asymptomatic from COVID infection. ([source](https://pubmed.ncbi.nlm.nih.gov/32691881/)).
* Covid-19 mortaility rate in the USA: 1.8% on the highest mortality rate (age-agnostic). ([source](https://coronavirus.jhu.edu/data/mortality))
* 4.5% of all deaths occurred in patients under the age of 65 who did not have an underlying medical condition ([source](https://www.worldometers.info/coronavirus/coronavirus-death-rate/)).

# Conclusions: 
* Chances of getting Covid-19 infection with symptoms are 27.35%.
* If you don't have an underlying condition and aren't vaccinated, your chances of dying with COVID are 0.023%. If you have an underlying condition and are unvaccinated, your chance of dying with COVID is 0.5%.
* If you have an underlying condition, your chance of dying from the vaccine **in one month time** is 1.09%. If you don't have an underlying condition, your chances of dying from the vaccine **in one month time** are 0.058%.
* Chances of other side effects from the COVID vaccine are quite high, I won't quantify it as accurate data is not available right now.
* Chances of long term side effects from the COVID vaccine and deaths after 1 month are **unknown at this time.** The vaccines are experimental so no such data exist at this point.

# The main conclusion is: if you vaccinate against Covid-19, your chances of dying (from vaccination alone) increase around 152% and **only considering the first month after vaccination**. If you consider mid-to-long term conditions (that are virtually unknown at this time), the probability increases considerably.