#Lessons from AIDS for the COVID-19 Pandemic, by William A. Haseltine 

This article was originally published with the title "What We Learned from AIDS" in Scientific American 323, 4, 35-41 (October 2020)

doi:10.1038/scientificamerican1020-35

WE CAN LEARN FROM PARALLELS BETWEEN THE CORONAVIRUS AND HIV CRISES

Like all viruses, coronaviruses are expert code crackers. SARS-CoV-2 has certainly cracked ours. Think of this virus as an intelligent biological machine continuously running DNA experiments to adapt to the ecological niche it inhabits. This virus has caused a pandemic in large part because it acted on three of our most human vulnerabilities.

#Human issues: Our biological defenses, our clustering patterns of social behavior and our simmering political divides.

Several lessons from the long battle with HIV, the human immunodeficiency virus that causes AIDS, suggest what may lie ahead. HIV/AIDS is one of the worst scourges humans have encountered. As a code cracker, HIV is an expert. By the end of 2019 the global death toll from this virus was roughly 33 million people. In all, 76 million people have been infected, and scientists estimate another 1.7 million people acquire the virus every year.
https://www.scientificamerican.com/article/lessons-from-aids-for-the-covid-19-pandemic/

![](https://www.unaids.org/sites/default/files/covid_hiv_cover_960_0.jpg)unaids.org

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.offline as py
import plotly.express as px

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#Just as with AIDS and cancer, we will need a combination of medicines to treat this disease

Of the nearly 38 million people currently living with HIV/AIDS, 25 million are receiving full antiretroviral treatments that prevent disease and suppress the virus so well they are unlikely to pass it along. I would wager that another 25 million or more infections never happened, primarily in sub-Saharan Africa, because these treatments became available in most countries.

From fighting this epic war against AIDS, doctors, virologists, epidemiologists and public health experts have learned crucial lessons that we can apply to the battle we are currently waging. For instance, we saw that vaccines are never a guarantee but that treatments can be our most important weapon. We discovered that human behavior plays a vital role in any disease-fighting effort and that we cannot overlook human nature. We have also seen how critical it is to build on knowledge and tools gained fighting earlier outbreaks—a strategy only possible if we continue funding research in between pandemics.
https://www.scientificamerican.com/article/lessons-from-aids-for-the-covid-19-pandemic/

In [None]:
nRowsRead = 1000 # specify 'None' if want to read whole file
df = pd.read_csv('../input/cusersmarildownloadssurveillancecsv/surveillance.csv', delimiter=';', encoding = "ISO-8859-1", nrows = nRowsRead)
df.dataframeName = 'surveillance.csv'
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')
df.head()

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSLd7XlU6z9ornp1KchLi5__B0vc9rMbGfZhw&usqp=CAU)msfaccess.org

#Columns Names

"tbhiv_sentin_cil","Non-routine HIV surveillance",,"Lower limit (95% confidence interval) of prevalence (%) of HIV in TB patients estimated using sentinel sites"

"tbhiv_sentin_ciu","Non-routine HIV surveillance",,"Upper limit (95% confidence interval) of prevalence (%) of HIV in TB patients estimated using sentinel sites"

"tbhiv_sentin_prev","Non-routine HIV surveillance",,"Prevalence (%) of HIV in TB patients estimated using sentinel sites"

"tbhiv_sentin_yr","Non-routine HIV surveillance",,"Year in which prevalence of HIV in TB patients was estimated using sentinel sites"

"tbhiv_surv_cil","Non-routine HIV surveillance",,"Lower limit (95% confidence interval) of prevalence (%) of HIV in TB patients estimated using nationwide representative surveys"

"tbhiv_surv_ciu","Non-routine HIV surveillance",,"Upper limit (95% confidence interval) of prevalence (%) of HIV in TB patients estimated using nationwide representative surveys"

"tbhiv_surv_prev","Non-routine HIV surveillance",,"Prevalence (%) of HIV in TB patients estimated using nationwide representative surveys"

"tbhiv_surv_yr","Non-routine HIV surveillance",,"Year in which prevalence of HIV in TB patients was estimated using nationwide representative surveys" https://gitlab.procc.fiocruz.br/oswaldo/DS_2017/blob/4e8f97ebbd15bdba60c2e3c0d91298a24801a2c5/exemplos/TB_data_dictionary_2017-10-23.csv

In [None]:
df.isnull().sum()

#VACCINE CHALLENGES

Unlike the hit-and-run polio virus, which evokes long-term immunity after an infection, HIV is a “catch it and keep it” virus—if you are infected, the pathogen stays in your body until it destroys the immune system, leaving you undefended against even mild infections. Moreover, HIV continually evolves a shrewd opponent seeking ways to elude our immune responses. Although this does not mean a vaccine is impossible, it certainly meant developing one, especially when the virus hit in the 1980s, would not be easy. “Unfortunately, no one can predict with certainty that an AIDS vaccine can ever be made,” “That is not to say it is impossible to make such a vaccine, only that we are not certain of success.” More than 30 years later there still is no effective vaccine to prevent HIV infection.

#Covid-19 reinfection by the same Strain? Transient immune protection? 

From what we have seen of SARS-CoV-2, it interacts with our immune system in complex ways, resembling polio in some of its behavior and HIV in others. We know from nearly 60 years of observing coronaviruses that a body's immune system can clear them. That seems to be generally the case for SARS-CoV-2 as well. But the cold-causing coronaviruses, just like HIV, also have their tricks. Infection from one of them never seems to confer immunity to reinfection or symptoms by the same strain of virus—that is why the same cold viruses return each season. 

These coronaviruses are not a hit-and-run virus like polio or a catch-it-and-keep-it virus like HIV. I call them “get it and forget it” viruses—once cleared, your body tends to forget it ever fought this foe. Early studies with SARS-CoV-2 suggest it might behave much like its cousins, raising transient immune protection.

The path to a SARS-CoV-2 vaccine may be filled with obstacles. Whereas some people with COVID-19 make neutralizing antibodies that can clear the virus, not everybody does. Whether a vaccine will stimulate such antibodies in everyone is still unknown. Moreover, we do not know how long those antibodies can protect someone from infection. It may be two or three years before we will have the data to tell us and any confidence in the outcome.
https://www.scientificamerican.com/article/lessons-from-aids-for-the-covid-19-pandemic/

In [None]:
# categorical features with missing values
categorical_nan = [feature for feature in df.columns if df[feature].isna().sum()>0 and df[feature].dtypes=='O']
print(categorical_nan)

#Vaccines Success: Will it work? Will it be safe? Will it be long lasting?

Another challenge is how this virus enters the body: through the nasal mucosal membranes. No COVID-19 vaccine currently in development has shown an ability to prevent infection through the nose. In nonhuman primates, some vaccines can prevent the disease from spreading efficiently to the lungs. But those studies do not tell us much about how the same drug will work in humans; the disease in our species is very different from what it is in monkeys, which do not become noticeably ill.

We learned with HIV that attempts to prevent virus entry altogether do not work well—not for HIV and not for many other viruses, including influenza and even polio. Vaccines act more like fire alarms: rather than preventing fires from breaking out, they call the immune system for help once a fire has ignited.

The hopes of the world rest on a COVID-19 vaccine. It seems likely that scientists will announce a “success” sometime this year, but success is not as simple as it sounds. As I write, officials in Russia have reported approving a COVID-19 vaccine. Will it work? Will it be safe? Will it be long lasting? No one will be able to provide convincing answers to these questions for any forthcoming vaccine soon, perhaps not for at least several years.
https://www.scientificamerican.com/article/lessons-from-aids-for-the-covid-19-pandemic/

In [None]:
# replacing missing values in categorical features
for feature in categorical_nan:
    df[feature] = df[feature].fillna('None')

In [None]:
df[categorical_nan].isna().sum()

#FOCUS ON TREATMENTS - Resistance to Single Drugs

"Thirty thousand volunteers around the world participate in networks built by the National Institutes of Health for studies of new HIV vaccine candidates, and these networks are being tapped for initial testing of COVID-19 vaccines, too."

"The gold standard for AIDS treatment now is called antiretroviral therapy—essentially patients take a cocktail of at least three different drugs that attack the HIV virus in different ways. The strategy is based on earlier success we had in fighting cancer." 

"We already know resistance to single drugs will bedevil COVID-19 treatments. We have seen resistance to single, anti-SARS-CoV-2 drugs develop rapidly in early lab studies. Just as with AIDS and cancer, we need a combination of medicines to treat this disease. The goal of the biotechnology and pharmaceutical industries now is to develop an array of highly potent and specific drugs, each of which targets a different function of the virus. Decades of research on HIV has shown the way and gives us confidence in our eventual success."
https://www.scientificamerican.com/article/lessons-from-aids-for-the-covid-19-pandemic/

In [None]:
# Lets first handle numerical features with nan value
numerical_nan = [feature for feature in df.columns if df[feature].isna().sum()>1 and df[feature].dtypes!='O']
numerical_nan

#HUMAN BEHAVIOR: Know your Risk, Take Necessary Precautions

People were NOT ignorant of how to catch it; it is that they did not change their lifestyle accordingly.

There is likewise a sexual dynamic to COVID-19 that often goes unmentioned. It is part of what is driving people out of their homes and into bars and parties. Anyone with a craving for a beer can quench their thirst in the safety of their own home, but gratification comes less easily for other desires, especially when one is young, single and living alone. Our public health strategies should not ignore this fact.

The same lessons we learned in the midst of the HIV epidemic to help young people change their behaviors apply today to COVID-19: know your risk, know your partners and take necessary precautions. Many young people operate under the false assumption that even if they become infected, they will not become severely ill. Not only is this belief untrue, but even people with asymptomatic infections can suffer serious, lasting damage. But the more people understand the risk (younger people especially) the greater likelihood they will take the steps necessary to protect themselves and others. We saw this happen with AIDS.
https://www.scientificamerican.com/article/lessons-from-aids-for-the-covid-19-pandemic/

In [None]:
df[numerical_nan].isna().sum()

#FUNDING

Governments and industry pulled the plug on coronavirus research funding in 2006 after the first SARS (severe acute respiratory syndrome) pandemic faded away and again in the years immediately following the MERS (Middle East respiratory syndrome, also caused by a coronavirus) outbreak when it seemed to be controllable.

Funding agencies everywhere, not just in the U.S. but in China, Japan, Singapore, Hong Kong and the Middle East (countries affected by SARS and MERS) underestimated the threat of coronaviruses. Despite clear, persistent, highly vocal warnings from many of those who battled SARS and MERS up close, funding dried up. 

The development of promising anti-SARS and MERS drugs, which might have been active against SARS-CoV-2 as well, was left unfinished for lack of money.

With 776,000 dead and 22 million infected globally as of mid-August, we have every motive to accelerate funding. The U.S. quickly opened the funding spigots last spring for research to hasten discoveries of vaccines and drugs. But will it be enough?

In Ronald Reagan first administration, funding for HIV research was scarce; few scientists were willing to stake their careers on deciphering the molecular biology. 

When the money flowed, outstanding scientists signed on. William A. Haseltine helped to design this first congressionally funded AIDS research program with Anthony Fauci, the doctor now leading United States fight against COVID-19. (And if there is one person in the world who has made the greatest contribution to the prevention and treatment of AIDS, that person is Fauci.)https://www.scientificamerican.com/article/lessons-from-aids-for-the-covid-19-pandemic/

In [None]:
## Replacing the numerical Missing Values

for feature in numerical_nan:
    ## We will replace by using median since there are outliers
    median_value=df[feature].median()
    
    df[feature].fillna(median_value,inplace=True)
    
df[numerical_nan].isnull().sum()

#Never let down our guard when it comes to Funding Antivirus Research.

Our tool set for virus and pharmaceutical research has improved enormously in the past 36 years since HIV was discovered. This is one reason to be confident that we will have effective antiviral drugs for treating COVID-19 infections by next year, if not sooner.

What used to take us five or 10 years in the 1980s and 1990s in many cases now can be done in five or 10 months. We can rapidly identify and synthesize chemicals to predict which drugs will be effective.

We can do cryoelectron microscopy to probe virus structures and simulate molecule-by-molecule interactions in a matter of weeks—something that used to take years. The lesson is to never let down our guard when it comes to funding antivirus research. We would have no hope of beating COVID-19 if it were not for the molecular biology gains we made during earlier virus battles. What we learn this time around will help us out during the next pandemic, but we must keep the money coming.
https://www.scientificamerican.com/article/lessons-from-aids-for-the-covid-19-pandemic/

In [None]:
from sklearn.preprocessing import LabelEncoder

#fill in mean for floats
for c in df.columns:
    if df[c].dtype=='float16' or  df[c].dtype=='float32' or  df[c].dtype=='float64':
        df[c].fillna(df[c].mean())

#fill in -999 for categoricals
df = df.fillna(-999)
# Label Encoding
for f in df.columns:
    if df[f].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(df[f].values))
        df[f] = lbl.transform(list(df[f].values))
        
print('Labelling done.')

In [None]:
df = pd.get_dummies(df)

#What story will be recounted about Covid-19 Pandemic?

We are just beginning to glimpse what the long-term toll of COVID-19 might be. This is a new virus, so we will not have a clearer idea until after a few years, but we know it will be very high. We have barely scratched the surface of coronavirus molecular biology. What story will our children and grandchildren recount about our successes as scientists and as a society, and our failures, to contain this pandemic—the worst we have faced in 100 years?

We may be in for a protracted pandemic, or we may get lucky with effective treatments and vaccines soon. But we have been here before, facing an unknown viral enemy, and we can lean on lessons we have learned. This is not the first and will not be the last global epidemic.https://www.scientificamerican.com/article/lessons-from-aids-for-the-covid-19-pandemic/

In [None]:
#Code by Puru Behl https://www.kaggle.com/accountstatus/mt-cars-data-analysis

sns.distplot(df['tbhiv_surv_prev'])
plt.axvline(df['tbhiv_surv_prev'].values.mean(), color='red', linestyle='dashed', linewidth=1)
plt.title('TB & HIV Surveillance Prevalence Distribution')

In [None]:
#Code by Firat Gonen https://www.kaggle.com/frtgnn/elo-eda-lgbm/notebook 

plt.figure(figsize=(10, 6))
plt.title('TB & HIV Surveillance Prevalence Distribution')
sns.despine()
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

sns.distplot(df['tbhiv_surv_prev'], hist=True, rug=False,norm_hist=True)

In [None]:
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

In [None]:
y_train = df.tbhiv_surv_prev.values

In [None]:
#Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(df.values)
    rmse= np.sqrt(-cross_val_score(model, df.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

In [None]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

In [None]:
#Elastic Net Regression : again made robust to outliers
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))

In [None]:
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

In [None]:
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)

In [None]:
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)

In [None]:
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

#Base models scores

In [None]:
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [None]:
score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [None]:
score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [None]:
score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [None]:
score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [None]:
score = rmsle_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

In [None]:
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)

#Averaged base models score

In [None]:
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [None]:
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

#Stacking Averaged models Score

In [None]:
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                 meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

#Ensembling StackedRegressor, XGBoost and LightGBM

In [None]:
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

#Final Training and Prediction StackedRegressor

In [None]:
stacked_averaged_models.fit(df.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(df.values)
stacked_pred = np.expm1(stacked_averaged_models.predict(df.values))
print(rmsle(y_train, stacked_train_pred))

In [None]:
model_xgb.fit(df, y_train)
xgb_train_pred = model_xgb.predict(df)
xgb_pred = np.expm1(model_xgb.predict(df))
print(rmsle(y_train, xgb_train_pred))

In [None]:
pip install --upgrade 'lightgbm>=3.0.0'

#That snippet below fixed the issue with LGB (feature with Json that resulted in error)

LightGBMError: Do not support non-ASCII (Json) characters in feature name.

The new version of LGBM didn't fix that.

In [None]:
import re
df = df.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

In [None]:
model_lgb.fit(df, y_train)
lgb_train_pred = model_lgb.predict(df)
lgb_pred = np.expm1(model_lgb.predict(df.values))
print(rmsle(y_train, lgb_train_pred))

In [None]:
'''RMSE on the entire Train data when averaging'''

print('RMSLE score on train data:')
print(rmsle(y_train,stacked_train_pred*0.70 +
               xgb_train_pred*0.15 + lgb_train_pred*0.15 ))

#Ensemble prediction

In [None]:
ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15

In [None]:
test_ID = df['tbhiv_surv_prev']

In [None]:
#Code by Olga Belitskaya https://www.kaggle.com/olgabelitskaya/sequential-data/comments
from IPython.display import display,HTML
c1,c2,f1,f2,fs1,fs2=\
'#2B3A67','#42a7f5','Akronim','Smokum',30,15
def dhtml(string,fontcolor=c1,font=f1,fontsize=fs1):
    display(HTML("""<style>
    @import 'https://fonts.googleapis.com/css?family="""\
    +font+"""&effect=3d-float';</style>
    <h1 class='font-effect-3d-float' style='font-family:"""+\
    font+"""; color:"""+fontcolor+"""; font-size:"""+\
    str(fontsize)+"""px;'>%s</h1>"""%string))
    
    
dhtml('Marília Prata, @mpwolke was Here.' )