#The BCG Strain Pool: Diversity Matters, by Daria Bottai and Roland Brosch 

Mol Ther. 2016 Feb; 24(2): 201–203. - Published online 2016 Feb 24. doi: 10.1038/mt.2016.18

The overall interpretation of BCG vaccine efficacy and the resulting recommendations are further complicated by the fact that BCG is not a single, pharmacologically well-defined vaccine but, rather, a pool of different BCG daughter strains that have acquired phenotypic and genotypic variations during decades of in vitro culturing in different laboratories. As reported in this issue of Molecular Therapy, Zhang and colleagues have now compared phenotypic and genotypic information of 13 different BCG strains with data on their virulence and vaccine efficacy in severe combined immunodeficient (SCID) and BALB/c mice, respectively.

The authors concluded that the distinct levels of virulence of the various strains might be linked to strain-specific duplications and deletions of genomic regions. Moreover, the authors observed a general trend whereby BCG strains showing higher virulence in SCID mice induced better protection against a Mycobacterium tuberculosis challenge in BALB/c mice3 relative to less virulent BCG strains. These observations have important implications for current BCG vaccination programs and are of particular relevance for both ongoing and future alternative TB vaccine development approaches.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4817828/

By Daria Bottai1 and Roland Brosch
![](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4817828/bin/mt201618f1.jpg) 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4817828/


Figure above is a representation of genealogy of BCG daughter strains. Comparative genomic analyses identified several gene-specific single-nucleotide polymorphisms (SNPs) as well as large-sequence polymorphisms (genomic deletions, tandem duplications, insertion sequences IS6110) both in BCG vaccines relative to virulent strains of M. bovis and M. tuberculosis, and among the different BCG daughter strains. The scheme shows regions of difference (RD), insertions (in), deletions (del), and SNPs, which differentiate the various BCG strains. The brown and blue dashed ellipses indicate tandem duplications DU1 (exclusively restricted to BCG Pasteur) and DU2 (present in the BCG substrains in four possible forms), which enable classification of BCG strains into four major lineages. 

Note that in other phylogenies (e.g., ref. 10), BCG China/BCG Beijing belongs to a cluster closely related to BCG Danish. It seems likely that two different groups of BCG exist that are both named BCG China or BCG Beijing. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4817828/

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
import plotly.offline as py
import plotly.express as px
import nltk
import string
from nltk.corpus import stopwords
import re

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

BCG, an attenuated anti-TB vaccine, is one of the most well-known examples of globally used vaccines developed in the twentieth century. Originally obtained by Albert Calmette and Camille Guérin in the early 1920s at the Institut Pasteur of Lille by serially passaging a virulent Mycobacterium bovis isolate on potato slices soaked in glycerol and ox bile, it remains one of the most widely used vaccines today (more than 120 million doses each year). 

Additional mutations in some individual BCG strains might also have contributed to the specific virulence phenotypes of certain strains. As an example, the low virulence shown by BCG Glaxo might have arisen from mutations leading to a defect in the synthesis of phthiocerol dimycocerosate and phenolic glycolipids, which are important virulence-associated lipids in tubercle bacilli. 

BCG Prague, for example, shows a frameshift mutation in PhoP, and has been found to be one of the least virulent BCG strains tested.

#Although BCG shows good efficacy in preventing disseminated forms of TB in young children, its widespread use has not prevented the #pandemic spread of the disease. 

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4817828/

In [None]:
df = pd.read_csv('../input/hackathon/task_2-BCG_world_atlas_data-bcg_strain-7July2020.csv', encoding='utf8')
df.head()

![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTf76Ddd8qPJmlVNhnxzHmJyu3jPAcoQiMenQ&usqp=CAU)
intechopen.com - Mosaic Structure as the Main Feature of Mycobacterium bovis BCG Genomes

In [None]:
df.isnull().sum()

Handling with Missing Values. (Categorical Variables)

In [None]:
# categorical features with missing values
categorical_nan = [feature for feature in df.columns if df[feature].isna().sum()>1 and df[feature].dtypes=='O']
print(categorical_nan)

![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcSulPLA1Bkobxn2zPABKEyQ0owLmpLgRxYOlw&usqp=CAU)
dcvmn.org

In [None]:
# replacing missing values in categorical features
for feature in categorical_nan:
    df[feature] = df[feature].fillna('None')

In [None]:
df[categorical_nan].isna().sum()

In [None]:
# Lets first handle numerical features with nan value
#numerical_nan = [feature for feature in df.columns if df[feature].isna().sum()>1 and df[feature].dtypes!='O']
#numerical_nan

In [None]:
#df[numerical_nan].isna().sum()

In [None]:
## Replacing the numerical Missing Values

#for feature in numerical_nan:
    ## We will replace by using median since there are outliers
   # median_value=df[feature].median()
    
   # df[feature].fillna(median_value,inplace=True)
    
#df[numerical_nan].isnull().sum()

In [None]:
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='black',
        stopwords=stopwords,
        max_words=200,
        colormap='Set2',
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
).generate(str(data))

    fig = plt.figure(1, figsize=(15, 15))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

show_wordcloud(df['bcg_strain_original'])

In [None]:
cnt_srs = df['vaccination_timing'].value_counts().head()
trace = go.Bar(
    y=cnt_srs.index[::-1],
    x=cnt_srs.values[::-1],
    orientation = 'h',
    marker=dict(
        color=cnt_srs.values[::-1],
        colorscale = 'Blues',
        reversescale = True
    ),
)

layout = dict(
    title='Vaccination Timing Distribution',
    )
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename="vaccination_timing")

In [None]:
df['vaccination_timing_length']=df['vaccination_timing'].apply(len)

In [None]:
sns.set(font_scale=2.0)

g = sns.FacetGrid(df,col='bcg_strain_id',height=5)
g.map(plt.hist,'vaccination_timing_length')

In [None]:
plt.figure(figsize=(10,8))
ax=sns.countplot(df['bcg_strain_t_cell_grp_3'])
ax.set_xlabel(xlabel="bcg_strain_t_cell_grp_3",fontsize=17)
ax.set_ylabel(ylabel='count',fontsize=17)
ax.axes.set_title('BCG Strain Lymphocyte Group 3',fontsize=17)
ax.tick_params(labelsize=13)

In [None]:
sns.set(font_scale=1.4)
plt.figure(figsize = (10,5))
sns.heatmap(df.corr(),cmap='summer',annot=True,linewidths=.5)

In [None]:
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack
from sklearn.feature_extraction.text import TfidfVectorizer

all_text=df['vaccination_timing']
train_text=df['vaccination_timing']
y=df['is_bcg_mandatory_for_all_children']

In [None]:
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=10000)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)

In [None]:
char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(2, 6),
    max_features=50000)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)

train_features = hstack([train_char_features, train_word_features])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_features, y,test_size=0.3,random_state=101)

In [None]:
import xgboost as xgb
xgb=xgb.XGBClassifier()
##xgb.fit(X_train,y_train)

![](https://media3.giphy.com/media/l3V0BVDTyuMzwpS1i/200.webp?cid=790b76117cd34d5455523e45a862c46eddd0032988573aa0&rid=200.webp)
As Richard Hendricks said (in Silicon Valley - sitcom season 6): sorry XGBoost. I don't have time to wait for you. (In fact he said with dirty words).

Das War's Kaggle Notebook Runner: Marília Prata  @mpwolke