#Mycobacterium tuberculosis Culture Filtrate Proteins plus CpG Oligodeoxynucleotides Confer Protection to Mycobacterium bovis
#BCG-Primed Mice by Inhibiting Interleukin-4 Secretion.

Authors: Denise Morais da Fonseca, Celio Lopes Silva, Pryscilla Fanini Wowk, Marina Oliveira e Paula,
Simone Gusmão Ramos, Cynthia Horn, Gilles Marchal, and Vânia Luiza Deperon Bonato.


Culture filtrate proteins (CFP) are potential targets for tuberculosis vaccine development. The Authors previously showed that despite the high level of gamma interferon (IFN-) production elicited by homologous immunization with CFP plus CpG oligodeoxynucleotides (CFP/CpG). 

They did not observe protection when these mice were challenged with Mycobacterium tuberculosis. In order to use the IFN-inducing ability of CFP antigens,in this study they evaluated a prime-boost heterologous immunization based on CFP/CpG to boost Mycobacterium bovis BCG vaccination in order to find an immunization schedule that could induce protection. Heterologous BCG-CFP/CpG immunization provided significant protection against experimental tuberculosis, and this protection was sustained during the late phase of infection and was even better than that conferred by a single BCG immunization. 

The protection was associated with high levels of antigen-specific IFN- and interleukin-17 (IL-17) and low IL-4 production. The deleterious role of IL-4 was confirmed when IL-4 knockout mice vaccinated with CFP/CpG showed consistent protection similar to that elicited by BCG-CFP/CpG heterologous immunization.

These findings show that a single dose of CFP/CpG can represent a new strategy to boost the protection conferred by BCG vaccination. Moreover, different immunological parameters, such as IFN- and IL-17 and tightly regulated IL-4 secretion, seem to contribute to the efficacy of this tuberculosis vaccine.https://www.arca.fiocruz.br/bitstream/icict/39057/2/ve_Fonseca_Denise_etal_INI_2009.pdf
https://www.arca.fiocruz.br/handle/icict/39057

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import cv2

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

![](https://iai.asm.org/content/iai/77/12/5311/F1.medium.gif)
#Protection against experimental TB conferred by the different vaccination schedules.

BALB/c mice (n = 8 to 10) were immunized subcutaneously with a single dose of BCG (BCG group), three doses of CFP/CpG (CFP/CpG group) at 7-day intervals, one dose of BCG followed by one dose of CFP/CpG after 15 days (BCG-CFP/CpG group), or one dose of CFP/CpG followed by one dose of BCG after 15 days (CFP/CpG-BCG group). Sixty days after vaccination, mice were challenged with a virulent strain of M. tuberculosis. At 30 (A) or 70 (B) days after infection, the lungs were processed for the CFU assay and histopathological analysis. Bacterial load is expressed as log10 CFU/g of lung from the means ± standard deviations of serial dilutions individually counted for each group. Results are from experiments repeated twice. #, P < 0.05 versus other groups; *, P < 0.05 versus nonimmunized, infected mice (MTB group); &, P < 0.05 versus the BCG-vaccinated group. For histological representation of the lungs from infected mice, sections (5 μm) of 30-day or 70-day infected lungs were stained with hematoxylin and eosin. Magnification, ×50.https://iai.asm.org/content/77/12/5311

#Codes from Paul Mooney https://www.kaggle.com/paultimothymooney/what-is-inside-of-the-mueller-report/notebook

In [None]:
# Import Python Packages
# PyTesseract and Tika-Python for OCR
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import shutil
import PIL
import os
from os import walk
from shutil import copytree, ignore_patterns
from collections import Counter
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
from wand.image import Image as Img
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 500)
#mueller_report = pd.read_csv('../input/data-science-cheat-sheets/Interview Questions/AI Questions.pdf') # one row per line

In [None]:
# Define helper function for plotting word clouds
def wordCloudFunction(df,column,numWords):
    # adapted from https://www.kaggle.com/benhamner/most-common-forum-topic-words
    topic_words = [ z.lower() for y in
                       [ x.split() for x in df[column] if isinstance(x, str)]
                       for z in y]
    word_count_dict = dict(Counter(topic_words))
    popular_words = sorted(word_count_dict, key = word_count_dict.get, reverse = True)
    popular_words_nonstop = [w for w in popular_words if w not in stopwords.words("english")]
    word_string=str(popular_words_nonstop)
    wordcloud = WordCloud(stopwords=STOPWORDS,
                          background_color='white',
                          max_words=numWords,
                          width=1000,height=1000,
                         ).generate(word_string)
    plt.clf()
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()

In [None]:
# Define helper function for plotting word bar graphs
def wordBarGraphFunction(df,column,title):
    # adapted from https://www.kaggle.com/benhamner/most-common-forum-topic-words
    topic_words = [ z.lower() for y in
                       [ x.split() for x in df[column] if isinstance(x, str)]
                       for z in y]
    word_count_dict = dict(Counter(topic_words))
    popular_words = sorted(word_count_dict, key = word_count_dict.get, reverse = True)
    popular_words_nonstop = [w for w in popular_words if w not in stopwords.words("english")]
    plt.barh(range(50), [word_count_dict[w] for w in reversed(popular_words_nonstop[0:50])])
    plt.yticks([x + 0.5 for x in range(50)], reversed(popular_words_nonstop[0:50]))
    plt.title(title)
    plt.show()

In [None]:
# Preview the data folder
inputFolder = '../input/'
for root, directories, filenames in os.walk(inputFolder):
    for filename in filenames: 
        print(os.path.join(root,filename))
        
# Move data to folder with read/write access
outputFolder = '/kaggle/working/pdfs/'
shutil.copytree(inputFolder,outputFolder,ignore=ignore_patterns('*.db'))
for root, directories, filenames in os.walk(outputFolder, topdown=False):
    for file in filenames:
        try:
            shutil.move(os.path.join(root, file), outputFolder)
        except OSError:
            pass
print(os.listdir(outputFolder))

In [None]:
# Look at page 3
pdf = os.path.join(outputFolder,'immunity.pdf[3]')
with Img(filename=pdf, resolution=300) as img:
    img.compression_quality = 99
    img.convert("RGBA").save(filename='/kaggle/working/immunity.jpg') # intro page to preview later

#PDF to CSV

Convert Page 3 of PDF to CSV (Method 1 of 2: PyTesseract)

In [None]:
# Parse a PDF file and convert it to CSV using PyTesseract
import pytesseract
pdfimage = Image.open('/kaggle/working/immunity.jpg')
text = pytesseract.image_to_string(pdfimage)  
df = pd.DataFrame([text.split('\n')])

In [None]:
# Plot WordCloud of page 3
plt.figure(figsize=(10,10))
wordCloudFunction(df.T,0,10000000)
plt.figure(figsize=(10,10))
wordBarGraphFunction(df.T,0,"Most Common Words on Page  of BCG/CFP Immunity")

In [None]:
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from graphviz import Source

In [None]:
from colorama import Fore, Style

nRowsRead = 1000 # specify 'None' if want to read whole file
# ham_lyrics.csv has 3634 rows in reality, but we are only loading/previewing the first 1000 rows
df = pd.read_csv('../input/hackathon/task_2-owid-covid-data-22_September_2020.csv', delimiter=',', nrows = nRowsRead)
df.dataframeName = 'task_2-owid-covid-data-22_September_2020.csv'
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')
print(Fore.CYAN + 'Data shape: ',Style.RESET_ALL,df.shape)
df.head()

In [None]:
#Code by Md Redwan Karim Sony https://www.kaggle.com/redwankarimsony/space-missions-data-eda-temporal-analysis

# Calculating
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column': df.columns,
                                 'percent': percent_missing})
missing_value_df.sort_values('percent', inplace=True)
missing_value_df.reset_index(drop=True, inplace=True)
missing_value_df = missing_value_df[missing_value_df['percent']>0]

# Plotting
fig = px.bar(
    missing_value_df, 
    x='percent', 
    y="column", 
    orientation='h', 
    title='Columns with Missing Values', 
    height=200, 
    width=600
)
fig.show()

#Handling Missing Values

In [None]:
# categorical features with missing values
categorical_nan = [feature for feature in df.columns if df[feature].isna().sum()>0 and df[feature].dtypes=='O']
print(categorical_nan)

In [None]:
# Lets handle numerical features with nan value
numerical_nan = [feature for feature in df.columns if df[feature].isna().sum()>1 and df[feature].dtypes!='O']
numerical_nan

In [None]:
df[numerical_nan].isna().sum()

In [None]:
## Replacing the numerical Missing Values

for feature in numerical_nan:
    ## We will replace by using median since there are outliers
    median_value=df[feature].median()
    
    df[feature].fillna(median_value,inplace=True)
    
df[numerical_nan].isnull().sum()

#There are still many missing values.

In [None]:
cols_to_drop=['new_tests','total_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'tests_per_case', 'positive_rate', 'tests_units']
df=df.drop(cols_to_drop,axis=1)
df.columns

In [None]:
from sklearn.preprocessing import LabelEncoder

#fill in mean for floats
for c in df.columns:
    if df[c].dtype=='float16' or  df[c].dtype=='float32' or  df[c].dtype=='float64':
        df[c].fillna(df[c].mean())

#fill in -999 for categoricals
df = df.fillna(-999)
# Label Encoding
for f in df.columns:
    if df[f].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(df[f].values))
        df[f] = lbl.transform(list(df[f].values))
        
print('Labelling done.')

In [None]:
df = pd.get_dummies(df)

In [None]:
import seaborn as sbn

correlation=df.corr()
plt.figure(figsize=(15,15))
sbn.heatmap(correlation,annot=True,cmap=plt.cm.Greens)

#Shap Codes by rossinEndrew https://www.kaggle.com/endrewrossin/fast-initial-lightgbm-model-to-detect-exam-result/comments

In [None]:
import shap
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
import random

In [None]:
SEED = 99
random.seed(SEED)
np.random.seed(SEED)

In [None]:
dfmodel = df.copy()

# read the "object" columns and use labelEncoder to transform to numeric
for col in dfmodel.columns[dfmodel.dtypes == 'object']:
    le = LabelEncoder()
    dfmodel[col] = dfmodel[col].astype(str)
    le.fit(dfmodel[col])
    dfmodel[col] = le.transform(dfmodel[col])

In [None]:
#change columns names to alphanumeric
dfmodel.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in dfmodel.columns]

In [None]:
X = dfmodel.drop(['stringency_index','population_density'], axis = 1)
y = dfmodel['stringency_index']

#I had already made `total_deaths` and `aged_65_older`, therefore I tried to see how stringency_index works with `population_density`

In [None]:
lgb_params = {
                    'objective':'binary',
                    'metric':'auc',
                    'n_jobs':-1,
                    'learning_rate':0.005,
                    'num_leaves': 20,
                    'max_depth':-1,
                    'subsample':0.9,
                    'n_estimators':2500,
                    'seed': SEED,
                    'early_stopping_rounds':100, 
                }

In [None]:
# choose the number of folds, and create a variable to store the auc values and the iteration values.
K = 5
folds = KFold(K, shuffle = True, random_state = SEED)
best_scorecv= 0
best_iteration=0

# Separate data in folds, create train and validation dataframes, train the model and cauculate the mean AUC.
for fold , (train_index,test_index) in enumerate(folds.split(X, y)):
    print('Fold:',fold+1)
          
    X_traincv, X_testcv = X.iloc[train_index], X.iloc[test_index]
    y_traincv, y_testcv = y.iloc[train_index], y.iloc[test_index]
    
    train_data = lgb.Dataset(X_traincv, y_traincv)
    val_data   = lgb.Dataset(X_testcv, y_testcv)
    
    LGBM = lgb.train(lgb_params, train_data, valid_sets=[train_data,val_data], verbose_eval=250)
    best_scorecv += LGBM.best_score['valid_1']['auc']
    best_iteration += LGBM.best_iteration

best_scorecv /= K
best_iteration /= K
print('\n Mean AUC score:', best_scorecv)
print('\n Mean best iteration:', best_iteration)

#Final Model Modify the hyperparameters to use the best iteration value and train the final model

In [None]:
lgb_params = {
                    'objective':'binary',
                    'metric':'auc',
                    'n_jobs':-1,
                    'learning_rate':0.05,
                    'num_leaves': 20,
                    'max_depth':-1,
                    'subsample':0.9,
                    'n_estimators':round(best_iteration),
                    'seed': SEED,
                    'early_stopping_rounds':None, 
                }

train_data_final = lgb.Dataset(X, y)
LGBM = lgb.train(lgb_params, train_data)

In [None]:
print(LGBM)

In [None]:
# telling wich model to use
explainer = shap.TreeExplainer(LGBM)
# Calculating the Shap values of X features
shap_values = explainer.shap_values(X)

In [None]:
shap.summary_plot(shap_values[1], X, plot_type="bar")

In [None]:
shap.summary_plot(shap_values[1], X)

In [None]:
#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

color = plt.cm.RdBu(np.linspace(0,1,20))
df["stringency_index"].value_counts().sort_values(ascending=False).head(20).plot.pie(y="population_density",colors=color,autopct="%0.1f%%")
plt.title("View of the Pandemic by Stringency Index")
plt.axis("off")
plt.show()

In [None]:
#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

color = plt.cm.flag(np.linspace(0,1,20))
df["aged_65_older"].value_counts().sort_values(ascending=False).head(10).plot.pie(y="total_cases",colors=color,autopct="%0.1f%%")
plt.title("View of the Pandemic by Age 65 older")
plt.axis("off")
plt.show()

In [None]:
#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

color = plt.cm.ocean(np.linspace(0,1,20))
df["female_smokers"].value_counts().sort_values(ascending=False).head(10).plot.pie(y="new_cases_smoothed",colors=color,autopct="%0.1f%%")
plt.title("View of the Pandemic by Female Smokers")
plt.axis("off")
plt.show()

In [None]:
#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

color = plt.cm.Oranges(np.linspace(0,1,20))
df["diabetes_prevalence"].value_counts().sort_values(ascending=False).head(20).plot.pie(y="new_cases_per_million",colors=color,autopct="%0.1f%%")
plt.title("View of the Pandemic by Diabetes Prevalence")
plt.axis("off")
plt.show()

In [None]:
#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

color = plt.cm.Blues(np.linspace(0,1,20))
df["extreme_poverty"].value_counts().sort_values(ascending=False).head(20).plot.pie(y="new_deaths_per_milion",colors=color,autopct="%0.1f%%")
plt.title("View of the Pandemic by Extreme Poverty")
plt.axis("off")
plt.show()

In [None]:
import plotly.offline as pyo
import plotly.graph_objs as go
lowerdf = df.groupby('cardiovasc_death_rate').size()/df['total_deaths_per_million'].count()*100
labels = lowerdf.index
values = lowerdf.values

# Use `hole` to create a donut-like pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values,marker_colors = px.colors.sequential.Darkmint, hole=.6)])
fig.show()

In [None]:
fig = px.pie(df,
             values="stringency_index",
             names="population_density",
             template="seaborn")
fig.update_traces(rotation=90, pull=0.05, textinfo="percent+label")
fig.show()

In [None]:
fig = px.pie(df,
             values="aged_65_older",
             names="continent",
             template="seaborn")
fig.update_traces(rotation=90, pull=0.05, textinfo="percent+label")
fig.show()

In [None]:
px.bar(df, x = 'stringency_index', y = 'population_density', color = 'continent',orientation='h' , title='Covid-19 Stringency Index vs Population Density',  height = 500 )

In [None]:
px.bar(df, x = 'total_deaths', y = 'aged_65_older', color = 'continent',orientation='h' , title='Covid-19 Deaths by Age',  height = 500 )

In [None]:
px.histogram(df, x='total_deaths', range_x=[-5, 50], color='continent')

#Only three Continents? It's a Pandemic, there should be more continents.

In [None]:
fig = px.bar(df, x= "continent", y= "total_deaths", color_discrete_sequence=['crimson'], title="Total Covid-19 Deaths by Continent")
fig.show()

In [None]:
# Count Plot
plt.style.use("classic")
plt.figure(figsize=(8, 6))
sns.countplot(df['continent'], palette='RdBu', **{'hatch':'/','linewidth':3})
plt.xlabel("Continents")
plt.ylabel("Count")
plt.title("View of the Pandemic by Continents")
plt.xticks(rotation=45, fontsize=8)
plt.show()

In [None]:
#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

color = plt.cm.rainbow(np.linspace(0,1,20))
df["continent"].value_counts().sort_values(ascending=False).head(10).plot.pie(y="total_deaths",colors=color,autopct="%0.1f%%")
plt.title("View of the Pandemic by Continents")
plt.axis("off")
plt.show()

#I have no clue why only Three continents. Have you?

In [None]:
fig = px.bar(df, x= "stringency_index", y= "population_density", color_discrete_sequence=['#2B3A67'], title="Stringency Index vs. Population density")
fig.show()

In [None]:
ls ../input/hackathon/task_1-google_search_txt_files_v2/PM/

In [None]:
 SaintPierreMiquelon= '../input/hackathon/task_1-google_search_txt_files_v2/PM/Saint Pierre and Miquelon-en-result-108-original.txt'

In [None]:
text = open(SaintPierreMiquelon, 'r',encoding='utf-8',
                 errors='ignore').read()

In [None]:
print(text[:2500])

#"Bacille Calmette-Guérin (BCG) remains the only effective vaccine against disseminated TB, but its inability to confer complete #protection against pulmonary TB in adolescents and adults calls for an urgent need to develop new and better vaccines. There is #also a need to identify markers of disease protection and develop novel drugs." 

Words from the text printed above (nanosymposium at the Institute of Infectious Disease and Molecular Medicine at the University of Cape Town).

In [None]:
#Code by Olga Belitskaya https://www.kaggle.com/olgabelitskaya/sequential-data/comments
from IPython.display import display,HTML
c1,c2,f1,f2,fs1,fs2=\
'#eb3434','#eb3446','Akronim','Smokum',30,15
def dhtml(string,fontcolor=c1,font=f1,fontsize=fs1):
    display(HTML("""<style>
    @import 'https://fonts.googleapis.com/css?family="""\
    +font+"""&effect=3d-float';</style>
    <h1 class='font-effect-3d-float' style='font-family:"""+\
    font+"""; color:"""+fontcolor+"""; font-size:"""+\
    str(fontsize)+"""px;'>%s</h1>"""%string))
    
    
dhtml('Thanks for your patience – please keep coming back to see my improvements, @mpwolke Was Here.' )