# Motivation
We have some manually labeled data about which snippet answers which question in the manually reviewed file. This can be used to create a supervised dataset for each of the 8 questions. It contains positive examples, i.e snippets that answer a question, but we can extract negative ones by looking at the rest of the text.

This notebook creates a labeled dataset for the first question ("Which is the First Year of the BCG Policy?") and experiments with different classifiers on it.
The dataset is highly imbalanced, as is the expectation for the real data - only a small number of snippets will be a valid answer for a particular question.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
pd.set_option('max_colwidth', None)

In [None]:
path = '/kaggle/input/bcg-manually-reviewed-cleaned'
file = f'{path}/manually_reviewed_cleaned.csv'
df = pd.read_csv(file, encoding = "ISO-8859-1")

## Analyse questions

Here we check how many snippets we have for each question.

In [None]:
question_names = ['first_year','last_year','is_mandatory','timing','strain','has_revaccinations','revaccination_timing','location', 'manufacturer', 'company', 'groups']

df.columns = ['alpha_2_code', 'country', 'url', 'filename', 'is_pdf','Comments',
              'Snippet'] + question_names + ['snippet_len', 'text_len']

In [None]:
df.info()

We see that the 'first_year' question has the most snippets - 21. This is why we chose it for this experiment.

For some questions there are very few or even no snippets at all.

In [None]:
df_fy = df[df['first_year'].notna()][['alpha_2_code','country','url', 'filename','first_year']].reset_index()

In [None]:
f"Working with {df_fy.shape[0]} positive examples."

## Create negative examples

From the manually reviewed dataset we have the positive examples. But to gather the negative ones we'll need to process the whole texts, containing the positive snippets. Then, we take only those parts of the texts that are different enough from our positive snippets. We use cosine distance to measure their similarity.

In [None]:
def read_text(row):
    code = row['alpha_2_code']
    filename=row['filename'].replace('.txt', '')
    filename = f'/kaggle/input/hackathon/task_1-google_search_txt_files_v2/{code}/{filename}.txt'
    
    with open(filename, 'r') as file:
        data = file.read()#.replace('\n', ' ')
    return data

import spacy
nlp = spacy.load('en_core_web_sm')

def get_snippets(text):
    '''
        Returns sentences in the text which contain more than 5 tokens and at least one verb.
    '''
    return [sent.text.strip() for sent in nlp(text).sents 
                 if len(sent.text.strip().split()) > 5 and any([token.pos_ == 'VERB' for token in sent])]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import tqdm

negative_examples = []

for _, row in tqdm.tqdm(df_fy.iterrows()):
    text = read_text(row)
    snippets = get_snippets(text)

    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(snippets)

    sim = cosine_similarity(tfidf_vectorizer.transform([row['first_year']]),tfidf_matrix)
    res = pd.DataFrame()
    res['snippet'] = snippets
    res['sim'] = sim[0]
    low_sim = res[res['sim']<0.1]['snippet'].values
    negative_examples.extend(low_sim)

Now let's combine the positive and negative examples:

In [None]:
df_data = pd.DataFrame({'snippet': df_fy['first_year'], 'label': 1})
df_data = df_data.append(pd.DataFrame({'snippet': negative_examples, 'label': 0}), ignore_index=True)

In [None]:
f"Working with {df_data.shape[0]} examples in total."

We want to remove snippets that are longer than 350 chars, since this is a requirement in the task.

In [None]:
df_data['snippet_len'] = df_data['snippet'].apply(len)
df_data.drop(df_data[df_data['snippet_len'] > 350].index, inplace=True)

In [None]:
f"Working with {df_data.shape[0]} examples in total."

In [None]:
df_data['label'].value_counts(normalize=True)

As mentioned, the dataset is highly imbalanced.

## Model exploration

In [None]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.model_selection import cross_val_score

from sklearn.pipeline import Pipeline

from sklearn.dummy import DummyClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_data['snippet'], df_data['label'], test_size=0.33, random_state=42, stratify=df_data['label'])

In [None]:
C = 1
clfs = [('Dummy', DummyClassifier(strategy='prior')),
        ('RF', RandomForestClassifier(n_estimators=100, max_depth=2, class_weight='balanced')),
       ('LogReg', LogisticRegression(random_state=0, solver='lbfgs', class_weight='balanced')),
       ('SVM SGD', SGDClassifier(max_iter=1000, tol=1e-3, class_weight='balanced')),
       ('SVM linear', SVC(kernel='linear', C=C, class_weight='balanced')),
       ('SVM RBF', SVC(kernel='rbf', C=C, class_weight='balanced')),
       ('SVM Poly', SVC(kernel='poly', C=C, class_weight='balanced'))
       ]


for name, clf in clfs:
    pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', clf),
    ])
    
    scores = cross_validate(pipeline, X_train, y_train, cv=5, scoring=('accuracy', 'f1', 'roc_auc'), return_train_score=True)

    print("{:10s} {:5s} | Train: {:.3f}, Test: {:.3f}".format(name, 'ACC', np.mean(scores['train_accuracy']), np.mean(scores['test_accuracy'])))
    
    print("{:10s} {:5s} | Train: {:.3f}, Test: {:.3f}".format(name, 'F1', np.mean(scores['train_f1']), np.mean(scores['test_f1'])))
    
    print("{:10s} {:5s} | Train: {:.3f}, Test: {:.3f}".format(name, 'AUC', np.mean(scores['train_roc_auc']), np.mean(scores['test_roc_auc'])))
    
    print()

To choose a model, we look at three metrics - accuracy, F1 score and AUC. The Logistic Regression seems to have highest scores on all of them.

## Model evaluation

In [None]:
from sklearn.metrics import accuracy_score, f1_score, classification_report

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(random_state=0, solver='lbfgs', class_weight='balanced')),
])
    
pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_test)
pred_proba = pipeline.predict_proba(X_test)[:,1]

In [None]:
result = pd.DataFrame({'proba': pred_proba, 'label': y_test, 'text': X_test})

In [None]:
result.sort_values('proba', ascending=False).head(10)

In [None]:
accuracy_score(y_test, pred)

In [None]:
f1_score(y_test, pred)

In [None]:
print(classification_report(y_test, pred))

The results look promising - using the classifier we can extract 2 valid answers with no false positives!