This notebook filters for questions from the initial training set of Quora questions from Kaggle to retain questions wherein >20% of the words in the questions are considered to be [biomedical terms](https://github.com/glutanimate/wordlist-medicalterms-en).


In [3]:
import pandas as pd

In [None]:
# download Kaggle training q's 
!wget https://www.kaggle.com/c/quora-insincere-questions-classification/download/train.csv

In [4]:
# load kaggle training q's (can load other sets of q's in this format as well)
quora_train = pd.read_csv('/Users/jlzhou/Documents/med277/train.csv')

# load medical terms
med_glossary = open('/Users/jlzhou/Documents/med277/wordlist-medicalterms-en/wordlist.txt').read().splitlines()
# convert glossary to set for faster lookups (?)
med_set = set(med_glossary)

In [5]:
# function for filtering questions
def filter_questions(df):
    # get set of words in each question ("bag of words")
    df['bow'] = df.question_text.apply(lambda x: set(x.split()))
    
    # test which questions contain medical terms
    ix_to_keep = []
    for ix,row in df.iterrows():
        if len(row.bow & med_set) > 0.2*len(row.bow):
            ix_to_keep.append(ix)
    
    # subset df
    med_df = df.iloc[ix_to_keep,:]
    return med_df

In [7]:
quora_train_med = filter_questions(quora_train)
quora_train_med.head()

Unnamed: 0,qid,question_text,target,bow
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0,"{does, Why, time?, space, Does, geometry?, vel..."
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0,"{How, von, hemispheres?, Magdeburg, used, the,..."
7,0000559f875832745e2e,Is it crazy if I wash or wipe my groceries off...,0,"{I, everywhere., my, off?, wipe, Germs, crazy,..."
13,000092a90bcfbfe8cd88,Can we use our external hard disk as a OS as w...,0,"{OS, well, be, a, our, external, affected?, fo..."
16,0000b8e1279eaa0a7062,How difficult is it to find a good instructor ...,0,"{How, take, you?, is, near, a, good, to, instr..."


Let's take a look at one of the questions remaining after filtering:

In [10]:
quora_train_med.iloc[0,:].question_text

'Why does velocity affect time? Does velocity affect space geometry?'

Which of the words in this sentence are from the set of medical terms?

In [9]:
quora_train_med.iloc[0,:].bow&med_set

{'affect', 'space', 'velocity'}

In [16]:
print('%d questions in initial set of questions' % quora_train.shape[0])
print('%d questions in filtered set of questions' % quora_train_med.shape[0])
print('%d insincere (target) questions in initial set' % sum(quora_train.target))
print('%d insincere (target) questions in filtered set' % sum(quora_train_med.target))

1306122 questions in initial set of questions
547610 questions in filtered set of questions
80810 insincere (target) questions in initial set
28314 insincere (target) questions in filtered set
