## Classification of Covid 19 tweets to assertain what can be learnt about public knowledge on symptoms, treatment and prevention of the virus

this project is an effort to contribute to health surveillance in order to help medical professionals coalate useful information to aid in intervention measures in the situation of an outbreak

The objectives of this project are:

1. implementation of bert algorithm 
2. aid in labelling large data
3. information gathering


### Data processing

In [1]:
import pandas as pd
import numpy as np

In [3]:
# load covid literature data
covid_literature_df = pd.read_csv('C:/Users/Sammy/2021/final-project/data/metadata.csv', usecols=['title','abstract','authors','doi','publish_time','pdf_json_files'])
np.random.seed()


  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
covid_literature_df.head()

Unnamed: 0,title,doi,abstract,publish_time,authors,pdf_json_files
0,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",document_parses/pdf_json/d1aafb70c066a2068b027...
1,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,Inflammatory diseases of the respiratory tract...,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cros...",document_parses/pdf_json/6b0567729c2143a66d737...
2,Surfactant protein-D and pulmonary host defense,10.1186/rr19,Surfactant protein-D (SP-D) participates in th...,2000-08-25,"Crouch, Erika C",document_parses/pdf_json/06ced00a5fc04215949aa...
3,Role of endothelin-1 in lung disease,10.1186/rr44,Endothelin-1 (ET-1) is a 21 amino acid peptide...,2001-02-22,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",document_parses/pdf_json/348055649b6b8cf2b9a37...
4,Gene expression in epithelial cells in respons...,10.1186/rr61,Respiratory syncytial virus (RSV) and pneumoni...,2001-05-11,"Domachowske, Joseph B; Bonville, Cynthia A; Ro...",document_parses/pdf_json/5f48792a5fa08bed9f560...


In [11]:
#missing values
missing_values = covid_literature_df.isnull().sum()
missing_values

title                216
doi               180633
abstract          115815
publish_time         220
authors            12283
pdf_json_files    274650
dtype: int64

In [10]:
#percentage of missing data
all_cells = np.product(covid_literature_df.shape)
all_missing = missing_values.sum()
percentage_missing = (all_missing/all_cells)*100
print(percentage_missing)

23.501964478366585


In [21]:
#drop duplicates and fill missing values
covid_literature_df.duplicated()
covid_literature_df.drop_duplicates(keep='first', inplace=True)
covid_literature_df.fillna('no text available', inplace = True)

In [30]:
covid_literature_df['abstract'] = (covid_literature_df['title']+ ' ' + covid_literature_df['abstract']).apply(lambda row: row.strip())

In [31]:
covid_literature_df['abstract'] = covid_literature_df['abstract'].apply(lambda row: row.replace('no text available',''))

In [35]:
#reduce the dataset to literature discussing covid specific topics
def find_covid_lit(df):
    df1 = df[df['abstract'].str.contains('covid')]
    df2 = df[df['abstract'].str.contains('-cov-2')]
    df3 = df[df['abstract'].str.contains('cov2')]
    df4 = df[df['abstract'].str.contains('ncov')]
    df5 = df[df['abstract'].str.contains('corona')]
    
    data =[df1,df2,df3,df4,df5]
    df = pd.concat(data)
    df=df.drop_duplicates(subset='title', keep="first")
    return df
    
df=find_covid_lit(covid_literature_df)
print (df.shape)
df.head()   

(70814, 6)


Unnamed: 0,title,doi,abstract,publish_time,authors,pdf_json_files
11852,"Scope, quality, and inclusivity of clinical gu...",10.1136/bmj.m2371,"Scope, quality, and inclusivity of clinical gu...",2020-06-12,no text available,no text available
11975,COVID-KOP: Integrating Emerging COVID-19 Data ...,10.26434/chemrxiv.12462623,COVID-KOP: Integrating Emerging COVID-19 Data ...,2020-06-18,"Korn, Daniel; Bobrowski, Tesia; Li, Michael; K...",no text available
12752,Assessment of workers’ personal vulnerability ...,10.1093/occmed/kqaa150,Assessment of workers’ personal vulnerability ...,2020-08-06,"Coggon, David; Croft, Peter; Cullinan, Paul; W...",document_parses/pdf_json/f829194cf74c2dc93ddce...
14292,REDIAL-2020: A Suite of Machine Learning Model...,10.26434/chemrxiv.12915779,REDIAL-2020: A Suite of Machine Learning Model...,2020-09-16,"KC, Govinda; Bocci, Giovanni; Verma, Srijan; H...",document_parses/pdf_json/eaeb169ec208d2921026b...
26291,Joint Commission releases COVID‐19 resources,10.1002/mhw.32331,Joint Commission releases COVID‐19 resources J...,2020-04-17,no text available,no text available


### labeling the literature data based on terms in the abstract

In [43]:
import functools
from nltk import PorterStemmer

# converts terms to morphological root eg. smoked and smoking to smok
def stemmer(words):
    stemmer = PorterStemmer()
    tokens=[]
    for w in words:
        tokens.append(stemmer.stem(w))
    return tokens


def search_dataframe(df,search_terms):   
    search_words=stemmer(search_terms)
    df1=df[functools.reduce(lambda a, b: a&b, (df['abstract'].str.contains(s) for s in search_terms))]
    return df1


# Get best results from abstract
def get_sentences(df1,search_terms,str1):
    df_table = pd.DataFrame(columns = ["pub_date","authors","title","excerpt",'label'])
    search_terms=stemmer(search_terms)
    for index, row in df1.iterrows():
        
        pub_sentence=''
        sentences_used=0
        #break apart the absracrt to sentence level
        sentences = row['abstract'].split('. ')
        #loop through the sentences of the abstract
        highligts=[]
        for sentence in sentences:
            # missing lets the system know if all the words are in the sentence
            missing=0
            #loop through the words of sentence
            for word in search_terms:
                #if keyword missing change missing variable
                if word not in sentence:
                    missing=1
                    
            # after all sentences processed show the sentences not missing keywords
            if missing==0 and len(sentence)<1000 and sentence!='':
                sentence=sentence.capitalize()
                if sentence[len(sentence)-1]!='.':
                    sentence=sentence+'.'
                pub_sentence=pub_sentence+sentence
                
        if pub_sentence!='':
            sentence=pub_sentence
            sentences_used=sentences_used+1
            authors=row["authors"].split(" ")            
            title=row["title"]                       
            to_append = [row['publish_time'],authors[0]+' et al.',title,sentence,str1]
            df_length = len(df_table)
            df_table.loc[df_length] = to_append
    return df_table

In [44]:
#list of labels
labels = ['transmission','treatment','symptoms']
# list of search terms
search=[['transmission','infection'],['prevent','treat','quarantine'],['symptom','asymptomatic']]
q=0
dfi_table = ['','','']

for search_terms in search:
    str1=''
    str1 = labels[q]
    
    #search the dataframe for all words
    df1=search_dataframe(df,search_terms)
   
    # get best sentences
    df_table=get_sentences(df1,search_terms,str1)    
    
    length=df_table.shape[0]

    if length<1:
        print ("No reliable answer could be located in the literature")
    else:
        display(df_table.head())
    dfi_table[q] = df_table
    q=q+1
    
print ('done')

Unnamed: 0,pub_date,authors,title,excerpt,label
0,2021-01-04,"Palas, et al.",Pediatric E.N.T. emergencies during COVID-19 p...,Comprehensively we recommend intervention only...,transmission
1,2020-12-31,"Coleman, et al.",1112. #EducationInTheTimeofCOVID: Using Twitte...,"Topics included public health & prevention, vi...",transmission
2,2020,"Alsohime, et al.",COVID-19 infection prevalence in pediatric pop...,The early identification of sars-cov-2 infecti...,transmission
3,2020,"Khan, et al.",Plant derived antiviral products for potential...,"However, transmission and infectivity rate of ...",transmission
4,2020-04-23,"Flaxman, et al.",Estimating the number of infections and the im...,Our model estimates these changes by calculati...,transmission


Unnamed: 0,pub_date,authors,title,excerpt,label
0,2020-05-27,Velarde-Ruiz et al.,Manifestaciones hepáticas y repercusion en el ...,"The recommendations for those patients, in add...",treatment
1,2020-05-26,"Kawasaki, et al.",Highly sensitive quantitative and rapid immuno...,As pathogens such as the influenza virus and s...,treatment
2,2021-01-01,"Berekaa, et al.","Insights into the COVID-19 pandemic: Origin, p...","Currently, there are no treatments for this in...",treatment
3,2020-05-01,"Fan, et al.",[Discussion and prospect of infusion of NK cel...,Although quarantine can effectively prevent an...,treatment
4,2021,"Berekaa, et al.","Insights into the COVID-19 pandemic: Origin, p...","Currently, there are no treatments for this in...",treatment


Unnamed: 0,pub_date,authors,title,excerpt,label
0,2020-09-30,"Sironi, et al.",Anthropological analysis on recent covid-19 in...,There was also incorrect medical information (...,symptoms
1,2020,"Huamaní, et al.",Estimated conditions to control the covid-19 p...,Materials and methods outbreak si mulations fo...,symptoms
2,2020-11-11,"Asadollahi-Amin, et al.",Postoperative COVID-19 pneumonia in an asympto...,Postoperative covid-19 pneumonia in an asympto...,symptoms
3,2020-09-17,"Trevisanuto, et al.",Coronavirus infection in neonates: a systemati...,"One out of four neonates was asymptomatic, and...",symptoms
4,2020-12-15,"Grosso, et al.",Suppression of Covid-19 outbreak among healthc...,The 46% of the positive tested were asymptomatic.,symptoms


done


In [47]:
covid_data = pd.concat([dfi_table[0],dfi_table[1],dfi_table[2]],axis = 0)
covid_data.reset_index()

Unnamed: 0,index,pub_date,authors,title,excerpt,label
0,0,2021-01-04,"Palas, et al.",Pediatric E.N.T. emergencies during COVID-19 p...,Comprehensively we recommend intervention only...,transmission
1,1,2020-12-31,"Coleman, et al.",1112. #EducationInTheTimeofCOVID: Using Twitte...,"Topics included public health & prevention, vi...",transmission
2,2,2020,"Alsohime, et al.",COVID-19 infection prevalence in pediatric pop...,The early identification of sars-cov-2 infecti...,transmission
3,3,2020,"Khan, et al.",Plant derived antiviral products for potential...,"However, transmission and infectivity rate of ...",transmission
4,4,2020-04-23,"Flaxman, et al.",Estimating the number of infections and the im...,Our model estimates these changes by calculati...,transmission
5,5,2020,"Fan, et al.",Influence of covid-19 on cerebrovascular disea...,"Currently, the most reasonable and effective w...",transmission
6,6,2020,"Girma, et al.",Knowledge and precautionary behavioral practic...,The final multiple linear regression analysis ...,transmission
7,7,2020,"Hua, et al.",Epidemiological features and medical care-seek...,Results: of the 205 patients with covid-19 inf...,transmission
8,8,2020,"Chen, et al.",Influence of covid-19 event on air quality and...,Correlation analysis shows that daily covid-19...,transmission
9,9,2020,Singh et al.,Modification of Neurosurgical Practice during ...,"All the residents, faculty and nursing staff r...",transmission


In [50]:
covid_data.sample(20)

Unnamed: 0,pub_date,authors,title,excerpt,label
386,2020,"Arawomo, et al.",Coronavirus Disease 2019 (COVID-19): Clinical ...,The mode of transmission via droplet infection...,transmission
1164,2020-04-07,"Roy, et al.","COVID-19 pandemic: Impact of lockdown, contact...","Covid-19 pandemic: impact of lockdown, contact...",transmission
120,2021-01-06,"Hoepler, et al.",Clinical and Angiographic Features in Three CO...,"All patients were female (median age, 67 years...",symptoms
1854,2020-06-20,"Kutsuna, et al.",SARS-CoV-2 screening test for Japanese returne...,This study was to evaluate the effectiveness o...,symptoms
1399,2020,"Ji, et al.",Lockdown Contained the Spread of 2019 Novel Co...,"Moreover, 50 asymptomatic infections were iden...",symptoms
491,2020,"Garg, et al.",Primary Health Care Facility Preparedness for ...,"Under these circumstances, the preparedness of...",transmission
374,2013,"Hijawi, et al.","Novel coronavirus infections in Jordan, April ...",This paper describes the epidemiological findi...,transmission
318,2020-03-25,"Yu, et al.",[Several suggestions of operation for colorect...,2019-ncov virus can be transmitted by asymptom...,symptoms
1254,2020-06-01,"MacIntyre, et al.",HUMAN CORONAVIRUS DATA FROM FOUR CLINICAL TRIA...,No transmissions to close contacts occurred wh...,transmission
1855,2020-12-11,"Yi, et al.",Characterizing the Dynamic of COVID-19 with a ...,Different from the traditional epidemic models...,symptoms


In [24]:
tweet_data = pd.read_csv('C:/Users/Sammy/2021/final-project/data/tweets/tweets.csv')

In [27]:
print(tweet_data.head())
tweet_data.shape

                                                text         label
0  The question we need urgent answer to is how l...  transmission
1  it takes like 2 weeks for symptoms to start sh...       symptom
2  Fort Bend County has a confirmed Coronavirus p...    prevention
3  China confirms 170 deaths as coronavirus sprea...       symptom
4  Excellent discussion from @PascalJabbourMD on ...   health risk


(50, 2)

In [26]:
tweet_data['label'].value_counts(normalize=False)

prevention      23
transmission    11
treatment        7
symptom          7
health risk      2
Name: label, dtype: int64

####  Defining our base model

In [28]:
from sklearn.model_selection import train_test_split

In [None]:
tweet = tweet_data['text'].values
y_label = tweet_data['label'].values

In [None]:
tweet_train, tweet_test, tweet_y_train, tweet_y_test = train_test_split(tweet, y_label,test_size = 0.25, random_state = 42)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(tweet_train)

X_train = vectorizer.transform(tweet_train)
X_test = vectorizer.transform(tweet_test)

In [None]:
len(tweet)
#X_train

In [None]:
# y_train = pd.factorize(tweet_y_train)
# y_test = pd.factorize(tweet_y_test)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier 

#### Decision tree classifier model

In [None]:
classifier = DecisionTreeClassifier(max_depth = 2)
classifier.fit(X_train,tweet_y_train)
score = classifier.score(X_test,tweet_y_test)
print("Accuracy:", score)

predictions = classifier.predict(X_test) 
  
# creating a confusion matrix 
cm = confusion_matrix(tweet_y_test, predictions) 
cm

#### Logistic regression model

In [None]:
lr_classifier = LogisticRegression()
lr_classifier.fit(X_train,y_train)
lr_score = lr_classifier.score(X_test,y_test)
print("Accuracy:", lr_score)

predictions = lr_classifier.predict(X_test) 
  
# creating a confusion matrix 
lr_cm = confusion_matrix(y_test, predictions) 
lr_cm

### Implementing BERT with keras

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers

input_dim = X_train.shape[1]  # Number of features

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))