# MuMiN large data preparation

In this notebook we will process the MuMiN large dataset in the following way:

- Extract the tweet text, label and language
- Perform basic text cleaning
- Translate the tweet texts into Bahasa Indonesia
- Create the training, test, and validation splits

Note that since we only care about the tweet text, we will be omitting images and articles.

# Extract necessary features

## Load the dataset

In [1]:
from pathlib import Path
from mumin import MuminDataset

# Set file names and paths
data_dir = Path("../../data/mumin_archive")
dataset_file = "mumin-medium_no-images.zip"

# Load the compiled dataset
size = "medium"
dataset_path = data_dir.joinpath(dataset_file)
include_tweet_images = False
include_articles = False
dataset = MuminDataset(dataset_path=dataset_path, size=size, include_tweet_images=include_tweet_images, include_articles=include_articles)
dataset.compile()

  from .autonotebook import tqdm as notebook_tqdm
2022-07-28 20:26:35,765 [INFO] Loading dataset


MuminDataset(num_nodes=805,586, num_relations=1,061,640, size='medium', compiled=True, bearer_token_available=False)

## Join claims with their tweets

Since we're focusing on COVID-19 misinformation, we will first filter out the claims that aren't about COVID-19 before joining them.

In [2]:
# Get tweets, claims and the relations between them
tweets = dataset.nodes["tweet"].dropna()
claims = dataset.nodes["claim"]
rels = dataset.rels[("tweet", "discusses", "claim")]

# Filter claims
covid_mask = claims.keywords.str.contains('(corona(.*virus)?|covid(.*19)?)') | claims.cluster_keywords.str.contains('(corona(.*virus)?|covid(.*19)?)')
claims_filtered = claims.loc[covid_mask, :]

# Join tweets and claims on rels
tc = (tweets.merge(rels, left_index=True, right_on='src')
            .merge(claims_filtered, left_on='tgt', right_index=True)
            .reset_index(drop=True))

  covid_mask = claims.keywords.str.contains('(corona(.*virus)?|covid(.*19)?)') | claims.cluster_keywords.str.contains('(corona(.*virus)?|covid(.*19)?)')


In [3]:
tc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4483 entries, 0 to 4482
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   tweet_id          4483 non-null   uint64        
 1   text              4483 non-null   object        
 2   created_at        4483 non-null   datetime64[ns]
 3   lang              4483 non-null   category      
 4   source            4483 non-null   object        
 5   num_retweets      4483 non-null   uint64        
 6   num_replies       4483 non-null   uint64        
 7   num_quote_tweets  4483 non-null   uint64        
 8   src               4483 non-null   int64         
 9   tgt               4483 non-null   int64         
 10  embedding         4483 non-null   object        
 11  label             4483 non-null   category      
 12  reviewers         4483 non-null   object        
 13  date              4483 non-null   datetime64[ns]
 14  language          4483 n

In [4]:
tc.head()

Unnamed: 0,tweet_id,text,created_at,lang,source,num_retweets,num_replies,num_quote_tweets,src,tgt,...,label,reviewers,date,language,keywords,cluster_keywords,cluster,train_mask,val_mask,test_mask
0,1243046281326534661,To keep our upper respiratory tract healthy in...,2020-03-26 05:25:02,en,Hootsuite Inc.,96,6,6,0,0,...,misinformation,[observador.pt],2020-03-15 12:30:21,pt,corona virus reaching lungs remains,coronavirus china covid 19 treatments recommended,0,True,False,False
1,1243148522209161217,Gargling salt water does not 'kill' coronaviru...,2020-03-26 12:11:18,en,Twitter for iPhone,7,0,0,1,0,...,misinformation,[observador.pt],2020-03-15 12:30:21,pt,corona virus reaching lungs remains,coronavirus china covid 19 treatments recommended,0,True,False,False
2,1238795119572049920,कॉरोना वायरस फेफड़ों में जाने से पहले तीन-चार ...,2020-03-14 11:52:26,hi,Twitter for Android,6,0,1,2,0,...,misinformation,[observador.pt],2020-03-15 12:30:21,pt,corona virus reaching lungs remains,coronavirus china covid 19 treatments recommended,0,True,False,False
3,1238947475471454220,Antes de llegar a los pulmones dura 4 días en ...,2020-03-14 21:57:51,es,Twitter for Android,8,3,0,3,0,...,misinformation,[observador.pt],2020-03-15 12:30:21,pt,corona virus reaching lungs remains,coronavirus china covid 19 treatments recommended,0,True,False,False
4,1239128401115516929,So they say the first symptons are #coughing\n...,2020-03-15 09:56:47,en,Twitter for Android,10,2,1,4,0,...,misinformation,[observador.pt],2020-03-15 12:30:21,pt,corona virus reaching lungs remains,coronavirus china covid 19 treatments recommended,0,True,False,False


## Get the necessary features

Here we extract the following features:

- `text`
- `label`
- `lang`

We don't need any of the `*mask` columns since we're going to create our own splits.

In [4]:
features = ["text", "label", "lang"]
dataset_clean = tc[features]

## Remove records with `zxx` language code

We'll remove all records with the `zxx` language code as it stands for "no linguistic content". If we actually look at the records with this language code, it's clear that they are all URLs.

In [7]:
# View all records with "zxx" language code
dataset_clean.loc[dataset_clean.lang == "zxx", :]

Unnamed: 0,text,label,lang
236,https://t.co/DIOtokZ5JZ,misinformation,zxx
349,https://t.co/zwKqF4qur7,misinformation,zxx
546,https://t.co/HeZ2S7sk5o,misinformation,zxx
772,https://t.co/ds4fqa6FiK,misinformation,zxx
1060,https://t.co/xI5YfL0DBu,misinformation,zxx
1068,https://t.co/xI5YfL0DBu,misinformation,zxx
1074,https://t.co/xI5YfL0DBu,misinformation,zxx
1685,https://t.co/QRB1suNRA4,misinformation,zxx
1686,https://t.co/QRB1suNRA4,misinformation,zxx
1687,https://t.co/QRB1suNRA4,misinformation,zxx


In [8]:
# Remove all records with "zxx" language code
dataset_clean = dataset_clean.loc[~(dataset_clean.lang == "zxx"), :].reset_index(drop=True)

In [9]:
# Write the current state of the dataset to a csv file
# Note that this is just a temporary code block
outfile = "mumin_medium-raw.csv"
dataset_clean.to_csv(Path("data").joinpath(outfile), index=False)

# Clean dataset

Here we perform only the following operations:

- Encode the labels
- Translate the text
- Substitute URLs, mentions, and (maybe) hashtags with an indicative token
- Remove emojis
- Normalize whitespace
- Remove any other unnecessary characters that I happen to notice

All of the more in-depth text cleaning operations will be done once the text has been translated.

## Encode labels

Here we encode the labels in the following way:

- `misinformation`: 1
- `factual`: 0

In [14]:
encodings = {"misinformation": 1, "factual": 0}
dataset_clean.label.replace(to_replace=encodings, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_clean.label.replace(to_replace=encodings, inplace=True)


In [7]:
dataset_clean.label.value_counts()

1    17625
0      600
Name: label, dtype: int64

## Remove URLs and mentions

Not sure if we should completely remove URLS and mentions as both provide useful information. For now, I'll simply replace URLs with `<URL>` and mentions with `<USER>`.

In [15]:
dataset_clean.text = dataset_clean.text.str.replace("http\S+", "<URL>")
dataset_clean.text = dataset_clean.text.str.replace("@\S+", "<USER>")

  dataset_clean.text = dataset_clean.text.str.replace("http\S+", "<URL>")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_clean.text = dataset_clean.text.str.replace("http\S+", "<URL>")
  dataset_clean.text = dataset_clean.text.str.replace("@\S+", "<USER>")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_clean.text = dataset_clean.text.str.replace("@\S+", "<USER>")


## Remove whitespace literal characters

Here we remove stray newline and tab characters.

In [66]:
text_test = dataset_clean.text.str.replace(r"\n|\t", "").str.replace("|", "", regex=False).str.replace("\s+", " ")

  text_test = dataset_clean.text.str.replace(r"\n|\t", "").str.replace("|", "", regex=False).str.replace("\s+", " ")


In [67]:
text_test.to_list()[:5]

["To keep our upper respiratory tract healthy in this time of #coronavirus, let's all gargle warm water with salt (preferably himalayan or sea salt), some ginger and apple cider vinegar first thing in the morning and before we sleep. Spit the water out after. #LetsFightCovid19 <URL>",
 "Gargling salt water does not 'kill' coronavirus in your throat Metro News <URL>",
 'कॉरोना वायरस फेफड़ों में जाने से पहले तीन-चार दिन तक गले में रहता है...उससे खांसी व कफ की शिकायत भी रहती है...ऐसी स्थिति में अगर गर्म पानी में नमक डालकर गरारे किए जाएं...और नींबू का सेवन किया जाए तो इस बीमारी से बचा जा सकता है...और अनेकों जिंदगी बच सकती है।🕉️👋🌷🌹💐🌴 <URL>',
 'Antes de llegar a los pulmones dura 4 días en la garganta, en ese punto la persona infectadas empiezan a toser y a tener dolor de garganta. Deben tomar mucha agua y hacer gárgaras de agua tibia con sal o vinagre esto eliminará el CORONAVIRUS retuitear pues pueden salvar alguien. <URL>',
 'So they say the first symptons are #coughingThe virus stays in 

## Remove emojis

In [None]:
text_test = dataset_clean.text.str.encode("ascii", "ignore").str.decode("ascii")

## Remove other unnecessary characters

In [None]:
dataset_clean.text = dataset_clean.text.str.replace("|", "", regex=False)

## Remove excess whitespace

In [None]:
dataset_clean.text = dataset_clean.text.str.replace("\s+", " ")

# Translate text

This will be done later.

# Create splits

In [14]:
# Create splits
train = dataset_clean.query('train_mask == True')
val = dataset_clean.query('val_mask == True')
test = dataset_clean.query('test_mask == True')

# Write them to csv files
features = ["text", "label", "lang"]
output_files = ["mumin_large-train.csv", "mumin_large-test.csv", "mumin_large-validation.csv"]
splits = [train, test, val]
for split, output_file in zip(splits, output_files):
    split.to_csv(data_dir.joinpath(output_file), columns=features, index=False)