# MuMiN large dataset prep

Here we process the data in the following ways:

- Extract the tweets and their associated labels
- Clean the tweet text
- Translate the text of each tweet into Bahasa Indonesia
- Create the splits as defined by the authors of the dataset

# Extract the tweets and their labels

## Load the dataset

In [1]:
from pathlib import Path
from mumin import MuminDataset

# Set file paths and names
data_dir = Path("data")
data_file = "mumin-large_no-images.zip"

# Load the data
# Note: don't need the bearer token here as we're loading a compiled dataset
size = "large"
dataset_path = data_dir.joinpath(data_file)
include_tweet_images = False

dataset = MuminDataset(size=size, dataset_path=dataset_path, include_tweet_images=include_tweet_images)
dataset.compile()

  from .autonotebook import tqdm as notebook_tqdm
2022-07-03 19:42:59,839 [INFO] Loading dataset


MuminDataset(num_nodes=1,636,198, num_relations=2,394,768, size='large', compiled=True, bearer_token_available=False)

## Join the claims and tweets into a single dataframe

In [2]:
# Get tweets
tweets = dataset.nodes["tweet"]
tweets.dropna(inplace=True) # Remove deleted tweets

# Get claims and reference indices
claims = dataset.nodes["claim"]
tc_ref = dataset.rels[("tweet", "discusses", "claim")]

# Join claims and tweets
tweet_claim = (tweets.merge(tc_ref, left_index=True, right_on='src')
                     .merge(claims, left_on='tgt', right_index=True)
                     .reset_index(drop=True))

In [3]:
tweet_claim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39001 entries, 0 to 39000
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   tweet_id          39001 non-null  uint64        
 1   text              39001 non-null  object        
 2   created_at        39001 non-null  datetime64[ns]
 3   lang              39001 non-null  category      
 4   source            39001 non-null  object        
 5   num_retweets      39001 non-null  uint64        
 6   num_replies       39001 non-null  uint64        
 7   num_quote_tweets  39001 non-null  uint64        
 8   src               39001 non-null  int64         
 9   tgt               39001 non-null  int64         
 10  embedding         39001 non-null  object        
 11  label             39001 non-null  category      
 12  reviewers         39001 non-null  object        
 13  date              39001 non-null  datetime64[ns]
 14  language          3900

In [4]:
tweet_claim.head()

Unnamed: 0,tweet_id,text,created_at,lang,source,num_retweets,num_replies,num_quote_tweets,src,tgt,...,label,reviewers,date,language,keywords,cluster_keywords,cluster,train_mask,val_mask,test_mask
0,1243046281326534661,To keep our upper respiratory tract healthy in...,2020-03-26 05:25:02,en,Hootsuite Inc.,96,6,6,0,0,...,misinformation,[observador.pt],2020-03-15 12:30:21,pt,corona virus reaching lungs remains,coronavirus china covid 19 treatments recommended,0,True,False,False
1,1243148522209161217,Gargling salt water does not 'kill' coronaviru...,2020-03-26 12:11:18,en,Twitter for iPhone,7,0,0,1,0,...,misinformation,[observador.pt],2020-03-15 12:30:21,pt,corona virus reaching lungs remains,coronavirus china covid 19 treatments recommended,0,True,False,False
2,1238795119572049920,कॉरोना वायरस फेफड़ों में जाने से पहले तीन-चार ...,2020-03-14 11:52:26,hi,Twitter for Android,6,0,1,2,0,...,misinformation,[observador.pt],2020-03-15 12:30:21,pt,corona virus reaching lungs remains,coronavirus china covid 19 treatments recommended,0,True,False,False
3,1238947475471454220,Antes de llegar a los pulmones dura 4 días en ...,2020-03-14 21:57:51,es,Twitter for Android,8,3,0,3,0,...,misinformation,[observador.pt],2020-03-15 12:30:21,pt,corona virus reaching lungs remains,coronavirus china covid 19 treatments recommended,0,True,False,False
4,1239128401115516929,So they say the first symptons are #coughing\n...,2020-03-15 09:56:47,en,Twitter for Android,10,2,1,4,0,...,misinformation,[observador.pt],2020-03-15 12:30:21,pt,corona virus reaching lungs remains,coronavirus china covid 19 treatments recommended,0,True,False,False


In [5]:
# Get the tweets, labels and language
data_clean = tweet_claim[["text", "label", "lang", "train_mask", "val_mask", "test_mask"]]

# Clean the dataset

## Encode labels

Here we simply perform the following encoding:

- `misinformation`: 0
- `factual`: 1

In [6]:
label_encodings = {"misinformation": 0, "factual": 1}
data_clean.label.replace(to_replace=label_encodings, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean.label.replace(to_replace=label_encodings, inplace=True)


In [7]:
data_clean.label.value_counts()

0    37330
1     1671
Name: label, dtype: int64

## Remove whitespace from the text

## Remove stopwords

Not sure if I should even do this step. I'll need to see how the Torch tokenizer works first, specifically the one used in the `transformers` library.

## Remove mentions and URLs from the text

I'm not entirely sure that I should be removing these things as they could contain additional information. I'll talk to Pak Dhomas about it later but for now I'll take `yarakyrychenko`'s approach and replace URLs with `<URL>` and mentions with `<USER>` as both of these will remain untranslated yet act as indicators.

In [8]:
#data_clean.text.str.replace("#\S*", "", inplace=True)
data_clean.text = data_clean.text.str.replace("http\S+", "<URL>", case=False)
data_clean.text = data_clean.text.str.replace("@\S+", "<USER>", case=False)

  data_clean.text = data_clean.text.str.replace("http\S+", "<URL>", case=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean.text = data_clean.text.str.replace("http\S+", "<URL>", case=False)
  data_clean.text = data_clean.text.str.replace("@\S+", "<USER>", case=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean.text = data_clean.text.str.replace("@\S+", "<USER>", case=False)


# Translate the tweet text

Will handle this later as at the time of writing there seems to be an issue with `py-googletrans`. I'll probably end up using the official Google translation API.

In [None]:
from googletrans import Translator

# Define translate function
def translate(text, lang, translator):
  try: 
    if lang != 'id':
      translation = translator.translate(text, src=lang, dest='id')
      return translation.text
    else:
      return text
  except:
    translation = translator.translate(text, dest='id')
    return translation.text

In [None]:
tr = Translator()
tr.translate('poop')

In [None]:
text

In [None]:
text = data_clean.text.to_list()[0]
lang = data_clean.lang.to_list()[0]
translate(text, lang, tr)

In [None]:
# Translate tweets
tr = Translator()
#test = data_clean.apply(lambda x: translate(x["text"], x["lang"], tr), axis=1)
tr_text = []
for text, lang in zip(data_clean.text.to_list(), data_clean.lang.to_list()):
    tr_text.append(translate(text, lang, tr))
tr_text[:5]

# Create the splits

In [10]:
# Get training, test, and validation sets
train = data_clean.query('train_mask == True')
val = data_clean.query('val_mask == True')
test = data_clean.query('test_mask == True')

In [14]:
# Save the splits to csv files
output_dir = Path("data")
output_files = ["mumin_large-train.csv", "mumin_large-test.csv", "mumin_large-validation.csv"]
splits = [train, test, val]
for split, output_file in zip(splits, output_files):
    split_features = ["text", "label", "lang"]
    split.to_csv(output_dir.joinpath(output_file), index=False, columns=split_features)