Files Used:
* altlex_dev.tsv - grad student annotated data
* altlex_gold.tsv - CrowdFlower annotated data

Formatting: All data is tab-separated.

Columns: prevWords, altlex, currWords, and label.<br>
prevWords - text before the altlex<br>
altlex - the altlex itself<br>
currWords - text after the altlex<br>
label - 0, 1, or 2 representing none, reason, or result, respectively<br>

1. REASON : Effect-Signal-Cause : He did this TO PROTECT that.
2. RESULT: Cause-Signal-Effect : That thing he did LED TO this.
  
Files Not Used:
* altlex_train_paraphrases.tsv - paraphrases derived from English and Simple Wikipedia, respectively
* altlex_train.tsv - distant labeled data with causal and non-causal connectives
* altlex_train_bootstrapped.tsv - distant labeled data after two rounds of bootstrapping

# Link to original Dataset
https://github.com/chridey/altlex/tree/master

### One File

In [1]:
import pandas as pd
import numpy as np
import re
import os
from sklearn.model_selection import train_test_split
import pickle
import json

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
data_dir = r"/content/drive/MyDrive/Colab Notebooks/NLP/project/PC-TES-PROJECT/causality-guided-Transformer/data/altlex_data/splits"
all_files = ['altlex_dev.tsv', 'altlex_gold.tsv']

file = all_files[1]
print(os.path.splitext(file)[0])
fn = os.path.join(data_dir, file)
original_data = pd.read_csv(fn, sep='\t')
original_data

altlex_gold


Unnamed: 0,prevWords,altlex,currWords,label
0,A government affidavit in 2006 stated that the...,caused,"558,125 injuries , including 38,478 temporary ...",2
1,The Indian government and local activists argu...,caused,a backflow of water into a MIC tank triggering...,2
2,The UCIL factory was,built in,1969 to produce the pesticide Sevin ( UCC 's b...,0
3,"In a panic , he removed his mask , inhaling a ...",which resulted in,his death 72 hours later .,2
4,"In August 1982 , a chemical engineer came into...",resulting in,burns over 30 percent of his body .,2
...,...,...,...,...
606,Then he turned into Angelus and when he was fi...,changed,"back ( after months , by Willow ) , Buffy stab...",0
607,"In Asia , the United States",led,the occupation of Japan and administrated Japa...,2
608,"In April 1940 , Germany invaded Denmark and No...",to protect,"shipments of iron ore from Sweden , which the ...",1
609,Musical analysis of a composition aims at achi...,leading to,more meaningful hearing and a greater apprecia...,2


In [4]:
corpus = 'altlex'
cols = ['corpus','doc_id','sent_id','eg_id','index','text','text_w_pairs','seq_label','pair_label','context','num_sents']

data = []
for i, row in original_data.iterrows():

    sentid = str(i) # Not relevant for this dataset
    num_eg_for_this_sentid = 0 # Not relevant for this dataset
    identifiers = [corpus,os.path.basename(fn),str(sentid),str(num_eg_for_this_sentid)]
    unique_index = '_'.join(identifiers)

    text = str(row['prevWords']).strip()+' '+str(row['altlex']).strip()+' '+str(row['currWords']).strip()
    label = int(row['label'])
    seq_label = pair_label = 1 if label!=0 else 0

    if label == 2: #CAUSE->EFFECT || ARG0->ARG1
        text_w_pairs = '<ARG0>'+str(row['prevWords']).strip()+'</ARG0> '+str(row['altlex']).strip()+' <ARG1>'+str(row['currWords']).strip()+'</ARG1>'
    else:
        text_w_pairs = '<ARG1>'+str(row['prevWords']).strip()+'</ARG1> '+str(row['altlex']).strip()+' <ARG0>'+str(row['currWords']).strip()+'</ARG0>'

    data.append(
        identifiers+[
            unique_index,
            text.strip(),
            text_w_pairs.strip(),
            seq_label,
            pair_label,
            '',1
        ]
    )

data = pd.DataFrame(data, columns=cols)
data

Unnamed: 0,corpus,doc_id,sent_id,eg_id,index,text,text_w_pairs,seq_label,pair_label,context,num_sents
0,altlex,altlex_gold.tsv,0,0,altlex_altlex_gold.tsv_0_0,A government affidavit in 2006 stated that the...,<ARG0>A government affidavit in 2006 stated th...,1,1,,1
1,altlex,altlex_gold.tsv,1,0,altlex_altlex_gold.tsv_1_0,The Indian government and local activists argu...,<ARG0>The Indian government and local activist...,1,1,,1
2,altlex,altlex_gold.tsv,2,0,altlex_altlex_gold.tsv_2_0,The UCIL factory was built in 1969 to produce ...,<ARG1>The UCIL factory was</ARG1> built in <AR...,0,0,,1
3,altlex,altlex_gold.tsv,3,0,altlex_altlex_gold.tsv_3_0,"In a panic , he removed his mask , inhaling a ...","<ARG0>In a panic , he removed his mask , inhal...",1,1,,1
4,altlex,altlex_gold.tsv,4,0,altlex_altlex_gold.tsv_4_0,"In August 1982 , a chemical engineer came into...","<ARG0>In August 1982 , a chemical engineer cam...",1,1,,1
...,...,...,...,...,...,...,...,...,...,...,...
606,altlex,altlex_gold.tsv,606,0,altlex_altlex_gold.tsv_606_0,Then he turned into Angelus and when he was fi...,<ARG1>Then he turned into Angelus and when he ...,0,0,,1
607,altlex,altlex_gold.tsv,607,0,altlex_altlex_gold.tsv_607_0,"In Asia , the United States led the occupation...","<ARG0>In Asia , the United States</ARG0> led <...",1,1,,1
608,altlex,altlex_gold.tsv,608,0,altlex_altlex_gold.tsv_608_0,"In April 1940 , Germany invaded Denmark and No...","<ARG1>In April 1940 , Germany invaded Denmark ...",1,1,,1
609,altlex,altlex_gold.tsv,609,0,altlex_altlex_gold.tsv_609_0,Musical analysis of a composition aims at achi...,<ARG0>Musical analysis of a composition aims a...,1,1,,1


In [5]:
# Number of duplicated rows
sum(data.duplicated(subset=['text']))

34

In [6]:
data[data.duplicated(subset=['text'], keep=False)].sort_values(by=['text'])

Unnamed: 0,corpus,doc_id,sent_id,eg_id,index,text,text_w_pairs,seq_label,pair_label,context,num_sents
89,altlex,altlex_gold.tsv,89,0,altlex_altlex_gold.tsv_89_0,Although first feared that Nalgae would cause ...,<ARG0>Although first feared that Nalgae would<...,1,1,,1
90,altlex,altlex_gold.tsv,90,0,altlex_altlex_gold.tsv_90_0,Although first feared that Nalgae would cause ...,<ARG0>Although first feared that Nalgae would ...,1,1,,1
128,altlex,altlex_gold.tsv,128,0,altlex_altlex_gold.tsv_128_0,DR6 is highly expressed in the human brain reg...,<ARG0>DR6 is highly expressed in the human bra...,1,1,,1
127,altlex,altlex_gold.tsv,127,0,altlex_altlex_gold.tsv_127_0,DR6 is highly expressed in the human brain reg...,<ARG0>DR6 is highly expressed in the human bra...,1,1,,1
212,altlex,altlex_gold.tsv,212,0,altlex_altlex_gold.tsv_212_0,Each circuit represents a bit ( binary digit )...,<ARG0>Each circuit represents a bit ( binary d...,1,1,,1
...,...,...,...,...,...,...,...,...,...,...,...
34,altlex,altlex_gold.tsv,34,0,altlex_altlex_gold.tsv_34_0,The seismicity of central and eastern Asia is ...,<ARG1>The seismicity of central and eastern As...,1,1,,1
172,altlex,altlex_gold.tsv,172,0,altlex_altlex_gold.tsv_172_0,Then he turned into Angelus and when he was fi...,<ARG1>Then he turned into Angelus and when he ...,1,1,,1
606,altlex,altlex_gold.tsv,606,0,altlex_altlex_gold.tsv_606_0,Then he turned into Angelus and when he was fi...,<ARG1>Then he turned into Angelus and when he ...,0,0,,1
151,altlex,altlex_gold.tsv,151,0,altlex_altlex_gold.tsv_151_0,Vampires Spike and Drusilla ( weakened from a ...,<ARG1>Vampires Spike and Drusilla ( weakened f...,1,1,,1


### Mass Run

In [7]:
import pandas as pd
import numpy as np
from collections import defaultdict
import re
import os

data_dir = r"/content/drive/MyDrive/Colab Notebooks/NLP/project/PC-TES-PROJECT/causality-guided-Transformer/data/altlex_data/splits"
all_filenames = ['altlex_dev.tsv', 'altlex_gold.tsv']
cols = ['corpus','doc_id','sent_id','eg_id','index','text','text_w_pairs','seq_label','pair_label','context','num_sents']


def run(file):
    # Open File
    fn = os.path.join(data_dir, file)
    original_data = pd.read_csv(fn, sep='\t')
    original_data = original_data.dropna(subset=['prevWords','currWords'])
    original_data['text'] = [
        str(p).strip()+' '+str(a).strip()+' '+str(c).strip() \
        for p,a,c in zip(original_data['prevWords'], original_data['altlex'], original_data['currWords'])
    ]

    # Add SentId
    original_data['sentid'] = original_data[['text']].sum(axis=1).map(hash)
    hashmap = {s:i for i,s in enumerate(original_data['sentid'].unique())}
    original_data['sentid'] = original_data['sentid'].apply(lambda x: hashmap[x])

    # Format
    data = []
    corpus = 'altlex'
    sentid_counter = defaultdict(int)
    for i, row in original_data.iterrows():

        sentid = str(row['sentid']) # Always single sentences for AltLex corpus
        num_eg_for_this_sentid = sentid_counter[sentid]
        identifiers = [corpus,os.path.basename(fn),str(sentid),str(num_eg_for_this_sentid)]
        unique_index = '_'.join(identifiers)

        text = row['text']
        label = int(row['label'])
        seq_label = pair_label = 1 if label!=0 else 0

        if label == 2: #CAUSE->EFFECT || ARG0->ARG1
            text_w_pairs = '<ARG0>'+str(row['prevWords']).strip()+'</ARG0> '+str(row['altlex']).strip()+' <ARG1>'+str(row['currWords']).strip()+'</ARG1>'
        else:
            text_w_pairs = '<ARG1>'+str(row['prevWords']).strip()+'</ARG1> '+str(row['altlex']).strip()+' <ARG0>'+str(row['currWords']).strip()+'</ARG0>'

        data.append(
            identifiers+[
                unique_index,
                text.strip(),
                text_w_pairs.strip(),
                seq_label,
                pair_label,
                '',1
            ]
        )
        sentid_counter[sentid]+=1

    data = pd.DataFrame(data, columns=cols)
    data['sent_id'] = data['sent_id'].astype(str)
    data['seq_label'] = data.groupby(['corpus','doc_id','sent_id'])['seq_label'].transform('max')

    return data

run(all_filenames[0])

Unnamed: 0,corpus,doc_id,sent_id,eg_id,index,text,text_w_pairs,seq_label,pair_label,context,num_sents
0,altlex,altlex_dev.tsv,0,0,altlex_altlex_dev.tsv_0_0,"The Bhopal disaster , also referred to as the ...","<ARG1>The Bhopal disaster , also referred to</...",0,0,,1
1,altlex,altlex_dev.tsv,1,0,altlex_altlex_dev.tsv_1_0,"In addition , several vent gas scrubbers had b...","<ARG1>In addition , several vent gas scrubbers...",0,0,,1
2,altlex,altlex_dev.tsv,2,0,altlex_altlex_dev.tsv_2_0,Union Carbide organized a team of internationa...,<ARG1>Union Carbide organized a team of intern...,0,0,,1
3,altlex,altlex_dev.tsv,3,0,altlex_altlex_dev.tsv_3_0,"Following an appeal of this decision , the U.S...","<ARG1>Following an appeal of this decision , t...",0,0,,1
4,altlex,altlex_dev.tsv,4,0,altlex_altlex_dev.tsv_4_0,The U.S. Supreme Court refused to hear an appe...,<ARG0>The U.S. Supreme Court refused to hear a...,1,1,,1
...,...,...,...,...,...,...,...,...,...,...,...
411,altlex,altlex_dev.tsv,396,0,altlex_altlex_dev.tsv_396_0,Bad economies and people wanting to rule thems...,<ARG1>Bad economies and people wanting to rule...,1,1,,1
412,altlex,altlex_dev.tsv,397,0,altlex_altlex_dev.tsv_397_0,The United States became richer than any other...,<ARG1>The United States became richer than any...,0,0,,1
413,altlex,altlex_dev.tsv,398,0,altlex_altlex_dev.tsv_398_0,"Between 1942 and 1945 , Roosevelt signed an or...","<ARG1>Between 1942</ARG1> and <ARG0>1945 , Roo...",0,0,,1
414,altlex,altlex_dev.tsv,399,0,altlex_altlex_dev.tsv_399_0,"The Resistance , the group of people who fough...","<ARG1>The Resistance , the group of people who...",0,0,,1


In [8]:
data = pd.DataFrame()
for counter, file in enumerate(all_filenames):

    df = run(file)
    data = pd.concat([data, df])

data

Unnamed: 0,corpus,doc_id,sent_id,eg_id,index,text,text_w_pairs,seq_label,pair_label,context,num_sents
0,altlex,altlex_dev.tsv,0,0,altlex_altlex_dev.tsv_0_0,"The Bhopal disaster , also referred to as the ...","<ARG1>The Bhopal disaster , also referred to</...",0,0,,1
1,altlex,altlex_dev.tsv,1,0,altlex_altlex_dev.tsv_1_0,"In addition , several vent gas scrubbers had b...","<ARG1>In addition , several vent gas scrubbers...",0,0,,1
2,altlex,altlex_dev.tsv,2,0,altlex_altlex_dev.tsv_2_0,Union Carbide organized a team of internationa...,<ARG1>Union Carbide organized a team of intern...,0,0,,1
3,altlex,altlex_dev.tsv,3,0,altlex_altlex_dev.tsv_3_0,"Following an appeal of this decision , the U.S...","<ARG1>Following an appeal of this decision , t...",0,0,,1
4,altlex,altlex_dev.tsv,4,0,altlex_altlex_dev.tsv_4_0,The U.S. Supreme Court refused to hear an appe...,<ARG0>The U.S. Supreme Court refused to hear a...,1,1,,1
...,...,...,...,...,...,...,...,...,...,...,...
606,altlex,altlex_gold.tsv,162,1,altlex_altlex_gold.tsv_162_1,Then he turned into Angelus and when he was fi...,<ARG1>Then he turned into Angelus and when he ...,1,0,,1
607,altlex,altlex_gold.tsv,574,0,altlex_altlex_gold.tsv_574_0,"In Asia , the United States led the occupation...","<ARG0>In Asia , the United States</ARG0> led <...",1,1,,1
608,altlex,altlex_gold.tsv,575,0,altlex_altlex_gold.tsv_575_0,"In April 1940 , Germany invaded Denmark and No...","<ARG1>In April 1940 , Germany invaded Denmark ...",1,1,,1
609,altlex,altlex_gold.tsv,576,0,altlex_altlex_gold.tsv_576_0,Musical analysis of a composition aims at achi...,<ARG0>Musical analysis of a composition aims a...,1,1,,1


In [9]:
# There should be some rows
data.loc[(data['seq_label'])==1 & (data['pair_label']==0)]

Unnamed: 0,corpus,doc_id,sent_id,eg_id,index,text,text_w_pairs,seq_label,pair_label,context,num_sents
192,altlex,altlex_dev.tsv,181,1,altlex_altlex_dev.tsv_181_1,The resulting lack of coverage over the island...,<ARG1>The resulting lack of coverage over the ...,1,0,,1
228,altlex,altlex_dev.tsv,217,0,altlex_altlex_dev.tsv_217_0,"Camulodunum was burned to the ground , as well...","<ARG1>Camulodunum was burned to the ground , a...",1,0,,1
308,altlex,altlex_dev.tsv,294,1,altlex_altlex_dev.tsv_294_1,"But despite its unsuitability , and the availa...","<ARG1>But despite its unsuitability , and the ...",1,0,,1
77,altlex,altlex_gold.tsv,75,0,altlex_altlex_gold.tsv_75_0,"Over the next two days , the system drifted no...","<ARG1>Over the next two days , the system drif...",1,0,,1
142,altlex,altlex_gold.tsv,133,1,altlex_altlex_gold.tsv_133_1,The clinical symptoms of AD usually occurs aft...,<ARG1>The clinical symptoms of AD usually occu...,1,0,,1
204,altlex,altlex_gold.tsv,191,1,altlex_altlex_gold.tsv_191_1,"Furthermore , jump instructions may be made to...","<ARG1>Furthermore , jump instructions may be m...",1,0,,1
213,altlex,altlex_gold.tsv,199,1,altlex_altlex_gold.tsv_199_1,Each circuit represents a bit ( binary digit )...,<ARG1>Each circuit represents a bit ( binary d...,1,0,,1
224,altlex,altlex_gold.tsv,209,1,altlex_altlex_gold.tsv_209_1,Many projects try to send working computers to...,<ARG1>Many projects try to send working comput...,1,0,,1
231,altlex,altlex_gold.tsv,215,0,altlex_altlex_gold.tsv_215_0,"On October 25 at 1:45 am EDT , Kennedy respond...","<ARG1>On October 25 at 1:45 am EDT , Kennedy r...",1,0,,1
354,altlex,altlex_gold.tsv,334,0,altlex_altlex_gold.tsv_334_0,Many commanders on both sides knew that such w...,<ARG1>Many commanders on both sides knew that ...,1,0,,1


In [10]:
from collections import Counter

print('All Examples')
print('Seq Level:', Counter(data['seq_label'])) # if sentence level causality exists
print('Pair Level:',Counter(data['pair_label'])) # if ARG0-ARG1 pair level causality exists
#print('Seq Level (Unique):',Counter(data.drop_duplicates(subset=['corpus','doc_id','sent_id'])['seq_label']))

print('\nSingle Sentence Examples')
print('Seq Level:', Counter(data.loc[data['num_sents']==1,'seq_label'])) # if sentence level causality exists
print('Pair Level:', Counter(data.loc[data['num_sents']==1,'pair_label'])) # if ARG0-ARG1 pair level causality exists
#print('Seq Level (Unique):',Counter(data.loc[data['num_sents']==1].drop_duplicates(subset=['corpus','doc_id','sent_id'])['seq_label']))

All Examples
Seq Level: Counter({0: 571, 1: 456})
Pair Level: Counter({0: 585, 1: 442})

Single Sentence Examples
Seq Level: Counter({0: 571, 1: 456})
Pair Level: Counter({0: 585, 1: 442})


# Save data as csv

In [11]:
directory_path = '/content/drive/MyDrive/Colab Notebooks/NLP/project/PC-TES-PROJECT/causality-guided-Transformer/data/altlex_data/splits/'

In [12]:
data.to_csv(directory_path + 'altlex.csv', index=False, encoding='utf-8-sig')

# Split Data into train,test,dev (60/20/20)

In [13]:
new_data = pd.read_csv(directory_path + 'altlex.csv')

In [14]:
new_data

Unnamed: 0,corpus,doc_id,sent_id,eg_id,index,text,text_w_pairs,seq_label,pair_label,context,num_sents
0,altlex,altlex_dev.tsv,0,0,altlex_altlex_dev.tsv_0_0,"The Bhopal disaster , also referred to as the ...","<ARG1>The Bhopal disaster , also referred to</...",0,0,,1
1,altlex,altlex_dev.tsv,1,0,altlex_altlex_dev.tsv_1_0,"In addition , several vent gas scrubbers had b...","<ARG1>In addition , several vent gas scrubbers...",0,0,,1
2,altlex,altlex_dev.tsv,2,0,altlex_altlex_dev.tsv_2_0,Union Carbide organized a team of internationa...,<ARG1>Union Carbide organized a team of intern...,0,0,,1
3,altlex,altlex_dev.tsv,3,0,altlex_altlex_dev.tsv_3_0,"Following an appeal of this decision , the U.S...","<ARG1>Following an appeal of this decision , t...",0,0,,1
4,altlex,altlex_dev.tsv,4,0,altlex_altlex_dev.tsv_4_0,The U.S. Supreme Court refused to hear an appe...,<ARG0>The U.S. Supreme Court refused to hear a...,1,1,,1
...,...,...,...,...,...,...,...,...,...,...,...
1022,altlex,altlex_gold.tsv,162,1,altlex_altlex_gold.tsv_162_1,Then he turned into Angelus and when he was fi...,<ARG1>Then he turned into Angelus and when he ...,1,0,,1
1023,altlex,altlex_gold.tsv,574,0,altlex_altlex_gold.tsv_574_0,"In Asia , the United States led the occupation...","<ARG0>In Asia , the United States</ARG0> led <...",1,1,,1
1024,altlex,altlex_gold.tsv,575,0,altlex_altlex_gold.tsv_575_0,"In April 1940 , Germany invaded Denmark and No...","<ARG1>In April 1940 , Germany invaded Denmark ...",1,1,,1
1025,altlex,altlex_gold.tsv,576,0,altlex_altlex_gold.tsv_576_0,Musical analysis of a composition aims at achi...,<ARG0>Musical analysis of a composition aims a...,1,1,,1


In [15]:
new_data['text_w_pairs'][0]

"<ARG1>The Bhopal disaster , also referred to</ARG1> as <ARG0>the Bhopal gas tragedy , was a gas leak incident in India , considered the world 's worst industrial disaster .</ARG0>"

In [16]:
X = df[['sent_id', 'text']].copy()

In [17]:
X

Unnamed: 0,sent_id,text
0,0,A government affidavit in 2006 stated that the...
1,1,The Indian government and local activists argu...
2,2,The UCIL factory was built in 1969 to produce ...
3,3,"In a panic , he removed his mask , inhaling a ..."
4,4,"In August 1982 , a chemical engineer came into..."
...,...,...
606,162,Then he turned into Angelus and when he was fi...
607,574,"In Asia , the United States led the occupation..."
608,575,"In April 1940 , Germany invaded Denmark and No..."
609,576,Musical analysis of a composition aims at achi...


In [18]:
Y = df['seq_label'].copy()

In [19]:
Y

Unnamed: 0,seq_label
0,1
1,1
2,0
3,1
4,1
...,...
606,1
607,1
608,1
609,1


In [20]:
X_train, X_temp, y_train, y_temp = train_test_split( X, Y, test_size=0.4, random_state=42)


In [21]:
X_test, X_dev, y_test, y_dev = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [22]:
X_train

Unnamed: 0,sent_id,text
356,335,British Field Marshal Sir Douglas Haig wrote i...
595,563,The Yugoslav Committee was formed in Paris on ...
245,228,Kennedy then sent an official letter to Khrush...
42,40,"According to Chinese state officials , the qua..."
428,406,Although many of the Axis 's crimes were broug...
...,...,...
71,69,"Later that day , the storm strengthened over l..."
106,99,"In an advisory , the JTWC reported that there ..."
270,253,In addition to fighting the Second Barons ' Wa...
435,413,"Upon taking power , James immediately made pea..."


In [23]:
altlex_seq_train = {}

# Iterate through the X_train DataFrame
for index, row in X_train.iterrows():
    altlex_seq_train[row['sent_id']] = row['text']

altlex_seq_train

{'335': 'British Field Marshal Sir Douglas Haig wrote in his diary : `` My officers and I were aware that such weapons would cause harm to women and children living in nearby towns , as strong winds were common in the battlefront .',
 '563': 'The Yugoslav Committee was formed in Paris on 30 April 1915 but shortly moved its office to London ; Trumbić led the Committee .',
 '228': 'Kennedy then sent an official letter to Khrushchev agreeing to the conditions of the first letter and not mentioning the second .',
 '40': 'According to Chinese state officials , the quake caused 69,180 known deaths including 68,636 in Sichuan province ; 18,498 people are listed as missing , and 374,176 injured , but these figures may further increase as more reports come in .',
 '406': "Although many of the Axis 's crimes were brought to the first international court , crimes caused by the Allies were not .",
 '389': "The German Government led by Adolf Hitler and the Nazi Party was responsible for the Holocau

In [24]:
X_test

Unnamed: 0,sent_id,text
592,561,Some of the chemicals lead to birth defects .
603,571,The Wars of the Roses ended with the victory o...
24,24,"In 2002 , The Yes Men issued a fake press rele..."
429,407,German concentration camps and Soviet gulags c...
530,502,The Allies continued by starting its own offen...
...,...,...
558,529,Germany surrendered to the Western Allies on 7...
290,272,The English merchants holding plantations in t...
482,457,The activists worked on organising the gas vic...
370,348,"Likewise , the art of Paul Nash , John Nash , ..."


In [25]:
altlex_seq_test = {}

# Iterate through the X_test DataFrame
for index, row in X_test.iterrows():
    altlex_seq_test[row['sent_id']] = row['text']

altlex_seq_test

{'561': 'Some of the chemicals lead to birth defects .',
 '571': 'The Wars of the Roses ended with the victory of Henry Tudor , who became king Henry VII of England , at the Battle of Bosworth Field in 1485 , where the Yorkist king , Richard III was killed .',
 '24': "In 2002 , The Yes Men issued a fake press release explaining why Dow refused to take responsibility for the disaster and started up a website , at `` DowEthics.com '' , designed to look like the real Dow website , but containing hoax information .",
 '407': 'German concentration camps and Soviet gulags caused a lot of death .',
 '502': 'The Allies continued by starting its own offensive , which drove the Axis west across Libya a few months later , just after the Anglo-American invasion of the French North Africa , forcing it to join the Allies .',
 '32': 'Earthquakes of this size have the potential to cause extensive damage and loss of life .',
 '286': 'On 28 July , the Austro-Hungarians declared war on Serbia and subsequ

In [26]:
X_dev

Unnamed: 0,sent_id,text
131,123,Alzheimer 's disease has been identified as a ...
483,458,"25 years later , LeMay still believed that `` ..."
88,85,Due to land interaction and colder sea surface...
479,455,Contrary to British fears of a revolt in India...
271,254,He conquered Wales and attempted to use a succ...
...,...,...
110,103,"On December 13 , the low pressure area quickly..."
369,347,Historian Samuel Hynes explained : This has be...
551,523,The sailors ' revolt which then ensued in the ...
354,334,Many commanders on both sides knew that such w...


In [27]:
altlex_seq_dev = {}

# Iterate through the X_dev DataFrame
for index, row in X_dev.iterrows():
    altlex_seq_dev[row['sent_id']] = row['text']

altlex_seq_dev

{'123': "Alzheimer 's disease has been identified as a protein misfolding disease ( proteopathy ) , caused by plaque accumulation of abnormally folded amyloid beta protein , and tau protein in the brain .",
 '458': "25 years later , LeMay still believed that `` We could have gotten not only the missiles out of Cuba , we could have gotten the Communists out of Cuba at that time . ''",
 '85': 'Due to land interaction and colder sea surface temperature in the South China Sea , the JMA downgraded Nalgae to a severe tropical storm on October 2 , and then a tropical storm late on October 3 .',
 '455': 'Contrary to British fears of a revolt in India , the outbreak of the war saw an unprecedented outpouring of loyalty and goodwill towards Britain .',
 '254': 'He conquered Wales and attempted to use a succession dispute to gain control of the Kingdom of Scotland , though this developed into a costly and drawn-out military campaign .',
 '262': 'He quickly reached an understanding with the French

# JSON files

In [28]:
# Save the dictionary to a JSON file
with open(directory_path + "altlex_seq_train.json", "w") as outfile:
    json.dump(altlex_seq_train, outfile)

In [29]:
# Save the dictionary to a JSON file
with open(directory_path + "altlex_seq_test.json", "w") as outfile:
    json.dump(altlex_seq_test, outfile)

In [30]:
# Save the dictionary to a JSON file
with open(directory_path + "altlex_seq_dev.json", "w") as outfile:
    json.dump(altlex_seq_dev, outfile)

#Pickle files

In [31]:
# Save the dictionary to a pickle file
with open(directory_path + 'altlex_seq_train.pkl', 'wb') as f:
    pickle.dump(altlex_seq_train, f)

In [32]:
# Save the dictionary to a pickle file
with open(directory_path + 'altlex_seq_test.pkl', 'wb') as f:
    pickle.dump(altlex_seq_test, f)

In [33]:
# Save the dictionary to a pickle file
with open(directory_path + 'altlex_seq_dev.pkl', 'wb') as f:
    pickle.dump(altlex_seq_dev, f)