# Increasing training dataset for NLP Tasks
##### By Ruben Seoane

**Brief:**
By using NLG solutions (backtranslation, paraphrasing, article spinning) provided for free, we expect to be able to generate additional text snippets from the existing dataset, thus avoiding the need for human labeling and providing the DL model with data that maintains the conceptual relationships between entities but not necessarily the same semantic structure, hypothesizing that this approach will help for a better understanding of relationships, based on higher level meaning and not depending on specific words.


In [None]:
# Import training set and dropping existing duplicates
dataset = pd.read_csv("train.csv")
dataset.drop_duplicates(inplace=True)

In [None]:
#Convert company names into keys
companies = dataset["company1"].append(dataset["company2"]).value_counts().keys()

_We observed that the training data contained several instances where different words including company names were concatenated together, so we implemented the following_

In [None]:
#Adding spaces between words and entities
for i in range(dataset.shape[0]):
    current_row = dataset.iloc[i]
    for company in companies:
        current_row["snippet"].replace(company, ' ' + company +' ')

**_Paraphrasing needs only to be applied on the forth column "snippet":_**

In [None]:
#Create new file with ´snippet´ column only
keep_col = ['snippet']
snippet_col = dataset[keep_col]
snippet_col.to_csv('snippet_only_original.csv', header=False, index=False)

_After trying to implement several "article spinning" APIs, and trying to set up a function to perform "backtranslation" (English-->French, French-->English...) to generate semantic variations, I found the service "https://www.prepostseo.com/free-online-article-rewriter" to be an easier approach. However, after testing with increasing text files, the service limit of 10Mb required the modified dataset (17Mb) to be split into two_

In [None]:
#   A- Transforming CSV into a list
with open('snippet_only_original.csv', 'r') as f:
    reader = csv.reader(f)
    snippet_list = list(reader)
    
#   B- Dividing list into two
snippet_l_one = snippet_list[:40000]
snippet_l_two = snippet_list[40000:]

In [None]:
# Export lists as CSV file for rewriting process
df = pd.DataFrame(snippet_l_one, columns=["column"])
df.to_csv('snippet_A.csv', index=False)

df = pd.DataFrame(snippet_l_two, columns=["column"])
df.to_csv('snippet_B.csv', index=False)

_I decided to export the above as CSV instead of directly to .TXT in order to review the results and use Excel to manually export into Tab Delimited file and Unicode text, as differences as delimiters by tab, space or " " were bringing different results from the text spinning tool._

_At this point, the generated .txt files, "TAB_text_snippet_A.txt" and "TAB_text_snippet_A.txt" were sent to the above mentioned site resulting in the navigator (Chrome and Firefox) crashing multiple times. In previous attemps, uploading the first 1000 rows (=1000 sentences or paragraphs) required 100-110 minutes of processing, which means that if the load of this process increases linearly, it could require around 8,800 minutes, or ~147 hours. Given the limited amount of time, and that we haven't been able to access any cloud provider to test if this task could be accelerated, I'll approach this task focusing in the 30% of the dataset that has True "Is Parent" relationships, to level the proportion between negative or neutral text snippets and the ones containing true ownership relationships._