In [1]:
import pandas as pd

In [2]:
dataset = pd.read_csv("../data/train.csv")

In [3]:
dataset.drop_duplicates(inplace=True)

In [4]:
dataset.shape

(79383, 4)

In [5]:
dataset[:5]

Unnamed: 0,company1,company2,is_parent,snippet
0,Sprint_Corporation,Verizon_Communications,False,1 wireless carrier Verizon_Communications (NY...
1,Sprint_Corporation,Verizon_Communications,False,"While AT&T, Sprint_Corporation, and T-Mobile ..."
2,Sprint_Corporation,Verizon_Communications,False,"\nAT&T, Sprint_Corporation, and Verizon_Commun..."
3,Alexa_Internet,Amazon.com,False,Logitech addsAmazon.comn'sAlexa_Interneta skil...
4,Alexa_Internet,Amazon.com,False,\nLogitech has announced a new version of the ...


### Why we do preprocessing?
Shortly we found the following problems in the text:
* The words near the organisation are directly concatinated (1)
* How to handle the problem of overfitting on company names? (2)
* Some organisations are not presented in the snippet

Next we will briefly explain how we attack all of those cases:

1. The solution to the first problem is rather easy just add spaces to each known company and the tokenizer will do the rest.
2. The second problem was rather more tricky, because our models could learn to detect just the organisation entities and don't grasp the context. Further more if we were adding the organisations as features to the model the order should matter because the *parent_of* relation is not transitive. We had two solutions here one. The first one was using function from R*R -> R that holds h(a,b) != h(b,a) and add this as feature. The other approach was rather more elegant it was just to rename the companies we are looking for with some constant strings in our case company1 and company2. This works because the model won't learn the exact names of the organisations and will preserve the order of the relation.
3. The last problem we solved by just ignoring those samples when calculating the distance feature in our classical machine learning models.

In [7]:
# using this strategy to fix the problem (stated in the paper) for pairs order

def preprocess(dataset):
    aliased_snippet = []
    companies = dataset["company1"].append(dataset["company2"]).value_counts().keys()
    for i in range(dataset.shape[0]):
        current_row = dataset.iloc[i]
         # I am adding more spaces cuz in some samples the words and concatanated
        for company in companies:
            current_row["snippet"].replace(company, ' ' + company +' ')

        aliased_snippet.append(current_row["snippet"]
                               .replace(current_row["company1"],' company1 ')
                               .replace(current_row["company2"],' company2 '))
    dataset['aliased_snippet'] = aliased_snippet

    dataset['aliased_snippet'] = dataset['aliased_snippet'].str.lower()
    print(companies.shape)
    return dataset

In [9]:
dataset = preprocess(dataset)
dataset.shape

(451,)


(79383, 5)

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
# I will split the train data to train,dev,test in ratio 70/20/10

train, other = train_test_split(dataset, stratify=dataset["is_parent"],test_size=0.3,random_state=26)
train.shape, other.shape

((55568, 5), (23815, 5))

In [12]:
train["is_parent"].value_counts()

False    39038
True     16530
Name: is_parent, dtype: int64

In [13]:
other["is_parent"].value_counts()

False    16730
True      7085
Name: is_parent, dtype: int64

In [14]:
dev,test = train_test_split(other, stratify=other["is_parent"], test_size=(1/3), random_state=26)
dev.shape, test.shape

((15876, 5), (7939, 5))

Lets check whether we splitted it correctly

In [15]:
def in_percent(ratio):
    return ratio*100

print(in_percent(train.shape[0]/dataset.shape[0]))
print(in_percent(dev.shape[0]/dataset.shape[0]))
print(in_percent(test.shape[0]/dataset.shape[0]))

69.99987402844438
19.999244170666262
10.00088180088936


In [16]:
train.to_csv("split/train.csv")
dev.to_csv("split/dev.csv")
test.to_csv("split/test.csv")

In [17]:
train["is_parent"].value_counts()

False    39038
True     16530
Name: is_parent, dtype: int64

In [18]:
dev["is_parent"].value_counts()

False    11153
True      4723
Name: is_parent, dtype: int64

In [19]:
test["is_parent"].value_counts()

False    5577
True     2362
Name: is_parent, dtype: int64

### Now lets preprocess the unlabeled test set in order to use it as corpus for more words and prepare it for input in the models

In [10]:
onto_test = pd.read_csv("./test/test.csv")

In [14]:
onto_test.shape

(20591, 5)

In [11]:
onto_test[:5]

Unnamed: 0,company1,company2,is_parent,snippet
0,Ford_Motor_Company,Holden,,95s to top the sheets ahead of Kiwi Fabian Cou...
1,Ford_Motor_Company,Holden,,95s to top the sheets ahead of Kiwi Fabian Cou...
2,Apple_Inc.,HBO,,\nGamers who want to access HBO Now on the Xbo...
3,Apple_Inc.,HBO,,\nHBO first launched its standalone subscripti...
4,Apple_Inc.,HBO,,\nHBO first launched its standalone subscripti...


In [12]:
onto_test = preprocess(onto_test)

(279,)


In [13]:
onto_test[:5]

Unnamed: 0,company1,company2,is_parent,snippet,aliased_snippet
0,Ford_Motor_Company,Holden,,95s to top the sheets ahead of Kiwi Fabian Cou...,95s to top the sheets ahead of kiwi fabian cou...
1,Ford_Motor_Company,Holden,,95s to top the sheets ahead of Kiwi Fabian Cou...,95s to top the sheets ahead of kiwi fabian cou...
2,Apple_Inc.,HBO,,\nGamers who want to access HBO Now on the Xbo...,\ngamers who want to access company2 now on ...
3,Apple_Inc.,HBO,,\nHBO first launched its standalone subscripti...,\n company2 first launched its standalone sub...
4,Apple_Inc.,HBO,,\nHBO first launched its standalone subscripti...,\n company2 first launched its standalone sub...


In [24]:
%mkdir processed

In [26]:
onto_test.to_csv("processed/test.csv", index_label=False)