# Checking for relationships between entities

The maximum performance that the model with word vectors was **Accuracy of 92.8 with a weighted F1-Score of 0.92.**

The Random Forest Model with TF-IDF yields an **Accuracy of 87.3 with a weighted F1-Score of 0.85.** In the unseen test set we can see that the model makes good predictions.

In [None]:
# Importing the required libraries

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer

from imblearn.pipeline import Pipeline as imb_Pipeline
from imblearn.over_sampling import RandomOverSampler

import spacy
from xgboost import XGBClassifier
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download("wordnet")
nltk.download("omw-1.4")

import warnings
warnings.filterwarnings("ignore")

In [2]:
training_set = pd.read_excel("partnership_training_data.xlsx")
print(training_set.shape)
training_set.head()

(1180, 2)


Unnamed: 0,text_between_orgs,class
0,to a good supply partnership but nothing to wr...,Garbage
1,", based in",Garbage
2,", will take a",Garbage
3,",",Garbage
4,project off the coast of,Garbage


In [3]:
# Checking for missing values:
training_set.isna().sum()

text_between_orgs    1
class                0
dtype: int64

In [4]:
training_set.dropna(inplace=True)

In [5]:
# Checking for Duplicate entries:
training_set.duplicated().sum()

0

In [6]:
# Checking for Class Imbalance
training_set["class"].value_counts()

Garbage                934
Partner                 62
Joint Venture           62
Acquisition             42
Merger                  38
Investor                21
Signed an agreement     12
Subsidiary               8
Name: class, dtype: int64

In [7]:
pred_set = pd.read_excel("partnership_prediction_data.xlsx")

## Modelling

In [8]:
X = training_set.drop("class", axis=1)
y = training_set["class"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y
)


In [9]:
X_train.head()

Unnamed: 0,text_between_orgs
870,sponsors such as
339,management infrastructure Vehicle segment
848,commenced the production and marketing marketi...
985,joint venture between
839,", Transport and Tourism . The unit of vertical..."


### Baseline Model

In [10]:
vectorizer = TfidfVectorizer(stop_words=stopwords.words("english"))

label_enc = LabelEncoder()
y_train = label_enc.fit_transform(y_train)
y_test = label_enc.transform(y_test)

col_transformer = ColumnTransformer(
    transformers=[
        ("tfidf_1", vectorizer, "text_between_orgs"),
    ]
)
oversample = RandomOverSampler(random_state=0)

In [11]:
model = RandomForestClassifier(random_state=0, n_estimators=100, n_jobs=-1)

clf_best = imb_Pipeline(
    steps=[
        ("prep", col_transformer),
        ("oversampling", oversample),
        ("model", model)
        ]
    )

clf_best.fit(X_train, y_train)
y_pred = clf_best.predict(X_test)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy score: 92.80000000000001
Classification Report: 
               precision    recall  f1-score   support

           0       0.80      1.00      0.89         8
           1       0.97      0.95      0.96       187
           2       0.67      0.50      0.57         4
           3       0.85      0.85      0.85        13
           4       0.78      0.88      0.82         8
           5       0.75      1.00      0.86        12
           6       0.00      0.00      0.00         2
           7       1.00      1.00      1.00         2

    accuracy                           0.93       236
   macro avg       0.73      0.77      0.74       236
weighted avg       0.92      0.93      0.92       236



In [12]:
vectorizer = TfidfVectorizer(stop_words=stopwords.words("english"))

label_enc = LabelEncoder()
y_train = label_enc.fit_transform(y_train)
y_test = label_enc.transform(y_test)

col_transformer = ColumnTransformer(
    transformers=[
        ("tfidf_1", vectorizer, "text_between_orgs"),
    ]
)
oversample = RandomOverSampler(random_state=0)

In [13]:
model = LogisticRegression(random_state=0, solver="liblinear", C=10)

clf = imb_Pipeline(
    steps=[
        ("prep", col_transformer),
        ("oversampling", oversample),
        ("model", model)
        ]
    )

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred)) 

Accuracy score: 91.9
Classification Report: 
               precision    recall  f1-score   support

           0       0.88      0.88      0.88         8
           1       0.97      0.93      0.95       187
           2       0.80      1.00      0.89         4
           3       0.79      0.85      0.81        13
           4       0.78      0.88      0.82         8
           5       0.67      1.00      0.80        12
           6       0.00      0.00      0.00         2
           7       1.00      1.00      1.00         2

    accuracy                           0.92       236
   macro avg       0.73      0.82      0.77       236
weighted avg       0.93      0.92      0.92       236



## Using word vectors|

In [14]:
nlp = spacy.load('en_core_web_lg')
stop_words = nlp.Defaults.stop_words

# Remove Stopwords from train and test
f = 'text_between_orgs'
train_texts = [' '.join([t for t in text.split() if(t.lower() not in stop_words)]) for text in X_train[f]]
test_texts = [' '.join([t for t in text.split() if(t.lower() not in stop_words)]) for text in X_test[f]]

# Get dataframes with text converted to spaCy vectors
tr_df = pd.DataFrame([list(nlp(text).vector) for text in train_texts])
te_df = pd.DataFrame([list(nlp(text).vector) for text in test_texts])

In [15]:
oversample = RandomOverSampler(random_state=0)

model = RandomForestClassifier(random_state=0)

clf = imb_Pipeline(
    steps=[
        ("oversampling", oversample),
        ("model", model)
        ]
    )

clf.fit(tr_df, y_train)
y_pred = clf.predict(te_df)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred)) 

Accuracy score: 87.3
Classification Report: 
               precision    recall  f1-score   support

           0       1.00      0.62      0.77         8
           1       0.87      0.99      0.93       187
           2       1.00      0.25      0.40         4
           3       0.86      0.46      0.60        13
           4       0.67      0.25      0.36         8
           5       1.00      0.50      0.67        12
           6       0.00      0.00      0.00         2
           7       1.00      0.50      0.67         2

    accuracy                           0.87       236
   macro avg       0.80      0.45      0.55       236
weighted avg       0.87      0.87      0.85       236



In [16]:
nlp = spacy.load('en_core_web_lg')
stop_words = nlp.Defaults.stop_words

# Remove Stopwords from train and test
f = 'text_between_orgs'
train_texts = [' '.join([t for t in text.split() if(t.lower() not in stop_words)]) for text in X_train[f]]
test_texts = [' '.join([t for t in text.split() if(t.lower() not in stop_words)]) for text in X_test[f]]

# Get dataframes with text converted to spaCy vectors
tr_df = pd.DataFrame([list(nlp(text).vector) for text in train_texts])
te_df = pd.DataFrame([list(nlp(text).vector) for text in test_texts])

label_enc = LabelEncoder()
y_train = label_enc.fit_transform(y_train)
y_test = label_enc.transform(y_test)

In [17]:
oversample = RandomOverSampler(random_state=0)

model = RandomForestClassifier(random_state=0)

clf = imb_Pipeline(
    steps=[
        ("oversampling", oversample),
        ("model", model)
        ]
    )

clf.fit(tr_df, y_train)
y_pred = clf.predict(te_df)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred)) 

Accuracy score: 87.3
Classification Report: 
               precision    recall  f1-score   support

           0       1.00      0.62      0.77         8
           1       0.87      0.99      0.93       187
           2       1.00      0.25      0.40         4
           3       0.86      0.46      0.60        13
           4       0.67      0.25      0.36         8
           5       1.00      0.50      0.67        12
           6       0.00      0.00      0.00         2
           7       1.00      0.50      0.67         2

    accuracy                           0.87       236
   macro avg       0.80      0.45      0.55       236
weighted avg       0.87      0.87      0.85       236



## Sanity check using Prediction set

In [19]:
col_check = pred_set[['Predicate']]
col_check.columns = ['text_between_orgs']
pred_set["Predicted_class"] = clf_best.predict(col_check)

In [20]:
pred_set.Predicted_class = pred_set.Predicted_class.map(
    {
        0: "Acquisition",
        1: "Garbage",
        2: "Investor",
        3: "Joint Venture",
        4: "Merger",
        5: "Partner",
        6: "Signed an agreement",
        7: "Subsidiary",
    }
)


In [22]:
pred_set.head(10)

Unnamed: 0,Entity,Predicate,Entity_2,Predicted_class
0,Spirit Energy,has agreed to partner with,Neptune Energy,Partner
1,Air Products,"joint venture in India , called",INOX Air Products,Joint Venture
2,Air Products,acquired a 50 % equity stake in,Industrial Oxygen Company Ltd,Acquisition
3,Ares Management Corporation,announced that a subsidiary of,Ares,Subsidiary
4,Ares,has entered into a definitive agreement with a...,BrightSphere Investment Group,Subsidiary
5,Landmark Investment Holdings LP,to acquire 100 % of,Landmark Partners,Garbage
6,Conagra Brands,will acquire all outstanding shares of,Pinnacle Foods,Garbage
7,Goldman Sachs,", in partnership with",Santander Bank,Partner
8,Fifth Street Finance,announced that its,the Board of Directors,Acquisition
9,IronPlanet,® jointly announced that they have entered int...,Ritchie Bros,Joint Venture


The maximum performance that the model with word vectors was **Accuracy of 92.8 with a weighted F1-Score of 0.92.**

The Random Forest Model with TF-IDF yields an **Accuracy of 87.3 with a weighted F1-Score of 0.85.** In the unseem test set we can see that the model makes good predictions.