# Checking for risk relationships between entities

Using all the text columns with separate TF-IDF vectorizers yields an **Accuracy of 0.77 with a weighted F1-Score of 0.76.**. We see a drop in performance when using word vectors as the Accuracy drops to 0.71.

After creating new features by extracting the text between the given entities, we see a further drop in performance with the best results reaching only an **Accuracy of 0.61.**

In [None]:
# Importing the required libraries

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report

from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer

from imblearn.pipeline import Pipeline as imb_Pipeline
from imblearn.over_sampling import RandomOverSampler

import spacy
from xgboost import XGBClassifier
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

import warnings
warnings.filterwarnings('ignore')

In [5]:
training_set = pd.read_excel("risk_rel_data_tagged.xlsx")
print(training_set.shape)
training_set.head()

(498, 6)


Unnamed: 0,title,link,Extracted_Sents,Risks,Organizations,Relationship_Tag
0,Transition phase of corruption?,https://arunachaltimes.in/index.php/2022/10/09...,"As per Transparency international, corruption ...",corruption,transparency,0
1,Stellantis Faces $300 Million Fine for Emissio...,https://autos.yahoo.com/stellantis-faces-300-m...,Photo credit: is on the hook for up to $300 mi...,criminal charges,fca,1
2,Stellantis Faces $300 Million Fine for Emissio...,https://autos.yahoo.com/stellantis-faces-300-m...,Photo credit: is on the hook for up to $300 mi...,polluting technologies,fca,1
3,Stellantis Faces $300 Million Fine for Emissio...,https://autos.yahoo.com/stellantis-faces-300-m...,The automaker pled guilty in June to wire frau...,violating the clean air act,fbi,0
4,Stellantis Faces $300 Million Fine for Emissio...,https://autos.yahoo.com/stellantis-faces-300-m...,"Now, the merged Stellantis group is on the hoo...",fines,stellantis,1


In [6]:
# Checking for missing values:
training_set.isna().sum()

title               0
link                0
Extracted_Sents     0
Risks               0
Organizations       0
Relationship_Tag    0
dtype: int64

In [7]:
training_set.dropna(inplace=True)

In [8]:
# Checking for Duplicate entries:
training_set.duplicated().sum()

0

In [10]:
# Checking for Class Imbalance
training_set["Relationship_Tag"].value_counts()

0    273
1    225
Name: Relationship_Tag, dtype: int64

## Modelling without feature engineering

In [26]:
X = training_set.drop('Relationship_Tag', axis=1)
y = training_set['Relationship_Tag']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

In [27]:
X_train.head()

Unnamed: 0,title,link,Extracted_Sents,Risks,Organizations
125,How companies handle criminal charges: Trump O...,https://finance.yahoo.com/news/companies-handl...,"For example, pharmaceutical companies convicte...",pharmaceutical companies convicted of felonies,medicare
466,Former CEO of Volkswagen AG Charged with Consp...,https://www.justice.gov/opa/pr/former-ceo-volk...,The indictment further alleges that Winterkorn...,perpetrate the fraud,vw
270,FCA guilty in labor corruption scandal as auto...,https://news.yahoo.com/fca-guilty-labor-corrup...,"The party included liquor, more than $7,000 wo...",convicted,uaw
301,Stellantis admits guilt to criminal conspiracy...,https://news.yahoo.com/stellantis-admits-guilt...,"The company, known then as Fiat Chrysler Autom...",cheating,fiat chrysler automobiles
18,Stellantis Faces $300 Million Fine for Emissio...,https://autos.yahoo.com/stellantis-faces-300-m...,"Now, the merged Stellantis group is on the hoo...",pleading guilty,environmental protection agency


In [28]:
X_test.columns

Index(['title', 'link', 'Extracted_Sents', 'Risks', 'Organizations'], dtype='object')

In [29]:
y_train

125    0
466    1
270    1
301    1
18     0
      ..
177    0
453    1
422    1
437    0
124    1
Name: Relationship_Tag, Length: 398, dtype: int64

### Baseline

In [30]:
vectorizer1 = TfidfVectorizer(
    stop_words=stopwords.words('english')
    )
vectorizer2 = TfidfVectorizer(
    stop_words=stopwords.words('english')
    )
vectorizer3 = TfidfVectorizer(
    stop_words=stopwords.words('english')
    )

col_transformer = ColumnTransformer(
    transformers= [
        ("tfidf_1", vectorizer1, "Extracted_Sents"),
        ("tfidf_2", vectorizer2, "Risks"),
        ("tfidf_3", vectorizer3, "Organizations")
        ]
    )

model = RandomForestClassifier(random_state=0, n_estimators=100, n_jobs=-1)

clf_best = imb_Pipeline(
    steps=[
        ("prep", col_transformer),
        ("model", model)
        ]
    )

clf_best.fit(X_train, y_train)
y_pred = clf_best.predict(X_test)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy score: 77.0
Classification Report: 
               precision    recall  f1-score   support

           0       0.74      0.89      0.81        55
           1       0.82      0.62      0.71        45

    accuracy                           0.77       100
   macro avg       0.78      0.76      0.76       100
weighted avg       0.78      0.77      0.76       100



In [31]:
vectorizer1 = TfidfVectorizer(
    stop_words=stopwords.words('english')
    )
vectorizer2 = TfidfVectorizer(
    stop_words=stopwords.words('english')
    )
vectorizer3 = TfidfVectorizer(
    stop_words=stopwords.words('english')
    )

col_transformer = ColumnTransformer(
    transformers= [
        ("tfidf_1", vectorizer1, "Extracted_Sents"),
        ("tfidf_2", vectorizer2, "Risks"),
        ("tfidf_3", vectorizer3, "Organizations")
        ]
    )

model = LogisticRegression(random_state=0, solver="liblinear", C=10)
clf_best = imb_Pipeline(
    steps=[
        ("prep", col_transformer),
        ("model", model)
        ]
    )

clf_best.fit(X_train, y_train)
y_pred = clf_best.predict(X_test)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy score: 76.0
Classification Report: 
               precision    recall  f1-score   support

           0       0.76      0.82      0.79        55
           1       0.76      0.69      0.72        45

    accuracy                           0.76       100
   macro avg       0.76      0.75      0.76       100
weighted avg       0.76      0.76      0.76       100



## Using word vectors|

In [32]:
X_train.columns

Index(['title', 'link', 'Extracted_Sents', 'Risks', 'Organizations'], dtype='object')

In [41]:
nlp = spacy.load('en_core_web_lg')
stop_words = nlp.Defaults.stop_words

# Remove Stopwords from train and test
f = 'text_between_orgs'
train_texts = [' '.join([t for t in text.split() if(t.lower() not in stop_words)]) for text in X_train['Extracted_Sents'] + X_train['Risks'] +  X_train['Organizations']]
test_texts = [' '.join([t for t in text.split() if(t.lower() not in stop_words)]) for text in X_test['Extracted_Sents'] + X_test['Risks'] + X_test['Organizations']]

# Get dataframes with text converted to spaCy vectors
tr_df = pd.DataFrame([list(nlp(text).vector) for text in train_texts])
te_df = pd.DataFrame([list(nlp(text).vector) for text in test_texts])

In [42]:
model = RandomForestClassifier(random_state=0)

clf = imb_Pipeline(
    steps=[
        ("model", model)
        ]
    )

clf.fit(tr_df, y_train)
y_pred = clf.predict(te_df)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred)) 

Accuracy score: 71.0
Classification Report: 
               precision    recall  f1-score   support

           0       0.71      0.80      0.75        55
           1       0.71      0.60      0.65        45

    accuracy                           0.71       100
   macro avg       0.71      0.70      0.70       100
weighted avg       0.71      0.71      0.71       100



In [43]:
model = LogisticRegression(random_state=0)

clf = imb_Pipeline(
    steps=[
        ("model", model)
        ]
    )

clf.fit(tr_df, y_train)
y_pred = clf.predict(te_df)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred)) 

Accuracy score: 67.0
Classification Report: 
               precision    recall  f1-score   support

           0       0.67      0.78      0.72        55
           1       0.67      0.53      0.59        45

    accuracy                           0.67       100
   macro avg       0.67      0.66      0.66       100
weighted avg       0.67      0.67      0.66       100



## Creating new features

In [66]:
training_set.columns

Index(['title', 'link', 'Extracted_Sents', 'Risks', 'Organizations',
       'Relationship_Tag'],
      dtype='object')

In [67]:
cleaned_df = pd.DataFrame(columns=["Text"])

In [68]:
def get_text_btwn_substrings(test_str, sub1, sub2):
    # getting index of substrings
    idx1 = test_str.find(sub1)
    idx2 = test_str.find(sub2)

    # length of substring 1 is added to
    # get string from next character
    res = test_str[idx1 + len(sub1) + 1: idx2]
    #print(res)
    return res

In [69]:
for ext_text, sub1, sub2 in training_set[['Extracted_Sents', 'Risks', 'Organizations']].itertuples(index=False):
    cleaned_df.loc[len(cleaned_df.index)] = [get_text_btwn_substrings(ext_text, sub1, sub2)]

In [70]:
cleaned_df

Unnamed: 0,Text
0,"is more prevalent in developing countries, esp..."
1,stemming from the creation and coverup of poll...
2,
3,n June to wire fraud and violating the Clean A...
4,and forfeited money judgments after pleading g...
...,...
493,that the firm conspired with two separate but ...
494,which comes amid an epidemic of prescription ...
495,that FedEx illegally distributed controlled su...
496,distributed controlled substances and -- inclu...


In [72]:
cleaned_df["target"] = training_set.Relationship_Tag

In [73]:
cleaned_df

Unnamed: 0,Text,target
0,"is more prevalent in developing countries, esp...",0
1,stemming from the creation and coverup of poll...,1
2,,1
3,n June to wire fraud and violating the Clean A...,0
4,and forfeited money judgments after pleading g...,1
...,...,...
493,that the firm conspired with two separate but ...,1
494,which comes amid an epidemic of prescription ...,1
495,that FedEx illegally distributed controlled su...,1
496,distributed controlled substances and -- inclu...,1


## Modelling after generating new features

In [77]:
X = cleaned_df.drop('target', axis=1)
y = cleaned_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

In [78]:
X_train.head()

Unnamed: 0,Text
125,may not be able to do business with the U.S. g...
466,and deceive U.S. regulators
270,in the corruption also approved spending more ...
301,on U.S. emissions tests
18,to intentionally cheating on federal emissions...


In [79]:
X_test.columns

Index(['Text'], dtype='object')

### Baseline

In [81]:
vectorizer = TfidfVectorizer(
    stop_words=stopwords.words('english')
    )

col_transformer = ColumnTransformer(
    transformers= [
        ("tfidf_1", vectorizer, "Text"),
        ]
    )

model = RandomForestClassifier(random_state=0, n_estimators=100, n_jobs=-1)

clf_best = imb_Pipeline(
    steps=[
        ("prep", col_transformer),
        ("model", model)
        ]
    )

clf_best.fit(X_train, y_train)
y_pred = clf_best.predict(X_test)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy score: 60.0
Classification Report: 
               precision    recall  f1-score   support

           0       0.61      0.75      0.67        55
           1       0.58      0.42      0.49        45

    accuracy                           0.60       100
   macro avg       0.59      0.58      0.58       100
weighted avg       0.60      0.60      0.59       100



In [82]:
vectorizer = TfidfVectorizer(
    stop_words=stopwords.words('english')
    )

col_transformer = ColumnTransformer(
    transformers= [
        ("tfidf_1", vectorizer, "Text"),
        ]
    )

model = LogisticRegression(random_state=0, solver="liblinear", C=10)

clf_best = imb_Pipeline(
    steps=[
        ("prep", col_transformer),
        ("model", model)
        ]
    )

clf_best.fit(X_train, y_train)
y_pred = clf_best.predict(X_test)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy score: 61.0
Classification Report: 
               precision    recall  f1-score   support

           0       0.62      0.73      0.67        55
           1       0.58      0.47      0.52        45

    accuracy                           0.61       100
   macro avg       0.60      0.60      0.60       100
weighted avg       0.61      0.61      0.60       100



## Using word vectors|

In [87]:
nlp = spacy.load('en_core_web_lg')
stop_words = nlp.Defaults.stop_words

# Remove Stopwords from train and test
f = 'Text'
train_texts = [' '.join([t for t in text.split() if(t.lower() not in stop_words)]) for text in X_train[f]]
test_texts = [' '.join([t for t in text.split() if(t.lower() not in stop_words)]) for text in X_test[f]]

# Get dataframes with text converted to spaCy vectors
tr_df = pd.DataFrame([list(nlp(text).vector) for text in train_texts])
te_df = pd.DataFrame([list(nlp(text).vector) for text in test_texts])

In [88]:
oversample = RandomOverSampler(random_state=0)

model = RandomForestClassifier(random_state=0)

clf = imb_Pipeline(
    steps=[
        ("oversampling", oversample),
        ("model", model)
        ]
    )

clf.fit(tr_df, y_train)
y_pred = clf.predict(te_df)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred)) 

Accuracy score: 57.99999999999999
Classification Report: 
               precision    recall  f1-score   support

           0       0.60      0.69      0.64        55
           1       0.54      0.44      0.49        45

    accuracy                           0.58       100
   macro avg       0.57      0.57      0.57       100
weighted avg       0.57      0.58      0.57       100



In [89]:
oversample = RandomOverSampler(random_state=0)

model = RandomForestClassifier(random_state=0)

clf = imb_Pipeline(
    steps=[
        ("oversampling", oversample),
        ("model", model)
        ]
    )

clf.fit(tr_df, y_train)
y_pred = clf.predict(te_df)

acc = round(accuracy_score(y_test, y_pred), 3)
print(f"Accuracy score: {acc*100}")
print("Classification Report: \n", classification_report(y_test, y_pred)) 

Accuracy score: 57.99999999999999
Classification Report: 
               precision    recall  f1-score   support

           0       0.60      0.69      0.64        55
           1       0.54      0.44      0.49        45

    accuracy                           0.58       100
   macro avg       0.57      0.57      0.57       100
weighted avg       0.57      0.58      0.57       100



Using all the text columns with separate TF-IDF vectorizers yields an **Accuracy of 0.77 with a weighted F1-Score of 0.76.**. We see a drop in performance when using word vectors as the Accuracy drops to 0.71.

After creating new features by extracting the text between the given entities, we see a further drop in performance with the best results reaching only an **Accuracy of 0.61.**