# Preprocessing

## Imports

In [1]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Load Data

In [27]:
train_non_adverse = pd.read_csv("data/nam.csv")
train_adverse = pd.read_csv("data/am.csv")
train_random_additional = pd.read_csv("data/random.csv")
am_additional = pd.read_csv("data/am_additional.csv")

Filter only the labeled ones and drop if title or article is undefined

In [43]:
train_concat = pd.concat([train_non_adverse, train_adverse, train_random_additional, am_additional], ignore_index=True)
train_filtered = train_concat[(train_concat.label == "nam") | (train_concat.label == "am") | (train_concat.label == "random")]
train_filtered = train_filtered.dropna(subset=['title', 'article'])

In [44]:
train_filtered.describe()

Unnamed: 0,source,entity_name,entity_type,explanation,label,url,article,full_response,assessor,title
count,1195,1082,1086,665,1594,1296,1594,1196,728,1594
unique,9,793,17,381,3,1296,1594,1196,10,1582
top,Sebastien,Donald Trump,individual,corruption,am,https://www.financialfraudsternews.com/en/comp...,The number of people convicted for criminal of...,[{'query': {'id': '1605047554739-2fe438585304f...,Carel,"Virginia airline founder charged with fraud, t..."
freq,284,10,345,33,801,1,1,1,220,2


In [45]:
train_filtered.head(3)

Unnamed: 0,source,entity_name,entity_type,explanation,label,url,article,full_response,assessor,title
0,Darya,Kevin Morais,individual,award for bravery towards corruption,nam,https://www.nst.com.my/news/nation/2020/02/564...,PUTRAJAYA: The late senior deputy public prose...,[{'query': {'id': '1605373858510-22ed9e7516161...,Shakshi,Kevin Morais named as recipient of Internation...
1,Carel,,,Bribery law,nam,https://www.gov.uk/government/publications/bri...,Details\n\nThe Bribery Act 2010 creates a new ...,[{'query': {'id': '1605369929260-65690a650021b...,Dan,Bribery Act 2010 guidance
2,Carel,,,Abstract of Insider Trading article,nam,https://www.jstor.org/stable/3666053?seq=1,Abstract\n\nTrading by corporate insiders has ...,[{'query': {'id': '1605369732277-abfa02170cbde...,Dan,For independent researchers


In [46]:
train_filtered.label.value_counts()

am        801
nam       397
random    396
Name: label, dtype: int64

There are 3 label types:

* `am` - adverse media.
* `nam` -  non adverse media, contains topics related to adverse media (tax evasion, corruption, ...) but not accusing anyone (article about how to stop corruption). 
* `random` - article that doesn't fall in any category described above.

## Functions

Here we add functions that provide different data sets accorfing to our needs

### Get full data (internal use)

Used by `getTrainData` function

In [48]:
def getData():
    train_data = pd.DataFrame([])
    train_data['text'] = train_filtered.article
    train_data['label'] = train_filtered.label.map(dict(am=1, nam=0, random=2))

    return train_data.copy()

### Get full data (internal use)

Used by `getTrainData` function

In [49]:
def get_n_sentences(n):
    train_data_sentences = pd.DataFrame([])
    train_data_sentences['text'] = train_filtered.article.apply(lambda a: " ".join(nltk.sent_tokenize(a)[0:n]))
    train_data_sentences['label'] = train_filtered.label.map(dict(am=1, nam=0, random=2))
    
    return train_data_sentences.copy()

### Full row by index (external use)

Useful if you want to get full row with all columns e.g. `url`.

This can be used when inpecting missclassified rows.

In [51]:
def getFullRowByIndex(idx):
    return train_filtered.loc[idx]

### Get train data (external use)

Returns data set according to function arguments.
Joins title and article in 1 column.

Transforms `label` to integer.

Arguments:
* `include_random` - include random labels in data set
* `random_as_2` - return random labels as 2 (default is 1)
* `shuffle` - shuffle the data set
* `no_title` - do not include the title
* `n_sentences` - return first n sentences (if -1 returns all of them)

Returns:
* pandas data frame with 2 columns: `text`, `label`.

In [50]:
def getTrainData(include_random=False, random_as_2=False, shuffle=False, no_title=False, n_sentences=-1):
    td = None
    if n_sentences >= 0:
        td = get_n_sentences(n_sentences)
    else:
        td = getData()
    
    if not no_title:
        if n_sentences == 0:
            td['text'] = train_filtered.title
        else:
            td['text'] = pd.DataFrame({ 'title': train_filtered.title, 'article': td.text }).agg('.\n'.join, axis=1)
    
    if not include_random:
        td = td.loc[td['label'] != 2]
    
    if not random_as_2:
        td.loc[td['label'] == 2, 'label'] = 0
    
    if shuffle:
        td = td.sample(frac=1)
    
    return td.copy()

#### Examples of `getTrainData` function

##### Get all data

In [64]:
getTrainData(include_random=True)

Unnamed: 0,text,label
0,Kevin Morais named as recipient of Internation...,0
1,Bribery Act 2010 guidance.\nDetails\n\nThe Bri...,0
2,For independent researchers.\nAbstract\n\nTrad...,0
3,Global FinTech Company Implements Automated & ...,0
4,Pope Francis commits to clean finances amid sc...,0
...,...,...
1692,U.S indicts Venezuelan in kickback scheme link...,1
1693,Spokane health clinic owner charged with $5 mi...,1
1694,FirstEnergy credit rating downgraded to “junk”...,1
1695,Former Tangipahoa Parish Sheriff’s Office empl...,1


##### Get just title

In [59]:
getTrainData(include_random=True, n_sentences=0)

Unnamed: 0,text,label
0,Kevin Morais named as recipient of Internation...,0
1,Bribery Act 2010 guidance,0
2,For independent researchers,0
3,Global FinTech Company Implements Automated & ...,0
4,Pope Francis commits to clean finances amid sc...,0
...,...,...
1692,U.S indicts Venezuelan in kickback scheme link...,1
1693,Spokane health clinic owner charged with $5 mi...,1
1694,FirstEnergy credit rating downgraded to “junk”...,1
1695,Former Tangipahoa Parish Sheriff’s Office empl...,1


##### Get just article body

In [60]:
getTrainData(include_random=True, no_title=True)

Unnamed: 0,text,label
0,PUTRAJAYA: The late senior deputy public prose...,0
1,Details\n\nThe Bribery Act 2010 creates a new ...,0
2,Abstract\n\nTrading by corporate insiders has ...,0
3,As FinTech organizations grow and expand their...,0
4,Pope Francis told European anti-money launderi...,0
...,...,...
1692,A dual Venezuelan-Italian citizen who controll...,1
1693,"The owner of a health clinic based in Spokane,...",1
1694,FirstEnergy‘s credit rating has been downgrade...,1
1695,A Tangipahoa Parish Sheriff’s Office employee ...,1


##### Get title + first sentence

In [61]:
getTrainData(include_random=True, n_sentences=1)

Unnamed: 0,text,label
0,Kevin Morais named as recipient of Internation...,0
1,Bribery Act 2010 guidance.\nDetails\n\nThe Bri...,0
2,For independent researchers.\nAbstract\n\nTrad...,0
3,Global FinTech Company Implements Automated & ...,0
4,Pope Francis commits to clean finances amid sc...,0
...,...,...
1692,U.S indicts Venezuelan in kickback scheme link...,1
1693,Spokane health clinic owner charged with $5 mi...,1
1694,FirstEnergy credit rating downgraded to “junk”...,1
1695,Former Tangipahoa Parish Sheriff’s Office empl...,1
