## 1. Sources 

1. Tutorial: Text Classification in Python Using spaCy: 
https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/
2. A Comprehensive Guide to Understand and Implement Text Classification in Python: https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/#:~:text=Introduction,one%20or%20more%20defined%20categories.



In [8]:
#Import and install necessary packages  
import os
import glob

import numpy as np
import pandas as pd

from sklearn.svm import SVC

In [10]:
# libraries for dataset preparation, feature engineering, model training

from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

import xgboost, textblob, string
#from tensorflow import keras
#from keras.preprocessing import text, sequence
#from keras import layers, models, optimizers

#end goal:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

## 1. Dataset preparation

The first step is the Dataset Preparation step which includes the process of loading a dataset and performing basic pre-processing. The dataset is then splitted into train and validation sets.

In [11]:
#Load the training dataset 
events = pd.read_csv('../events/group_2_labelled.csv')
events = events.loc[events['Near Miss Event'].notna(), ]

# event_mapper = {val: enum for enum, val in zip(*events.event_id.factorize())}
# file_mapper = {val: enum for enum, val in zip(*events.filename.factorize())}

events.head()



Unnamed: 0.1,Unnamed: 0,event_id,filename,sentence_idx,sentence_text,n_trigger_words,trigger_words_in_sentence,trigger_words_in_event,event_text,ORE_DEPOSIT,ROCK,MINERAL,STRAT,LOCATION,TIMESCALE,event_label,reviewed,Near Miss Event,Key trigger phrase
0,0,a081752_anrep2008eraheedy2103_15107355_16,a081752_anrep2008eraheedy2103_15107355.json,16,mineral occurrences and exploration potential ...,1,['potential'],['potential'],"bibliography bunting ja 1986, geology of the e...",[],['granite'],[],[],"['nabberu basin', 'western australia', 'wester...",[],0,True,False,
1,1,a075210_buck_a_ el12_1_2007_11292066_235,a075210_buck_a_ el12_1_2007_11292066.json,235,further drilling in coming years will further ...,1,['further drilling'],['further drilling'],a summary of the coal tonnages within el12 1 i...,[],['coal'],[],[],[],[],0,True,False,
2,2,a075210_buck_a_ el12_1_2007_11292066_246,a075210_buck_a_ el12_1_2007_11292066.json,246,the tenement was applied for on the 12 1 2005 ...,1,['possible'],['possible'],"keywords: ac drilling, diamond core drilling, ...",[],['ash'],"['diamond', 'sulphur']",[],"['muja', 'collie', 'ewington', 'collie']",[],0,True,False,
3,3,a080379_e80_2574_08atr_12876104_4,a080379_e80_2574_08atr_12876104.json,4,the east kimberley halls creek orogen is widel...,2,"['potential', 'mineralisation']","['potential', 'mineralisation', 'broad']",if this work is positive drill testing of anom...,['pge'],[],"['gold', 'sulphide']",[],"['kimberley', 'halls creek orogen', 'australia']",[],0,True,False,
4,4,a080379_e80_2574_08atr_12876104_10,a080379_e80_2574_08atr_12876104.json,10,this belt contains the portimo and penikat int...,2,"['mineralisation', 'potential']","['mineralisation', 'potential', 'mineralisatio...",the hco has a number of similarities to the to...,"['pge', 'pge']",[],[],[],[],[],0,True,False,


In [12]:
events.loc[0, 'event_text'] #just to check 

'bibliography bunting ja 1986, geology of the eastern part of the nabberu basin western australia. west australian geological survey bulletin 131 130p geological survey of western australia 2005. mineral occurrences and exploration potential of the eraheedy area western australia. location: the granite peak project is located about 150km north of wiluna.'

In [13]:
#Create the training data with event text and labels 
X = events['event_text']
ylabels = events['Near Miss Event'].astype(int)


In [14]:
ylabels[0] #just to check 

0

In [15]:

# split the dataset into training and validation datasets. 
# The test set will be the remaining data files that have not been labelled 
X_train, X_valid, y_train, y_valid = train_test_split(X, ylabels, test_size=0.3)

## 2. Feature Engineering 

The next step is the feature engineering step. In this step, raw text data will be transformed into feature vectors and new features will be created using the existing dataset. We will implement the following different ideas in order to obtain relevant features from our dataset.


2.1 Count Vectors as features

2.2 TF-IDF Vectors as features

Word level

N-Gram level

Character level

2.3 Word Embeddings as features

2.4 Text / NLP based features

2.5 Topic Models as features


## 2.1 Count Vectors as features
Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.

In [18]:
import string

import spacy
from spacy import displacy

from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en_core_web_lg')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

In [26]:
#create a count vectorizer # From Daniel # Could we use utilscharlie???

count_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
#tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [28]:
# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(X_train['text'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(X_train)
xvalid_count =  count_vect.transform(X_valid)

NameError: name 'trainDF' is not defined

## 3. Model Building
The final step in the text classification framework is to train a classifier using the features created in the previous step. There are many different choices of machine learning models which can be used to train a final model. We will implement following different classifiers for this purpose:

1. Linear Classifier 
2. Naive Bayes Classifier
3. Support Vector Machine
4. Bagging Models
5. Boosting Models
6. Shallow Neural Networks
7. Deep Neural Networks
    * Convolutional Neural Network (CNN)
    * Long Short Term Modelr (LSTM)
    * Gated Recurrent Unit (GRU)
    * Bidirectional RNN
    * Recurrent Convolutional Neural Network (RCNN)
    * Other Variants of Deep Neural Networks

## 3.1. Linear Classifier (Logistic Regression Model ) 
### 3.1. 1. Linear Classfiier on Count Vectors 


In [20]:

# Logistic Regression Classifier
classifier = LogisticRegression()

# Create pipeline using bag of words(?)
pipe = Pipeline([('vectorizer', count_vector), ('classifier', classifier)])

# model generation
pipe.fit(X_train, y_train)

Pipeline(steps=[('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x0000022D3F645EE0>)),
                ('classifier', LogisticRegression())])

In [22]:
print(f'Class balance for train set:\n{y_train.value_counts()}\n')
print(f'Class balance for validation set:\n{y_valid.value_counts()}')

Class balance for train set:
0    76
1    64
Name: Near Miss Event, dtype: int64

Class balance for test set:
1    31
0    30
Name: Near Miss Event, dtype: int64


In [24]:

# Predicting with a test dataset
predicted = pipe.predict(X_valid)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_valid, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_valid, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_valid, predicted))

Logistic Regression Accuracy: 0.7377049180327869
Logistic Regression Precision: 0.8947368421052632
Logistic Regression Recall: 0.5483870967741935
