# Baseline classifiers

Pipeline for predicting classes of dataset references

Author: Jinseok Kim

Editor: Sara Lafia (adapting for competition)

04/23/2021

< To Do >
- [x] drop datasets with less than (5) citances
- [x] use citances (sentences containing citations) rather than labels for training/testing 
- [ ] add dummy model for comparison
- [ ] add pipeline, gridsearch function for comparing and selecting best models

< NOTE >
1. Procedure: import, preprocessing, training, predicting (four algorithms), model evaluation
2. Probability prediction output: each prediction produces a probability score between 0 and 1  
3. Data splitting before training: 80% training and 20% test data   
5. Class imbalance >> Apply sampling method (undersampling/oversampling) 
   Install 'imbalanced-learn' library: type the following code in command line interface (cmd)  
   A. Check if pip is already installed: pip --version  or pip3 --version
   B. Install the package: pip installl imbalanced-learn or pip3 install imbalanced-learn  

In [3]:
import glob
import nltk
import re
import json
import pandas as pd

from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
def clean_text(txt):
    """
    Convert to lowercase, remove special characters, and punctuation.
    """
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

In [1]:
path = '../data/'
DATA_DIR = '/nfs/turbo/hrg/coleridge/'

## Data
Load tokenized training sentences generated in `prep-sentences` notebook

In [25]:
df = pd.read_csv(DATA_DIR+'train_sentences_all.csv', index_col=0)
df.info()

  mask |= (ar1 == a)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1116344 entries, 195 to 6788807
Data columns (total 18 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   Id             1116344 non-null  object
 1   dataset_label  1116344 non-null  object
 2   section_title  1116344 non-null  object
 3   sent           1116344 non-null  object
 4   match          1116344 non-null  bool  
 5   sent_clean     1116344 non-null  object
 6   clean_section  1116344 non-null  object
 7   indicator      422515 non-null   object
 8   hasData        1116344 non-null  int64 
 9   hasEdu         1116344 non-null  int64 
 10  hasSample      1116344 non-null  int64 
 11  hasNational    1116344 non-null  int64 
 12  hasSurvey      1116344 non-null  int64 
 13  inIntro        1116344 non-null  int64 
 14  inMethod       1116344 non-null  int64 
 15  inLimit        1116344 non-null  int64 
 16  inConcl        1116344 non-null  int64 
 17  inAckno        1116344 no

In [26]:
df['isMatch'] = df['match'].astype('category').cat.codes

df = df.drop(columns=['dataset_label', 
                      'section_title', 
                      'match',
                      'sent_clean', 
                      'clean_section', 
                      'indicator'])

df['isMatch'].value_counts()

0    1101069
1      15275
Name: isMatch, dtype: int64

In [27]:
df.head(5)

Unnamed: 0,Id,sent,hasData,hasEdu,hasSample,hasNational,hasSurvey,inIntro,inMethod,inLimit,inConcl,inAckno,isMatch
195,2f26f645-3dec-485d-b68d-f013c9e05e60,"Overall, the results presented in this Issue B...",0,0,0,0,0,0,0,0,0,0,0
196,2f26f645-3dec-485d-b68d-f013c9e05e60,These estimates offer an early look at the pos...,0,0,0,0,0,0,0,0,0,0,0
197,2f26f645-3dec-485d-b68d-f013c9e05e60,Some dropouts who were enrolled in a postsecon...,0,1,0,0,0,0,0,0,0,0,0
198,2f26f645-3dec-485d-b68d-f013c9e05e60,Future research could use NELS respondents' po...,0,0,0,0,0,0,0,0,0,0,0
199,2f26f645-3dec-485d-b68d-f013c9e05e60,Table 2 .,0,0,0,0,0,0,0,0,0,0,0


## Functions

In [28]:
def load_data(filename):
    """
    Read in input file and load data
    
    filename: csv file
    
    """
    
    # Load input file
    df = pd.read_csv(filename, encoding='utf-8') # if this produces encoding errors, use 'encoding='iso-8859-1'

    print("Total No of Rows: ", df.shape[0])
    print("Total No of Columns: ", df.shape[1])
    
    print('\nTraining & Test Size(row, colum):')
    df.iloc[:, -1].value_counts()

    # Split data into training and test (Test size: 0.2, stratify turned on)
    X, y = df.iloc[:, :-1], df.iloc[:, -1]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

    print('X_train: {}\nX_test: {}\ny_train: {}\ny_test: {}'.format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))
    
    return X_train, X_test, y_train, y_test

In [29]:
def sample_data(X_train, y_train, sampling=0, sample_method='over'):
    
    from imblearn.over_sampling import RandomOverSampler
    from imblearn.under_sampling import RandomUnderSampler
    
    if sampling:
        # select a sampling method
        if sample_method == 'over':
            oversample = RandomOverSampler(random_state=42)
            X_over, y_over = oversample.fit_resample(X_train, y_train)
            print('\nOversampled Data (class, Rows):\n{}'.format(y_over.value_counts()))
            X_train_sam, y_train_sam = X_over, y_over
            
        elif sample_method == 'under':
            undersample = RandomUnderSampler(random_state=42)
            X_under, y_under = undersample.fit_resample(X_train, y_train)
            print('\nUndersampled Data (class,Rows):\n{}'.format(y_under.value_counts()))
            X_train_sam, y_train_sam = X_under, y_under
    else:
        X_train_sam, y_train_sam = X_train, y_train      
        print('\nNo Sampling Performed\n')
    
    return X_train_sam, y_train_sam

In [13]:
def preprocess_data(X_data_raw):
    
    """
       Preprocess data with lowercase conversion, punctuation removal, stop-word removal, tokenization, stemming, and joining
       
       X_data_raw: X data in dataframe
       
    """
    X_data=X_data_raw.iloc[:, 1]
    
    # 1 convert all characters to lowercase
    X_data = X_data.map(lambda x: str(x).lower())
     
    # 2. remove non-alphabetical characters
    X_data = X_data.str.replace("[^a-zA-Z]", " ", regex=True)
    
    # remove stopwords in English
    stop_english = stopwords.words('english')
    X_data = X_data.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_english)]))
    
    # remove words
    #words_to_remove = {'type ', 'here ', 'words'} # you can insert words you want to remove
    #X_data = X_data.apply(lambda x: ' '.join([word for word in x.split() if word not in words_to_remove]))

    # filter words of length above n
    X_data = X_data.apply(lambda x: ' '.join([word for word in x.split() if len(word) > 1]))
        
    # 3. word tokenize
    X_data = X_data.apply(nltk.word_tokenize)
    
    # 4. stemming
    stemmer = PorterStemmer()
    X_data = X_data.apply(lambda x: [stemmer.stem(y) for y in x])
    
    # ngram
    #X_data = X_data.apply(lambda x: list(nltk.ngrams(x, 2))) # you can modify n by replacing 2 with other integers
    #X_data = X_data.apply(lambda x: [''.join(i) for i in x])
    #print (X_data.iloc[0, 1])
    
    # join by spaces
    X_data = X_data.apply(lambda x: ' '.join(x))
       
    return X_data

In [8]:
def fit_model(X_train, y_train, model='DT'):
    
    """
      Model fitting with options of classifiers:
      decision tree, svm, knn, naive bayes, random forest, and gradient boosting
      
      X_train: X train data
      y_train: y train data
      model: name of classifier
      
    """
    
    if model=='DT':
        DT = DecisionTreeClassifier(max_depth=2)
        model = DT.fit(X_train, y_train)
    elif model=='SVM':
        SVM = SVC(kernel='linear', probability=True)  
        model = SVM.fit(X_train, y_train)
    elif model=='KNN':
        KNN = KNeighborsClassifier(n_neighbors=7)  
        model = KNN.fit(X_train, y_train)
    elif model=='NB':
        NB = MultinomialNB()
        model = NB.fit(X_train, y_train)
    elif model=='RF':
        RF = RandomForestClassifier(max_depth=2, random_state=0)
        model = RF.fit(X_train, y_train)
    elif model=='GB':
        GB = GradientBoostingClassifier()
        model = GB.fit(X_train, y_train)
    
    return model

In [9]:
def evaluate_model(y_test, y_pred, eval_model=0):
    
    """
      evaluate model performance
      
      y_test: y test data
      y_pred: t prediction score
      eval_model: indicator if this funtion is on or off
      
    """
    
    if eval_model:
        print('\nConfusion Matrix:\n')
        print(confusion_matrix(y_test, y_pred))
    
        print('\nClassification Report:\n')
        print(classification_report(y_test, y_pred))

In [10]:
def predict_proba(model, X_test_trans, X_test, y_test, y_pred, proba_file):
    
    """
       Predict probability of each class
       
       model: trained model with selected classifier
       X_test_trans: X test data preprocessed
       X_test: original X test data
       y_test: original y test data
       y_pred: predicted y values
       proba_file: output file of probability scores
       
    """
    
    ## Compute probability
    y_prob = model.predict_proba(X_test_trans)
    df_prob = pd.DataFrame(data=y_prob, columns=model.classes_)
    result = pd.concat([X_test.reset_index(drop=True), df_prob], axis=1, ignore_index=False)

    ## Add predicted class to output
    result['pred'] = pd.Series(y_pred)

    ## Add actual class to output 
    y_test = y_test.reset_index(drop=True)
    result['act'] = y_test

    ## Save output
    result.to_csv(proba_file, encoding='utf-8-sig', index=False, header=True)

In [46]:
def main(input_file, sample_on, sample_type, model_method, eval_on, proba_file):
    
    """
       Main function for processing data, model fitting, and prediction
       
       input_train_file: input train file
       input_test_file: input test file
       sample_on: indicator of sampling on or off
       sample_type: sample type to choose if sample_on is 1
       model_method: name of classifier to be applied for model fitting
       eval_on: indicator of model evaluation on or off
       proba_file: name of output file of probability
       
    """
    
    ## 1. Load data
    X_train, X_test, y_train, y_test = load_data(input_file)
    
    print('\nOriginal Data (class, rows):\n{}'.format(y_train.value_counts()))
    
    ## 2. Sampling 
    X_train_samp, y_train_samp = sample_data(X_train, y_train, sampling=sampling_on, sample_method=sampling_type)
    
    ## 3. Preprocessing 
    X_train_pro = preprocess_data(X_train_samp)
    
    count_vect = CountVectorizer()
    counts = count_vect.fit_transform(X_train_pro)
    transformer = TfidfTransformer(smooth_idf=True, use_idf=True).fit(counts)
    X_train_transformed = transformer.transform(counts)
    
    X_train_trans = X_train_transformed
    y_train_trans = y_train_samp

    ## 4. Model Fitting
    model = fit_model(X_train_trans, y_train_trans, model=model_method)
    
    ## 5. Prediction
    # Transform X_test data
    X_test_pro = preprocess_data(X_test)
    counts_test = count_vect.transform(X_test_pro)
    X_test_trans = transformer.transform(counts_test)
    
    # Predict output
    y_pred = model.predict(X_test_trans)
    
    ## 6. Evaluating model performance
    evaluate_model(y_test, y_pred, eval_model=eval_on)
    
    ## 7. Probability prediction    
    predict_proba(model, X_test_trans, X_test, y_test, y_pred, proba_file=proba_file)
    print("\nOutput file:'" + proba_file + "' Created")

In [None]:
if __name__== "__main__":
    
    ## Define parameter values
    # Filename of input dataset
    input_file=path+"training_labels.csv" #input_file="dbpedia.csv" 
    
    sampling_on=1             # 0 for no sampling; 1 for sampling
    sampling_type='over'      # Use when sampling_on=1; 'over'(oversampling), 'under'(undersampling)

    model_type='SVM'           #'DT'(Decision Tree);'SVM'(SVM);'KNN'(KNeighbors);#'NB'(Naive Bayes);
                              #'RF'(Random Forest);'GB'(Gradient Boosting)
    
    eval_on=1                 # 0 for no; 1 for yes (display confusion matrix/classification report)
    
    output_file = path+ "/output" + "/proba_" + model_type + ".csv"  # Filename for probability output 
    
    
    ## Main fuction
    main(input_file=input_file, 
         sample_on=sampling_on, 
         sample_type=sampling_type, 
         model_method=model_type, 
         eval_on=eval_on, 
         proba_file=output_file)

Total No of Rows:  42990
Total No of Columns:  4

Training & Test Size(row, colum):
X_train: (34392, 3)
X_test: (8598, 3)
y_train: (34392,)
y_test: (8598,)

Original Data (class, rows):
2     27281
4      1271
10      598
31      506
19      506
11      434
7       406
17      359
14      358
5       351
24      348
1       306
26      298
3       296
28      248
9       119
27      110
32       93
6        69
20       68
8        65
21       55
23       50
12       36
30       33
22       29
16       28
15       20
18       18
25       18
29       10
13        5
Name: class, dtype: int64

Oversampled Data (class, Rows):
32    27281
31    27281
2     27281
3     27281
4     27281
5     27281
6     27281
7     27281
8     27281
9     27281
10    27281
11    27281
12    27281
13    27281
14    27281
15    27281
16    27281
17    27281
18    27281
19    27281
20    27281
21    27281
22    27281
23    27281
24    27281
25    27281
26    27281
27    27281
28    27281
29    27281
30    27281

## Error analysis

For interpreting errors, inspect records from a given class

In [41]:
df.loc[df['class'] == 1]

Unnamed: 0,id,text,class
75392,fe9543b3-7b2f-4820-8bef-63234813d5ec,Data from the USDA Agricultural Resource Manag...,1
75393,fe9543b3-7b2f-4820-8bef-63234813d5ec,Data from the USDA Agricultural Resource Manag...,1
75418,fe9543b3-7b2f-4820-8bef-63234813d5ec,"Furthermore, analysis of the data from the USD...",1
75419,fe9543b3-7b2f-4820-8bef-63234813d5ec,"Furthermore, analysis of the data from the USD...",1
75434,fe9543b3-7b2f-4820-8bef-63234813d5ec,"Furthermore, analysis of the data from the USD...",1
...,...,...,...
679261,f6afe4c3-fce7-442c-bf33-a1507fad5a50,The data are cross-sectional and come from the...,1
679262,f6afe4c3-fce7-442c-bf33-a1507fad5a50,The data are cross-sectional and come from the...,1
679389,f6afe4c3-fce7-442c-bf33-a1507fad5a50,I use data from the USDA's Agricultural Resour...,1
679390,f6afe4c3-fce7-442c-bf33-a1507fad5a50,I use data from the USDA's Agricultural Resour...,1
