# Overview
- feedback-prize-2021 コンペ。
- 公開ノートブックの特徴量とロジスティック回帰を書き換え。

TODO
- CountVectorizer から TF-IDF に書き換え。
- モデルを LightGBM に書き換え。
- "No Class" の補完。
- 文をつなぐ方針のアルゴリズムから、文を分裂させることもあるアルゴリズムにできる？

Ref.
- [Expanding Sentences Window - Logistic Regression | Kaggle (@samir95)](https://www.kaggle.com/samir95/expanding-sentences-window-logistic-regression)

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

SEED = 4121995

np.random.seed(SEED)

# Introduction

The approach I'll use in this notebook is to build the bare minimum training and validation set and an algorithm maximizing the likelihood of an n-gram of sentences (or words) to be one of a discourse types using a RandomForestClassifier or a NaiveBayesClassifier on TFIDF or CountVectorized datasets.

#### I'll do this without exploring the dataset, as I'm more interested in making a validation pipeline and an algorithm that I have in mind.

# Creating training and validation set

In [2]:
train_df = pd.read_csv('/kaggle/input/feedback-prize-2021/train.csv')
train_df.head()

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring
0,423A1CA112E2,1622628000000.0,8.0,229.0,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
1,423A1CA112E2,1622628000000.0,230.0,312.0,They are some really bad consequences when stu...,Position,Position 1,45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
2,423A1CA112E2,1622628000000.0,313.0,401.0,Some certain areas in the United States ban ph...,Evidence,Evidence 1,60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
3,423A1CA112E2,1622628000000.0,402.0,758.0,"When people have phones, they know about certa...",Evidence,Evidence 2,76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 9...
4,423A1CA112E2,1622628000000.0,759.0,886.0,Driving is one of the way how to get around. P...,Claim,Claim 1,139 140 141 142 143 144 145 146 147 148 149 15...


I'll make a validation set with a number of essays that is equal to the number of essays in the test set.

In [3]:
!ls '/kaggle/input/feedback-prize-2021/test' | wc -l

5


In [4]:
!ls '/kaggle/input/feedback-prize-2021/train' | wc -l

15594


The training data has 15594 essays, which means that we can possibly spare more than 5 essays. But I don't know if we should.

The regular approach would be to split the dataset into an 80/20 training/validation split. <s>but until now this notebook hasn't made any rigorous algorithm, so I'll stick with 5 essays just to make it run.</s> 

<s>After I make sure that the pipeline is working, I can look into making a robust validation set that reflects public leaderboard score, and indicates how well we can do in the private dataset.</s>

Evaluation one essay takes around 113 ms, so in order to minimize waiting and still be able to evaluate results, I'll use around 50 essays so that evaluation doesn't exceed 10 seconds.

In [5]:
# id をランダムに n 個選んで valid にする
def split_essays(train_df, n):
    if isinstance(n, float):
        n = int(len(train_df.id.unique()) * n)
    val_ids = np.random.choice(train_df.id.unique(), n, False)
    train_df, val_df = train_df[~train_df.id.isin(val_ids)], train_df[train_df.id.isin(val_ids)]
    return train_df, val_df

In [6]:
train_df, val_df = split_essays(train_df, 100)
train_df.shape, val_df.shape

((143379, 8), (914, 8))

Now of course I don't want to use the whole training dataset right now, so I'll make another function to sample a small part for development purposes. I can use the previous function that I made, but I'll discard the training set it creates.

In [7]:
# _, dev_df = split_essays(train_df, 10000)
dev_df = train_df
dev_df.shape

(143379, 8)

One thing that we need to add to the training set is parapgraphs or blocks of text which have no classification, and then we should add a label to them in order to train the classifier to also predict gaps in the text.

In [8]:
# ラベルがついていないテキストを補完する

# These functions are inspired by this amazing notebook 
# https://www.kaggle.com/erikbruin/nlp-on-student-writing-eda

def get_unique_ids(df):
    return df.id.unique()

def filter_essay(df, essay_id):
    return df.query('id == @essay_id').reset_index(drop=True)

def read_essay_txt(essay_id, path='train'):
    essay_file_path = f"../input/feedback-prize-2021/{path}/{essay_id}.txt"
    with open(essay_file_path, 'r') as essay_file:
        return essay_file.read()
        
def add_gap_rows_essay(df, essay_id, path):
    
    essay_df = filter_essay(df, essay_id)
    essay_txt = read_essay_txt(essay_id, path)
    
    for index, row in essay_df.iterrows():
        if index == essay_df.index[0]: 
            continue
            
        current_discourse_start = int(row['discourse_start'])
        current_discourse_end = int(row['discourse_end'])
        previous_discourse_start = int(essay_df.loc[index - 1, 'discourse_start'])
        previous_discourse_end = int(essay_df.loc[index - 1, 'discourse_end'])

        if previous_discourse_end != current_discourse_start - 1 and previous_discourse_end != current_discourse_start:
            current_predstring = row['predictionstring']
            previous_predstring = essay_df.loc[index - 1, 'predictionstring']

            current_predstring_first_token = int(current_predstring.split()[0])
            previous_predstring_last_token = int(previous_predstring.split()[-1])
            
            gap_tokens_list = np.arange(previous_predstring_last_token + 1,
                                        current_predstring_first_token).tolist()

            gap_row = {}  
            gap_row['id'] = row['id']
            gap_row['discourse_id'] = row['discourse_id']
            gap_row['discourse_start'] = previous_discourse_end + 1
            gap_row['discourse_end'] = current_discourse_start - 1
            gap_row['discourse_text'] = essay_txt[previous_discourse_end+1: current_discourse_start]
            gap_row['discourse_type'] = 'Gap'
            gap_row['discourse_type_num'] = 'Gap'
            gap_row['predictionstring'] = ' '.join([str(token) for token in gap_tokens_list])
            
            essay_df = essay_df.append(pd.Series(gap_row), ignore_index=True)
    
    essay_df = essay_df.sort_values('discourse_start').reset_index(drop=True)
    return essay_df

def add_gap_rows_df(df, path):
    new_df = None
    essay_ids = get_unique_ids(df)
    
    for essay_id in essay_ids:
        essay_df = add_gap_rows_essay(df, essay_id, path)
        new_df = pd.concat([new_df, essay_df], axis=0, ignore_index=True)
    
    return new_df           
        

In [9]:
%%time

# Testing on one discourse
add_gap_rows_essay(dev_df, dev_df.id.values[30], 'train')

CPU times: user 15.6 ms, sys: 950 µs, total: 16.6 ms
Wall time: 30 ms


Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring
0,E05C7F5C1156,1622838000000.0,0.0,455.0,People are debating whether if drivers should ...,Lead,Lead 1,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18...
1,E05C7F5C1156,1622838000000.0,455.0,527.0,I also think that you shouldn't use your phone...,Position,Position 1,77 78 79 80 81 82 83 84 85 86 87 88 89
2,E05C7F5C1156,1622838000000.0,528.0,568.0,because it can cause vehicle collisions,Claim,Claim 1,90 91 92 93 94 95
3,E05C7F5C1156,1622838000000.0,569.0,588.0,"slow reaction time,",Claim,Claim 2,96 97 98
4,E05C7F5C1156,1622838000000.0,589.0,609.0,and fatal injuries.,Claim,Claim 3,99 100 101
5,E05C7F5C1156,1622838000000.0,610.0,781.0,"herefore, driving can cause many accidents tha...",Gap,Gap,102 103 104 105 106 107 108 109 110 111 112 11...
6,E05C7F5C1156,1622838000000.0,782.0,937.0,\nThe first reason why the use of cell phones ...,Claim,Claim 4,133 134 135 136 137 138 139 140 141 142 143 14...
7,E05C7F5C1156,1622838000000.0,937.0,1403.0,Most vehicle collisions happen when the drive...,Evidence,Evidence 1,158 159 160 161 162 163 164 165 166 167 168 16...
8,E05C7F5C1156,1622838000000.0,1404.0,1506.0,The second reason why you shouldn't operate a ...,Claim,Claim 5,234 235 236 237 238 239 240 241 242 243 244 24...
9,E05C7F5C1156,1622838000000.0,1507.0,2042.0,Reaction time is the measure of how quickly an...,Evidence,Evidence 2,251 252 253 254 255 256 257 258 259 260 261 26...


In [10]:
%%time

# 8min 29s
dev_df = add_gap_rows_df(dev_df, 'train')

CPU times: user 4min 48s, sys: 25.5 s, total: 5min 14s
Wall time: 6min 5s


In [11]:
%%time 

# 1sec
val_df = add_gap_rows_df(val_df, 'train')

CPU times: user 772 ms, sys: 13.1 ms, total: 785 ms
Wall time: 1.1 s


It seems that gaps aren't always sentences, so I think that dropping gap lines that aren't full sentences or paragraphs could be beneficial as the model will always be passed TFIDF or Vectorized sentences, unless the approach is used but with maximizing the likelihood of ngrams of words.

# The Algorithm

What I have in a mind is a simple algorithm that maximizes the likelihood of a TFIDF or Bag of words of sequences of sentences (or words) of being one of the presented discourse types.

I don't know much about NLP, but this is the easiest way I can think about. 

For example, let's say that we an essay, the algorithm would split this essay into sentences, then it would classify the first sentence alone and take it's maximum prediction as a baseline. 

The algorithm would then add another sentence classify the two sentences together, and then add another and another. Hypothetically, since I will be training the simple model on block of discourse types, the likelihood would increase with increasing sentences, until it starts decreasing, and then we can select the block which maximized the likelihood of correct prediction according to the model.

Then the algorithm would iterate again from the sentence following the previously predicted block of text.

The core of this approach can be used with any complex model, but I don't know if it would be suitable or not, since probably deep learning models could be capable of more without this brute forcing attempt, but I could be wrong since I don't know much about NLP.

And of course this approach is naive, since it won't learn about the context of the essay, and hence won't be able to predict two evidence sentences following each other for example, and might consider them as one.

In [12]:
# 例えば、エッセイがあったとすると、このアルゴリズムはこのエッセイを文に分割し、
# 最初の文だけを分類し、その最大予測値を基準値とします。
# 
# 次に別の文を追加し、2つの文を一緒に分類し、さらに別の文を追加する。
# このような場合、単純なモデルを談話タイプのブロックに対して学習させるので、
# 尤度は文の増加とともに増加し、減少し始め、そしてモデルに従って正しい予測の尤度を
# 最大にするブロックを選択することができる。

## Preprocessing the dataset

TFIDF or Bag of Words could be used, <s>but I'll use a simple Bag of Words right now.</s> and I'll use TFIDF.

The data provides two features regarding position of the text which is discourse_start and discourse_end. I want to use them, but I'm think it would be easier to another feature which is the start token, since my model uses Bag of Words, so it would be easier to implement it into the prediction pipeline of the algorithm.

In [13]:
# Add starting token feature
dev_df['start_token'] = dev_df['predictionstring'].str.split().str[-1].shift(1).fillna(0).astype(int) + 1
val_df['start_token'] = val_df['predictionstring'].str.split().str[-1].shift(1).fillna(0).astype(int) + 1

dev_df.head(5)

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring,start_token
0,423A1CA112E2,1622628000000.0,8.0,229.0,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...,1
1,423A1CA112E2,1622628000000.0,230.0,312.0,They are some really bad consequences when stu...,Position,Position 1,45 46 47 48 49 50 51 52 53 54 55 56 57 58 59,45
2,423A1CA112E2,1622628000000.0,313.0,401.0,Some certain areas in the United States ban ph...,Evidence,Evidence 1,60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75,60
3,423A1CA112E2,1622628000000.0,402.0,758.0,"When people have phones, they know about certa...",Evidence,Evidence 2,76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 9...,76
4,423A1CA112E2,1622628000000.0,759.0,886.0,Driving is one of the way how to get around. P...,Claim,Claim 1,139 140 141 142 143 144 145 146 147 148 149 15...,139


In [14]:
%%time

# Add full text length feature
train_essays = os.listdir('/kaggle/input/feedback-prize-2021/train/')
test_essays = os.listdir('/kaggle/input/feedback-prize-2021/test/')

train_essays_length = {id_.rstrip('.txt'): len(read_essay_txt(id_.rstrip('.txt')).split()) for id_ in train_essays}
test_essays_length = {id_.rstrip('.txt'): len(read_essay_txt(id_.rstrip('.txt'), 'test').split()) for id_ in test_essays}

CPU times: user 1.23 s, sys: 571 ms, total: 1.8 s
Wall time: 8.43 s


In [15]:
dev_df['len_essay'] = dev_df['id'].map(train_essays_length)
val_df['len_essay'] = val_df['id'].map(train_essays_length)
dev_df.head(10)

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring,start_token,len_essay
0,423A1CA112E2,1622628000000.0,8.0,229.0,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...,1,379
1,423A1CA112E2,1622628000000.0,230.0,312.0,They are some really bad consequences when stu...,Position,Position 1,45 46 47 48 49 50 51 52 53 54 55 56 57 58 59,45,379
2,423A1CA112E2,1622628000000.0,313.0,401.0,Some certain areas in the United States ban ph...,Evidence,Evidence 1,60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75,60,379
3,423A1CA112E2,1622628000000.0,402.0,758.0,"When people have phones, they know about certa...",Evidence,Evidence 2,76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 9...,76,379
4,423A1CA112E2,1622628000000.0,759.0,886.0,Driving is one of the way how to get around. P...,Claim,Claim 1,139 140 141 142 143 144 145 146 147 148 149 15...,139,379
5,423A1CA112E2,1622628000000.0,887.0,1150.0,That's why there's a thing that's called no te...,Evidence,Evidence 3,163 164 165 166 167 168 169 170 171 172 173 17...,163,379
6,423A1CA112E2,1622628000000.0,1151.0,1533.0,Sometimes on the news there is either an accid...,Evidence,Evidence 4,211 212 213 214 215 216 217 218 219 220 221 22...,211,379
7,423A1CA112E2,1622628000000.0,1534.0,1602.0,Phones are fine to use and it's also the best ...,Claim,Claim 2,282 283 284 285 286 287 288 289 290 291 292 29...,282,379
8,423A1CA112E2,1622628000000.0,1603.0,1890.0,If you go through a problem and you can't find...,Evidence,Evidence 5,297 298 299 300 301 302 303 304 305 306 307 30...,297,379
9,423A1CA112E2,1622628000000.0,1891.0,2027.0,The news always updated when people do somethi...,Concluding Statement,Concluding Statement 1,355 356 357 358 359 360 361 362 363 364 365 36...,355,379


Now I'll make a transformer which expects starting token feature and text, then it uses them to calculate the ending token.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import TruncatedSVD

class ExtraFeatures(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X['end_token'] = X['start_token'] + X['discourse_text'].str.split().str.len()
        X['tokens_from_start_to_end'] = X['end_token'] - X['start_token'] + 1
        X['tokens_from_start_to_finish'] = X['len_essay'] - X['start_token'] + 1
        X['tokens_from_end_to_finish'] = X['len_essay'] - X['end_token'] + 1
        X['percent_read_before'] = X['start_token'] / X['len_essay']
        X['percent_read_now'] = X['end_token'] / X['len_essay']
        X['percent_remaining'] = (X['len_essay'] - X['end_token']) / X['len_essay']
        X['percent_sentence'] = (X['end_token'] - X['start_token'] + 1) / X['len_essay']
        
        feats = ['percent_read_before', 'percent_read_now', 'percent_remaining', 'percent_sentence']
        return X[feats].values
    

def preprocess_data(X, pipeline=None):
    if not pipeline:
        
        vectorizer_pipeline = Pipeline([
            ('vectorizer', CountVectorizer(ngram_range=(1, 3), max_features=100000)),
#             ('tfidf', TfidfTransformer()),
#             ('pca', TruncatedSVD(n_components=0.9))
        ])
                
        pipeline = FeatureUnion([
            ('extra_features', ColumnTransformer([('end_token', ExtraFeatures(), ['start_token', 'len_essay', 'discourse_text'])])),
            ('vectorizer', ColumnTransformer([('vectorizer', vectorizer_pipeline, 'discourse_text')]))
        ])

        pipeline.fit(X)
        
    X = pipeline.transform(X)
    return X, pipeline


def encode_labels(y, encoder=None):
    if not encoder:
        encoder = LabelEncoder()
        encoder.fit(y)
        
    y = encoder.transform(y)
    return y, encoder

In [17]:
dev_df.head(1)

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring,start_token,len_essay
0,423A1CA112E2,1622628000000.0,8.0,229.0,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...,1,379


In [18]:
X_train, y_train = dev_df[['start_token', 'len_essay', 'discourse_text']], dev_df['discourse_type']
X_val, y_val = val_df[['start_token', 'len_essay', 'discourse_text']], val_df['discourse_type']

X_train, pipeline = preprocess_data(X_train)
X_val, _ = preprocess_data(X_val, pipeline)

y_train, encoder = encode_labels(y_train)
y_val, _ = encode_labels(y_val, encoder)

In [19]:
dev_df.head(2)

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring,start_token,len_essay
0,423A1CA112E2,1622628000000.0,8.0,229.0,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...,1,379
1,423A1CA112E2,1622628000000.0,230.0,312.0,They are some really bad consequences when stu...,Position,Position 1,45 46 47 48 49 50 51 52 53 54 55 56 57 58 59,45,379


In [20]:
print(X_train.shape, X_val.shape)
print(pipeline)
print(type(pipeline))
# CountVectorizer を確認
print(dir(pipeline.transformer_list[1][1].transformers[0][1].steps[0][1]))
print(pipeline.transformer_list[1][1].transformers[0][1].steps[0][1].get_params())

(168222, 100004) (1054, 100004)
FeatureUnion(transformer_list=[('extra_features',
                                ColumnTransformer(transformers=[('end_token',
                                                                 ExtraFeatures(),
                                                                 ['start_token',
                                                                  'len_essay',
                                                                  'discourse_text'])])),
                               ('vectorizer',
                                ColumnTransformer(transformers=[('vectorizer',
                                                                 Pipeline(steps=[('vectorizer',
                                                                                  CountVectorizer(max_features=100000,
                                                                                                  ngram_range=(1,
                                                       

## Training the core model

Since I like Random Forests, I'll stick with a RF Classifier.

In [21]:
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LogisticRegression

In [22]:
%%time

# 1min 34s
if False:
    model = LogisticRegression(C=1, dual=True, solver='liblinear')
    model.fit(X_train, y_train)

    # if hasattr(model, 'oob_score_'): print(model.oob_score_)

    # model = MultinomialNB()
    # model.fit(X_train, y_train)
    
    # print(model.score(X_train, y_train))
    # print(model.score(X_val, y_val))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.68 µs


In [23]:
%%time


params_lgb = {
    'objective': 'multiclass',
    'metric': 'multi_logloss',
    'num_class': len(set(y_train)),
    'n_estimators': 100,
    'random_state': 0,
    'learning_rate': 0.05,
    'early_stopping_rounds': 50,
    'subsample': 0.6,
    'subsample_freq': 1,
    'colsample_bytree': 0.4,
    'reg_alpha': 10.0,
    'reg_lambda': 1e-1,
    'min_child_weight': 256,
    'min_child_samples': 4,
    'device': 'cpu',
}

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_val, y_val)

result = {}
model_lgb = lgb.train(params_lgb,
                      lgb_train,
                      valid_sets=[lgb_train, lgb_eval],
                      valid_names=['Train', 'Valid'],
                      verbose_eval=100,
                      num_boost_round=500,
                      evals_result=result)

X_val_pred = model_lgb.predict(X_val, num_iteration=model_lgb.best_iteration)



You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 257655
[LightGBM] [Info] Number of data points in the train set: 168222, number of used features: 100003
[LightGBM] [Info] Start training from score -1.215524
[LightGBM] [Info] Start training from score -2.528464
[LightGBM] [Info] Start training from score -3.371400
[LightGBM] [Info] Start training from score -1.309464
[LightGBM] [Info] Start training from score -1.912709
[LightGBM] [Info] Start training from score -2.900121
[LightGBM] [Info] Start training from score -2.396060
[LightGBM] [Info] Start training from score -3.665740
Training until validation scores don't improve for 50 rounds
[100]	Train's multi_logloss: 0.864077	Valid's multi_logloss: 0.940988
Did not meet early stopping. Best iteration is:
[100]	Train's multi_logloss: 0.864077	Valid's multi_logloss: 0.940988
CPU times: user 1h 37min 30s, sys: 6.53 s, total: 1h 37min 37s


# Prediction

## Designing The Algorithm

The algorithm should take a full text and then output a dataframe of classes and prediction strings (tokens). 

In [24]:
# 次の discourse_type, 次の id を意味する列を追加
dev_df['next_discourse_type'] = dev_df['discourse_type'].shift(-1)
dev_df['next_id'] = dev_df['id'].shift(-1)

dev_df.loc[
    dev_df['next_id'] != dev_df['id'],
    'next_discourse_type'
] = 'NaN'

dev_df.head(3)

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring,start_token,len_essay,next_discourse_type,next_id
0,423A1CA112E2,1622628000000.0,8.0,229.0,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...,1,379,Position,423A1CA112E2
1,423A1CA112E2,1622628000000.0,230.0,312.0,They are some really bad consequences when stu...,Position,Position 1,45 46 47 48 49 50 51 52 53 54 55 56 57 58 59,45,379,Evidence,423A1CA112E2
2,423A1CA112E2,1622628000000.0,313.0,401.0,Some certain areas in the United States ban ph...,Evidence,Evidence 1,60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75,60,379,Evidence,423A1CA112E2


In [25]:
# 推移確率行列
discourse_next = pd.pivot_table(data=dev_df, index='discourse_type', columns='next_discourse_type', aggfunc='size')
discourse_next = discourse_next.apply(lambda x: x / discourse_next.sum(axis=1))
discourse_next = discourse_next[encoder.classes_].fillna(0)
discourse_next

next_discourse_type,Claim,Concluding Statement,Counterclaim,Evidence,Gap,Lead,Position,Rebuttal
discourse_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Claim,0.198729,0.018021,0.020226,0.559825,0.189268,0.0,0.008259,4e-05
Concluding Statement,0.005439,0.000224,0.007675,0.00231,0.00529,0.0,0.017063,0.0
Counterclaim,0.033928,0.025446,0.011425,0.224338,0.070106,0.0,0.011425,0.618141
Evidence,0.383538,0.187629,0.072995,0.092394,0.201304,6.6e-05,0.016537,0.010393
Gap,0.5414,0.098539,0.022783,0.277422,0.0,0.0,0.050034,0.009822
Lead,0.091734,0.0,0.016099,0.04322,0.108806,0.000108,0.7396,0.000108
Position,0.460153,0.032243,0.021278,0.171986,0.279943,0.000196,6.5e-05,0.000196
Rebuttal,0.159154,0.210967,0.052045,0.412872,0.113151,0.0,0.013708,0.002556


In [26]:
# 推移確率行列を辞書化
expectations = {}
for discourse_type, row in discourse_next.iterrows():
    expectations[discourse_type] = row.values

expectations

{'Claim': array([1.98729128e-01, 1.80207268e-02, 2.02257101e-02, 5.59825205e-01,
        1.89267745e-01, 0.00000000e+00, 8.25866458e-03, 4.00906048e-05]),
 'Concluding Statement': array([0.00543924, 0.00022353, 0.00767454, 0.00230981, 0.00529022,
        0.        , 0.01706281, 0.        ]),
 'Counterclaim': array([0.03392764, 0.02544573, 0.01142461, 0.22433789, 0.07010559,
        0.        , 0.01142461, 0.6181409 ]),
 'Evidence': array([3.83538116e-01, 1.87629365e-01, 7.29951116e-02, 9.23944158e-02,
        2.01303563e-01, 6.60589246e-05, 1.65367508e-02, 1.03932708e-02]),
 'Gap': array([0.54139999, 0.09853882, 0.02278308, 0.27742221, 0.        ,
        0.        , 0.05003421, 0.00982168]),
 'Lead': array([9.17341977e-02, 0.00000000e+00, 1.60994057e-02, 4.32198811e-02,
        1.08806051e-01, 1.08049703e-04, 7.39600216e-01, 1.08049703e-04]),
 'Position': array([4.60152732e-01, 3.22433262e-02, 2.12779845e-02, 1.71986163e-01,
        2.79942562e-01, 1.95809673e-04, 6.52698910e-05, 1.95

In [27]:
# 隠れマルコフモデルっぽいことをしている
# 訓練データの推移確率行列を使って、前の予測 discourse_type から、次に出現しやすい discourse_type の情報を使っている

import nltk

def predict_text(model, data, pipeline):
    # return model.predict_proba(pipeline.transform(data))
    return model_lgb.predict(pipeline.transform(data), num_iteration=model_lgb.best_iteration)

def predict_essay(model, essay_id, path, pipeline, max_iter, print_results=False):
    essay_txt = read_essay_txt(essay_id, path)
    essay_sentences = nltk.sent_tokenize(essay_txt)
    essay_preds = {}
    essay_preds['id'] = essay_id
    essay_preds['discourse_type'] = []
    essay_preds['predictionstring'] = []
    
#     print(len(essay_sentences))
#     print(len(essay_txt.split()))
    
    start_token = 0
    end_token = 0
    start_sent = 0
    end_sent = 1
    iter_bad = 0
    max_pred = 0
    
    start_end_sents, max_preds, argmax_preds = [], [], [] 
    stop = False
    
    prev_type = None
    
    while not stop: 
        data = {}
        data['start_token'] = [start_token + 1]
        data['len_essay'] = train_essays_length[essay_id] if path == 'train' else test_essays_length[essay_id]
        data['discourse_text'] = [' '.join(essay_sentences[start_sent:end_sent])]
        data = pd.DataFrame(data)[['start_token', 'len_essay', 'discourse_text']]
        preds = predict_text(model, data, pipeline)
        
#         print('Before update:', np.argmax(preds))
        if prev_type:
            preds = update_preds(preds, prev_type)
            
#         print('After update:', np.argmax(preds), prev_type)

        if preds.max() >= max_pred:
            max_pred = preds.max()
            start_end_sents.append([start_sent, end_sent])
            max_preds.append(max_pred)
            argmax_preds.append(preds.argmax())

        else:
            iter_bad += 1
        
#         print(start_sent, end_sent, encoder.inverse_transform([preds.argmax()])[0], preds.max(), max_pred, iter_bad)
        end_sent += 1
            
#         print(start_end_sents, max_preds)
                         
        if iter_bad >= max_iter or end_sent > len(essay_sentences):
            best_pred = np.argmax(max_preds)
            best_start_end = start_end_sents[best_pred]
            merged_sentence = ' '.join(essay_sentences[best_start_end[0]: best_start_end[-1]])
            end_token = len(merged_sentence.split()) + end_token
            prediction_string = ' '.join([str(token) for token in range(start_token, end_token)])
            
            essay_preds['discourse_type'].append(encoder.inverse_transform([argmax_preds[best_pred]])[0])
            essay_preds['predictionstring'].append(prediction_string)
            
            if print_results: print('MATCH ------- \n', merged_sentence, '\n\n', encoder.inverse_transform([argmax_preds[best_pred]])[0], '\n\n')
            
            start_token = end_token
            start_sent = best_start_end[-1]
            
#             end_token = start_token + 1
            end_sent = start_sent + 1
            
            iter_bad = 0
            max_pred = 0
            
            prev_type = encoder.inverse_transform([argmax_preds[best_pred]])[0]
            
            start_end_sents, max_preds, argmax_preds = [], [], []
            
            
        if start_sent == len(essay_sentences):
            stop = True
    
    return essay_preds

def update_preds(preds, prev_type):
    prev_type_expec = expectations[prev_type]
    return preds * prev_type_expec

def predict_df(model, df, path, pipeline, max_iter=5):
    essay_ids = df['id'].unique()
    preds_df = None
    for essay_id in essay_ids:
        essay_preds = predict_essay(model, essay_id, path, pipeline, max_iter)
        preds_df = pd.concat([preds_df, pd.DataFrame(essay_preds)], axis=0)
        
    return preds_df

Let's test the algorithm now with one essay.

In [28]:
val_df.id.unique()

array(['570D8769BE33', 'B18FB042DEDD', 'B5DE2FAE1DB5', '0FB84B0726F5',
       '656F48B15786', '8ABA260B3B98', '1E8C298D92CB', '63001194BE9C',
       '535D71E8C000', '0F92CE19A137', '2E1266682F4A', '91A7412303BD',
       'CD10A87243D1', '6BFE834C4FD5', '16EAB34FDC87', 'CC50C8238771',
       'BE80D73EC131', '714F75BBE481', 'E70DA4483D3F', '130B05555BA1',
       '983273B60F84', '003FDC7E6F20', 'A5DA0D5410DE', 'D69416DBA6F4',
       '0A6E7B9813A4', '1B241587A2A6', '476E07AEA488', 'F324DBEBCAFA',
       'C60B2BD410BD', '64551673D7DB', 'A8A25F94F5B6', '10F1ACB5559E',
       '64EE7AB92E51', 'D2A7CA1EEFC5', '25FA01853A35', 'C237996BF240',
       'E66C66D98801', '9AFE16DB63D8', '4441EEF0E6FD', '1627A070AEBB',
       '04F277F6562D', '32D988FC1AA1', 'AA3A35044E77', '0CE521F8D172',
       '3B053B2F03A3', '8DBEE30EF6A7', '318E29BC6F06', '649807ACDD9F',
       'E3BA4F414829', '85EB8DC94BD0', '0B4FAC7A4A8B', 'B88301B4E6D7',
       'B15EA1EE9302', 'B23941DFCB72', '1AC3332D2C41', '520B38139E21',
      

In [29]:
%%time

_ = predict_essay(model_lgb, '570D8769BE33', 'train', pipeline, 2, print_results=True)

MATCH ------- 
 Should drivers be able to use a phone in any capacity while operating a vehicle or should drivers not be able to use their phones while operating a vehicle? Using a phone while driving could be helpful in various ways and also could be used in harmful ways. Cell phones can take the attention of a driver causing a risk of danger. A cell phone helps drivers with immediate roadside assistance, in case of a serious car accident. Cell phone's allow drivers to get directions turn by turn with the helping of GPS technology. Phone cameras are very important in a case of a hit and run accident, it allows a person to take a picture of the license plate. 

 Evidence 


MATCH ------- 
 Some insurance companies may require a picture of an accident. Traffic reports update on a phone in minutes, accidents, highway closings and also an estimated travel time to their destination. 

 Claim 


MATCH ------- 
 An activity that grabs the attention from a driver is Distracted Driving. Being 

# Evaluation

According to the competition evaluation page

1. For each sample, all ground truths and predictions for a given class are compared.
2. If the overlap between the ground truth and prediction is >= 0.5, and the overlap between the prediction and the ground truth >= 0.5, the prediction is a match and considered a true positive. If multiple matches exist, the match with the highest pair of overlaps is taken.
3. Any unmatched ground truths are false negatives and any unmatched predictions are false positives.


In [30]:
%%time

# 30.5s
val_preds_df = predict_df(model_lgb, val_df, 'train', pipeline, 2)
val_preds_df.head()

CPU times: user 57.7 s, sys: 209 ms, total: 57.9 s
Wall time: 30.5 s


Unnamed: 0,id,discourse_type,predictionstring
0,570D8769BE33,Evidence,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18...
1,570D8769BE33,Claim,120 121 122 123 124 125 126 127 128 129 130 13...
2,570D8769BE33,Evidence,150 151 152 153 154 155 156 157 158 159 160 16...
3,570D8769BE33,Claim,244 245 246 247 248 249 250 251 252 253 254 25...
4,570D8769BE33,Evidence,271 272 273 274 275 276 277 278 279 280 281 28...


In [31]:
def evaluate_df(df, pred_df):
    essay_ids = df['id'].unique()
    f1_scores = []
    for essay_id in essay_ids:
        f1_score = evaluate_essay(df, pred_df, essay_id)
        f1_scores.append(f1_score)
    return np.mean(f1_scores)
        
def evaluate_essay(df, pred_df, essay_id, print_results=False):
    essay_df = filter_essay(df, essay_id)
    pred_essay_df = filter_essay(pred_df, essay_id)
    pred_essay_df = pred_essay_df.loc[pred_essay_df['discourse_type'] != 'Gap', :]
    f1_scores = []
    for class_ in df['discourse_type'].unique():
        f1_score = evaluate_class(essay_df, pred_essay_df, class_, print_results)
        f1_scores.append(f1_score)
        
    return np.mean(f1_scores)
        
def evaluate_class(df, pred_df, class_, print_results):
    class_df = filter_class(df, class_)
    pred_class_df = filter_class(pred_df, class_)
    truths = class_df['predictionstring'].str.split(' ').tolist()
    predictions = pred_class_df['predictionstring'].str.split(' ').tolist()
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    matched_truths_idx = []
    for prediction in predictions:
        for i, truth in enumerate(truths):
            if test_overlap(prediction, truth):
                true_positives += 1 
                matched_truths_idx.append(i)
            else:
                false_positives += 1
        truths = remove_from_list(truths, matched_truths_idx)
        matched_truths_idx = []
        
    false_negatives = len(truths)
    
    f1_score = calculate_f1(true_positives, false_positives, false_negatives)
    
    if print_results: print(class_, f1_score)
        
    return f1_score


def filter_class(df, class_):
    return df.query('discourse_type == @class_')
                
        
def test_overlap(prediction, truth):
    prediction_set = set(prediction)
    truth_set = set(truth)
#     print(overlap_fraction(prediction_set, truth_set), overlap_fraction(truth_set, prediction_set))
    if overlap_fraction(prediction_set, truth_set) >= 0.5 and overlap_fraction(truth_set, prediction_set) >= 0.5:
        return True
    
    
def overlap_fraction(set1, set2):
    return len(set1.intersection(set2)) / len(set1)
    
    
def remove_from_list(list_, idx):
    return [x for i, x in enumerate(list_) if i not in idx]

def calculate_f1(true_positives, false_positives, false_negatives):
    precision = calculate_precision(true_positives, false_positives)
    recall = calculate_recall(true_positives, false_negatives)
    return 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    
def calculate_precision(true_positives, false_positives):
    return true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0

def calculate_recall(true_positives, false_negatives):
    return true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0

In [32]:
%%time

# Test one essay
evaluate_essay(val_df, val_preds_df, '570D8769BE33', True)

Lead 0
Claim 0
Evidence 0.125
Counterclaim 0
Position 0
Concluding Statement 1.0
Rebuttal 0
Gap 0
CPU times: user 42.1 ms, sys: 1.03 ms, total: 43.2 ms
Wall time: 40.4 ms


0.140625

In [33]:
%%time

evaluate_df(val_df, val_preds_df)

CPU times: user 3.2 s, sys: 28.2 ms, total: 3.23 s
Wall time: 3.19 s


0.08505312253106372

# Submission

In [34]:
test_ids = !ls '/kaggle/input/feedback-prize-2021/test'
test_ids = [id_.rstrip('.txt') for id_ in test_ids]
test_df = pd.DataFrame({'id': test_ids})
test_df['len_essay'] = test_df['id'].map(test_essays_length)

In [35]:
%%time

submission = predict_df(model_lgb, test_df, 'test', pipeline, 2)
submission.head()

CPU times: user 5.16 s, sys: 19 ms, total: 5.18 s
Wall time: 2.61 s


Unnamed: 0,id,discourse_type,predictionstring
0,0FB0700DAF44,Claim,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1,0FB0700DAF44,Evidence,16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3...
2,0FB0700DAF44,Claim,249 250 251 252 253 254 255 256 257 258 259 26...
3,0FB0700DAF44,Evidence,262 263 264 265 266 267 268 269 270 271 272 27...
4,0FB0700DAF44,Claim,432 433 434 435 436 437 438 439 440 441 442 44...


In [36]:
submission.columns = ['id', 'class', 'predictionstring']

In [37]:
# Drop gaps
submission = submission.loc[submission['class'] != 'Gap', :]

In [38]:
submission.to_csv('submission.csv', index=False)