In [1]:
import os

from IPython.display import display, Markdown

In [2]:
REPO_PATH = os.path.dirname(os.path.dirname(os.path.dirname(os.getcwd())))
TASK_PATH = os.path.join(REPO_PATH, "tasks", "06-language-as-sequence")

In [3]:
def show_markdown(path):
    with open(path, 'r') as md:
        content = md.read()
    display(Markdown(content))

In [4]:
show_markdown(os.path.join(TASK_PATH, "06-language-as-sequence.md"))

# Мова як послідовність

## Run-on Sentences

### 1. Домен

Цього тижня ви працюватимете над задачею виправлення помилок.

Run-on речення - це речення, склеєне з двох чи більше речень без належної пунктуації. Таку помилку часто допускають механічно, коли швидко друкують текст, проте така помилка виникає і від незнання мови. Особливо часто ця помилка зустрічається в інтернет-спілкуванні.

Наприклад:
```
Thanks for talking to me let's meet again tomorrow Bye.
```

У цьому реченні насправді три склеєні речення. Правильний варіант:
```
Thanks for talking to me. Let's meet again tomorrow. Bye.
```

Run-on речення важливо визначати не лише для виправлення помилок. Ця помилка впливає на якість визначення сутностей, машинного перекладу, об'єкта сентименту тощо.

Більше інформації та прикладів можна знайти за посиланнями:
- <http://www.bristol.ac.uk/arts/exercises/grammar/grammar_tutorial/page_37.htm>
- <https://www.english-grammar-revolution.com/run-on-sentence.html>
- <https://www.quickanddirtytips.com/education/grammar/what-are-run-on-sentences>

### 2. Класифікатор

Дані:
- Згенеруйте тренувальні дані для моделі на основі відкритих корпусів. Тренувальними даними буде набір склеєних речень. Візьміть до уваги, що склеєних речень може бути кілька (зазвичай 2, але буває і 3-4), а перше слово наступного речення може писатися з великої чи малої літери.
- Знайдіть у відкритому доступі чи зберіть самостійно базу енграмів на рівні слів чи частин мови. Завважте, що відкриті бази енграмів зазвичай містять статистику, зібрану на реченнях, а отже вони можуть не містити енграми на межі речень.

Тестування:
- Напишіть базове рішення та метрику для тестування якості.
- Для тестування використайте корпус [run-on-test.json](run-on-test.json). Формат корпусу:
```
[
  [
    ["Thanks", false],
    ["for", false],
    ["talking", false],
    ["to", false],
    ["me", true],
    ["let", false],
    ["'s", false],
    ["meet", false],
    ["again", false],
    ["tomorrow", true],
    ["Bye", false],
    [".", false]
  ],
...
]
```

`true` позначає слово, після якого треба додати крапку. Тестовий корпус містить 200 токенізованих речень (~ 4700 токенів). 3% токенів мають клас `true`, а решта - `false`. Зверніть увагу, що корпус вже токенізований.

Класифікатор:
- Виділіть ознаки, які впливають на те, чи є слово на межі речень. Наприклад:
  - правий/лівий контекст;
  - написання слова;
  - граматичні ознаки (чи може речення закінчитись на сполучник?);
  - енграми (чи часто це слово і наступне йдуть поруч? чи ймовірні ці дві частини мови поруч?);
  - глибина синтаксичного дерева чи найближчий спільний предок;
  - ваші варіанти.
- Побудуйте класифікатор на основі логістичної регресії чи умовних випадкових полів (CRF), який анотує послідовно слова у реченні на предмет закінчення речення.
- Спробуйте покращити якість роботи класифікатора, змінюючи набір чи комбінацію ознак.
- **Важливо:** під час покращення класифікатора перевіряйте його якість на своїх даних (train/test або кросвалідація).
- Визначте фінальну якість класифікатора на тестовій вибірці.

Запишіть ваші спостереження та результати в окремий файл.

### Корисні посилання

- [CRF tutorial](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html)
- [Google ngrams](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) (and [how to download](https://pypi.org/project/google-ngram-downloader/))
- [Google syntactic ngrams](http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html)
- [1 mln of 2/3/4/5-ngrams from COCA](https://www.ngrams.info/download_coca.asp)

### Оцінювання

100% за завдання.

### Крайній термін

18.04.2020


## Data preparation

Wiki:
https://dumps.wikimedia.org/simplewiki/latest/  
Brown:
https://www.kaggle.com/nltkdata/brown-corpus  
Some corpus:
https://www.kaggle.com/espn56/english-corpus 

In [5]:
import re
import json
import spacy
import numpy as np
import pandas as pd

from tqdm import tqdm
from bs4 import BeautifulSoup

In [6]:
nlp = spacy.load("en_core_web_md")

### Simple wiki

In [7]:
# ! python WikiExtractor.py simplewiki-latest-pages-articles-multistream.xml -o wiki_articles

In [8]:
# lst = []

# for folder in os.listdir('wiki_articles'):
#     for fn in tqdm(os.listdir(f"wiki_articles/{folder}")):
#         with open(f'wiki_articles/{folder}/{fn}') as file_:
#             res = file_.readlines()
#             res = "".join(res)
#             res = BeautifulSoup(res, 'lxml')
#             res = list(filter(lambda x: len(x.split()) > 10, res.get_text().split("\n")))
#             lst.extend(res)

# df = pd.DataFrame(lst, columns=['text'])
# df.to_csv("simple_wiki.txt", index=False)

In [9]:
df_wiki = pd.read_csv("simple_wiki.txt")
df_wiki.shape

(428708, 1)

In [10]:
df_wiki.head()

Unnamed: 0,text
0,Jean Bercher (known as Dauberval or D'Auberval...
1,Cri-Cri is a fictional talking cricket. The ch...
2,The character was created by Gabilondo Soler w...
3,It was made into a movie that was released on ...
4,"Baldwin I († 879), also known as ""Baldwin Iron..."


### Test data

In [11]:
with open(os.path.join(TASK_PATH ,'run-on-test.json')) as file_:
    test_js = json.load(file_)

In [12]:
tmp = " ".join([item[0] for item in test_js[1]])

##### number of sentence splits in one sample

In [13]:
pd.Series([sum([int(item[1] == True) for item in sample]) for sample in test_js]).value_counts()

1    145
0     50
2      5
dtype: int64

###### class distrubution

In [14]:
lst = []
for sample in test_js:
    lst.extend([int(item[1]) for item in sample])
pd.Series(lst).value_counts(normalize=False)

0    4542
1     155
dtype: int64

###### upper/lower letter after dot

In [15]:
up_count = 0
low_count = 0

for sample in test_js:
    for index, item in enumerate(sample):
        if item[1] == True:
            if sample[index+1][0][0].isupper():
                up_count += 1
            else:
                low_count += 1
print(up_count, low_count)

75 80


### data preprocessing

In [16]:
def prepare_train(text, zipped=False):
    doc = nlp(text)
    sents = list(doc.sents)
    tokens = []
    labels = []
    for index, sen in enumerate(sents):
        tkns = [token.text for token in sen]
        if index > 0 and np.random.rand() > 0.5:
            tkns[0] = tkns[0].lower()
        if index != len(sents)-1:
            tkns = tkns[:-1]
        tokens.extend(tkns)
        lbls = [False for i in range(len(tkns)-1)] + [True]
        labels.extend(lbls)
    labels[-1] = False
    if sum(labels) != len(sents) - 1:
        raise ValueError(f"there is a problem with sentence=*{text}*")
    if zipped:
        return list(zip(tokens, labels))
    else:
        return tokens, labels

In [17]:
def make_features(tokens):
    text = " ".join(tokens).replace("-", "")
    text = re.sub(r".*(\")\w", "", text)
#     for item in ["-", ",", "."]:
#         text = text.replace(" {item} ", "{item}")
    for k, v in [("'m", "am"), ("'s", "is"), ("'re", "are"), ("'ve", "have")]:
        text = text.replace(k, v)
#     print(text)

    text = "start_sent " + text + " end_sent" #+ " end_end_sent"
    doc = nlp(text)
    doc_range = len(doc)
    lst = []
    for i in range(1, doc_range - 1):
        prev_token = doc[i-1]
        cur_token = doc[i]
        next_token = doc[i+1]
#         next_next_token = doc[i+2]
        
        features = {
            "prev_token_pos": prev_token.pos_,
            "prev_token_lemma_": prev_token.lemma_,
            "prev_token_ent_iob": prev_token.ent_iob,
            "prev_token_is_alpha": prev_token.is_alpha,
            "prev_token_is_digit": prev_token.is_digit,
            "prev_token_is_lower": prev_token.is_lower,
            "prev_token_is_upper": prev_token.is_upper,
            "prev_token_is_title": prev_token.is_title,
            "prev_token_is_punct": prev_token.is_punct,
            "prev_token_dep_": prev_token.is_punct,
            "prev_token_start_sen": int(prev_token.text == 'start_sent'),
            "prev_token_num": i-1,
            
            "cur_token_pos": cur_token.pos_,
            "cur_token_lemma_": cur_token.lemma_,
            "cur_token_ent_iob": cur_token.ent_iob,
            "cur_token_is_alpha": cur_token.is_alpha,
            "cur_token_is_digit": cur_token.is_digit,
            "cur_token_is_lower": cur_token.is_lower,
            "cur_token_is_upper": cur_token.is_upper,
            "cur_token_is_title": cur_token.is_title,
            "cur_token_is_punct": cur_token.is_punct,
            "cur_token_dep_": cur_token.is_punct,
            "cur_token_num": i,
            
            "next_token_pos": next_token.pos_,
            "next_token_lemma_": next_token.lemma_,
            "next_token_ent_iob": next_token.ent_iob,
            "next_token_is_alpha": next_token.is_alpha,
            "next_token_is_digit": next_token.is_digit,
            "next_token_is_lower": next_token.is_lower,
            "next_token_is_upper": next_token.is_upper,
            "next_token_is_title": next_token.is_title,
            "next_token_is_punct": next_token.is_punct,
            "next_token_dep_": next_token.is_punct,
            "next_token_end_sen": int(next_token.text == 'end_sent'),
            "next_token_num": i+1,
            
#             "next_next_token_pos": next_next_token.pos_,
#             "next_next_token_lemma_": next_next_token.lemma_,
#             "next_next_token_ent_iob": next_next_token.ent_iob,
#             "next_next_token_is_alpha": next_next_token.is_alpha,
#             "next_next_token_is_digit": next_next_token.is_digit,
#             "next_next_token_is_lower": next_next_token.is_lower,
#             "next_next_token_is_upper": next_next_token.is_upper,
#             "next_next_token_is_title": next_next_token.is_title,
#             "next_next_token_is_punct": next_next_token.is_punct,
#             "next_next_token_dep_": next_next_token.is_punct,
#             "next_next_token_end_sen": int(next_next_token.text == 'end_end_sent')
        }
        lst.append(features)
    return lst
        
    

###### train data sentence count

In [18]:
# lst = []

# for x in tqdm(df_wiki['text'].values):
#     lst.append([sen for sen in nlp(x).sents])

# df_wiki['sentences'] = lst
# df_wiki['sen_num'] = df_wiki.sentences.map(len)

In [19]:
df_wiki = pd.read_csv('simple_wiki.csv')

In [20]:
df = df_wiki.loc[df_wiki.sen_num < 4].sample(20000)
df['text'] = df['text'].map(lambda x: re.sub(r'\s+', " ", re.sub(r"(\(.+?\))", " ", x)))
print(df.shape)

(20000, 2)


In [21]:
train_df = pd.DataFrame()
failed_sentences = []

for text in tqdm(df.text.values):
    try:
        tokens, labels = prepare_train(text)
        tmp_df = pd.DataFrame(make_features(tokens))
        tmp_df['target'] = labels
        train_df = train_df.append(tmp_df)
    except:
        failed_sentences.append(text)

100%|██████████| 20000/20000 [37:06<00:00,  8.98it/s]


In [22]:
print(f"Percentage of failed sentences: {len(failed_sentences)/ train_df.shape[0]*100:.2f} %")

Percentage of failed sentences: 0.01 %


In [23]:
test_df = pd.DataFrame()

for sample in tqdm(test_js):
    try:
        tokens, labels = [item[0] for item in sample], [item[1] for item in sample]
        tmp_df = pd.DataFrame(make_features(tokens))
        tmp_df['target'] = labels
        test_df = test_df.append(tmp_df)
    except Exception as e:
        print(sample)
        break

100%|██████████| 200/200 [00:02<00:00, 80.89it/s]


In [24]:
train_df.target.value_counts(normalize=True)

False    0.971872
True     0.028128
Name: target, dtype: float64

In [25]:
test_df.target.value_counts(normalize=True)

False    0.967
True     0.033
Name: target, dtype: float64

In [26]:
# train_df.columns

In [27]:
train_df[[item for item in train_df.columns if item.startswith('cur')]+['target']].head()

Unnamed: 0,cur_token_pos,cur_token_lemma_,cur_token_ent_iob,cur_token_is_alpha,cur_token_is_digit,cur_token_is_lower,cur_token_is_upper,cur_token_is_title,cur_token_is_punct,cur_token_dep_,cur_token_num,target
0,ADJ,french,3,True,False,False,False,True,False,False,1,False
1,NOUN,troop,2,True,False,True,False,False,False,False,2,False
2,VERB,land,2,True,False,True,False,False,False,False,3,False
3,ADP,on,2,True,False,True,False,False,False,False,4,False
4,PROPN,Elba,3,True,False,False,False,True,False,False,5,False


### Train model

In [28]:
from sklearn.metrics import *

from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.decomposition import TruncatedSVD, PCA
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, KFold, StratifiedKFold

In [29]:
RANDOM_STATE = 0

In [30]:
# train_df.to_csv('train_df.csv', index=False)

In [31]:
X_train, X_dev, y_train, y_dev = train_test_split(train_df.drop('target', 1), train_df['target'], 
                                                  test_size=0.2, 
                                                  random_state=RANDOM_STATE,
                                                  stratify=train_df['target'])
X_test, y_test = test_df.drop('target', 1), test_df['target']

In [32]:
print(X_train.shape[0], X_dev.shape[0])

542605 135652


In [33]:
vectorizer = DictVectorizer()
X_train_vec = vectorizer.fit_transform(X_train.to_dict('records'))
X_dev_vec = vectorizer.transform(X_dev.to_dict('records'))
X_test_vec = vectorizer.transform(X_test.to_dict('records'))

In [34]:
class_weights = (1 / y_train.value_counts(normalize=True)).to_dict()
class_weights

{False: 1.0289413152350557, True: 35.552679858472025}

#### solo lr/svc model

In [35]:
def f1_macro(y_true, y_pred):
    return f1_score(y_true, y_pred, average='macro')

def train_eval(clf):
    clf.fit(X_train_vec, y_train)
    y_pred = clf.predict(X_test_vec)
    print("f1 macro:", f1_macro(y_test, y_pred))
#     print(clf)

In [36]:
reg_interval = [1, 5, 10, 25]

for index, interval in enumerate(reg_interval):
    print(reg_interval[index])
#     train_eval(LinearSVC(C=interval, class_weight=class_weights, max_iter=2000))
    train_eval(LogisticRegression(C=interval, class_weight=class_weights, max_iter=2000))

1


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


f1 macro: 0.7403618599476937
5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


f1 macro: 0.7486201933888015
10


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


f1 macro: 0.7551480120196254
25
f1 macro: 0.7566385138706597


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [37]:
# model = LinearSVC(C=10, class_weight=class_weights, max_iter=2000)
# model = LogisticRegression(C=10, solver='liblinear', penalty='l1', class_weight=class_weights, max_iter=2000)
model = LogisticRegression(C=10, class_weight=class_weights, max_iter=2000)


model.fit(X_train_vec, y_train);

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [38]:
y_pred = model.predict(X_dev_vec)

print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

       False       0.99      0.97      0.98    131836
        True       0.44      0.79      0.57      3816

    accuracy                           0.97    135652
   macro avg       0.72      0.88      0.77    135652
weighted avg       0.98      0.97      0.97    135652



In [39]:
y_pred = model.predict(X_test_vec)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.99      0.96      0.97      4542
        True       0.39      0.85      0.54       155

    accuracy                           0.95      4697
   macro avg       0.69      0.90      0.76      4697
weighted avg       0.97      0.95      0.96      4697



In [40]:
import eli5

eli5.show_weights(model, vec=vectorizer)



Weight?,Feature
+19.233,prev_token_lemma_=GCIE
+16.891,prev_token_lemma_=François
+16.706,prev_token_lemma_=trouble
+16.550,prev_token_lemma_=Silas
+15.746,next_token_lemma_=bomis
+14.947,next_token_lemma_=kingman
+14.625,next_token_lemma_=hoc
+14.125,next_token_lemma_=sequel
+13.706,prev_token_lemma_=Len
+13.532,prev_token_lemma_=Stendal


#### pipeline with svc

In [None]:
dict_vect = DictVectorizer()

pipeline_svc = Pipeline([
    ("main_union", FeatureUnion([
        ("pipe1", Pipeline([
            ('dict_vect', dict_vect),
        ])),
        ("pipe2", Pipeline([
            ('dict_vect', dict_vect),
            ("SVD", TruncatedSVD())
        ])),
    ])),
    ('LinearSVC', LinearSVC(class_weight=class_weights))
])

distributions = {
    "LinearSVC__C": [1, 5, 10, 20],
    "main_union__pipe2__SVD__n_components": [200, 300, 400, 500]
}

svc_pipe = RandomizedSearchCV(pipeline_svc,
                         distributions,
                         random_state=RANDOM_STATE,
                         scoring=make_scorer(f1_macro),
                         n_iter=10,
                         cv=5,
                         verbose=5,
                         n_jobs=-1)
search = svc_pipe.fit(X_train.to_dict('records'), y_train)
print(search.best_params_, search.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


In [None]:
y_pred = svc_pipe.predict(X_dev.to_dict('records'))

print(classification_report(y_dev, y_pred))

In [None]:
y_pred = svc_pipe.predict(X_test.to_dict('records'))

print(classification_report(y_test, y_pred))

#### pipeline with lr

In [None]:
dict_vect = DictVectorizer()

pipeline_lr = Pipeline([
    ("main_union", FeatureUnion([
        ("pipe1", Pipeline([
            ('dict_vect', dict_vect),
        ])),
        ("pipe2", Pipeline([
            ('dict_vect', dict_vect),
            ("SVD", TruncatedSVD())
        ])),
    ])),
#     ("LogReg", LogisticRegression(max_iter=1000, class_weight=class_weights))
    ("LogReg", LogisticRegression(max_iter=2000, solver='liblinear', class_weight=class_weights))
])

distributions = {
    "LogReg__C": [1, 5, 10, 20],
    "LogReg__penalty": ["l1"],
    "main_union__pipe2__SVD__n_components": [200, 300, 400, 500]
}

lr_pipe = RandomizedSearchCV(pipeline_lr,
                         distributions,
                         random_state=RANDOM_STATE,
                         scoring=make_scorer(f1_macro),
                         n_iter=10,
                         cv=5,
                         verbose=5,
                         n_jobs=-1)
search = lr_pipe.fit(X_train.to_dict('records'), y_train)
print(search.best_params_, search.best_score_)

In [None]:
y_pred = lr_pipe.predict(X_dev.to_dict('records'))

print(classification_report(y_dev, y_pred))

In [None]:
y_pred = lr_pipe.predict(X_test.to_dict('records'))

print(classification_report(y_test, y_pred))

#### ligthgbm solo model

In [42]:
from lightgbm import LGBMClassifier, plot_importance

In [43]:
def lgb_fscore(y_true, y_pred):
    y_pred = np.round(y_pred)
    res = f1_score(y_true, y_pred, average='macro')
    return 'macro_f1', res, True

In [48]:
params = {
    'objective': 'binary',
    'num_rounds': 5000,
    'max_depth': -1,
    'learning_rate': 0.01,
    'num_leaves': 31,
    'verbose': 100,
    'early_stopping_rounds': 200,
    'min_data_in_leaf': 30,
    'lambda_l2': 0.9,
    'feature_fraction': 0.5,
    'metric': 'custom',
    'imbalance': True,
    'random_state': RANDOM_STATE
}


lgb_clf = LGBMClassifier(**params)

In [49]:
lgb_clf.fit(
    X=X_train_vec,
    y=y_train,
    eval_set=[(X_dev_vec, y_dev)],
    verbose=params['verbose'],
    eval_metric=lgb_fscore,
#     sample_weight=y_train.map(class_weights).values
)

Training until validation scores don't improve for 200 rounds
[100]	valid_0's macro_f1: 0.763776
[200]	valid_0's macro_f1: 0.857911
[300]	valid_0's macro_f1: 0.879627
[400]	valid_0's macro_f1: 0.884272
[500]	valid_0's macro_f1: 0.88722
[600]	valid_0's macro_f1: 0.889048
[700]	valid_0's macro_f1: 0.890434
[800]	valid_0's macro_f1: 0.892111
[900]	valid_0's macro_f1: 0.892357
[1000]	valid_0's macro_f1: 0.894231
[1100]	valid_0's macro_f1: 0.894774
[1200]	valid_0's macro_f1: 0.895287
[1300]	valid_0's macro_f1: 0.895437
[1400]	valid_0's macro_f1: 0.895557
[1500]	valid_0's macro_f1: 0.896606
[1600]	valid_0's macro_f1: 0.897205
[1700]	valid_0's macro_f1: 0.897503
[1800]	valid_0's macro_f1: 0.897772
[1900]	valid_0's macro_f1: 0.898159
[2000]	valid_0's macro_f1: 0.898547
[2100]	valid_0's macro_f1: 0.898637
[2200]	valid_0's macro_f1: 0.898844
[2300]	valid_0's macro_f1: 0.898904
[2400]	valid_0's macro_f1: 0.899111
[2500]	valid_0's macro_f1: 0.89935
[2600]	valid_0's macro_f1: 0.899678
[2700]	valid_

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               early_stopping_rounds=200, feature_fraction=0.5, imbalance=True,
               importance_type='split', lambda_l2=0.9, learning_rate=0.01,
               max_depth=-1, metric='custom', min_child_samples=20,
               min_child_weight=0.001, min_data_in_leaf=30, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, num_rounds=5000,
               objective='binary', random_state=0, reg_alpha=0.0,
               reg_lambda=0.0, silent=True, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0, verbose=100)

In [50]:
y_pred = lgb_clf.predict(X_dev_vec)

print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

       False       0.99      1.00      0.99    131836
        True       0.91      0.72      0.81      3816

    accuracy                           0.99    135652
   macro avg       0.95      0.86      0.90    135652
weighted avg       0.99      0.99      0.99    135652



In [51]:
y_pred = lgb_clf.predict(X_test_vec)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.99      0.99      0.99      4542
        True       0.81      0.69      0.75       155

    accuracy                           0.98      4697
   macro avg       0.90      0.84      0.87      4697
weighted avg       0.98      0.98      0.98      4697



#### ligthgbm out-of-fold 

In [52]:
use_sample_weight = True
N_FOLDS = 4
num_threads = 12
SVD_n_comp = 300

In [53]:
params = {
    'objective': 'binary',
    'num_rounds': 5000,
    'max_depth': -1,
    'learning_rate': 0.01,
    'num_leaves': 31,
    'verbose': 100,
    'early_stopping_rounds': 200,
    'min_data_in_leaf': 30,
    'lambda_l2': 0.9,
    'feature_fraction': 0.5,
    'metric': 'custom',
    'imbalance': True,
    'random_state': RANDOM_STATE
}


classifier = LGBMClassifier(**params)

In [54]:
strategy = StratifiedKFold(n_splits=N_FOLDS, random_state=RANDOM_STATE, shuffle=True)

In [55]:
train = train_df

sample_weight = train_df.target.map(class_weights).values

test = test_df

In [56]:
pred_oof = np.zeros(len(train), dtype=np.float32)
pred_test = np.zeros((len(test), 2, N_FOLDS), dtype=np.float32)
fold_metrics = np.zeros(N_FOLDS)

In [57]:
dict_vect = DictVectorizer()

for i, (tr_ind, val_ind) in enumerate(strategy.split(X=np.ones(len(train)), y=train['target'])):
    print(f'Fold: {i + 1}\n\tTrain len: {len(tr_ind)}\n\tVal len: {len(val_ind)}')
    pipe = Pipeline([
            ('dict_vect', dict_vect),
            ("SVD", TruncatedSVD(n_components=SVD_n_comp))
        ])
    pipe.fit(train.iloc[tr_ind].drop('target', 1).to_dict('records'))
    
    X = pipe.transform(train.iloc[tr_ind].drop('target', 1).copy().to_dict('records'))
    y = train.iloc[tr_ind]['target'].copy()
    X_val = pipe.transform(train.iloc[val_ind].drop('target', 1).copy().to_dict('records'))
    y_val = train.iloc[val_ind]['target'].copy()
    X_test_ = pipe.transform(test.drop('target', 1).to_dict('records'))
    
    # fit model
    print('\tFITTING MODEL...')
    classifier.fit(
        X=X,
        y=y,
        eval_set=[(X_val, y_val)],
        early_stopping_rounds=params['early_stopping_rounds'],
        verbose=params['verbose'],
        eval_metric=lgb_fscore,
        sample_weight=sample_weight[tr_ind] if use_sample_weight else None,
    )
    # predict OOF val
    print('\tPREDICT OOF...')
    pred_oof[val_ind] = classifier.predict(X_val, num_threads=num_threads)
    # predict test
    print('\tPREDICTING TEST...')
    pred_test[..., i] = classifier.predict_proba(
        X_test_, num_threads=num_threads)
    fold_metrics[i] = f1_macro(y_val, pred_oof[val_ind])
    print(f'\tFold score: {fold_metrics[i]}')

Fold: 1
	Train len: 508692
	Val len: 169565
	FITTING MODEL...




Training until validation scores don't improve for 200 rounds
[100]	valid_0's macro_f1: 0.728239
[200]	valid_0's macro_f1: 0.726031
Early stopping, best iteration is:
[78]	valid_0's macro_f1: 0.728298
	PREDICT OOF...
	PREDICTING TEST...
	Fold score: 0.7282982717086455
Fold: 2
	Train len: 508693
	Val len: 169564
	FITTING MODEL...




Training until validation scores don't improve for 200 rounds
[100]	valid_0's macro_f1: 0.723507
[200]	valid_0's macro_f1: 0.725145
[300]	valid_0's macro_f1: 0.72675
[400]	valid_0's macro_f1: 0.731281
[500]	valid_0's macro_f1: 0.733882
[600]	valid_0's macro_f1: 0.73667
[700]	valid_0's macro_f1: 0.739541
[800]	valid_0's macro_f1: 0.742379
[900]	valid_0's macro_f1: 0.744117
[1000]	valid_0's macro_f1: 0.746279
[1100]	valid_0's macro_f1: 0.748865
[1200]	valid_0's macro_f1: 0.751438
[1300]	valid_0's macro_f1: 0.754446
[1400]	valid_0's macro_f1: 0.756635
[1500]	valid_0's macro_f1: 0.759155
[1600]	valid_0's macro_f1: 0.761133
[1700]	valid_0's macro_f1: 0.762852
[1800]	valid_0's macro_f1: 0.764939
[1900]	valid_0's macro_f1: 0.767273
[2000]	valid_0's macro_f1: 0.769663
[2100]	valid_0's macro_f1: 0.772109
[2200]	valid_0's macro_f1: 0.774647
[2300]	valid_0's macro_f1: 0.777027
[2400]	valid_0's macro_f1: 0.778573
[2500]	valid_0's macro_f1: 0.780926
[2600]	valid_0's macro_f1: 0.783346
[2700]	valid_



Training until validation scores don't improve for 200 rounds
[100]	valid_0's macro_f1: 0.728072
[200]	valid_0's macro_f1: 0.72799
Early stopping, best iteration is:
[48]	valid_0's macro_f1: 0.733978
	PREDICT OOF...
	PREDICTING TEST...
	Fold score: 0.7339781986136253
Fold: 4
	Train len: 508693
	Val len: 169564
	FITTING MODEL...




Training until validation scores don't improve for 200 rounds
[100]	valid_0's macro_f1: 0.722005
[200]	valid_0's macro_f1: 0.724913
[300]	valid_0's macro_f1: 0.724944
[400]	valid_0's macro_f1: 0.724871
[500]	valid_0's macro_f1: 0.727131
[600]	valid_0's macro_f1: 0.731255
[700]	valid_0's macro_f1: 0.733473
[800]	valid_0's macro_f1: 0.736143
[900]	valid_0's macro_f1: 0.739605
[1000]	valid_0's macro_f1: 0.742444
[1100]	valid_0's macro_f1: 0.744458
[1200]	valid_0's macro_f1: 0.746942
[1300]	valid_0's macro_f1: 0.749609
[1400]	valid_0's macro_f1: 0.752439
[1500]	valid_0's macro_f1: 0.754606
[1600]	valid_0's macro_f1: 0.756617
[1700]	valid_0's macro_f1: 0.758341
[1800]	valid_0's macro_f1: 0.760655
[1900]	valid_0's macro_f1: 0.763066
[2000]	valid_0's macro_f1: 0.765127
[2100]	valid_0's macro_f1: 0.767326
[2200]	valid_0's macro_f1: 0.7702
[2300]	valid_0's macro_f1: 0.772894
[2400]	valid_0's macro_f1: 0.774957
[2500]	valid_0's macro_f1: 0.777247
[2600]	valid_0's macro_f1: 0.779463
[2700]	valid_

In [58]:
print(f'Total score: ', f1_macro(train['target'], pred_oof))

Total score:  0.7711281899568938


In [59]:
y_pred_raw = pred_test.mean(axis=-1)
y_pred = y_pred_raw.argmax(axis=1).astype(np.int32)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.99      0.96      0.98      4542
        True       0.45      0.84      0.58       155

    accuracy                           0.96      4697
   macro avg       0.72      0.90      0.78      4697
weighted avg       0.98      0.96      0.97      4697

