In [118]:
import os

from IPython.display import display, Markdown

In [2]:
REPO_PATH = os.path.dirname(os.path.dirname(os.path.dirname(os.getcwd())))
TASK_PATH = os.path.join(REPO_PATH, "tasks", "06-language-as-sequence")

In [3]:
def show_markdown(path):
    with open(path, 'r') as md:
        content = md.read()
    display(Markdown(content))

In [4]:
show_markdown(os.path.join(TASK_PATH, "06-language-as-sequence.md"))

# Мова як послідовність

## Run-on Sentences

### 1. Домен

Цього тижня ви працюватимете над задачею виправлення помилок.

Run-on речення - це речення, склеєне з двох чи більше речень без належної пунктуації. Таку помилку часто допускають механічно, коли швидко друкують текст, проте така помилка виникає і від незнання мови. Особливо часто ця помилка зустрічається в інтернет-спілкуванні.

Наприклад:
```
Thanks for talking to me let's meet again tomorrow Bye.
```

У цьому реченні насправді три склеєні речення. Правильний варіант:
```
Thanks for talking to me. Let's meet again tomorrow. Bye.
```

Run-on речення важливо визначати не лише для виправлення помилок. Ця помилка впливає на якість визначення сутностей, машинного перекладу, об'єкта сентименту тощо.

Більше інформації та прикладів можна знайти за посиланнями:
- <http://www.bristol.ac.uk/arts/exercises/grammar/grammar_tutorial/page_37.htm>
- <https://www.english-grammar-revolution.com/run-on-sentence.html>
- <https://www.quickanddirtytips.com/education/grammar/what-are-run-on-sentences>

### 2. Класифікатор

Дані:
- Згенеруйте тренувальні дані для моделі на основі відкритих корпусів. Тренувальними даними буде набір склеєних речень. Візьміть до уваги, що склеєних речень може бути кілька (зазвичай 2, але буває і 3-4), а перше слово наступного речення може писатися з великої чи малої літери.
- Знайдіть у відкритому доступі чи зберіть самостійно базу енграмів на рівні слів чи частин мови. Завважте, що відкриті бази енграмів зазвичай містять статистику, зібрану на реченнях, а отже вони можуть не містити енграми на межі речень.

Тестування:
- Напишіть базове рішення та метрику для тестування якості.
- Для тестування використайте корпус [run-on-test.json](run-on-test.json). Формат корпусу:
```
[
  [
    ["Thanks", false],
    ["for", false],
    ["talking", false],
    ["to", false],
    ["me", true],
    ["let", false],
    ["'s", false],
    ["meet", false],
    ["again", false],
    ["tomorrow", true],
    ["Bye", false],
    [".", false]
  ],
...
]
```

`true` позначає слово, після якого треба додати крапку. Тестовий корпус містить 200 токенізованих речень (~ 4700 токенів). 3% токенів мають клас `true`, а решта - `false`. Зверніть увагу, що корпус вже токенізований.

Класифікатор:
- Виділіть ознаки, які впливають на те, чи є слово на межі речень. Наприклад:
  - правий/лівий контекст;
  - написання слова;
  - граматичні ознаки (чи може речення закінчитись на сполучник?);
  - енграми (чи часто це слово і наступне йдуть поруч? чи ймовірні ці дві частини мови поруч?);
  - глибина синтаксичного дерева чи найближчий спільний предок;
  - ваші варіанти.
- Побудуйте класифікатор на основі логістичної регресії чи умовних випадкових полів (CRF), який анотує послідовно слова у реченні на предмет закінчення речення.
- Спробуйте покращити якість роботи класифікатора, змінюючи набір чи комбінацію ознак.
- **Важливо:** під час покращення класифікатора перевіряйте його якість на своїх даних (train/test або кросвалідація).
- Визначте фінальну якість класифікатора на тестовій вибірці.

Запишіть ваші спостереження та результати в окремий файл.

### Корисні посилання

- [CRF tutorial](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html)
- [Google ngrams](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) (and [how to download](https://pypi.org/project/google-ngram-downloader/))
- [Google syntactic ngrams](http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html)
- [1 mln of 2/3/4/5-ngrams from COCA](https://www.ngrams.info/download_coca.asp)

### Оцінювання

100% за завдання.

### Крайній термін

18.04.2020


## Data preparation

Wiki:
https://dumps.wikimedia.org/simplewiki/latest/  
Brown:
https://www.kaggle.com/nltkdata/brown-corpus  
Some corpus:
https://www.kaggle.com/espn56/english-corpus 

In [139]:
from tqdm import tqdm
import pandas as pd
import numpy as np
import json
import re

### Simple wiki

In [6]:
# ! python WikiExtractor.py simplewiki-latest-pages-articles-multistream.xml -o wiki_articles

In [7]:
# lst = []

# for folder in os.listdir('wiki_articles'):
#     for fn in tqdm(os.listdir(f"wiki_articles/{folder}")):
#         with open(f'wiki_articles/{folder}/{fn}') as file_:
#             res = file_.readlines()
#             res = "".join(res)
#             res = BeautifulSoup(res, 'lxml')
#             res = list(filter(lambda x: len(x.split()) > 10, res.get_text().split("\n")))
#             lst.extend(res)

# df = pd.DataFrame(lst, columns=['text'])
# df.to_csv("simple_wiki.txt", index=False)

In [8]:
df_wiki = pd.read_csv("simple_wiki.txt")
df_wiki.shape

(428708, 1)

In [9]:
df_wiki.head()

Unnamed: 0,text
0,Jean Bercher (known as Dauberval or D'Auberval...
1,Cri-Cri is a fictional talking cricket. The ch...
2,The character was created by Gabilondo Soler w...
3,It was made into a movie that was released on ...
4,"Baldwin I († 879), also known as ""Baldwin Iron..."


### Brown

In [None]:
# df_brown = pd.read_csv("brown.csv")[['tokenized_text']]
# df_brown.shape

In [None]:
# df_brown.head()

### Some unknown corpus

In [None]:
# with open("corpus.txt") as file_:
#     res = file_.readlines()
    
# lst = []
# tmp = ''

# for item in res:
#     if not item.strip().endswith("."):
#         lst.append(tmp)
#         tmp = ''
#     else:
#         tmp += " " + item.strip()

# df_small = pd.DataFrame(lst, columns=['text'])
# df_small['text'] = df_small['text'].map(lambda x: x.strip())
# df_small = df_small.loc[df_small.text.str.len() > 0]

# df_small.to_csv("small_corpus.txt", index=False)

In [20]:
df_small = pd.read_csv("small_corpus.txt")

In [21]:
df_small.head()

Unnamed: 0,text
0,"When the shouting ended , the bill passed , 11..."
1,"-- A Houston teacher , now serving in the Legi..."
2,-- Principals of the 13 schools in the Denton ...
3,"The monthly cost of ADC to more than 100,000 r..."
4,Several defendants in the Summerdale police bu...


In [None]:
df_small.sample().values[0][0]

### Test data

In [188]:
with open(os.path.join(TASK_PATH ,'run-on-test.json')) as file_:
    test_js = json.load(file_)

In [40]:
tmp = " ".join([item[0] for item in test_js[1]])

number of sentence splits in one sample

In [203]:
pd.Series([sum([int(item[1] == True) for item in sample]) for sample in test_js]).value_counts()

1    145
0     50
2      5
dtype: int64

In [215]:
lst = []
for sample in test_js:
    lst.extend([int(item[1]) for item in sample])
pd.Series(lst).value_counts(normalize=False)

0    4542
1     155
dtype: int64

In [214]:
up_count = 0
low_count = 0

for sample in test_js:
    for index, item in enumerate(sample):
        if item[1] == True:
            if sample[index+1][0][0].isupper():
                up_count += 1
            else:
                low_count += 1
print(up_count, low_count)

75 80


In [525]:
test_df = pd.DataFrame()

for sample in tqdm(test_js):
    try:
        tokens, labels = [item[0] for item in sample], [item[1] for item in sample]
        tmp_df = pd.DataFrame(make_features(tokens))
        tmp_df['target'] = labels
        test_df = test_df.append(tmp_df)
    except:
        print(sample)
        break

100%|██████████| 200/200 [00:02<00:00, 79.28it/s]


In [500]:
print(len(tokens))

17


In [501]:
print(" ".join(tokens).replace(" 'm ", "'m "))

It 's hard I'm so proud of how we 've managed to balance and sacrifice .


In [502]:
nlp('doublecheck')[0].pos_

'VERB'

In [507]:
test_df

Unnamed: 0,prev_token_pos,prev_token_lemma_,prev_token_ent_iob,prev_token_is_alpha,prev_token_is_digit,prev_token_is_lower,prev_token_is_upper,prev_token_is_title,prev_token_is_punct,prev_token_dep_,...,next_token_ent_iob,next_token_is_alpha,next_token_is_digit,next_token_is_lower,next_token_is_upper,next_token_is_title,next_token_is_punct,next_token_dep_,next_token_end_sen,target
0,PROPN,start_sent,2,False,False,True,False,False,False,False,...,2,True,False,True,False,False,False,False,0,False
1,PRON,-PRON-,2,True,False,False,False,True,False,False,...,2,True,False,True,False,False,False,False,0,False
2,AUX,be,2,True,False,True,False,False,False,False,...,2,True,False,True,False,False,False,False,0,False
3,VERB,know,2,True,False,True,False,False,False,False,...,2,True,False,True,False,False,False,False,0,False
4,ADP,for,2,True,False,True,False,False,False,False,...,3,True,False,False,False,True,False,False,0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27,NOUN,monk,2,True,False,True,False,False,False,False,...,2,True,False,True,False,False,False,False,0,False
28,PUNCT,",",2,False,False,False,False,False,True,True,...,2,True,False,True,False,False,False,False,0,False
29,CCONJ,and,2,True,False,True,False,False,False,False,...,3,True,False,True,False,False,False,False,0,False
30,DET,an,2,True,False,True,False,False,False,False,...,2,False,False,False,False,False,True,True,0,False


In [504]:
pd.DataFrame(make_features(tokens))[['cur_token_lemma_']]

Unnamed: 0,cur_token_lemma_
0,-PRON-
1,be
2,hard
3,-PRON-
4,be
5,so
6,proud
7,of
8,how
9,-PRON-


In [422]:
len(tokens)

17

In [416]:
# train_df = pd.DataFrame()
# failed_sentences = []

# for text in tqdm(df.text.values):
#     try:
#         tokens, labels = prepare_train(text)
#         tmp_df = pd.DataFrame(make_features(tokens))
#         tmp_df['target'] = labels
#         train_df = train_df.append(tmp_df)
#     except:
#         failed_sentences.append(text)

100%|██████████| 10000/10000 [10:45<00:00, 15.48it/s]


In [456]:
pd.DataFrame(make_features(tokens))

Now , they say they believe that will likely happen during the current April-to-June period .


Unnamed: 0,prev_token_pos,prev_token_lemma_,prev_token_ent_iob,prev_token_is_alpha,prev_token_is_digit,prev_token_is_lower,prev_token_is_upper,prev_token_is_title,prev_token_is_punct,prev_token_dep_,...,next_token_lemma_,next_token_ent_iob,next_token_is_alpha,next_token_is_digit,next_token_is_lower,next_token_is_upper,next_token_is_title,next_token_is_punct,next_token_dep_,next_token_end_sen
0,PROPN,start_sent,2,False,False,True,False,False,False,False,...,",",2,False,False,False,False,False,True,True,0
1,ADV,now,2,True,False,False,False,True,False,False,...,-PRON-,2,True,False,True,False,False,False,False,0
2,PUNCT,",",2,False,False,False,False,False,True,True,...,say,2,True,False,True,False,False,False,False,0
3,PRON,-PRON-,2,True,False,True,False,False,False,False,...,-PRON-,2,True,False,True,False,False,False,False,0
4,VERB,say,2,True,False,True,False,False,False,False,...,believe,2,True,False,True,False,False,False,False,0
5,PRON,-PRON-,2,True,False,True,False,False,False,False,...,that,2,True,False,True,False,False,False,False,0
6,VERB,believe,2,True,False,True,False,False,False,False,...,will,2,True,False,True,False,False,False,False,0
7,DET,that,2,True,False,True,False,False,False,False,...,likely,2,True,False,True,False,False,False,False,0
8,VERB,will,2,True,False,True,False,False,False,False,...,happen,2,True,False,True,False,False,False,False,0
9,ADV,likely,2,True,False,True,False,False,False,False,...,during,2,True,False,True,False,False,False,False,0


In [414]:
len(tokens)

17

In [13]:
import spacy

In [469]:
nlp = spacy.load("en_core_web_md")

In [109]:
lst = []

for x in tqdm(df_wiki['text'].values):
    lst.append([sen for sen in nlp(x).sents])

100%|██████████| 428708/428708 [59:04<00:00, 120.94it/s] 


In [110]:
df_wiki['sentences'] = lst

In [112]:
df_wiki['sen_num'] = df_wiki.sentences.map(len)

In [121]:
df_wiki.loc[df_wiki.sen_num > 1].text.values[1]

'Cri-Cri is a fictional talking cricket. The character was created by Francisco Gabilondo Soler and introduced in 1934 on his own musical radio show in Mexico. Cri-Cri is known as "el grillito cantor", which is Spanish for "the singing cricket".'

In [123]:
# df_wiki.sen_num.value_counts()

In [127]:
# df = df_wiki.loc[df_wiki.sen_num < 4].sample(1000)

In [132]:
# df.shape

## temp

In [690]:
df = df_wiki.loc[df_wiki.sen_num < 4].sample(5000)

In [691]:
df['text'] = df['text'].map(lambda x: re.sub(r'\s+', " ", re.sub(r"(\(.+?\))", " ", x)))

In [692]:
doc = nlp(df.text.values[1])

In [693]:
df.text.values[1]

"Paul O'Neill was an American music composer, lyricist, producer, and songwriter."

In [694]:
def prepare_train(text, zipped=False):
    doc = nlp(text)
    sents = list(doc.sents)
    tokens = []
    labels = []
    for index, sen in enumerate(sents):
        tkns = [token.text for token in sen]
        if index > 0 and np.random.rand() > 0.5:
            tkns[0] = tkns[0].lower()
        if index != len(sents)-1:
            tkns = tkns[:-1]
        tokens.extend(tkns)
        lbls = [False for i in range(len(tkns)-1)] + [True]
        labels.extend(lbls)
    labels[-1] = False
    if sum(labels) != len(sents) - 1:
        raise ValueError(f"there is a problem with sentence=*{text}*")
    if zipped:
        return list(zip(tokens, labels))
    else:
        return tokens, labels

In [695]:
def make_features(tokens):
    text = " ".join(tokens).replace("-", "")
    text = re.sub(r".*(\")\w", "", text)
#     for item in ["-", ",", "."]:
#         text = text.replace(" {item} ", "{item}")
    for k, v in [("'m", "am"), ("'s", "is"), ("'re", "are"), ("'ve", "have")]:
        text = text.replace(k, v)
#     print(text)

    text = "start_sent " + text + " end_sent"
    doc = nlp(text)
    doc_range = len(doc)
    lst = []
    for i in range(1, doc_range - 1):
        prev_token = doc[i-1]
        cur_token = doc[i]
        next_token = doc[i+1]
        
        features = {
            "prev_token_pos": prev_token.pos_,
            "prev_token_lemma_": prev_token.lemma_,
            "prev_token_ent_iob": prev_token.ent_iob,
            "prev_token_is_alpha": prev_token.is_alpha,
            "prev_token_is_digit": prev_token.is_digit,
            "prev_token_is_lower": prev_token.is_lower,
            "prev_token_is_upper": prev_token.is_upper,
            "prev_token_is_title": prev_token.is_title,
            "prev_token_is_punct": prev_token.is_punct,
            "prev_token_dep_": prev_token.is_punct,
            "prev_token_start_sen": int(prev_token.text == 'start_sent'),
            
            "cur_token_pos": cur_token.pos_,
            "cur_token_lemma_": cur_token.lemma_,
            "cur_token_ent_iob": cur_token.ent_iob,
            "cur_token_is_alpha": cur_token.is_alpha,
            "cur_token_is_digit": cur_token.is_digit,
            "cur_token_is_lower": cur_token.is_lower,
            "cur_token_is_upper": cur_token.is_upper,
            "cur_token_is_title": cur_token.is_title,
            "cur_token_is_punct": cur_token.is_punct,
            "cur_token_dep_": cur_token.is_punct,
            
            "next_token_pos": next_token.pos_,
            "next_token_lemma_": next_token.lemma_,
            "next_token_ent_iob": next_token.ent_iob,
            "next_token_is_alpha": next_token.is_alpha,
            "next_token_is_digit": next_token.is_digit,
            "next_token_is_lower": next_token.is_lower,
            "next_token_is_upper": next_token.is_upper,
            "next_token_is_title": next_token.is_title,
            "next_token_is_punct": next_token.is_punct,
            "next_token_dep_": next_token.is_punct,
            "next_token_end_sen": int(next_token.text == 'end_sent')
        }
        lst.append(features)
    return lst
        
    

In [696]:
train_df = pd.DataFrame()
failed_sentences = []

for text in tqdm(df.text.values):
    try:
        tokens, labels = prepare_train(text)
        tmp_df = pd.DataFrame(make_features(tokens))
        tmp_df['target'] = labels
        train_df = train_df.append(tmp_df)
    except:
        failed_sentences.append(text)

100%|██████████| 5000/5000 [03:16<00:00, 25.48it/s]


In [697]:
var = """Originally Elgar wanted to write three oratorios which would belong together. "The Apostles" is the first one, the second one became "The Kingdom" but the third one, which would have been about the Last Judgement, was never written."""
var1 = """Ellery Cory Stowell was a professor of international law at Columbia University and then American University in Washington, D.C. He represented the United States at The Hague Convention of 1907 and the London Naval Conference of 1909."""

In [698]:
train_df.target.value_counts()

False    161533
True       4729
Name: target, dtype: int64

In [699]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import *

In [700]:
def train_eval(clf):
    clf.fit(X_train_vec, y_train)
    y_pred = clf.predict(X_test_vec)
    print("f1 macro:", f1_score(y_test, y_pred, average='macro'))
#     print(clf)

In [701]:
X_train, X_dev, y_train, y_dev = train_test_split(train_df.drop('target', 1), train_df['target'], 
                                                    test_size=0.2, 
                                                    random_state=42,
                                                    stratify=train_df['target'])
X_test, y_test = test_df.drop('target', 1), test_df['target']

In [702]:
class_weights = (1 / y_train.value_counts(normalize=True)).to_dict()
class_weights

{False: 1.0292742946465883, True: 35.159661644197726}

In [703]:
vectorizer = DictVectorizer()
X_train_vec = vectorizer.fit_transform(X_train.to_dict('records'))
X_dev_vec = vectorizer.transform(X_dev.to_dict('records'))
X_test_vec = vectorizer.transform(X_test.to_dict('records'))

In [704]:
# test_df

In [705]:
reg_interval = [0.1, 0.5, 1, 5, 10, 100]

for index, interval in enumerate(reg_interval):
    print(reg_interval[index])
    train_eval(LinearSVC(C=interval, class_weight=class_weights, max_iter=1000))
#     train_eval(LogisticRegression(C=i, class_weight=class_weights, max_iter=2000))

0.1




f1 macro: 0.7599012530795769
0.5
f1 macro: 0.7693961849150968
1
f1 macro: 0.7774686971235194
5
f1 macro: 0.7825159602608847
10
f1 macro: 0.8138202994141503
100
f1 macro: 0.7915501708605157


In [706]:
# model = LinearSVC(C=10, class_weight=class_weights)
model = LogisticRegression(C=10, class_weight=class_weights)

model.fit(X_train_vec, y_train);

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [707]:
y_pred = model.predict(X_dev_vec)

print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

       False       0.99      0.97      0.98     32307
        True       0.46      0.75      0.57       946

    accuracy                           0.97     33253
   macro avg       0.73      0.86      0.78     33253
weighted avg       0.98      0.97      0.97     33253



In [708]:
y_pred = model.predict(X_test_vec)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.99      0.96      0.98      4542
        True       0.41      0.79      0.54       155

    accuracy                           0.96      4697
   macro avg       0.70      0.88      0.76      4697
weighted avg       0.97      0.96      0.96      4697



In [607]:
def lgb_fscore(y_true, y_pred):
    y_pred = np.round(y_pred)
    res = f1_score(y_true, y_pred, average='macro')
    return 'macro_f1', res, True

In [571]:
from lightgbm import LGBMClassifier

In [584]:
class_weights[True]

35.600525624178715

In [757]:
params = {
    'objective': 'binary',
    'num_rounds': 5000,
    'max_depth': -1, #  8
    'learning_rate': 0.01,  #  0.007
    'num_leaves': 16, # was 127
    'verbose': 100,
    'early_stopping_rounds': 200,
    'min_data_in_leaf': 30,
    'lambda_l2': 0.9,
    'feature_fraction': 1, #  0.8
    'metric': 'custom',
    'imbalance': True
}


lgb_clf = LGBMClassifier(**params)

In [758]:
lgb_clf.fit(
    X=X_train_vec,
    y=y_train,
    eval_set=[(X_dev_vec, y_dev)],
#        early_stopping_rounds=params['early_stopping_rounds'],
    verbose=params['verbose'],
    eval_metric=lgb_fscore,
#     sample_weight=y_train.map(class_weights).values
)

Training until validation scores don't improve for 200 rounds
[100]	valid_0's macro_f1: 0.795061
[200]	valid_0's macro_f1: 0.848706
[300]	valid_0's macro_f1: 0.872714
[400]	valid_0's macro_f1: 0.877121
[500]	valid_0's macro_f1: 0.878378
[600]	valid_0's macro_f1: 0.879151
[700]	valid_0's macro_f1: 0.881881
[800]	valid_0's macro_f1: 0.883404
[900]	valid_0's macro_f1: 0.884541
[1000]	valid_0's macro_f1: 0.885218
[1100]	valid_0's macro_f1: 0.886827
[1200]	valid_0's macro_f1: 0.887202
[1300]	valid_0's macro_f1: 0.887818
[1400]	valid_0's macro_f1: 0.888192
Early stopping, best iteration is:
[1244]	valid_0's macro_f1: 0.888192


LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               early_stopping_rounds=200, feature_fraction=1, imbalance=True,
               importance_type='split', lambda_l2=0.9, learning_rate=0.01,
               max_depth=-1, metric='custom', min_child_samples=20,
               min_child_weight=0.001, min_data_in_leaf=30, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=16, num_rounds=5000,
               objective='binary', random_state=None, reg_alpha=0.0,
               reg_lambda=0.0, silent=True, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0, verbose=100)

In [759]:
y_pred = lgb_clf.predict(X_dev_vec)

print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

       False       0.99      1.00      0.99     32307
        True       0.91      0.69      0.78       946

    accuracy                           0.99     33253
   macro avg       0.95      0.84      0.89     33253
weighted avg       0.99      0.99      0.99     33253



In [760]:
y_pred = lgb_clf.predict(X_test_vec)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.99      0.99      0.99      4542
        True       0.81      0.72      0.76       155

    accuracy                           0.99      4697
   macro avg       0.90      0.86      0.88      4697
weighted avg       0.98      0.99      0.98      4697



In [None]:
sample_text = [item[0] for item in prepare_train(sample)]

In [298]:
sample = df.text.values[1]

In [302]:
print(sample)
prepare_train(sample)

Near Phoenix, rainfall from the storm caused the Narrows Dam, a small earthen dam, to fail. In other locations in Arizona, California, Nevada, and Utah, more than occurred in a few localized areas, sometimes with precipitation comparable to the entire local yearly average rainfall. Flooding was also reported in Somerton, San Diego, El Centro, Palm Springs and Indio, while 12,000 people lost power in Yuma, as well as Los Angeles and southwestern Utah.


[('Near', False),
 ('Phoenix', False),
 (',', False),
 ('rainfall', False),
 ('from', False),
 ('the', False),
 ('storm', False),
 ('caused', False),
 ('the', False),
 ('Narrows', False),
 ('Dam', False),
 (',', False),
 ('a', False),
 ('small', False),
 ('earthen', False),
 ('dam', False),
 (',', False),
 ('to', False),
 ('fail', True),
 ('In', False),
 ('other', False),
 ('locations', False),
 ('in', False),
 ('Arizona', False),
 (',', False),
 ('California', False),
 (',', False),
 ('Nevada', False),
 (',', False),
 ('and', False),
 ('Utah', False),
 (',', False),
 ('more', False),
 ('than', False),
 ('occurred', False),
 ('in', False),
 ('a', False),
 ('few', False),
 ('localized', False),
 ('areas', False),
 (',', False),
 ('sometimes', False),
 ('with', False),
 ('precipitation', False),
 ('comparable', False),
 ('to', False),
 ('the', False),
 ('entire', False),
 ('local', False),
 ('yearly', False),
 ('average', False),
 ('rainfall', True),
 ('flooding', False),
 ('was', False)

In [269]:
# lst = []

# for sen in doc.sents:
# #     lst = []
# #     for index, token in enumerate(sen):
# #         lst.append((token.text, token.pos_, False))
    
#     print([(token.text, token.pos_) for token in sen])
#     print()

In [240]:
np.random.rand()

0.8503172740143496

In [103]:
tmp = df_wiki.sample().text.values[0]
# tmp = df_small.text.values[1]

In [119]:
doc = nlp(tmp)

In [120]:
for sen in doc.sents:
    print(sen)
    print()
    for token in sen:
        print(token.text, token.lemma_, token.pos_)

All bryozoa have a lophophore.

All all DET
bryozoa bryozoa PROPN
have have AUX
a a DET
lophophore lophophore NOUN
. . PUNCT
This is a ring of ten tentacles surrounding the mouth, each tentacle covered with cilia.

This this DET
is be AUX
a a DET
ring ring NOUN
of of ADP
ten ten NUM
tentacles tentacle NOUN
surrounding surround VERB
the the DET
mouth mouth NOUN
, , PUNCT
each each DET
tentacle tentacle NOUN
covered cover VERB
with with ADP
cilia cilia NOUN
. . PUNCT
When feeding, the zooid extends the lophophore outwards; when resting it is withdrawn into the mouth to protect it from predators.

When when ADV
feeding feed VERB
, , PUNCT
the the DET
zooid zooid PROPN
extends extend VERB
the the DET
lophophore lophophore NOUN
outwards outward VERB
; ; PUNCT
when when ADV
resting rest VERB
it -PRON- PRON
is be AUX
withdrawn withdraw VERB
into into ADP
the the DET
mouth mouth NOUN
to to PART
protect protect VERB
it -PRON- PRON
from from ADP
predators predator NOUN
. . PUNCT


In [67]:
for sen in doc.sents:
    print(sen)
    print()
    for token in sen:
        print(token.text, token.lemma_, token.pos_)

Opel and Vauxhall make the same vehicle but with different names, The names are: Opel Vivaro and Vauxhall Vivaro.

Opel Opel PROPN
and and CCONJ
Vauxhall Vauxhall PROPN
make make VERB
the the DET
same same ADJ
vehicle vehicle NOUN
but but CCONJ
with with ADP
different different ADJ
names name NOUN
, , PUNCT
The the DET
names name NOUN
are be AUX
: : PUNCT
Opel Opel PROPN
Vivaro Vivaro PROPN
and and CCONJ
Vauxhall Vauxhall PROPN
Vivaro Vivaro PROPN
. . PUNCT
