In [1]:
import spacy

Попробуем себя в решении задачи определения темы текста. Будем считать, что два текста похожи по теме, если у них больше общих слов (только не предлогов с союзами), чем у других текстов. У нашей программы для определения темы будет несколько готовых текстов (достаточно больших!) с уже известной темой в базе: выберите тексты (и темы) самостоятельно, 5-6 будет достаточно.

Что должна делать программа? При запуске вы ей сообщаете название нового файла с текстом, который нужно классифицировать, она его открывает, обрабатывает и сравнивает с текстами в своей базе. С которым из текстов оказалось больше всего общих слов, того и тема! Очевидно, вам понадобится какие-то слова из текстов отбрасывать (подумайте, каким образом это сделать - здесь на самом деле несколько вариантов концепций), а еще лемматизировать или хотя бы использовать стемминг.

Некоторые предлоги в русском языке могут управлять разными падежами (например, "я еду в Лондон" vs "я живу в Лондоне"). Давайте проанализируем эти предлоги и их падежи. Необходимо:

составить список таких предлогов (РГ-80 вам в помощь)
взять достаточно большой текст (можно большое художественное произведение)
сделать морфоразбор этого текста
Посчитать, как часто и какие падежи встречаются у слова, идущего после предлога.

In [32]:
from natasha import (
    Segmenter,
    MorphVocab,

    NewsEmbedding,
    NewsMorphTagger,
    NewsSyntaxParser,

    Doc
)

In [26]:
import pandas as pd

from pandas import DataFrame

def count_stats(text, segmenter, morph_tagger,syntax_parser):
  doc = Doc(text)
  
  doc.segment(segmenter)
  doc.tag_morph(morph_tagger)
  doc.parse_syntax(syntax_parser)

  id2token = {token.id: token for token in doc.tokens}    #чтобы быстро работал поиск по head_id
  stats = DataFrame(columns=['text', 'parent_text', 'parent_case'])   #оформление в виде таблицы
  dangerous_guys = set(['в', 'на', 'о', 'по', 'под', 'с', 'меж', 'между', 'за']) # брала первообразные предлоги только

  for token in doc.tokens:
    if token.text.lower() not in dangerous_guys:
      continue  
    parent = id2token.get(token.head_id, None)
    if not parent:
      continue  #если нет вершины
    stats.loc[len(stats)] = [token.text.lower(), parent.text, parent.feats.get('Case', None)] #само сохранение инфы в таблицу
  
  return stats

In [29]:
with open('C:\\Users\\verid\\OneDrive\\Документы\\mag2023\\CompLing\\abstr_mondaysaturday.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# это был "Понедельник начинается в субботу"

In [33]:
segmenter = Segmenter()
morph_vocab = MorphVocab()

emb = NewsEmbedding()
morph_tagger = NewsMorphTagger(emb)
syntax_parser = NewsSyntaxParser(emb)

stats = count_stats(text, segmenter, morph_tagger, syntax_parser)

In [34]:
stats_without_text = stats.drop(columns='parent_text') #чтобы немного упростить вывод результата
stats_without_text.groupby(['text', 'parent_case']).value_counts().to_frame()

# если посмотреть на те падежи, которых в таблице быть не должно (например, именительный - априори), 
# то окажется, что это ошибки морфологической/грамматической разметки. Но основная масса результатов 
# сосредоточена в ячейках правильных падежей - например, предложный и винительный падежи для предлога "в"

Unnamed: 0_level_0,Unnamed: 1_level_0,count
text,parent_case,Unnamed: 2_level_1
в,Acc,677
в,Dat,1
в,Gen,47
в,Ins,12
в,Loc,835
в,Nom,8
за,Acc,121
за,Gen,4
за,Ins,82
за,Nom,7


Возьмите любой достаточно длинный (лучше новостной) текст. Любым известным инструментом извлеките именованные сущности из этого текста и выведите их списком по категориям (т.е. персоны вместе, локации вместе, организации вместе).

In [24]:
nlp = spacy.load('en_core_web_lg')
doc = nlp('''Boris Johnson could return as Prime Minister under astonishing plans being hatched by Tory MPs – with a 'dream ticket' leadership tie-up with Nigel Farage even being considered.
The Mail on Sunday has spoken to multiple Conservative MPs who believe that bringing back the former Premier is the only way to save the party from an Election wipeout.
It comes as Rishi Sunak faces a crunch vote on Tuesday on his flagship Rwanda migrants plan, with whips using threats and blandishments to try to quell a revolt – allegedly even offering peerages to potential rebels if they toe the line.
This newspaper can reveal that Mr Sunak's Tory enemies have drawn up what they crudely call an 'Advent calendar of s**t' to further destabilise the Prime Minister following his sacking of Suella Braverman as Home Secretary and the resignation of Robert Jenrick as Immigration Minister over attempts to salvage the Rwanda plan.
The MPs intend to rebel in Commons votes and make increasingly outspoken interventions, with No 10 nervously braced for further Ministerial resignations. One plotter admitted the intention was to 'crash' the Sunak Government and install a leader who could close the gap with Sir Keir Starmer's Labour party.
The MPs are panicking about polling figures which show Tory support sinking, with many voters turning to the Reform Party, the successor to Nigel Farage's Brexit Party. Its fortunes have been boosted by Mr Farage's successful run on ITV reality show I'm A Celebrity.
This newspaper can reveal that Rishi Sunak's Tory enemies have drawn up what they crudely call an 'Advent calendar of s**t' to further destabilise the Prime Minister
The Tory rebels argue that Mr Johnson is the only Conservative with the pulling power to neutralise Mr Farage's impact, particularly in the Red Wall seats in the Midlands and the North which he took from Labour at the 2019 General Election. Although Trade Secretary Kemi Badenoch has emerged as another leading contender among MPs.
No 10, however, insists Mr Sunak will see off the plotters and lead the Tories into the next Election.
Last night, a spokesman for Mr Johnson would not be drawn on his political ambitions, and denied the existence of any plans to team up with Mr Farage. He said: 'Boris Johnson is currently writing a book and is supporting the Government.'
A source close to Mr Farage insisted that any pact between the two heavyweights would 'soon end in tears'. However, it is understood that MPs have privately urged the pair to talk.
Mr Sunak faces a test of his authority on Tuesday when MPs will vote on the principle of whether to tighten the law to try to salvage his plan to dispatch Channel migrants to Rwanda.
Mr Jenrick quit because he thought the legislation did not go far enough. However, MPs will not have the chance to debate and vote on potentially divisive amendments until the New Year, limiting their opportunities for rebellion.
Ms Braverman denies scheming to bring down Mr Sunak, claiming that she hopes he will lead the party into the next Election.
Neither Mr Johnson nor Mr Farage is currently in Parliament, but Boris's supporters believe that if an MP quit a safe seat before the Election to make way for Mr Johnson, Tory high command would be unable to block it.
A leadership contest would then be triggered if at least 53 letters of no confidence in Mr Sunak were sent to Sir Graham Brady, chair of the backbench 1922 Committee.
Another suggestion is that a Johnson ally, such as former Home Secretary Priti Patel, could be installed as a caretaker Prime Minister, with Mr Johnson standing for a safe seat at the Election, then stepping back in to No 10. If Reform remained a threat, a deal could be struck by giving Mr Farage, the party's honorary president, and Richard Tice, its leader, places in the Lords and key Ministerial positions.
But Reform Party officials said their aim was to kill off the Conservatives. One told The Mail on Sunday: 'When Nigel gets back [from the ITV reality show] he's going to start dominating the agenda. Within about six to eight weeks we'll be polling in the high teens, and the Tories will start to slip below 20 per cent.
'At that point between five and ten MPs will realise the game's up, and defect to us. Then it's game over. We're looking at the last majority Tory administration of our lifetime. We're going to destroy them.'
Party donors are already starting to switch. The co-owners of Bristol Ports, who have donated more than £640,000 to the Tories since 2001, recently gave £100,000 to Reform.
A source close to Farage said any pact between him and Boris would 'soon end in tears'
A source close to Farage said any pact between him and Boris would 'soon end in tears'
Mr Farage has been able to reach millions of voters through the hit ITV programme, which concludes today. Yesterday, the Reform Party emailed subscribers begging them to go 'against the establishment' and vote for Farage to be crowned King of the Jungle. Mr Tice wrote: 'Our man has been superb standing for Brexit in the face of Remainer campmates who have constantly challenged his views on air.'
One Tory MP said: 'When Farage comes back he's going to be all over the airwaves, and he's going to have us in his sights.'
Another said: 'Reform are going to kill us, so we have to buy Farage off. The plan is we get him into the Lords, give him some brief like we did with Cameron – maybe even Home Secretary – then go to the country with the dream team.
'It may not be enough to win, but it would definitely re-energise our base, shake up the debate and give Starmer something to think about.'
Surprisingly, Mr Johnson's supporters in the parliamentary party include MPs who helped to oust him from Downing Street last year following a revolt over scandals including Partygate.
One Red Wall MP told the MoS: 'I came out early to say he had to go. But I think we have to think outside the box now. Whatever you feel about him, one thing no one can question is his effectiveness as a campaigner. And we need that now, we're staring at obliteration.'
Mr Johnson's stock is perceived to have risen following his performance at the Covid Inquiry, which one supporter said showed he can be serious and 'on top of the detail'.
Mr Johnson spent months preparing with his close aide Lord Kempsell for his appearance, during which he apologised for the 'pain and the loss and the suffering' that victims of Covid and their families went through.
However, one ex-Cabinet Minister warned Mr Johnson's comeback could be thwarted by his old rival Lord Cameron, the Foreign Secretary. But they added: 'That said, if Boris were still in the Commons, he would be back already. There would be a coronation. Just look at the polls. Boris is the best campaigner we have by a mile. If you want the prospect of votes at the next Election, Boris is your man.'
The MP added that replacing Mr Sunak with anyone else currently in the Commons would 'make things worse rather than better'.''')

print(f'Персоны: {[ent.text for ent in doc.ents if ent.label_ == "PERSON"]}\nОрганизации: {[ent.text for ent in doc.ents if ent.label_ == "ORG"]}\nЛокации: {[ent.text for ent in doc.ents if ent.label_ == "GPE"]}\nНациональные, религиозные, политические группы: {[ent.text for ent in doc.ents if ent.label_ == "NORP"]}')

Персоны: ['Boris Johnson', 'Nigel Farage', 'Rishi Sunak', 'Sunak', 'Suella Braverman', 'Robert Jenrick', 'Keir Starmer', "Nigel Farage's", 'Farage', "Rishi Sunak's", 'Johnson', 'Farage', 'Kemi Badenoch', 'Sunak', 'Johnson', 'Farage', 'Boris Johnson', 'Farage', 'Sunak', 'Jenrick', 'Braverman', 'Sunak', 'Johnson', 'Farage', 'Boris', 'Johnson', 'Sunak', 'Graham Brady', 'Johnson', 'Priti Patel', 'Johnson', 'Farage', 'Richard Tice', 'Nigel', 'Boris', 'Boris', 'Farage', 'Tice', 'Cameron', 'Johnson', 'Johnson', 'Johnson', 'Kempsell', 'Johnson', 'Cameron', 'Boris', 'Boris', 'Boris', 'Sunak']
Организации: ['Home', 'Commons', 'Ministerial', 'the Sunak Government', 'Labour party', 'the Reform Party', 'Brexit Party', 'ITV', 'Red Wall', 'Labour', 'Trade', 'Parliament', '1922 Committee', 'Home', 'Reform', 'Lords', 'Reform Party', 'Mail', 'ITV', 'Bristol Ports', 'Reform', 'Farage', 'Farage', 'ITV', 'the Reform Party', 'Farage', 'Remainer', 'Farage', 'Farage', 'Lords', 'Home', 'Starmer', 'Partygate', 