Цель проекта - выявление трендов на определённых рынках с последующим построении стратегии компании с учетом этих трендов. 

Методология - поиск устойчивых словосочетаний в научных статьях, частота употребления которых растет. 
Также выявление нелинейных закономерностей для определения более непредсказуемых всплесков популярности различных тематик. 

1. парсинг научных статей 
2. очистка от html кода и удаление общеупотребимых слов(словарь пополнялся в ходе обучения модели)  и объединение текстов в единой коррпус
3. разбивка на словосочетания(триграммы и биграммы) 
4. рассчет частотности словосочетаний по годам
5. формирование признаков(на которых в итоге обучалась модель): частотность, общий размер корпуса(для нормализации), количество статей, в которых встречается словосочетание, общее количество статей
6. целевая переменная 1 или 0, 1 - тренд, 0 - тренд (консультанты вручную разметили около 10 000 тысяч словосочетаний) 
7. обучение модели машинного обучения для предсказаний: тренд не тренд
8. модель верно предсказывает около 75% из словосочетаний, которые точно не тренды 
9. модель верно предсказывает около 65% из словосочетаний, которые точно тренды
10. производится отбор после предсказаний модели на основании анализа динамики частотности

В итоге цель была достигнута, модель отсеивает большую часть словосочетаний, которые не относятся к возможным трендам и облегчает работу консультантов, 
выдавая вероятные тренды, которые потом, конечно анализирует человек и под конретную задачу бизнеса формирует наиболее вероятные тренды, а на осовании этого различные 
варианты стратегии. 

Перспективы модели 
1. точность будет расти при увеличении объема разметки
2. точность будет расти при добавлении новых общеупотребимых слов 
3. так же возможен рост модели при многократном предсказании модели, используя уже готовые предсказания предыдущей модели. это просев 
возможен как 'вручную', так будет легче контролировать процесс и с помощью анасмаблевых моделей(например градиентный бустинг)

In [None]:
import json
import pandas as pd
import re
import numpy as np

from collections import Counter

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from nltk.util import bigrams, trigrams
from nltk.probability import FreqDist

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import confusion_matrix


pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.options.display.max_columns=None
pd.options.mode.chained_assignment = None

In [2]:
# загружаю новые статьи
new_articles = pd.read_json('C:/Users/a.safonov/future_world/parsed_articles_new/article_content_mdpi_full3.json', orient='records')

In [3]:
new_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6289 entries, 0 to 6288
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   link    6289 non-null   object
 1   full    6289 non-null   object
 2   dates   6289 non-null   object
dtypes: object(3)
memory usage: 147.5+ KB


In [4]:
new_articles[['link','dates']].head(10)

Unnamed: 0,link,dates
0,/2079-7737/12/5/719,"[Received: 28 April 2023, Accepted: 8 May 2023, Published: 15 May 2023]"
1,/2304-6732/10/10/1090,"[Received: 24 August 2023, Revised: 16 September 2023, Accepted: 25 September 2023, Published: 28 September 2023]"
2,/1996-1944/16/18/6146,"[Received: 17 August 2023, Revised: 3 September 2023, Accepted: 7 September 2023, Published: 10 September 2023]"
3,/1996-1944/16/13/4569,"[Received: 17 May 2023, Revised: 1 June 2023, Accepted: 1 June 2023, Published: 24 June 2023]"
4,/1996-1944/16/11/4106,"[Received: 7 October 2022, Revised: 29 October 2022, Accepted: 1 November 2022, Published: 31 May 2023]"
5,/2304-6732/10/5/510,"[Received: 17 March 2023, Revised: 21 April 2023, Accepted: 26 April 2023, Published: 27 April 2023]"
6,/2076-3417/13/8/5132,"[Received: 22 March 2023, Revised: 9 April 2023, Accepted: 18 April 2023, Published: 20 April 2023]"
7,/2304-6732/10/4/447,"[Received: 23 February 2023, Revised: 5 April 2023, Accepted: 9 April 2023, Published: 13 April 2023]"
8,/1996-1944/16/7/2908,"[Received: 25 January 2023, Revised: 22 March 2023, Accepted: 31 March 2023, Published: 6 April 2023]"
9,/1424-8220/23/5/2784,"[Received: 2 December 2022, Revised: 14 February 2023, Accepted: 23 February 2023, Published: 3 March 2023]"


In [5]:
# Извлечение даты Published 
new_articles['dates'] = new_articles['dates'].apply(lambda x: ' '.join(map(str, x)))

In [6]:

def extract_published_date(text):
    match = re.search(r'Published: (\d+ \w+ \d+)', text)
    return match.group(1) if match else None

new_articles['published_date'] = new_articles['dates'].apply(extract_published_date)


In [7]:

# Разделение даты на месяц и год
new_articles[['day', 'month', 'year']] = new_articles['published_date'].str.split(expand=True)

new_articles.drop(['published_date','day'], axis=1, inplace=True)

In [8]:
new_articles_for_merge=new_articles[['year','full']]

In [9]:
# объединяю полные тексты статей
abstracts_doaj_mdpi = pd.read_json('C:/Users/a.safonov/future_world/abstracts_doaj.json', orient='records')


In [10]:
abstracts_doaj_mdpi ['date'] = abstracts_doaj_mdpi ['source'].str.extract(r'(\b\w+ \d{4})')

In [11]:
abstracts_doaj_mdpi[['month', 'year']] = abstracts_doaj_mdpi['date'].str.split(expand=True)

In [12]:

# датафрейм: ссылки и дата
abstract_link_year_df=abstracts_doaj_mdpi[['title','month','year','link']]
abstract_link_year_df=abstract_link_year_df.drop_duplicates(['link'])
abstract_link_year_df=abstract_link_year_df.dropna()
abstract_link_year_df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 9726 entries, 0 to 9948
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   9726 non-null   object
 1   month   9726 non-null   object
 2   year    9726 non-null   object
 3   link    9726 non-null   object
dtypes: object(4)
memory usage: 379.9+ KB


In [13]:

# загружаю и склеиваю все тексты статей
full_doaj_mdpi = pd.read_json('C:/Users/a.safonov/future_world/article_content_doaj_mdpi_full2.json', orient='records')
full_doaj_front = pd.read_json('C:/Users/a.safonov/future_world/article_content_doaj_front_full2.json', orient='records')
full_doaj_dovepress = pd.read_json('C:/Users/a.safonov/future_world/article_content_doaj_dovepress_full2.json', orient='records')
full_doaj_mfd=pd.concat([full_doaj_mdpi,full_doaj_front,full_doaj_dovepress])
full_doaj_mfd=full_doaj_mfd.drop_duplicates(['link'])


In [14]:

# добавляю к статьям год
full_doaj=pd.merge(full_doaj_mfd,abstract_link_year_df,how='left',on='link')
full_doaj=full_doaj[['title','month','year','full']]

In [15]:
full_doaj_for_merge=full_doaj[['year','full']]
#new_articles_for_merge

In [16]:
# объединияю с новыми статьями 
full_doaj=pd.concat([full_doaj_for_merge,new_articles_for_merge])

In [17]:
full_doaj=full_doaj.dropna()
full_doaj.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8555 entries, 0 to 6288
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    8555 non-null   object
 1   full    8555 non-null   object
dtypes: object(2)
memory usage: 200.5+ KB


In [18]:
# оставляю данные начиная с 19 года
years19=['2019','2020','2021','2022','2023']
full_doaj=full_doaj.query('year in @years19')
full_doaj['year'].value_counts()

year
2023    4837
2022    2380
2021     589
2020     283
2019     162
Name: count, dtype: int64

In [19]:
# перевожу каждую статью в строку, то есть перевожу элементы списка в одну строку
full_doaj['full'] = full_doaj['full'].apply(lambda x: ' '.join(map(str, x)))

In [20]:
full_doaj['full']=full_doaj['full'].str.lower()
# full_doaj['month']=full_doaj['month'].str.lower()

In [21]:
# очищаю тексты
common_words = [
    'in', 'on', 'at', 'under', 'over', 'through', 'with', 'without', 'to', 'for',
    'by', 'from', 'between', 'among',
    'and', 'but', 'or', 'so', 'because', 'although', 'while', 'if', 'when', 'as',
    'i', 'you', 'he', 'she', 'it', 'we', 'they',
    'my', 'your', 'his', 'her', 'its', 'our', 'their',
    'this', 'that', 'these', 'those',
    'must', 'have', 'do',
    'say', 'get', 'make', 'go', 'know', 'take', 'see', 'come', 'think', 'look', 'want',
    'give', 'use', 'find', 'tell', 'ask', 'work', 'seem', 'feel', 'try', 'leave', 'call',
    'table', 'chair', 'book', 'house', 'car', 'friend', 'family', 'time', 'way', 'day',
    'man', 'woman', 'child', 'money', 'job', 'place', 'food', 'water', 'air', 'life',
    'is', 'are', 'could', 'make', 'was', 'were', 'have', 'has', 'had', 'do', 'does',
    'such', 'this', 'these', 'that', 'those', 'itself', 'myself', 'yourself', 'himself', 'herself',
    'oneself', 'themselves', 'each', 'other', 'what', 'which', 'who', 'whom', 'where', 'whose',
    'whichever', 'whoever', 'whomever', 'whatever',
    'more', 'most', 'less', 'least',
    'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten',
    'first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth', 'tenth',
    'am', 'is', 'are', 'was', 'were', 'been', 'being',
    'have', 'has', 'had', 'having',
    'do', 'does', 'did', 'doing',
    'can', 'could',
    'will', 'shall', 'would', 'should',
    'may', 'might', 'must',
    'go', 'went', 'gone', 'going',
    'make', 'made', 'making',
    'take', 'took', 'taken', 'taking',
    'find', 'found', 'finding',
    'say', 'said', 'saying',
    'get', 'got', 'gotten', 'getting',
    'run', 'ran', 'running',
    'sing', 'sang', 'sung', 'singing',
    'child', 'children',
    'man', 'men',
    'woman', 'women',
    'person', 'people',
    'mouse', 'mice',
    'foot', 'feet',
    'goose', 'geese',
    'tooth', 'teeth',
    'child', 'children',
    'this', 'these', 'those',
    'that', 'those',
    'itself', 'themselves',
    'each', 'others', 'whats', 'whiches', 'whos', 'whoms', 'wheres', 'whoses',
    'whichever', 'whoever', 'whomever', 'whatever','the',
    "all", "publications", "review", "reviews", "authors", "articles", "participants", "also",
    "current", "sample", "scopus", "cited"
]

In [22]:

def cleaning_full(row):
    text=row[1]
    
    # Удаление символов, состоящих из 2 или менее знаков
    pattern_2 = r'\b\w{1,2}\b'
    text = re.sub(pattern_2, '', text)

    text = re.sub(r'\n', '', text)

    # удаляем общеупотребимые слова
    pattern_common = r'\b(?:' + '|'.join(re.escape(word) for word in common_words) + r')\b'
    text = re.sub(pattern_common, '', text, flags=re.IGNORECASE)

    # Удаление знаков препинания
    text = re.sub(r'[^\w\s]', '', text)

    # Удаление кусков HTML кода (если они представлены в угловых скобках)
    text = re.sub(r'<[^>]+>', '', text)

    # удаление лишних пробелов
    text = re.sub(r'\s+', ' ', text)

    return text


In [23]:

full_doaj['full']=full_doaj.apply(cleaning_full,axis=1)

In [24]:
# разбиваю каждую статью на триграммы 
def articles_trigrams(row):
    text=row['full']
    art_words = text.lower().split()
    article_trigram = list(trigrams(art_words))
    trigram_strings = [' '.join(trigram) for trigram in article_trigram]
    return trigram_strings

full_doaj['article_trigrams']= full_doaj.apply(articles_trigrams,axis=1)

full_doaj.info()   
    

<class 'pandas.core.frame.DataFrame'>
Index: 8251 entries, 0 to 6288
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   year              8251 non-null   object
 1   full              8251 non-null   object
 2   article_trigrams  8251 non-null   object
dtypes: object(3)
memory usage: 257.8+ KB


In [25]:
# все статьи в формате str
# триграммы для каждой статью представляют собой список, где каждая триграмма элемент списка 

In [26]:
# суммирую все тексты по годам
full_doaj_years=pd.pivot_table(full_doaj,
               index=['year'],
               values=['article_trigrams'],
               aggfunc=sum).reset_index()

In [27]:
# считаю сколько статей по каждому году
full_doaj_articles=pd.pivot_table(full_doaj,
               index=['year'],
               values=['full'],
               aggfunc='count').reset_index()
full_doaj_articles.columns=['year','articles_for_year']
full_doaj_years=pd.merge(full_doaj_years,full_doaj_articles, on='year', how='left')

In [28]:
full_doaj_years.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   year               5 non-null      object
 1   article_trigrams   5 non-null      object
 2   articles_for_year  5 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 252.0+ bytes


In [29]:
# full_doaj_years
# article_trigrams - список триграмм по каждому году в нижнем регистре, где каждый элемент списка одна триграмма
# articles_for_year - количество статей за год 

In [30]:
# Создайте пустой датафрейм для триграмм
trigram_result_df = pd.DataFrame()

# Проход по строкам датафрейма full_doaj_mfd_2015
for index, row in full_doaj_years.iterrows():
    year = row[0]
    trigram_list = row[1] # список уже готовых триграмм
    articles_number=row[2]

    # Создание счетчика для подсчета количества каждой триграммы
    trigram_counter = FreqDist(trigram_list)
    
    trigram_df = pd.DataFrame({
        'trigram': list(trigram_counter.keys()),
        'trigram_count{}'.format(year): trigram_counter.values(),
        #'trigrams_all{}'.format(year): len(trigram_list),
        #'article_count{}'.format(year): [0] * len(trigram_counter.values()),
        #'articles_all{}'.format(year): articles_number
    })
    # Объедините датафреймы по триграммам в общий датафрейм
    if trigram_result_df.empty:
        trigram_result_df = trigram_df
    else:
        trigram_result_df = trigram_result_df.merge(trigram_df, on='trigram', how='outer')

In [31]:
trigram_result_df = trigram_result_df.fillna(0)

In [32]:
trigram_result_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9466581 entries, 0 to 9466580
Data columns (total 6 columns):
 #   Column             Dtype  
---  ------             -----  
 0   trigram            object 
 1   trigram_count2019  float64
 2   trigram_count2020  float64
 3   trigram_count2021  float64
 4   trigram_count2022  float64
 5   trigram_count2023  float64
dtypes: float64(5), object(1)
memory usage: 433.3+ MB


In [33]:
trigram_result_df['trigram_count']=trigram_result_df['trigram_count2019']+trigram_result_df['trigram_count2020']+trigram_result_df['trigram_count2021']+trigram_result_df['trigram_count2022']+trigram_result_df['trigram_count2023']

In [34]:
trigram_result_df_10=trigram_result_df.query('trigram_count>=10')

In [36]:
# создание признака - количество статей по году в которые входит каждая триграмма

In [37]:
full_doaj=full_doaj.reset_index(drop=True)

In [38]:
full_doaj.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8251 entries, 0 to 8250
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   year              8251 non-null   object
 1   full              8251 non-null   object
 2   article_trigrams  8251 non-null   object
dtypes: object(3)
memory usage: 193.5+ KB


In [39]:
# датафрейм, в котором только год и текст статьи, разбитый на триграммы 
full_doaj_for_counting_articles=full_doaj[['year','article_trigrams']]
full_doaj_for_counting_articles['year'] = full_doaj_for_counting_articles['year'].astype(int)

In [40]:
dlgf=full_doaj_for_counting_articles.loc[1,'article_trigrams']

In [41]:
# Развернем списки в столбце 'article_trigrams' в отдельные строки, где индекс указывает на статью которой принадлежит данная триграмма 
full_doaj_expanded = full_doaj_for_counting_articles.explode('article_trigrams').reset_index(names='article_mark')


In [42]:
full_doaj_expanded.head(10)

Unnamed: 0,article_mark,year,article_trigrams
0,0,2023,given massive increase
1,0,2023,massive increase vehicle
2,0,2023,increase vehicle traffic
3,0,2023,vehicle traffic worldwide
4,0,2023,traffic worldwide issues
5,0,2023,worldwide issues road
6,0,2023,issues road safety
7,0,2023,road safety traffic
8,0,2023,safety traffic congestion
9,0,2023,traffic congestion emissions


In [43]:
len(full_doaj_expanded['article_mark'].unique())

8251

In [44]:
trigram_counts_by_year=pd.pivot_table(full_doaj_expanded,
               index=['article_trigrams'],
               values=['article_mark'],
               columns=['year'],
               aggfunc=[lambda x:len(x.unique())],
               fill_value=0).reset_index()

In [45]:
trigram_counts_by_year.head(1)

Unnamed: 0_level_0,article_trigrams,<lambda>,<lambda>,<lambda>,<lambda>,<lambda>
Unnamed: 0_level_1,Unnamed: 1_level_1,article_mark,article_mark,article_mark,article_mark,article_mark
year,Unnamed: 1_level_2,2019,2020,2021,2022,2023
0,000 000 000,1,0,0,1,1


In [46]:
years=['2019','2020','2021','2022','2023']

In [47]:
trigram_counts_by_year.columns = ['trigram'] + [f'article_count{year}' for year in years]

trigram_counts_by_year.sort_values(['article_count2019', 'article_count2020','article_count2021','article_count2022','article_count2023'],ascending=False).head(2)

Unnamed: 0,trigram,article_count2019,article_count2020,article_count2021,article_count2022,article_count2023
6374467,potential conflict interest,56,54,145,252,233
586451,any commercial financial,56,54,144,244,223


In [48]:
trigram_counts_by_year.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9466581 entries, 0 to 9466580
Data columns (total 6 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   trigram            object
 1   article_count2019  int64 
 2   article_count2020  int64 
 3   article_count2021  int64 
 4   article_count2022  int64 
 5   article_count2023  int64 
dtypes: int64(5), object(1)
memory usage: 433.3+ MB


In [49]:
trigram_result_df_10.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15853 entries, 6 to 9449778
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   trigram            15853 non-null  object 
 1   trigram_count2019  15853 non-null  float64
 2   trigram_count2020  15853 non-null  float64
 3   trigram_count2021  15853 non-null  float64
 4   trigram_count2022  15853 non-null  float64
 5   trigram_count2023  15853 non-null  float64
 6   trigram_count      15853 non-null  float64
dtypes: float64(6), object(1)
memory usage: 990.8+ KB


In [50]:
# Объединим данные
trigram_article_count = pd.merge(trigram_result_df, trigram_counts_by_year, on='trigram', how='left')

In [51]:
trigram_article_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9466581 entries, 0 to 9466580
Data columns (total 12 columns):
 #   Column             Dtype  
---  ------             -----  
 0   trigram            object 
 1   trigram_count2019  float64
 2   trigram_count2020  float64
 3   trigram_count2021  float64
 4   trigram_count2022  float64
 5   trigram_count2023  float64
 6   trigram_count      float64
 7   article_count2019  int64  
 8   article_count2020  int64  
 9   article_count2021  int64  
 10  article_count2022  int64  
 11  article_count2023  int64  
dtypes: float64(6), int64(5), object(1)
memory usage: 866.7+ MB


In [52]:
trigram_article_count.sort_values(['trigram_count2019', 'trigram_count2020','trigram_count2021','trigram_count2022','trigram_count2023'],ascending=False).head(1)

Unnamed: 0,trigram,trigram_count2019,trigram_count2020,trigram_count2021,trigram_count2022,trigram_count2023,trigram_count,article_count2019,article_count2020,article_count2021,article_count2022,article_count2023
60299,potential conflict interest,56.0,54.0,145.0,252.0,234.0,741.0,56,54,145,252,233


In [53]:
# ДОБАВЛЯЮ КОЛИЧЕСТВО ТРИГРАММ И КОЛИЧЕСТВО СТАТЕЙ В КАЖДОМ ГОДУ 

In [54]:
full_doaj_years.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   year               5 non-null      object
 1   article_trigrams   5 non-null      object
 2   articles_for_year  5 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 252.0+ bytes


In [55]:
full_doaj_years['articles_for_year_len']=full_doaj_years.apply(lambda x:len(x['article_trigrams']),axis=1)

In [56]:
full_doaj_years[['year','articles_for_year','articles_for_year_len']]

Unnamed: 0,year,articles_for_year,articles_for_year_len
0,2019,162,369760
1,2020,283,533984
2,2021,589,967450
3,2022,2380,3019176
4,2023,4837,5958874


In [57]:
# все триграммы во всех статьях
trigrams_list_for_df = trigram_article_count['trigram']

# Создание словаря для данных в датафрейме
data = {'trigram': trigrams_list_for_df}

# Добавление столбцов с количеством триграмм и статей для каждого года
years = ['2019', '2020', '2021', '2022', '2023']
trigrams_all=list(full_doaj_years['articles_for_year'])
articles_all=list(full_doaj_years['articles_for_year_len'])

In [58]:
for year, tri, art in zip(years, trigrams_all, articles_all):
    data[f'trigrams_all{year}'] = [tri] * len(trigrams_list_for_df)
    data[f'articles_all{year}'] = [art] * len(trigrams_list_for_df)

# Создание датафрейма
trigrams_articles_all = pd.DataFrame(data)
trigrams_articles_all.head()

Unnamed: 0,trigram,trigrams_all2019,articles_all2019,trigrams_all2020,articles_all2020,trigrams_all2021,articles_all2021,trigrams_all2022,articles_all2022,trigrams_all2023,articles_all2023
0,essay discuss conversational,162,369760,283,533984,589,967450,2380,3019176,4837,5958874
1,discuss conversational systems,162,369760,283,533984,589,967450,2380,3019176,4837,5958874
2,conversational systems called,162,369760,283,533984,589,967450,2380,3019176,4837,5958874
3,systems called chatbots,162,369760,283,533984,589,967450,2380,3019176,4837,5958874
4,called chatbots natural,162,369760,283,533984,589,967450,2380,3019176,4837,5958874


In [59]:
trigrams_articles_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9466581 entries, 0 to 9466580
Data columns (total 11 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   trigram           object
 1   trigrams_all2019  int64 
 2   articles_all2019  int64 
 3   trigrams_all2020  int64 
 4   articles_all2020  int64 
 5   trigrams_all2021  int64 
 6   articles_all2021  int64 
 7   trigrams_all2022  int64 
 8   articles_all2022  int64 
 9   trigrams_all2023  int64 
 10  articles_all2023  int64 
dtypes: int64(10), object(1)
memory usage: 794.5+ MB


In [60]:
# датафрейм с встречаемостью по каждому году и количеству вхождений в статьи и общим размеров корпуса каждого года и количеством статей 
trigrams_articles_for_years=pd.merge(trigram_article_count,trigrams_articles_all,how='left',on='trigram')

In [61]:
trigrams_articles_for_years.tail(5)

Unnamed: 0,trigram,trigram_count2019,trigram_count2020,trigram_count2021,trigram_count2022,trigram_count2023,trigram_count,article_count2019,article_count2020,article_count2021,article_count2022,article_count2023,trigrams_all2019,articles_all2019,trigrams_all2020,articles_all2020,trigrams_all2021,articles_all2021,trigrams_all2022,articles_all2022,trigrams_all2023,articles_all2023
9466576,1987smith january 2021,0.0,0.0,0.0,0.0,1.0,1.0,0,0,0,0,1,162,369760,283,533984,589,967450,2380,3019176,4837,5958874
9466577,ethical restrictions protect,0.0,0.0,0.0,0.0,1.0,1.0,0,0,0,0,1,162,369760,283,533984,589,967450,2380,3019176,4837,5958874
9466578,restrictions protect identity,0.0,0.0,0.0,0.0,1.0,1.0,0,0,0,0,1,162,369760,283,533984,589,967450,2380,3019176,4837,5958874
9466579,protect identity privacy,0.0,0.0,0.0,0.0,1.0,1.0,0,0,0,0,1,162,369760,283,533984,589,967450,2380,3019176,4837,5958874
9466580,identity privacy declare,0.0,0.0,0.0,0.0,1.0,1.0,0,0,0,0,1,162,369760,283,533984,589,967450,2380,3019176,4837,5958874


In [62]:
trigram_result_df['trigram_count']=trigram_result_df['trigram_count2019']+trigram_result_df['trigram_count2020']+trigram_result_df['trigram_count2021']+trigram_result_df['trigram_count2022']+trigram_result_df['trigram_count2023']

In [63]:
trigram_result_df_10=trigram_result_df.query('trigram_count>=10')

In [64]:
trigrams_articles_for_years=trigrams_articles_for_years.drop('trigram_count',axis=1)

In [65]:
# загрузка размеченных данных 
trigram_marked= pd.read_excel('C:/Users/a.safonov/future_world/balanced_model/500-500.xlsx', dtype=str)
trigram_marked.to_csv("C:/Users/a.safonov/future_world/balanced_model/\.xlsx", index=False, sep=';')


In [66]:
trigram_marked['mark'].value_counts()

mark
1    500
4    500
Name: count, dtype: int64

In [67]:
# МОДЕЛЬ

In [68]:
trigrams_articles_for_years.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9466581 entries, 0 to 9466580
Data columns (total 21 columns):
 #   Column             Dtype  
---  ------             -----  
 0   trigram            object 
 1   trigram_count2019  float64
 2   trigram_count2020  float64
 3   trigram_count2021  float64
 4   trigram_count2022  float64
 5   trigram_count2023  float64
 6   article_count2019  int64  
 7   article_count2020  int64  
 8   article_count2021  int64  
 9   article_count2022  int64  
 10  article_count2023  int64  
 11  trigrams_all2019   int64  
 12  articles_all2019   int64  
 13  trigrams_all2020   int64  
 14  articles_all2020   int64  
 15  trigrams_all2021   int64  
 16  articles_all2021   int64  
 17  trigrams_all2022   int64  
 18  articles_all2022   int64  
 19  trigrams_all2023   int64  
 20  articles_all2023   int64  
dtypes: float64(5), int64(15), object(1)
memory usage: 1.5+ GB


In [69]:
trigrams_articles_for_years.to_csv('C:/Users/a.safonov/future_world/balanced_model/trigrams_for_model_9m.csv', index=False)

In [70]:
# объединение признаков и таргета 
trigrams_for_model=pd.merge(trigrams_articles_for_years,trigram_marked,how='inner',on='trigram')

In [71]:
trigrams_for_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   trigram            999 non-null    object 
 1   trigram_count2019  999 non-null    float64
 2   trigram_count2020  999 non-null    float64
 3   trigram_count2021  999 non-null    float64
 4   trigram_count2022  999 non-null    float64
 5   trigram_count2023  999 non-null    float64
 6   article_count2019  999 non-null    int64  
 7   article_count2020  999 non-null    int64  
 8   article_count2021  999 non-null    int64  
 9   article_count2022  999 non-null    int64  
 10  article_count2023  999 non-null    int64  
 11  trigrams_all2019   999 non-null    int64  
 12  articles_all2019   999 non-null    int64  
 13  trigrams_all2020   999 non-null    int64  
 14  articles_all2020   999 non-null    int64  
 15  trigrams_all2021   999 non-null    int64  
 16  articles_all2021   999 non

In [72]:
trigrams_for_model.head(3)

Unnamed: 0,trigram,trigram_count2019,trigram_count2020,trigram_count2021,trigram_count2022,trigram_count2023,article_count2019,article_count2020,article_count2021,article_count2022,article_count2023,trigrams_all2019,articles_all2019,trigrams_all2020,articles_all2020,trigrams_all2021,articles_all2021,trigrams_all2022,articles_all2022,trigrams_all2023,articles_all2023,mark
0,language processing nlp,2.0,1.0,3.0,9.0,34.0,2,1,3,9,31,162,369760,283,533984,589,967450,2380,3019176,4837,5958874,1
1,study collection analyses,5.0,10.0,21.0,108.0,256.0,5,10,21,107,255,162,369760,283,533984,589,967450,2380,3019176,4837,5958874,4
2,machine learning model,4.0,0.0,7.0,40.0,62.0,3,0,4,23,45,162,369760,283,533984,589,967450,2380,3019176,4837,5958874,1


In [73]:
# размеченные триграммы, которые нужно будет убрать при тестировании 
trigrams_for_remove=list(trigrams_for_model['trigram'])

In [74]:
len(trigrams_for_remove)

999

In [75]:
# X - признаки, y - целевая переменная
X = trigrams_for_model.drop(['mark', 'trigram'], axis=1)
y = trigrams_for_model['mark']

# Разбивка данных на тренировочный и тестовый набор
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Рассчет весов классов для учета дисбаланса
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)

# Определение модели RandomForestClassifier с учетом весов классов
rf_model = RandomForestClassifier(class_weight=dict(zip(np.unique(y_train), class_weights)))

# Задание сетки параметров для перебора
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Создание объекта GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='f1_macro', n_jobs=-1)

# Обучение модели с перебором гиперпараметров
grid_search.fit(X_train, y_train)

# Вывод лучших параметров и соответствующей F-меры
print("Лучшие параметры:", grid_search.best_params_)
print("Лучшая F-мера:", grid_search.best_score_)

# Получение предсказаний на тестовом наборе
y_pred = grid_search.predict(X_test)


Лучшие параметры: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 200}
Лучшая F-мера: 0.740776393830663


In [76]:
preds_forrest=pd.Series(y_pred)
answers_forrest=pd.Series(y_test).reset_index(drop=True)

In [77]:
conf_matrix=confusion_matrix(answers_forrest,preds_forrest)

In [78]:
conf_matrix

array([[62, 35],
       [28, 75]], dtype=int64)

In [None]:
pred_forrest.head(1)

In [None]:

# Оценка F-меры модели
f1_macro = classification_report(y_test, y_pred, target_names=['1','4'], output_dict=True)['macro avg']['f1-score']
print("F-мера на тестовом наборе:", f1_macro)

# Рассчет и вывод accuracy модели
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy на тестовом наборе:", accuracy)


In [None]:
y_probs = grid_search.predict_proba(X_test)
roc_auc = roc_auc_score(y_test, y_pred)
print("AUC-ROC на тестовом наборе:", roc_auc)

In [None]:
# random forrest предсказываем для > 3 000 000 шинглов

In [None]:
# все триграммы с 19 по 23 кроме тех, по которым проводилось обучение 
trigram_result_test=trigrams_articles_for_years.query('trigram not in @trigrams_for_remove')

In [None]:
trigram_result_test.head()

In [None]:
trigram_result_test_5=trigram_result_test.query('trigram_count>=5')

In [None]:

# оставляю только признаки для модели 
trigrams_for_prediction = trigram_result_test_5.drop(['trigram'], axis=1)


In [None]:
# отделяю сами триграммы, чтобы после предикта объединить 
trigrams_after_prediction=trigram_result_test_5['trigram'].reset_index(drop=True)


In [None]:
trigrams_after_prediction.info()

In [None]:
predictions_series.head()

In [None]:

# предсказываем для всех триграмм(кроме выборки для обучения) 
new_predictions = grid_search.predict(trigrams_for_prediction)
# соединяю предсказания модели с самими триграммами 
predictions_series=pd.Series(new_predictions)
predictions_df=pd.concat([trigrams_after_prediction, predictions_series], axis=1)
predictions_df.columns=['trigram','prediction']

# предсказания модели для > 3 000 000 триграмм 
predictions_df.info()
