### Импорты

In [None]:
!pip install -q opendatasets

In [None]:
import opendatasets as od
import pandas as pd
import networkx as nx
import plotly.graph_objects as go

Для данной работы был взят датасет Kensho Derived Wikimedia Dataset - English Wikipedia corpus and Wikidata knowledge graph for NLP (link: https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data )\
Датасет содержит в себе много информации, взятой из WikiData\
Файлы, взятые для работы:\
statements_id - содержит в себе:

*   source_item_id - номер первой сущности (объекта)
*   target_item_id - номер второй сущности (объекта)
*   relations_id - номер связи (property)

item - содержит в себе:

*   item_id - номер объекта
*   en_label - название объекта
*   en_description - краткое описание из WikiData объекта



Данные для входа на kaggle:\
username: alicenet\
key: eb38bcb6ad56b21440b7519fbfc6f0fa


In [None]:
od.download("https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data", files=['item.csv', 'statements.csv'], force=True)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: alicenet
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data
Downloading kensho-derived-wikimedia-data.zip to ./kensho-derived-wikimedia-data


100%|██████████| 8.16G/8.16G [01:07<00:00, 130MB/s]





### 1. Построить базу знаний в любой предметной области, сформировав массив документов (например, публикаций или файлов на каком-либо портале или набор файлов на компьютере или сервере и т.п.) и построить граф знаний об этих документах

In [None]:
statements = pd.read_csv('/content/kensho-derived-wikimedia-data/statements.csv', nrows=30000)
statements.head()

In [None]:
item = pd.read_csv('/content/kensho-derived-wikimedia-data/item.csv')

In [None]:
item.head()

Unnamed: 0,item_id,en_label,en_description
0,1,Universe,totality of space and all contents
1,2,Earth,third planet from the Sun in the Solar System
2,3,life,matter capable of extracting energy from the e...
3,4,death,permanent cessation of vital functions
4,5,human,"common name of Homo sapiens, unique extant spe..."


In [None]:
filtered_statements = statements.loc[(statements['edge_property_id'] == 1376) | (statements['edge_property_id'] == 30)]
filtered_statements

Unnamed: 0,source_item_id,edge_property_id,target_item_id
243,16,30,49
565,17,30,48
909,20,30,46
910,20,30,51
1107,21,30,46
...,...,...,...
29987,456,1376,46130
29988,456,1376,535140
29989,456,1376,16665897
29990,456,1376,18338206


In [None]:
def add_link(value):
    return 'https://www.wikidata.org/wiki/Q' + str(value)

In [None]:
filtered_statements['source_link'] = filtered_statements['source_item_id'].apply(add_link)
filtered_statements['target_link'] = filtered_statements['target_item_id'].apply(add_link)
merged_table = pd.merge(filtered_statements, item, left_on='source_item_id', right_on='item_id', how='left')
merged_table.drop(columns=['source_item_id'], inplace=True)
merged_table.rename(columns={'en_label': 'source_name', 'en_description': 'source_description'}, inplace=True)
merged_table = pd.merge(merged_table, item, left_on='target_item_id', right_on='item_id', how='left')
merged_table.drop(columns=['target_item_id', 'item_id_x', 'item_id_y'], inplace=True)
merged_table.rename(columns={'en_label': 'target_name', 'en_description': 'target_description'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_statements['source_link'] = filtered_statements['source_item_id'].apply(add_link)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_statements['target_link'] = filtered_statements['target_item_id'].apply(add_link)


In [None]:
merged_table['edge_property_id'].replace({30:'continent of', 1376:'capital of'}, inplace=True)
merged_table.head()

Unnamed: 0,edge_property_id,source_link,target_link,source_name,source_description,target_name,target_description
0,continent of,https://www.wikidata.org/wiki/Q16,https://www.wikidata.org/wiki/Q49,Canada,country in North America,North America,continent on the Earth's northwestern quadrant
1,continent of,https://www.wikidata.org/wiki/Q17,https://www.wikidata.org/wiki/Q48,Japan,constitutional monarchy in East Asia,Asia,"continent, mainly on the Earth's northeastern ..."
2,continent of,https://www.wikidata.org/wiki/Q20,https://www.wikidata.org/wiki/Q46,Norway,constitutional monarchy in Northern Europe,Europe,"continent on Earth, mainly on the northeastern..."
3,continent of,https://www.wikidata.org/wiki/Q20,https://www.wikidata.org/wiki/Q51,Norway,constitutional monarchy in Northern Europe,Antarctica,polar continent
4,continent of,https://www.wikidata.org/wiki/Q21,https://www.wikidata.org/wiki/Q46,England,"country in north-west Europe, part of the Unit...",Europe,"continent on Earth, mainly on the northeastern..."


In [None]:
G = nx.MultiGraph()
for _, row in merged_table.iterrows():
  G.add_node(row['source_name'], link=row['source_link'], description=row['source_description'])
  G.add_node(row['target_name'], link=row['target_link'], description=row['target_description'])
  G.add_edge(row['source_name'], row['target_name'], label=row['edge_property_id'])

In [None]:
def get_plt(G):
  pos = nx.fruchterman_reingold_layout(G, k=0.5)
  edge_traces = []
  for edge in G.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_trace = go.Scatter(
      x=[x0, x1, None],
      y=[y0, y1, None],
      mode='lines+markers',
      line=dict(width=0.5, color='gray'),
      marker=dict(
        symbol="arrow-bar-up",
        size=10,
        angleref="previous",
      ),
      hoverinfo='none'
    )
    edge_traces.append(edge_trace)

  node_trace = go.Scatter(
    x=[pos[node][0] for node in G.nodes()],
    y=[pos[node][1] for node in G.nodes()],
    mode='markers+text',
    marker=dict(size=10, color='lightblue'),
    text=[node for node in G.nodes()],
    hovertext=[f'Title: {node}\n Description: ' + str(G.nodes[node]['description']) + '\nLink: ' + str(G.nodes[node]['link']) for node in G.nodes()],
    textposition='top center',
    textfont=dict(size=7)
  )

  edge_label_trace = go.Scatter(
    x=[(pos[edge[0]][0] + pos[edge[1]][0]) / 2 for edge in G.edges()],
    y=[(pos[edge[0]][1] + pos[edge[1]][1]) / 2 for edge in G.edges()],
    mode='text',
    text=[G[edge[0]][edge[1]][0]['label'] for edge in G.edges()],
    textposition='middle center',
    hoverinfo='none',
    textfont=dict(size=7)
  )

  layout = go.Layout(
    title='Knowledge Graph',
    titlefont_size=16,
    title_x=0.5,
    showlegend=False,
    hovermode='closest',
    margin=dict(b=20, l=5, r=5, t=40),
    xaxis_visible=False,
    yaxis_visible=False
  )

  fig = go.Figure(data=edge_traces + [node_trace, edge_label_trace], layout=layout)
  fig.show()

### 2. Реализовать сервис краткого описания каждого документа в базе знаний (абстракт, автореферат - summary)

In [None]:
get_plt(G)

### 3. Реализовать сервис поиска по запросу внутри базы знаний

In [None]:
def get_description(node_name):
  print(f'Title of document: {node_name}\n' +
  f'Descrtiption: {G.nodes[node_name]["description"]}\n' +
  f'Link: {G.nodes[node_name]["link"]}\n' +
  f'Connected with: {list(G.neighbors(node_name))}')

In [None]:
get_description('Europe')

Title of document: Europe
Descrtiption: continent on Earth, mainly on the northeastern quadrant, i.e. north-western Eurasia
Link: https://www.wikidata.org/wiki/Q46
Connected with: ['Norway', 'England', 'Scotland', 'Wales', 'Northern Ireland', 'Ireland', 'Hungary', 'Spain', 'Belgium', 'Luxembourg', 'Finland', 'Sweden', 'Denmark', 'Poland', 'Lithuania', 'Italy', 'Switzerland', 'Austria', 'Greece', 'Turkey', 'Portugal', 'Netherlands', 'Courrendlin', 'Bern', 'Geneva', 'Zürich', 'London', 'Paris', 'France', 'United Kingdom', 'Russia', 'Yorkshire', 'Germany', 'Belarus', 'Iceland', 'Estonia', 'Latvia', 'Ukraine', 'Czech Republic', 'Slovakia', 'Slovenia', 'Vilnius', 'Moldova', 'Romania', 'Bulgaria', 'North Macedonia', 'Albania', 'Croatia', 'Bosnia and Herzegovina', 'Azerbaijan', 'Andorra', 'Republic of Cyprus', 'Georgia', 'Wallonia', 'Kazakhstan', 'Malta', 'Flanders', 'Monaco', 'Montenegro', 'Vatican City', 'San Marino', 'Brussels', 'Brussels-Capital Region', 'Poznań', 'Warsaw', 'Eurovision So

In [None]:
get_description('Asia')

Title of document: Asia
Descrtiption: continent, mainly on the Earth's northeastern quadrant
Link: https://www.wikidata.org/wiki/Q48
Connected with: ['Japan', 'Turkey', 'Egypt', "People's Republic of China", 'Russia', 'Azerbaijan', 'Republic of Cyprus', 'Georgia', 'Kazakhstan', 'Indonesia', 'Uzbekistan', 'Tashkent', 'Singapore', 'Bahrain', 'Armenia', 'Istanbul', 'North Korea', 'Cambodia']


In [None]:
get_description('Poland')

Title of document: Poland
Descrtiption: republic in Central Europe
Link: https://www.wikidata.org/wiki/Q36
Connected with: ['Europe', 'Warsaw']
