# 1. Preparing Telegram forwards data

This section is dedicated to processing the initial dataset. This dataset includes the data of forwards between 100 politician Telegram channels and provided by TgStat service. 

Our goal is to look inside this data and based on it create three graphs:

* simple graph
* graph with weighted edges to estimate the strength of connection between channels
* bipartite graph with weighted edges

The idea of creating a bipatite graph with weighted edges goes from Castells communication model. This model defines that communicators in a network can be an author and a distributor as well. That's why we estimate every node from both positions.

### Import requiremental packages

In [1]:
import networkx as nx
from networkx.algorithms import bipartite
import json
import collections

## 1.1 Forwards data

### Import data and look on its structure

In [2]:
with open('../data/forwards_e_2019.jsonl') as f:
    text = f.read()[0:-2]+']'
    forwards = json.loads(text)

print("Number of forwards:", len(forwards))

features = ", ".join(list(forwards[0]['row'].keys()))
print("Available features:", features)

Number of forwards: 58341
Available features: channel, mentioned_channel, mention_type, post_date, post_id, link, text, post_date_h


### Fields description

* **channel** - the name of channel, forwarded a message
* **mentioned_channel** - an author of a message
* **mention_type** - forward of a message (*"forward"*) or mention in original message (*"channel"*)
* **post_date** - UNIX date of *mention*
* **post_id** - ID of the post in channel
* **link** - likn to the post in channel
* **text** - text of a mention message
* **post_date_h** - normal format date of *mention*

### Time period

In [3]:
print("The first message of", sorted(forwards, key=lambda x: x['row']['post_date'])[0]['row']['post_date_h'])
print("The last message of", sorted(forwards, key=lambda x: x['row']['post_date'])[-1]['row']['post_date_h'])

The first message of 2018-12-31 21:03:28
The last message of 2019-10-17 09:15:27


### Clear forwards vs. mentions vs. posts based on mention

In [4]:
print("The number of forwards in data:", len([f for f in forwards if f['row']['mention_type'] == 'forward']))
print("The number of frorwards with posts:", len([f for f in forwards if f['row']['mention_type'] == 'post']))
print("The number of channel mentions:", len([f for f in forwards if f['row']['mention_type'] == 'channel']))

The number of forwards in data: 24225
The number of frorwards with posts: 16521
The number of channel mentions: 17595


#### An example of forward
A forward is a message which was only reposted from author's channel.

In [5]:
example_forward = [f for f in forwards if f['row']['mention_type'] == 'forward'][101]
print("Forwarder:", example_forward['row']['channel'])
print("Mentioned channel:", example_forward['row']['mentioned_channel'])
print("Link to the message:", example_forward['row']['link'])

Forwarder: teory_elit
Mentioned channel: redzion
Link to the message: https://tgstat.ru/channel/teory_elit/8235


#### An example of channel mention
A channel mention is a mention of channel name in a message

In [6]:
example_forward = [f for f in forwards if f['row']['mention_type'] == 'channel'][2]
print("Forwarder:", example_forward['row']['channel'])
print("Mentioned channel:", example_forward['row']['mentioned_channel'])
print("Link to the message:", example_forward['row']['link'])

Forwarder: mediatech
Mentioned channel: criminalru
Link to the message: https://tgstat.ru/channel/mediatech/6657


#### An example of forward with post
A forward with post means that forwarder write somethig additional based on a forwarding message.

In [7]:
example_forward = [f for f in forwards if f['row']['mention_type'] == 'post'][0]
print("Forwarder:", example_forward['row']['channel'])
print("Mentioned channel:", example_forward['row']['mentioned_channel'])
print("Link to the message:", example_forward['row']['link'])

Forwarder: master_pera
Mentioned channel: seryikardinal
Link to the message: https://tgstat.ru/channel/master_pera/2857


## 1.2 Simple graph

In [8]:
nodes = []
edges = []

# write unique nodes and edges between them
for i in range(0, len(forwards)):
    if forwards[i]['row']['channel'] not in nodes:
        nodes.append(forwards[i]['row']['channel'])
    if forwards[i]['row']['mentioned_channel'] not in nodes:
        nodes.append(forwards[i]['row']['mentioned_channel'])
        
    edges.append((forwards[i]['row']['channel'], forwards[i]['row']['mentioned_channel']))
    
# create a simple graph
G_simple = nx.Graph()
G_simple.add_nodes_from(nodes)
G_simple.add_edges_from(edges)

# show statistics
print("Number of nodes in the simple network:", G_simple.number_of_nodes())
print("Number of edges in the graph:", G_simple.number_of_edges())

# save this graph
nx.write_gpickle(G_simple, "graphs/1.2_simple_graph.gpickle", protocol=4)

Number of nodes in the simple network: 100
Number of edges in the graph: 3028


In [10]:
G_simple.nodes()

NodeView(('boilerroomchannel', 'criminalru', 'russica2', 'PolitBulka', 'SergeyKolyasnikov', 'fcpeshka', 'kaktovottak', 'mediatech', 'kremlebezBashennik', 'Sandymustache', 'stalin_gulag', 'kashinguru', 'otsuka_bld', 'SerpomPo', 'aavst55', 'ruredmantis', 'prbezposhady', 'popyachsa', 'navalny', 'vibornyk', 'bloodysx', 'redzion', 'greenserpent', 'russicatop', 'fuckyouthatswhy', 'Gubery', 'kstati_p', 'imnotbozhena', 'teory_elit', 'niemandswasser', 'margaritasimonyan', 'ekvinokurova', 'akitilop', 'krasniydom', 'MedvedevVesti', 'politjoystic', 'scienpolicy', 'SolovievLive', 'Ivorytowers', 'solarstorm', 'russianfuture', 'docpro', 'MayorFSB', 'go338', 'vybora', 'kononenkome', 'apostolaki_the_cat', 'kpotupchik', 'postposttruth', 'kbrvdvkr', 'gayasylum', 'rlz_the_kraken', 'politburo2', 'skabeeva', 'obrazbuduschego', 'tikandelaki', 'apologia', 'mysly', 'zakuliska', 'PlushevChannel', 'mig41', 'pzdcofficial', 'tvjihad', 'russiaelections', 'staraya', 'maester', 'antiskrepa', 'kremlin_mother_expert', 

## 1.3 Simple graph with weighted edges

In [57]:
_edges = [] # array to minify processing load
weighted_edges = []

# weighted edges
for edge in edges:
    if edge not in _edges:
        number = edges.count(edge)
        weighted_edge = (edge[0], edge[1], number)
    
        if weighted_edge not in weighted_edges:
            weighted_edges.append(weighted_edge)
        
        _edges.append(edge)

In [59]:
# find maximum weight
max_weight = max([int(edge[2]) for edge in weighted_edges])

for e in weighted_edges:
    if int(e[2]) == max_weight:
        print("Channels with maximum connections (weight):", e[0], "&", e[1])
        print("Number of connections between them:", e[2])
        break

Channels with maximum connections (weight): krasniydom & margaritasimonyan
Number of connections between them: 573


In [60]:
# create weighted graph
G_weighted = nx.Graph()
G_weighted.add_weighted_edges_from(weighted_edges)

# show statistics
print("Number of nodes in the weighted network:", G_weighted.number_of_nodes())
print("Number of edges in the weighted graph:", G_weighted.number_of_edges())

# save this graph
nx.write_gpickle(G_weighted, "graphs/1.3_weighted_graph.gpickle", protocol=4)

Number of nodes in the weighted network: 100
Number of edges in the weighted graph: 3028


## 1.4 Bipartite graph with weighted edges

In [61]:
forw = set()
ment = set()
edges_raw = list()

# create list of forwarders and mentioned channels; write edges between them
for i in range(0, len(forwards)):
    forw.add(forwards[i]['row']['channel']+'__forw')
    ment.add(forwards[i]['row']['mentioned_channel']+'__ment')
        
    edges_raw.append((forwards[i]['row']['channel']+'__forw', forwards[i]['row']['mentioned_channel']+'__ment'))

# create weighted edges
edges_count = collections.Counter(edges_raw)
edges = list()
for k, v in dict(edges_count).items():
    item = list(k)
    item.append(v)
    edges.append(tuple(item))

# statistics
forw = list(forw)
ment = list(ment)
edges = list(edges)

print("The number of forwarders:", len(forw))
print("the number of mentioned channels:", len(ment))
print("The number of edges between them:", len(edges))

The number of forwarders: 99
the number of mentioned channels: 100
The number of edges between them: 4643


In [67]:
# create bipartite graph with entities: forwarders and mentioned
G_bipartite = nx.Graph()
G_bipartite.add_nodes_from(forw, bipartite='forwarders')
G_bipartite.add_nodes_from(ment, bipartite='mentioned')
G_bipartite.add_weighted_edges_from(edges)

# save this graph
nx.write_gpickle(G_bipartite, "graphs/1.4_bipartite_graph.gpickle", protocol=4)