# Potential query workflow

This notebook shows examples of Telegram database queries.

### Sentiment analysis of Ukraine-Russia conflict

Let's imagine we want to compare the sentiment for the Russia-Ukraine war on 3 samples, each sample corresponding to one week. We have chosen those 3 samples because some important events took place (such as pre-war, begining of the war, important battle etc.)

To start, we assume that our initial, input chats were all related to Russia-Ukraine conflict. 

First, we are going to filter our dataset based on dates, and retrieve specific endpoints that will help us understand our topic even more. Then we are going to save our data 

In [1]:
# Connect to database
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client["Telegram-1-crawler"]

# Connect to the collection and see details
messages_collection = db["messages"]

collection_details = db.command("collstats", "messages")
print(f"Docuemnts count: {collection_details['count']}")
print(f"Collection size: {collection_details['size']}")
print(f"Average document size: {collection_details['avgObjSize']}")
print(f"Index size: {collection_details['totalIndexSize']}")

Docuemnts count: 1964128
Collection size: 4108442565
Average document size: 2091
Index size: 78028800


In [3]:
# Let's check indices, to see which fields we can easily query
indices = messages_collection.list_indexes()
for index in indices:
    print(index)

SON([('v', 2), ('key', SON([('_id', 1)])), ('name', '_id_')])
SON([('v', 2), ('key', SON([('date', 1)])), ('name', 'date_1')])
SON([('v', 2), ('key', SON([('peer_id.channel_id', 1)])), ('name', 'peer_id.channel_id_1')])


In [6]:
# Looks like we are good! We have indices on the date field, thus we can easily create our queries.
# Let's define our dates
from datetime import datetime

date_start1 = datetime(2021, 12, 15, 0, 0, 0)  # Prewar
date_end1 = datetime(2021, 12, 22, 0, 0, 0)

date_start2 = datetime(2022, 2, 24, 0, 0, 0)  # Begging of the war
date_end2 = datetime(2022, 3, 2, 0, 0, 0)

date_start3 = datetime(2023, 6, 8, 0, 0, 0)  # During the war
date_end3 = datetime(2023, 6, 15, 0, 0, 0)

# Let's count the number of documents
count1 = messages_collection.count_documents(
    {"date": {"$gte": date_start1, "$lte": date_end1}}
)
print(count1)
count2 = messages_collection.count_documents(
    {"date": {"$gte": date_start2, "$lte": date_end2}}
)
print(count2)
count3 = messages_collection.count_documents(
    {"date": {"$gte": date_start3, "$lte": date_end3}}
)
print(count3)

12581
15124
10112


In [16]:
import pprint

# Let's define our query
query = {
    "$or": [
        {"date": {"$gte": date_start1, "$lte": date_end1}},
        {"date": {"$gte": date_start2, "$lte": date_end2}},
        {"date": {"$gte": date_start3, "$lte": date_end3}},
    ]
}


# Let's check one message to see its endpoints
one_message = messages_collection.find_one(query)
pprint.pprint(one_message)

{'_': 'Message',
 '_id': ObjectId('65c0fe7f71c2d614315d33b6'),
 'date': datetime.datetime(2021, 12, 15, 0, 0, 10),
 'edit_date': None,
 'edit_hide': False,
 'entities': [],
 'forwards': 467,
 'from_id': {'_': 'PeerChannel', 'channel_id': 1440361410},
 'from_scheduled': False,
 'fwd_from': {'_': 'MessageFwdHeader',
              'channel_post': 1608,
              'date': datetime.datetime(2021, 12, 14, 21, 45, 6),
              'from_id': {'_': 'PeerChannel', 'channel_id': 1395659502},
              'from_name': None,
              'imported': False,
              'post_author': None,
              'psa_type': None,
              'saved_from_msg_id': 24393,
              'saved_from_peer': {'_': 'PeerChannel',
                                  'channel_id': 1440361410}},
 'grouped_id': None,
 'id': 125645,
 'legacy': False,
 'media': None,
 'media_unread': False,
 'mentioned': False,
 'message': 'Why does anyone comply!?\n'
            '\n'
            '“…the medical literature for the

In [27]:
# After thorough examination of our output, we have decided to not keep all of our fields
# Be cautios here, some fields have different nested key-value pairs, thus the best option would be to project based on the first key of the nested keys

cursor = messages_collection.find(
    query, {"message", "date", "forwards", "from_id", "peer_id.channel_id"}
)

our_subset = list(cursor)
print(len(our_subset))
pprint.pprint(our_subset[0:2], indent=4)

38267
[   {   '_id': ObjectId('65c0fe7f71c2d614315d33b6'),
        'date': datetime.datetime(2021, 12, 15, 0, 0, 10),
        'forwards': 467,
        'from_id': {'_': 'PeerChannel', 'channel_id': 1440361410},
        'message': 'Why does anyone comply!?\n'
                   '\n'
                   '“…the medical literature for the past forty-five years has '
                   'been consistent: masks are useless in preventing the '
                   'spread of disease and, if anything, are unsanitary objects '
                   'that themselves spread bacteria and viruses.”',
        'peer_id': {'channel_id': 1437926327}},
    {   '_id': ObjectId('65c0b8e8c648e7d6e90bb0c9'),
        'date': datetime.datetime(2021, 12, 15, 0, 0, 25),
        'forwards': 190,
        'from_id': None,
        'message': 'GOP Makes Huge Gains on Generic Congressional Ballot',
        'peer_id': {'channel_id': 1462338131}}]


In [30]:
# Finally let's put our dataset to dataframe
import pandas as pd

df = pd.DataFrame(our_subset)
df.head()

Unnamed: 0,_id,peer_id,date,message,from_id,forwards
0,65c0fe7f71c2d614315d33b6,{'channel_id': 1437926327},2021-12-15 00:00:10,Why does anyone comply!?\n\n“…the medical lite...,"{'_': 'PeerChannel', 'channel_id': 1440361410}",467.0
1,65c0b8e8c648e7d6e90bb0c9,{'channel_id': 1462338131},2021-12-15 00:00:25,GOP Makes Huge Gains on Generic Congressional ...,,190.0
2,65bbc5468ac0e993045f2109,{'channel_id': 1194730261},2021-12-15 00:00:55,The good Germans fled to Argentina and America...,,3.0
3,65c13e8371c2d614316c0222,{'channel_id': 1425836445},2021-12-15 00:01:01,https://childrenshealthdefense.org/defender/st...,,229.0
4,65bbc5468ac0e993045f2108,{'channel_id': 1194730261},2021-12-15 00:02:56,Radical Americanism. Just shit and shame on ou...,,3.0


### Querying large dataset

For this example, let's imagine that initial, input chats of our crawler were randomly choosen, thus we do not know what kind of chats/messages we have. On top of that, let's assume our crawler worked for 10 months with 10 accounts, therefore we have one billion messages. 

Here we need to use indexing capabilities of MongoDB and projection, to retrieve just a subset of the data. Finally, we could use our knowledge of connected fields throughout our collections, to meaningfully reduce our messages to only relevant chats. 

Workflow as follows: 
    1) Search full_chat.about field for the word National
    2) Merge current chats with the links found in them

In [32]:
# Let's se structure of one chat

chats_collection = db["chats"]
one_chat = chats_collection.find_one()
pprint.pprint(one_chat)

{'_': 'ChatFull',
 '_id': ObjectId('65bbc1c08ac0e993045de66c'),
 'chats': [{'_': 'Channel',
            'access_hash': 1219837807675274486,
            'admin_rights': None,
            'banned_rights': None,
            'broadcast': True,
            'call_active': False,
            'call_not_empty': False,
            'creator': False,
            'date': datetime.datetime(2017, 6, 23, 19, 1, 4),
            'default_banned_rights': None,
            'fake': False,
            'forum': False,
            'gigagroup': False,
            'has_geo': False,
            'has_link': False,
            'id': 1123235006,
            'join_request': False,
            'join_to_send': False,
            'left': True,
            'megagroup': False,
            'min': False,
            'noforwards': False,
            'participants_count': None,
            'photo': {'_': 'ChatPhoto',
                      'dc_id': 4,
                      'has_video': False,
                      'photo_id':

In [38]:
# Now we have two options, either to create text index on about field, or to use regular expression, which is slower
# As we do not have too many fields for this search we are going to use regex
# chats_collection.create_index([('about', 'text')]) # We could create index like this
import re

search_term = "National"
regex_query = {"full_chat.about": {"$regex": re.compile(search_term, re.IGNORECASE)}}

# Let's count
num_chats = chats_collection.count_documents(regex_query)
print(num_chats)

31


In [47]:
# Yay! We have found 31 chats. Let's see the about seciton for few of them
chats = chats_collection.find(regex_query, {"full_chat.about"}).limit(5)
for chat in chats:
    print(chat)

{'_id': ObjectId('65bbc1cf8ac0e993045de9a7'), 'full_chat': {'about': 'Collection of movies created and watched during the reign of National Socialist Germany. \n\nHeil Hitler!'}}
{'_id': ObjectId('65bbc2088ac0e993045df86a'), 'full_chat': {'about': 'A National Socialist Publishing Organization. \n\nChat:\nt.me/InvisibleEmpireChat\n\nIEP is operated by Co-owners:\nEditor-in-Chief: \n@ZakalKampf\nChief Design Officer:\n@FriendlyFather\n\nIn association with the NSCC.\n\nInvisibleEmpirePublishing.com'}}
{'_id': ObjectId('65bbc3738ac0e993045e6d81'), 'full_chat': {'about': 'Strictly a National Socialist Organization\n\nFor subscribers of Invisible Empire Publishing. Here you can chat with one another respectfully. Submit requests for books you would like us to do, and more.\n\nIEP is operated by:\n@danielzakal\n@foundingfather'}}
{'_id': ObjectId('65bbc3f88ac0e993045ea05e'), 'full_chat': {'about': "A National Socialist Publishing Organization. \n\nChief Editor @ZakalKampf\nChief Artist @Frie

In [54]:
# Or like this:
chats = chats_collection.find(regex_query, {"full_chat.about"}).limit(5)
for chat in chats:
    print(chat.get("full_chat", {}).get("about"), "\n", 50 * "-")

Collection of movies created and watched during the reign of National Socialist Germany. 

Heil Hitler! 
 --------------------------------------------------
A National Socialist Publishing Organization. 

Chat:
t.me/InvisibleEmpireChat

IEP is operated by Co-owners:
Editor-in-Chief: 
@ZakalKampf
Chief Design Officer:
@FriendlyFather

In association with the NSCC.

InvisibleEmpirePublishing.com 
 --------------------------------------------------
Strictly a National Socialist Organization

For subscribers of Invisible Empire Publishing. Here you can chat with one another respectfully. Submit requests for books you would like us to do, and more.

IEP is operated by:
@danielzakal
@foundingfather 
 --------------------------------------------------
A National Socialist Publishing Organization. 

Chief Editor @ZakalKampf
Chief Artist @FriendlyFather

In Association with the NSCC and it's members. 
 --------------------------------------------------
We are a National Socialist activist organ

In [84]:
# Now, let's find netowrk of these chats!
chats = chats_collection.find(regex_query, {"full_chat.id"})
initial_chats = []
for chat in chats:
    initial_chats.append(chat.get("full_chat", {}).get("id"))
print(len(initial_chats))
print(initial_chats)

31
[1327284437, 1635259727, 1648865616, 1569004724, 1303089864, 1304292577, 2128933650, 2121286470, 1833522557, 1916625330, 1718632902, 1718632902, 1275396146, 1354301573, 1317863577, 1232052560, 1588658373, 1406566662, 1790738811, 1406206361, 1249128539, 1471677550, 1527481557, 1334871294, 1189709213, 1409428304, 1961330547, 1977076704, 1649886663, 1744570766, 1187219939]


In [85]:
network_collection = db["network"]
query = {"chat_id": {"$in": initial_chats}}
one = network_collection.find_one(query)

pprint.pprint(one)

{'_id': ObjectId('65bbc1d18ac0e993045dea5b'),
 'chat_id': 1327284437,
 'fwd_from': [],
 'iteration_num': 1,
 'linked': 1296890338,
 'mentions_with_at': ['TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
                      'TheThirdReichCinema',
              

In [86]:
cursor = network_collection.find(query)

all_chats = []
for net in cursor:
    all_chats.extend(net["fwd_from"])
    all_chats.extend(net["mentions_with_at"])
    all_chats.extend(net["mentions_with_tdotme"])
    all_chats.append(net["linked"])
print(len(all_chats))
print(len(list(set(all_chats))))

32877
4917


In [87]:
# So, we found 4917 links to other chats in our 31 chats
# Let's merge two of our lists

all_chats.extend(initial_chats)
print(len(all_chats))
all_chats = list(set(all_chats))
print(len(all_chats))

32908
4925


In [83]:
# Now we see that 8 of our initial chats were not mentioned/referred to inside of 31 initial chats
# Let's get our messages
query = {"peer_id.channel_id": {"$in": all_chats}}

num_docs = messages_collection.count_documents(query)
print(num_docs)

1054541


We got one million documents! 

Now we cannot be sure that all of these messages belong to the desired channels, thus we can do some further filtering, or not. 

It is imporant to note that some of the input chats (2 or 3) that initiated this crawling session, had "Nazi"/"Right wing" ideas, thus this explains why we got this many messages. It is hard to tell how many messages we would get if we would fetch randomly, or if we would have 1 billion of randomly fetched messages (randomly as without any consideration of the input chat relations). 