<a href="https://colab.research.google.com/github/r2barati/TREC23-CrisisFACTS/blob/main/TREC_23_CrisisFACTS_First_Prototype.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install --upgrade git+https://github.com/allenai/ir_datasets.git@crisisfacts # install ir_datasets (crisisfacts branch)

credentials = {
    "institution": "<Toronto Metropolitan University>", # University, Company or Public Agency Name
    "contactname": "<Reza Barati, Aary Kartha>", # Your Name
    "email": "<rezabarati@gmail.com, aaryaman.kartha@torontomu.ca>", # A contact email address
    "institutiontype": "<Research>" # Either 'Research', 'Industry', or 'Public Sector'
}

# Write this to a file so it can be read when needed
import json
import os

home_dir = os.path.expanduser('~')

!mkdir -p ~/.ir_datasets/auth/
with open(home_dir + '/.ir_datasets/auth/crisisfacts.json', 'w') as f:
    json.dump(credentials, f)

# Event numbers as a list
eventNoList = [
    "001", # Lilac Wildfire 2017
    "002", # Cranston Wildfire 2018
    "003", # Holy Wildfire 2018
    "004", # Hurricane Florence 2018
    "005", # 2018 Maryland Flood
    "006", # Saddleridge Wildfire 2019
    "007", # Hurricane Laura 2020
    "008", # Hurricane Sally 2020
    "009", # Beirut Explosion, 2020
    "010", # Houston Explosion, 2020
    "011", # Rutherford TN Floods, 2020
    "012", # TN Derecho, 2020
    "013", # Edenville Dam Fail, 2020
    "014", # Hurricane Dorian, 2019
    "015", # Kincade Wildfire, 2019
    "016", # Easter Tornado Outbreak, 2020
    "017", # Tornado Outbreak, 2020 Apr
    "018", # Tornado Outbreak, 2020 March
]

import requests

# Gets the list of days for a specified event number, e.g. '001'
def getDaysForEventNo(eventNo):

    # We will download a file containing the day list for an event
    url = "http://trecis.org/CrisisFACTs/CrisisFACTS-"+eventNo+".requests.json"

    # Download the list and parse as JSON
    dayList = requests.get(url).json()

    # Print each day
    # Note each day object contains the following fields
    #   {
    #      "eventID" : "CrisisFACTS-001",
    #      "requestID" : "CrisisFACTS-001-r3",
    #      "dateString" : "2017-12-07",
    #      "startUnixTimestamp" : 1512604800,
    #      "endUnixTimestamp" : 1512691199
    #   }

    return dayList

for day in getDaysForEventNo(eventNoList[0]):
    print(day["dateString"])

eventsMeta = {}

for eventNo in eventNoList: # for each event
    dailyInfo = getDaysForEventNo(eventNo) # get the list of days
    eventsMeta[eventNo]= dailyInfo

    print("Event "+eventNo)
    for day in dailyInfo: # for each day
        print("  crisisfacts/"+eventNo+"/"+day["dateString"], "-->", day["requestID"]) # construct the request string

    print()

import ir_datasets

# download the first day for event 001 (this is a lazy call, it won't download until we first request a document from the stream)
dataset = ir_datasets.load('crisisfacts/001/2017-12-07')

for item in dataset.docs_iter()[:10]: # create an iterator over the stream containing the first 10 items
    print(item)

# download the second day for event 009, first 2023 event
dataset = ir_datasets.load('crisisfacts/009/2020-08-04')

for item in dataset.docs_iter()[:10]: # create an iterator over the stream containing the first 10 items
    print(item)

import pandas as pd

# Convert the stream of items to a Pandas Dataframe
itemsAsDataFrame = pd.DataFrame(dataset.docs_iter())

# Create a filter expression
is_reddit =  itemsAsDataFrame['source_type']=="Reddit"

# Apply our filter
itemsAsDataFrame[is_reddit]

# Create a filter expression
is_twitter =  itemsAsDataFrame['source_type']=="Twitter"

# Apply our filter
itemsAsDataFrame[is_twitter]

# Create a filter expression
is_fb =  itemsAsDataFrame['source_type']=="Facebook"

# Apply our filter
itemsAsDataFrame[is_fb]

# Create a filter expression
is_news =  itemsAsDataFrame['source_type']=="News"

# Apply our filter
itemsAsDataFrame[is_news]

import pandas as pd

pd.DataFrame(dataset.queries_iter())

In [None]:
!pip install python-terrier # install pyTerrier

import pyterrier as pt

# Initalize pyTerrier if not started
if not pt.started():
    pt.init()

# Ask pyTerrier to download the dataset, the 'irds:' header tells pyTerrier to use ir_datasets as the data source
pyTerrierDataset = pt.get_dataset('irds:crisisfacts/009/2020-08-04')

# To create the index, we use an 'indexer', this interates over the documents in the collection and adds them to the index
# The paramters of this call are:
#  Index Storage Path: "None" (some index types write to disk, this would be the directory to write to)
#  Index Type: type=pt.index.IndexingType(3) (Type 3 is a Memory Index)
#  Meta Index Fields: meta=['docno', 'text'] (The index also can store raw fields so they can be attached to the search results, this specifies what fields to store)
#  Meta Index Lengths: meta_lengths=[40, 200] (pyTerrier allocates a fixed amount of storage space per field, how many characters should this be?)
indexer = pt.IterDictIndexer("None", type=pt.index.IndexingType(3), meta=['docno', 'text'], meta_lengths=[40, 200])

# Trigger the indexing process
index = indexer.index(pyTerrierDataset.get_corpus_iter())

retriever = pt.BatchRetrieve(index, wmodel="DFReeKLIM", metadata=["docno", "text"])

pd.DataFrame(retriever.search("injuries"))

# All of the above codes are provided by project at https://colab.research.google.com/github/crisisfacts/utilities/blob/main/00-Data/00-CrisisFACTS.Downloader.ipynb

In [None]:
!pip install gensim

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Prepare the data for Doc2Vec
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(itemsAsDataFrame['text'].str.split())]

# Train a Doc2Vec model
model = Doc2Vec(documents, vector_size=100, window=2, min_count=1, workers=4)

# Generate embeddings for all documents
document_embeddings = np.array([model.infer_vector(doc.words) for doc in documents])

# Function to calculate cosine similarity between two vectors
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Generate an embedding for a query
query = "injuries"
query_embedding = model.infer_vector(query.split())

# Calculate similarity scores between the query and all documents
similarity_scores = [cosine_similarity(query_embedding, doc_embedding) for doc_embedding in document_embeddings]

# Print the top 10 documents with the highest similarity scores
top_docs = np.argsort(similarity_scores)[::-1][:10]
print(itemsAsDataFrame.iloc[top_docs])

In [None]:
!pip install bert-extractive-summarizer

from summarizer import Summarizer

# Instantiate a Summarizer model
model = Summarizer()

# Let's say we want to summarize the top 10 documents related to "injuries"
top_docs_text = itemsAsDataFrame.iloc[top_docs]['text']

# Concatenate the text of the top documents into one string
text_to_summarize = ' '.join(top_docs_text)

# Use the model to summarize the text
summary = model(text_to_summarize)

print(summary)


In [None]:
import json

def process_request(request):
    # This is a placeholder function.
    return [
        {"doc_id": "doc1", "score": 0.9},
        {"doc_id": "doc2", "score": 0.8},
        # ... more documents ...
    ]

results = []
for eventNo in eventNoList:
    dailyInfo = getDaysForEventNo(eventNo)
    for day in dailyInfo:
        request_id = day["requestID"]
        ranked_docs = process_request(request_id)
        for rank, doc in enumerate(ranked_docs, start=1):
            result = {
                "run_id": "run1",  # placeholder
                "event_id": day["eventID"],
                "request_id": request_id,
                "doc_id": doc["doc_id"],
                "rank": rank,
                "score": doc["score"],
                "run_type": "automatic"  # placeholder
            }
            results.append(result)

# Write the results to the output file
with open('submission.json', 'w') as f:
    for result in results:
        json.dump(result, f)
        f.write('\n')  # write a newline character after each JSON object

with open('submission.json', 'r') as f:
    for line in f:
        print(line)


In [None]:
import numpy as np

def precision_at_k(r, k):
    """Score is precision @ k
    Relevance is binary (nonzero is relevant).
    """
    assert k >= 1
    r = np.asarray(r)[:k] != 0
    return np.mean(r)

def dcg_at_k(r, k):
    """Score is discounted cumulative gain (dcg)
    Relevance is positive real values.
    """
    r = np.asarray(r)[:k]
    if r.size:
        return np.sum(r / np.log2(np.arange(2, r.size + 2)))
    return 0.

def ndcg_at_k(r, k):
    """Score is normalized discounted cumulative gain (ndcg)"""
    dcg_max = dcg_at_k(sorted(r, reverse=True), k)
    if not dcg_max:
        return 0.
    return dcg_at_k(r, k) / dcg_max

def expected_reciprocal_rank(r):
    """ERR is the expected reciprocal rank"""
    p = 1.0
    for i in range(len(r)):
        rank = i+1
        R = (2**r[i]-1) / 2**max(r)
        p *= R
        ERR = p / rank
    return ERR

# Assumption for relevance scores for a query
relevance_scores = [3, 2, 3, 0, 0, 1, 2, 3, 2, 0]

# Calculate the metrics:
print("P@10:", precision_at_k(relevance_scores, 10))
print("nDCG@10:", ndcg_at_k(relevance_scores, 10))
print("ERR:", expected_reciprocal_rank(relevance_scores))


P@10: 0.7
nDCG@10: 0.9183997457184155
ERR: 0.0


This pipeline is our first prototype for the TREC 2023 CrisisFACTS challenge, which aims to develop systems that can effectively retrieve and analyze information about crisis events. The specific requirements of the challenge can be found on the [CrisisFACTS website](https://crisisfacts.github.io/).

Here's a breakdown of what each section does:

1. **Installation and Authentication**: The code begins by installing necessary packages and setting up authentication for accessing the CrisisFACTS dataset. It uses the `ir_datasets` package, which is a collection of information retrieval datasets. The credentials are stored in a JSON file.
Provided by the Project website at: https://colab.research.google.com/github/crisisfacts/utilities/blob/main/00-Data/00-CrisisFACTS.Downloader.ipynb

2. **Event List and Metadata Retrieval**: The code then defines a list of event numbers, each representing a different crisis event. For each event, it retrieves metadata about the event, such as the dates on which the event occurred.
Provided by the Project website at: https://colab.research.google.com/github/crisisfacts/utilities/blob/main/00-Data/00-CrisisFACTS.Downloader.ipynb

3. **Data Loading and Exploration**: The code loads the data for each event and prints out the first few items. It then converts the data into a pandas DataFrame and filters the data based on the source type (e.g., Reddit, Twitter, Facebook, News).
Provided by the Project website at: https://colab.research.google.com/github/crisisfacts/utilities/blob/main/00-Data/00-CrisisFACTS.Downloader.ipynb

4. **Indexing and Retrieval**: The code uses the `pyTerrier` package to index the data and retrieve relevant documents. It uses the `BatchRetrieve` class with the `DFReeKLIM` weighting model to retrieve documents.
Provided by the Project website at: https://colab.research.google.com/github/crisisfacts/utilities/blob/main/00-Data/00-CrisisFACTS.Downloader.ipynb

5. **Document Embedding and Similarity Calculation**: The code uses the `gensim` package to train a Doc2Vec model on the text of the documents. It then calculates the cosine similarity between a query and all documents, and prints out the top 10 most similar documents.

6. **Summarization**: The code uses the `bert-extractive-summarizer` package to summarize the text of the top 10 most similar documents.

7. **Result Generation and Submission**: The code generates a JSON file containing the ranked list of documents for each request. This file is what is expected to be submitted to the TREC 2023 CrisisFACTS challenge.

8. **Evaluation Metrics**: The code defines several functions for calculating information retrieval evaluation metrics, such as precision at k, normalized discounted cumulative gain (nDCG) at k, and expected reciprocal rank (ERR).
