<a href="https://colab.research.google.com/github/sush104/Crisis_Event_Ranking_and_Summarization/blob/main/CrisisFACTs_Sushant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TREC CrisisFACTs Track 2022 Tutorial

This notebook illustrates how to download the TREC 2022 CrisisFACTs event streams along with the information needs for each one.

**Part 1: Installing Needed Packages**

Before we can get the data, we need to install some packages to handle the download process, as well as to enable some analysis we will do later in this tutorial. In particular, we are going to install two main packages


*   ir_datasets (https://github.com/allenai/ir_datasets): A python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc. We can use this to download the raw event streams and information needs for each.
*   pyTerrier (https://pyterrier.readthedocs.io/en/latest/): pyTerrier is a python wrapper around the Terrier IR Platform (a search engine in-a-box). We will use this to produce a searchable index for each day during a crisis event, so we can retrieve (hopefully) relevant content for different information needs. 



In [22]:
!pip install -q git+https://github.com/allenai/ir_datasets.git@crisisfacts # install ir_datasets (crisisfacts branch)
!pip install -q python-terrier # install pyTerrier

**Import required libraries** 

In [23]:
import pandas as pd
import json
import pyterrier as pt
import ir_datasets
from urllib.request import urlopen
import gensim.downloader as api
import json

**Part 2: Initalizing Your Credentials**

When you want to download part of the CrisisFACTs dataset we require that you provide a set of contact details. The reason for this is two-fold: 1) the terms of service from some of the platforms (like Twitter) from which we have sourced data require us to do so, and 2) it allows us to collect statistics on how many people are making use of the data we provide.

**GDPR Statement**: By downloading the CrisisFACTs datasets, you agree to the University of Glasgow processing your personal data, as defined by the EU General Data Protection Regulation (GDPR) - your name and email in this case. Queries about data processing and access/deletion requests should be sent to [me via email](http://www.dcs.gla.ac.uk/~richardm/Home/Contact.html). We will store your data for as long as the track is on-going and up-to 2 years beyond that. I may contact you using the details provided to notify you about changes in the datasets or track, to provide information or ask you questions about your participation or otherwise contact you about topics relevant to emergency management. We may collate statistics from the provided information that will be published, but we will not release individual names or email addresses. 

Rather than entering these details every time you request the dataset, its more effcient to set this once up-front, so fill in your details below:

In [24]:
credentials = {
    "institution": "University of Foo", # University, Company or Public Agency Name
    "contactname": "Foo Bar", # Your Name
    "email": "foo@bar.edu", # A contact email address
    "institutiontype": "Research" # Either 'Research', 'Industry', or 'Public Sector'
}

# Write this to a file so it can be read when needed
!mkdir -p /root/.ir_datasets/auth/
with open('/root/.ir_datasets/auth/crisisfacts.json', 'w') as f:
    json.dump(credentials, f)

**Part 3: Understanding the structure of the CrisisFACTs Dataset**

The CrisisFACTs dataset is divided into events, representing real-world crises. Each event is given an identifier, e.g. 'CrisisFACTS-001' is the Lilac Wildfire from 2017. We sometimes refer to the event number or 'eventNo', this is the last three digits of the event identifier, e.g. '001'. There are 8 events for CrisisFACTs 2022:

In [25]:
# Event numbers as a list
eventNoList = [
          "001", # Lilac Wildfire 2017
          "002", # Cranston Wildfire 2018
          "003", # Holy Wildfire 2018
          "004", # Hurricane Florence 2018
          "005", # 2018 Maryland Flood
          "006", # Saddleridge Wildfire 2019
          "007", # Hurricane Laura 2020
          "008" # Hurricane Sally 2020
]

Each event has a duration, i.e. it lasts for a number of days. In the CrisisFACTs track, you need to produce a timeline summary for each day for a set of events. You can get the list of days for an event as shown below (example is for event "001", i.e. the Lilac Wildfire 2017):

In [26]:
# Gets the list of days for a specified event number, e.g. '001'
def getDaysForEventNo(eventNo):

  # We will download a file containing the day list for an event
  url = "http://trecis.org/CrisisFACTs/CrisisFACTS-"+eventNo+".requests.json"

  # Download the list and parse as JSON
  dayList = json.loads(urlopen(url).read())

  # Print each day
  # Note each day object contains the following fields
  #   {
  #      "eventID" : "CrisisFACTS-001",
  #      "requestID" : "CrisisFACTS-001-r3",
  #      "dateString" : "2017-12-07",
  #      "startUnixTimestamp" : 1512604800,
  #      "endUnixTimestamp" : 1512691199
  #   }

  return dayList

# for day in getDaysForEventNo(eventNoList[0]):
#   print(day)

For each day, we collected related content to the event from the following sources:


*   **Twitter**: We are re-using tweets collected as part of the TREC Incident Streams track (http://trecis.org). These tweets were crawled by keyword, and as such most are likely to be relevant to the event, but are not nessessaraly good candidates for inclusion into a summary of what is happening.
*   **Reddit**: Discussions regarding what happens during events also occurs on the forum platform Reddit. We collected relevant Reddit threads to each event, where we include both the original submission and subsequent comments within those threads.
*   **News**: Traditional news agencies are often a good source of information during an emergency and so we have also included a small number of news articles collected during each event as well.
*   **Facebook**: We collected Facebook/Meta posts from public pages that are relevant to each event using CrowdTangle. We cannot share the content of these posts, however, we have included the post and page ids of this content within the stream for those who have access to the CrowdTangle API and can retrieve this data separately. 

Because these sources have different formatting and characteristics, we reformatted this data into a list of standardized 'stream items', where a stream item contains:


*   **event**: The identifier of the event, e.g. 'CrisisFACTS-001'
*   **streamID**: A unique identifier for the stream item. This will generally be of the form 'CrisisFACTS-\<eventNo\>-\<source\>-\<postID\>-\<sentenceID\>', e.g. CrisisFACTS-001-Twitter-15712-0.
*   **unixTimestamp**: This is the time that the content was originally posted, expressed as a unix timestamp in seconds (UTC timezone).
*   **text**: The text of the stream item. The maximum length of a stream item is 200 characters. 
*   **sourceType**: A string denoting the source, i.e. either Twitter, Reddit, News or Facebook.
*   **source**: This is the original post content formated as JSON.

Since, some types of content are longer than others (compare a news article vs. a tweet for instance), for long-form content we perform sentence segmentation, so one input post might form multiple stream items. In these cases, the 'sentenceID' component of the streamID denotes the number of the sentence in the source content.


The dataset is structured by day and event. To access the stream items for a particular \<event,day\> pair we generate a request string specifying the day and event we want, of the form:

*   '**crisisfacts/\<eventNo\>/\<day\>**'

For instance, we could generate request strings for all CrisisFACTs \<event,day\> pairs as follows:

In [27]:

cols = ["Event", "Facts_Date", "Requist_ID"]
stream_df = pd.DataFrame(columns = cols)

for eventNo in eventNoList: # for each event
  dayList = getDaysForEventNo(eventNo) # get the list of days
  for day in dayList: # for each day
    # print("Event "+eventNo)
    # print("  crisisfacts/"+eventNo+"/"+day["dateString"]) # construct the request string
    # stream_df["Event"] = eventNo
    # stream_df["Fast date"] = " crisisfacts/"+eventNo+"/"+day["dateString"]
    stream_df = stream_df.append({'Event': "Event "+eventNo, 'Facts_Date':"crisisfacts/"+eventNo+"/"+day["dateString"], 'Requist_ID':day["requestID"]},ignore_index=True)

In [28]:
stream_df

Unnamed: 0,Event,Facts_Date,Requist_ID
0,Event 001,crisisfacts/001/2017-12-07,CrisisFACTS-001-r3
1,Event 001,crisisfacts/001/2017-12-08,CrisisFACTS-001-r4
2,Event 001,crisisfacts/001/2017-12-09,CrisisFACTS-001-r5
3,Event 001,crisisfacts/001/2017-12-10,CrisisFACTS-001-r6
4,Event 001,crisisfacts/001/2017-12-11,CrisisFACTS-001-r7
5,Event 001,crisisfacts/001/2017-12-12,CrisisFACTS-001-r8
6,Event 001,crisisfacts/001/2017-12-13,CrisisFACTS-001-r9
7,Event 001,crisisfacts/001/2017-12-14,CrisisFACTS-001-r10
8,Event 001,crisisfacts/001/2017-12-15,CrisisFACTS-001-r11
9,Event 002,crisisfacts/002/2018-07-25,CrisisFACTS-002-r1


Group by events by to pass each event name to specific dataset at the time of summarization

In [29]:
event_name_df = stream_df.groupby('Event')
event_name_df.first()

Unnamed: 0_level_0,Facts_Date,Requist_ID
Event,Unnamed: 1_level_1,Unnamed: 2_level_1
Event 001,crisisfacts/001/2017-12-07,CrisisFACTS-001-r3
Event 002,crisisfacts/002/2018-07-25,CrisisFACTS-002-r1
Event 003,crisisfacts/003/2018-08-06,CrisisFACTS-003-r5
Event 004,crisisfacts/004/2018-09-01,CrisisFACTS-004-r8
Event 005,crisisfacts/005/2018-05-27,CrisisFACTS-005-r3
Event 006,crisisfacts/006/2019-10-10,CrisisFACTS-006-r4
Event 007,crisisfacts/007/2020-08-27,CrisisFACTS-007-r13
Event 008,crisisfacts/008/2020-09-11,CrisisFACTS-008-r3


In [30]:
len(event_name_df["Event"])

8

Now that we know what the request strings for each event and day are, we can download for the associated stream for each via ir_datasets:

In [31]:

# download the first day for event 001 (this is a lazy call, it won't download until we first request a document from the stream)
# dataset = ir_datasets.load('crisisfacts/001/2017-12-07')

# for item in dataset.docs_iter()[:10]: # create an iterator over the stream containing the first 10 items
#   print(item)

In [32]:
len(stream_df)

55

In [33]:
# for i in range(0, 1):
#   print("Event: "+stream_df['Event'][i])
#   print("Facts_date: "+stream_df['Facts_Date'][i])
#   dataset = ir_datasets.load(stream_df['Facts_Date'][i])
  # for item in dataset.docs_iter()[:2]: # create an iterator over the stream containing the first 10 items
  #   print(item)

As we can see the first stream items are tweets, and not all of them are relevant, particularly at the begining of the event. If we wanted to find content of other types we can try filtering by the source_type field.

In [34]:
# Convert the stream of items to a Pandas Dataframe
itemsAsDataFrame = pd.DataFrame(dataset.docs_iter())

# Create a filter expression
is_reddit =  itemsAsDataFrame['source_type']=="Reddit"
is_twitter =  itemsAsDataFrame['source_type']=="Twitter"
is_news =  itemsAsDataFrame['source_type']=="News"
is_facebook =  itemsAsDataFrame['source_type']=="Facebook"
# Apply our filter
itemsAsDataFrame[is_facebook]

Unnamed: 0,doc_id,event,text,source,source_type,unix_timestamp
1188,CrisisFACTS-001-Facebook-0-0,CrisisFACTS-001,,{'pageName': 'Kellyville Fire Department/Women...,Facebook,1512624253
1191,CrisisFACTS-001-Facebook-1-0,CrisisFACTS-001,,"{'pageID': 663003590422565, 'postID': 16043004...",Facebook,1512624364
1192,CrisisFACTS-001-Facebook-1-1,CrisisFACTS-001,,"{'pageName': 'Times of San Diego', 'username':...",Facebook,1512624364
1214,CrisisFACTS-001-Facebook-2-0,CrisisFACTS-001,,"{'pageName': 'The Vista Press Online', 'userna...",Facebook,1512625075
1215,CrisisFACTS-001-Facebook-2-1,CrisisFACTS-001,,"{'pageID': 110068055823871, 'postID': 85585502...",Facebook,1512625075
...,...,...,...,...,...,...
7251,CrisisFACTS-001-Facebook-540-1,CrisisFACTS-001,,"{'pageID': 166794293342433, 'postID': 16724727...",Facebook,1512691084
7252,CrisisFACTS-001-Facebook-541-0,CrisisFACTS-001,,"{'pageID': 110419376353, 'postID': 10155904416...",Facebook,1512691086
7268,CrisisFACTS-001-Facebook-542-0,CrisisFACTS-001,,"{'pageName': 'Soumada Khan', 'username': 'Soum...",Facebook,1512691140
7269,CrisisFACTS-001-Facebook-542-1,CrisisFACTS-001,,"{'pageID': 100057778012514, 'postID': 90004676...",Facebook,1512691140


# Getting the Crisis Reponder Information Needs / Queries

Clearly not all of the information in the input stream for each day will be useful for an emergency responder, or even be relevant. Hence it makes sense that we filter these streams down based on what the emergency responder cares about. Our task is focused on producing timeline summaries containing similar information to what might be entered into an after action report, similar to an ICS 209 form: 
* https://training.fema.gov/emiweb/is/icsresource/assets/ics%20forms/ics%20form%20209,%20incident%20status%20summary%20(v3).pdf

But how can we express this information need in a way that a computer can understand? To make it easier for participant systems to integrate content relevant to the event, we have manually constructed a set of queries that encapsulate this information need. These queries are in effect questions that an emergency responder might ask when writing their after action report.

These queries are included as part of each day of the CrisisFACTs dataset, and can access them as follows:

In [35]:
# pd.DataFrame(dataset.queries_iter())

##Initialize pyTerrier library


In [36]:
# Initalize pyTerrier if not started
if not pt.started():
  pt.init()

# Searching for Relevant Content

At this point you know how to get the data streams that you are to summarize, and you know how what ideally should be included within your summary. This is the minimum that you need to tackle the CrisisFACTs task. However, one of the reasons that we integrated the CrisisFACTs datasets into ir_datasets is that it provides you with a plug-and-play means to perform text search of the content for a day via pyTerrier. This is useful as an initial step to find content that is relevant to the emergency responder information needs.

Before we get into creating our search engine, its worth providing a very broad overview of how a (text) search engine works. At its core, a search engine produces a data structure called an index from your document set. This index makes it really fast to identify documents containing a particular query term. To create our index, we need to provide the input documents, as well as specify what fields in the document contain text that we want to be searchable. We can take one of the request strings for an <event,day> pair and ask pyTerrier to create an index for us:   

In [37]:
#Pre-trained model
%time model = api.load("glove-twitter-25")

CPU times: user 1min 16s, sys: 6.12 s, total: 1min 22s
Wall time: 1min 31s


In [40]:


#Word2Vec Function 
def w2v_qexp(q):
  tokens = set(q["query"].to_string(index=False).split())
  print(q["query"])
  expandedquery = ""
  count = 0

  for element in tokens:
    if element in model.vocab.keys() and element.isalnum():  
        similar = dict(model.most_similar(element))

        for element in model.vocab.keys() and similar.keys():
          if(element.isalnum()):
            expandedquery += " " + str(element)     
            count+=1           

  q["query"] += expandedquery
  q["score"] += (count*0.1)/len(tokens)
  print(q["score"])
  print(q["query"])
  print("\n")

  return(q)

In [41]:
file_dict = {}
w2v_ranked_df = pd.DataFrame()

for file in stream_df["Event"]:
  for i in range(0,1):
    
    #Printing event no and fact date
    print("Event: "+stream_df['Event'][i])
    print("Facts_date: "+stream_df['Facts_Date'][i])
    print("Requist ID: "+stream_df["Requist_ID"][i])
    add_to_file = stream_df['Facts_Date'][i]
    req_id = stream_df["Requist_ID"][i]
    #Loading the desired dataset from ir_datasets
    # dataset = ir_datasets.load(stream_df['Facts_Date'][i])

    # Ask pyTerrier to download the dataset, the 'irds:' header tells pyTerrier to use ir_datasets as the data source
    pyTerrierDataset = pt.get_dataset('irds:'+stream_df['Facts_Date'][i])

    # To create the index, we use an 'indexer', this interates over the documents in the collection and adds them to the index
    indexer = pt.index.IterDictIndexer("None", type=pt.index.IndexingType(3), meta=['docno', 'text'], meta_lengths=[40, 200])

    # we give the dataset get_corpus_iter() directly to the indexer
    # while specifying the fields to index and the metadata to record
    index_ref = indexer.index(pyTerrierDataset.get_corpus_iter(), meta=('docno', 'text',))

    #Creating an index
    index = pt.IndexFactory.of(index_ref)
    

    topics = pyTerrierDataset.get_topics(variant='indicative_terms')
    # corpus = pyTerrierDataset.get_corpus_iter()
    
    # for item in dataset.docs_iter()[:2]: # create an iterator over the stream containing the first 10 items
    #   print(item)

    #Baseline Model using TF_IDF
    retriever = pt.BatchRetrieve(index, wmodel="TF_IDF", metadata=["docno", "text"])
    # retriever = pt.BatchRetrieve(index, wmodel="DFReeKLIM", metadata=["docno", "text"])

    print("Ranking after Baseline model: ")
    baseline_ranked_df = pd.DataFrame(retriever.transform(topics))
    print(baseline_ranked_df)

    ##Query Expansion
    BM25 = pt.BatchRetrieve(index, controls = {"wmodel": "BM25"},  metadata=["docno", "text"])
    bo1 = BM25 >> pt.rewrite.Bo1QueryExpansion(index) >> BM25
    # klq = pt.rewrite.KLQueryExpansion(index)

    retriever = pt.BatchRetrieve(index, wmodel="DFReeKLIM", metadata=["docno", "text"])
    pipelineQE = (retriever % 100) >> bo1 >> (retriever % 100)

    print("Ranking after Query Expansion using Bo1 model: ")
    QE_ranked_df = pd.DataFrame(pipelineQE.transform(topics))
    print(QE_ranked_df)

    #Multi-stage retrival
    #this ranker will make the candidate set of documents for each query
    BM25 = pt.BatchRetrieve(index, controls = {"wmodel": "BM25"},  metadata=["docno", "text"])

    #these rankers we will use to re-rank the TF_IDF and PL2 results
    TF_IDF =  pt.BatchRetrieve(index, controls = {"wmodel": "TF_IDF"},  metadata=["docno", "text"])
    PL2 =  pt.BatchRetrieve(index, controls = {"wmodel": "PL2"},  metadata=["docno", "text"])

    #Applying pipeline on BM25 candidate set on top of TF_IDF and PL2
    pipe = BM25 >> (TF_IDF ** PL2)
    fbr = pt.FeaturesBatchRetrieve(index, controls = {"wmodel": "BM25"}, features=["SAMPLE", "WMODEL:TF_IDF", "WMODEL:PL2"], metadata=["docno", "text"]) 

    pipe = BM25 >> (pt.transformer.IdentityTransformer() ** TF_IDF ** PL2)

    #Making retrival process fast using compile() from pyterrier
    pipe_fast = pipe.compile()
    print("Ranking after Multistage retrival model: ")
    custom_ranked_df = (pipe_fast %100).transform(topics)
    print(custom_ranked_df)

    #Applying word2vec model on top of previously defined pipeline
    test_pipeline = retriever >> pt.apply.by_query(w2v_qexp) >> retriever

    #Re-ranking the document using word2vec
    # data = test_pipeline.transform(topics)
    w2v_ranked_df = w2v_ranked_df.append(test_pipeline.transform(topics))
    # w2v_ranked_df["Original Query"] = topics["query"][i]
    w2v_ranked_df["Facts Date"] = add_to_file
    # w2v_ranked_df["RequistID"] = req_id
    #Selecting 2 top events across four streams using all queries.
    print("After word2vec retrival model: ")
  
  # w2v_ranked_df = pd.concat(w2v_ranked_df)  
  key = file
  df = file
  file_dict[key] = df
  pd.DataFrame(w2v_ranked_df.groupby('qid').head(2)).to_csv("Crisis_"+str(file_dict[key])+".csv")
  break  
  # Trigger the indexing process
  # index = indexer.index(pyTerrierDataset.get_corpus_iter())

Event: Event 001
Facts_date: crisisfacts/001/2017-12-07
Requist ID: CrisisFACTS-001-r3


crisisfacts/001/2017-12-07 documents: 0it [00:00, ?it/s]



Ranking after Baseline model: 


  df.drop(df.columns.difference(['qid','query']), 1, inplace=True)


                            qid  docid                            docno  \
0      CrisisFACTS-General-q001   1513  CrisisFACTS-001-Twitter-48717-0   
1      CrisisFACTS-General-q001   4001   CrisisFACTS-001-Twitter-5831-0   
2      CrisisFACTS-General-q001   3510    CrisisFACTS-001-Twitter-512-0   
3      CrisisFACTS-General-q001    532  CrisisFACTS-001-Twitter-22676-0   
4      CrisisFACTS-General-q001    364  CrisisFACTS-001-Twitter-49822-0   
...                         ...    ...                              ...   
8996  CrisisFACTS-Wildfire-q006    338       CrisisFACTS-001-News-19-11   
8997  CrisisFACTS-Wildfire-q006    339       CrisisFACTS-001-News-19-12   
8998  CrisisFACTS-Wildfire-q006    336        CrisisFACTS-001-News-19-9   
8999  CrisisFACTS-Wildfire-q006    340       CrisisFACTS-001-News-19-13   
9000  CrisisFACTS-Wildfire-q006    341       CrisisFACTS-001-News-19-14   

                                                   text  rank     score  \
0     Good to be home ag

In [55]:
w2v_ranked_df['max'] = w2v_ranked_df.groupby('score')['score'].transform('max')

w2v_ranked_df = w2v_ranked_df.sort_values(["max","score"], ascending=False).drop('max', axis=1)

w2v_ranked_df.loc[w2v_ranked_df["score"] > 10].groupby("qid").head(1)

Unnamed: 0,qid,docid,docno,text,rank,score,query,Facts Date
7157,CrisisFACTS-Wildfire-q001,1156,CrisisFACTS-001-Twitter-27628-0,make it or break it finals next week but ice s...,0,38.168168,acres size buildings lakes branches barrels po...,crisisfacts/001/2017-12-07
6993,CrisisFACTS-General-q043,3642,CrisisFACTS-001-Twitter-20059-0,Cause baby We're just reckless kids trying to ...,0,37.130202,flooding delays warnings disruption warning fl...,crisisfacts/001/2017-12-07
1165,CrisisFACTS-General-q010,3167,CrisisFACTS-001-Twitter-18807-0,#BREAKING: A person was killed Thursday by a f...,0,36.92808,killed dead hell walking heard fucking bad kil...,crisisfacts/001/2017-12-07
1147,CrisisFACTS-General-q009,1686,CrisisFACTS-001-News-26-18,A San Diego County Sheriff's Department deputy...,0,36.140224,injury injured concussion injuries groin minor...,crisisfacts/001/2017-12-07
3304,CrisisFACTS-General-q026,1702,CrisisFACTS-001-News-26-34,Here are the road closures as of 7:00 p.m. Go...,0,32.424334,tree block road closures cancellations closing...,crisisfacts/001/2017-12-07
7014,CrisisFACTS-General-q045,2260,CrisisFACTS-001-News-32-4,Other hazards to be aware of are trees and pol...,0,31.353416,fuel hazard waste infectious chemical carbon s...,crisisfacts/001/2017-12-07
2993,CrisisFACTS-General-q021,935,CrisisFACTS-001-Twitter-42097-0,I swear it's a ritual for me to drink a big as...,0,28.709712,food water transport transit service transport...,crisisfacts/001/2017-12-07
3084,CrisisFACTS-General-q023,935,CrisisFACTS-001-Twitter-42097-0,I swear it's a ritual for me to drink a big as...,0,28.709712,food water sandbags ieds sedatives scavenging ...,crisisfacts/001/2017-12-07
2609,CrisisFACTS-General-q017,2616,CrisisFACTS-001-Twitter-7555-0,@UPS stands for U Package Somewhere but obviou...,0,27.484543,are there goods needing delivered wanting help...,crisisfacts/001/2017-12-07
6842,CrisisFACTS-General-q040,2816,CrisisFACTS-001-News-38-6,"We are in no way near the end of this, warned ...",0,26.982145,pio public information officer business manage...,crisisfacts/001/2017-12-07


##Checking stats of currect event

In [None]:
print('Event: crisisfacts/001/2017-12-07')
print(index.getCollectionStatistics().toString())

In [None]:
topics = pyTerrierDataset.get_topics(variant='indicative_terms')
# corpus = pyTerrierDataset.get_corpus_iter()

  df.drop(df.columns.difference(['qid','query']), 1, inplace=True)


In [None]:
topics

Unnamed: 0,qid,query
0,CrisisFACTS-General-q001,airport closed
1,CrisisFACTS-General-q002,rail closed
2,CrisisFACTS-General-q003,water supply
3,CrisisFACTS-General-q004,firefighters on duty
4,CrisisFACTS-General-q005,evacuated
5,CrisisFACTS-General-q006,shelters
6,CrisisFACTS-General-q007,missing
7,CrisisFACTS-General-q008,trapped
8,CrisisFACTS-General-q009,injury injured
9,CrisisFACTS-General-q010,killed dead


Now that we have an index, we can issue queries to it like you would do to a web search engine. Since this is our index, we have control over how we want scoring of the items to happen. Each item is scored using what is known as a weighting model. This is a function that produces a score based on the number of query terms the document contains, in combination with statistics of the documents in the dataset. Different weighting models are optimised for different types of documents. For instance, the classical BM25 model was designed for web pages.

In pyTerrier, we create a retriever object that will execute our queries. We pass the index to the retriever along with the weighting model we want to be used. We can also specify any raw fields we stored in the index for an item that we want to be attached to the search result, such as the original text:

# Baseline Model using TF-IDF


In [None]:
retriever = pt.BatchRetrieve(index, wmodel="TF_IDF", metadata=["docno", "text"])
# retriever = pt.BatchRetrieve(index, wmodel="DFReeKLIM", metadata=["docno", "text"])

ranked_df = pd.DataFrame(retriever.transform(topics))
ranked_df



Unnamed: 0,qid,docid,docno,text,rank,score,query
0,CrisisFACTS-General-q001,1513,CrisisFACTS-001-Twitter-48717-0,Good to be home again! @ San Diego Internation...,0,5.959746,have airports closed
1,CrisisFACTS-General-q001,4001,CrisisFACTS-001-Twitter-5831-0,#LilacFire Smokey here in Oceanside above the ...,1,5.698283,have airports closed
2,CrisisFACTS-General-q001,3510,CrisisFACTS-001-Twitter-512-0,#lilacFire photos from Palomar airport rd Carl...,2,5.458798,have airports closed
3,CrisisFACTS-General-q001,532,CrisisFACTS-001-Twitter-22676-0,Back to Cali . . . #california #sandiego # # @...,3,5.238631,have airports closed
4,CrisisFACTS-General-q001,364,CrisisFACTS-001-Twitter-49822-0,I'm at San Diego International Airport in San ...,4,5.035535,have airports closed
...,...,...,...,...,...,...,...
12914,CrisisFACTS-Wildfire-q006,725,CrisisFACTS-001-Twitter-30710-0,"We have a few fires, some heading our way, wav...",995,1.272743,what is the fire containment level
12915,CrisisFACTS-Wildfire-q006,886,CrisisFACTS-001-Twitter-43707-0,Scooter clips ft new song!!!!!!!!!F the world ...,996,1.272743,what is the fire containment level
12916,CrisisFACTS-Wildfire-q006,930,CrisisFACTS-001-Twitter-32751-0,Great tips from CAL_FIRE! Winds are already st...,997,1.272743,what is the fire containment level
12917,CrisisFACTS-Wildfire-q006,980,CrisisFACTS-001-Twitter-11909-0,@gretchenbostrom Thank you. Safe and sound her...,998,1.272743,what is the fire containment level


In [None]:
# ranked_df.to_json('tf_idf-ranked.json', orient='records', lines=True)

# Query Expansion

In [None]:
##Query Expansion
BM25 = pt.BatchRetrieve(index, controls = {"wmodel": "BM25"},  metadata=["docno", "text"])
bo1 = BM25 >> pt.rewrite.Bo1QueryExpansion(index) >> BM25
# klq = pt.rewrite.KLQueryExpansion(index)

retriever = pt.BatchRetrieve(index, wmodel="DFReeKLIM", metadata=["docno", "text"])
pipelineQE = (retriever % 100) >> bo1 >> (retriever % 100)

In [None]:
pd.DataFrame(pipelineQE.transform(topics))



Unnamed: 0,qid,docid,docno,text,rank,score,query
0,CrisisFACTS-General-q001,3510,CrisisFACTS-001-Twitter-512-0,#lilacFire photos from Palomar airport rd Carl...,0,8.422256,applypipeline:off airport^1.845727872 close^1....
1,CrisisFACTS-General-q001,4001,CrisisFACTS-001-Twitter-5831-0,#LilacFire Smokey here in Oceanside above the ...,1,8.407218,applypipeline:off airport^1.845727872 close^1....
2,CrisisFACTS-General-q001,364,CrisisFACTS-001-Twitter-49822-0,I'm at San Diego International Airport in San ...,2,8.320512,applypipeline:off airport^1.845727872 close^1....
3,CrisisFACTS-General-q001,3712,CrisisFACTS-001-Twitter-23239-0,@justin_ternes Northeast of El Camino Real and...,3,8.312118,applypipeline:off airport^1.845727872 close^1....
4,CrisisFACTS-General-q001,532,CrisisFACTS-001-Twitter-22676-0,Back to Cali . . . #california #sandiego # # @...,4,8.265814,applypipeline:off airport^1.845727872 close^1....
...,...,...,...,...,...,...,...
48981,CrisisFACTS-Wildfire-q006,133,CrisisFACTS-001-News-8-23,"The blaze, reported about 1:15 p.m. near Los A...",95,2.973472,applypipeline:off fire^1.081699906 contain^1.0...
48982,CrisisFACTS-Wildfire-q006,2812,CrisisFACTS-001-News-38-2,"By 8 p.m., the Lilac fire had grown to 4,100 a...",96,2.973472,applypipeline:off fire^1.081699906 contain^1.0...
48983,CrisisFACTS-Wildfire-q006,3552,CrisisFACTS-001-Twitter-22533-0,#LilacFire The fire is now 100-150 acres &amp;...,97,2.958183,applypipeline:off fire^1.081699906 contain^1.0...
48984,CrisisFACTS-Wildfire-q006,3636,CrisisFACTS-001-Twitter-31377-0,#LilacFire The fire is now 100-150 acres &amp...,98,2.958183,applypipeline:off fire^1.081699906 contain^1.0...


# Multi-stage Retrieval

In [None]:
#this ranker will make the candidate set of documents for each query
BM25 = pt.BatchRetrieve(index, controls = {"wmodel": "BM25"},  metadata=["docno", "text"])

#these rankers we will use to re-rank the TF_IDF and PL2 results
TF_IDF =  pt.BatchRetrieve(index, controls = {"wmodel": "TF_IDF"},  metadata=["docno", "text"])
PL2 =  pt.BatchRetrieve(index, controls = {"wmodel": "PL2"},  metadata=["docno", "text"])

In [None]:
pipe = BM25 >> (TF_IDF ** PL2)
pipe.transform(topics)



Unnamed: 0,qid,docid,docno,text,rank,score,query,features
0,CrisisFACTS-General-q001,1513,CrisisFACTS-001-Twitter-48717-0,Good to be home again! @ San Diego Internation...,0,10.840907,airport closed,"[5.959746115803786, 5.434001669923279]"
1,CrisisFACTS-General-q001,4001,CrisisFACTS-001-Twitter-5831-0,#LilacFire Smokey here in Oceanside above the ...,1,10.365301,airport closed,"[5.698283354422201, 5.181268577133161]"
2,CrisisFACTS-General-q001,3510,CrisisFACTS-001-Twitter-512-0,#lilacFire photos from Palomar airport rd Carl...,2,9.929671,airport closed,"[5.458797921543436, 4.952882979095842]"
3,CrisisFACTS-General-q001,532,CrisisFACTS-001-Twitter-22676-0,Back to Cali . . . #california #sandiego # # @...,3,9.529182,airport closed,"[5.2386306112173004, 4.744836590959179]"
4,CrisisFACTS-General-q001,364,CrisisFACTS-001-Twitter-49822-0,I'm at San Diego International Airport in San ...,4,9.159746,airport closed,"[5.035534595592607, 4.554075159436155]"
...,...,...,...,...,...,...,...,...
8996,CrisisFACTS-Wildfire-q006,338,CrisisFACTS-001-News-19-11,Creek fire in northeast San Fernando Valley 1...,184,3.288778,containment,"[1.8208355675518322, 1.497431333559856]"
8997,CrisisFACTS-Wildfire-q006,339,CrisisFACTS-001-News-19-12,https://t.co/ZMx1UKojdE pic.twitter.com/Tptsg8...,185,3.208158,containment,"[1.7762005299317922, 1.451470494726682]"
8998,CrisisFACTS-Wildfire-q006,336,CrisisFACTS-001-News-19-9,Thomas fire in Ventura and Santa Barbara count...,186,3.131397,containment,"[1.733701453690955, 1.4075047602952049]"
8999,CrisisFACTS-Wildfire-q006,340,CrisisFACTS-001-News-19-13,"Lilac fire in San Diego County 4,100 acres bu...",187,3.058223,containment,"[1.693188600436249, 1.3653893495946197]"


In [None]:
fbr = pt.FeaturesBatchRetrieve(index, controls = {"wmodel": "BM25"}, features=["SAMPLE", "WMODEL:TF_IDF", "WMODEL:PL2"], metadata=["docno", "text"]) 

pipe = BM25 >> (pt.transformer.IdentityTransformer() ** TF_IDF ** PL2)
#look at the top 100 results
(fbr %100).transform(topics)



Unnamed: 0,qid,query,docid,rank,features,docno,text,score
0,CrisisFACTS-General-q001,airport closed,1513,0,"[10.840907078668243, 1.8888947026336904, 1.535...",CrisisFACTS-001-Twitter-48717-0,Good to be home again! @ San Diego Internation...,10.840907
1,CrisisFACTS-General-q001,airport closed,4001,1,"[10.365300660946287, 2.034966369838934, 1.6832...",CrisisFACTS-001-Twitter-5831-0,#LilacFire Smokey here in Oceanside above the ...,10.365301
2,CrisisFACTS-General-q001,airport closed,3510,2,"[9.929671479084204, 1.983828675112109, 1.63176...",CrisisFACTS-001-Twitter-512-0,#lilacFire photos from Palomar airport rd Carl...,9.929671
3,CrisisFACTS-General-q001,airport closed,532,3,"[9.529182379946793, 2.0888102021206136, 1.7371...",CrisisFACTS-001-Twitter-22676-0,Back to Cali . . . #california #sandiego # # @...,9.529182
4,CrisisFACTS-General-q001,airport closed,364,4,"[9.159746335079618, 2.034966369838934, 1.68321...",CrisisFACTS-001-Twitter-49822-0,I'm at San Diego International Airport in San ...,9.159746
...,...,...,...,...,...,...,...,...
8907,CrisisFACTS-Wildfire-q006,containment,3390,95,"[4.392621516780349, 3.3247012032612195, 3.0397...",CrisisFACTS-001-Twitter-44500-0,RT @nbcsandiego: #BREAKING: The #LilacFire has...,4.392622
8908,CrisisFACTS-Wildfire-q006,containment,3454,96,"[4.392621516780349, 2.4319800925912967, 2.1147...",CrisisFACTS-001-Twitter-35144-0,#BREAKING: The #LilacFire burning in the Bonsa...,4.392622
8909,CrisisFACTS-Wildfire-q006,containment,3581,97,"[4.392621516780349, 3.484589789337811, 3.21781...",CrisisFACTS-001-Twitter-15242-0,#LilacFire as seen from #CSUSM. It's burning n...,4.392622
8910,CrisisFACTS-Wildfire-q006,containment,3992,98,"[4.392621516780349, 2.3530039613044473, 2.0354...",CrisisFACTS-001-Twitter-18545-0,#LilacFire updates: --Between 100-150 acres bu...,4.392622


In [None]:
pipe_fast = pipe.compile()
custom_rankedd_df = (pipe_fast %100).transform(topics)

Applying 8 rules


In [None]:
pipe_fast.transform('evacuated')

  topics = m.transform(topics)


Unnamed: 0,qid,docid,docno,text,rank,score,query,features
0,1,7082,CrisisFACTS-001-Twitter-26352-0,Temporary evacuation information #LilacFire #F...,0,5.252715,evacuated,"[5.252714720050483, 3.0617048531517845, 3.0979..."
1,1,6504,CrisisFACTS-001-Twitter-44011-0,Updated evacuation map for #LilacFire. Yellow ...,1,4.973554,evacuated,"[4.973553979278057, 2.8989875078579646, 2.8829..."
2,1,5900,CrisisFACTS-001-Twitter-15149-0,If ordered to evacuate please evacuate. #prayf...,2,4.954042,evacuated,"[4.954042338964, 2.8876145536759172, 2.8429577..."
3,1,5628,CrisisFACTS-001-Twitter-15975-0,"Please, please, please think of the animals as...",3,4.817091,evacuated,"[4.817091054547189, 2.8077883239087145, 2.7333..."
4,1,5664,CrisisFACTS-001-Twitter-13722-0,Please tag your horses if you are planning to ...,4,4.817091,evacuated,"[4.817091054547189, 2.8077883239087145, 2.7333..."
...,...,...,...,...,...,...,...,...
594,1,2261,CrisisFACTS-001-News-32-5,EVACUATION SHELTERS EL CAJON: Bostonia Park &...,594,2.189437,evacuated,"[2.1894366011694317, 1.2761798469428722, 0.986..."
595,1,108,CrisisFACTS-001-News-7-14,Evacuation centers accepting pets: Palomar Col...,595,2.135766,evacuated,"[2.1357658651615203, 1.2448962228236335, 0.954..."
596,1,2266,CrisisFACTS-001-News-32-10,People who need assistance evacuating large an...,596,2.135766,evacuated,"[2.1357658651615203, 1.2448962228236335, 0.954..."
597,1,2930,CrisisFACTS-001-News-39-7,SAN DIEGO AIR QUALITY LEVELS Evacuation order...,597,2.084663,evacuated,"[2.0846634840922142, 1.2151096426519998, 0.922..."


##Building ground truths JSON file into dataframe for training purpose

In [43]:
from google.colab import drive
drive.mount('/content/drive')
root_path = 'drive/My Drive/Project_2022/'

# import pandas as pd
# facts_df = pd.read_json('CrisisFACTs-2022.json', orient='records')

# facts_df.head(5)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import json 
import pandas as pd 
from pandas.io.json import json_normalize

# facts_df = pd.read_json('CrisisFACTs-2022.json')
# facts_df.head(1)

with open(root_path +'CrisisFACTs-2022.json') as f:
    d = json.load(f)

data_df = json_normalize(d[2])
data_df.head(10)


  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,event,eventID,summaryRequests,allFacts,factsByRequest.CrisisFACTS-003-r9,factsByRequest.CrisisFACTS-003-r6,factsByRequest.CrisisFACTS-003-r10,factsByRequest.CrisisFACTS-003-r8,factsByRequest.CrisisFACTS-003-r7,factsByRequest.CrisisFACTS-003-r11,factsByRequest.CrisisFACTS-003-r5
0,Holy Wildfire 2018,CrisisFACTS-003,"[{'eventID': 'CrisisFACTS-003', 'requestID': '...","[{'eventID': 'CrisisFACTS-003', 'event': 'Holy...","[{'eventID': 'CrisisFACTS-003', 'event': 'Holy...","[{'eventID': 'CrisisFACTS-003', 'event': 'Holy...","[{'eventID': 'CrisisFACTS-003', 'event': 'Holy...","[{'eventID': 'CrisisFACTS-003', 'event': 'Holy...","[{'eventID': 'CrisisFACTS-003', 'event': 'Holy...","[{'eventID': 'CrisisFACTS-003', 'event': 'Holy...","[{'eventID': 'CrisisFACTS-003', 'event': 'Holy..."


In [44]:
import json 

with open(root_path +'CrisisFACTs-2022.json') as f:
    json_data = json.load(f)

In [54]:
def split_in_files(json_data, amount):
    step = len(json_data) // amount
    pos = 0
    for i in range(amount - 1):
        with open('output_file{}.json'.format(i+1), 'w') as file:
            json.dump(json_data[pos:pos+step], file)
            pos += step
    # last one
    with open('output_file{}.json'.format(amount), 'w') as file:
        json.dump(json_data[pos:], file)

split_in_files(json_data, len(event_name_df))

TypeError: ignored

#Word Embedding using word2vec

#Custome model

In [None]:
import gensim

In [None]:
clean_text = custom_rankedd_df.text.apply(gensim.utils.simple_preprocess)

In [None]:
clean_text

0       [good, to, be, home, again, san, diego, intern...
1       [lilacfire, smokey, here, in, oceanside, above...
2       [lilacfire, photos, from, palomar, airport, rd...
3       [back, to, cali, california, sandiego, san, di...
4       [at, san, diego, international, airport, in, s...
                              ...                        
8907    [rt, nbcsandiego, breaking, the, lilacfire, ha...
8908    [breaking, the, lilacfire, burning, in, the, b...
8909    [lilacfire, as, seen, from, csusm, it, burning...
8910    [lilacfire, updates, between, acres, burned, c...
8911    [breaking, news, the, lilacfire, has, exploded...
Name: text, Length: 3364, dtype: object

In [None]:
from pandas.core import window
custome_model = gensim.models.Word2Vec(
    window = 3,
    min_count = 2,
    workers = 4
)

In [None]:
#Building vocab
custome_model.build_vocab(clean_text, progress_per=1000)

In [None]:
custome_model.epochs

5

In [None]:
#Train the model
custome_model.train(clean_text, total_examples=custome_model.corpus_count, epochs=custome_model.epochs)

(208654, 292110)

In [None]:
#saving model
custome_model.save('./custom_model.model')

In [None]:
custome_model.similarity(w1="evacuated", w2="water")

  """Entry point for launching an IPython kernel.


0.99966383

#Pre-trained model

In [None]:
# %time model = api.load("glove-wiki-gigaword-300")
%time model = api.load("glove-twitter-25")
# %time model = api.load('glove-twitter-200')

CPU times: user 54 s, sys: 4.87 s, total: 58.9 s
Wall time: 1min 1s


In [None]:
emb = model.get_vector("airport")
print(emb.shape)
print(emb)

(25,)
[-2.0865e+00 -2.1981e-03  8.4930e-01 -1.4564e+00 -8.3844e-01 -8.0157e-01
  4.4099e-01  3.7413e-01  1.4009e+00  4.0926e-01 -1.0263e-01  7.7931e-01
 -2.8374e+00  5.7789e-01  7.0094e-01 -9.8445e-01  2.0104e-01  3.6623e-01
 -1.0395e+00 -2.7583e-01 -1.6308e-01 -1.4543e+00  4.2966e-01 -1.1305e+00
 -2.5316e-01]


In [None]:
model.most_similar("evacuated")

[('diverted', 0.8607070446014404),
 ('raided', 0.8510200381278992),
 ('offices', 0.8429232239723206),
 ('firefighters', 0.8372986316680908),
 ('transported', 0.8277961015701294),
 ('factories', 0.8258089423179626),
 ('courthouse', 0.8183614015579224),
 ('halted', 0.814586341381073),
 ('demolished', 0.8140581250190735),
 ('disrupted', 0.8140448331832886)]

In [None]:
model.similarity(w1="evacuated", w2="water")

0.37149754

##Word2vec function

In [None]:

def w2v_qexp(q):
  tokens = set(q["query"].to_string(index=False).split())
  print(q["query"])
  expandedquery = ""
  count = 0

  for element in tokens:
    if element in model.vocab.keys() and element.isalnum():  
        similar = dict(model.most_similar(element))

        for element in model.vocab.keys() and similar.keys():
          if(element.isalnum()):
            expandedquery += " " + str(element)     
            count+=1           

  q["query"] += expandedquery
  q["score"] += (count*0.1)/len(tokens)
  print(q["score"])
  print(q["query"])
  print("\n")

  return(q)


test_pipeline = pipe_fast >> pt.apply.by_query(w2v_qexp) >> pipe_fast


In [None]:
ranked_doc = test_pipeline.transform(topics)

0      airport closed
1      airport closed
2      airport closed
3      airport closed
4      airport closed
            ...      
140    airport closed
141    airport closed
142    airport closed
143    airport closed
144    airport closed
Name: query, Length: 145, dtype: object
0      11.840907
1      11.365301
2      10.929671
3      10.529182
4      10.159746
         ...    
140     4.692898
141     4.692898
142     4.692898
143     4.600097
144     4.427819
Name: score, Length: 145, dtype: float64
0      airport closed bangkok beijing shanghai headin...
1      airport closed bangkok beijing shanghai headin...
2      airport closed bangkok beijing shanghai headin...
3      airport closed bangkok beijing shanghai headin...
4      airport closed bangkok beijing shanghai headin...
                             ...                        
140    airport closed bangkok beijing shanghai headin...
141    airport closed bangkok beijing shanghai headin...
142    airport closed bangkok beij

In [None]:
ranked_df.head()

Unnamed: 0,qid,docid,docno,text,rank,score,query
0,CrisisFACTS-General-q001,1513,CrisisFACTS-001-Twitter-48717-0,Good to be home again! @ San Diego Internation...,0,5.959746,have airports closed
1,CrisisFACTS-General-q001,4001,CrisisFACTS-001-Twitter-5831-0,#LilacFire Smokey here in Oceanside above the ...,1,5.698283,have airports closed
2,CrisisFACTS-General-q001,3510,CrisisFACTS-001-Twitter-512-0,#lilacFire photos from Palomar airport rd Carl...,2,5.458798,have airports closed
3,CrisisFACTS-General-q001,532,CrisisFACTS-001-Twitter-22676-0,Back to Cali . . . #california #sandiego # # @...,3,5.238631,have airports closed
4,CrisisFACTS-General-q001,364,CrisisFACTS-001-Twitter-49822-0,I'm at San Diego International Airport in San ...,4,5.035535,have airports closed


In [None]:
pd.DataFrame(ranked_df.groupby('qid').head(2))

Unnamed: 0,qid,docid,docno,text,rank,score,query,features
0,CrisisFACTS-General-q001,1678,CrisisFACTS-001-News-26-10,Eastbound traffic on North River Road at Leon ...,0,13.847139,airport closed bangkok beijing shanghai headin...,"[13.84713925144419, 7.217716770633265, 6.48456..."
1,CrisisFACTS-General-q001,2256,CrisisFACTS-001-News-32-0,ROAD AND FREEWAY CLOSURES All roadways that w...,1,10.402066,airport closed bangkok beijing shanghai headin...,"[10.402066411423325, 5.405415268092923, 4.8264..."
145,CrisisFACTS-General-q002,1678,CrisisFACTS-001-News-26-10,Eastbound traffic on North River Road at Leon ...,0,11.064795,rail closed vehicle construction aircraft flee...,"[11.064794855692089, 5.747287032707115, 5.1472..."
146,CrisisFACTS-General-q002,2256,CrisisFACTS-001-News-32-0,ROAD AND FREEWAY CLOSURES All roadways that w...,1,10.402066,rail closed vehicle construction aircraft flee...,"[10.402066411423325, 5.405415268092923, 4.8264..."
280,CrisisFACTS-General-q003,2844,CrisisFACTS-001-News-38-34,"Tankers and helicopters, including those owned...",0,19.259518,water supply salt burning light bottle glass d...,"[19.259518288743802, 10.561789957412142, 9.295..."
...,...,...,...,...,...,...,...,...
8100,CrisisFACTS-Wildfire-q004,2929,CrisisFACTS-001-News-39-6,"Damages: At least 20 structures destroyed, unk...",1,21.519059,homes destroyed damaged carried trapped beaten...,"[21.51905945647335, 11.692402818804773, 10.857..."
8394,CrisisFACTS-Wildfire-q005,4203,CrisisFACTS-001-Twitter-40518-0,#NBC7s non-stop team coverage is coming up on ...,0,11.870070,acres per hour hours early weeks days monday l...,"[11.870070282240444, 6.316481244202238, 5.5797..."
8395,CrisisFACTS-Wildfire-q005,3394,CrisisFACTS-001-Twitter-23431-0,#Lilacfire grows to 150 acres in about an hour...,1,11.129582,acres per hour hours early weeks days monday l...,"[11.129582476904263, 5.964893091012373, 5.2577..."
8812,CrisisFACTS-Wildfire-q006,4846,CrisisFACTS-001-Twitter-10308-0,am i reading that right.... 0% CONTAINED!?!? m...,0,7.235807,containment fastening occupancy odometer flood...,"[7.2358070714371765, 4.006113134114233, 3.8556..."


# Re-ranking using Machine learning  i.e Lerning to Rank

In [None]:
bm25_cands = pt.BatchRetrieve(index, wmodel="BM25")
dph_cands = pt.BatchRetrieve(index, wmodel="DPH")
all_cands = bm25_cands | dph_cands

all_features = all_cands >> (  
    pt.BatchRetrieve(index, wmodel="BM25F") **
    pt.rewrite.SDM() >> pt.BatchRetrieve(index, wmodel="BM25")
    )

import xgboost as xgb
params = {'objective': 'rank:ndcg', 
          'learning_rate': 0.1, 
          'gamma': 1.0, 'min_child_weight': 0.1,
          'max_depth': 6,
          'verbose': 2,
          'random_state': 42 
         }
lambdamart = pt.ltr.apply_learned_model(xgb.sklearn.XGBRanker(**params), form='ltr')
final_pipe = all_features >> lambdamart

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
dataset1 = pt.datasets.get_dataset("vaswani")
indexref1 = dataset1.get_index()
topics1 = dataset1.get_topics()
qrels1 = dataset1.get_qrels()

Downloading vaswani index to /root/.pyterrier/corpora/vaswani/index


data.direct.bf:   0%|          | 0.00/388k [00:00<?, ?iB/s]

data.document.fsarrayfile:   0%|          | 0.00/234k [00:00<?, ?iB/s]

data.inverted.bf:   0%|          | 0.00/362k [00:00<?, ?iB/s]

data.lexicon.fsomapfile:   0%|          | 0.00/682k [00:00<?, ?iB/s]

data.lexicon.fsomaphash:   0%|          | 0.00/777 [00:00<?, ?iB/s]

data.lexicon.fsomapid:   0%|          | 0.00/30.3k [00:00<?, ?iB/s]

data.meta-0.fsomapfile:   0%|          | 0.00/725k [00:00<?, ?iB/s]

data.meta.idx:   0%|          | 0.00/89.3k [00:00<?, ?iB/s]

data.meta.zdata:   0%|          | 0.00/224k [00:00<?, ?iB/s]

data.properties:   0%|          | 0.00/4.29k [00:00<?, ?iB/s]

md5sums:   0%|          | 0.00/619 [00:00<?, ?iB/s]

Downloading vaswani topics to /root/.pyterrier/corpora/vaswani/query-text.trec


query-text.trec:   0%|          | 0.00/3.05k [00:00<?, ?iB/s]

Downloading vaswani qrels to /root/.pyterrier/corpora/vaswani/qrels


qrels:   0%|          | 0.00/6.63k [00:00<?, ?iB/s]

In [None]:
fbr = pt.FeaturesBatchRetrieve(indexref1, controls = {"wmodel": "BM25"}, features=["WMODEL:TF_IDF", "WMODEL:PL2"]) 
# the top 2 results
(fbr %2).search("chemical")

Unnamed: 0,qid,query,docid,rank,features,docno,score
0,1,chemical,10702,0,"[1.9972714735280614, 1.590216305943686]",10703,13.472012
1,1,chemical,1055,1,"[2.5168371014881425, 2.1297038460724336]",1056,12.517082


In [None]:
from ir_datasets.util import metadata
# BaselineLTR = fbr >> pt.pipelines.LTR_pipeline(RandomForestRegressor(n_estimators=400))
import numpy as np
train_topics, valid_topics, test_topics = np.split(topics1, [int(.6*len(topics1)), int(.8*len(topics1))])

import lightgbm as lgb
from sklearn.ensemble import RandomForestRegressor

lmart_l = lgb.LGBMRanker(task="train",
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=100,
    max_bin=255,
    num_leaves=7,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[1, 3, 5, 10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=10,
    metadata=["docno", "text"])

lmart_l_pipe = fbr >> pt.ltr.apply_learned_model(lmart_l, form="ltr")
lmart_l_pipe.fit(train_topics, qrels1, valid_topics, qrels1)





[1]	valid_0's ndcg@1: 0.526316
[2]	valid_0's ndcg@1: 0.368421
[3]	valid_0's ndcg@1: 0.368421
[4]	valid_0's ndcg@1: 0.315789
[5]	valid_0's ndcg@1: 0.263158
[6]	valid_0's ndcg@1: 0.263158
[7]	valid_0's ndcg@1: 0.263158
[8]	valid_0's ndcg@1: 0.263158
[9]	valid_0's ndcg@1: 0.263158
[10]	valid_0's ndcg@1: 0.210526


In [None]:
pd.DataFrame(lmart_l_pipe.search('chemical'))

Unnamed: 0,qid,query,docid,features,docno,score,rank
0,1,chemical,10702,"[1.9972714735280614, 1.590216305943686]",10703,0.191864,0
2,1,chemical,4885,"[2.3728103620913967, 1.983371320830459]",4886,0.052304,1
1,1,chemical,1055,"[2.5168371014881425, 2.1297038460724336]",1056,0.032782,2
6,1,chemical,10138,"[3.4834162139007545, 3.0708350019597836]",10139,-0.118626,3
3,1,chemical,6278,"[4.127360117137546, 3.67541591591923]",6279,-0.180456,4
4,1,chemical,1139,"[4.733102135127965, 4.239937921699505]",1140,-0.296689,5
14,1,chemical,4911,"[4.892671258538515, 4.389142244890826]",4912,-0.314971,6
5,1,chemical,8765,"[5.063375014663324, 4.549375729649994]",8766,-0.327094,7
7,1,chemical,2519,"[5.063375014663324, 4.549375729649994]",2520,-0.327094,8
8,1,chemical,2557,"[4.97655971570293, 4.4677913542010534]",2558,-0.327094,9


In [None]:
from collections import Counter
def slash_counter(url):
  counter = Counter(url)
  return counter['/']

In [None]:
improved_pl2_pipeline = pipe_fast >>pt.apply.doc_score(lambda row:slash_counter(row['text']),verbose=True)


In [None]:
improved_pl2_pipeline.transform('evacuated').head()

  topics = m.transform(topics)


pt.apply.doc_score:   0%|          | 0/599 [00:00<?, ?d/s]

Unnamed: 0,qid,docid,docno,text,score,query,features,rank
0,1,7082,CrisisFACTS-001-Twitter-26352-0,Temporary evacuation information #LilacFire #F...,3,evacuated,"[5.252714720050483, 3.0617048531517845, 3.0979...",44
1,1,6504,CrisisFACTS-001-Twitter-44011-0,Updated evacuation map for #LilacFire. Yellow ...,3,evacuated,"[4.973553979278057, 2.8989875078579646, 2.8829...",45
2,1,5900,CrisisFACTS-001-Twitter-15149-0,If ordered to evacuate please evacuate. #prayf...,3,evacuated,"[4.954042338964, 2.8876145536759172, 2.8429577...",46
3,1,5628,CrisisFACTS-001-Twitter-15975-0,"Please, please, please think of the animals as...",3,evacuated,"[4.817091054547189, 2.8077883239087145, 2.7333...",47
4,1,5664,CrisisFACTS-001-Twitter-13722-0,Please tag your horses if you are planning to ...,3,evacuated,"[4.817091054547189, 2.8077883239087145, 2.7333...",48
