<a href="https://colab.research.google.com/github/r2barati/TREC23-CrisisFACTS/blob/main/Simple_Hybrid_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Requirements from the last meeting by Professor on Aug 4th

## Suggested Procedure
* Literature review for short text information retrieval / classification. This * will help us the most, I believe, since the methods were very different. News are different
* data stats: what's the data distribution among data sources, crisis types
* data quality: which data source has the higher quality? should we use
* different sources to do the cross validation if your computer memory does not permit the computation?  
* focus: can we focus on some types of crises, if there are some crise who dominate the data?

## Submission
regarding the submissions: five runs so we can put different focuses on 5 runs.
We can use the 2022 data as the training data and test our methods on 2023 data.

## Timeline
* 1-2 weeks on data analysis + short text method
* 1 week on method selection
* 1 week on tuning

In [None]:
!pip install --upgrade git+https://github.com/allenai/ir_datasets.git@crisisfacts # install ir_datasets (crisisfacts branch)


Collecting git+https://github.com/allenai/ir_datasets.git@crisisfacts
  Cloning https://github.com/allenai/ir_datasets.git (to revision crisisfacts) to /tmp/pip-req-build-s8yy3dyo
  Running command git clone --filter=blob:none --quiet https://github.com/allenai/ir_datasets.git /tmp/pip-req-build-s8yy3dyo
  Running command git checkout -b crisisfacts --track origin/crisisfacts
  Switched to a new branch 'crisisfacts'
  Branch 'crisisfacts' set up to track remote branch 'crisisfacts' from 'origin'.
  Resolved https://github.com/allenai/ir_datasets.git to commit e2359e24c9546e2a62284cd1aec6138295bb5ec5
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting trec-car-tools>=2.5.4 (from ir-datasets==0.5.2)
  Downloading trec_car_tools-2.6-py3-none-any.whl (8.4 kB)
Collecting lz4>=3.1.1 (from ir-datasets==0.5.2)
  Downloading lz4-4.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [3

In [None]:
# Set up credentials
credentials = {
    "institution": "<Toronto Metropolitan University>", # University, Company or Public Agency Name
    "contactname": "<Reza Barati, Aary Kartha>", # Your Name
    "email": "<rezabarati@gmail.com, aaryaman.kartha@torontomu.ca>", # A contact email address
    "institutiontype": "<Research>" # Either 'Research', 'Industry', or 'Public Sector'
}

  # Write this to a file so it can be read when needed
import json
import os

home_dir = os.path.expanduser('~')

!mkdir -p ~/.ir_datasets/auth/
with open(home_dir + '/.ir_datasets/auth/crisisfacts.json', 'w') as f:
    json.dump(credentials, f)


In [None]:
# Define the event numbers
eventNoList = [
    "001", # Lilac Wildfire 2017
]

In [None]:
# Define the function to get days for an event
import requests

# Gets the list of days for a specified event number, e.g. '001'
def getDaysForEventNo(eventNo):

    # We will download a file containing the day list for an event
    url = "http://trecis.org/CrisisFACTs/CrisisFACTS-"+eventNo+".requests.json"

    # Download the list and parse as JSON
    dayList = requests.get(url).json()

    # Print each day
    # Note each day object contains the following fields
    #   {
    #      "eventID" : "CrisisFACTS-001",
    #      "requestID" : "CrisisFACTS-001-r3",
    #      "dateString" : "2017-12-07",
    #      "startUnixTimestamp" : 1512604800,
    #      "endUnixTimestamp" : 1512691199
    #   }

    return dayList

In [None]:
# Use the function to get days for the first event
# Below has a manual function for getting the days of each event number (there are 0-17 )
import pandas as pd
for day in getDaysForEventNo(eventNoList[0]):
      print(day["dateString"])
    # print(day) day include Summary Request JSON File


2017-12-07
2017-12-08
2017-12-09
2017-12-10
2017-12-11
2017-12-12
2017-12-13
2017-12-14
2017-12-15


In [None]:
# Get days for all events
eventsMeta = {}

for eventNo in eventNoList: # for each event
    dailyInfo = getDaysForEventNo(eventNo) # get the list of days
    eventsMeta[eventNo]= dailyInfo

    print("Event "+eventNo)
    for day in dailyInfo: # for each day
        print("  crisisfacts/"+eventNo+"/"+day["dateString"], "-->", day["requestID"]) # construct the request string

    print()

Event 001
  crisisfacts/001/2017-12-07 --> CrisisFACTS-001-r3
  crisisfacts/001/2017-12-08 --> CrisisFACTS-001-r4
  crisisfacts/001/2017-12-09 --> CrisisFACTS-001-r5
  crisisfacts/001/2017-12-10 --> CrisisFACTS-001-r6
  crisisfacts/001/2017-12-11 --> CrisisFACTS-001-r7
  crisisfacts/001/2017-12-12 --> CrisisFACTS-001-r8
  crisisfacts/001/2017-12-13 --> CrisisFACTS-001-r9
  crisisfacts/001/2017-12-14 --> CrisisFACTS-001-r10
  crisisfacts/001/2017-12-15 --> CrisisFACTS-001-r11



In [None]:
# Download and print data for the second day of the ninth event
import ir_datasets

dataset = ir_datasets.load('crisisfacts/001/2017-12-07')

for item in dataset.docs_iter()[:1]:
    print(item)

[INFO] [starting] building docstore
[INFO] [starting] requesting access key
[INFO] [finished] requesting access key [1.16s]
docs_iter: 7288doc [00:10, 704.80doc/s]




[INFO] [finished] docs_iter: [00:10] [7288doc] [704.58doc/s]
[INFO] [finished] building docstore [10.37s]


In [None]:
# Convert the stream of items to a Pandas DataFrame and filter by source type
import pandas as pd

# Convert the stream of items to a Pandas Dataframe
itemsAsDataFrame = pd.DataFrame(dataset.docs_iter())

# Create a copy of the first 100 rows
itemsAsDataFrame10000 = itemsAsDataFrame.head(10000).copy()
itemsAsDataFrame10000.to_csv("2017-12-07-10000.csv", index=False)

# Create a filter expression
is_reddit =  itemsAsDataFrame['source_type']=="Reddit"

# Apply our filter
itemsAsDataFrame[is_reddit]
itemsAsDataFrame10000_Reddit = itemsAsDataFrame[is_reddit].head(10000).copy()

# Create a filter expression
is_twitter =  itemsAsDataFrame['source_type']=="Twitter"

# Apply our filter
itemsAsDataFrame[is_twitter]
itemsAsDataFrame_Twitter = itemsAsDataFrame[is_twitter].head(10000).copy()

# Create a filter expression
is_fb =  itemsAsDataFrame['source_type']=="Facebook"

# Apply our filter
itemsAsDataFrame[is_fb]
itemsAsDataFrame10000_fb = itemsAsDataFrame[is_fb].head(10000).copy()

# Create a filter expression
is_news =  itemsAsDataFrame['source_type']=="News"

# Apply our filter
itemsAsDataFrame[is_news]
itemsAsDataFrame10000_news = itemsAsDataFrame[is_news].head(10000).copy()

itemsAsDataFrame.head(5)


Unnamed: 0,doc_id,event,text,source,source_type,unix_timestamp
0,CrisisFACTS-001-News-5-0,CrisisFACTS-001,Live updates: San Diego County fire is 92 perc...,{'url': 'http://www.sandiegouniontribune.com/n...,News,1512604800
1,CrisisFACTS-001-News-5-1,CrisisFACTS-001,"The Lilac fire now 92 percent contained, Cal F...",{'url': 'http://www.sandiegouniontribune.com/n...,News,1512604800
2,CrisisFACTS-001-News-5-2,CrisisFACTS-001,The county of San Diego has opened a Local Ass...,{'url': 'http://www.sandiegouniontribune.com/n...,News,1512604800
3,CrisisFACTS-001-News-5-3,CrisisFACTS-001,The center is at the Vista branch library on 7...,{'url': 'http://www.sandiegouniontribune.com/n...,News,1512604800
4,CrisisFACTS-001-News-5-4,CrisisFACTS-001,Homeowners also will be able to get informatio...,{'url': 'http://www.sandiegouniontribune.com/n...,News,1512604800


In [None]:
itemsAsDataFrame[is_twitter].head(10)

Unnamed: 0,doc_id,event,text,source,source_type,unix_timestamp
351,CrisisFACTS-001-Twitter-15712-0,CrisisFACTS-001,The homie tell me meet him at a time and this ...,{'created_at': 'Thu Dec 07 00:00:32 +0000 2017...,Twitter,1512604832
352,CrisisFACTS-001-Twitter-31905-0,CrisisFACTS-001,A couple tattoos from the other day. Thank yo...,{'created_at': 'Thu Dec 07 00:00:55 +0000 2017...,Twitter,1512604855
353,CrisisFACTS-001-Twitter-14023-0,CrisisFACTS-001,Big increase in the wind plus drop in humidity...,{'created_at': 'Thu Dec 07 00:01:16 +0000 2017...,Twitter,1512604876
354,CrisisFACTS-001-Twitter-31850-0,CrisisFACTS-001,Blue is the color of the shirt of the man i lo...,{'created_at': 'Thu Dec 07 00:02:00 +0000 2017...,Twitter,1512604920
355,CrisisFACTS-001-Twitter-27052-0,CrisisFACTS-001,Prayers go out to you all! From surviving 2 ma...,{'created_at': 'Thu Dec 07 00:02:57 +0000 2017...,Twitter,1512604977
356,CrisisFACTS-001-Twitter-17243-0,CrisisFACTS-001,@Rockinchick69 @doriemarie468 @fauxcin @nobama...,{'created_at': 'Thu Dec 07 00:03:00 +0000 2017...,Twitter,1512604980
357,CrisisFACTS-001-Twitter-20552-0,CrisisFACTS-001,JOIN US at https://t.co/33v6kC6gAO—We’re Talki...,{'created_at': 'Thu Dec 07 00:03:58 +0000 2017...,Twitter,1512605038
358,CrisisFACTS-001-Twitter-12517-0,CrisisFACTS-001,Back at my apartments for the day (@ Ballpark ...,{'created_at': 'Thu Dec 07 00:04:14 +0000 2017...,Twitter,1512605054
359,CrisisFACTS-001-Twitter-43197-0,CrisisFACTS-001,Affordable Ice Maker repairs near #CollegeArea...,{'created_at': 'Thu Dec 07 00:04:19 +0000 2017...,Twitter,1512605059
360,CrisisFACTS-001-Twitter-4592-0,CrisisFACTS-001,"We are next! Be Safe San Diego! Be diligent, a...",{'created_at': 'Thu Dec 07 00:04:35 +0000 2017...,Twitter,1512605075


In [None]:
columns_to_remove = ['doc_id', 'event', 'source', 'source_type', 'unix_timestamp']
itemsAsDataFrame_Twitter = itemsAsDataFrame_Twitter.drop(columns=columns_to_remove)

In [None]:
itemsAsDataFrame_Twitter

Unnamed: 0,doc_id,event,text,source,source_type,unix_timestamp
351,CrisisFACTS-001-Twitter-15712-0,CrisisFACTS-001,The homie tell me meet him at a time and this ...,{'created_at': 'Thu Dec 07 00:00:32 +0000 2017...,Twitter,1512604832
352,CrisisFACTS-001-Twitter-31905-0,CrisisFACTS-001,A couple tattoos from the other day. Thank yo...,{'created_at': 'Thu Dec 07 00:00:55 +0000 2017...,Twitter,1512604855
353,CrisisFACTS-001-Twitter-14023-0,CrisisFACTS-001,Big increase in the wind plus drop in humidity...,{'created_at': 'Thu Dec 07 00:01:16 +0000 2017...,Twitter,1512604876
354,CrisisFACTS-001-Twitter-31850-0,CrisisFACTS-001,Blue is the color of the shirt of the man i lo...,{'created_at': 'Thu Dec 07 00:02:00 +0000 2017...,Twitter,1512604920
355,CrisisFACTS-001-Twitter-27052-0,CrisisFACTS-001,Prayers go out to you all! From surviving 2 ma...,{'created_at': 'Thu Dec 07 00:02:57 +0000 2017...,Twitter,1512604977
...,...,...,...,...,...,...
7283,CrisisFACTS-001-Twitter-14176-0,CrisisFACTS-001,HEY SAN DIEGO WHAT ABOUT WE ALL TURN ON OUR SP...,{'created_at': 'Thu Dec 07 23:59:48 +0000 2017...,Twitter,1512691188
7284,CrisisFACTS-001-Twitter-13471-0,CrisisFACTS-001,"Oh good, fires down by my parents in San Diego...",{'created_at': 'Thu Dec 07 23:59:49 +0000 2017...,Twitter,1512691189
7285,CrisisFACTS-001-Twitter-34602-0,CrisisFACTS-001,"#LilacFire grows to 2,000 acres, is 0% contain...",{'created_at': 'Thu Dec 07 23:59:54 +0000 2017...,Twitter,1512691194
7286,CrisisFACTS-001-Twitter-45905-0,CrisisFACTS-001,#LilacFire Is the 76 closed both directions r...,{'created_at': 'Thu Dec 07 23:59:55 +0000 2017...,Twitter,1512691195


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.
Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.
Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.
Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


In [None]:
import re
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove all non-ASCII characters except some punctuation
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    # Optionally, remove digits (comment this line out if you want to keep numbers)
    # text = re.sub(r'\d+', '', text)
    return text

# Apply the cleaning function to the 'text' column
itemsAsDataFrame_Twitter['text'] = itemsAsDataFrame_Twitter['text'].apply(clean_text)

In [None]:
itemsAsDataFrame_Twitter

Unnamed: 0,text
351,The homie tell me meet him at a time and this ...
352,A couple tattoos from the other day. Thank yo...
353,Big increase in the wind plus drop in humidity...
354,Blue is the color of the shirt of the man i lo...
355,Prayers go out to you all! From surviving 2 ma...
...,...
7283,HEY SAN DIEGO WHAT ABOUT WE ALL TURN ON OUR SP...
7284,"Oh good, fires down by my parents in San Diego..."
7285,"#LilacFire grows to 2,000 acres, is 0% contain..."
7286,#LilacFire Is the 76 closed both directions r...


In [None]:
from google.colab import data_table

In [None]:
User_Profiles_Event_Definition = pd.DataFrame(dataset.queries_iter())

[INFO] [starting] requesting access key
[INFO] [finished] requesting access key [1.25s]


In [None]:
User_Profiles_Event_Definition

Unnamed: 0,query_id,text,indicative_terms,trecis_category_mapping,event_id,event_title,event_dataset,event_description,event_trecis_id,event_type,event_url
0,CrisisFACTS-General-q001,Have airports closed,airport closed,Report-Factoid,CrisisFACTS-001,Lilac Wildfire 2017,2017_12_07_lilac_wildfire.2017,The Lilac Fire was a fire that burned in north...,TRECIS-CTIT-H-092,Wildfire,https://en.wikipedia.org/wiki/Lilac_Fire
1,CrisisFACTS-General-q002,Have railways closed,rail closed,Report-Factoid,CrisisFACTS-001,Lilac Wildfire 2017,2017_12_07_lilac_wildfire.2017,The Lilac Fire was a fire that burned in north...,TRECIS-CTIT-H-092,Wildfire,https://en.wikipedia.org/wiki/Lilac_Fire
2,CrisisFACTS-General-q003,Have water supplied been contaminated,water supply,Report-EmergingThreats,CrisisFACTS-001,Lilac Wildfire 2017,2017_12_07_lilac_wildfire.2017,The Lilac Fire was a fire that burned in north...,TRECIS-CTIT-H-092,Wildfire,https://en.wikipedia.org/wiki/Lilac_Fire
3,CrisisFACTS-General-q004,How many firefighters are active,firefighters on-duty,Report-Factoid,CrisisFACTS-001,Lilac Wildfire 2017,2017_12_07_lilac_wildfire.2017,The Lilac Fire was a fire that burned in north...,TRECIS-CTIT-H-092,Wildfire,https://en.wikipedia.org/wiki/Lilac_Fire
4,CrisisFACTS-General-q005,How many people are affected,evacuated,Report-Factoid,CrisisFACTS-001,Lilac Wildfire 2017,2017_12_07_lilac_wildfire.2017,The Lilac Fire was a fire that burned in north...,TRECIS-CTIT-H-092,Wildfire,https://en.wikipedia.org/wiki/Lilac_Fire
5,CrisisFACTS-General-q006,How many people are in shelters,shelters,Report-Factoid,CrisisFACTS-001,Lilac Wildfire 2017,2017_12_07_lilac_wildfire.2017,The Lilac Fire was a fire that burned in north...,TRECIS-CTIT-H-092,Wildfire,https://en.wikipedia.org/wiki/Lilac_Fire
6,CrisisFACTS-General-q007,How many people are missing,missing,Report-Factoid,CrisisFACTS-001,Lilac Wildfire 2017,2017_12_07_lilac_wildfire.2017,The Lilac Fire was a fire that burned in north...,TRECIS-CTIT-H-092,Wildfire,https://en.wikipedia.org/wiki/Lilac_Fire
7,CrisisFACTS-General-q008,How many people are trapped,trapped,Request-SearchAndRescue,CrisisFACTS-001,Lilac Wildfire 2017,2017_12_07_lilac_wildfire.2017,The Lilac Fire was a fire that burned in north...,TRECIS-CTIT-H-092,Wildfire,https://en.wikipedia.org/wiki/Lilac_Fire
8,CrisisFACTS-General-q009,How many people have been injured,injury injured,Report-Factoid,CrisisFACTS-001,Lilac Wildfire 2017,2017_12_07_lilac_wildfire.2017,The Lilac Fire was a fire that burned in north...,TRECIS-CTIT-H-092,Wildfire,https://en.wikipedia.org/wiki/Lilac_Fire
9,CrisisFACTS-General-q010,How many people have been killed,killed dead,Report-Factoid,CrisisFACTS-001,Lilac Wildfire 2017,2017_12_07_lilac_wildfire.2017,The Lilac Fire was a fire that burned in north...,TRECIS-CTIT-H-092,Wildfire,https://en.wikipedia.org/wiki/Lilac_Fire


## Simple Hybrid Model
Our initial model is designed for testing purposes and focuses on the fundamentals of information retrieval.

### Text Vectorization:
We employ two text vectorization techniques:

* TF-IDF (Term Frequency-Inverse Document Frequency): This method transforms the text into a numerical form by considering the importance of terms within the text and across a collection of documents.

* Word2Vec: A neural network-based approach that represents words as vectors, capturing semantic relationships.
** Similarity Computation: We compute the cosine similarity between queries and documents using the vectorized forms. This measure helps identify the most relevant documents to a given query.

Results Retrieval:
The top 5 most similar documents are retrieved for each query, representing the relevant information related to the disaster.

#### First For One Day dataset cleaned and filtered to show only Twitter contents

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec
import numpy as np

# Combine queries and Twitter documents
all_texts = User_Profiles_Event_Definition['text'].tolist() + itemsAsDataFrame_Twitter['text'].tolist()

# Apply TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(all_texts)

# Split the TF-IDF matrix into queries and documents
queries_tfidf = tfidf_matrix[:len(User_Profiles_Event_Definition)]
documents_tfidf = tfidf_matrix[len(User_Profiles_Event_Definition):]

# Compute cosine similarity using TF-IDF
similarity_matrix_tfidf = cosine_similarity(queries_tfidf, documents_tfidf)

# Train Word2Vec model
sentences = [text.split() for text in all_texts]
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
word2vec_model.train(sentences, total_examples=len(sentences), epochs=10)

# Compute average Word2Vec vectors for queries and documents
def average_word_vectors(text):
    vectors = [word2vec_model.wv[word] for word in text.split() if word in word2vec_model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(word2vec_model.vector_size)

queries_w2v = np.array([average_word_vectors(query) for query in User_Profiles_Event_Definition['text']])
documents_w2v = np.array([average_word_vectors(doc) for doc in itemsAsDataFrame_Twitter['text']])

# Compute cosine similarity using Word2Vec
similarity_matrix_w2v = cosine_similarity(queries_w2v, documents_w2v)

# Get top 5 indices of most similar documents for each query (using TF-IDF)
top_indices_tfidf = np.argsort(-similarity_matrix_tfidf, axis=1)[:, :5]

# Retrieve and print the corresponding documents for each query (using TF-IDF)
for query_idx, indices in enumerate(top_indices_tfidf):
    query_text = User_Profiles_Event_Definition.iloc[query_idx]['text']
    print(f"Query: {query_text}")
    print("Top 5 Relevant Documents:")
    for idx in indices:
        document_text = itemsAsDataFrame_Twitter.iloc[idx]['text']
        print(f"  - {document_text}")
    print("\n" + "="*50 + "\n")




Query: Have airports closed
Top 5 Relevant Documents:
  - I-15 SB two lanes closed w/3 mile backup.  76 closed from old 395 W to Gird. Old 395 closed from 76 S to 15.… https://t.co/qHyVtyg60L
  - #LilacFire UPDATE: Fire is currently at over 150 acres. 2 lanes of SB 15 closed. 76 at Hwy 395 to Gird Road closed.… https://t.co/nCNQG3DuDA
  - #LilacFire  Route 76 closed down.  SB 15 3 lanes closed.  150 acres now.  that's from 10 acres in the last 45 min.… https://t.co/TjtQGCs9s0
  - EB 76 closed at East Vista Way due to #LilacFire  WB only open from Gird to the West.  Both directions of 76 closed… https://t.co/U4bvKGkihP
  - The Latest: Fire in San Diego County triggers evacuations: Today News for #PortCharlotte: Authorities have closed a… https://t.co/0urWF7ercX


Query: Have railways closed
Top 5 Relevant Documents:
  - I-15 SB two lanes closed w/3 mile backup.  76 closed from old 395 W to Gird. Old 395 closed from 76 S to 15.… https://t.co/qHyVtyg60L
  - #LilacFire UPDATE: Fire is curr

In [None]:
results = []

# Define weights for combining TF-IDF and Word2Vec similarity
weight_tfidf = 0.5
weight_w2v = 0.5

for query_idx, indices in enumerate(top_indices_tfidf):
    query_data = User_Profiles_Event_Definition.iloc[query_idx]
    for idx in indices:
        document = itemsAsDataFrame_Twitter.iloc[idx]
        # Compute the importance score as a weighted combination of TF-IDF and Word2Vec similarity
        importance_score = weight_tfidf * similarity_matrix_tfidf[query_idx, idx] + weight_w2v * similarity_matrix_w2v[query_idx, idx]
        fact = {
            "requestID": query_data['event_id'],
            "factText": document['text'],
            "unixTimestamp": document['unix_timestamp'],
            "importance": importance_score,
            "sources": [document['source_type']],
            "streamID": query_data['event_id'],
            "informationNeeds": [query_data['trecis_category_mapping']]
        }
        results.append(fact)

In [None]:
for result in results[:200]:
    print(result)

{'requestID': 'CrisisFACTS-001', 'factText': 'I-15 SB two lanes closed w/3 mile backup.  76 closed from old 395 W to Gird. Old 395 closed from 76 S to 15.… https://t.co/qHyVtyg60L', 'unixTimestamp': 1512678914, 'importance': 0.6067307453693106, 'sources': ['Twitter'], 'streamID': 'CrisisFACTS-001', 'informationNeeds': ['Report-Factoid']}
{'requestID': 'CrisisFACTS-001', 'factText': '#LilacFire UPDATE: Fire is currently at over 150 acres. 2 lanes of SB 15 closed. 76 at Hwy 395 to Gird Road closed.… https://t.co/nCNQG3DuDA', 'unixTimestamp': 1512679259, 'importance': 0.5787352740710063, 'sources': ['Twitter'], 'streamID': 'CrisisFACTS-001', 'informationNeeds': ['Report-Factoid']}
{'requestID': 'CrisisFACTS-001', 'factText': "#LilacFire  Route 76 closed down.  SB 15 3 lanes closed.  150 acres now.  that's from 10 acres in the last 45 min.… https://t.co/TjtQGCs9s0", 'unixTimestamp': 1512679536, 'importance': 0.5795094582143426, 'sources': ['Twitter'], 'streamID': 'CrisisFACTS-001', 'inform

In [None]:
import json
import gzip
import numpy as np

# Function to convert NumPy types to Python native types
def convert_types(obj):
    if isinstance(obj, np.integer):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    elif isinstance(obj, dict):
        return {key: convert_types(value) for key, value in obj.items()}
    elif isinstance(obj, list):
        return [convert_types(item) for item in obj]
    else:
        return obj

# Convert the results, applying convert_types to each fact in the list
converted_results = [convert_types(fact) for fact in results]

# Open a new gzip file for writing
with gzip.open("submission_file.json.gz", "wt", encoding="utf-8") as file:
    # Write each result object as a separate line
    for result in converted_results:
        file.write(json.dumps(result) + "\n")



#### Now for the whole one day event data unfiltered and uncleaned

In [None]:
# Retrieve and print the corresponding documents for each query (using TF-IDF)
for query_idx, indices in enumerate(top_indices_tfidf):
    query_text = User_Profiles_Event_Definition.iloc[query_idx]['text']
    print(f"Query: {query_text}")
    print("Top 5 Relevant Documents (TF-IDF):")
    for idx in indices:
        document_text = itemsAsDataFrame.iloc[idx]['text']
        print(f"  - {document_text}")
    print("\n" + "="*50 + "\n")


Query: Have airports closed
Top 5 Relevant Documents (TF-IDF):
  - i have to get a -11.3 on my math final to keep my A in the class im dead should i even go to my final
  - 2015 I did some dope pop up shops. Promoted my first show but then fell into depression. Went to 2 mental hospital’… https://t.co/V9FJITvCtk
  - San Diego native, and I don’t ever remember Santa Ana winds like this ever before. Hurricane strength, dry as a bon… https://t.co/4OqegE658O
  - Wooohoo @nbcsandiego and @CALFIRESANDIEGO made #LilacFire a trending topic in the US. You can trust the robot.
  - @watchesdotcom @Yotpo Switzerland BINGER  🇨🇭 - Special Price  ➤ https://t.co/6HtmB7k7Hz   Pearl Harbor Sen. Al Fran… https://t.co/lw3AZX0xCW


Query: Have railways closed
Top 5 Relevant Documents (TF-IDF):
  - i have to get a -11.3 on my math final to keep my A in the class im dead should i even go to my final
  - 2015 I did some dope pop up shops. Promoted my first show but then fell into depression. Went to 2 mental 

Below is an explanation of the code, broken down line by line

### Text Vectorization

```python
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(all_texts)
```
- **Purpose**: Applies TF-IDF vectorization to all the texts, converting them into numerical form.

```python
from gensim.models import Word2Vec
sentences = [text.split() for text in all_texts]
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
word2vec_model.train(sentences, total_examples=len(sentences), epochs=10)
```
- **Purpose**: Trains a Word2Vec model on the text, creating word vectors that capture semantic meaning.

### Similarity Computation

```python
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix_tfidf = cosine_similarity(queries_tfidf, documents_tfidf)
similarity_matrix_w2v = cosine_similarity(queries_w2v, documents_w2v)
```
- **Purpose**: Computes cosine similarity between queries and documents, identifying relevant information.

### Results Retrieval

```python
top_indices_tfidf = np.argsort(-similarity_matrix_tfidf, axis=1)[:, :5]
for query_idx, indices in enumerate(top_indices_tfidf):
    # Code to retrieve and print the top 5 most relevant documents for each query
```
- **Purpose**: Retrieves the top 5 most similar documents for each query, aligning with the need to identify critical developments.