# Data Preprocessing Dev

1. **Load and Preprocess Evidence Data**:

- *Data Structure*: Your dataset, evidence_df, contains two columns: evidence_id and evidence_paragraph.
- *Objective*: Use all evidence paragraphs to train a TF-IDF model. This model will be used to retrieve the most relevant evidences for a given input claim.

2. **TF-IDF for Evidence Retrieval**:

- *Preprocessing*: Clean and preprocess the evidence paragraphs to optimize them for TF-IDF vectorization (e.g., removing stopwords, punctuation, and normalizing text).
- *Vectorization*: Apply TF-IDF vectorization to the preprocessed evidence paragraphs to create a matrix representing the importance of terms in each document.
- *Similarity Calculation*: When a new claim is received, convert it into a TF-IDF vector using the same vectorizer and calculate its cosine similarity against the TF-IDF matrix to find the most relevant evidences.

3. **Construct an Evidence List**:

*Relevance*: Based on the similarity scores, select the top relevant evidences. This list will be used for further processing and classification.

4. **Concatenate Claim and Evidences**:

*Integration*: Concatenate the input claim with its corresponding top relevant evidences into a single text block (paragraph). This concatenated text serves as a comprehensive context for the claim.

5. **Word2Vec Model Training and Application**:

- *Model Building*: Build a Word2Vec model from scratch using PyTorch to learn word embeddings from the concatenated text of claims and their relevant evidences.
- *Usage*: The trained Word2Vec model can be used to convert words or phrases from the claims and evidences into vectors, which can then be utilized for various tasks such as classification, clustering, or further similarity measurements.

6. **Classification**:

- *Approach*: Use the embeddings from the Word2Vec model along with additional features (if necessary) to classify the claim into one of four predefined categories.
- *Model Selection*: Depending on the complexity and nature of the classification, choose an appropriate machine learning or deep learning model. This could be a simple logistic regression, a support vector machine, or a more complex neural network.

**Considerations for Implementation**:
- *Modularity*: Each step should be encapsulated within its class or function to ensure modularity and ease of maintenance.
- *Scalability*: Design the system to handle increases in data volume efficiently, possibly by optimizing data handling and processing.
- *Extensibility*: Allow for easy updates and modifications, such as adding new preprocessing steps, changing the classification model, or adjusting the number of top evidences retrieved.

## 1. Load and Preprocess Evidence Data

- *Data Structure*: Your dataset, evidence_df, contains two columns: evidence_id and evidence_paragraph.
- *Objective*: Use all evidence paragraphs to train a TF-IDF model. This model will be used to retrieve the most relevant evidences for a given input claim.

In [1]:
from dataclasses import dataclass
import pandas as pd
import logging
from typing import Optional
from pathlib import Path

# Configure logging
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S')
logger = logging.getLogger(__name__)

In [2]:
@dataclass
class DataLoader:
    file_path: Path

    def load_data(self) -> Optional[pd.DataFrame]:
        """
        Loads the data from the specified JSON file path using a Path object.
        Attempts to read a JSON file into a pandas DataFrame.
        Logs an error and returns None if the operation fails.
        
        :return: Optional[pd.DataFrame] - A pandas DataFrame if successful, None otherwise.
        """
        try:
            if not self.file_path.exists():
                logger.warning(f"The file {self.file_path} does not exist.")
                return None
            
            data = pd.read_json(self.file_path, orient='index')
            logger.info("Data loaded successfully.")
            return data
        except Exception as e:
            logger.warning(f"An error occurred while loading the data from {self.file_path}: {e}")
            return None

## 2. **TF-IDF for Evidence Retrieval**:

Text preprocessing is an essential step in our pipeline, aiming to transform raw text into a format that's more analyzable and meaningful. This step involves cleaning the text, reducing words to their base or root form, removing irrelevant characters and words that do not contribute to the semantic meaning of the text, and enriching the text to retain more information.

### Approach
Our preprocessing workflow integrates several techniques to refine the text data comprehensively:

1. **Contraction Expansion**: Converts contractions (e.g., "isn't" to "is not") into their expanded forms to standardize text and improve analysis accuracy.
2. **Tokenization and Lowercasing**: Splits text into individual words or tokens and transforms all text to lowercase to ensure consistency and avoid duplication based on case differences.
3. **Special Characters Removal**: Deletes non-word characters (e.g., punctuation) to focus on the textual content, followed by removing duplicated spaces to clean the text further.
4. **Part-of-Speech Tagging and Named Entity Recognition (NER)**: Identifies the grammatical parts of speech for each word and recognizes named entities (e.g., "South Australia") as unique tokens. This step is crucial for maintaining the specificity of geographical locations, organizations, and individuals in the text.
5. **Text Enrichment**: Enhances text by adding synonyms, hyponyms, and antonyms based on the part-of-speech, avoiding duplicates to enrich the content without altering the original meaning significantly.
6. **Stop Words Removal**: Although not explicitly mentioned in the preprocessing function, it's a common step that involves eliminating commonly used words (e.g., "the", "is") that usually have little to no semantic value in the context of text analysis.

### Implementation
The `preprocess_text` method encapsulates the preprocessing steps, taking a string of text as input. This method is a critical component of the `TfidfEvidenceRetriever` class, which preprocesses both the query and the documents in the corpus to ensure a standardized format for TF-IDF vectorization.

The `preprocess` method orchestrates the preprocessing of the entire dataset stored in a DataFrame. It utilizes the tqdm library to display a progress bar, providing real-time feedback on the preprocessing status, and prepares the text for subsequent TF-IDF vectorization and similarity comparison.

### Example
Consider the claim: "[South Australia] has the most expensive electricity in the world." During preprocessing:

1. **Contraction Expansion & Lowercasing**: No contractions are present; "South Australia" is lowercased.
2. **Special Characters Removal**: Assumes no special characters; the text remains unchanged. Duplicated spaces are removed.
3. **Tokenization & Part-of-Speech Tagging**: Splits the text into tokens and tags them.
4. **Named Entity Recognition (NER)**: Identifies "South Australia" as a named entity and preserves it as a unique token.
5. **Text Enrichment**: Enriches the text by adding synonyms, hyponyms, and antonyms for relevant words based on their part-of-speech tags.
6. **Lemmatization & Stop Words Removal**: "has" -> "have", "expensive" remains, "electricity" remains, removing "the", "in", "world".

The preprocessing results in a focused representation of the claim, highlighting the key components, preserving the named entity "South Australia" for precise evidence retrieval, and enriching the text to enhance the analysis.

In [3]:
import re
import numpy as np
import pandas as pd
from dataclasses import dataclass, field
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
import contractions
import logging
from pathlib import Path
from tqdm.auto import tqdm
import pickle
from collections import Counter
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
import re
import contractions
from nltk.corpus import stopwords

# Download necessary NLTK resources
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

True

In [4]:
@dataclass
class TfidfEvidenceRetriever:
    evidence_path: Path
    vectorizer: TfidfVectorizer = field(default_factory=lambda: TfidfVectorizer(ngram_range=(1, 3), max_df=0.85, min_df=2))
    tfidf_matrix: np.ndarray = None
    stop_words: set = field(default_factory=lambda: set(stopwords.words('english')))
    similarity_threshold: float = 0.5
    lemmatizer: WordNetLemmatizer = field(default_factory=WordNetLemmatizer)

    def __post_init__(self):
        self.evidence_df = pd.read_json(self.evidence_path, orient='index') # .head(100000)
        self.evidence_df.columns = ['paragraph']
        self.evidence_df.reset_index(inplace=True)
        self.preprocess()

    def get_wordnet_pos(self, treebank_tag):
        """Converts treebank POS tags to WordNet POS tags."""
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            return None

    def enrich_text(self, word, pos, existing_tokens):
        """Enriches text based on POS, avoiding duplicates."""
        enrichments = []
        synsets = wordnet.synsets(word, pos=pos)
        if not synsets:
            return enrichments

        # For nouns: add at least one synonym and two hyponyms
        if pos == wordnet.NOUN:
            synonyms_added, hyponyms_added = 0, 0
            for synset in synsets:
                if synonyms_added < 1:
                    for lemma in synset.lemmas():
                        lemma_name = lemma.name().replace('_', ' ').lower()
                        if lemma_name != word and lemma_name not in existing_tokens:
                            enrichments.append(lemma_name)
                            synonyms_added += 1
                            break  # Break after adding one synonym
                for hyponym in synset.hyponyms():
                    for lemma in hyponym.lemmas():
                        lemma_name = lemma.name().replace('_', ' ').lower()
                        if lemma_name != word and lemma_name not in existing_tokens and hyponyms_added < 2:
                            enrichments.append(lemma_name)
                            hyponyms_added += 1
                            existing_tokens.add(lemma_name)
                        if hyponyms_added >= 2:
                            break
                    if hyponyms_added >= 2:
                        break

        # For verbs: add an antonym
        if pos == wordnet.VERB:
            for synset in synsets:
                for lemma in synset.lemmas():
                    if lemma.antonyms():
                        antonym_name = lemma.antonyms()[0].name().replace('_', ' ').lower()
                        if antonym_name not in existing_tokens:
                            enrichments.append(antonym_name)
                            break

        return enrichments

    def traverse_tree(self, tree):
        final_tokens = []
        for subtree in tree:
            if type(subtree) == Tree:
                ne_token = "_".join(word for word, tag in subtree.leaves())
                final_tokens.append(ne_token)
            else:
                final_tokens.append(subtree[0])
        return final_tokens
    
    def preprocess_text(self, text: str) -> str:
        """Updated preprocess text method to include text enrichment based on POS."""
        try:
            text = contractions.fix(text)  # Expand contractions
        except:
            pass
        
        tokens = word_tokenize(text)
        tagged_tokens = pos_tag(tokens)
        
        ne_tree = ne_chunk(tagged_tokens)
        
        processed_tokens = self.traverse_tree(ne_tree)
        
        seen_tokens = set()
        final_tokens = []
        for token, tag in tagged_tokens:
            wordnet_pos = self.get_wordnet_pos(tag)  # Convert POS tag to a WordNet POS tag.
            if wordnet_pos:  # Only enrich if a valid WordNet POS tag is available.
                enrichments = self.enrich_text(token, wordnet_pos, seen_tokens)
                for enrichment in enrichments:
                    if enrichment not in seen_tokens:
                        final_tokens.append(enrichment)
                        seen_tokens.add(enrichment)
            # Add the original token if not already added.
            token_lower = token.lower()
            if token_lower not in seen_tokens:
                final_tokens.append(token_lower)
                seen_tokens.add(token_lower)
        
        final_text = ' '.join(final_tokens)
        final_text = re.sub(r'[^\w\s]', '', final_text)
        final_text = re.sub(r'\s{2,}', ' ', final_text)
        
        return final_text.strip(' ')

    def preprocess(self):
        logger.info("Starting preprocessing of paragraphs.")
        processed_paragraphs = []
        for paragraph in tqdm(self.evidence_df['paragraph'], desc="Preprocessing paragraphs"):
            try:
                processed_paragraph = self.preprocess_text(paragraph)
                processed_paragraphs.append(processed_paragraph)
            except IndexError as e:
                logger.error(f"Error processing paragraph: {paragraph}")
                raise e
        self.evidence_df['processed_paragraph'] = processed_paragraphs
        self.tfidf_matrix = self.vectorizer.fit_transform(self.evidence_df['processed_paragraph'])
        logger.info("Preprocessing complete.")

    def find_relevant_evidences(self, query: str) -> pd.DataFrame:
        processed_query = self.preprocess_text(query)
        logger.info(f"Processed claim: {processed_query}")
        query_tfidf = self.vectorizer.transform([processed_query])
        cosine_similarities = cosine_similarity(query_tfidf, self.tfidf_matrix).flatten()

        sorted_indices = np.argsort(cosine_similarities)[::-1]
        if cosine_similarities[sorted_indices[0]] < self.similarity_threshold:
            most_relevant_index = [sorted_indices[0]]
            logger.info("No evidences above the threshold. Returning the most relevant evidence.")
            return self.evidence_df.iloc[most_relevant_index]

        relevant_indices = [index for index in sorted_indices if cosine_similarities[index] >= self.similarity_threshold][:6]
        logger.info(f"Found {len(relevant_indices)} relevant evidences.")
        return self.evidence_df.iloc[relevant_indices]

In [5]:
# Assuming evidence_df is already loaded and contains a 'paragraph' column
retriever = TfidfEvidenceRetriever(Path('../data/evidence.json'))

2024-05-13 02:02:41 - INFO - Starting preprocessing of paragraphs.


Preprocessing paragraphs:   0%|          | 0/1208827 [00:00<?, ?it/s]

2024-05-13 02:55:22 - INFO - Preprocessing complete.


In [11]:
retriever.similarity_threshold=0.35

with open('tfidf_evidence_retriever_v2.pkl', 'wb') as file:
    pickle.dump(retriever, file)

print("Retriever object saved successfully.")

Retriever object saved successfully.


In [6]:
list(retriever.evidence_df.iloc[67732])

['evidence-67732',
 '[citation needed] South Australia has the highest retail price for electricity in the country.',
 'commendation citation obviate needed south australia lack abstain refuse has the highest retail monetary value price for electrical energy electricity in state country']

In [10]:
# Example claim to retrieve evidence for
claim = "[South Australia] has the most expensive electricity in the world."
retriever.similarity_threshold=0.35
relevant_indices = retriever.find_relevant_evidences(claim)
relevant_indices

2024-05-13 02:55:56 - INFO - Processed claim: south australia lack abstain refuse has the most expensive electrical energy electricity in universe world
2024-05-13 02:55:58 - INFO - Found 2 relevant evidences.


Unnamed: 0,index,paragraph,processed_paragraph
67732,evidence-67732,[citation needed] South Australia has the high...,commendation citation obviate needed south aus...
572512,evidence-572512,"""South Australia has the highest power prices ...",south australia lack abstain refuse has the hi...


In [18]:
retriever.similarity_threshold=0.28
def get_evidences(claim_text):
    relevant_indices = retriever.find_relevant_evidences(claim_text)
    return list(relevant_indices['index'])

get_evidences(claim)

2024-05-13 03:03:51 - INFO - Processed claim: south australia lack abstain refuse has the most expensive electrical energy electricity in universe world
2024-05-13 03:03:54 - INFO - Found 2 relevant evidences.


['evidence-67732', 'evidence-572512']

In [19]:
claims_df = pd.read_json('../data/dev-claims.json', orient='index')
claims_df['evidences'] = claims_df.agg(lambda df: get_evidences(df['claim_text']), axis=1)
claims_df.to_json('../data/claims_prediction.json', index=True)

2024-05-13 03:04:00 - INFO - Processed claim: south australia lack abstain refuse has the most expensive electrical energy electricity in universe world
2024-05-13 03:04:02 - INFO - Found 2 relevant evidences.
2024-05-13 03:04:02 - INFO - Processed claim: when 3 per penny cent of total annual global emission emissions c carbon dioxide differ are from world humans and australia produces 13 centime this then no sum amount emanation reduction here will lack abstain refuse have any consequence effect on clime climate
2024-05-13 03:04:04 - INFO - No evidences above the threshold. Returning the most relevant evidence.
2024-05-13 03:04:04 - INFO - Processed claim: this means that the universe world differ is now 1c warmer than it was in preindustrial multiplication times
2024-05-13 03:04:06 - INFO - Found 2 relevant evidences.
2024-05-13 03:04:06 - INFO - Processed claim: as it dematerialize happens zika may also differ be a good theoretical account model of the second worrying consequence ef

In [20]:
# Example claim to retrieve evidence for
claim = "Satellite measurements of infrared spectra over the past 40 years observe less energy escaping to space at the wavelengths associated with CO2."
retriever.similarity_threshold=0.5
relevant_indices = retriever.find_relevant_evidences(claim)
relevant_indices

2024-05-13 03:08:51 - INFO - Processed claim: satellite measurement measurements of infrared spectrum spectra over the past 40 old age years observe less free energy energy escaping to infinite space at wavelength wavelengths dissociate associated with carbon dioxide co2
2024-05-13 03:08:54 - INFO - No evidences above the threshold. Returning the most relevant evidence.


Unnamed: 0,index,paragraph,processed_paragraph
668884,evidence-668884,CO2).,carbon dioxide co2


In [21]:
1

1

In [None]:
#with open('tfidf_evidence_retriever.pkl', 'rb') as file:
#    loaded_retriever = pickle.load(file)
#logger.info("Retriever object loaded sucessfully.")

In [37]:
# Example claim to retrieve evidence for
claim = "when 3 per cent of total annual global emissions of carbon dioxide are from humans and Australia prod­uces 1.3 per cent of this 3 per cent, then no amount of emissions reductio­n here will have any effect on global climate."
retriever.similarity_threshold=0.5
relevant_indices = retriever.find_relevant_evidences(claim)
pd.DataFrame(retriever.evidence_df.iloc[relevant_indices]['paragraph'])

2024-05-12 21:43:34 - INFO - Finding relevant evidences for the given query: 'when 3 per cent of total annual global emissions of carbon dioxide are from humans and Australia prod­uces 1.3 per cent of this 3 per cent, then no amount of emissions reductio­n here will have any effect on global climate.'
2024-05-12 21:43:35 - INFO - Found 2 relevant evidences.


Unnamed: 0,paragraph
1140012,Developing countries with the highest rate of ...
78654,""", as opposed to ""per cent""."


In [20]:
# Example claim to retrieve evidence for
claim = "when 3 per cent of total annual global emissions of carbon dioxide are from humans and Australia prod­uces 1.3 per cent of this 3 per cent, then no amount of emissions reductio­n here will have any effect on global climate."
retriever.similarity_threshold=0.5
relevant_indices = retriever.find_relevant_evidences(claim)
pd.DataFrame(retriever.evidence_df.iloc[relevant_indices]['paragraph'])

2024-05-12 20:51:53 - INFO - Finding relevant evidences for the given query: 'when 3 per cent of total annual global emissions of carbon dioxide are from humans and Australia prod­uces 1.3 per cent of this 3 per cent, then no amount of emissions reductio­n here will have any effect on global climate.'
2024-05-12 20:51:53 - INFO - No evidences above the threshold. Returning the most relevant evidence.


Unnamed: 0,paragraph
5606,Their reported relationship appeared to accoun...
