# Data Preprocessing Dev

1. **Load and Preprocess Evidence Data**:

- *Data Structure*: Your dataset, evidence_df, contains two columns: evidence_id and evidence_paragraph.
- *Objective*: Use all evidence paragraphs to train a TF-IDF model. This model will be used to retrieve the most relevant evidences for a given input claim.

2. **TF-IDF for Evidence Retrieval**:

- *Preprocessing*: Clean and preprocess the evidence paragraphs to optimize them for TF-IDF vectorization (e.g., removing stopwords, punctuation, and normalizing text).
- *Vectorization*: Apply TF-IDF vectorization to the preprocessed evidence paragraphs to create a matrix representing the importance of terms in each document.
- *Similarity Calculation*: When a new claim is received, convert it into a TF-IDF vector using the same vectorizer and calculate its cosine similarity against the TF-IDF matrix to find the most relevant evidences.

3. **Construct an Evidence List**:

*Relevance*: Based on the similarity scores, select the top relevant evidences. This list will be used for further processing and classification.

4. **Concatenate Claim and Evidences**:

*Integration*: Concatenate the input claim with its corresponding top relevant evidences into a single text block (paragraph). This concatenated text serves as a comprehensive context for the claim.

5. **Word2Vec Model Training and Application**:

- *Model Building*: Build a Word2Vec model from scratch using PyTorch to learn word embeddings from the concatenated text of claims and their relevant evidences.
- *Usage*: The trained Word2Vec model can be used to convert words or phrases from the claims and evidences into vectors, which can then be utilized for various tasks such as classification, clustering, or further similarity measurements.

6. **Classification**:

- *Approach*: Use the embeddings from the Word2Vec model along with additional features (if necessary) to classify the claim into one of four predefined categories.
- *Model Selection*: Depending on the complexity and nature of the classification, choose an appropriate machine learning or deep learning model. This could be a simple logistic regression, a support vector machine, or a more complex neural network.

**Considerations for Implementation**:
- *Modularity*: Each step should be encapsulated within its class or function to ensure modularity and ease of maintenance.
- *Scalability*: Design the system to handle increases in data volume efficiently, possibly by optimizing data handling and processing.
- *Extensibility*: Allow for easy updates and modifications, such as adding new preprocessing steps, changing the classification model, or adjusting the number of top evidences retrieved.

## 1. Load and Preprocess Evidence Data

- *Data Structure*: Your dataset, evidence_df, contains two columns: evidence_id and evidence_paragraph.
- *Objective*: Use all evidence paragraphs to train a TF-IDF model. This model will be used to retrieve the most relevant evidences for a given input claim.

In [1]:
from dataclasses import dataclass
import pandas as pd
import logging
from typing import Optional
from pathlib import Path

# Configure logging
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S')
logger = logging.getLogger(__name__)

In [2]:
@dataclass
class DataLoader:
    file_path: Path

    def load_data(self) -> Optional[pd.DataFrame]:
        """
        Loads the data from the specified JSON file path using a Path object.
        Attempts to read a JSON file into a pandas DataFrame.
        Logs an error and returns None if the operation fails.
        
        :return: Optional[pd.DataFrame] - A pandas DataFrame if successful, None otherwise.
        """
        try:
            if not self.file_path.exists():
                logger.warning(f"The file {self.file_path} does not exist.")
                return None
            
            data = pd.read_json(self.file_path, orient='index')
            logger.info("Data loaded successfully.")
            return data
        except Exception as e:
            logger.warning(f"An error occurred while loading the data from {self.file_path}: {e}")
            return None

## 2. **TF-IDF for Evidence Retrieval**:

Text preprocessing is a critical step in our pipeline, aiming to transform raw text into a more analyzable and meaningful format. This step involves cleaning the text, reducing words to their base or root form, and removing irrelevant characters and words that do not contribute to the semantic meaning of the text.

### Approach
Our preprocessing workflow integrates several techniques to refine the text data:

1. **Contraction Expansion**: Converts contractions (e.g., "isn't" to "is not") to their expanded form to standardize text and improve analysis accuracy.
2. **Lowercasing**: Transforms all text to lowercase to ensure consistency and avoid duplication based on case differences.
3. **Special Characters Removal**: Deletes non-word characters (e.g., punctuation) to focus on the textual content.
4. **Tokenization**: Splits text into individual words or tokens, facilitating further processing like part-of-speech tagging.
5. **Part-of-Speech Tagging**: Identifies the grammatical parts of speech of each word, which helps in lemmatization.
6. **Lemmatization**: Reduces words to their base or dictionary form, considering the word's part-of-speech to ensure that the root word (lemma) is a valid word.
7. **Stop Words Removal**: Eliminates commonly used words (e.g., "the", "is") that usually have little to no semantic value in the context of text analysis.
8. **Named Entity Recognition (NER)**: Identifies and preserves named entities (e.g., "South Australia") as unique tokens. This is crucial for maintaining the specificity of geographical locations, organizations, and individuals in the text.
9. **Contextual Token Support**: Enhances the representation of text by considering the context around important words or named entities. This approach helps in capturing the semantic meaning more effectively.

### Implementation
The `preprocess_text` method encapsulates the preprocessing steps, taking a string of text and an index as inputs. The index allows for logging progress at specified intervals, enhancing transparency and monitoring during processing.

During the preprocessing, after tokenization and part-of-speech tagging, we perform **Named Entity Recognition (NER)** using NLTK's `ne_chunk`. Named entities are combined into single tokens (e.g., "New York" becomes "New_York"), which are then processed along with other tokens for lemmatization and stop words removal.

Additionally, we incorporate **contextual token support** by examining the context around key terms and named entities. This allows our system to better understand the relevance and significance of specific phrases within the text, thereby improving the accuracy of evidence retrieval.

The `preprocess` method orchestrates the preprocessing of the entire dataset. It utilizes the tqdm library to display a progress bar, providing real-time feedback on the preprocessing status.

### Example
Consider the claim: "[South Australia] has the most expensive electricity in the world." During preprocessing:

1. **Contraction Expansion & Lowercasing**: No contractions present; "South Australia" is lowercased.
2. **Special Characters Removal**: Assumes no special characters; the text remains unchanged.
3. **Tokenization & Part-of-Speech Tagging**: Splits the text into tokens and tags them.
4. **Named Entity Recognition (NER)**: Identifies "South Australia" as a named entity and preserves it as a unique token.
5. **Lemmatization & Stop Words Removal**: "has" -> "have", "expensive" remains, "electricity" remains, removing "the", "in", "world".

The preprocessing results in a focused representation of the claim, highlighting the key components and preserving the named entity "South Australia" for precise evidence retrieval.


In [4]:
import re
import numpy as np
import pandas as pd
from dataclasses import dataclass, field
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.chunk import ne_chunk
from nltk.tag import pos_tag
import contractions
import logging
from pathlib import Path
from tqdm.auto import tqdm

nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('words', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

True

In [9]:
@dataclass
class TfidfEvidenceRetriever:
    evidence_path: Path
    vectorizer: TfidfVectorizer = TfidfVectorizer(ngram_range=(1, 3), max_df=0.85, min_df=2)
    tfidf_matrix: np.ndarray = None
    stop_words: set = field(default_factory=lambda: set(stopwords.words('english')))
    similarity_threshold: float = 0.45
    lemmatizer: WordNetLemmatizer = WordNetLemmatizer()

    def __post_init__(self):
        self.evidence_df = pd.read_json(self.evidence_path, orient='index')
        self.evidence_df.columns = ['paragraph']
        self.evidence_df.reset_index(inplace=True)
        if 'paragraph' not in self.evidence_df.columns:
            logger.error("DataFrame must contain a 'paragraph' column.")
            raise ValueError("DataFrame must contain a 'paragraph' column.")
        self.preprocess()

    def get_wordnet_pos(self, treebank_tag):
        """Converts treebank POS tags to WordNet POS tags."""
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            return wordnet.NOUN  # Default to noun

    def preprocess_text(self, text: str, index: int) -> str:
        text = contractions.fix(text.lower())
        text = re.sub(r'[^\w\s]', '', text)
        tokens = word_tokenize(text)
        tagged_tokens = pos_tag(tokens)
        chunked_nes = ne_chunk(tagged_tokens)
        nes = ["_".join(w for w, t in ne.leaves()) for ne in chunked_nes if isinstance(ne, nltk.Tree)]
        lemmatized_tokens = [self.lemmatizer.lemmatize(word, self.get_wordnet_pos(tag)) for word, tag in tagged_tokens if word not in self.stop_words and word not in nes]
        combined_tokens = nes + lemmatized_tokens
        return ' '.join(combined_tokens)

    def preprocess(self):
        logger.info("Starting preprocessing of paragraphs.")
        processed_paragraphs = [self.preprocess_text(paragraph, i) for i, paragraph in tqdm(enumerate(self.evidence_df['paragraph']), total=self.evidence_df.shape[0])]
        self.evidence_df['processed_paragraph'] = processed_paragraphs
        logger.info("Vectorizing processed paragraphs.")
        self.tfidf_matrix = self.vectorizer.fit_transform(self.evidence_df['processed_paragraph'])
        logger.info("Preprocessing complete.")

    def find_relevant_evidences(self, query: str) -> list:
        logger.info(f"Finding relevant evidences for the given query: '{query}'")
        processed_query = self.preprocess_text(query, -1)  # -1 index since it's just a single query
        query_tfidf = self.vectorizer.transform([processed_query])
        cosine_similarities = cosine_similarity(query_tfidf, self.tfidf_matrix).flatten()
        
        if cosine_similarities.max() < self.similarity_threshold:
            most_relevant_index = [np.argmax(cosine_similarities)]
            logger.info("No evidences above the threshold. Returning the most relevant evidence.")
            return most_relevant_index
        
        relevant_indices = [index for index, similarity in enumerate(cosine_similarities) if similarity >= self.similarity_threshold]
        logger.info(f"Found {len(relevant_indices)} relevant evidences.")
        return relevant_indices

In [None]:
# Assuming evidence_df is already loaded and contains a 'paragraph' column
retriever = TfidfEvidenceRetriever(Path('../data/evidence.json'))
retriever.preprocess()

2024-05-12 20:02:23 - INFO - Starting preprocessing of paragraphs.


  0%|          | 0/1208827 [00:00<?, ?it/s]

In [122]:
# Example claim to retrieve evidence for
claim = "when 3 per cent of total annual global emissions of carbon dioxide are from humans and Australia prod­uces 1.3 per cent of this 3 per cent, then no amount of emissions reductio­n here will have any effect on global climate."
retriever.similarity_threshold=0.5
relevant_indices = retriever.find_relevant_evidences(claim)
pd.DataFrame(evidence_df.iloc[relevant_indices]['paragraph'])

2024-05-12 19:35:03 - INFO - Finding relevant evidences for the given query: when 3 per cent of total annual global emissions of carbon dioxide are from humans and Australia prod­uces 1.3 per cent of this 3 per cent, then no amount of emissions reductio­n here will have any effect on global climate.
2024-05-12 19:35:04 - INFO - Found 2 relevant evidences.


Unnamed: 0,paragraph
78654,""", as opposed to ""per cent""."
1140012,Developing countries with the highest rate of ...


In [156]:
# Example claim to retrieve evidence for
claim = "[South Australia] has the most expensive electricity in the world."
retriever.similarity_threshold=0.30
relevant_indices = retriever.find_relevant_evidences(claim)
pd.DataFrame(evidence_df.iloc[relevant_indices]['paragraph'])

2024-05-12 19:45:35 - INFO - Finding relevant evidences for the given query: [South Australia] has the most expensive electricity in the world.
2024-05-12 19:45:35 - INFO - Found 2 relevant evidences.


Unnamed: 0,paragraph
995049,The Attorney-General of South Australia is the...
1095235,"'' For the place in Adelaide, South Australia,..."


In [134]:
list(pd.DataFrame(evidence_df.iloc[relevant_indices]['paragraph'])['paragraph'])

['It is the state animal of South Australia.',
 'It is found in South Australia and Western Australia.',
 "The District Court of South Australia is South Australia 's principal trial court.",
 'Power FM (South Australia), a radio station in South Australia, Australia',
 'The Cabinet of South Australia is the chief policy-making organ of the Government of South Australia.',
 'Copeville is a settlement in South Australia.',
 'Australia',
 'The South Australia Colonisation Act 1834 (4 & 5 Will.',
 "The Attorney-General of South Australia is the member of the Government of South Australia responsible for South Australia 's system of law and justice.",
 "'' For the place in Adelaide, South Australia, see Glynde, South Australia.",
 'Port Vincent, South Australia, Australia',
 'Mitcham, South Australia, a suburb of Adelaide, South Australia']

In [150]:
# Example claim to retrieve evidence for
claim = "when 3 per cent of total annual global emissions of carbon dioxide are from humans and Australia prod­uces 1.3 per cent of this 3 per cent, then no amount of emissions reductio­n here will have any effect on global climate."
retriever.similarity_threshold=0.5
relevant_indices = retriever.find_relevant_evidences(claim)
list(pd.DataFrame(evidence_df.iloc[relevant_indices]['paragraph'])['paragraph'])

2024-05-12 19:40:56 - INFO - Finding relevant evidences for the given query: when 3 per cent of total annual global emissions of carbon dioxide are from humans and Australia prod­uces 1.3 per cent of this 3 per cent, then no amount of emissions reductio­n here will have any effect on global climate.
2024-05-12 19:40:56 - INFO - Found 2 relevant evidences.


['", as opposed to "per cent".',
 'Developing countries with the highest rate of women who have been cut are Somalia (with 98 per cent of women affected), Guinea (96 per cent), Djibouti (93 per cent), Egypt (91 per cent), Eritrea (89 per cent), Mali (89 per cent), Sierra Leone (88 per cent), Sudan (88 per cent), Gambia (76 per cent), Burkina Faso (76 per cent), and Ethiopia (74 per cent).']

In [152]:
list(evidence_df[evidence_df['index'] == 'evidence-67732']['paragraph'])

['[citation needed] South Australia has the highest retail price for electricity in the country.']