# Data Preprocessing Dev

1. **Load and Preprocess Evidence Data**:

- *Data Structure*: Your dataset, evidence_df, contains two columns: evidence_id and evidence_paragraph.
- *Objective*: Use all evidence paragraphs to train a TF-IDF model. This model will be used to retrieve the most relevant evidences for a given input claim.

2. **TF-IDF for Evidence Retrieval**:

- *Preprocessing*: Clean and preprocess the evidence paragraphs to optimize them for TF-IDF vectorization (e.g., removing stopwords, punctuation, and normalizing text).
- *Vectorization*: Apply TF-IDF vectorization to the preprocessed evidence paragraphs to create a matrix representing the importance of terms in each document.
- *Similarity Calculation*: When a new claim is received, convert it into a TF-IDF vector using the same vectorizer and calculate its cosine similarity against the TF-IDF matrix to find the most relevant evidences.

3. **Construct an Evidence List**:

*Relevance*: Based on the similarity scores, select the top relevant evidences. This list will be used for further processing and classification.

4. **Concatenate Claim and Evidences**:

*Integration*: Concatenate the input claim with its corresponding top relevant evidences into a single text block (paragraph). This concatenated text serves as a comprehensive context for the claim.

5. **Word2Vec Model Training and Application**:

- *Model Building*: Build a Word2Vec model from scratch using PyTorch to learn word embeddings from the concatenated text of claims and their relevant evidences.
- *Usage*: The trained Word2Vec model can be used to convert words or phrases from the claims and evidences into vectors, which can then be utilized for various tasks such as classification, clustering, or further similarity measurements.

6. **Classification**:

- *Approach*: Use the embeddings from the Word2Vec model along with additional features (if necessary) to classify the claim into one of four predefined categories.
- *Model Selection*: Depending on the complexity and nature of the classification, choose an appropriate machine learning or deep learning model. This could be a simple logistic regression, a support vector machine, or a more complex neural network.

**Considerations for Implementation**:
- *Modularity*: Each step should be encapsulated within its class or function to ensure modularity and ease of maintenance.
- *Scalability*: Design the system to handle increases in data volume efficiently, possibly by optimizing data handling and processing.
- *Extensibility*: Allow for easy updates and modifications, such as adding new preprocessing steps, changing the classification model, or adjusting the number of top evidences retrieved.

## 1. Load and Preprocess Evidence Data

In [1]:
import pandas as pd
from pathlib import Path
from typing import Union

class DataLoader:
    def __init__(self, file_path: Union[str, Path]):
        """
        Initializes the DataLoader with the path to the dataset.
        :param file_path: str, path to the dataset in JSON format.
        """
        self.file_path = file_path

    def load_data(self):
        """
        Loads the data from the specified JSON file path.
        :return: DataFrame, the loaded data.
        """
        try:
            data = pd.read_json(self.file_path)
            print("Data loaded successfully.")
            return data
        except Exception as e:
            print(f"An error occurred while loading the data: {e}")