

**Theodoros Lambrou**

This notebook explains the utility functions I contributed to `utils.py` for this assignment, alongside creating the project structure and building the baseline pipeline. These functions handle basic preprocessing, lexical similarity, early semantic similarity, data splitting.
These utility functions formed the foundation for preprocessing and baseline modeling in this assignment. They are based on simplicity, interpretability and are based on the course guidelines.

---

### `clean_text(text)`
Cleans raw text by lowercasing, removing punctuation, and normalizing whitespace. This is essential for ensuring that downstream NLP features aren't affected by superficial formatting.

**Example:**
```python
clean_text("  This is an EXAMPLE !!  ")  # Output: 'this is an example'
```

---

### `jaccard_similarity(q1, q2)`
Calculates the Jaccard similarity between the sets of words in two questions. This metric measures lexical overlap, a simple and interpretable feature.

**Example:**
```python
jaccard_similarity("what is ai", "what is artificial intelligence")
```

---

### `extract_basic_features(df)`
Takes a DataFrame of question pairs, applies `clean_text()` to both questions, and computes their Jaccard similarity. Returns a new DataFrame with a single column: `'jaccard'`.

This was used to build our baseline logistic regression model.

---

### `tfidf_cosine_similarity(q1, q2, vectorizer=None)`
Computes cosine similarity between two questions using TF-IDF vectors. Optionally accepts a shared vectorizer.

Used to captures semantic similarity based on token frequency and importance.

---

### `load_and_split_data()`
Implemented the required data split logic based on the assignment guide with the addition of the below:

- If no path is given, it defaults to: `~/Datasets/QuoraQuestionPairs/quora_data.csv`

