<a href="https://colab.research.google.com/github/tasinfrancesco/Practical_ML_PSL/blob/main/NB10_Starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab Work 10 : NLP Basics

This notebook builds on the tenth lecture of Foundations of Machine Learning. We'll focus on some *traditionnal* NLP technics, meaning not using any *transformer* architectures.

Important note: the steps shown here are not always the most efficient or the most "industry-approved." Their main purpose is pedagogical. So don't panic if something looks suboptimalâ€”it's meant to be.

If you have questions (theoretical or practical), don't hesitate to bug your lecturer.

We will try to accurately predict if a tweet has been written by Donald Trump (until its account was banned) or by an AI. To build this dataset, we used a dataset that has collected several Donald Trump's tweet and we manually ask several models to wrote copies. More details on how the dataset was made in a separate notebook.
Let's load the dataset.

In [3]:
import pandas as pd

df = pd.read_csv("Donald_or_AI_train.csv")
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Donald_or_AI_train.csv'

Tweets can be a challenge for NLP techniques :
* Only short snippets of text : we kept only tweets below 150 characters
* Some Twitter/X specificity : the "@" character and the "#" can carry meanings
* There can be some emojis in it

We will focus first on some cleaning before diving in the modelling.

## Data cleaning

In order to know what to perform, we suggest looking at some tweets or fake tweets first.

In [2]:
from random import sample

df["writer"] = df["model"].apply(lambda string: "Original" if pd.isna(string) else string)

indexes = sample(range(df.shape[0]), k=10)
for index in indexes:
    tweet = df["content"][index]
    writer = df["writer"][index]
    print(f"[{writer}] {tweet}")
    print("-"*25)

NameError: name 'df' is not defined

Given several rolls, we can see some points to clean :
* Some tweets ends with "\n" caracter
* Some tweets are all within double quotes
* Some tweets starts with ". " then a quote, it is unnecessary to keep
* Some tweets have double spaces, it is unnecessary to keep

Beside this format, we also note that the deepseek-r1 model wrote very long tweet. Let's display one :

In [None]:
print(df.loc[df["model"] == "deepseek-r1:1.5b", ]["content"].values[1])

The Deepseek-R1 model is a **reasoning model** meaning he *thinks* before answering. The only part we need here is the part outside of the thinking process.

**Task** : Given all the previous discussion, write a `clean_tweet` function. It will also lower all characters.

Now that we have *cleaner* tweets to work on, we need to build features from it.

## Exploration and feature engineering

**Task** : Create the following columns, with only the first one being build on the raw tweets. The rest of them shall be derived from a cleaned version.
* `uppercase_ratio` : the proportion of uppercase character in the whole text. We may use the `isupper` method for a string.
* `character_count` : the number of character in the text
* `word_count` : the number of words in the text
* `avg_word_length` : the average length of words in the tweet

We would like to see if the previous indicators we build might already help identifying AI.

**Task** : Using seaborn's [`histplot`](https://seaborn.pydata.org/generated/seaborn.histplot.html) function with the appropriate parameters, explore the columns. We shall use the *hue* parameter with either the target column `generated` or the `writer`.

## First modelisation

With not much work, can we already perform a classification ?

**Task** : Train a [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) with a [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) in a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). We shall use the [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function to measure performance. Here, we'll use the [`f1_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) metric, so we will probably need the [`make_scorer`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html) function.
Bonus: use a [`TunedThresholdClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TunedThresholdClassifierCV.html).

Already quite good performance without using the text *directly*.

## NLP modelisation

But it should be better with it.

**Task** : Still using a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class and the [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function, now use the [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class with english stopwords and display results.

That is better ! But we could imagine stronger performance if we tuned a bit the vectorizer.

**Task** : Using the [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class, find better parameter for the vectorizer.

But we only used words this time, not the previous statistics we had.

## Third modelisation : combining approach

**Task** : Using the [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer), rewrite the pipeline so that it uses both numeric and text features.

As the [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) expect 1D array, one need to use first [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) to flatten the input using the `squeeze` method.

It is better ! Time to fit it, then use the test set and submit on the [Kaggle competition](https://www.kaggle.com/t/db2bae0e9d814baa96a0468f021cd3f2).

**Task** : Rewrite your feature engineering pipeline and use it to submit.

Now, it is up to you to make the performance better ! Here are some guidelines:
1. Make the dataset *cleaner* : there are probably still some work to do
2. Build better features : more useful statistics can be extracted
3. Find the most suitable model : note that we didn't fine-tune it, only the vectorizer...