🧠 AI Text Detection

Distinguishing between human-written and AI-generated text using deep learning and natural language processing (NLP).

📄 Overview

AI Text Detection is a data science project that explores the ability to distinguish between texts written by humans and those generated by artificial intelligence models.
Using multiple neural architectures — from simple dense networks to Transformer-based encoders — this project aims to identify stylistic and semantic patterns that reveal the authorship of a given text.

Developed as part of the Data Science and Artificial Intelligence degree at Universidad Politécnica de Madrid, this project evaluates the performance, robustness, and generalization capabilities of modern text classification models.

🎯 Objectives

Develop and evaluate deep learning models for classifying texts by authorship (human vs AI).
Compare classical vectorization techniques (Bag of Words, TF-IDF) with modern embedding-based representations.
Assess model robustness when presented with unseen texts.
Identify the most reliable architecture for real-world AI text detection applications.

🧩 Methodology

The workflow followed a supervised learning approach, including:

Data Preprocessing
- Cleaning, normalization, and balancing of the dataset.
- Equal number of human-written and AI-generated samples (from Kaggle Dataset).
Exploratory Data Analysis
- Analysis of word frequency, text length distribution, and stylistic features.
- Extraction of sequence length and vocabulary size for model input configuration.
Vectorization
- Bag of Words (BoW) – basic token counting representation.
- TF-IDF with bigrams – term importance weighting.
- Integer Encoding – tokenized representation for deep networks.

Model Design and Training

Four architectures were trained and compared:

Model	Description	Vectorization	Accuracy (Test)
1️⃣ Simple Dense NN	Basic dense network with ReLU and Sigmoid layers	Bag of Words	99.75%
2️⃣ Simple Dense NN	Same as above, using TF-IDF bigrams	TF-IDF	99.69%
3️⃣ Transformer Encoder	Embedding + TransformerEncoder (2 heads)	Integer Encoding	99.66%
4️⃣ Transformer Encoder + PositionalEmbedding	Adds positional context	Integer Encoding	99.3%

Evaluation
- Metrics: Accuracy.
- Loss: Binary Crossentropy.
- Validation with unseen human and AI-generated texts (e.g., Wikipedia, The Lord of the Rings, CNN articles, academic papers).

📊 Results

All models achieved very high accuracy (>99%) on the test set, showing strong learning capacity.
However, when exposed to unseen data, only the Transformer-based models showed consistent and reliable behavior.

Best Performing Model:

Transformer Encoder with Positional Embedding
- Balanced performance across human and AI-generated samples.
- Demonstrated robustness and generalization in real-world scenarios.

Performance summary:

Dense models → High accuracy, but limited semantic understanding.
Transformer models → Slightly lower accuracy on test data, but significantly better generalization.

🔍 Conclusions

Transformer architectures outperform traditional dense models in detecting text authorship due to their ability to capture contextual and positional semantics.
The Positional Embedding Transformer emerged as the most robust model overall.
Future improvements could include:
- Expanding vocabulary size.
- Keeping stopwords during training.
- Adding dense or MultiHeadAttention layers.
- Generating synthetic AI texts to further balance the dataset.

🧰 Technologies Used

Python 3.10
TensorFlow / Keras
NumPy, Pandas, Scikit-learn
Matplotlib, Seaborn
Jupyter Notebook

📁 Repository Structure

AI-Text-Detection/
│
├── data/
│ ├── AI_Human_reduced.csv
│ └── README.md
│
├── docs/
│ ├── Memoria.pdf
│
├── models/
│ ├── model_1_history.pkl
│ ├── model_1.keras
│ ├── model_2_history.pkl
│ ├── model_2.keras
│ ├── model_3_history.pkl
│ ├── model_3.keras
│ ├── model_4_history.pkl
│ └── model_4.keras
│
├── notebooks/
│ ├── AI_vs_Human.ipynb
│ └── AI_vs_Human.pdf
│
├── src/
│ ├── create_reduced_dataset.py
│
├── .gitignore
├── LICENSE
├── README.md
└── requirements.txt

📂 The data/ folder contains a reduced, balanced version of the original dataset (from Kaggle) for reproducibility. The complete dataset (~ 1 GB) can be obtained here.

📚 References

Vaswani et al. (2023). Attention Is All You Need – arXiv.
Robertson, S. (2004). Understanding Inverse Document Frequency: On Theoretical Arguments for IDF.
Gerami, S. (2024). AI vs Human Text Dataset. Kaggle.
Almeida, F., & Xexéo, G. (2023). Word Embeddings: A Survey – arXiv.
Tolkien, J. R. R. (2005). The Lord of the Rings, HarperCollins.
Collinson, S. (2015). CNN: How Donald Trump Took the Republican Party by Storm.

👤 Author

Raúl Andrino
Universidad Politécnica de Madrid
📧 Contact: [raulandrino90@gmail.com]
📘 Project for: Data Science and Artificial Intelligence Degree

⚖️ License

This project is licensed under the MIT License — feel free to use, modify, and share for educational or research purposes.

⭐ If you find this project useful, consider giving it a star on GitHub!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 AI Text Detection

📄 Overview

🎯 Objectives

🧩 Methodology

📊 Results

🔍 Conclusions

🧰 Technologies Used

📁 Repository Structure

📚 References

👤 Author

⚖️ License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
docs		docs
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

raulandrino/AI-Text-Detection

Folders and files

Latest commit

History

Repository files navigation

🧠 AI Text Detection

📄 Overview

🎯 Objectives

🧩 Methodology

📊 Results

🔍 Conclusions

🧰 Technologies Used

📁 Repository Structure

📚 References

👤 Author

⚖️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages