# Anonymize whatsapp messages

I wrote this script to help me out with another project. I was actually working on analyzing the whatsapp messages between me and my partner. Since I wanted to upload the analysis, I wanted to anonymize the text while keeping the structure intact. This is how I ended up writing this.

This script takes a whatsapp chat log .txt file, keeps the dates, time, and names as they are but swaps out every character (incl. space) with "x".

The structure whatsapp uses is:
> dd.mm.yy, hh:mm - name: message

Here's an example:
> 21.01.24, 10:19 - Friend: Hi, how are you doing? :)

Anything before the ":" is therefore kept as is, while the rest is masked.


This way, the data can be analyzed for frequency of writing, author distribution, message length etc. while maintaining privacy.
By default, it takes "chat.txt" as input and outputs "chats_anonymized.txt". 
The file I uploaded here was generated using chatgpt to make it look similar to actual files.

If this script is of any use to anyone, go ahead and have fun with it.


In [12]:
import re

# Capture prefix: "dd.mm.yy, hh:mm - Name:"
WHATSAPP_HEADER = re.compile(
    r"^(\d{2}\.\d{2}\.\d{2}, \d{2}:\d{2} - .*?:)(.*)$"
)

def anonymize_chat(input_file, output_file):
    with open(input_file, "r", encoding="utf-8") as f:
        lines = f.readlines()

    with open(output_file, "w", encoding="utf-8") as f:
        for line in lines:
            line = line.rstrip("\n")

            m = WHATSAPP_HEADER.match(line)
            if m:
                prefix = m.group(1)          # keep this
                message = m.group(2)         # anonymize this

                anonymized = "x" * len(message)
                f.write(prefix + anonymized + "\n")
            else:
                # continuation line â†’ mask whole line
                f.write("x" * len(line) + "\n")

anonymize_chat("chat.txt", "chat_anonymized.txt")
