<a href="https://colab.research.google.com/github/vishwarajanand/pistoBot/blob/master/pistoBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **🤖 pistoBot**

> Create an AI that (try) to chat like you.<br>
> by [Simone Guardati](https://www.linkedin.com/in/simone-guardati/)


## 🥜 In a nutshell
1. Get your whatsapp and telegram data
2. Parse it to get a ML-like dataset
3. Train a GTP2 model
4. Chat with the model

**🔗 Resources**
- Chat parser - [github](https://github.com/GuardatiSimone/messaging-chat-parser)
- pistoBot - [github](https://github.com/GuardatiSimone/pistoBot)
- pistoBot website - [link](https://guardatisimone.github.io/pistoBot-website/)

<br>

**⚠️ Warning**
- It's always better not to run random scripts on personal information (like personal chat messages).
- I guarantee there is no double end, but you can always check (and use) the native code: <br>
[messaging-chat-parser](https://github.com/GuardatiSimone/messaging-chat-parser) and [pistobot](https://github.com/GuardatiSimone/pistoBot)


------
## 0️⃣ Init

In [None]:
!nvidia-smi # p100 suggested

In [None]:
import os
!git clone --quiet https://github.com/GuardatiSimone/messaging-chat-parser.git
!pip install -q -r messaging-chat-parser/requirements.txt
!git clone --quiet https://github.com/GuardatiSimone/pistoBot.git

-----
## 1️⃣ Get the data
> Get your chat from whatsapp and telegram and upload them under `./messaging-chat-parser/data/chat_raw/` notebook folder


- **WhatsApp**
    - _.txt_ files exported from one or more chat - [how](https://faq.whatsapp.com/en/android/23756533/) (under "Export chat history" section)
        - don't export images
        - don't export group chats
    - place all txt files in `./messaging-chat-parser/data/chat_raw/whatsapp/*.txt`
- **Telegram** 
    - _.json_ with the telegram dump - [how](https://telegram.org/blog/export-and-more)
        - don't export images
        - choose "machine-readable JSON"
    - copy and rename the json file in `./messaging-chat-parser/data/chat_raw/telegram/telegram_dump.json`


------

## 2️⃣ Parse the data

**Whatsapp**
- Set the following variable `whatsapp_user_name`
    - Get it from one of the WhatsApp chat exported text. 
    <br> e.g. from one line: <br> 
     _12/12/19, 08:40 - `whatsapp_user_name`: bla bla bla_ 
- Datetime:
    - WhatsApp and Telegram have two different ways to manage the datetime.
    - Here are listed the two format, with Italian default values, if your data have different formats, change accordingly the next line values
        

In [None]:
whatsapp_user_name = None
whatsapp_datetime_format = "%d/%m/%y, %H:%M"
telegram_datetime_format = "%Y-%m-%dT%H:%M:%S"

In [None]:
# Whatsapp
print("> [WHATSAPP] start parsing...")
assert whatsapp_user_name is not None, "[!] Whatsapp user name not setted"
!cd messaging-chat-parser && python ./src/whatsapp_parser.py --session_token "<|endoftext|>" --user_name $whatsapp_user_name --time_format "$whatsapp_datetime_format"
print("> [WHATSAPP] parsing completed!\n\n")
print("----------------------------------")

# Telegram
print("> [TELEGRAM] start parsing...")
assert os.path.exists("./messaging-chat-parser/data/chat_raw/telegram/telegram_dump.json") is not None, "[!] `telegram_dump.json` not loaded"
!cd messaging-chat-parser && python ./src/telegram_parser.py --session_token "<|endoftext|>" --time_format "$telegram_datetime_format"
print("> [TELEGRAM] parsing completed!\n\n")
print("----------------------------------")

# Join Telegram and Whatsapp data
!cd messaging-chat-parser/ && python ./src/joiner.py
training_data_lines = sum(1 for line in open('./messaging-chat-parser/data/chat_parsed/all-messages.txt'))
print(f"> [PARSER] Training file lines: {training_data_lines}")
print("----------------------------------")

# Check data size
if training_data_lines < 100000:
    print(f"[WARNING] attention insufficient training data ({training_data_lines} < 100K), it is recommended to export more chats")

------
## 3️⃣ Train a GTP2 model

- ⏳ The following cell could take **up to 10 hours**
    - An estimation of the total time will be prompted


In [None]:
!cp ./messaging-chat-parser/data/chat_parsed/all-messages.txt ./pistoBot/data/inputs/chat_parsed/all-messages-endoftext.txt
!cd pistoBot/colab/ && bash run_training.sh gpt2-scratch

---
## 4️⃣ Chat with the model

Load the model

In [None]:
from aitextgen import aitextgen
from pprint import pprint

files = os.listdir("./pistoBot/data/models_trained/")
files.remove('.gitkeep')
folder_name = files[0]

model_path = os.path.join(".", "pistoBot", "data", "models_trained", folder_name, "pytorch_model.bin")
config_path = os.path.join(".", "pistoBot", "data", "models_trained", folder_name,"config.json")
vocab_path = os.path.join(".", "pistoBot", "data", "models_trained", folder_name,"aitextgen-vocab.json")
merges_path = os.path.join(".", "pistoBot", "data", "models_trained", folder_name,"aitextgen-merges.txt")

ai = aitextgen(model=model_path, 
               config=config_path,
               vocab_file=vocab_path,
               merges_file=merges_path,
               to_gpu=True)

### 💬 Interactive mode
> Chat with the model one message at a time

- Run the following cell and use the prompt (✍) to write your messages
- The chats messages will show two tags:
    - **[others]** tags: messages wrote by the user
    - **[me]** tags: messages generated by the model

<br>

- Error _Max temperature reached_:
    - **Solution**: re-run the cell
    - Motivation: under the hood the program increase the _temperature_ value to get a new message that start with "[me]" tag. This is done until a max value is reached.


In [None]:
chat = []
start_temperature = 0.7
max_temperature = 3.0

for _ in range(5):
    new_line = "[others] " + input("✍") + '\n'
    chat.append(new_line)
    
    me_token = False
    temperature = start_temperature
    input_network = ' '.join(chat)
    
    while not me_token:
        text = ai.generate(prompt=input_network, 
                           return_as_list=True, 
                           temperature=temperature)
        text = text[0] # batch of 1

        text = text.split('\n')
        chat_pos = len(chat)
        network_reply = text[chat_pos]

        if network_reply.startswith('[me]'):
            me_token = True
            network_reply = text[chat_pos] + '\n'
            chat.append(network_reply)
        else:
            if temperature >= max_temperature:
                raise RuntimeError("Max temperature reached")
            temperature += 0.1
    # print(f'temperature exit: {temperature}')
    print('Chat:')
    pprint(chat)
    print('---------------------')
    