# Word cloud from exported Telegram chat history

From Telegram Desktop app it is possible to [export history of a group chat](https://telegram.org/blog/export-and-more). Exported data consists of html-files that contain chat messages and folders that contain other stuff send to chat such as files and videos depending on export settings. In this notebook a word cloud is created from exported chat history. We use notebook running `Node.js` for this visualization as it feels kind of natural to use javascript to extract data from html file. [Ijavacript](https://github.com/n-riesco/ijavascript) offers JavaScript kernel for Jypyter Notebook.

There are quite many links and some [html entities](https://www.w3schools.com/html/html_entities.asp) in the exported data that must be cleaned. Following functions are used to parse messages.

In [1]:
/* Searches for links in message with regex and parses each link. */
const parseMessage = message => {
    let out = message
    const regex = /<a.*<\/a>/g
    const foundHyperlinks = message.match(regex)

    if (foundHyperlinks) {
        foundHyperlinks.forEach(l => out = out.replace(l, parseHyperlink(l)))
    }
    
    htmlEntities = {
        "<br/>": "\n",
        "&lt;": "<",
        "&gt;": ">",
        "<em>": "",
        "</em>": "",
        "<strong>": "",
        "</strong>": "",
        "&apos;": "\'",
        "&quot;": "\""
    }
    
    for (const entity in htmlEntities) {
        out = out.replace(new RegExp(entity, 'g'), htmlEntities[entity]) // replace all
    }

    return out
}

/* Parses different link types in message */
const parseHyperlink = link => {
  // bot commands
  if (link.startsWith("<a href onclick=\"return ShowBotCommand")) {
    return "/" + link.split("&quot;")[1]
  // tagging some one in chat
  } else if (link.startsWith("<a href=\"https://t.me/")) {
    return link.split(">")[1].split("<")[0]
  // phone number, hashtag, external url, lets remove those
  } else {
    return ""
  }
}

Exported data is located in `data`-folder. We will loop over all html files, then convert files to DOM-objects and search for all elements that have `text`-class. These are messages sent to chat. Occurrences of each word are saved to result-object.

Next suitably formatted result is passed  to html-template that sources `script.js`-file which contains algorithm that creates word cloud. Script is based on [this snippet](https://codepen.io/stevn/pen/JdwNgw). Resulting html-file is both saved and rendered in notebook. So while we use `Node` for preprocessing we leave creating final image for browser. Note that executing js in files takes some time and your browser will be unresponsive while it renders especially if you are creating larger wordcloud.

In [2]:
const fs = require('fs')
const DomParser = require('dom-parser')
const parser = new DomParser()

In [3]:
const dir = fs.readdirSync("data")
const files = dir.filter(x => x.match(/messages[0-9]*\.html/ig))
const result = {}

for (const file of files) {
    const contents = fs.readFileSync(`data/${file}`, 'utf8')
    const dom = parser.parseFromString(contents)
    const messages = dom.getElementsByClassName("text")
    const messagesParsed = messages
        .map(x => x.innerHTML.trim())
        .map(parseMessage)
        .filter(x => x) // filter empty strings

    for (const msg of messagesParsed) {
        for (const word of msg.toLowerCase().split(/\s+/)) {
            if (word in result) {
                result[word] += 1
            } else {
                result[word] = 1
            }
        }
    }  
}

// make sure we don't change meaning. Since amout of :D >> amount of :d in this chat this is okay.
result[":D"] = result[":d"]
result[":DD"] = result[":dd"]
delete result[":d"]
delete result[":dd"]

// remove be-verb and some conjunctions and pronouns
const wordsToRemove = [
    "on", "oo", "oot", "oon", "oli", "onks", "ollu", "olla", "ois", "onko", "ollaa", "olis",
    "olin", "ollaan", "oisko", "oliks",
    "että", "et", "jotta", "koska", "kun", "ku", "jos", "vaikka", "vaik", "kuin", "kunnes",
    "ja", "sekä", "eli", "tai", "vai", "mutta", "mut", "sillä", "vaa", "vaan",
    "mul", "mä", "mulla", "mäki", "mulle", "mua", "mun",
    "sä", "sun",
    "meil", "me",
    "te",
    "ne", "niitä", "noi", "nää", "niit",
    "toi", "tämä", "täs", "tos", "tän", "tästä", "tänne", "tätä", "tota", "tää",
    "se", "sen", "sitä", "siitä", "siihen", "siellä", "siinä", "siin", "siit",
    "joka", "jotka", "mikä", "kuka"]
wordsToRemove.forEach(word => delete result[word])

In [5]:
const weight = x => 2 * Math.sqrt(x) // return font size in pixels for word

const resultSorted = Object
    .entries(result)
    .sort((a,b) => a[1]-b[1])
    .reverse()
const resultString = `[${resultSorted.slice(0,100).map(x => `{ word: "${x[0]}", freq: ${weight(x[1])} }`)}]`
const resultHTML =`
<!doctype html>
<html>
<head>
  <meta charset="utf-8">
  <title>Telegram chat word cloud</title>
  <style type="text/css" media="screen">
  /* NotoColorEmoji font for emojis https://github.com/googlefonts/noto-emoji */
  @font-face {
    font-family: NotoColorEmoji;
    src:url("./font/NotoColorEmoji.ttf");
  }
  #word-cloud{
    font-family: Calibri, NotoColorEmoji;
    height: 800px;
    width: 800px;
    margin: 0 auto;
  }
  </style>
</head>
<body>
  <!-- This is based on https://codepen.io/stevn/pen/JdwNgw with some modifications. -->
  <div id="word-cloud"></div>
  <script> var words = ${resultString} </script>
  <script src="./script.js"></script>
</body>
</html>`

// save and display result
fs.writeFile('result.html', resultHTML, (err) => { throw err }))
$$.html(resultHTML)