# WhatsApp Chat History Data Visualization

This is a *work-in-progress* notebook for prototyping my chat log data visualization tool. Code here has been minimally refactored - the intent is to develop an MVP.

### Imports

Let's start off by importing our bread-and-butter data and visualization libraries:

In [733]:
# Data and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

We will also import the custom dictionaries defined in `Chat-History-Custom-Functs.ipynb` (which will be used in the text normalization process). Feel free to edit the dictionaries based on the desired normalization in your text.

In [734]:
%run Chat-History-User-Defined.ipynb

### Data Extraction

We'll read in the WhatsApp chat log (exported from an iOS device) to a dataframe and make a deepcopy for us to try out all of our preprocessing on. <br>
**Note:** For privacy purposes, the chat log available on GitHub is much shorter and consists of dummy text, albeit maintaining WhatsApp's export style

In [735]:
# Read in Whatsapp chat log to a dataframe

imported_messages = pd.read_csv('chat.txt', delimiter='\n', skiprows=[0], names = ['text_raw'])
imported_messages.head()

Unnamed: 0,text_raw
0,"[2020-02-28, 2:55:53 AM] User 1: From outside ..."
1,"[2020-02-28, 2:55:53 AM] User 2: and lool lool..."
2,"[2020-02-28, 2:56:07 AM] User 2: \ which told ..."
3,"[2020-02-28, 2:56:08 AM] User 2: hah lmaoooo w..."
4,"[2020-02-28, 2:56:27 AM] User 1: 😯 Far away we..."


In [736]:
imported_messages.tail()

Unnamed: 0,text_raw
29,"[2020-03-08, 05:20:32 PM] User 2: At the momen..."
30,"[2020-03-08, 05:45:56 PM] User 1: lool but the..."
31,made it impossible for me to tell what it was ...
32,"[2020-03-08, 05:50:34 PM] User 2: I could, how..."
33,"[2020-03-08, 05:52:12 PM] User 2: https://towa..."


**Note:** Using '\n' as the delimiter results in messages with embedded line breaks escaping to new rows in the dataframe. These rows will not have the '*\[datetime\] username: text*' pattern seen in other rows, so we need to handle these appropriately when separating out datetimes and usernames.

In [737]:
# Deepcopy into a working dataframe for preprocessing / cleaning

messages = imported_messages.copy(deep=True)
messages.iloc[7:11]

Unnamed: 0,text_raw
7,"[2020-02-29, 6:00:23 PM] User 1: ☺ Twelve struck,"
8,"and one and two and three,"
9,and still we sat waiting silently for whatever...
10,"[2020-02-29, 6:15:12 PM] User 1: ‎video omitted"


### Preprocessing Non-Text Data

Alright, now let's work on extracting the datetime and username fields by leveraging [regular expressions](https://jakevdp.github.io/WhirlwindTourOfPython/14-strings-and-regular-expressions.html). Let's start with some useful libraries.

In [738]:
# Libraries to help us handle dates/times
import datetime as dt
from pytz import timezone

# Library for regular expressions
import regex

# Library to handle emojis in text
import emoji

We'll define a helper function to aid us with extracting usernames and datetimes, then apply it to our dataframe.

In [739]:
# Function to extract datetime and username as text
def extract_datetime_username(text):
    """
    Note:   Requires regex module to be imported
    Input:  String of text which may contain '[...]' text pattern
    Output: Tuple of the following: (String to the right of the ': ' text pattern      OR original text string ,
                                     String with contents of the '[...]' text pattern  OR NaN , 
                                     String between the '[...]' and ': ' text patterns OR NaN )
    """
    # Regex to find '[...]' pattern in text
    date_time = regex.search(r'.*\[(.*)\].*', text)
    
    # Output based on pattern search result
    if date_time:
        text_remainder = text.split("] ")[1]
        text_username = text_remainder.split(": ")
        return (text_username[1], date_time.group(1), text_username[0])
    else:
        return (text, np.nan, np.nan)

In [740]:
# Apply the function to our dataframe and extract out the datetimes and usernames
messages['text_raw'], messages['date_time'], messages['username'] = zip(*messages['text_raw'].apply(extract_datetime_username))
messages.head()

Unnamed: 0,text_raw,date_time,username
0,From outside came the occasional cry of a nigh...,"2020-02-28, 2:55:53 AM",User 1
1,and lool lool lool lool once at our very windo...,"2020-02-28, 2:55:53 AM",User 2
2,\ which told us that the cheetah was indeed at...,"2020-02-28, 2:56:07 AM",User 2
3,hah lmaoooo wooowww,"2020-02-28, 2:56:08 AM",User 2
4,😯 Far away we could hear the deep tones of the...,"2020-02-28, 2:56:27 AM",User 1


In [741]:
# Check to ensure functionality is as intended on rows with embedded line breaks
messages.iloc[7:11]

Unnamed: 0,text_raw,date_time,username
7,"☺ Twelve struck,","2020-02-29, 6:00:23 PM",User 1
8,"and one and two and three,",,
9,and still we sat waiting silently for whatever...,,
10,‎video omitted,"2020-02-29, 6:15:12 PM",User 1


Before we proceed, let's verify if there are any rows of text with none / NaN values.

In [742]:
messages[messages['text_raw'].isna()]

Unnamed: 0,text_raw,date_time,username


Now we can fill in the NaN values in the 'date_time' and 'username' columns by considering those messages to have been sent by the user in the row above, at the time in the row above.

In [743]:
messages.fillna(method='ffill', inplace=True)
messages.iloc[7:11]

Unnamed: 0,text_raw,date_time,username
7,"☺ Twelve struck,","2020-02-29, 6:00:23 PM",User 1
8,"and one and two and three,","2020-02-29, 6:00:23 PM",User 1
9,and still we sat waiting silently for whatever...,"2020-02-29, 6:00:23 PM",User 1
10,‎video omitted,"2020-02-29, 6:15:12 PM",User 1


Let's leverage Python's `datetime` module to convert our date_time column from a string to handy datetime objects (localized in my case to Toronto, Canada).

In [744]:
local_timezone = timezone('America/Toronto')

In [745]:
messages['date_time'] = messages['date_time'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d, %I:%M:%S %p'))
messages['date_time'] = messages['date_time'].apply(lambda x: local_timezone.localize(x))
messages.head()

Unnamed: 0,text_raw,date_time,username
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1
1,and lool lool lool lool once at our very windo...,2020-02-28 02:55:53-05:00,User 2
2,\ which told us that the cheetah was indeed at...,2020-02-28 02:56:07-05:00,User 2
3,hah lmaoooo wooowww,2020-02-28 02:56:08-05:00,User 2
4,😯 Far away we could hear the deep tones of the...,2020-02-28 02:56:27-05:00,User 1


### Preprocessing Text Data

Now we're ready to work directly on the text data and make it more palatable for extracting insights from. We'll begin with library imports - primarily from NLTK.

In [746]:
# Python String Library
import string

# NLTK for all our languarge processing needs (tokenization and stopword removal - lemmatization if our data ends up too 'muddy')
import nltk
#from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords 
#from nltk.stem import WordNetLemmatizer 
#from normalise import normalise

In [747]:
# If needed:
# nltk.download()

For reference, let's outline what our [text preprocessing](https://www.kdnuggets.com/2019/04/text-preprocessing-nlp-machine-learning.html) pipeline will look like. <br><br>
**Pipeline:**
1. Tokenize each message
2. Separate out emojis (**Note:** This might be revisited if/when I begin investigating sentiment analysis)                    
3. Categorize each message type (text / picture / video / link)
4. Clean text messages
5. Normalize user-specific slang in text messages
6. Remove stopwords from text messages

**Step 1:**<br>
Let's [tokenize](https://www.geeksforgeeks.org/tokenize-text-using-nltk-python/) our corpus of text messages! In this case, we'll use NLTK's `TweetTokenizer`, since it is capable of splitting up emoji groupings and identifying html links as single tokens (see [here](https://towardsdatascience.com/an-introduction-to-tweettokenizer-for-processing-tweets-9879389f8fe7) for a brief comparison between NLTK tokenizers).

In [748]:
tokenizer = TweetTokenizer()
messages['text_processed'] = messages['text_raw'].apply(lambda x: tokenizer.tokenize(x))
messages.tail()

Unnamed: 0,text_raw,date_time,username,text_processed
29,At the moment when Holmes struck the light I h...,2020-03-08 17:20:32-04:00,User 2,"[At, the, moment, when, Holmes, struck, the, l..."
30,lool but the sudden glare lool lool flashing i...,2020-03-08 17:45:56-04:00,User 1,"[lool, but, the, sudden, glare, lool, lool, fl..."
31,made it impossible for me to tell what it was ...,2020-03-08 17:45:56-04:00,User 1,"[made, it, impossible, for, me, to, tell, what..."
32,"I could, however, see that his face was deadly...",2020-03-08 17:50:34-04:00,User 2,"[I, could, ,, however, ,, see, that, his, face..."
33,https://towardsdatascience.com/an-introduction...,2020-03-08 17:52:12-04:00,User 2,[https://towardsdatascience.com/an-introductio...


**Step 2:** <br>
Let's now extract out the emojis from the text - through a helper function defined below. This will simplify the workload involved in extracting out emoji-related insights in our visualizations down the line. <br><br>
**Note:** I had originally modified a solution found [here](https://stackoverflow.com/questions/49113909/split-and-count-emojis-and-words-in-a-given-string-in-python?noredirect=1&lq=1) to create a (very un-optimized!) solution that extracted emojis directly from the un-tokenized messages. However, this was before I discovered the magic of `TweetTokenizer`!
[This site](https://www.regular-expressions.info/) and [this post](https://stackoverflow.com/questions/9928505/what-does-the-expression-x-match-when-inside-a-regex)  provided a lot of insight into understanding and using regex (despite it not being needed as heavily).

In [749]:
# Function for separating out emojis from the tokenized corpus
def extract_emojis(tokenlist):
    '''
    Input:  List of tokenized strings (utf-8), containing emojis 
    Output: Tuple of the following: (list of non-emoji tokens from the input, list of all emoji tokens from the input)
    '''
    
    list_emojis = []
    list_text = []
    
    list_emojis = [token for token in tokenlist if any(char in emoji.UNICODE_EMOJI for char in token)]
    list_text = [token for token in tokenlist if token not in list_emojis]
    
    return (list_text, list_emojis)

#   **For reference, here's the original emoji list creation code without the one-liner list comprehension
#    list_emojis = []
#    for token in tokenlist:
#        if any(char in emoji.UNICODE_EMOJI for char in token):  
#            list_emojis += [token]

In [750]:
# Apply it to our dataframe and extract out the emojis
messages['text_processed'], messages['emojis'] = zip(*messages['text_processed'].apply(extract_emojis))
messages.tail(10)

Unnamed: 0,text_raw,date_time,username,text_processed,emojis
24,"struck a match, and lashed furiously with his ...",2020-03-07 00:09:48-05:00,User 1,"[struck, a, match, ,, and, lashed, furiously, ...",[😆]
25,😯👀,2020-03-07 00:09:56-05:00,User 1,[],"[😯, 👀]"
26,"""You see it, Watson?"" he yelled. ""You see it?""",2020-03-08 17:10:08-04:00,User 1,"["", You, see, it, ,, Watson, ?, "", he, yelled,...",[]
27,But I saw nothing lmaoo. 😯,2020-03-08 17:10:29-04:00,User 1,"[But, I, saw, nothing, lmaoo, .]",[😯]
28,‎image omitted,2020-03-08 17:15:24-04:00,User 1,"[‎, image, omitted]",[]
29,At the moment when Holmes struck the light I h...,2020-03-08 17:20:32-04:00,User 2,"[At, the, moment, when, Holmes, struck, the, l...",[]
30,lool but the sudden glare lool lool flashing i...,2020-03-08 17:45:56-04:00,User 1,"[lool, but, the, sudden, glare, lool, lool, fl...",[]
31,made it impossible for me to tell what it was ...,2020-03-08 17:45:56-04:00,User 1,"[made, it, impossible, for, me, to, tell, what...",[]
32,"I could, however, see that his face was deadly...",2020-03-08 17:50:34-04:00,User 2,"[I, could, ,, however, ,, see, that, his, face...","[😭, 😭, 😭]"
33,https://towardsdatascience.com/an-introduction...,2020-03-08 17:52:12-04:00,User 2,[https://towardsdatascience.com/an-introductio...,[]


**Step 3:**<br>
Since the WhatsApp chat log is text-only, any image or video media is represented as a message with the text *'image omitted'* or *'video omitted'* in it. In addition, shared links are sent as an individual message with the *'https://'* prefix and categorized into a single token by `TweetTokenizer`. We can use this information to categorize message types, as well as emptying the respective '*text_processed*' field for images and videos (since they aren't 'real' text messages).<br>
**Note:** The 'image omitted' and 'text omitted' text has a non-printable unicode typesetting character ([\u200e](https://www.fileformat.info/info/unicode/char/200e/index.htm)), hence the reason it is tokenized into 3 tokens, rather than the expected 2. Hidden / nonstandard characters such as this will be cleaned out in the next step.

In [751]:
messages['text_raw'].iloc[10][0]

'\u200e'

In [752]:
# Function for categorizing messages into types (Warning, this function isn't very pythonic...)
def categorize_message(tokenlist):
    '''
    Input:  Tokenized text string - i.e. list of words
    Output: Tuple of the following: (String indicating type of message - 'image', 'video', 'link' or 'text'),
                                     appropriate output token for the message type)
    '''
    # Identify links based on prefix
    if (len(tokenlist) == 1):
        if (tokenlist[0][:8] == 'https://'):
            return ('link', tokenlist)
    
    # Identify images / video by default WhatsApp message ("\u200e image omitted" or "\u200e video omitted")
    elif len(tokenlist) == 3:
        if (tokenlist[2] == 'omitted'):
            
            if (tokenlist[1] == 'image'):
                return ('image', [])
            
            elif (tokenlist[1] == 'video'):
                return ('video', [])
    
    # Default is text
    return ('text', tokenlist)

In [753]:
# Apply function to dataframe
messages['msg_type'], messages['text_processed'] = zip(*messages['text_processed'].apply(categorize_message))
messages.head()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,msg_type
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,"[From, outside, came, the, occasional, cry, of...",[],text
1,and lool lool lool lool once at our very windo...,2020-02-28 02:55:53-05:00,User 2,"[and, lool, lool, lool, lool, once, at, our, v...","[😯, 😯, 😯]",text
2,\ which told us that the cheetah was indeed at...,2020-02-28 02:56:07-05:00,User 2,"[\, which, told, us, that, the, cheetah, was, ...",[],text
3,hah lmaoooo wooowww,2020-02-28 02:56:08-05:00,User 2,"[hah, lmaoooo, wooowww]",[],text
4,😯 Far away we could hear the deep tones of the...,2020-02-28 02:56:27-05:00,User 1,"[Far, away, we, could, hear, the, deep, tones,...","[😯, ☺]",text


In [754]:
print(messages['msg_type'].iloc[10], messages['text_raw'].iloc[10])

video ‎video omitted


In [755]:
messages.tail()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,msg_type
29,At the moment when Holmes struck the light I h...,2020-03-08 17:20:32-04:00,User 2,"[At, the, moment, when, Holmes, struck, the, l...",[],text
30,lool but the sudden glare lool lool flashing i...,2020-03-08 17:45:56-04:00,User 1,"[lool, but, the, sudden, glare, lool, lool, fl...",[],text
31,made it impossible for me to tell what it was ...,2020-03-08 17:45:56-04:00,User 1,"[made, it, impossible, for, me, to, tell, what...",[],text
32,"I could, however, see that his face was deadly...",2020-03-08 17:50:34-04:00,User 2,"[I, could, ,, however, ,, see, that, his, face...","[😭, 😭, 😭]",text
33,https://towardsdatascience.com/an-introduction...,2020-03-08 17:52:12-04:00,User 2,[https://towardsdatascience.com/an-introductio...,[],link


**Step 4:** <br>
Let's clean up the text by lowercasing all text, stripping leading / trailing whitespace, removing 'non-printable' characters (such punctuation and hidden characters such as '\u200e' which may be embedded in our text messages). <br>
**Note:** this may have the side effect of removing non-ascii characters, therefore might cause unintended behaviour if the message corpus contains **non-Latin ("English")** characters. 

In [756]:
# Define function for cleaning up text (lowercasing all text + stripping whitespace + removing non-alphanumeric characters)
def clean_text(tokenlist):
    tokenlist_clean = []
    
    for token_raw in tokenlist:
        token = token_raw.strip().lower()
        token_clean = "".join(c for c in token if str.isalnum(c))
        
        if len(token_clean) > 0:
            tokenlist_clean.append(token)
            
    return tokenlist_clean

In [757]:
# Apply it to our dataframe, skipping any 'link' messages to preserve the html formatting, but cleaning the other message types
messages['text_processed'] = np.where(messages['msg_type'] != 'link',
                                      messages['text_processed'].apply(clean_text),
                                      messages['text_processed'])

messages.tail()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,msg_type
29,At the moment when Holmes struck the light I h...,2020-03-08 17:20:32-04:00,User 2,"[at, the, moment, when, holmes, struck, the, l...",[],text
30,lool but the sudden glare lool lool flashing i...,2020-03-08 17:45:56-04:00,User 1,"[lool, but, the, sudden, glare, lool, lool, fl...",[],text
31,made it impossible for me to tell what it was ...,2020-03-08 17:45:56-04:00,User 1,"[made, it, impossible, for, me, to, tell, what...",[],text
32,"I could, however, see that his face was deadly...",2020-03-08 17:50:34-04:00,User 2,"[i, could, however, see, that, his, face, was,...","[😭, 😭, 😭]",text
33,https://towardsdatascience.com/an-introduction...,2020-03-08 17:52:12-04:00,User 2,[https://towardsdatascience.com/an-introductio...,[],link


In [758]:
print(messages['text_processed'].iloc[-1])

['https://towardsdatascience.com/an-introduction-to-tweettokenizer-for-processing-tweets-9879389f8fe7']


**Step 5:** <br>
Next, let's normalize our corpus to manage any slang and slang variants. The initial implementation attempted to use the `normalize` library found [here](https://github.com/EFord36/normalise), but processing runtime was infeasible for this dataset. As an alternative, I defined my own 'custom' normalizer to handle a user's 'chat-specific' slang. The normalizer primarily corrects slang words with repeated characters, as outlined in the comments below. The user can also define their own slang dictionary mappings in the auxiliary `Chat-History-User-Defined.ipynb` notebook. <br><br>
**Note**: `TweetTokenizer` has a `reduce_len` parameter which accomplishes a similar functionality ([see here](https://www.nltk.org/api/nltk.tokenize.html) for details), but it treats words under the '3 repeated characters' limit as unique words, and is not user-customizable.

In [759]:
# 'Custom' normalizer:

# 1) Define sets of 'slang' strings based on type of expected non-normalization, for example:
# --- end-letter repeats (omgggg -> omg)
# --- combination repeats (wooowww -> wow)
# --- mid-letter repeats (looool -> lol)         [implemented via separate 'custom' dict instead]
# --- two-letter repeats (hahahaha -> haha)      [implemented via separate 'custom' dict instead]
# 2) Generate dicts which map slang variant to its 'root' form.
# --- Will be specific for each set and have an 'upper bound' number of repeats
# 3) Combine to a master 'normalization' dict. Define the custom normalization function to parse through the dict and evaluate feasibility of scaling up

# Expected runtime O(n*m) where n = number of tokens in corpus, m = number of keys in master normalization lookup dict
# Will be slow, but potentially m will at least 1-2 orders of magnitude smaller than using `normalise` library

# Custom / User defined:
# --- 'base' slang words and their 'type'
# --- 'upper bound' on number of repeats for dict generation

In [760]:
# Utility function to generate dicts of slang strings based on their type
def generate_slang_variants(SLANG_DICT, LIMIT = 5):
    """
    Input:    LIMIT:            Integer defining maximum number of repeated letters.
                                Default is 5 (i.e, 5 repeated single / double letters in the variant)
              SLANG_DICT:       Dictionary of string: integer pairs.
                                Strings are slang 'base' words. Integers are the slang variant 'type'defined below:
                                1 == Repeating single final letter (eg. lmaooooo -> lmao)
                                2 == Repeating double final letters (eg. wooooowwwww -> wow)
                                
    Output:   slang_lookup:     Dictionary of slang-variant : slang-root pairs
    """
    slang_lookup = {}
    
    for root in SLANG_DICT: #root[-2:]
                
        if SLANG_DICT[root] == 1:
            # Case 1: Repeat the single final letter up to limit. Add to final dictionary with root as value
            for i in range(LIMIT):
                variant = root + (root[-1]*i)
                slang_lookup.update({variant : root})
            
        elif SLANG_DICT[root] == 2:
            # Case 2: Generate combinations of both final letters up to limit. Add to final dictionary with root as value
            for i in range(1, LIMIT+1):
                for j in range(1, LIMIT+1):
                    variant = root[:-2] + (root[-2]*i) + (root[-1]*j)
                    slang_lookup.update({variant : root})
    
    return slang_lookup


In [761]:
# Primary function to normalize slang variants to their 'base' slang form
def normalize_slang(tokenlist, slang_lookup):
    """
    Input:    tokenlist:               Tokenized corpus to be parsed and normalized
              slang_lookup:            Dictionary of slang-variant : slang-root pairs

    Output:   tokenlist_normalized:    Normalized, tokenized corpus
    """
    tokenlist_normalized = []
    
    for token in tokenlist:
        if token in slang_lookup:
            tokenlist_normalized.append(slang_lookup[token])
        else:
            tokenlist_normalized.append(token)
    
    return tokenlist_normalized


In [762]:
# Generate final dictionary mapping slang variants to their 'base' slang word.
slang_lookup = generate_slang_variants(SLANG_DICT, VARIANT_LIMIT)

# 'Custom' slang variants are added to this final dictionary as well
slang_lookup.update(SLANG_SPECIAL_CASES)

In [763]:
# Apply function to dataframe and normalize slang on non-link messages
messages['text_processed'] = np.where(messages['msg_type'] != 'link',
                                      messages['text_processed'].apply(lambda x: normalize_slang(x, slang_lookup)),
                                      messages['text_processed'])
#messages['text_processed'] = messages['text_processed'].apply(lambda x: normalize_slang(x, slang_lookup))
messages.tail()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,msg_type
29,At the moment when Holmes struck the light I h...,2020-03-08 17:20:32-04:00,User 2,"[at, the, moment, when, holmes, struck, the, l...",[],text
30,lool but the sudden glare lool lool flashing i...,2020-03-08 17:45:56-04:00,User 1,"[lool, but, the, sudden, glare, lool, lool, fl...",[],text
31,made it impossible for me to tell what it was ...,2020-03-08 17:45:56-04:00,User 1,"[made, it, impossible, for, me, to, tell, what...",[],text
32,"I could, however, see that his face was deadly...",2020-03-08 17:50:34-04:00,User 2,"[i, could, however, see, that, his, face, was,...","[😭, 😭, 😭]",text
33,https://towardsdatascience.com/an-introduction...,2020-03-08 17:52:12-04:00,User 2,[https://towardsdatascience.com/an-introductio...,[],link


**Step 6:**<br>
Finally, let's remove common [stopwords](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/) from our tokenized and cleaned (non-link) messages. This will ensure our data isn't polluted by common-use words. The user can also define additional stopwords in the `Chat-History-User-Defined.ipynb` notebook.

In [764]:
# Define standard set of stopwords
stopwords_set = set(stopwords.words('english'))

# Add user-defined custom stopwords to set
stopwords_set = stopwords_set | STOPWORDS_EXTRA

# List comprehension helper function to remove stopwords
def remove_stopwords(tokenlist):
    return [word.lower() for word in tokenlist if word.lower() not in stopwords_set]

In [765]:
# Apply function to dataframe and remove stopwords on non-link messages
messages['text_processed'] = np.where(messages['msg_type'] != 'link',
                                      messages['text_processed'].apply(remove_stopwords),
                                      messages['text_processed'])
#messages['text_processed'] = messages['text_processed'].apply(remove_stopwords)
messages.head()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,msg_type
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,"[outside, came, occasional, cry, night-bird]",[],text
1,and lool lool lool lool once at our very windo...,2020-02-28 02:55:53-05:00,User 2,"[window, long, drawn, catlike, whine]","[😯, 😯, 😯]",text
2,\ which told us that the cheetah was indeed at...,2020-02-28 02:56:07-05:00,User 2,"[told, us, cheetah, indeed, liberty]",[],text
3,hah lmaoooo wooowww,2020-02-28 02:56:08-05:00,User 2,"[hahaha, lmao, wow]",[],text
4,😯 Far away we could hear the deep tones of the...,2020-02-28 02:56:27-05:00,User 1,"[far, away, could, hear, deep, tones, parish, ...","[😯, ☺]",text


In [766]:
messages.tail()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,msg_type
29,At the moment when Holmes struck the light I h...,2020-03-08 17:20:32-04:00,User 2,"[moment, holmes, struck, light, heard, low, cl...",[],text
30,lool but the sudden glare lool lool flashing i...,2020-03-08 17:45:56-04:00,User 1,"[sudden, glare, flashing, weary, eyes]",[],text
31,made it impossible for me to tell what it was ...,2020-03-08 17:45:56-04:00,User 1,"[made, impossible, tell, friend, lashed, savag...",[],text
32,"I could, however, see that his face was deadly...",2020-03-08 17:50:34-04:00,User 2,"[could, however, see, face, deadly, pale, fill...","[😭, 😭, 😭]",text
33,https://towardsdatascience.com/an-introduction...,2020-03-08 17:52:12-04:00,User 2,[https://towardsdatascience.com/an-introductio...,[],link


In [767]:
print(messages['text_processed'].iloc[-1])

['https://towardsdatascience.com/an-introduction-to-tweettokenizer-for-processing-tweets-9879389f8fe7']


### Data Analysis and Visualization

Now for the fun part! Let's make a list of potential data vizualizations, in the context of a two-person WhatsApp conversation:
1. General stats (Messages totals per person, image / media totals per person, most messages in a day, longest message length etc.)
2. Daily (or weekly if too noisy) time series for number of messages. Can annotate based on known life events.
3. Most used words, broken down by person. Can highlight 'key' words.
4. Most used emojis, broken down by person.
5. Longest word used by each person.
6. Average message qty by day of week - can also extend to identify which hours were the busiest
7. Longest consecutive days without messages sent by either person

Let's keep track of which dataframes can be used for each vizualization - and create new ones if needed (TBD if not yet investigated / implemented)!
1. TBD
2. `messages`
3. `text_expanded_by_user`
4. `emoji_expanded_by_user`
5. `text_expanded_by_user`
6. `messages`
7. `messages`

Creating 'by user' visualizations require us to have access to the individual tokens in `messages[text_processed]`. We can explode out each token to a new row and preserve usernames (in a new dataframe) as follows:

In [768]:
text_expanded = messages[['username','text_processed']].explode('text_processed').dropna()
text_expanded.head()

Unnamed: 0,username,text_processed
0,User 1,outside
0,User 1,came
0,User 1,occasional
0,User 1,cry
0,User 1,night-bird


We need to pivot this data such that we obtain rows for counts of each unique word, broken down per user. This can be done as follows:

In [769]:
# Pivot
text_expanded_by_user = text_expanded.pivot_table(text_expanded, 
                                                  index='text_processed', 
                                                  columns='username', 
                                                  aggfunc=len).fillna(0)

# Create and populate a 'total' column
text_expanded_by_user['total'] = 0

for column in text_expanded_by_user:
    if column != 'total':
        text_expanded_by_user['total'] += text_expanded_by_user[column]

# Sort by most common words
text_expanded_by_user.sort_values(by='total', ascending=False, inplace=True)
        
# Re-index dataframe and fix column naming
text_expanded_by_user = text_expanded_by_user.reset_index()
text_expanded_by_user.rename_axis(None, axis=1, inplace=True)

text_expanded_by_user.head(60)

Unnamed: 0,text_processed,User 1,User 2,total
0,lmao,4.0,1.0,5.0
1,hahaha,3.0,1.0,4.0
2,see,2.0,1.0,3.0
3,sound,2.0,1.0,3.0
4,struck,2.0,1.0,3.0
5,heard,1.0,2.0,3.0
6,gentle,1.0,1.0,2.0
7,smell,1.0,1.0,2.0
8,could,1.0,1.0,2.0
9,hour,1.0,1.0,2.0


We can repeat this process to obtain a similar dataframe for emojis. **Note:** Need to'de-emojize' before pivoting, since many identical emojis were unnecessarily represented differently in unicode. To display as emojis in visualizations, must 're-emojize' the emojis.

In [770]:
emoji_expanded = messages[['username','emojis']].explode('emojis').dropna()
emoji_expanded.head()

Unnamed: 0,username,emojis
1,User 2,😯
1,User 2,😯
1,User 2,😯
4,User 1,😯
4,User 1,☺


In [771]:
# De-emoji in-progress dataframe
emoji_expanded['emojis'] = emoji_expanded['emojis'].apply(emoji.demojize)

# Pivot
emoji_expanded_by_user = emoji_expanded.pivot_table(emoji_expanded, 
                                                  index='emojis', 
                                                  columns='username', 
                                                  aggfunc=len).fillna(0)

# Create and populate a 'total' column
emoji_expanded_by_user['total'] = 0

for column in emoji_expanded_by_user:
    if column != 'total':
        emoji_expanded_by_user['total'] += emoji_expanded_by_user[column]

# Sort by most common emojis
emoji_expanded_by_user.sort_values(by='total', ascending=False, inplace=True)

# Re-index dataframe and fix column naming
emoji_expanded_by_user = emoji_expanded_by_user.reset_index()
emoji_expanded_by_user.rename_axis(None, axis=1, inplace=True)

emoji_expanded_by_user.head(60)

Unnamed: 0,emojis,User 1,User 2,total
0,:hushed_face:,7.0,4.0,11.0
1,:loudly_crying_face:,0.0,7.0,7.0
2,:eyes:,2.0,0.0,2.0
3,:grinning_squinting_face:,2.0,0.0,2.0
4,:smiling_face:,2.0,0.0,2.0
5,:frowning_face:,1.0,0.0,1.0
6,:grinning_face_with_smiling_eyes:,0.0,1.0,1.0


In [772]:
emoji.emojize(emoji_expanded_by_user['emojis'][0])

'😯'

We now have two more dataframes that can be fed into some of our target vizualizations!

1. a. Message totals per person: