# WhatsApp Chat History Data Visualization

This is a *work-in-progress* notebook for prototyping my chat log data visualization tool.

### Imports

Let's start off by importing our bread-and-butter data and visualization libraries:

In [160]:
# Data and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Data Extraction

We'll read in the WhatsApp chat log (exported from an iOS device) to a dataframe and make a deepcopy for us to try out all of our preprocessing on. <br>
**Note:** For privacy purposes, the chat log available on GitHub is much shorter and consists of dummy text, albeit maintaining WhatsApp's export style

In [161]:
# Read in Whatsapp chat log to a dataframe

imported_messages = pd.read_csv('chat.txt', delimiter='\n', skiprows=[0], names = ['text_raw'])
imported_messages.head()

Unnamed: 0,text_raw
0,"[2020-02-28, 2:55:53 AM] User 1: From outside ..."
1,"[2020-02-28, 2:55:53 AM] User 2: and once at o..."
2,"[2020-02-28, 2:56:07 AM] User 2: \n which told..."
3,"[2020-02-28, 2:56:08 AM] User 2: 😯"
4,"[2020-02-28, 2:56:27 AM] User 1: 😯 Far away we..."


**Note:** Using '\n' as the delimiter results in messages with embedded line breaks escaping to new rows in the dataframe. These rows will not have the '*\[datetime\] username: text*' pattern seen in other rows, so we need to handle these appropriately when separating out datetimes and usernames.

In [162]:
# Deepcopy into a working dataframe for preprocessing / cleaning

messages = imported_messages.copy(deep=True)
messages.iloc[7:11]

Unnamed: 0,text_raw
7,"[2020-02-29, 6:00:23 PM] User 1: ☺ Twelve struck,"
8,"and one and two and three,"
9,and still we sat waiting silently for whatever...
10,"[2020-02-29, 6:15:12 PM] User 1: ‎video omitted"


### Preprocessing Non-Text Data

Alright, now let's work on extracting the datetime and username fields by leveraging [regular expressions](https://jakevdp.github.io/WhirlwindTourOfPython/14-strings-and-regular-expressions.html). Let's start with some useful libraries.

In [163]:
# Libraries to help us handle dates/times
import datetime as dt
from pytz import timezone

# Library for regular expressions
import regex

# Library to handle emojis in text
import emoji

We'll define a helper function to aid us with extracting usernames and datetimes, then apply it to our dataframe.

In [164]:
# Function to extract datetime and username as text
def extract_datetime_username(text):
    """
    Note:   Requires regex module to be imported
    Input:  String of text which may contain '[...]' text pattern
    Output: Tuple of the following: (String to the right of the ': ' text pattern      OR original text string ,
                                     String with contents of the '[...]' text pattern  OR NaN , 
                                     String between the '[...]' and ': ' text patterns OR NaN )
    """
    # Regex to find '[...]' pattern in text
    date_time = regex.search(r'.*\[(.*)\].*', text)
    
    # Output based on pattern search result
    if date_time:
        text_remainder = text.split("] ")[1]
        text_username = text_remainder.split(": ")
        return (text_username[1], date_time.group(1), text_username[0])
    else:
        return (text, np.nan, np.nan)

In [165]:
# Apply the function to our dataframe and extract out the datetimes and usernames
messages['text_raw'], messages['date_time'], messages['username'] = zip(*messages['text_raw'].apply(extract_datetime_username))
messages.head()

Unnamed: 0,text_raw,date_time,username
0,From outside came the occasional cry of a nigh...,"2020-02-28, 2:55:53 AM",User 1
1,and once at our very window a long drawn catli...,"2020-02-28, 2:55:53 AM",User 2
2,\n which told us that the cheetah was indeed a...,"2020-02-28, 2:56:07 AM",User 2
3,😯,"2020-02-28, 2:56:08 AM",User 2
4,😯 Far away we could hear the deep tones of the...,"2020-02-28, 2:56:27 AM",User 1


In [166]:
# Check to ensure functionality is as intended on rows with embedded line breaks
messages.iloc[7:11]

Unnamed: 0,text_raw,date_time,username
7,"☺ Twelve struck,","2020-02-29, 6:00:23 PM",User 1
8,"and one and two and three,",,
9,and still we sat waiting silently for whatever...,,
10,‎video omitted,"2020-02-29, 6:15:12 PM",User 1


Before we proceed, let's verify if there are any rows of text with none / NaN values.

In [167]:
messages[messages['text_raw'].isna()]

Unnamed: 0,text_raw,date_time,username


Now we can fill in the NaN values in the 'date_time' and 'username' columns by considering those messages to have been sent by the user in the row above, at the time in the row above.

In [168]:
messages.fillna(method='ffill', inplace=True)
messages.iloc[7:11]

Unnamed: 0,text_raw,date_time,username
7,"☺ Twelve struck,","2020-02-29, 6:00:23 PM",User 1
8,"and one and two and three,","2020-02-29, 6:00:23 PM",User 1
9,and still we sat waiting silently for whatever...,"2020-02-29, 6:00:23 PM",User 1
10,‎video omitted,"2020-02-29, 6:15:12 PM",User 1


Let's leverage Python's `datetime` module to convert our date_time column from a string to handy datetime objects (localized in my case to Toronto, Canada).

In [169]:
local_timezone = timezone('America/Toronto')

In [170]:
messages['date_time'] = messages['date_time'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d, %I:%M:%S %p'))
messages['date_time'] = messages['date_time'].apply(lambda x: local_timezone.localize(x))
messages.head()

Unnamed: 0,text_raw,date_time,username
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2
3,😯,2020-02-28 02:56:08-05:00,User 2
4,😯 Far away we could hear the deep tones of the...,2020-02-28 02:56:27-05:00,User 1


Now for emojis: let's create another helper function to extract emojis into a separate dataframe column.<br>
**Note:** This was a more complicated endeavour than I had initially anticipated. I modified a solution found [here](https://stackoverflow.com/questions/49113909/split-and-count-emojis-and-words-in-a-given-string-in-python?noredirect=1&lq=1) to create a (very un-optimized!) solution. [This](https://www.regular-expressions.info/) site and [this](https://stackoverflow.com/questions/9928505/what-does-the-expression-x-match-when-inside-a-regex) post also helped!

In [171]:
# Define function to extract and process emojis
def extract_emojis(text):
    '''
    Input:  String (utf-8 encoding) containing emojis 
    Output: Tuple of the following: (original string with emojis removed, list of all emojis found in the string)
    '''
    # Use regex to split up our string. '\X' captures composite unicode emojis as a single emoji
    data = regex.findall(r'\X',text)    
    
    # Create a list of all emojis present in data**
    all_emojis = [symbol for symbol in data if any(char in emoji.UNICODE_EMOJI for char in symbol)]
            
    # Remove emojis from the given text
    for emjs in all_emojis:
        text = text.replace(emjs, '') 

    return (text, all_emojis)


#   **For reference, here's the original emoji list creation code without the one-liner list comprehension
#    all_emojis = []
#    for word in data:
#        if any(char in emoji.UNICODE_EMOJI for char in word):  
#            all_emojis += [word]

In [172]:
# Apply it to our dataframe and extract out the emojis
messages['text_raw'], messages['emojis'] = zip(*messages['text_raw'].apply(extract_emojis))
messages.head()

Unnamed: 0,text_raw,date_time,username,emojis
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,[]
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2,[😯]
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2,[]
3,,2020-02-28 02:56:08-05:00,User 2,[😯]
4,Far away we could hear the deep tones of the ...,2020-02-28 02:56:27-05:00,User 1,"[😯, ☺]"


### Preprocessing Text Data

Now we're ready to work directly on the text data and make it more palatable for extracting insights from. We'll begin with library imports - primarily from NLTK.

In [173]:
# Python String Library
import string

# NLTK for all our languarge processing needs (tokenization and stopword removal - lemmatization if our data ends up too 'muddy')
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords 
#from nltk.stem import WordNetLemmatizer 

In [174]:
# If needed:
# nltk.download()

For reference, let's outline what out text preprocessing pipeline will look like. <br><br>
**Pipeline:**
1. Remove non-printable characters
2. Tokenize each message
3. Categorize each message type (text / picture / video) (*Note: voice call missed / video call missed not implemented yet*)
4. Remove stopwords from each message

**Step 1:** <br>
Since we've removed all the emojis, let's clean up the text further by removing 'non-printable' characters (co-opting [this solution](https://stackoverflow.com/questions/1342000/how-to-make-the-python-interpreter-correctly-handle-non-ascii-characters-in-stri)).
**Note:** this may have the side effect of removing non-ascii characters, therefore might cause unintended behaviour if the message corpus contains **non-Latin ("English")** characters.

In [175]:
# Define function for removing non-printable characters
def remove_non_printables(s):
    return "".join(x for x in s if str.isprintable(x))

In [176]:
# Apply it to our dataframe to clean the raw text data
messages['text_raw'] = messages['text_raw'].apply(lambda x: remove_non_printables(x))
messages.iloc[10]['text_raw']

'video omitted'

Awesome! Now before we process our text data any further, let's calculate the raw text string length:

In [177]:
messages['text_str_length'] = messages['text_raw'].apply(len)
messages.head()

Unnamed: 0,text_raw,date_time,username,emojis,text_str_length
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,[],52
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2,[😯],55
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2,[],56
3,,2020-02-28 02:56:08-05:00,User 2,[😯],0
4,Far away we could hear the deep tones of the ...,2020-02-28 02:56:27-05:00,User 1,"[😯, ☺]",59


In [178]:
messages.iloc[7:11]

Unnamed: 0,text_raw,date_time,username,emojis,text_str_length
7,"Twelve struck,",2020-02-29 18:00:23-05:00,User 1,[☺],15
8,"and one and two and three,",2020-02-29 18:00:23-05:00,User 1,[],26
9,and still we sat waiting silently for whatever...,2020-02-29 18:00:23-05:00,User 1,[],61
10,video omitted,2020-02-29 18:15:12-05:00,User 1,[],13


**Step 2:**<br>
Now, let's [tokenize](https://www.geeksforgeeks.org/tokenize-text-using-nltk-python/) our corpus of text messages! In this case, we'll use a [regex tokenizer](https://kite.com/python/answers/how-to-remove-all-punctuation-marks-with-nltk-in-python) to ignore punctuation and only select words.

In [179]:
tokenizer = nltk.RegexpTokenizer(r"\w+")
messages['text_processed'] = messages['text_raw'].apply(lambda x: tokenizer.tokenize(x))
messages.head()

Unnamed: 0,text_raw,date_time,username,emojis,text_str_length,text_processed
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,[],52,"[From, outside, came, the, occasional, cry, of..."
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2,[😯],55,"[and, once, at, our, very, window, a, long, dr..."
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2,[],56,"[n, which, told, us, that, the, cheetah, was, ..."
3,,2020-02-28 02:56:08-05:00,User 2,[😯],0,[]
4,Far away we could hear the deep tones of the ...,2020-02-28 02:56:27-05:00,User 1,"[😯, ☺]",59,"[Far, away, we, could, hear, the, deep, tones,..."


**Step 3:**<br>
Since the WhatsApp chat log is text-only, any image or video media is represented as a message with the text *'image omitted'* or *'video omitted'* in it. We can extract this information into a separate column, as well as emptying the respective `text_processed` field (since they aren't 'real' messages)

In [180]:
# Function for finding image and video messages
def find_message_type(tokenlist):
    '''
    Input:  Tokenized text string - i.e. list of words)
    Output: Tuple of the following: (String indicating type of message ['image', 'video', or 'text')],
                                     appropriate output token for the message type
    '''
    if len(tokenlist) == 2:
        if (tokenlist[1] == 'omitted'):
            
            if (tokenlist[0] == 'image'):
                return ('image', [])
            
            elif (tokenlist[0] == 'video'):
                return ('video', [])
                        
    return ('text', tokenlist)

In [181]:
messages['msg_type'], messages['text_processed'] = zip(*messages['text_processed'].apply(find_message_type))
messages.head()

Unnamed: 0,text_raw,date_time,username,emojis,text_str_length,text_processed,msg_type
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,[],52,"[From, outside, came, the, occasional, cry, of...",text
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2,[😯],55,"[and, once, at, our, very, window, a, long, dr...",text
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2,[],56,"[n, which, told, us, that, the, cheetah, was, ...",text
3,,2020-02-28 02:56:08-05:00,User 2,[😯],0,[],text
4,Far away we could hear the deep tones of the ...,2020-02-28 02:56:27-05:00,User 1,"[😯, ☺]",59,"[Far, away, we, could, hear, the, deep, tones,...",text


In [182]:
messages.tail()

Unnamed: 0,text_raw,date_time,username,emojis,text_str_length,text_processed,msg_type
30,but the sudden glare flashing into my weary eyes,2020-03-08 17:45:56-04:00,User 1,[],48,"[but, the, sudden, glare, flashing, into, my, ...",text
31,made it impossible for me to tell what it was ...,2020-03-08 17:45:56-04:00,User 1,[],84,"[made, it, impossible, for, me, to, tell, what...",text
32,"I could, however, see that his face was deadly...",2020-03-08 17:50:34-04:00,User 2,"[😭, 😭, 😭]",88,"[I, could, however, see, that, his, face, was,...",text
33,Missed video call,2020-03-08 18:20:13-04:00,User 2,[],17,"[Missed, video, call]",text
34,Missed voice call,2020-03-08 18:22:23-04:00,User 2,[],17,"[Missed, voice, call]",text


In [183]:
print(messages['msg_type'].iloc[10], messages['text_processed'].iloc[10])

video []


**Step 4:**<br>
Finally, let's remove common [stopwords](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/) from our tokenized corpus. **Note:** Our helper function will also turn our text to all lowercase.

In [184]:
# Define list of stopwords
stopwords_list = set(stopwords.words('english'))
print(stopwords_list)

# List comprehension helper function to remove stopwords
def remove_stopwords(tokenlist):
    return [word.lower() for word in tokenlist if word.lower() not in stopwords_list]

{"that'll", 'shouldn', 'that', 'itself', 'yours', 'be', 'which', 'haven', 'should', "aren't", 'some', "isn't", 'wasn', "doesn't", 'her', 'their', "should've", 'herself', 'your', 'she', 'above', "wasn't", "didn't", 'shan', 'a', 'yourselves', 'having', 'his', 'him', 'theirs', 'have', 't', "you're", 'did', 'for', 'myself', 'couldn', 'who', 'my', 'all', "don't", 'is', 'mustn', 'are', 'so', 'has', 'where', 'he', 'm', 'hasn', 'd', 'y', 'under', 'by', 'am', 'or', 'few', "wouldn't", 'what', 'other', 'how', 's', 'here', 'doesn', 'll', 'too', 'aren', 'but', 'against', 'own', 'o', 'about', 'we', 'himself', 'there', 'can', 'being', 'these', 'nor', "mustn't", 'ours', 'while', 'on', 'don', 'again', 'because', 'isn', 'will', 'me', 'now', "hadn't", 'each', 'was', 'ma', 'ain', 'do', "you'd", 'were', 'when', 'needn', 'down', 'does', 'our', 'into', 'it', 'more', 'with', 'in', 'than', "she's", 'didn', "shouldn't", "haven't", 'further', 'off', 'such', 'to', 'from', 'through', 'the', 'yourself', "it's", 'bo

In [185]:
messages['text_processed'] = messages['text_processed'].apply(remove_stopwords)
messages.tail()

Unnamed: 0,text_raw,date_time,username,emojis,text_str_length,text_processed,msg_type
30,but the sudden glare flashing into my weary eyes,2020-03-08 17:45:56-04:00,User 1,[],48,"[sudden, glare, flashing, weary, eyes]",text
31,made it impossible for me to tell what it was ...,2020-03-08 17:45:56-04:00,User 1,[],84,"[made, impossible, tell, friend, lashed, savag...",text
32,"I could, however, see that his face was deadly...",2020-03-08 17:50:34-04:00,User 2,"[😭, 😭, 😭]",88,"[could, however, see, face, deadly, pale, fill...",text
33,Missed video call,2020-03-08 18:20:13-04:00,User 2,[],17,"[missed, video, call]",text
34,Missed voice call,2020-03-08 18:22:23-04:00,User 2,[],17,"[missed, voice, call]",text


### Exploratory Data Analysis

In [186]:
text_processed_master = messages['text_processed'].explode().value_counts()
text_processed_master.head(10)

struck    3
sound     3
heard     3
see       3
lashed    2
call      2
holmes    2
long      2
missed    2
smell     2
Name: text_processed, dtype: int64