# WhatsApp Chat History Data Visualization

This is a *work-in-progress* notebook for prototyping my chat log data visualization tool.

### Imports

Let's start off by importing our bread-and-butter data and visualization libraries:

In [574]:
# Data and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Data Extraction

We'll read in the WhatsApp chat log (exported from an iOS device) to a dataframe and make a deepcopy for us to try out all of our preprocessing on. <br>
**Note:** For privacy purposes, the chat log available on GitHub is much shorter and consists of dummy text, albeit maintaining WhatsApp's export style

In [575]:
# Read in Whatsapp chat log to a dataframe

imported_messages = pd.read_csv('chat_demo.txt', delimiter='\n', skiprows=[0], names = ['text_original'])
imported_messages.head()

Unnamed: 0,text_original
0,"[2020-02-28, 2:55:53 AM] User 1: From outside ..."
1,"[2020-02-28, 2:55:53 AM] User 2: and once at o..."
2,"[2020-02-28, 2:56:07 AM] User 2: \n which told..."
3,"[2020-02-28, 2:56:08 AM] User 2: 😯"
4,"[2020-02-28, 2:56:27 AM] User 1: 😯 Far away we..."


**Note:** Using '\n' as the delimiter results in messages with embedded line breaks escaping to new rows in the dataframe. These rows will not have the '*\[datetime\] username: text*' pattern seen in other rows, so we need to handle these appropriately when separating out datetimes and usernames.

In [576]:
# Deepcopy into a working dataframe for preprocessing / cleaning

messages = imported_messages.copy(deep=True)
messages.iloc[7:11]

Unnamed: 0,text_original
7,"[2020-02-29, 6:00:23 PM] User 1: ☺ Twelve struck,"
8,"and one and two and three,"
9,and still we sat waiting silently for whatever...
10,"[2020-02-29, 6:15:12 PM] User 1: video omitted"


### Preprocessing Non-Text Data

Alright, now let's work on extracting the datetime and username fields by leveraging [regular expressions](https://jakevdp.github.io/WhirlwindTourOfPython/14-strings-and-regular-expressions.html). Let's start with some useful libraries.

In [577]:
# Libraries to help us handle dates/times
import datetime as dt
from pytz import timezone

# Library for regular expressions
import regex

# Library to handle emojis in text
import emoji

We'll define a helper function to aid us with extracting usernames and datetimes, then apply it to our dataframe.

In [578]:
# Function to extract datetime and username as text
def extract_datetime_username(text):
    """
    Note:   Requires regex module to be imported
    Input:  String of text which may contain '[...]' text pattern
    Output: Tuple of the following: (String to the right of the ': ' text pattern      OR original text string ,
                                     String with contents of the '[...]' text pattern  OR NaN , 
                                     String between the '[...]' and ': ' text patterns OR NaN )
    """
    # Regex to find '[...]' pattern in text
    date_time = regex.search(r'.*\[(.*)\].*', text)
    
    # Output based on pattern search result
    if date_time:
        text_remainder = text.split("] ")[1]
        text_username = text_remainder.split(": ")
        return (text_username[1], date_time.group(1), text_username[0])
    else:
        return (text, np.nan, np.nan)

In [579]:
# Apply the function to our dataframe and extract out the datetimes and usernames
messages['text_original'], messages['date_time'], messages['username'] = zip(*messages['text_original'].apply(extract_datetime_username))
messages.head()

Unnamed: 0,text_original,date_time,username
0,From outside came the occasional cry of a nigh...,"2020-02-28, 2:55:53 AM",User 1
1,and once at our very window a long drawn catli...,"2020-02-28, 2:55:53 AM",User 2
2,\n which told us that the cheetah was indeed a...,"2020-02-28, 2:56:07 AM",User 2
3,😯,"2020-02-28, 2:56:08 AM",User 2
4,😯 Far away we could hear the deep tones of the...,"2020-02-28, 2:56:27 AM",User 1


In [580]:
# Check to ensure functionality is as intended on rows with embedded line breaks
messages.iloc[7:11]

Unnamed: 0,text_original,date_time,username
7,"☺ Twelve struck,","2020-02-29, 6:00:23 PM",User 1
8,"and one and two and three,",,
9,and still we sat waiting silently for whatever...,,
10,video omitted,"2020-02-29, 6:15:12 PM",User 1


Before we proceed, let's verify if there are any rows of text with none / NaN values.

In [581]:
messages[messages['text_original'].isna()]

Unnamed: 0,text_original,date_time,username


Now we can fill in the NaN values in the 'date_time' and 'username' columns by considering those messages to have been sent by the user in the row above, at the time in the row above.

In [582]:
messages.fillna(method='ffill', inplace=True)
messages.iloc[7:11]

Unnamed: 0,text_original,date_time,username
7,"☺ Twelve struck,","2020-02-29, 6:00:23 PM",User 1
8,"and one and two and three,","2020-02-29, 6:00:23 PM",User 1
9,and still we sat waiting silently for whatever...,"2020-02-29, 6:00:23 PM",User 1
10,video omitted,"2020-02-29, 6:15:12 PM",User 1


Let's leverage Python's `datetime` module to convert our date_time column from a string to handy datetime objects (localized in my case to Toronto, Canada).

In [583]:
local_timezone = timezone('America/Toronto')

In [584]:
messages['date_time'] = messages['date_time'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d, %I:%M:%S %p'))
messages['date_time'] = messages['date_time'].apply(lambda x: local_timezone.localize(x))
messages.head()

Unnamed: 0,text_original,date_time,username
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2
3,😯,2020-02-28 02:56:08-05:00,User 2
4,😯 Far away we could hear the deep tones of the...,2020-02-28 02:56:27-05:00,User 1


Now for emojis: let's create another helper function to extract emojis into a separate dataframe column.<br>
**Note:** This was a more complicated endeavour than I had initially anticipated. I modified a solution found [here](https://stackoverflow.com/questions/49113909/split-and-count-emojis-and-words-in-a-given-string-in-python?noredirect=1&lq=1) to create a (very un-optimized!) solution. [This](https://www.regular-expressions.info/) site and [this](https://stackoverflow.com/questions/9928505/what-does-the-expression-x-match-when-inside-a-regex) post also helped!

In [585]:
# Define function to extract and process emojis
def extract_emojis(text):
    '''
    Input:  String (utf-8 encoding) containing emojis 
    Output: Tuple of the following: (original string with emojis removed, list of all emojis found in the string)
    '''
    # Use regex to split up our string. '\X' captures composite unicode emojis as a single emoji
    data = regex.findall(r'\X',text)    
    
    # Create a list of all emojis present in data**
    all_emojis = [symbol for symbol in data if any(char in emoji.UNICODE_EMOJI for char in symbol)]
            
    # Remove emojis from the given text
    for emjs in all_emojis:
        text = text.replace(emjs, '') 

    return (text, all_emojis)


#   **For reference, here's the original emoji list creation code without the one-liner list comprehension
#    all_emojis = []
#    for word in data:
#        if any(char in emoji.UNICODE_EMOJI for char in word):  
#            all_emojis += [word]

In [586]:
# Apply it to our dataframe and extract out the emojis
messages['text_original'], messages['emojis'] = zip(*messages['text_original'].apply(extract_emojis))
messages.head()

Unnamed: 0,text_original,date_time,username,emojis
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,[]
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2,[😯]
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2,[]
3,,2020-02-28 02:56:08-05:00,User 2,[😯]
4,Far away we could hear the deep tones of the ...,2020-02-28 02:56:27-05:00,User 1,"[😯, ☺]"


Awesome! Now before we process our text data any further, let's calculate the raw text string length:

In [587]:
messages['text_str_length'] = messages['text_original'].apply(len)
messages.head()

Unnamed: 0,text_original,date_time,username,emojis,text_str_length
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,[],52
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2,[😯],55
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2,[],56
3,,2020-02-28 02:56:08-05:00,User 2,[😯],0
4,Far away we could hear the deep tones of the ...,2020-02-28 02:56:27-05:00,User 1,"[😯, ☺]",59


In [588]:
messages.iloc[7:11]

Unnamed: 0,text_original,date_time,username,emojis,text_str_length
7,"Twelve struck,",2020-02-29 18:00:23-05:00,User 1,[☺],15
8,"and one and two and three,",2020-02-29 18:00:23-05:00,User 1,[],26
9,and still we sat waiting silently for whatever...,2020-02-29 18:00:23-05:00,User 1,[],61
10,video omitted,2020-02-29 18:15:12-05:00,User 1,[],13


### Preprocessing Text Data

Now we're ready to work directly on the text data and make it more palatable for extracting insights from. We'll begin with library imports - primarily from NLTK.

In [589]:
# NLTK for all our languarge processing needs (tokenization, lemmatization and stopword removal)
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords 

In [590]:
# If needed:
# nltk.download()

First step: [Tokenizing](https://www.geeksforgeeks.org/tokenize-text-using-nltk-python/) our corpus of text messages!

In [591]:
messages['text_processed'] = messages['text_original'].apply(word_tokenize)
messages.head()

Unnamed: 0,text_original,date_time,username,emojis,text_str_length,text_processed
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,[],52,"[From, outside, came, the, occasional, cry, of..."
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2,[😯],55,"[and, once, at, our, very, window, a, long, dr..."
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2,[],56,"[\n, which, told, us, that, the, cheetah, was,..."
3,,2020-02-28 02:56:08-05:00,User 2,[😯],0,[]
4,Far away we could hear the deep tones of the ...,2020-02-28 02:56:27-05:00,User 1,"[😯, ☺]",59,"[Far, away, we, could, hear, the, deep, tones,..."


In [592]:
# Note for future reference: can leverage 'image omitted' and 'video omitted' to get picture / video counts
# filter rows with text_char_length == 14 for search
# Can also handle 'Missed voice call' similarly