# WhatsApp Chat History Data Visualization

This is a *work-in-progress* notebook for prototyping my chat log data visualization tool.

### Imports

Let's start off by importing our bread-and-butter data and visualization libraries:

In [74]:
# Data and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

We will also import the custom dictionaries defined in `Chat-History-Custom-Functs.ipynb` (which will be used in the text normalization process). Feel free to edit the dictionaries based on the desired normalization in your text.

In [75]:
%run Chat-History-Custom-Functs.ipynb

### Data Extraction

We'll read in the WhatsApp chat log (exported from an iOS device) to a dataframe and make a deepcopy for us to try out all of our preprocessing on. <br>
**Note:** For privacy purposes, the chat log available on GitHub is much shorter and consists of dummy text, albeit maintaining WhatsApp's export style

In [76]:
# Read in Whatsapp chat log to a dataframe

imported_messages = pd.read_csv('chat.txt', delimiter='\n', skiprows=[0], names = ['text_raw'])
imported_messages.head()

Unnamed: 0,text_raw
0,"[2020-02-28, 2:55:53 AM] User 1: From outside ..."
1,"[2020-02-28, 2:55:53 AM] User 2: and once at o..."
2,"[2020-02-28, 2:56:07 AM] User 2: \n which told..."
3,"[2020-02-28, 2:56:08 AM] User 2: hah lmaoooo w..."
4,"[2020-02-28, 2:56:27 AM] User 1: 😯 Far away we..."


**Note:** Using '\n' as the delimiter results in messages with embedded line breaks escaping to new rows in the dataframe. These rows will not have the '*\[datetime\] username: text*' pattern seen in other rows, so we need to handle these appropriately when separating out datetimes and usernames.

In [77]:
# Deepcopy into a working dataframe for preprocessing / cleaning

messages = imported_messages.copy(deep=True)
messages.iloc[7:11]

Unnamed: 0,text_raw
7,"[2020-02-29, 6:00:23 PM] User 1: ☺ Twelve struck,"
8,"and one and two and three,"
9,and still we sat waiting silently for whatever...
10,"[2020-02-29, 6:15:12 PM] User 1: ‎video omitted"


### Preprocessing Non-Text Data

Alright, now let's work on extracting the datetime and username fields by leveraging [regular expressions](https://jakevdp.github.io/WhirlwindTourOfPython/14-strings-and-regular-expressions.html). Let's start with some useful libraries.

In [78]:
# Libraries to help us handle dates/times
import datetime as dt
from pytz import timezone

# Library for regular expressions
import regex

# Library to handle emojis in text
import emoji

We'll define a helper function to aid us with extracting usernames and datetimes, then apply it to our dataframe.

In [79]:
# Function to extract datetime and username as text
def extract_datetime_username(text):
    """
    Note:   Requires regex module to be imported
    Input:  String of text which may contain '[...]' text pattern
    Output: Tuple of the following: (String to the right of the ': ' text pattern      OR original text string ,
                                     String with contents of the '[...]' text pattern  OR NaN , 
                                     String between the '[...]' and ': ' text patterns OR NaN )
    """
    # Regex to find '[...]' pattern in text
    date_time = regex.search(r'.*\[(.*)\].*', text)
    
    # Output based on pattern search result
    if date_time:
        text_remainder = text.split("] ")[1]
        text_username = text_remainder.split(": ")
        return (text_username[1], date_time.group(1), text_username[0])
    else:
        return (text, np.nan, np.nan)

In [80]:
# Apply the function to our dataframe and extract out the datetimes and usernames
messages['text_raw'], messages['date_time'], messages['username'] = zip(*messages['text_raw'].apply(extract_datetime_username))
messages.head()

Unnamed: 0,text_raw,date_time,username
0,From outside came the occasional cry of a nigh...,"2020-02-28, 2:55:53 AM",User 1
1,and once at our very window a long drawn catli...,"2020-02-28, 2:55:53 AM",User 2
2,\n which told us that the cheetah was indeed a...,"2020-02-28, 2:56:07 AM",User 2
3,hah lmaoooo wooowww,"2020-02-28, 2:56:08 AM",User 2
4,😯 Far away we could hear the deep tones of the...,"2020-02-28, 2:56:27 AM",User 1


In [81]:
# Check to ensure functionality is as intended on rows with embedded line breaks
messages.iloc[7:11]

Unnamed: 0,text_raw,date_time,username
7,"☺ Twelve struck,","2020-02-29, 6:00:23 PM",User 1
8,"and one and two and three,",,
9,and still we sat waiting silently for whatever...,,
10,‎video omitted,"2020-02-29, 6:15:12 PM",User 1


Before we proceed, let's verify if there are any rows of text with none / NaN values.

In [82]:
messages[messages['text_raw'].isna()]

Unnamed: 0,text_raw,date_time,username


Now we can fill in the NaN values in the 'date_time' and 'username' columns by considering those messages to have been sent by the user in the row above, at the time in the row above.

In [83]:
messages.fillna(method='ffill', inplace=True)
messages.iloc[7:11]

Unnamed: 0,text_raw,date_time,username
7,"☺ Twelve struck,","2020-02-29, 6:00:23 PM",User 1
8,"and one and two and three,","2020-02-29, 6:00:23 PM",User 1
9,and still we sat waiting silently for whatever...,"2020-02-29, 6:00:23 PM",User 1
10,‎video omitted,"2020-02-29, 6:15:12 PM",User 1


Let's leverage Python's `datetime` module to convert our date_time column from a string to handy datetime objects (localized in my case to Toronto, Canada).

In [84]:
local_timezone = timezone('America/Toronto')

In [85]:
messages['date_time'] = messages['date_time'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d, %I:%M:%S %p'))
messages['date_time'] = messages['date_time'].apply(lambda x: local_timezone.localize(x))
messages.head()

Unnamed: 0,text_raw,date_time,username
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2
3,hah lmaoooo wooowww,2020-02-28 02:56:08-05:00,User 2
4,😯 Far away we could hear the deep tones of the...,2020-02-28 02:56:27-05:00,User 1


Now for emojis: let's create another helper function to extract emojis into a separate dataframe column.<br>
**Note:** This was a more complicated endeavour than I had initially anticipated. I modified a solution found [here](https://stackoverflow.com/questions/49113909/split-and-count-emojis-and-words-in-a-given-string-in-python?noredirect=1&lq=1) to create a (very un-optimized!) solution. [This](https://www.regular-expressions.info/) site and [this](https://stackoverflow.com/questions/9928505/what-does-the-expression-x-match-when-inside-a-regex) post also helped!

In [86]:
# Define function to extract and process emojis
def extract_emojis(text):
    '''
    Input:  String (utf-8 encoding) containing emojis 
    Output: Tuple of the following: (original string with emojis removed, list of all emojis found in the string)
    '''
    # Use regex to split up our string. '\X' captures composite unicode emojis as a single emoji
    data = regex.findall(r'\X',text)    
    
    # Create a list of all emojis present in data**
    all_emojis = [symbol for symbol in data if any(char in emoji.UNICODE_EMOJI for char in symbol)]
            
    # Remove emojis from the given text
    for emjs in all_emojis:
        text = text.replace(emjs, '') 

    return (text, all_emojis)


#   **For reference, here's the original emoji list creation code without the one-liner list comprehension
#    all_emojis = []
#    for word in data:
#        if any(char in emoji.UNICODE_EMOJI for char in word):  
#            all_emojis += [word]

In [87]:
# Apply it to our dataframe and extract out the emojis
messages['text_processed'], messages['emojis'] = zip(*messages['text_raw'].apply(extract_emojis))
messages.head()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,From outside came the occasional cry of a nigh...,[]
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2,and once at our very window a long drawn catli...,"[😯, 😯, 😯]"
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2,\n which told us that the cheetah was indeed a...,[]
3,hah lmaoooo wooowww,2020-02-28 02:56:08-05:00,User 2,hah lmaoooo wooowww,[]
4,😯 Far away we could hear the deep tones of the...,2020-02-28 02:56:27-05:00,User 1,Far away we could hear the deep tones of the ...,"[😯, ☺]"


### Preprocessing Text Data

Now we're ready to work directly on the text data and make it more palatable for extracting insights from. We'll begin with library imports - primarily from NLTK.

In [88]:
# Python String Library
import string

# NLTK for all our languarge processing needs (tokenization and stopword removal - lemmatization if our data ends up too 'muddy')
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords 
#from nltk.stem import WordNetLemmatizer 
from normalise import normalise

In [89]:
# If needed:
# nltk.download()

For reference, let's outline what our [text preprocessing](https://www.kdnuggets.com/2019/04/text-preprocessing-nlp-machine-learning.html) pipeline will look like. <br><br>
**Pipeline:**
1. Remove noise from each message
2. Tokenize each message
3. Categorize each message type (text / picture / video) (*Note: voice call missed / video call missed not implemented yet*)
4. Normalize the corpus of messages
5. Remove stopwords from each message

**Step 1:** <br>
Since we've removed all the emojis, let's clean up the text further by lowercasing all text and removing 'non-printable' characters (co-opting [this solution](https://stackoverflow.com/questions/1342000/how-to-make-the-python-interpreter-correctly-handle-non-ascii-characters-in-stri)).
**Note:** this may have the side effect of removing non-ascii characters, therefore might cause unintended behaviour if the message corpus contains **non-Latin ("English")** characters. 

In [90]:
# Define function for lowercasing all text
def convert_lowercase(s):
    return s.lower()

# Define function for removing non-printable characters
def remove_non_printables(s):
    return "".join(x for x in s if str.isprintable(x))

In [91]:
# Apply it to our dataframe to clean the raw text data
messages['text_processed'] = messages['text_processed'].apply(remove_non_printables)
messages['text_processed'] = messages['text_processed'].apply(convert_lowercase)
messages.iloc[10]['text_processed']

'video omitted'

Awesome! Now before we process our text data any further, let's store each message's length (since this is the best time to capture a 'true-to-typed-out' character count, excluding emojis):

In [92]:
messages['text_str_length'] = messages['text_processed'].apply(len)
messages.head()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,text_str_length
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,from outside came the occasional cry of a nigh...,[],52
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2,and once at our very window a long drawn catli...,"[😯, 😯, 😯]",55
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2,\n which told us that the cheetah was indeed a...,[],56
3,hah lmaoooo wooowww,2020-02-28 02:56:08-05:00,User 2,hah lmaoooo wooowww,[],19
4,😯 Far away we could hear the deep tones of the...,2020-02-28 02:56:27-05:00,User 1,far away we could hear the deep tones of the ...,"[😯, ☺]",59


In [93]:
messages.iloc[7:11]

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,text_str_length
7,"☺ Twelve struck,",2020-02-29 18:00:23-05:00,User 1,"twelve struck,",[☺],15
8,"and one and two and three,",2020-02-29 18:00:23-05:00,User 1,"and one and two and three,",[],26
9,and still we sat waiting silently for whatever...,2020-02-29 18:00:23-05:00,User 1,and still we sat waiting silently for whatever...,[],61
10,‎video omitted,2020-02-29 18:15:12-05:00,User 1,video omitted,[],13


**Step 2:**<br>
Now, let's [tokenize](https://www.geeksforgeeks.org/tokenize-text-using-nltk-python/) our corpus of text messages! In this case, we'll use a [regex tokenizer](https://kite.com/python/answers/how-to-remove-all-punctuation-marks-with-nltk-in-python) to ignore punctuation and only select words.

In [94]:
tokenizer = nltk.RegexpTokenizer(r"\w+")
messages['text_processed'] = messages['text_processed'].apply(lambda x: tokenizer.tokenize(x))
messages.head()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,text_str_length
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,"[from, outside, came, the, occasional, cry, of...",[],52
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2,"[and, once, at, our, very, window, a, long, dr...","[😯, 😯, 😯]",55
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2,"[n, which, told, us, that, the, cheetah, was, ...",[],56
3,hah lmaoooo wooowww,2020-02-28 02:56:08-05:00,User 2,"[hah, lmaoooo, wooowww]",[],19
4,😯 Far away we could hear the deep tones of the...,2020-02-28 02:56:27-05:00,User 1,"[far, away, we, could, hear, the, deep, tones,...","[😯, ☺]",59


**Step 3:**<br>
Since the WhatsApp chat log is text-only, any image or video media is represented as a message with the text *'image omitted'* or *'video omitted'* in it. We can extract this information into a separate column, as well as emptying the respective `text_processed` field (since they aren't 'real' messages)

In [95]:
# Function for finding image and video messages
def find_message_type(tokenlist):
    '''
    Input:  Tokenized text string - i.e. list of words)
    Output: Tuple of the following: (String indicating type of message ['image', 'video', or 'text')],
                                     appropriate output token for the message type
    '''
    if len(tokenlist) == 2:
        if (tokenlist[1] == 'omitted'):
            
            if (tokenlist[0] == 'image'):
                return ('image', [])
            
            elif (tokenlist[0] == 'video'):
                return ('video', [])
                        
    return ('text', tokenlist)

In [96]:
messages['msg_type'], messages['text_processed'] = zip(*messages['text_processed'].apply(find_message_type))
messages.head()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,text_str_length,msg_type
0,From outside came the occasional cry of a nigh...,2020-02-28 02:55:53-05:00,User 1,"[from, outside, came, the, occasional, cry, of...",[],52,text
1,and once at our very window a long drawn catli...,2020-02-28 02:55:53-05:00,User 2,"[and, once, at, our, very, window, a, long, dr...","[😯, 😯, 😯]",55,text
2,\n which told us that the cheetah was indeed a...,2020-02-28 02:56:07-05:00,User 2,"[n, which, told, us, that, the, cheetah, was, ...",[],56,text
3,hah lmaoooo wooowww,2020-02-28 02:56:08-05:00,User 2,"[hah, lmaoooo, wooowww]",[],19,text
4,😯 Far away we could hear the deep tones of the...,2020-02-28 02:56:27-05:00,User 1,"[far, away, we, could, hear, the, deep, tones,...","[😯, ☺]",59,text


In [97]:
messages.tail()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,text_str_length,msg_type
30,but the sudden glare flashing into my weary eyes,2020-03-08 17:45:56-04:00,User 1,"[but, the, sudden, glare, flashing, into, my, ...",[],48,text
31,made it impossible for me to tell what it was ...,2020-03-08 17:45:56-04:00,User 1,"[made, it, impossible, for, me, to, tell, what...",[],84,text
32,"I could, however, see that his face was deadly...",2020-03-08 17:50:34-04:00,User 2,"[i, could, however, see, that, his, face, was,...","[😭, 😭, 😭]",88,text
33,‎Missed video call,2020-03-08 18:20:13-04:00,User 2,"[missed, video, call]",[],17,text
34,‎Missed voice call,2020-03-08 18:22:23-04:00,User 2,"[missed, voice, call]",[],17,text


In [98]:
print(messages['msg_type'].iloc[10], messages['text_processed'].iloc[10])

video []


**Step 4:** <br>
Next, let's [normalize](https://github.com/EFord36/normalise) our corpus to get rid of any spelling errors for common words. 

<br> **Note:** we can also leverage this to define a custom dictionary of common 'chat-specific' slang. The intent is to clump together similar words as much as possible (for example, treating *'lol'* and *'loool'* as the same word *'lol'*. The default `normalise()` method appears to handle cases of formal English words with repeated single characters well, such as the previous example. However, we need to add the 'root' slang to the [custom abbreviation dictionary](https://towardsdatascience.com/nlp-text-preprocessing-and-cleaning-pipeline-in-python-3bafaf54ac35) since it attempts to change the root words to a 'correct' English word. So **YMMV** based on the amount and complexity of slang in the message corpus.

<br> **Update 01APR20: `normalize` was dropped since performance was abysmal, made my own customized slang normalization function instead.**  

In [99]:
# 'Custom' normalizer:

# 1) Define sets of 'slang' strings based on type of expected non-normalization, for example:
# --- end-letter repeats (omgggg -> omg)
# --- mid-letter repeats (looool -> lol)         [implemented via separate 'custom' dict instead]
# --- combination repeats (wooowww -> wow)
# --- two-letter repeats (hahahaha -> haha)      [implemented via separate 'custom' dict instead]
# 2) Generate dicts which map slang variant to its 'root' form.
# --- Will be specific for each set and have an 'upper bound' number of repeats
# 3) Combine to a master 'normalization' dict. Define the custom normalization function to parse through the dict and evaluate feasibility of scaling up

# Expected runtime O(n*m) where n = number of tokens in corpus, m = number of keys in master normalization lookup dict
# Will be slow, but potentially m will at least 1-2 orders of magnitude smaller than using `normalise` library

# Custom / User defined:
# --- 'base' slang words and their 'type'
# --- 'upper bound' on number of repeats for dict generation

In [100]:
# Utility function to generate dicts of slang strings based on their type
def generate_slang_variants(slang_dict, LIMIT = 5):
    """
    Input:    LIMIT:            Integer defining maximum number of repeated letters.
                                Default is 5 (i.e, 5 repeated single / double letters in the variant)
              slang_dict:       Dictionary of string: integer pairs.
                                Strings are slang 'base' words. Integers are the slang variant 'type'defined below:
                                1 == Repeating single final letter (eg. lmaooo -> lmao)
                                2 == Repeating double final letters (eg. wooowww -> wow)
                                
    Output:   slang_lookup:     Dictionary of slang-variant : slang-root pairs
    """
    slang_lookup = {}
    
    for root in slang_dict: #root[-2:]
                
        if slang_dict[root] == 1:
            # Case 1: Repeat the single final letter up to limit. Add to final dictionary with root as value
            for i in range(LIMIT):
                variant = root + (root[-1]*i)
                slang_lookup.update({variant : root})
            
        elif slang_dict[root] == 2:
            # Case 2: Generate combinations of both final letters up to limit. Add to final dictionary with root as value
            for i in range(1, LIMIT+1):
                for j in range(1, LIMIT+1):
                    variant = root[:-2] + (root[-2]*i) + (root[-1]*j)
                    slang_lookup.update({variant : root})
    
    return slang_lookup


In [101]:
# Primary function to normalize slang variants to their 'base' slang form
def normalize_slang(tokenlist, slang_lookup):
    """
    Input:    tokenlist:    Tokenized corpus to be parsed and normalized
              slang_lookup: Dictionary of slang-variant : slang-root pairs

    Output:   tokenlist_normalized:    Normalized, tokenized corpus
    """
    tokenlist_normalized = []
    
    for token in tokenlist:
        if token in slang_lookup:
            tokenlist_normalized.append(slang_lookup[token])
        else:
            tokenlist_normalized.append(token)
    
    return tokenlist_normalized


In [102]:
testmessage = messages['text_processed'].iloc[3]
print(testmessage)

['hah', 'lmaoooo', 'wooowww']


In [103]:
# Generate final dictionary mapping slang variants to their 'base' slang word.
slang_lookup = generate_slang_variants(slang_dict, VARIANT_LIMIT)

# 'Custom' slang variants are added to this final dictionary as well
slang_lookup.update(slang_special_cases)
print(slang_lookup)

{'lmao': 'lmao', 'lmaoo': 'lmao', 'lmaooo': 'lmao', 'lmaoooo': 'lmao', 'lmaooooo': 'lmao', 'wow': 'wow', 'woww': 'wow', 'wowww': 'wow', 'wowwww': 'wow', 'wowwwww': 'wow', 'woow': 'wow', 'wooww': 'wow', 'woowww': 'wow', 'woowwww': 'wow', 'woowwwww': 'wow', 'wooow': 'wow', 'woooww': 'wow', 'wooowww': 'wow', 'wooowwww': 'wow', 'wooowwwww': 'wow', 'woooow': 'wow', 'wooooww': 'wow', 'woooowww': 'wow', 'woooowwww': 'wow', 'woooowwwww': 'wow', 'wooooow': 'wow', 'woooooww': 'wow', 'wooooowww': 'wow', 'wooooowwww': 'wow', 'wooooowwwww': 'wow', 'ha': 'hahaha', 'hah': 'hahaha', 'haha': 'hahaha'}


In [104]:
testmessage_norm = normalize_slang(testmessage, slang_lookup)
print(testmessage_norm)

['hahaha', 'lmao', 'wow']


In [105]:
messages['text_processed'] = messages['text_processed'].apply(lambda x: normalize_slang(x, slang_lookup))
messages.tail()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,text_str_length,msg_type
30,but the sudden glare flashing into my weary eyes,2020-03-08 17:45:56-04:00,User 1,"[but, the, sudden, glare, flashing, into, my, ...",[],48,text
31,made it impossible for me to tell what it was ...,2020-03-08 17:45:56-04:00,User 1,"[made, it, impossible, for, me, to, tell, what...",[],84,text
32,"I could, however, see that his face was deadly...",2020-03-08 17:50:34-04:00,User 2,"[i, could, however, see, that, his, face, was,...","[😭, 😭, 😭]",88,text
33,‎Missed video call,2020-03-08 18:20:13-04:00,User 2,"[missed, video, call]",[],17,text
34,‎Missed voice call,2020-03-08 18:22:23-04:00,User 2,"[missed, voice, call]",[],17,text


**Step 5:**<br>
Next, let's remove common [stopwords](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/) from our tokenized corpus. **Note:** Our helper function will also turn our text to all lowercase.

In [106]:
# Define list of stopwords
stopwords_list = set(stopwords.words('english'))
print(stopwords_list)

# List comprehension helper function to remove stopwords
def remove_stopwords(tokenlist):
    return [word.lower() for word in tokenlist if word.lower() not in stopwords_list]

{'how', 'than', "should've", 't', 'doesn', 'some', 'yours', 'only', 'needn', 'his', 'their', 'now', 'her', "hadn't", "hasn't", 'am', 'a', 'is', 'shouldn', 'you', 's', "she's", 'yourself', 'be', 'if', 'll', 'further', 'while', 'ma', 'after', 'theirs', 'we', 'in', 'both', "mightn't", 'herself', 'so', 'themselves', 'during', 'once', 'its', 'myself', "weren't", 'hadn', 'most', "you've", 'against', 'all', "doesn't", 'him', 'itself', 'do', 'own', 'wasn', 'they', 'it', 'nor', 'those', "isn't", 'isn', 'couldn', 'more', 'didn', 'm', 'mustn', 'weren', 'have', 'me', 'been', 'between', 'will', 'y', "wasn't", 'each', 'not', "you're", "you'll", 'being', 'before', "couldn't", 'by', 'did', 'where', "wouldn't", 'haven', 'at', 'that', 'then', 'as', "didn't", 'or', 're', 'he', 'no', 'having', 'into', 'yourselves', 'with', 'any', 'ours', 'were', 'these', 'o', 'our', "that'll", 'when', 'again', "won't", 'mightn', 'but', 'on', "mustn't", 'she', 'don', 'down', 've', 'i', 'should', 'does', 'few', 'from', 'who

In [107]:
messages['text_processed'] = messages['text_processed'].apply(remove_stopwords)
messages.tail()

Unnamed: 0,text_raw,date_time,username,text_processed,emojis,text_str_length,msg_type
30,but the sudden glare flashing into my weary eyes,2020-03-08 17:45:56-04:00,User 1,"[sudden, glare, flashing, weary, eyes]",[],48,text
31,made it impossible for me to tell what it was ...,2020-03-08 17:45:56-04:00,User 1,"[made, impossible, tell, friend, lashed, savag...",[],84,text
32,"I could, however, see that his face was deadly...",2020-03-08 17:50:34-04:00,User 2,"[could, however, see, face, deadly, pale, fill...","[😭, 😭, 😭]",88,text
33,‎Missed video call,2020-03-08 18:20:13-04:00,User 2,"[missed, video, call]",[],17,text
34,‎Missed voice call,2020-03-08 18:22:23-04:00,User 2,"[missed, voice, call]",[],17,text


### Exploratory Data Analysis

In [111]:
text_processed_master = messages['text_processed'].explode().value_counts()
text_processed_master.head()

lmao      5
hahaha    4
heard     3
see       3
sound     3
Name: text_processed, dtype: int64

In [112]:
emoji_master = messages['emojis'].explode().value_counts()
emoji_master.head()

😯    11
😭     7
😆     2
👀     2
☺     2
Name: emojis, dtype: int64