In [1]:
# Importing the required libraries
import numpy as np
import pandas as pd
import re
import json
from bs4 import BeautifulSoup

### Step 1: Import and Parse the .json files into the python environment.

Python has the json library which enables us to parse the dataset. We observe that the Json files are similar to a Python dictionary, and comprises Strings, Numbers, Lists and Nested Lists.

### Description of the dataset

The dataset contains all details corresponding to 469 e-mails from an inbox. This is a list of dictionaries - where each dictionary corresponds to one email, and each of the keys (of a particular dictionary) corresponds to a specific piece of information regarding the mail. 

In [2]:
# To import the file in this manner, ensure that the data is in the same folder as that of the jupyter(.ipynb) notebook
sourceFile = open("anupriya@moduleq.com.emails.txt") # One of the 4 files
json_data = json.load(sourceFile)

# This file test2.txt contains one of the files with the all the mails from a particular inbox

In [3]:
print(len(json_data))

469


There are 469 mails in this dataframe.

**In order to study the structure of each dictionary in this list, let us look at each of the keys.**

In [4]:
json_data[1].keys()

dict_keys(['@odata.etag', 'id', 'createdDateTime', 'lastModifiedDateTime', 'changeKey', 'categories', 'receivedDateTime', 'sentDateTime', 'hasAttachments', 'internetMessageId', 'subject', 'bodyPreview', 'importance', 'parentFolderId', 'conversationId', 'isDeliveryReceiptRequested', 'isReadReceiptRequested', 'isRead', 'isDraft', 'webLink', 'inferenceClassification', 'body', 'sender', 'from', 'toRecipients', 'ccRecipients', 'bccRecipients', 'replyTo'])

**Based on the set of keys printed above, we observe that from an Automatic Keyphrase Extraction perspective, many of the keys (details of the mail) are not relevant and can be excluded from further processing steps.**

Let us therefore convert this list of dictionaries to a pandas dataframe and then exclude the non-relevant information (columns).

In [5]:
import pandas as pd
email_dataframe = pd.DataFrame(json_data)

In [6]:
email_dataframe.shape

(469, 28)

Among the 28 columns, we only retain the subject line ('subject') and the content of the mail ('body'), excluding information such as send date, time, has Attachment?, importance, and so on ... Since they do not enable Automatic Keyphrase Extractions in any way, we exclude them from the dataframe and retain only these two columns.

In [7]:
email_dataframe.columns.values

array(['@odata.etag', 'bccRecipients', 'body', 'bodyPreview',
       'categories', 'ccRecipients', 'changeKey', 'conversationId',
       'createdDateTime', 'from', 'hasAttachments', 'id', 'importance',
       'inferenceClassification', 'internetMessageId',
       'isDeliveryReceiptRequested', 'isDraft', 'isRead',
       'isReadReceiptRequested', 'lastModifiedDateTime', 'parentFolderId',
       'receivedDateTime', 'replyTo', 'sender', 'sentDateTime', 'subject',
       'toRecipients', 'webLink'], dtype=object)

In [8]:
# Considering only the reduced dataframe with only the required columns - 
reduced_dataframe = email_dataframe[['body','subject']]
reduced_dataframe.head(15)

Unnamed: 0,body,subject
0,"{'contentType': 'html', 'content': '<html> <h...",[JIRA] (MQ-482) Email fetch not working locally
1,"{'contentType': 'html', 'content': '<html> <h...",[JIRA] (MQ-525) Improve monitoring capability
2,"{'contentType': 'html', 'content': '<html> <h...",[JIRA] (MQ-525) Improve monitoring capability
3,"{'contentType': 'html', 'content': '<html> <h...",[JIRA] (MQ-539) Capture and log user input to Q
4,"{'contentType': 'html', 'content': '<html> <h...",[Confluence] MQ.ai > 2017-03-15
5,"{'contentType': 'html', 'content': '<html> <h...",Re: [moduleQ/MQ.ai] Changed 'Create priority' ...
6,"{'contentType': 'html', 'content': '<html> <h...",Re: [moduleQ/MQ.ai] Fixed truncation for morni...
7,"{'contentType': 'html', 'content': '<html> <h...",Re: [moduleQ/MQ.ai] MQ-527 duplicate key value...
8,"{'contentType': 'html', 'content': '<html> <h...",[moduleQ/MQ.ai] Feature/mq 554 own service url...
9,"{'contentType': 'html', 'content': '<html> <h...",[JIRA] (MQ-539) Capture and log user input to Q


**Based on the dataframe, we observe that the entry in the column 'body' is by itself a dictionary, with two keys. The first key 'contentType' can be excluded, as it redundantly states that the content type is html. The second key 'content' is what we need to focus on.**

**The second column of the dataframe is the subject of the mail, from which parts of text such as "JIRA(MQ-XXX)" and "Re:(moduleQ/MQ.ai)" are to be excluded.**

As a trial operation, let us work on the content of just the first row (first mail) in the dataframe. 

In [9]:
email_dataframe['body'][0]

{'content': '<html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\r\n<meta content="text/html; charset=utf-8">\r\n<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">\r\n<base href="https://moduleq.atlassian.net">\r\n</head>\r\n<body class="jira" style="color:#333333; font-family:Arial,sans-serif; font-size:14px; line-height:1.429">\r\n<table id="background-table" cellpadding="0" cellspacing="0" width="100%" bgcolor="#f5f5f5" style="border-collapse:collapse; background-color:#f5f5f5; border-collapse:collapse">\r\n<tbody>\r\n<tr>\r\n<td id="header-pattern-container" style="padding:0; border-collapse:collapse; padding:10px 20px">\r\n<table id="header-pattern" cellspacing="0" cellpadding="0" border="0" style="border-collapse:collapse">\r\n<tbody>\r\n<tr>\r\n<td id="header-avatar-image-container" valign="top" width="32" style="padding:0; border-collapse:collapse; vertical-align:top; width:32px; padding-right:8px">\r\n<

In [10]:
dictionary = email_dataframe['body'][0]

keys = list(dictionary.keys())
values = list(dictionary.values())

In [11]:
# values[0] contains the first value, corresponding to the key-value pair {contentType': 'html'}
# values[1] contains the text to be analyzed

In [12]:
items = dictionary.items()

In [13]:
keys, values = zip(*dictionary.items())

In [14]:
# Let us use beautiful soup to parse this html content
htmltext = values[1]
soup = BeautifulSoup(htmltext, 'lxml')
cleaned_text = soup.text
cleaned_text

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nVasyl Rubinskyi\ncommented on \n MQ-482\n\n\n\n\n\n\n\n\n\n\n\n\r\n\xa0\n\n\n\n\n\n\n\nRe: Email fetch\r\n not working locally \n\n\n\n\n\n\n\n\n\n\n\nRyan Curd Seems to be fixed. I guess it may be closed.\r\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAdd Comment\n\n\n\n\n\n\n\n\n\n\n\n\r\n\xa0\n\n\n\n\n\n\n\n\n\n\n\r\nThis message was sent by Atlassian JIRA (v1000.824.2#100035-sha1:a97671d)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

We observe that all the html content has been removed from the text, but some other "\n", "\r" type of tags remain. Let us exclude them using the text.replace() command.

In [15]:
cleaned_text = cleaned_text.replace('\n','')
cleaned_text = cleaned_text.replace('\r','')
cleaned_text = cleaned_text.replace('\xa0',' ')
cleaned_text = cleaned_text.replace('Add Comment This message was sent by Atlassian JIRA (v1000.824.2#100035-sha1:a97671d)','')
cleaned_text = re.sub('<.*?>', '', cleaned_text)
cleaned_text = re.sub('Add Comment.*?a97671d\)','',cleaned_text)
cleaned_text = re.sub('Sherry.*?artificial intelligence. ','',cleaned_text)
cleaned_text = re.sub('Margo Poda.*?@ModuleQ ','',cleaned_text)
cleaned_text = re.sub('Sent from my iPhone.*?[A-Za-z]+','',cleaned_text)    

In [16]:
cleaned_text

'Vasyl Rubinskyicommented on  MQ-482 Re: Email fetch not working locally Ryan Curd Seems to be fixed. I guess it may be closed.'

**The above operation shows that we can use the Beautiful Soup library to clean the *content* of the mail. Let us now operationalize these steps for all 469 mails using a for loop.**

In [17]:
list_of_cleaned_mails = []
for i in range(0,len(json_data)):
    dictionary1 = email_dataframe['body'][i]
    keys1 = list(dictionary1.keys())
    values1 = list(dictionary1.values())
    htmltext1 = values1[1]
    soup = BeautifulSoup(htmltext1, 'lxml')
    cleaned_text = soup.text
    cleaned_text = cleaned_text.replace('\n','')
    cleaned_text = cleaned_text.replace('\r','')
    cleaned_text = cleaned_text.replace('\t','')
    cleaned_text = cleaned_text.replace('\xa0',' ')
    cleaned_text = cleaned_text.replace('Reply to this email directly, view it on GitHub, or mute the thread.','')
    cleaned_text = re.sub('<.*?>', '', cleaned_text)
    cleaned_text = re.sub('Add Comment.*?a97671d\)','',cleaned_text)
    cleaned_text = re.sub('Sherry.*?artificial intelligence. ','',cleaned_text)
    cleaned_text = re.sub('Margo Poda.*?@ModuleQ ','',cleaned_text)
    cleaned_text = re.sub('Sent from my iPhone.*?[A-Za-z]+','',cleaned_text)
    
    list_of_cleaned_mails.append(cleaned_text)
    

In [18]:
# Include this list as an additional column in the pandas dataframe
reduced_dataframe['cleaned_content'] = list_of_cleaned_mails


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [19]:
len(list_of_cleaned_mails)

469

In [20]:
# Let us view some sample entries of the column: cleaned_content to get a sense of the type of communication
list_of_cleaned_mails[0:150]

['Vasyl Rubinskyicommented on  MQ-482 Re: Email fetch not working locally Ryan Curd Seems to be fixed. I guess it may be closed.',
 'Vasyl Rubinskyiupdated an issue  ModuleQ /MQ-525 Improve monitoring capability Change By:Vasyl RubinskyiEpic Link:MQ-559 ',
 'Vasyl Rubinskyiassigned an issue to Vasily Korytov  ModuleQ /MQ-525 Improve monitoring capability Change By:Vasyl RubinskyiAssignee:Anupriya AnkolekarVasily Korytov',
 'Vasyl Rubinskyiassigned an issue to Nikolay Borovenskiy  ModuleQ /MQ-539 Capture and log user input to Q Change By:Vasyl RubinskyiAssignee:Nikolay Borovenskiy',
 'Vasyl Rubinskyi created a page 2017-03-15Date15 Mar 2017 Attendees@Yury Apollov @Anton Koltsov @Nikolay Borovenskiy @Vasily Korytov @David Brunner [Administrator] @Peter Taraba @Vasyl Rubinskyi @Ryan Curd GoalsClarify where we are Discussion itemsWhoItemNotes@Yury Apollov TodayMQ-550 - (  ) InvestigateTomorrowProblems@Anton Koltsov TodayMQ-560 - (  ) To do. TomorrowProblems@Nikolay Borovenskiy TodayMQ-536 

In [21]:
# Looking at the subject lines:
list_of_subjects = list(reduced_dataframe['subject'][0:30])
list_of_subjects

['[JIRA] (MQ-482) Email fetch not working locally',
 '[JIRA] (MQ-525) Improve monitoring capability',
 '[JIRA] (MQ-525) Improve monitoring capability',
 '[JIRA] (MQ-539) Capture and log user input to Q',
 '[Confluence] MQ.ai > 2017-03-15',
 "Re: [moduleQ/MQ.ai] Changed 'Create priority' to 'Confirm priority' and added UTC timezon… (#233)",
 "Re: [moduleQ/MQ.ai] Fixed truncation for morning briefing detail's email subject and people (#232)",
 'Re: [moduleQ/MQ.ai] MQ-527 duplicate key value violates unique constraint (#219)',
 '[moduleQ/MQ.ai] Feature/mq 554 own service url for each user (#234)',
 '[JIRA] (MQ-539) Capture and log user input to Q',
 '[JIRA] (MQ-539) Capture and log user input to Q',
 '[JIRA] (MQ-482) Email fetch not working locally',
 'Re: [moduleQ/MQ.ai] Feature/mq 554 own service url for each user (#234)',
 'Morning Briefing questions',
 '[JIRA] (MQ-539) Capture and log user input to Q',
 'Website Update',
 'Re: Website Update',
 '[JIRA] (MQ-539) Capture and log user in

In [22]:
reduced_dataframe['cleaned_content'][0:15]

0     Vasyl Rubinskyicommented on  MQ-482 Re: Email ...
1     Vasyl Rubinskyiupdated an issue  ModuleQ /MQ-5...
2     Vasyl Rubinskyiassigned an issue to Vasily Kor...
3     Vasyl Rubinskyiassigned an issue to Nikolay Bo...
4     Vasyl Rubinskyi created a page 2017-03-15Date1...
5     Merged #233.—You are receiving this because yo...
6     Merged #232.—You are receiving this because yo...
7     Closed #219.—You are receiving this because yo...
8     You can view, comment on, or merge this pull r...
9     Nikolay Borovenskiyupdated  MQ-539 ModuleQ /MQ...
10    Nikolay Borovenskiycommented on  MQ-539 Re: Ca...
11    Ryan Curddeleted an issue  ModuleQ /MQ-482 Ema...
12    Merged #234.—You are receiving this because yo...
13    Anupriya, I just had some questions about info...
14    Nikolay Borovenskiycommented on  MQ-539 Re: Ca...
Name: cleaned_content, dtype: object

**Based on the basic cleaning operations carried out on the mails in the previous notebook, a set of 16 mails have been selected for analysis.**

Note to self: I had to perform some cleaning steps manually at the end to get the mail content in its cleanest form. This was just a trial exercise - In the future, target automating these cleaning operations using code.

In [23]:
cleaned_mail_text = ["Anupriya, I just had some questions about information in the morning briefing. For reference here is mine for today with numbers that correspond to my questions. On the main priority listing card, Do the numbers represent the importance of the priority, its ranking? If so, it seems to me like Microsoft should be in the number one priority based on more emails and a more recent upward trend. Do the priorities reorder themselves based on trends?Also the text below (which is unfortunately cut off here) says 'green arrows' and should probably read 'green numbers'. On the individual priority card, there are two numbers under activity (42 emails and 57 total). The total number here does not match the email number from the first card. I assume this is a bug. What is the scope of the first number that it is less than the total? Is it the past week only? It might be good to put a qualifier in there to help not confuse people. I assume the gray band is the next upcoming meeting in that priority. Again, we might want to consider noting that so people know what it's telling them. The date and time are wrong. It appears to only display the current date and time in GMT. I assume this is another bug.a. I assume I have two entries under one email because it was somehow split, correct? I can’t seem to find an email between just me and Vasyl on for this thread though.b. Similar to new priority notifications, should my name be appearing here or should it be omitted. I know it's a lot of questions, so thanks in advance for your patience. Ryan",

 "Nikolay Borovenskiy commented on  MQ-539 Re: Capture and log user input to Q. It is fine for me use database, you have my voice to follow this approach. I hope that very soon we will have an admin panel and will be able to display this information very friendly. But I must know which one approach to do. So I am waiting your decision.",

 "Session decrease today – and that is to be expected (so far, it's early, but I'll be gone tonight so I'll check in tomorrow).",

 "Monthly product updates, pro-tips, and events from the Sentry team. Product updates for March 2017 Happy March! We'll try not to be too offended if you want to unsubscribe from our monthly product updates. Feel free to remove yourself by clicking the link in the email footer. Here are some updates we've recently shipped at Sentry. Product updates iOS Reprocessing Have you ever been in a situation where your app crashes before you've uploaded your debug symbols to us? Well, we're happy to say Sentry has gotten a whole lot better here. You can now enable a feature we call reprocessing (in your project settings), which will delay parsing your errors until you've uploaded the relevant symbols. —Armin Learn more More iOS updates Okay Armin, we get it, reprocessing will make debugging easier. But only because of the big update we made to our Swift SDK. Make sure you check out ourdocs and update Swift. —DanielBilling Update We changed our pricing model in January for all new Sentry users. If you’re currently on a paid legacy plan, hold tight! You'll hear from us in the coming months about the migration process to the new plan.—Jess A. Psst...We have a cool new update coming to releases. Stay tuned! —Jess M. Inside Sentry updatesDodging S3 Downtime with Nginx and HAProxy How we reduced our S3 bandwidth costs by 70% while gaining more performance and reliability. - Matt Learn more ICYMI Error handling in Node. jsI gave a talk in the March SFNode meetup at our office. You can check out the recording of this talkhere or just go through the slides to learn more. — Lewis Filtering exceptions Olark, one of our exceptional customers, wrote this great blog post on how they get the most out of Sentry. If you have your own way of filtering and want to share, send us a quick email. Pro-Tips Be sure to update your Javascript SDKWe updated our browser JavaScript SDK to prevent sending the same event back-to-back. We've also made other fixes to surface higher quality errors in some situations.",

 "Nikolay Borovenskiy commented on  MQ-539 Re: Capture and log user input to Q Anupriya Ankolekar Yes, they will. We'll do foreign key for each user. We need to think about how to show this information. I hold in my head only admin panel as instrument to show it. or use tools to get access to data base. These are mostly facts we know but backed up here. It is good coverage of MSFT Teams.",

 "It is a useful number. Some people that we have spoken to get more then 300 a day. Those were typically consulting or sales managers. Of course many of those were internal emails because they worked in large organizations. By this article internal email volume decreased also. Teams makes that happen. We improve the experience in Teams with external emails by surfacing priority communications.",

 "My experience. I had to delete Steve Vigo as I noticed he got in the way. MSFT needs to sort that out. Once I signed into Teams and engaged it did the binding which just spun. I watched it for a bit then I looked at Teams and it in fact was working fine. So it did work but as a user, I thought it did not ... initially. Another point on the priority identification. I got 3 recommended in this order - EY – I accepted Royal Coffee – I accepted but I don’t understand why it came up. I have no meeting set up with them at all. Pressly – I did have one meeting set up and a few emails. It did not recommend Microsoft in the first 3 which is most alarming. I have meetings set up in the near past and future and lots of emails with them. It is most of my email traffic.",

 "Steve was not showing up when I logged in. The priorities were not right. Even though I accepted Royal Cup Coffee it did not fit a priority as I thought we had defined it. I have no meeting set up with them and I haven't had a meeting for maybe 4 months … or more.  Microsoft should have been right below EY or ahead of EY and it didn't show up in the first 3. Pressly was one meeting and not more then 3 emails. Those 3 emails are in one chain. There is only one person on the Pressly email and meeting. There are many people on the EY and MSFT meetings and emails. So it seems a little random to me.",

 "Interesting! ... Now I do remember your mentioning this number before for consulting and sales managers. The magnitude of that wrt how Q will perform is now starting to sink in for me. Something to keep in mind as we go through the next iteration of Q's design. You're right that if a target organization is already using Teams, then when they start using Q, the email traffic should be smaller. I'm mostly thinking about the initial experience, if they are new to both Q and Teams when we start working with them. Once they have put up with us for a while, they are more likely to stay.And hopefully not all of the 300 emails were priority communications, so our notifications will definitely present a smaller set. Still, we may ultimately need to become clever about which kinds of notifications we show even for accepted priorities or allow users to set manual filters if they want to.",

 "Okay, it sounds like the ranking is really off for you. I am working on coming up with changes to the clustering algorithm first and then for the ranking of these priorities. David and I were thinking of a rule-based approach to the initial clustering and ranking. I'm making a list of rules to apply and will get back to you for verification/validation of these. In the meantime, feel free to tell me anything else you find odd. I want to address as many issues as possible.",

 'Thanks Margo for keeping us updated. We have a few random new users, so some are enticed enough to add our bot. :) Hope you are having fun!',

 "Hi Ryan, No worries about the length, only got to it a bit later, cause I wanted to answer properly. :) 1. Yes, the order should correspond to their importance. The priorities currently do not reorder themselves, so they are in undefined order. Once the priority dashboard is in place, people will be able to reorder priorities and then this ranking should correspond to their user-defined ranking.a. It is indeed wrong and should say 'marked in green', but the developers missed that new text I think. It is a simple fix in MQ.ai/moduleq/bot/views/templates/bulletin/summary.xml, so I will do it (has been on my todo list for a while.)2. Yes and yes. Those numbers have been an issue. It should say 42 emails this week and 57 emails in total. That was specified originally, but got lost. The number 57 should appear in the summary view too. This is definitely a problem. The numbers are calculated differently: for the summary, it is read from the json query (I have to check how that is calculated), for the detail view, it is a sum of the numbers used to generate the activity graph. So, I guess they will always be close, but they really should be the same.3. Yes, the upcoming meeting might be more obvious with a clear line saying 'Upcoming meeting'. I hadn\'t noticed the date and time problems. They are definitely wrong and not having timezone there is an issue!4. Just looks like a bug to me. Email addressees are not supposed to be split over two lines. I don\'t see why that should happen.5. Although this is a different case than new priority notifications, omitting the user\'s name (which will always be present) saves space, so I suggest removing it here as well. That said, for new email notifications which contain a snippet of the new email, it made sense to keep the user in there, because there we\'re presenting the email there as a purer object, which has the user as an attribute as well. That said, this is fuzzy reasoning, so I can be persuaded either way.Btw, I can work on 1a, 2a, 3 and 4 (you won\'t need to create separate JIRA items for each of them). Just let me know.Also, as a heads up, although not a prerequisite here, it would help immensely for design of the morning bulletin to look into headless Chrome (https://moduleq.atlassian.net/browse/MQ-503). This may be something for Vasily when he is relatively unburdened. Changing designs is onerous in our current setup, because the html rendering engine we use does not support the latest CSS standards. Both Nikolay and me have been reluctant (and slow) to make non-essential changes for this reason. Morning Briefing questions Anupriya, I just had some questions about information in the morning briefing. For reference here is mine for today with numbers that correspond to my questions. On the main priority listing card, Do the numbers represent the importance of the priority, its ranking? If so, it seems to me like Microsoft should be in the number one priority based on more emails and a more recent upward trend. Do the priorities reorder themselves based on trends?Also the text below (which is unfortunately cut off here) says 'green arrows' and should probably read ‘green numbers’. On the individual priority card, there are two numbers under activity (42 emails and 57 total). The total number here does not match the email number from the first card. I assume this is a bug. What is the scope of the first number that it is less than the total? Is it the past week only? It might be good to put a qualifier in there to help not confuse people. I assume the gray band is the next upcoming meeting in that priority. Again, we might want to consider noting that so people know what it's telling them. The date and time are wrong. It appears to only display the current date and time in GMT. I assume this is another bug. a. I assume I have two entries under one email because it was somehow split, correct? I can’t seem to find an email between just me and Vasyl on for this thread though. b. Similar to new priority notifications, should my name be appearing here or should it be omitted. I know it's a lot of questions, so thanks in advance for your patience. Ryan",

 "Yes, good data points! In terms of email volume, that range seems about right for moderately busy professionals (200 per day x 20 work days per month = 4K).  It may be low for senior people. I'd expect them to be in the 5K - 10K range, and some higher.",

 "Hi Team, We bet you love SF restaurants and their great food, so we wanted to invite everyone at ModuleQ to try our dine-in app. Allset at restaurants near your office. Our mission is to help busy professionals enjoy a better lunch break. We’ve gotten great feedback from employees of local companies and built up a strong following in San Francisco, New York City, and Chicago. Everyone at ModuleQ can get a $50 credit and try Allset at their own convenience by using the codeTEAMTRY50 ($10 off first five orders). Simply pass this email along to your team. Thanks! Kate If you’re not interested, you can let me know by clicking click here. Thanks!",

 "Thanks Anupriya. If you want to work on the tasks you certainly can. Otherwise, if you have other items you want to focus on I can assign it to Nikolay. For number 5, since neither of us has strong opinions about it we'll just leave it as is.",

 "So as much as I hate to call Q an assistant we might want to. It is a common category. David you can call this now if you want to but I can take some to work through possibilities. I think we need to connect with a category that has strong search results."]

Let us first combine these mails into a single string and then explore basic processing steps related to Word Frequency

In [24]:
# Cleaned mail text
email_string = ""
for i in cleaned_mail_text:
    email_string += str(i) + " "
email_string = email_string[:-1]
#print(email_string)

# Excluding the apostrophes
email_string = email_string.replace("’","")

In [25]:
import string
import nltk
import re
from collections import Counter

In [26]:
# This can be done using nltk as follows:
from nltk.tokenize import sent_tokenize, word_tokenize
lowcase_email_string = email_string.lower() # Convert all words to lowcase, since we will not distinguish between uppercase and lowcase
# Let us also remove all the punctuation
lowcase_email_nopunct_string = re.sub('['+string.punctuation+']', '', lowcase_email_string)
words = word_tokenize(lowcase_email_nopunct_string)

In [27]:
#tokens = get_tokens()
count = Counter(words)
print(count.most_common(40))

[('the', 100), ('to', 84), ('i', 56), ('and', 52), ('a', 48), ('it', 47), ('in', 46), ('is', 41), ('for', 33), ('of', 32), ('be', 24), ('that', 23), ('this', 23), ('we', 23), ('on', 22), ('so', 20), ('you', 20), ('have', 19), ('priority', 18), ('not', 18), ('email', 18), ('should', 17), ('are', 16), ('as', 15), ('here', 14), ('more', 14), ('emails', 14), ('with', 12), ('if', 12), ('me', 12), ('number', 12), ('new', 12), ('your', 12), ('will', 12), ('our', 12), ('they', 12), ('up', 12), ('numbers', 11), ('one', 10), ('which', 10)]


In [28]:
# We notice that most of the commonly occuring words are stopwords, which do not assist in keyphrase extraction
# Let us exclude these stopwords and then study the frequency
# nltk has an inbuilt set of stopwords which can be used as a reference
from nltk.corpus import stopwords

In [29]:
# Viewing a few examples
stopWords = list(stopwords.words('english'))
print("Total stopwords recorded in the nltk library: ",len(stopWords))
stopWords[0:20]

Total stopwords recorded in the nltk library:  179


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

In [30]:
filtered_email_content = []
stopWords = set(stopwords.words('english'))
for w in words:
    if w not in stopWords:
        filtered_email_content.append(w)

In [31]:
count = Counter(filtered_email_content)
print(count.most_common(40))

[('priority', 18), ('email', 18), ('emails', 14), ('number', 12), ('new', 12), ('numbers', 11), ('one', 10), ('meeting', 10), ('want', 10), ('first', 8), ('assume', 8), ('people', 8), ('questions', 7), ('priorities', 7), ('total', 7), ('know', 7), ('teams', 7), ('card', 6), ('ranking', 6), ('might', 6), ('notifications', 6), ('thanks', 6), ('user', 6), ('q', 6), ('updates', 6), ('sentry', 6), ('well', 6), ('us', 6), ('get', 6), ('yes', 6), ('3', 6), ('set', 6), ('green', 5), ('two', 5), ('date', 5), ('time', 5), ('update', 5), ('work', 5), ('anupriya', 4), ('information', 4)]


In [32]:
# Combining the tokens back to a string
filtered_mail_content_string = ""
for i in filtered_email_content:
    filtered_mail_content_string += str(i) + " "
filtered_mail_content_string = filtered_mail_content_string[:-1]

Notice above that the word email and emails have been treated as separate words, but ideally would be considered as the same word - likewise with number and numbers. This process is called lemmatization and can be executed directly using a python library.

In [33]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

lemmatized_words_list = []
for word in filtered_email_content:
    lemmatized_word = wordnet_lemmatizer.lemmatize(word)
    lemmatized_words_list.append(lemmatized_word)
count = Counter(lemmatized_words_list)
print(count.most_common(50))

[('email', 32), ('priority', 25), ('number', 23), ('meeting', 12), ('new', 12), ('update', 11), ('one', 10), ('want', 10), ('user', 10), ('team', 10), ('first', 8), ('assume', 8), ('people', 8), ('question', 7), ('total', 7), ('know', 7), ('q', 7), ('card', 6), ('ranking', 6), ('might', 6), ('notification', 6), ('thanks', 6), ('sentry', 6), ('well', 6), ('u', 6), ('get', 6), ('yes', 6), ('3', 6), ('set', 6), ('say', 5), ('green', 5), ('two', 5), ('date', 5), ('time', 5), ('need', 5), ('work', 5), ('anupriya', 4), ('information', 4), ('morning', 4), ('correspond', 4), ('seems', 4), ('like', 4), ('microsoft', 4), ('based', 4), ('reorder', 4), ('57', 4), ('bug', 4), ('good', 4), ('help', 4), ('upcoming', 4)]


**Based on the literature, it must be noted that it is inappropriate to conclude that the most frequently occuring words are the keywords/keyphrases.**

In [34]:
# Sample stemming operation using root word navigate

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
#stemmer.stem(string)
list_of_words = ['navigate','navigated','navigating','navigator']
for word in list_of_words:
    print(stemmer.stem(word))
#stemmer.stem('cookery')

navig
navig
navig
navig


One observation is that the process of stemming cuts off suffixes - and if we are looking to extract key phrases, we might end up losing the phrase itself - and hence, it might not be a great tool in this case. Let us now try out lemmatization.