# Classifying strings by tense, subject and time objects
## By Amir ElTabakh
## February 13, 2022

[Link to Upwork Proposal](https://www.upwork.com/ab/proposals/1490854526188195841)

In this document I create a machine learning model using  NLP techniques to classify strings based on tense (past, present, future).

I will be using the The Natural Language Toolkit (NLTK) library to capture the different parts of speech in each sentence/string, and then I will score the sentence with respect to the tense. There are many parts of speech, but below is a guide to the grammer that is relevant.

In [1]:
grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous:         {<MD><VB><VBG>}
Future_Perfect:            {<MD><VB><VBN>}
Past_Perfect_Continuous:   {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite:         {<MD><VB>}
Past_Continuous:           {<VBD><VBG>}
Past_Perfect:              {<VBD><VBN>}
Present_Continuous:        {<VBZ|VBP><VBG>}
Present_Perfect:           {<VBZ|VBP><VBN>}
Past_Indefinite:           {<VBD>}
Present_Indefinite:        {<VBZ>|<VBP>}
"""

In [2]:
from nltk import word_tokenize, pos_tag
text = "I will do my homework tomorrow." 

tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
tagged

[('I', 'PRP'),
 ('will', 'MD'),
 ('do', 'VB'),
 ('my', 'PRP$'),
 ('homework', 'NN'),
 ('tomorrow', 'NN'),
 ('.', '.')]

In [3]:
from nltk import word_tokenize, pos_tag

def determine_tense_input(sentence):
    text = word_tokenize(sentence)
    tagged = pos_tag(text)

    tense = {}
    tense["future"] = len([word for word in tagged if word[1] in ["MD", "VBF", "VBC", "VB", "VBN"]])
    tense["present"] = len([word for word in tagged if word[1] in ["VBP", "VBZ","VBG"]])
    tense["past"] = len([word for word in tagged if word[1] in ["VBD", "VBN"]]) 
    return(tense)

In [7]:
text = "did you get my email?" 
print(determine_tense_input(text))
text = word_tokenize(text)
tagged = pos_tag(text)
print(tagged)

{'future': 1, 'present': 0, 'past': 1}
[('did', 'VBD'), ('you', 'PRP'), ('get', 'VB'), ('my', 'PRP$'), ('email', 'NN'), ('?', '.')]


In [5]:
text = "Hello, how are you doing?"
determine_tense_input(text)

{'future': 0, 'present': 2, 'past': 0}

In [6]:
text = "I am currently programming in Python." 
determine_tense_input(text)

{'future': 0, 'present': 2, 'past': 0}

In [7]:
text = "Tomorrow I will be programming in Python."
determine_tense_input(text)

{'future': 2, 'present': 1, 'past': 0}

In [8]:
# determine the tense of the string
text = "We will attend the meeting in one hour." 
max(determine_tense_input(text), key=determine_tense_input(text).get)

'future'

## Identifying the Subject

The function below will return the counts of the subjects.

In [9]:
# function to determine count of subjects

def determine_subject_input(sentence):
    text = word_tokenize(sentence)
    tagged = pos_tag(text)

    subject = {}
    subject["Count"] = len([word for word in tagged if word[1] in ["PRP", "NNP", "NNS"]])  
    return(subject)

In [10]:
text = "We will attend the meeting in one hour." 
determine_subject_input(text)

{'Count': 1}

In [11]:
text = "Tomorrow the homework will be complete." 
determine_subject_input(text)

{'Count': 0}

## Identifying Time related objects in string

In [8]:
# function to determine count of time-related topics

def determine_time_input(sentence):
    text = word_tokenize(sentence)
    tagged = pos_tag(text)

    time_count = len([word for word in tagged if word[1] in ["NNP", "CD"]])  
    return(time_count)

In [13]:
sentence = "So, and given the complete scenario, this is what we'll be doing for one of the biggest financial industry."
text = word_tokenize(sentence)
print(pos_tag(text))


[('So', 'RB'), (',', ','), ('and', 'CC'), ('given', 'VBN'), ('the', 'DT'), ('complete', 'JJ'), ('scenario', 'NN'), (',', ','), ('this', 'DT'), ('is', 'VBZ'), ('what', 'WP'), ('we', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('doing', 'VBG'), ('for', 'IN'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('biggest', 'JJS'), ('financial', 'JJ'), ('industry', 'NN'), ('.', '.')]


In [9]:
text = "Did you get my email?"
print(determine_time_input(text))
print(pos_tag(word_tokenize(text)))

1
[('Did', 'NNP'), ('you', 'PRP'), ('get', 'VB'), ('my', 'PRP$'), ('email', 'NN'), ('?', '.')]


In [15]:
text = "I will submit the work"
print(determine_time_input(text))
print(pos_tag(word_tokenize(text)))

0
[('I', 'PRP'), ('will', 'MD'), ('submit', 'VB'), ('the', 'DT'), ('work', 'NN')]


In [16]:
text = "I want you to complete the task."
print(determine_time_input(text))
print(pos_tag(word_tokenize(text)))

0
[('I', 'PRP'), ('want', 'VBP'), ('you', 'PRP'), ('to', 'TO'), ('complete', 'VB'), ('the', 'DT'), ('task', 'NN'), ('.', '.')]


In [17]:
text = "I want you to complete the task by Wednesday."
print(determine_time_input(text))
print(pos_tag(word_tokenize(text)))

1
[('I', 'PRP'), ('want', 'VBP'), ('you', 'PRP'), ('to', 'TO'), ('complete', 'VB'), ('the', 'DT'), ('task', 'NN'), ('by', 'IN'), ('Wednesday', 'NNP'), ('.', '.')]


In [18]:
text = "I want you to complete the task as soon as possible."
print(determine_time_input(text))
print(pos_tag(word_tokenize(text)))

0
[('I', 'PRP'), ('want', 'VBP'), ('you', 'PRP'), ('to', 'TO'), ('complete', 'VB'), ('the', 'DT'), ('task', 'NN'), ('as', 'RB'), ('soon', 'RB'), ('as', 'IN'), ('possible', 'JJ'), ('.', '.')]


## Identifying date and time objects in string using datefinder

In [None]:
import datefinder

text = "I will do the assignment Sunday."

matches = datefinder.find_dates(text)

for match in matches:

In [49]:
def determine_datefinder_input(sentence):
    matches = datefinder.find_dates(sentence)

    date_time_count = len(matches)  
    return(date_time_count)

In [60]:
text = "Will meet to discuss the plans for Tuesday at 7 PM."

matches = datefinder.find_dates(text)

for match in matches:
    print(match)

#determine_datefinder_input(text)

2022-02-22 19:00:00


## All together,

Below is one function. The ffunction will identify the tense, count of subjects, and count of objects, and the function will filter the dataframe for the following:

- Tense is Future
- Count of subjects > 0
- Count of time objects > 0

If you want to filter the dataframe set the `filter_string` parameter to True (default is True). If you do not wish to filter the dataframe set it to false.

In [19]:
# generating mock dataset
import pandas as pd

list_of_strings = ["So, and given the complete scenario, this is what we'll be doing for one of the biggest financial industry.",
                   'I will close the window.',
                   'We will attend the meeting in one hour.',
                   'I did the assignment yesterday.',
                   'I walked the dog in the morning.',
                   'My cat is stepping on the keyboard.',
                   'Could you repeat that?',
                   'Would you like to get food with me tomorrow?',
                   'can you submit the code.',
                   'can you submit the code by tomorrow.',
                   'I will inform tomorrow.',
                   'I will inform.',
                   'I want you to complete the task as soon as possible.',
                   'I want you to complete the task.',
                   'I want you to complete the task by Wednesday.',
                   'Hello, how are you doing?',
                   'I am fine, thank you!',
                   'No matter the industry your business is in, an important team meeting topic is customer stories – whether good or bad.',
                   'You need to know what your customer is saying, listen to any recent reviews a customer has left, or read an email from a customer out loud to the team.',
                   'You will need to know what your customer is saying, listen to any recent reviews a customer has left, or read an email from a customer out loud to the team.',
                   'Then, see if you can take the customer story and turn it into ways to improve or anything that may need to be changed in the future.',
                   'It’s not always smooth sailing, and sometimes the team will run into a roadblock, challenge, or bottleneck.',
                   'When new products are being announced, released, or if they’re still in progress, this is a great topic of discussion for a meeting.',
                   'When you announce new products next week, be sure to talk about it because this is a great topic of discussion for a meeting.',
                   'Be sure you submit the work at 10:30 PM.',
                   'Amir can you submit the work.',
                   'Amir can you submit the work tomorrow.']

df_test = pd.DataFrame({'text': list_of_strings})

df_test

Unnamed: 0,text
0,"So, and given the complete scenario, this is w..."
1,I will close the window.
2,We will attend the meeting in one hour.
3,I did the assignment yesterday.
4,I walked the dog in the morning.
5,My cat is stepping on the keyboard.
6,Could you repeat that?
7,Would you like to get food with me tomorrow?
8,can you submit the code.
9,can you submit the code by tomorrow.


In [20]:
# pip install dependencies
#!pip install pandas
#!pip install nltk

In [21]:
# import csv

#df = pd.read_csv(r"C:/...")

#df.head()

In [22]:
from nltk import word_tokenize, pos_tag

# determine tense counts of string
def determine_tense_input(sentence):
    text = word_tokenize(sentence)
    tagged = pos_tag(text)

    tense = {}
    tense["future"] = len([word for word in tagged if word[1] in ["MD", "VBF", "VBC", "VB", "VBN"]])
    tense["present"] = len([word for word in tagged if word[1] in ["VBP", "VBZ","VBG"]])
    tense["past"] = len([word for word in tagged if word[1] in ["VBD", "VBN"]]) 
    return(tense)

# get count of time objects
def determine_time_input(sentence):
    text = word_tokenize(sentence)
    tagged = pos_tag(text)

    time_count = len([word for word in tagged if word[1] in ["NNP", "CD"]])  
    return(time_count)

# get count of subject objects
def determine_subject_input(sentence):
    text = word_tokenize(sentence)
    tagged = pos_tag(text)

    subject_count = len([word for word in tagged if word[1] in ["PRP", "NNP", "NNS"]])  
    return(subject_count)


def get_action_strings(df, text_col_name, filter_strings = True):
    """
    This function takes two arguments as input:
    - df: Your dataframe containing a column of strings
    - text_col_name: The name of the column that contains the strings
    - filter_strings: True or False, if True it will filter the dataframe for the following:
    -- Tense == 'future'
    -- subject_count > 0
    -- time_object_count > 0 
    
    This function will update the dataframe with three new columns:
    - tense: Past/Present/Future
    - subject_count: Number of subjects present in string
    - time_object_count: Number of time related objects in string

    """
    
    # lists to store tense's, count of subjects, and count of time-objects
    tense_col = []
    subject_count = []
    time_object_count = []
    
    # process the df
    df_list = list(df[text_col_name])

    
    for i in range(len(df)):
        text = df_list[i]
        
        # Tense
        tagged = pos_tag(text)

        tense = {}
        tense["future"] = len([word for word in tagged if word[1] in ["MD", "VBF", "VBC", "VB", "VBN"]])
        tense["present"] = len([word for word in tagged if word[1] in ["VBP", "VBZ","VBG"]])
        tense["past"] = len([word for word in tagged if word[1] in ["VBD", "VBN"]]) 
        
        tense_dict = determine_tense_input(text)
        final_tense = max(tense_dict, key=tense_dict.get)
        tense_col += [final_tense]
        
        # Subject Count
        subject_count += [determine_subject_input(text)]
        
        # Time object count
        time_object_count += [determine_time_input(text)]
    
    # add column of tense's to df
    df = df.assign(tense = tense_col)
    
    # add column of subject counts to df
    df = df.assign(subject_count = subject_count)
    
    # add column of time-related object count
    df = df.assign(time_count = time_object_count)
    
    # FILTER
    
    if filter_strings == True:
    
        # filter for those rows that contain future tense strings
        df = df[df["tense"].str.contains("future")]

        # filter for those rows that contain at least one subject
        df = df[df["subject_count"] > 0]

        # filter for those rows that contain at least one time object
        df = df[df["time_count"] > 0]
    
    
    # return the updated df
    return df

In [23]:
# example
df_example = get_action_strings(df = df_test, text_col_name = "text", filter_strings = False)
df_example

Unnamed: 0,text,tense,subject_count,time_count
0,"So, and given the complete scenario, this is w...",future,1,1
1,I will close the window.,future,1,0
2,We will attend the meeting in one hour.,future,1,1
3,I did the assignment yesterday.,past,1,0
4,I walked the dog in the morning.,past,1,0
5,My cat is stepping on the keyboard.,present,0,0
6,Could you repeat that?,present,2,1
7,Would you like to get food with me tomorrow?,future,2,0
8,can you submit the code.,future,1,0
9,can you submit the code by tomorrow.,future,1,0


In [24]:
# example
df_example = get_action_strings(df = df_test, text_col_name = "text", filter_strings = True)
df_example

Unnamed: 0,text,tense,subject_count,time_count
0,"So, and given the complete scenario, this is w...",future,1,1
2,We will attend the meeting in one hour.,future,1,1
14,I want you to complete the task by Wednesday.,future,3,1
24,Be sure you submit the work at 10:30 PM.,future,2,2
25,Amir can you submit the work.,future,2,1
26,Amir can you submit the work tomorrow.,future,2,1


In [25]:
# pip installing datetime
!pip install datefinder

Collecting datefinder
  Downloading datefinder-0.7.1-py2.py3-none-any.whl (10 kB)
Installing collected packages: datefinder
Successfully installed datefinder-0.7.1


You should consider upgrading via the 'E:\Users\amira\anaconda\python.exe -m pip install --upgrade pip' command.


In [34]:
import datefinder

text = "I will do the assignment Sunday."

matches = datefinder.find_dates(text)

for match in matches:
    print(match)

2022-02-20 00:00:00


In [48]:
text = "Six"

tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
tagged

[('Six', 'NN')]