## Taking control of happiness: what actions drive happiness?
#### *A preliminary review of actions/verbs that behind happy moments*

HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon's Mechanical Turk. In this notebook, we extract verbs from text that represents actions that made people happy at the time of reflection.

### **Step 0: Load required libraries**

From the packages' descriptions:

+ `pandas` is an open source data analysis and manipulation tool;
+ `numpy` is an open source project that enables numerical computing with Python;
+ `spacy` is a free open-source library for Natural Language Processing in Python;

In [1]:
import spacy
import pandas as pd
import numpy as np
from spacy import displacy
from spacy.matcher import DependencyMatcher
from spacy.matcher import Matcher
import nltk
from nltk.tokenize import sent_tokenize
import re

#### **Step 1: Load data to be processed**

Description of key datasets:

+ `hb_df` is the main table with all responses from HappyDB's database;
+ `demo_df` is the self-reported demographic data of respondents;

In [2]:
url_hb = 'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/cleaned_hm.csv'
url_demo = 'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'

In [3]:
hb_df = pd.read_csv(url_hb)
demo_df = pd.read_csv(url_demo)

In [4]:
hb_df.head() # check data

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection
1,27674,2,24h,I was happy when my son got 90% marks in his e...,I was happy when my son got 90% marks in his e...,True,1,,affection
2,27675,1936,24h,I went to the gym this morning and did yoga.,I went to the gym this morning and did yoga.,True,1,,exercise
3,27676,206,24h,We had a serious talk with some friends of our...,We had a serious talk with some friends of our...,True,2,bonding,bonding
4,27677,6227,24h,I went with grandchildren to butterfly display...,I went with grandchildren to butterfly display...,True,1,,affection


In [5]:
demo_df.head() # check data

Unnamed: 0,wid,age,country,gender,marital,parenthood
0,1,37.0,USA,m,married,y
1,2,29.0,IND,m,married,y
2,3,25.0,IND,m,single,n
3,4,32.0,USA,m,married,y
4,5,29.0,USA,m,married,y


From steps 2 to x, we will use only two columns from *hb_df* - 'hmid'and 'cleaned_hm' (stored in a pd.df called 'text_corpus'). This is done to minimize computational power needed to analyze the data.

In [6]:
text_corpus = hb_df[['hmid','cleaned_hm']]

In [7]:
text_corpus.head()

Unnamed: 0,hmid,cleaned_hm
0,27673,I went on a successful date with someone I fel...
1,27674,I was happy when my son got 90% marks in his e...
2,27675,I went to the gym this morning and did yoga.
3,27676,We had a serious talk with some friends of our...
4,27677,I went with grandchildren to butterfly display...


#### **Step 2: Process data and check attributes**

**Lemmatization, POS mapping and syntactic dependency mapping**: In this step, we process the text data to extract lemma, POS and parsed dependency mappings for each response. While we don't use this directly our analysis, This step is crucial to understand the underlying structure of data.

We will check structure for a random sample of data from 'text_corpus' (memory constraints).

In [8]:
nlp = spacy.load("en_core_web_sm") # Load spaCy English model and store it in an object 'nlp'

In [9]:
def process_text(text):
    doc = nlp(text)
    token_texts = [token.text for token in doc]
    token_pos = [token.pos_ for token in doc]
    token_lemmas = [token.lemma_ for token in doc]
    token_deps = [token.dep_ for token in doc]

    return token_texts, token_pos, token_lemmas, token_deps

In the next step, we take a small sample from 'text_corpus' dataframe to check pos mapping, lemmatization and most importantly, parsed dependencies in the text. We don't use this data for analysis directly; rather, it helps us understand the structure of sentences in HappyDB corpus, and will play a crucial role in defining patterns for extracting action items in subsequent steps.

In [10]:
s_size = 50
np.random.seed(8957)
sample_df = text_corpus[['hmid', 'cleaned_hm']].sample(n=s_size, replace=False)
sample_df['index'] = np.arange(1, s_size + 1)

In [11]:
# Apply the function to the 'cleaned_hm' column to extract attributes
sample_df[['token_text', 'token_pos', 'token_lemma', 'token_dep']] = sample_df['cleaned_hm'].apply(process_text).apply(pd.Series)
sample_df.head()

Unnamed: 0,hmid,cleaned_hm,index,token_text,token_pos,token_lemma,token_dep
59270,87249,I had In N Out for the first time in over thre...,1,"[I, had, In, N, Out, for, the, first, time, in...","[PRON, VERB, ADP, NUM, PROPN, ADP, DET, ADJ, N...","[I, have, in, n, Out, for, the, first, time, i...","[nsubj, ROOT, prep, pobj, dobj, prep, det, amo..."
82373,110490,I got some household chores done.,2,"[I, got, some, household, chores, done, .]","[PRON, VERB, DET, NOUN, NOUN, VERB, PUNCT]","[I, get, some, household, chore, do, .]","[nsubj, ROOT, det, compound, nsubj, ccomp, punct]"
56684,84650,I filed my taxes and did not owe any money. I ...,3,"[I, filed, my, taxes, and, did, not, owe, any,...","[PRON, VERB, PRON, NOUN, CCONJ, AUX, PART, VER...","[I, file, my, taxis, and, do, not, owe, any, m...","[nsubj, ROOT, poss, dobj, cc, aux, neg, conj, ..."
16592,44362,My 6 year old finished her first grade reading...,4,"[My, 6, year, old, finished, her, first, grade...","[PRON, NUM, NOUN, ADJ, VERB, PRON, ADJ, NOUN, ...","[my, 6, year, old, finish, her, first, grade, ...","[poss, nummod, npadvmod, nsubj, ROOT, poss, am..."
56094,84055,My daughter had lost her first tooth and she w...,5,"[My, daughter, had, lost, her, first, tooth, a...","[PRON, NOUN, AUX, VERB, PRON, ADJ, NOUN, CCONJ...","[my, daughter, have, lose, her, first, tooth, ...","[poss, nsubj, aux, ROOT, poss, amod, dobj, cc,..."


Next, we will visualize a few dependencies using displacy visualizor.This is to get an idea of dependencies that exist in a sentence. Based on these examples, we'll create the lexical patterns that represent 'actions' or verbs.

In [12]:
txt1 = sample_df['cleaned_hm'].iloc[46]
txt2 = sample_df['cleaned_hm'].iloc[35]
txt3 = sample_df['cleaned_hm'].iloc[21]
txt4 = sample_df['cleaned_hm'].iloc[9]
txt5 = sample_df['cleaned_hm'].iloc[5]

In [14]:
doc1 = nlp(txt1)
displacy.render(doc1, style='dep', jupyter=True, options={'distance': 130})

In [15]:
doc2 = nlp(txt2)
displacy.render(doc2, style='dep', jupyter=True, options={'distance': 130})

In [16]:
doc3 = nlp(txt3)
displacy.render(doc3, style='dep', jupyter=True, options={'distance': 130})

In [17]:
doc4 = nlp(txt4)
displacy.render(doc4, style='dep', jupyter=True, options={'distance': 130})

In [18]:
doc5 = nlp(txt5)
displacy.render(doc5, style='dep', jupyter=True, options={'distance': 130})

As this is a preliminary attempt at extracting verbs or actions, we will limit ourselves to a visual check to determine patterns. For a more exhaustive and accurate method, we would need to develop a extensive list of patterns to extract information on actions done as agent vs actions experienced. Instead of dependency matching, we might opt for more sophisticated methods, such as using transformers for advanced text summarization.

Based on the visualizations above, there are two key patterns that emerge:
1. Root verb > direct object 'dobj'. E.g., 'Watching', 'shows' or 'seeing', 'movies'
2. Root verb > direct object > preposition > object of a preposition 'pobj'. E.g., 'Celebrating', 'birthday', 'with', 'family'

#### **Step 3: Extract main actions / phrases from each response**

In this step, we use two techniques to extract information on actions performed by individuals: Depdendency matching to extract patterns based on dependencies, and extracting other information based on POS mapping.

**Step 3.1: Identify 'root' verbs or actions using Dependency matching**

In [19]:
# Pattern 1: Root verb < direct object 'dobj'

pattern_1 = [
    {
        "RIGHT_ID": "root_verb",
        "RIGHT_ATTRS": {"dep": "ROOT", "pos": "VERB"}  # Match ROOT verb
    },
    {
        "LEFT_ID": "root_verb",
        "REL_OP": ">",
        "RIGHT_ID": "object",
        "RIGHT_ATTRS": {"dep": "dobj"}  # Match dependent tokens (dobj)
    }
]

# Pattern2: Root verb > direct object > preposition > object of a preposition 'pobj'

pattern_2 = [
    {
        "RIGHT_ID": "root_verb",
        "RIGHT_ATTRS": {"dep": "ROOT", "pos": "VERB"}  # Match ROOT verb
    },
    {
        "LEFT_ID": "root_verb",
        "REL_OP": ">",
        "RIGHT_ID": "object",
        "RIGHT_ATTRS": {"dep": "dobj"}  # Match dobj dependency
    },
    {
        "LEFT_ID": "object",
        "REL_OP": ">",
        "RIGHT_ID": "prep",
        "RIGHT_ATTRS": {"dep": "prep"}  # Match prep dependency
    },
    {
        "LEFT_ID": "prep",
        "REL_OP": ">",
        "RIGHT_ID": "pobj",
        "RIGHT_ATTRS": {}  # Match any dep tokens as pobj
    }
]

In [20]:
# 'Activate' matcher and assig unique name to pattern
# By adding pattern 2 above pattern 1, we are prioritizing longer phrase match
matcher = DependencyMatcher(nlp.vocab)
matcher.add("root_object_pattern_test", [pattern_2])
matcher.add("root_dobj_prep_pobj_pattern_test", [pattern_1])

In [21]:
# Create an empty list to store results
results = []

In [22]:
for idx in text_corpus.index:
    text = text_corpus.at[idx, 'cleaned_hm']
    doc = nlp(text)
    matches = matcher(doc)

    # Map identifier 'hmid' to each row
    hmid = text_corpus.at[idx, 'hmid']

    for match_id, token_ids in matches:
        chain = " - ".join([doc[token_id].text for token_id in token_ids])
        results.append({"hmid": hmid, "Index": idx, "Root_Object_Phrase": chain})

In [34]:
# Create a DataFrame from the results list
results_df = pd.DataFrame(results)

# Print the DataFrame
results_df.head()

Unnamed: 0,hmid,Index,Root_Object_Phrase
0,27676,3,had - talk - with - friends
1,27676,3,had - talk
2,27679,6,made - recipe - for - bread
3,27679,6,made - recipe
4,27680,7,got - gift


This dataframe ('results_df') is likely to have multiple phrases for a single response. For example, when someone writes paragraphs, the DependencyMatcher algorithm extracts multiple 'actions' based on pattern matching. Let's check how many duplicate items per unique response exist.

In [30]:
if results_df.duplicated().any():
    duplicatedRows = results_df[results_df['hmid'].duplicated()]
    print('\nDuplicated rows in the sheet:\n', duplicatedRows.head(10))
else:
    print('No duplicates')


Duplicated rows in the sheet:
      hmid  Index Root_Object_Phrase
1   27676      3         had - talk
3   27679      6      made - recipe
6   27682      9    Watching - wars
8   27684     11    completed - run
16  27704     31    consumed - bowl
19  27708     35      formed - team
29  27722     49       ate - dinner
34  27727     54  sealed - position
36  27728     55       made - bunch
39  27731     58         had - chat


Now that we've confirmed presence of duplicates, we need to decide how to proceed. We could:
1. Keep the duplicate actions/phrases. If we decide to use these actions in visualizations with demographic characteristics of each person, then it may give misleading group-wise results (as one person's response may be counted multiple times as different actions).
2. Remove duplicate actions/phrases: This might randomly remove the 'main' action taken by the person. Let's assume that the longest string match is more likely to represent actions. In this case, we can conditionally remove duplicates.

As the next part of this project is focused on visualizations, we will follow option 2 and conditionally remove duplicates.

In [None]:
# Convert 'hmid' column to string type so the variable is treated as an identifier.
results_df['hmid'] = results_df['hmid'].astype(str)

# Define function to find the longest phrase/action
def max_length_row(group):
    return group.loc[group['Root_Object_Phrase'].str.len().idxmax()]

# Group by 'hmid' and iteratively apply the 'max_length_row' function
results_df_uniq = results_df.groupby('hmid', group_keys=False, as_index=False).apply(max_length_row)

In [33]:
# Check results
results_df_uniq.head()

Unnamed: 0,hmid,Index,Root_Object_Phrase
0,100000,71930,bought - earrings
1,100001,71931,took - tour
2,100003,71933,awarded - employee - of - month
3,100004,71934,made - plans
4,100005,71935,got - job - at - hospital


The pattern revealed some interesting verbs. However, currently, the pattern extracts verbs from responses without categorizing them as action initiated or conrolled by person (agent vs experiencer). For example, when writing reflections, statements such as "I ran for 5 miles" represent an action *taken* by the person. On the other hand, a statement 'My brother bought me a gift." represents an action *experienceD* by the person (i.e., not initiated by them). In this project, we aim to identify the former.

To achieve this, we will simply identify and filter sentences starting with 'I', 'i', or 'I'm' in a response OR those which start with a verb (such as 'Went to the gym'.)

*Note: This method is a preliminary approach to focus on the focus of the the event or verb. There are many incomplete responses that use incomplete phrases or complex structures that aren't captured by this pattern*
*Further analysis is needed to find more robust approaches.*

**Step 3.2: Extract 'I' pronoun for each row, if present, and store seperately**

In [None]:

# Function to check if a sentence starts with 'I', 'i', 'I'm' or a verb
def starts_with_I_or_verb(sentence):
    doc = nlp(sentence)
    first_word = doc[0].text.lower()
    return (first_word in ['i', "i'm"] or doc[0].pos_ == 'VERB')

# Create empty list to store results
results_i = []

for idx, (text, hmid) in enumerate(zip(text_corpus['cleaned_hm'], text_corpus['hmid'])):
    # Tokenize 'cleaned_hm' into sentences and get POS mappings
    doc = nlp(text)

    for i, sent in enumerate(doc.sents):
        # Apply function starts_with_I_or_verb iteratively and flag sentences as 'Yes' when condition satisfied
        if starts_with_I_or_verb(sent.text):
            result = {
                "hmid": hmid,
                "Index": idx,
                "Sentence Index": i,
                "Sentence": sent.text,
                "Starts with I or Verb": 'Yes'
            }
            results_i.append(result)
        else:
            result = {
                "hmid": hmid,
                "Index": idx,
                "Sentence Index": i,
                "Sentence": sent.text,
                "Starts with I or Verb": 'No'
            }
            results_i.append(result)

# Convert the list of dictionaries to a DataFrame
results_dF_with_I = pd.DataFrame(results_i)

In [43]:
# Print the resulting DataFrame
results_dF_with_I.head()

Unnamed: 0,hmid,Index,Sentence Index,Sentence,Starts with I or Verb
0,27673,0,0,I went on a successful date with someone I fel...,Yes
1,27674,1,0,I was happy when my son got 90% marks in his e...,Yes
2,27675,2,0,I went to the gym this morning and did yoga.,Yes
3,27676,3,0,We had a serious talk with some friends of our...,No
4,27676,3,1,They understood and we had a good evening hang...,No


In [44]:
results_dF_with_I_yes = results_dF_with_I[results_dF_with_I['Starts with I or Verb'] == 'Yes']
results_dF_with_I_yes.head()

Unnamed: 0,hmid,Index,Sentence Index,Sentence,Starts with I or Verb
0,27673,0,0,I went on a successful date with someone I fel...,Yes
1,27674,1,0,I was happy when my son got 90% marks in his e...,Yes
2,27675,2,0,I went to the gym this morning and did yoga.,Yes
5,27677,4,0,I went with grandchildren to butterfly display...,Yes
6,27678,5,0,I meditated last night.,Yes


We note have two main datasets:
* *results_df_uniq*: A dataframe with unique HMID-mapped action words or phrases extracted based on dependency matching
* *results_dF_with_I_yes*: A dataframe with all sentences and their unique HMID that start with 'I'/I'm', or a verb.

Next, we will combine these dataframes and keep only the actions items which satisty the condition in *results_dF_with_I_yes* (i.e., action items which are mapped to responses starting with 'I'/I'm', or a verb.)

#### **Step 4: Merge all datasets**

In [46]:
# Convert 'hmid' to int in unique_results_df
results_df_uniq['hmid'] = results_df_uniq['hmid'].astype(int)

In [47]:
merged_df = results_df_uniq.merge(results_dF_with_I_yes, on='hmid', how='left')[['hmid', 'Root_Object_Phrase', 'Sentence', 'Starts with I or Verb']]

In [69]:
filtered_merged_df = merged_df[merged_df['Starts with I or Verb'] == 'Yes']

filtered_merged_df.head()

Unnamed: 0,hmid,Root_Object_Phrase,Sentence,Starts with I or Verb
0,100000,bought - earrings,I bought cute earrings,Yes
3,100004,made - plans,I made plans to meet up with a girl I like.,Yes
5,100006,found - clothes,I found some new clothes that were on sale and...,Yes
6,100008,had - visit - with - him,I got to see my brother for the first time in ...,Yes
7,100008,had - visit - with - him,I had a good visit with him.,Yes


The dataframe *filtered_merged_df* contains the action words or phrases *only* for sentences that start with I or verb (i.e., actions intiiated by a person). Now, we merge this dataframe with the *hb_df* and *demo_df* dataframe and delete the unnecessary columns.

In [52]:
final_merged_df = filtered_merged_df.merge(hb_df, on = 'hmid', how = 'left') # Left join as we want to only keep 'moments' for which we have successfully identified action phrases.

In [53]:
final_merged_df = final_merged_df.merge(demo_df, on = 'wid', how = 'left')

In [56]:
final_merged_df.head(5) #check data

Unnamed: 0,hmid,Root_Object_Phrase,Sentence,Starts with I or Verb,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,age,country,gender,marital,parenthood
0,100000,bought - earrings,I bought cute earrings,Yes,884,3m,I bought cute earrings,I bought cute earrings,True,1,achievement,achievement,22.0,USA,f,married,n
1,100004,made - plans,I made plans to meet up with a girl I like.,Yes,2905,3m,I made plans to meet up with a girl I like.,I made plans to meet up with a girl I like.,True,1,,affection,20.0,USA,m,single,n
2,100006,found - clothes,I found some new clothes that were on sale and...,Yes,340,3m,I found some new clothes that were on sale and...,I found some new clothes that were on sale and...,True,1,achievement,achievement,43.0,USA,f,married,y
3,100008,had - visit - with - him,I got to see my brother for the first time in ...,Yes,271,3m,I got to see my brother for the first time in ...,I got to see my brother for the first time in ...,True,2,,affection,36.0,USA,m,married,y
4,100008,had - visit - with - him,I had a good visit with him.,Yes,271,3m,I got to see my brother for the first time in ...,I got to see my brother for the first time in ...,True,2,,affection,36.0,USA,m,married,y


In [61]:
print(final_merged_df.columns.tolist()) # see all columns

['hmid', 'Root_Object_Phrase', 'Sentence', 'Starts with I or Verb', 'wid', 'reflection_period', 'original_hm', 'cleaned_hm', 'modified', 'num_sentence', 'ground_truth_category', 'predicted_category', 'age', 'country', 'gender', 'marital', 'parenthood']


In [64]:
# Keep only relevant columns

final_df = final_merged_df[
    ['hmid', 'Root_Object_Phrase', 'Sentence', 'cleaned_hm', 'Starts with I or Verb', 'wid', 'reflection_period','predicted_category',
     'age', 'country', 'gender', 'marital', 'parenthood']
]

In [65]:
final_df.head() #check results

Unnamed: 0,hmid,Root_Object_Phrase,Sentence,cleaned_hm,Starts with I or Verb,wid,reflection_period,predicted_category,age,country,gender,marital,parenthood
0,100000,bought - earrings,I bought cute earrings,I bought cute earrings,Yes,884,3m,achievement,22.0,USA,f,married,n
1,100004,made - plans,I made plans to meet up with a girl I like.,I made plans to meet up with a girl I like.,Yes,2905,3m,affection,20.0,USA,m,single,n
2,100006,found - clothes,I found some new clothes that were on sale and...,I found some new clothes that were on sale and...,Yes,340,3m,achievement,43.0,USA,f,married,y
3,100008,had - visit - with - him,I got to see my brother for the first time in ...,I got to see my brother for the first time in ...,Yes,271,3m,affection,36.0,USA,m,married,y
4,100008,had - visit - with - him,I had a good visit with him.,I got to see my brother for the first time in ...,Yes,271,3m,affection,36.0,USA,m,married,y


In [66]:
from google.colab import files

In [67]:
final_df.to_csv('final_df_Sept20.csv', encoding = 'utf-8-sig')

In [68]:
files.download('final_df_Sept20.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>