## Taking control of happiness: what actions drive happiness?
#### *A preliminary review of actions/verbs that behind happy moments*

HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon's Mechanical Turk. In this notebook, we extract verbs from text that represents actions that made people happy at the time of reflection.

### **Step 0: Load required libraries**

From the packages' descriptions:

+ `pandas` is an open source data analysis and manipulation tool;
+ `numpy` is an open source project that enables numerical computing with Python;
+ `spacy` is a free open-source library for Natural Language Processing in Python;

In [1]:
import spacy
import pandas as pd
import numpy as np
from spacy import displacy
from spacy.matcher import DependencyMatcher
from spacy.matcher import Matcher

#### **Step 1: Load data to be processed**

Description of key datasets:

+ `hb_df` is the main table with all responses from HappyDB's database;
+ `demo_df` is the self-reported demographic data of respondents;

In [2]:
url_hb = 'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/cleaned_hm.csv'
url_demo = 'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'

In [3]:
hb_df = pd.read_csv(url_hb)
demo_df = pd.read_csv(url_demo)

In [4]:
hb_df.head() # check data

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection
1,27674,2,24h,I was happy when my son got 90% marks in his e...,I was happy when my son got 90% marks in his e...,True,1,,affection
2,27675,1936,24h,I went to the gym this morning and did yoga.,I went to the gym this morning and did yoga.,True,1,,exercise
3,27676,206,24h,We had a serious talk with some friends of our...,We had a serious talk with some friends of our...,True,2,bonding,bonding
4,27677,6227,24h,I went with grandchildren to butterfly display...,I went with grandchildren to butterfly display...,True,1,,affection


In [5]:
demo_df.head() # check data

Unnamed: 0,wid,age,country,gender,marital,parenthood
0,1,37.0,USA,m,married,y
1,2,29.0,IND,m,married,y
2,3,25.0,IND,m,single,n
3,4,32.0,USA,m,married,y
4,5,29.0,USA,m,married,y


From steps 2 to x, we will use only two columns from *hb_df* - 'hmid'and 'cleaned_hm' (stored in a pd.df called 'text_corpus'). This is done to minimize computational power needed to analyze the data.

In [6]:
text_corpus = hb_df[['hmid','cleaned_hm']]

In [7]:
text_corpus.head()

Unnamed: 0,hmid,cleaned_hm
0,27673,I went on a successful date with someone I fel...
1,27674,I was happy when my son got 90% marks in his e...
2,27675,I went to the gym this morning and did yoga.
3,27676,We had a serious talk with some friends of our...
4,27677,I went with grandchildren to butterfly display...


#### **Step 2: Process data and check attributes**

**Lemmatization, POS mapping and syntactic dependency mapping**: In this step, we process the text data to extract lemma, POS and parsed dependency mappings for each response. While we don't use this directly our analysis, This step is crucial to understand the underlying structure of data.

We will check structure for a random sample of data from 'text_corpus' (memory constraints).

In [8]:
nlp = spacy.load("en_core_web_sm") # Load spaCy English model and store it in an object 'nlp'

In [None]:
def process_text(text):
    doc = nlp(text)
    token_texts = [token.text for token in doc]
    token_pos = [token.pos_ for token in doc]
    token_lemmas = [token.lemma_ for token in doc]
    token_deps = [token.dep_ for token in doc]

    return token_texts, token_pos, token_lemmas, token_deps

In [None]:
s_size = 50
np.random.seed(8957)
sample_df = text_corpus[['hmid', 'cleaned_hm']].sample(n=s_size, replace=False)
sample_df['index'] = np.arange(1, s_size + 1)

In [None]:
# Apply the function to the 'cleaned_hm' column to extract attributes
sample_df[['token_text', 'token_pos', 'token_lemma', 'token_dep']] = sample_df['cleaned_hm'].apply(process_text).apply(pd.Series)
sample_df.head()

Unnamed: 0,hmid,cleaned_hm,index,token_text,token_pos,token_lemma,token_dep
59270,87249,I had In N Out for the first time in over thre...,1,"[I, had, In, N, Out, for, the, first, time, in...","[PRON, VERB, ADP, NUM, PROPN, ADP, DET, ADJ, N...","[I, have, in, n, Out, for, the, first, time, i...","[nsubj, ROOT, prep, pobj, dobj, prep, det, amo..."
82373,110490,I got some household chores done.,2,"[I, got, some, household, chores, done, .]","[PRON, VERB, DET, NOUN, NOUN, VERB, PUNCT]","[I, get, some, household, chore, do, .]","[nsubj, ROOT, det, compound, nsubj, ccomp, punct]"
56684,84650,I filed my taxes and did not owe any money. I ...,3,"[I, filed, my, taxes, and, did, not, owe, any,...","[PRON, VERB, PRON, NOUN, CCONJ, AUX, PART, VER...","[I, file, my, taxis, and, do, not, owe, any, m...","[nsubj, ROOT, poss, dobj, cc, aux, neg, conj, ..."
16592,44362,My 6 year old finished her first grade reading...,4,"[My, 6, year, old, finished, her, first, grade...","[PRON, NUM, NOUN, ADJ, VERB, PRON, ADJ, NOUN, ...","[my, 6, year, old, finish, her, first, grade, ...","[poss, nummod, npadvmod, nsubj, ROOT, poss, am..."
56094,84055,My daughter had lost her first tooth and she w...,5,"[My, daughter, had, lost, her, first, tooth, a...","[PRON, NOUN, AUX, VERB, PRON, ADJ, NOUN, CCONJ...","[my, daughter, have, lose, her, first, tooth, ...","[poss, nsubj, aux, ROOT, poss, amod, dobj, cc,..."


Next, we will visualize a few dependencies using displacy visualizor.This is to get an idea of dependencies that exist in a sentence. This part can be skipped if not needed.

In [None]:
sample_txt = sample_df['cleaned_hm'].iloc[13]

doc = nlp(sample_txt)

In [None]:
displacy.serve(doc, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


#### **Step 3: Extract main actions / phrases from each response**

In this step, we explore two techniques to extract information on actions performed by individuals: Depdendency matching, and information extraction.

**Step 3.1: Identify 'root' verbs or actions using Dependency matching**

In [9]:
# Define patterns - 'root' verb and related direct object OR attributes

pattern = [
    {
        "RIGHT_ID": "root_verb",
        "RIGHT_ATTRS": {"dep": "ROOT", "pos": "VERB"}  # Match ROOT verb
    },
    {
        "LEFT_ID": "root_verb",
        "REL_OP": ">",
        "RIGHT_ID": "object",
        "RIGHT_ATTRS": {"dep": {"in": ["dobj", "attr"]}}  # Match dependent tokens
    }
]

In [10]:
# 'Activate' matcher and assig unique name to pattern
matcher = DependencyMatcher(nlp.vocab)
matcher.add("root_object_pattern_test", [pattern])

In [11]:
# Create an empty list to store results
results = []

In [12]:
for idx in text_corpus.index:
    text = text_corpus.at[idx, 'cleaned_hm']
    doc = nlp(text)
    matches = matcher(doc)

    # Map identifier 'hmid' to each row
    hmid = text_corpus.at[idx, 'hmid']

    for match_id, token_ids in matches:
        chain = " - ".join([doc[token_id].text for token_id in token_ids])
        results.append({"hmid": hmid, "Index": idx, "Root_Object_Phrase": chain})

In [14]:
# Create a DataFrame from the results list
results_df = pd.DataFrame(results)

# Print the DataFrame
results_df.head(20)

Unnamed: 0,hmid,Index,Root_Object_Phrase
0,27676,3,had - talk
1,27679,6,made - recipe
2,27680,7,got - gift
3,27682,9,Watching - wars
4,27684,11,completed - run
5,27686,13,shorting - Gold
6,27687,14,take - while
7,27689,16,helped - neighbour
8,27693,20,Got - A
9,27694,21,called - me


The pattern revealed some interesting verbs. However, currently, the pattern extracts verbs from responses without categorizing them as action initiated or conrolled by person (experiencer vs agent)

To achieve this, we will simply use 'I' identifier in a response as mapped to the root word.

*Note: This method is a preliminary approach to focus on the focus of the the event or verb. There are many incomplete responses that use incomplete phrases to show action done (such as "Gave an interview today.").*
*Further analysis is needed to find more robust approaches.*

**Step 3.2: Extract 'I' pronoun for each row, if present, and store seperately**

In [15]:
# Create two seperate lists for each column in 'text_corpus'
text_column = text_corpus['cleaned_hm']
hmid_column = text_corpus['hmid']

In [17]:
# Create an empty list to store results
results_ie = []

In [18]:
#Process data
for idx, (text, hmid) in enumerate(zip(text_column, hmid_column)):
    doc = nlp(text)
    pronoun = None

    for token in doc:
        if token.text == 'I' and token.pos_ == 'PRON':
            pronoun = token.text

    # Create a result dictionary
    result = {
        "hmid": hmid,
        "Index": idx,
        "Pronoun": pronoun if pronoun else None
    }

    results_ie.append(result)

In [19]:
result_ie_df = pd.DataFrame(results_ie)
result_ie_df.head(10)

Unnamed: 0,hmid,Index,Pronoun
0,27673,0,I
1,27674,1,I
2,27675,2,I
3,27676,3,
4,27677,4,I
5,27678,5,I
6,27679,6,I
7,27680,7,I
8,27681,8,I
9,27682,9,


#### **Step 4: Merge all datasets**

In [36]:
merged_df = results_df.merge(result_ie_df, on = 'hmid', how = 'inner')

In [37]:
merged_df.head() # check data

Unnamed: 0,hmid,Index_x,Root_Object_Phrase,Index_y,Pronoun
0,27676,3,had - talk,3,
1,27679,6,made - recipe,6,I
2,27680,7,got - gift,7,I
3,27682,9,Watching - wars,9,
4,27684,11,completed - run,11,I


In [38]:
merged_df = merged_df.merge(hb_df, on = 'hmid', how = 'left')

In [39]:
merged_df = merged_df.merge(demo_df, on = 'wid', how = 'left')

In [40]:
merged_df.head() #check data

Unnamed: 0,hmid,Index_x,Root_Object_Phrase,Index_y,Pronoun,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,age,country,gender,marital,parenthood
0,27676,3,had - talk,3,,206,24h,We had a serious talk with some friends of our...,We had a serious talk with some friends of our...,True,2,bonding,bonding,28,DNK,f,married,n
1,27679,6,made - recipe,6,I,195,24h,"I made a new recipe for peasant bread, and it ...","I made a new recipe for peasant bread, and it ...",True,1,,achievement,30,USA,m,single,n
2,27680,7,got - gift,7,I,740,24h,I got gift from my elder brother which was rea...,I got gift from my elder brother which was rea...,True,1,,affection,23,IND,m,single,n
3,27682,9,Watching - wars,9,,4833,24h,Watching cupcake wars with my three teen children,Watching cupcake wars with my three teen children,True,1,,affection,41,USA,f,married,y
4,27684,11,completed - run,11,I,78,24h,I completed my 5 miles run without break. It m...,I completed my 5 miles run without break. It m...,True,2,,exercise,28,USA,f,married,y


In [41]:
print(merged_df.columns.tolist())

['hmid', 'Index_x', 'Root_Object_Phrase', 'Index_y', 'Pronoun', 'wid', 'reflection_period', 'original_hm', 'cleaned_hm', 'modified', 'num_sentence', 'ground_truth_category', 'predicted_category', 'age', 'country', 'gender', 'marital', 'parenthood']


In [42]:
# Keep only relevant columns
final_df = merged_df[
    ['hmid', 'Root_Object_Phrase', 'Pronoun', 'wid', 'reflection_period','predicted_category',
     'age', 'country', 'gender', 'marital', 'parenthood']
]