# Question -

HappyDB is a corpus of 100,000+ crowd-sourced happy moments. Use this corpus to answer this question: when asked to reflect on happy moments, who are most often mentioned - spouse, parents, children, friends, or someone else?
HappyDB is available on GitHub [1]. The cleaned data set is located at [2]. HappyDB also provides a “people dictionary” [3], which is a lexicon of common social relationships. If you can find, or create, a better lexicon of social relationships, please feel free to use it, and explain why it is a better lexicon for this task.
Write a Python script in Jupyter Notebook and submit the .ipynb file that includes your code and your explanations in comments.

# Step 1 
read the csv files for cleaned_hm and people_dict

In [76]:
# import pandas package
import pandas as pd

# import countVectorizer because we will use it to find frequency of words
from sklearn.feature_extraction.text import CountVectorizer

# import nltk
# from nltk.tokenize import sent_tokenize
# from nltk.tokenize import word_tokenize
# import re
# from nltk.tokenize import TreebankWordTokenizer
# from nltk.tokenize import WordPunctTokenizer
# from nltk.tokenize import WhitespaceTokenizer

In [77]:
# read cleaned_hm csv file
cleanedHm = pd.read_csv("/Users/shivangi/Downloads/cleaned_hm.csv")

# read people-dict csv file without header because this file does not have a header
peopleDict = pd.read_csv("/Users/shivangi/Downloads/people-dict.csv", header = None)

In [78]:
# take a look at the dataframe
cleanedHm.head()

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection
1,27674,2,24h,I was happy when my son got 90% marks in his e...,I was happy when my son got 90% marks in his e...,True,1,,affection
2,27675,1936,24h,I went to the gym this morning and did yoga.,I went to the gym this morning and did yoga.,True,1,,exercise
3,27676,206,24h,We had a serious talk with some friends of our...,We had a serious talk with some friends of our...,True,2,bonding,bonding
4,27677,6227,24h,I went with grandchildren to butterfly display...,I went with grandchildren to butterfly display...,True,1,,affection


In [79]:
# take a look at peopleDict
peopleDict.head()

Unnamed: 0,0
0,aunt
1,auntie
2,aunties
3,aunts
4,aunty


# Step 2
Vectorize. We'll use count vectorization since we want to know how many times a word appears in the column.

In [81]:
# define countvectorizer
cv = CountVectorizer(encoding = 'latin-1')
# use countVectorizer to create fit_transform()
vecs = cv.fit_transform(cleanedHm['cleaned_hm'])
# get word list
word_list = cv.get_feature_names();
# get count list
count_list = vecs.toarray().sum(axis=0)



In [82]:
freq = dict(zip(word_list,count_list))
print(freq.get('couples')) # word frequency
print(cv.vocabulary_.get('couples')) # word index, not frequency

17
5642


# Step 3
add the frequency of words to the people dict dataframe for all relevant words

In [83]:
# rename the column to a more meaningful value
peopleDict.rename(columns = {0:'relation'}, inplace = True)

In [84]:
# create a list of word frequencies of all words in peopleDict so we can later add this list to the peopleDict df
wordFrequency = []
for word in peopleDict['relation']:
    wordFrequency.append(freq.get(word))

In [85]:
# find the type of list that got created
type(wordFrequency)

list

In [86]:
# add word frequency as a column to the peopleDict dataframe
peopleDict['wordFrequency'] = wordFrequency

In [87]:
# check the peopleDict dataframe
peopleDict.head()

Unnamed: 0,relation,wordFrequency
0,aunt,222.0
1,auntie,2.0
2,aunties,
3,aunts,26.0
4,aunty,26.0


# Step 4
get the most used relation in peopleDict

In [88]:
print(peopleDict[peopleDict.wordFrequency == peopleDict.wordFrequency.max()])

   relation  wordFrequency
59   friend         6166.0


Therefore, the word "friend" is the most used relation in these happy moments. The word is used precisely 6166 times in the entire dataset.