## Assignment 1 Happy Moment Word Frequency

Containing a crowd-sourced collection of 100,000+ happy moments, [HappyDB](https://raw.githubusercontent.com/megagonlabs/HappyDB/master/happydb/data/cleaned_hm.csv) is an authentic database that is aimed to help researchers to understand means of happiness utterance via NLP-related technologies. The goal of this assignment is to classify the groups of people mentioned most often in their happy moments. Based on the person list provide by a open-sourced dictionary (
[people-dict](https://github.com/megagonlabs/HappyDB/blob/master/happydb/data/topic_dict/people-dict.csv)), I have classify those individuals into nine groups. These groups include: 

<ol>
<li>Parent (biological parents and step-parent)</li>
<li>Spouse (people who have or once marriage, civil union, or common-law marriage)</li>
<li>Children (biological children and step-children)</li>
<li>Affinity (people who have or had involved in a romantic relationship)</li>
<li>Sibling (Full sibling and step-sibling)</li>
<li>Relative (people who don't come from immediate family)</li>
<li>Friend (people who have or had involved in a non-romantic relationship)</li>
<li>Others (people who are excluded from the groups mentioned above)</li>
</ol>

In [1]:
import re
import pandas as pd
import string
import nltk
from nltk.tokenize import word_tokenize 
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk import FreqDist
from nltk.corpus import stopwords

In [2]:
# Import Dataset
reviews = pd.read_csv("cleaned_hm.csv", error_bad_lines=False, index_col=False, dtype='unicode', encoding = 'utf8')
hm = reviews['cleaned_hm'].to_list()
# The database contains 100,535 sentences.
print("The number of sentences in the cleaned_hm document:", len(hm))

# Import "English" version stopward
sw = set(stopwords.words('english'))
sw.add("i")
sw.remove("my")
sw.remove("them")

# Import "English" version stopward
# read the dictonaries provided
people = pd.read_csv("people-dict.csv", error_bad_lines=False, index_col=False, dtype='unicode', encoding = 'utf8', names = ['individual'])
print("The number of individual in the people_dict:", len(people["individual"]))

The number of sentences in the cleaned_hm document: 100535
The number of individual in the people_dict: 239


The next step is Data Preprocessing, which contains punctuations removal, stopwords removal, and lowercase conversion. Additionally, we should avoid removing "hyphens" during the process of removing punctuation.

In [3]:
def text_process(mess):
    return_list = []
    for line in mess:
        wordsList = nltk.word_tokenize(line)
        wordsList = [x.lower() for x in wordsList]
        wordsList = [w for w in wordsList if not w in sw]
        wordsList = [re.sub(r"[^\w\d\s\-']+", ' ', w) for w in wordsList]
        wordsList = [x for x in wordsList if x is not " "]
        return_list.append(wordsList)
    return(return_list)

unigram = text_process(hm)
unigram_list = []
for each in unigram:
    for item in each:
        unigram_list.append(item)

In [4]:
# Unigram_list contains corpuses derived from the happy moment text.
unigram_list[:10]

['went',
 'successful',
 'date',
 'someone',
 'felt',
 'sympathy',
 'connection',
 'happy',
 'my',
 'son']

Using the FreqDist function that is available in NLTK package, we could calculate the number of times that each outcome of a corpus occurs.

In [5]:
fdist = FreqDist(unigram_list)
fdistkeys = list(fdist.keys())

According to the three open-sourced dictionaries, one could build a dictionary to identify "parent" related words. Specifically, the list contains:

<ul>
<li>
   
**Father related words**: "father", "fathers", "dad", "dads", "daddy", "daddies", "father-in-law", "papa", "pappa", "old man"</li>
<li> 
   
**Mother related words**: "mother", "mothers", "mum", "mums", "mom", "moms", "momma", "mummy", "mummies", "mommy", "mommies", "mamma", "mommie", "mother-in-law"</li>
<li> 

**Parent related words**: "parent", "parents", "step-parent", "stepparent" </li>
<li> 

**Others**: "step-father", "step-mother", "stepmother", "stepfather", "stepmom", "stepdad"</li>
</ul>

Given the number of corpora listed above, using regular expression could be an efficient way to match topic-specific tokens. Accordingly, the HappyDB collection contains 6,590 parent-related words. Closely analyzing the result, we could find that "mother" has the highest frequency value. The second one is one of the abbreviations of the mother. The intriguing part is that the total amount of frequency value of "dad" and "father" is half of that of "mom" and mother, implying a possibility that, instead of mentioning their fathers, people tend to mention their mothers in their happy moments.

In [6]:
# construct a function to return a dataframe based on the requirement of regular expression
def get_list(compare_list, dic_list, type_of):
    list_count = []
    for (word, count) in compare_list:
        for match in dic_list:
            if re.findall(match, word):
                list_count.append((word, count, type_of))
        df = pd.DataFrame(list_count, columns = ['word', 'frequency', "type"]).sort_values(by=['frequency'], ascending=False)
    return(df)

In [7]:
parent = ["^(step)?(-)?(father|mother|dad|papa|pappa|mom|mum|parent|daddie|daddy|mummy|mommy|momma|mamma|mama|mommie)s?$", 
          "^(father|mother)s?.*?(law)$"]

parent_count_df = get_list(fdist.items(), parent, "Parent")
print("Total: ", sum(parent_count_df["frequency"]))
parent_count_df

Total:  6590


Unnamed: 0,word,frequency,type
3,mother,1717,Parent
2,mom,1473,Parent
4,parents,1119,Parent
5,dad,912,Parent
6,father,734,Parent
12,daddy,113,Parent
13,mothers,112,Parent
11,parent,97,Parent
20,mommy,56,Parent
1,mama,47,Parent


Similarly, we could obtain the corresponding word frequency of the "children" group. The total frequency of children-related words is 11,098. Among those results, the frequency of the phrase "son" almost equals that of "daughter". "baby" and "kids" rank third and fourth in the list.

In [8]:
children = ["^(step)?(-)?(daughter|son|kid|child|children|baby|babie|bby|bae)s?$"]

children_count_df = get_list(fdist.items(), children, "Children")
print("Total: ", sum(children_count_df["frequency"]))
children_count_df

Total:  11098


Unnamed: 0,word,frequency,type
0,son,3482,Children
2,daughter,3216,Children
4,baby,1226,Children
3,kids,1129,Children
6,child,663,Children
1,children,591,Children
5,kid,342,Children
9,daughters,192,Children
7,sons,155,Children
10,babies,53,Children


For spouse, the dictionary contains "spouse", "spouses", "wife", "wifes", "wives", "husband", "husbands", "wifey", "hubby", "ex husband", "ex wife", "ex-husband", "ex-wife". The frequency of "wife" is slightly higher than that of "husband". It is counterintuitive to find a plural phrase of "husband" and "wife". One possible interpretation is the failure of splitting possessive nouns from the original noun during the phrase of word tokenization. 

In [9]:
spouse = ["^(ex)?(-)?(spouse|wife|wive|wifey|husband|hubby)s?$"]

spouse_count_df = get_list(fdist.items(), spouse, "Spouse")
print("Total: ", sum(spouse_count_df["frequency"]))
spouse_count_df

Total:  5807


Unnamed: 0,word,frequency,type
2,wife,2776,Spouse
0,husband,2713,Spouse
1,spouse,225,Spouse
4,hubby,40,Spouse
5,husbands,23,Spouse
9,ex-husband,9,Spouse
3,ex-wife,8,Spouse
6,wifes,8,Spouse
7,spouses,2,Spouse
8,wives,2,Spouse


For affinity, the dictionary contains "fiance", "fiancee", "fiancé", "fiancée", "ex-girlfriend", "ex-boyfriend", "girlfriend", "girlfriends", "boyfriend", "boyfriends", "lover", "lovers", "my crush", "bf", "gf", "babe", "babes". It is interesting that people tend to use "baby" to represent their children while use "babe" to represent their lovers. 

In [10]:
affinity = ["^(ex)?(-)?(fiance|fiancee|fiancé|fiancée|girlfriend|boyfriend|lover|bf|gf|babe)s?$"]

affinity_count_df = get_list(fdist.items(), affinity, "Affinity")
print("Total: ", sum(affinity_count_df["frequency"]))
affinity_count_df

Total:  3855


Unnamed: 0,word,frequency,type
1,girlfriend,1959,Affinity
2,boyfriend,1252,Affinity
4,fiance,262,Affinity
0,fiancee,128,Affinity
3,lover,97,Affinity
5,gf,55,Affinity
8,girlfriends,43,Affinity
6,boyfriends,16,Affinity
7,ex-girlfriend,12,Affinity
13,lovers,12,Affinity


For siblings, the dictionary contains formal phrases like "brother" and "sister" as well as abbreviations like "bro" and "sis". Besides, "stepbrother", "stepsister", "halfbrother", and "halfsister" also have been considered during matching. 

In [11]:
sibling = ["^(step)?(-)?(brother|sister|sis|bro|sibling)s?$"]

sibling_count_df = get_list(fdist.items(), sibling, "Sibling")
print("Total: ", sum(sibling_count_df["frequency"]))
sibling_count_df

Total:  3597


Unnamed: 0,word,frequency,type
1,sister,1767,Sibling
0,brother,1458,Sibling
3,sisters,132,Sibling
4,brothers,123,Sibling
2,siblings,56,Sibling
5,sibling,32,Sibling
6,bros,12,Sibling
9,sis,9,Sibling
7,bro,4,Sibling
10,stepbrother,2,Sibling


Given the wide range of nouns to represent relatives, the dictionary contains disparate phrases to comprise relative-related words as much as possible, ranging from "aunt" and "uncle" to "grandparent" and "grandchildren". Even though phrases like "family" and "families" are frequently mentioned in the document, I decide to exclude them from the matching list due to their ambiguity. The phrase "cousin" appears 467 times, ranking the first in the list. Tokens of "uncle", "niece", and "nephew" have relatively higher frequency values, which are above 400.

In [12]:
relative = ["^(great-)?(grand)(-)?(son|parent|pa|ma|father|mother|kid|daughter|children|child)s?$",
            "^(family|familie|cousin|aunt|auntie|aunty|uncle|niece|nephew|granny|grannie)s?$",
            "^(son|sister|brother|daughter)s?.*?(law)$",
            "^(in)-?(law)s?$"]

relative_count_df = get_list(fdist.items(), relative, "Relative")
print("Total: ", sum(relative_count_df["frequency"]))
relative_count_df

Total:  8365


Unnamed: 0,word,frequency,type
1,family,4623,Relative
8,cousin,467,Relative
4,uncle,450,Relative
12,niece,428,Relative
6,nephew,405,Relative
25,grandma,253,Relative
2,grandmother,239,Relative
10,aunt,219,Relative
13,cousins,159,Relative
7,granddaughter,143,Relative


As for friends, various phrases could be used to represent friendship. The most formal phrases are "friend" and "friends". Other informal phrases contain "dude", "mate", "buddy", "pal", "bestie", "fella", and "lad"  on some rare occasions. Besides, friendship abbreviations could not be neglected. In this case, only "bbf" (stands for best friend forever) has been included. 
<br>
According to the result, the word "friend" occurs at least 6,119 times in the whole document, which is followed by its plural phrase "friends" (which appears 4,692 times). "roommate" ranked third with merely 172 times of appearance in frequency. Most of the phrases have relatively lower frequencies, which are lower than 10.

In [13]:
friend = ["^(bbf|friend|dude|bestie|lad|buddy|buddie|pal|fella)s?$",
          "^(best|house|room|team|class|play|lab|college|school|work_place)*mate(s?)$"]

friend_count_df = get_list(fdist.items(), friend, "Friend")
print("Total: ", sum(friend_count_df["frequency"]))
friend_count_df

Total:  11351


Unnamed: 0,word,frequency,type
1,friend,6119,Friend
0,friends,4692,Friend
5,roommate,172,Friend
3,mate,60,Friend
13,buddy,53,Friend
6,roommates,44,Friend
9,classmates,39,Friend
8,mates,35,Friend
2,buddies,32,Friend
11,classmate,29,Friend


As for the group of others, it is a complementary part of phrases mentioned above. The top 3 phrases are "them", "people", and "girl". "Them" only acts as a personal pronoun, which does not have any specific meaning. 

In [14]:
others = ["^(boy|girl|colleague|folk|guy|infant|neighbor|neighbour|newborn|partner|people|person|ppl|preschooler|them|toddler|woman|chick|brunette|blonde|cutie|blond|hottie|women|men|man|peep|coworker|co-woker|tenn|teen|teenager|customer|client|stranger|teacher|celebrity|celebritie|professor|adult|bloke|kiddo|redhead|lady|ladie|eldest)s?$",
          "^(every|some)(body|one|1)$",
          "^(actor|actress)(s|es)?$"]
others_count_df = get_list(fdist.items(), others, "Others")
print("Total: ", sum(others_count_df["frequency"]))
others_count_df

Total:  11970


Unnamed: 0,word,frequency,type
7,them,2386,Others
4,people,1348,Others
3,girl,832,Others
18,person,831,Others
0,someone,806,Others
8,everyone,469,Others
13,man,467,Others
10,partner,349,Others
23,boy,345,Others
21,neighbor,333,Others


Despite of the unigram identification, we also need to find out the frequency of bigram phrases if necessary.

In [15]:
hm_bi_sentence = [" ".join(x) for x in unigram]

In [16]:
def listToString(lists):  
    string = ""  
    for each in lists:  
        string += each 
        string += " "
    return string

doc = listToString(hm_bi_sentence)
finder = BigramCollocationFinder.from_words(doc.split())
bigram_lists = []
for k,v in finder.ngram_fd.items():
    bigram_lists.append((k[0]+ " " + k[1],v))

In the bigram analysis of the parent group, three cases are needed to be considered, which are "father in-law(s)", "mother in-law(s)", and "old man". According to the result, we could find that phrase like "old man" appears 31 times. <br>
Given that the word "mother" and "father" would be counted twice if we introduce phrases like "mother in-law" and "father in-law" in the matching list, we should subtract the frequency values of "mother in-law" and "father in-law" from those of "mother" and "father" respectively. For the final frequency value of "mother", it should be subtracted 4 from 1717, which is 1713. Similarly, we could get the final value of "father", which is 1469.

In [17]:
# partents unigrams and bigrams
parent_bi = ["^(old man)$"]
bigram_count_parents_df = get_list(bigram_lists, parent_bi, "Parent")
parents_df = bigram_count_parents_df.append(parent_count_df).sort_values(by="frequency", ascending=False)
 
print("Total: ", sum(parents_df["frequency"]))
parents_df

Total:  6621


Unnamed: 0,word,frequency,type
3,mother,1717,Parent
2,mom,1473,Parent
4,parents,1119,Parent
5,dad,912,Parent
6,father,734,Parent
12,daddy,113,Parent
13,mothers,112,Parent
11,parent,97,Parent
20,mommy,56,Parent
1,mama,47,Parent


One way to distinguish occations that people use phrase "baby" to call their children from those of calling their lovers is analyzing the frquency of certain bigram phrase. Specifically, those bigram phrases are "elder son", "eldest son", "second son", etc. Given that the word "son" would be counted twice if we introduce phrases like "elder son", "eldest son", "second son" in the matching list, we should subtract the total frequency value of son-related bigram (which is 144) from that of unigram "son" (which is 3482). Therefore, the final frequency value of "son" should be 3482. Similar processes are executed for the rest duplicate cases.


In [18]:
# children unigrams and bigrams
children_bi = ["(elder|older|eldest|younger|youngest|only|first|second|last)\s(baby|babie|bae|child|children|daughter|son|kid|bby)s?$"]

bigram_count_children_df = get_list(bigram_lists, children_bi, "Children")
children_df = bigram_count_children_df.append(children_count_df).sort_values(by="frequency", ascending=False)
# subtract the duplicate count from the corresponding unigram frequency value
children_df.loc[children_df["word"] == "child", ["frequency"]] = (663 - 156)
children_df.loc[children_df["word"] == "baby", ["frequency"]] = (1226 - 28)
children_df.loc[children_df["word"] == "children", ["frequency"]] = (591 - 5)
children_df.loc[children_df["word"] == "daughter", ["frequency"]] = (3216 - 115)
children_df.loc[children_df["word"] == "son", ["frequency"]] = (3482 - 144)
children_df.loc[children_df["word"] == "kid", ["frequency"]] = (342 - 7)

print("Total: ", sum(children_df["frequency"]))
children_df

Total:  11098


Unnamed: 0,word,frequency,type
0,son,3338,Children
2,daughter,3101,Children
4,baby,1198,Children
3,kids,1129,Children
6,child,507,Children
1,children,586,Children
5,kid,335,Children
9,daughters,192,Children
7,sons,155,Children
9,first child,81,Children


In [19]:
# friend unigrams and bigrams
friend_bi = ["^(best|close|good|childhood)\s(friend|bestie|bff|buddie|buddy|dude|pal|fella|lad)s?$"]

bigram_count_friend_df = get_list(bigram_lists, friend_bi, "Friend")
friend_df = bigram_count_friend_df.append(friend_count_df).sort_values(by="frequency", ascending=False)
# subtract the duplicate count from the corresponding unigram frequency value
friend_df.loc[friend_df["word"] == "friend", ["frequency"]] = (6119 - 1238)
friend_df.loc[friend_df["word"] == "friends", ["frequency"]] = (4692 - 398)
friend_df.loc[friend_df["word"] == "buddy", ["frequency"]] = (53 - 7)

print("Total: ", sum(friend_df["frequency"]))
friend_df

Total:  11352


Unnamed: 0,word,frequency,type
1,friend,4881,Friend
0,friends,4294,Friend
1,best friend,790,Friend
6,good friend,210,Friend
5,best friends,182,Friend
5,roommate,172,Friend
7,close friend,149,Friend
2,close friends,102,Friend
3,good friends,99,Friend
4,childhood friend,89,Friend


People also would use "my crush" to represent their romantic partner. One thing worthwhile to mention is the difference between a "girlfriend" and a "girl friend". Normally, a girlfriend refers to a person who has a romantic relationship with while a girl friend refers to a person who has a pure friendship with. 

In [20]:
# affinity unigrams and bigrams
affinity_bi = ["^(my|his|her)\s(crush|ex)s?$"]

bigram_count_affinity_df = get_list(bigram_lists, affinity_bi, "Affinity")
affinity_df = bigram_count_affinity_df.append(affinity_count_df).sort_values(by="frequency", ascending=False)
print("Total: ", sum(affinity_df["frequency"]))
affinity_df

Total:  4016


Unnamed: 0,word,frequency,type
1,girlfriend,1959,Affinity
2,boyfriend,1252,Affinity
4,fiance,262,Affinity
0,fiancee,128,Affinity
1,my ex,98,Affinity
3,lover,97,Affinity
0,my crush,63,Affinity
5,gf,55,Affinity
8,girlfriends,43,Affinity
6,boyfriends,16,Affinity


In [21]:
# sibling unigrams and bigrams
sibling_bi = ["^(elder|older|younger|twin)\s(brother|sister|sis|bro|sibling)s?$"]

bigram_count_sibling_df = get_list(bigram_lists, sibling_bi, "Sibling")
sibling_df = bigram_count_sibling_df.append(sibling_count_df).sort_values(by="frequency", ascending=False)
# subtract the duplicate count from the corresponding unigram frequency value
sibling_df.loc[sibling_df["word"] == "brother", ["frequency"]] = (1458 - 88)
sibling_df.loc[sibling_df["word"] == "sister", ["frequency"]] = (1767 - 52)
sibling_df.loc[sibling_df["word"] == "brothers", ["frequency"]] = (123 - 6)
sibling_df.loc[sibling_df["word"] == "sisters", ["frequency"]] = (132 - 1)

print("Total: ", sum(sibling_df["frequency"]))
sibling_df

Total:  3600


Unnamed: 0,word,frequency,type
1,sister,1715,Sibling
0,brother,1370,Sibling
3,sisters,131,Sibling
4,brothers,117,Sibling
2,siblings,56,Sibling
3,younger brother,46,Sibling
5,sibling,32,Sibling
1,older brother,24,Sibling
2,younger sister,23,Sibling
0,elder brother,18,Sibling


According to the frequency values, the persons that people mentioned the most frequently in their happy moments are their friends, followed by their family and children.<br> 
Specifically, the most frequently mentioned phrase is "friend" with 4.881 frequency value. The second one is "family" (which appears approximately 4,623 times), following by phrase "friends" (whose frequency is about 4,294). "Son" and "daughter" rank the fourth and fifth in the list. Among those words listed in the Affinity group, "girlfriend" is the only word existing in the top 10 list. 

In [23]:
types = ["Parent", "Children", "Spouse", "Affinity", "Sibling", "Relatives", "Friend", "Others"]
dic_list = [parents_df, children_df, spouse_count_df, affinity_df, sibling_df, relative_count_df, friend_df, others_count_df]

sum_df = pd.concat(dic_list).sort_values(by="frequency", ascending=False)
sum_df[:10]

Unnamed: 0,word,frequency,type
1,friend,4881,Friend
1,family,4623,Relative
0,friends,4294,Friend
0,son,3338,Children
2,daughter,3101,Children
2,wife,2776,Spouse
0,husband,2713,Spouse
7,them,2386,Others
1,girlfriend,1959,Affinity
3,mother,1717,Parent


The limitations are conspicuous. Those results could be authentic and plausible only under a ideal cercumstance that each comment merely mentioned one specific subject (or individual). The reality is, however, that most of the sentences have mentioned more than one individual. For example, a record of "I had dinner with my girlfriend and my girlfriend's family" has mentioned three subjects within one sentence, including "I", "girlfriend", and "girlfriend's family". When we tried to count the frequency value of each subject, phrase like "grilfriend" would be counted twice since it appeared in that sentence twice times. What's more, the comprehensiveness of dictionary used to create the regular expression rules could make a difference to the exactitude of frequency analysis. 