# **Exploratory Data Analysis Essentials**

By: Tiffany Aihara

### **Background:** 

Online dating has gained more popularity in recent years. Each dater creates a profile that is shared with people nearby. Since dating profiles have both texts and images, online daters struggle with “managing” their online presence (Ellison et al., 2006). During the initial stages, self-presentation strategies, “first-impressions,” are significant in whether an interaction is considered “successful” (Derlega et al., 1987). Ellison et al. (2006) find the importance of portraying the ideal self in one’s profile. Most participants from their study had a recurrence of the ideal self where participants create a profile that not only describes themselves today but also their potential and future version of themselves. Online daters can make self-presentational choices, where they can choose what information to disclose, how to disclose it, and whether to engage in “deception” (Hancock & Toma 2009). Deception within the online dating community often is illustrated through exaggerated information that either emphasized their status or physical attractiveness (Guadagno et. al, 2012). 


Online daters are typically guided by two forces. The first is self-enhancement which reflects a dater’s desire to “appear as attractive as possible” to their potential match. The second is authenticity which reflects the need to be seen as honest in their depiction (Hancock & Toma 2009). The “accuracy” of an online dater’s profile reflected authenticity or being perceived as honest by potential matches. 

### Interest to Question

My initial interest was to explore whether online daters choose athenticity or attractiveness, or how much of both, when constructing their profile (self-presentation). From this initial interest, I developed the following questions: *How do people in the online dating community view profiles? What type of "advice" is given? In the spectrum between authencity and attractiveness, where does "this" advice lie? Does it reveal any greater cultural or societal norms?* 

**Research Aim:** To identify whether online dating advice (from the dating pool) reflects authencity (personality, unedited/filtered) or attractiveness.

**Research Question(s):** What types of "advice" do daters receive from a public profile review? Is attractiveness or authencity viewed as "more proactive" amongst the online dating community? 

**Note (After Data Collection):** Sentiment analysis libraries tend to focus on whether they are positive, negative, or neutral. For the purpose of this project, I have decided to classify "positive" as constructive or helpful advice and "negative" as not as helpful. 

### Finding & Collecting Data

To answer the research question, I will be exploring the *Hinge Dating App* Reddit page. The *Hinge Dating App Reddit Page* has "flairs," which are individual topic filters. The page has the flairs help, discussion, profile review, daily thread, dating questions, and app questions. I have filtered the data to **only search** through **profile reviews.**

The data will only look at the first 100 submissions. 

In [1]:
import praw
import reddit_info 
import pandas as pd
from string import punctuation 
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import re
from collections import Counter
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [2]:
reddit = praw.Reddit(user_agent='Online Dating Profiles', 
                    client_id = reddit_info.reddit_id, client_secret = reddit_info.secret, 
                    username = reddit_info.username, password = reddit_info.password)

In [3]:
subreddit =  reddit.subreddit('hingeapp')
submissions = subreddit.top(limit=100)

In [4]:
profile_reviews = {'id': [], 'title': [], 'link': [], 'date': []}

for submission in subreddit.search('flair:"Profile Review"', syntax='lucene'):
    profile_reviews['id'].append(submission.id) 
    profile_reviews['title'].append(submission.title)
    profile_reviews['link'].append(submission.permalink)
    profile_reviews['date'].append(submission.created_utc)

In [5]:
result = [] 
for each_id in profile_reviews['id']: 
    for each_comment in reddit.submission(each_id).comments[1:]: 
        result.append(each_comment.body)

**Why is it comments[1:]?** Seeing the first few comments in results, there is the following sentences: "Are you looking for something serious or casual? How long have you been on Hinge? How many likes/matches are you getting on average?" These questions in the comments is a bot. Therefore, to remove the bot question the first index of the comment thread (0) was not included. 

### Filtering & Cleaning the Data
To get the word frequency, characters that were not numbers or letters were removed. I could have used lemmatizer() to obtain the stem. However, my confidence level was not high, and thus went with a simple word tokenizer. 

In [6]:
def remove_punctuation(word): 
    #Removes punctuation & returns just the word 
    word = word.lower() 
    new_word = re.sub('[^A-Za-z0-9\s]', '', word)
    return new_word

In [7]:
sample = [remove_punctuation(x) for x in result]
stop_words = set(stopwords.words('english'))

In [8]:
cleaned_data = [word.replace('\n', ' ') for word in sample] 
corpus = '' 

for each_sentence in cleaned_data: 
    corpus += each_sentence + ' '

In [9]:
cleaned_data[:5] #first five items of the cleaned data (visually see what this looks like)

['haha is this serious or a joke',
 'last selfie looks the most human definitely nix it  also biblically accurate angels are a bit trite at this point theyre all over the internet as memes id come up with something even more demonic and singular for your third prompt if you really want to bag some bad succubi',
 'hey man for all those judging you my friends brother who is very goth and eclectic married his very goth and eclectic girlfriend a couple years ago the bridesmaids held daggers instead of bouquets there was some sort of flame when they kissed oh and the bride was walkeddown the aisle in a coffin they are very happy and cute together whatever makes you happy youre more genuine with yourself than most people on the app more power to you',
 'very niche',
 'removed']

In [10]:
sample_token = word_tokenize(corpus)
clean_token = [word for word in sample_token if not word in stop_words]

In [11]:
clean_token[:10] #printing only the first 10 tokenize words

['haha',
 'serious',
 'joke',
 'last',
 'selfie',
 'looks',
 'human',
 'definitely',
 'nix',
 'also']

In [15]:
word_appearance = {'word': [], 'count': []}
counter_data = [Counter(clean_token)]

for counter in counter_data: 
    for items in counter.items(): 
        word_appearance['word'].append(items[0])
        word_appearance['count'].append(items[1]) 

### Frequent Words & Advice - Type Using Sentiment Score
In order to get the first five "most" occurred word, I sorted the dictionary, word_appearance, by count. However, since Reddit posts change **every day** it is important to update and switch out the list of numbers in *top_five*. A quick reminder that the code above should be re-ran because you have to **reset** the list so that the word and count have the same index.

In [13]:
word_appearance['count'].sort(reverse = True)

In [17]:
top_five = [795, 661, 639, 529, 521]
#word_appearance['count'][:5]
#make sure to referesh the code above!! 

In [18]:
result_id = []
for i in range(len(top_five)):  
    result_id.append(word_appearance['count'].index(top_five[i]))

result_words = [] 
for i in range(len(result_id)): 
    result_words.append(word_appearance['word'][result_id[i]])
    
topfive_occur = {'Word': result_words, 'Count/Appearance': top_five}
search_words = topfive_occur['Word']

In [19]:
#Top Word Occurance (5)
search_words

['like', 'youre', 'profile', 'would', 'good']

In [20]:
def check_keyword(keywords: 'list of words', body: 'titletext'): 
    #Returns the keywords ONLY if it is in the title 
    body = body.lower() 
    result = '' 
    for every_word in keywords: 
        if every_word in body: 
            result += every_word + ', '
        else: 
            pass
    return result

In [21]:
topword_data_sample = {'Keyword': [], "Text": [], 'Polarity Score': []}
for each_comment in result: 
    if check_keyword(search_words, each_comment) != '': 
        topword_data_sample['Keyword'].append(check_keyword(search_words, each_comment)) 
        topword_data_sample['Text'].append(each_comment)
    else: 
        continue 

**Sentiment Score:** The dictionary, *topword_data_sample* is a dictionary that holds keyword(s), text, and polarity score. Keywords is one of the top_five most used word. Text is the comment that the keyword is in. The polarity score is whether the comment is classified as positive or negative. 

To get the sentiment score, I looked at the "text" only. Since the appended text is an "uncleaned" version, the text needs to be filtered and cleaned. NLTK's Sentiment Intensity Analyzer is a built in method that measures the positivity and negativity level of a response. This library is often used to analyze tweets and reviews. 

In [22]:
topword_data_sample['Text']

topword_text = [remove_punctuation(x) for x in topword_data_sample['Text']]
cleaned_topword = [word.replace('\n', ' ') for word in topword_text] 

In [23]:
def sentiment_analyze(text): 
    #text should be a cleaned text 
    score = SentimentIntensityAnalyzer().polarity_scores(text)
    neg = score['neg']
    pos = score['pos']
    if neg > pos: 
        topword_data_sample['Polarity Score'].append('Negative')
    elif pos > neg: 
        topword_data_sample['Polarity Score'].append("Positive")
    else: 
        topword_data_sample['Polarity Score'].append("Neutral")

In [24]:
for each_sentence in topword_data_sample['Text']: 
    sentiment_analyze(each_sentence)

The dictionary was put into a dataframe and later stored in a csv file (which is in the github). 

For visual notes, check out the first few results of what the comment was classified as. 

In [25]:
df = pd.DataFrame(topword_data_sample) 
df.head(5)

Unnamed: 0,Keyword,Text,Polarity Score
0,"like, would, good,",Very chaotic and it's good to let your matches...,Positive
1,"like, would, good,",Honestly I think it's good that you're swervin...,Positive
2,"profile, good,",What are you looking for on Hinge? Based on yo...,Positive
3,"good,",For the automods questions:\n1. Serious or cas...,Positive
4,"would,","As a woman who is into black metal, this made ...",Positive


In [26]:
df['Polarity Score'].value_counts() 

Positive    1038
Negative     119
Neutral       32
Name: Polarity Score, dtype: int64

In [27]:
df.to_csv('hingeapp_advice.csv')

### Final Thoughts
Initially, the polarity scores show a lot of positive feedback for users. This indicates that the dating profile review advice is effective and helpful for those who are posting/asking for it. 

After viewing the csv file, I noticed that some of the advice that were marked "positive" were actually "negative." The sentiment analysis fails to notice tone (specifically sarcasm) which could affect the level of helpfulness. 

Although this section answers what type of advice is given and the level of helpfulness, it still lacked the answer to whether authencity or attractiveness played a role. I want to further explore this project and look for words (datasheets) that are synonyms or word insinuating a comment towards attraction and authencity. 

I will be exploring this a for a bit and will paste it here (in updated repository).

### Attempt to Make A Sentiment Score
**Goal:** If there is a "x" number of attraction or authenticity phrases and then print the score as the category. If it doesn't then it's neither. 

**Method:** Using the tokenized words (from above) we will be using those words to measure whether the post's advice is centered more towards attraction or authencity. I just took a few words and placed them in either what is associated with "personality/persona/authenicity" and what is associated with "attraction/looks/physical appearance." 

**Note:** This sentiment will only be tested with the first 10 results from the most used words dictionary. 

In [28]:
authenticity = ['prompts', 'prompt', 'joke', 'accurate', 'happy', 'laugh', 'hobbies', 
               'hobby', 'passionate', 'casual', 'unique', 'honest', 'rounded', 'honesty', 'friendly', 
               'sociable', 'nice', 'chill', 'interests', 'interest', 'adventure', 'personality', 'write', 
               'prompts', 'bio', 'voice', 'boring', 'bland', 'dry']

attraction = ['pic', 'photo', 'good looking', 'picture', 'look', 'pictures', 'swipe', 
             'cute', 'blurry', 'selfie', 'hot', 'beautiful', 'perfection', 'attractive', 'attract', 
             'swipe right', 'blonde', 'brunette', 'eyes', 'hair', 'height', 'looks', 'better photo', 'appearance', 
             'look bad', 'ugly'] 

neither = ['like', 'love', 'profile', 'good', 'really', 'matches', 'natural'] 

In [29]:
def check_authenticity(autlist, bodytext): 
    #Returns the score count of the number of appearances 
    score = 0 
    for every_word in autlist: 
        if every_word in bodytext: 
            score += 1 
        else: 
            continue 
    return score 

def check_attraction(attlist, bodytext): 
    #Returns the score count of the number of appearances 
    score = 0 
    for every_word in attlist: 
        if every_word in bodytext: 
            score += 1
        else: 
            continue 
    return score 

def check_neutral(neither, bodytext): 
    #Returns the score count of the number of appearance 
    score = 0 
    for each_word in neither: 
        if each_word in neither: 
            score += 0 
        else: 
            continue 
    return score 
    

In [32]:
au_score = 0 
att_score = 0
nu = 0


measure_post = df[:10]
results = []

for every_sentence in topword_data_sample['Text'][:10]: 
    au_score = check_authenticity(authenticity, every_sentence)
    att_score = check_attraction(attraction, every_sentence) 
    nu = check_neutral(neither, every_sentence) 
    
    if au_score >= att_score and au_score >= nu: 
        results.append('Authentic')
        #print('authentic')
    elif att_score >= au_score and att_score >= nu: 
        results.append('Attraction')
    elif nu >= att_score and nu >= au_score: 
        results.append('Neutral')
    else: 
        results.append("Inconclusive")
    

Although the test words to measure authenticity, attraction, and neutrality is a small pool of words it did produce some results. The functions counted the number of words in the sentence and compared it with the other words (in terms of score). This small test ignored tone and sentence structure.


For example, line 7 shows the Redditor commenting on the dater's images and self-presentation (physically). In terms of the initial research question, the number of advice for improving a dater's match rate is about the same. Reviewers and their feedback provide helpful tips for improving self-presentation (both in photography - looks - and prompts - personality -).   

I think this a great start into something that I am interested to explore further! 

In [37]:
measure_post['Authentic/Attraction Level'] = results 
measure_post

Unnamed: 0,Keyword,Text,Polarity Score,Authentic/Attraction Level
0,"like, would, good,",Very chaotic and it's good to let your matches...,Positive,Attraction
1,"like, would, good,",Honestly I think it's good that you're swervin...,Positive,Attraction
2,"profile, good,",What are you looking for on Hinge? Based on yo...,Positive,Attraction
3,"good,",For the automods questions:\n1. Serious or cas...,Positive,Authentic
4,"would,","As a woman who is into black metal, this made ...",Positive,Authentic
5,"like,",I like the pre-coffee me pic in the woods. Is ...,Positive,Attraction
6,"like,",Move your last photo to the first slot. If idk...,Negative,Attraction
7,"like, profile,","You're quite handsome, and I think you need mo...",Positive,Attraction
8,"like, profile, would, good,",It’s clear that your aim is to attract a speci...,Positive,Authentic
9,"like,","Genuinely curious, how many matches do you get...",Positive,Authentic


### Works Cited
Derlega, V., Winstead, B., Wong, P., & Greenspan, M. (1987). Self-disclosure and relationship
development: An attributional analysis. In M. E. Roloff & G. R. Miller (Eds.), Interpersonal Processes: New Directions in Communication Research (pp. 172–187).Thousand Oaks, CA:Sage.

Ellison, Heino, R., & Gibbs, J. (2006). Managing Impressions Online: Self-Presentation Processes in the Online Dating Environment. Journal of Computer-Mediated Communication, 11(2), 415–441. https://doi.org/10.1111/j.1083-6101.2006.00020.

Guadagno, Okdie, B. M., & Kruse, S. A. (2012). Dating deception: Gender, online dating, and exaggerated self-presentation. Computers in Human Behavior, 28(2), 642–647. https://doi.org/10.1016/j.chb.2011.11.010. 

Hancock, & Toma, C. L. (2009). Putting Your Best Face Forward: The Accuracy of Online Dating Photographs. Journal of Communication, 59(2), 367–386. https://doi.org/10.1111/j.1460-2466.2009.01420.x