# Get Personalities of Programmers of Different Programming Languages

This is the code that allows me to get the avreage personality of programmers of different programming langauges. My intuition is, that different programs attract different types of programmers. So here is a way to sort of test it. 

The data used here is the one found on Kaggle: [https://www.kaggle.com/stackoverflow/stacksample/version/1](https://www.kaggle.com/stackoverflow/stacksample/version/1). If you want to replicate the analysis, please use the data there.

First I need to import all the libaries that I am going to use.

In [1]:
from ibm_watson import PersonalityInsightsV3
from bs4 import BeautifulSoup
import collections
import requests
import pandas
import json
import csv
import os

In the next part, I am setting come constants, that I am going to be using below. In order to get the personalities, I use the IBM's personality insights ([https://www.ibm.com/watson/services/personality-insights/](https://www.ibm.com/watson/services/personality-insights/)). If you want to replicate it, you need to get the keys from IBM. Up to 1000 calls/month it is free (and this script uses less than 1000). 

In [2]:
folder_personality = "personality"

In [3]:
credentials = {
  "apikey": "",
  "iam_apikey_description": "",
  "iam_apikey_name": "",
  "iam_role_crn": "",
  "iam_serviceid_crn": "",
  "url": "https://gateway-fra.watsonplatform.net/personality-insights/api"
}
api_link = "https://gateway-fra.watsonplatform.net/personality-insights/api"

First I need to get all the tags for each question. So I am parsing the file with tags and save all the tags in the list and creating dictionary of all the question with their tags. 

In [4]:
question_tags = collections.defaultdict(list)
all_tags = []

In [5]:
with open("Tags.csv") as f:
    reader = csv.reader(f, delimiter=',', quotechar='"')
    for question, tag in reader:
        if not question == "Id":
            question_tags[question].append(tag)
            all_tags.append(tag)

Since not every tag is a programming langauge, I need to find a way to filter it. What I did was use StackOverflow 2019 survey to get the names of the most popular langauges. This is the script that allowed me to do this. I then save them in the langs.

In [6]:
url_lang = "https://insights.stackoverflow.com/survey/2019#technology"
data = requests.get(url_lang)
data_html = BeautifulSoup(data.text)
data_html = data_html.find_all(id="technology-most-popular-technologies-all-respondents")
data_html = [l.lower() for l in data_html[0].get_text().split("\n") if l.strip() and not "%" in l][:-1]

In [7]:
langs = []
for lang in data_html:
    if "/" in lang:
        lang = lang.split("/")
        langs += lang
    else:
        langs.append(lang)

Here I am now going througj all the tags and filter first the tags and then the questions. I first filter the 1000 most popular tags on stackoverflow and then only keep the questions that have one of the tags that is both in 1000 most popular and in the language list. I do the same with answers. 

In [8]:
tags_to_check = set([lang for lang, count in collections.Counter(all_tags).most_common(1000) if lang.lower() in langs])

In [9]:
question_tags = dict([(q, list(set(tags).intersection(tags_to_check))) for q, tags in question_tags.items() if len(set(tags).intersection(tags_to_check)) > 0])

In [10]:
pairs_questions = []

In [11]:
with open("Questions.csv", "r", -1, "latin-1") as read:
    reader = csv.reader(read, delimiter=',', quotechar='"')
    for question, user, _, _, _, _, _ in reader:
        if question in question_tags:
            pairs_questions.append([question, user])

In [12]:
pairs_questions = dict(pairs_questions)

In [13]:
pairs_answers = []

In [14]:
with open("Answers.csv", "r", -1, "latin-1") as read:
    reader = csv.reader(read, delimiter=',', quotechar='"')
    for _, user, _, question, _, _ in reader:
        if question in question_tags:
            pairs_answers.append([question, user])

In [15]:
pairs_answers = dict(pairs_answers)

I then combine the answers and question tags for each user. This way, I can then filter the users, so I am using only the users, that have at least half of the question and answers in one language. So there is limited cross-contamination of different persoanlties. I also filter users with less than 100 questions with these tags. 

In [16]:
users_tags = collections.defaultdict(list)

In [17]:
for question, user in pairs_answers.items():
    users_tags[user] += question_tags[question]

In [18]:
for question, user in pairs_questions.items():
    users_tags[user] += question_tags[question]

In [19]:
users_with_lang = []
for user, tags in users_tags.items():
    if len(tags) > 10:
        most_common_tag_info = collections.Counter(tags).most_common(1)[0]
        most_common_tag = most_common_tag_info[0]
        tag_count = most_common_tag_info[1]
        if tag_count > len(tags) * 0.5:
            users_with_lang.append([user, most_common_tag])

In [20]:
users_with_lang = dict(users_with_lang)

Here I am checking, how many users have a certain langauge as their most popular one. I then filter by the most popular 15 langauges.

In [21]:
number_of_users_with_each_lang = collections.defaultdict(int)
for user, lang in users_with_lang.items():
    number_of_users_with_each_lang[lang] += 1

In [22]:
lang_for_personality = set(pandas.DataFrame([(lang, count) for lang, count in number_of_users_with_each_lang.items()], columns=["Lang", "Count"]).sort_values(by="Count", ascending=False).head(15)["Lang"])

Next I am collecting questions and answers for these users, that have one the the 15 langauges as their most popular one.

In [23]:
user_with_text = collections.defaultdict(str)

In [24]:
with open("Questions.csv", "r", -1, "latin-1") as f:
    reader = csv.reader(f, delimiter=',', quotechar='"')
    for _, user, _, _, _, title, body in reader:
        if user in users_with_lang:
            if users_with_lang[user] in lang_for_personality:
                user_with_text[user] += BeautifulSoup(title).get_text() + " "
                user_with_text[user] += BeautifulSoup(body).get_text() + " "

In [25]:
with open("Answers.csv", "r", -1, "latin-1") as f:
    reader = csv.reader(f, delimiter=',', quotechar='"')
    for _, user, _, _, _, body in reader:
        if user in users_with_lang:
            if users_with_lang[user] in lang_for_personality:
                user_with_text[user] += BeautifulSoup(body).get_text() + " "

Next I am checking, how long is the 50th longest text for each langauges. This allows me to filter the 50 users for each language.

In [26]:
longest_texts_per_lang = collections.defaultdict(list)

In [27]:
for user, lang in users_with_lang.items():
    if lang in lang_for_personality:
        longest_texts_per_lang[lang].append(len(user_with_text[user]))

In [28]:
longest_texts_per_lang = dict([(lang, sorted(counts)[-50:][0]) for lang, counts in longest_texts_per_lang.items()])

In [29]:
longest_texts_per_lang

{'c++': 165235,
 'c#': 192436,
 'ruby': 45695,
 'java': 185852,
 'c': 59555,
 'javascript': 137794,
 'sql': 116272,
 'php': 101734,
 'scala': 27057,
 'python': 127637,
 'objective-c': 79656,
 'r': 54280,
 'vba': 33050,
 'swift': 30400,
 'css': 28698}

Next I am taking these texts, send them to the IBM's server, get back their personality and save it. You can find it in the folder personality. 

In [30]:
personality_insights = PersonalityInsightsV3(version="2017-10-13", iam_apikey=credentials["apikey"], url=api_link)

In [31]:
for user, text in user_with_text.items():
    lang = users_with_lang[user]
    if lang in longest_texts_per_lang:
        if len(text) >= longest_texts_per_lang[lang]:
            filename = lang + "_" + user + ".json"
            profile = personality_insights.profile(content=text, accept="application/json", content_language="en", 
                                                   accept_language="en", raw_scores=True, consumption_preferences=True,
                                                   content_type="text/plain"
            ).get_result()
            with open(os.path.join(folder_personality, filename), "w") as f:
                f.write(json.dumps(profile))
            