# Introduction

## Big Five Personality Traits

According to Wikipedia, 
> In psychological trait theory, the Big Five personality traits, also known as the five-factor model (FFM) and the OCEAN model, is a suggested taxonomy, or grouping, for personality traits.
>
> The theory identifies five factors:
>
> * openness to experience (inventive/curious vs. consistent/cautious)
> * conscientiousness (efficient/organized vs. extravagant/careless)
> * extraversion (outgoing/energetic vs. solitary/reserved)
> * agreeableness (friendly/compassionate vs. challenging/callous)
> * neuroticism (sensitive/nervous vs. resilient/confident)

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Wiki-grafik_peats-de_big_five_ENG.png/493px-Wiki-grafik_peats-de_big_five_ENG.png)
source: https://en.wikipedia.org/wiki/File:Wiki-grafik_peats-de_big_five_ENG.png

## Dataset

According to the codebook supplied with the dataset:

>This data was collected (2016-2018) through an interactive on-line personality test.
>The personality test was constructed with the "Big-Five Factor Markers" from the IPIP. https://ipip.ori.org/newBigFive5broadKey.htm
Participants were informed that their responses would be recorded and used for research at the beginning of the test, and asked to confirm their consent at the end of the test.

The interactive on-line personality test can be found here: https://openpsychometrics.org/tests/IPIP-BFFM/. The test was presented as a single web page, containing 50 questions (10 per trait) and the user had to rate on a five points scale using radio buttons.

The dataset has 1,015,342 rows. Answers and time spent on each question are provided in the dataset. In addition to this, some user's device information have been collected: 
* timestamp when the survey was started
* device's screen width and height
* location information: country, approximate latitude and approximate longitude.

**Let's give a try hacking into this dataset throughout some EDA.**

# Data Importation

In [None]:
!pip install country-converter
!pip install pycountry-convert

In [None]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import plotly.graph_objects as go
import plotly.express as px
import country_converter as coco
import pycountry_convert as pycoco
sns.set_style("darkgrid")

In [None]:
data = pd.read_csv("../input/big-five-personality-test/IPIP-FFM-data-8Nov2018/data-final.csv", sep = "\t")

## Dataset information

In [None]:
print("Dataset shape:", data.shape)

In [None]:
data.tail()

## Data quality

### Missing values

In [None]:
s = data.isnull().sum()
print(s[s != 0])

There are missing values in 105 out of 110 columns. It appears that the missing values come from the same observations.

In [None]:
print(s[s != 0].value_counts())

We can afford to discard these rows because we have more than one million rows in the dataset.

In [None]:
data = data.dropna()

In [None]:
print("Dataset (new) shape:", data.shape)

In [None]:
s = data.isnull().sum()
print(s[s != 0])

Rows with missing values have been discarded successfully.

Let's pay attention to the type of columns.

### Columns type

In [None]:
data.info()

Main columns containing answers (EXT1, etc.) are supposed to be integers since the scale contains 5 steps (from 1 to 5).

Let's figure out if those columns don't contain floating values.

First of all, we're going to create a list of column names containing the answers from the test. Personality traits are labeled as:

* EXT: Extroversion
* EST: Neuroticism
* AGR: Agreeableness
* CSN: Conscientiousness
* OPN: Openness

And each trait is figured out through ten questions each.

In [None]:
personality_traits = ["EXT", "AGR", "CSN", "EST", "OPN"]
answer_columns = [trait + str(number) for trait in personality_traits for number in range(1, 11)]
print(answer_columns)

Let's see if there's a difference between values represented as integers and as floats.

In [None]:
(data[answer_columns] != data[answer_columns].astype(int)).sum()

Values can be converted to integers without any difference. This will save a lot of memory and the EDA will be cleaner later.

In [None]:
data[answer_columns] = data[answer_columns].astype(int)

According to the codebook, given latitudes and longitudes (`lat_appx_lots_of_err` and `long_appx_lots_of_err` columns) are very inaccurate, so both columns will be dropped. User location will be based on countries (`country` column, using the ISO 3166-1 alpha-2 standard).

In [None]:
data.drop(["lat_appx_lots_of_err", "long_appx_lots_of_err"], axis = 1, inplace = True)

### Outliers handling

Let's give a look to answers given to the 50 questions.

In [None]:
data[answer_columns].apply(pd.Series.value_counts)

The questions scale goes from 1 to 5. The zero value isn't supposed to exist, it probably means that the question hasn't been answered: the user didn't click on any radio button from the respective row. 

Observations containing at least a "0" from these columns will be discarded.

In [None]:
data = data[(data[answer_columns] != 0).all(axis = 1)]

The time spent on each question (xxxx_E columns) is recorded in milliseconds, let's convert it to seconds.

In [None]:
answer_columns_time = [trait + str(number) + "_E" for trait in personality_traits for number in range(1, 11)]
print(answer_columns_time)

In [None]:
data[answer_columns_time] = data[answer_columns_time].apply(lambda x: x / 1000)
data[answer_columns_time].describe()

Some values are very high, which can be assumed as outliers: users took a break in the middle of the test or there was a technical issue while tracking mouse clicks. 

Either way, let's discard outliers using arbitrary limits. If the time spent on a question is above 30 seconds, the row will be deleted. More over, if there's a negative time, the row will also be removed. 

The resulting dataset will only be used for the next section, otherwise, the entire (cleaned till this cell) dataset will be used.

In [None]:
data_time = data[((data[answer_columns_time] < 30) & (data[answer_columns_time] > 0)).all(axis = 1)]

In [None]:
print("Dataset (new) shape:", data_time.shape)

# Exploratory Data Analysis

## Response time analysis

To begin this EDA, let's study the response time per question. We're going to use the restricted dataset (response time below 30 seconds).

In [None]:
df_response = pd.melt(data_time[answer_columns_time])
df_response["trait"] = df_response["variable"].str.slice(0, 3)

In [None]:
fig, axs = plt.subplots(ncols = 2, nrows = 3, figsize = (18, 18))

sns.boxplot(x = "variable", y = "value", data = df_response[df_response["trait"] == "EXT"], 
            showfliers = False, ax = axs[0, 0]).set_title("Extroversion")
sns.boxplot(x = "variable", y = "value", data = df_response[df_response["trait"] == "AGR"], 
            showfliers = False, ax = axs[0, 1]).set_title("Agreeableness")
sns.boxplot(x = "variable", y = "value", data = df_response[df_response["trait"] == "CSN"], 
            showfliers = False, ax = axs[1, 0]).set_title("Conscientiousness")
sns.boxplot(x = "variable", y = "value", data = df_response[df_response["trait"] == "EST"], 
            showfliers = False, ax = axs[1, 1]).set_title("Neuroticism")
sns.boxplot(x = "variable", y = "value", data = df_response[df_response["trait"] == "OPN"], 
            showfliers = False, ax = axs[2, 0]).set_title("Openness")

fig.delaxes(axs[2, 1])

for ax in axs.flat:
    ax.set(xlabel = None, ylabel = "Response time (seconds)")

Overall, the average response time is pretty much the same for most of the questions.

The EXT1 question (`I am the life of the party.`) has very high response times because it is the first question in the list. So users are probably taking time to discover how the test works and are scrolling throught the list first.

A higher average response time for other questions can be explained by comprehension issues or simply longer sentences. Let's find the correlation between the average response time and the number of words in the question.

In [None]:
with open("../input/big-five-personality-test/IPIP-FFM-data-8Nov2018/codebook.txt") as f:
    lines = f.readlines()
questions = lines[7:57]
questions = [x.replace("\n", "").split("\t") for x in questions]
questions = pd.DataFrame.from_records(questions, columns = ["code", "question"])
questions["wc"] = [len(x) for x in questions["question"].str.split()]
questions["lc"] = [len(x) for x in questions["question"]]
print(questions.head())

Let's merge both dataframes now.

In [None]:
mean_response_time = data_time[answer_columns_time].mean()
df_mean_response_time = pd.DataFrame({"time": mean_response_time})
df_mean_response_time["code"] = df_mean_response_time.index.str.replace("_E", "")

df_mean_response_time = df_mean_response_time.merge(questions, on = "code")
print(df_mean_response_time.head())

Obviously, the first question (`EXT1`) is removed because the response times are biased as said earlier.

In [None]:
print("Pearson correlation coefficient between response time and words count:\n",
      np.corrcoef(df_mean_response_time["time"][1:], df_mean_response_time["wc"][1:])[0, 1])
print("Pearson correlation coefficient between response time and letters count:\n",
      np.corrcoef(df_mean_response_time["time"][1:], df_mean_response_time["lc"][1:])[0, 1])

The correlation coefficient is pretty high, no surprise here. However, we can try to figure out if there were any questions that caused problems that may have held users (incomprehension issues) by looking at observations far over from a simple OLS regression line.

The following plot represents the evolution of response times from the beginning to the end.

In [None]:
plt.figure(figsize = (20, 10))

for i in range(0, 41, 10):
    # plotting a vertical line for each personality trait
    plt.axvline(x = i, color = "black", alpha = 0.5)
    
sns.lineplot(data = mean_response_time, sort = False, linewidth = 4, drawstyle = "steps-pre")

plt.xticks(rotation = 90)
plt.grid(axis = "y")

Here's some funny pattern from `EST5` to `EST9` questions. If you pay attention to the corresponding questions, those are all positive keyed and are simple questions, almost asking redundant information from the user.

The response times analysis ends here. The specific dataset will be discarded now.

In [None]:
del data_time

## Location analysis

Let's find out where the respondents are coming from.

We have country codes in the `country` column from the dataset. Let's convert those from ISO-2 standard to ISO-3 (i.e. `FR` becomes `FRA`) to be able to match countries with their location in a plotly map without using GPS information.

Additional columns will be created:

* `country_name`: contains the country short name (i.e. France)
* `continent`: contains the continent name where the country belongs to (this information will be used later).

Replacement dictionaries will be used to convert standards.

In [None]:
iso2 = list(data["country"].unique())

unknown_iso2 = ["NONE", "SX", "TL", "AQ"]

iso3 = coco.convert(names = iso2, to = "ISO3")
continent = pd.Series(iso2)[~pd.Series(iso2).isin(unknown_iso2)].apply(lambda x: pycoco.country_alpha2_to_continent_code(x))
short_name = coco.convert(names = iso2, to = "name_short")

dict_continent_name = {
    'NA': 'North America',
    'SA': 'South America', 
    'AS': 'Asia',
    'OC': 'Oceania',
    'EU': 'Europe',
    'AF': 'Africa'
}

continent = continent.replace(dict_continent_name)
dict_country = dict(zip(iso2, iso3))
dict_short_name = dict(zip(iso3, short_name))
dict_continent = dict(zip(iso2, continent))

data["country_iso2"] = data["country"].replace(dict_country)
data["country_iso3"] = data["country"].replace(dict_country)
data["country_name"] = data["country_iso3"].replace(dict_short_name)
data["continent"] = data["country"].replace(dict_continent)

Then, the number of observation per country is computed, in order to display them on a choropleth map.

In [None]:
country_table = data["country_iso3"].value_counts()
country_table = country_table.to_frame("count")
country_table["country_iso3"] = country_table.index
country_table["country_name"] = country_table["country_iso3"].replace(dict_short_name)
country_table["hover_text"] = country_table["country_name"] + '<br>' + \
    country_table["count"].apply("{:,}".format) + " obs."
print(country_table)

In [None]:
fig = go.Figure(data = go.Choropleth(
    locations = country_table.index,
    z = np.log(country_table["count"]),
    text = country_table["hover_text"],
    showscale = False,
    colorscale = "Reds",
    hoverinfo = "text",
    marker_line_color = "darkgray"
))

fig.update_layout(title = "Number of responses per country (logarithmic color scale)")
fig.show()

(Do not hesitate to move your mouse over the map in order to see the exact number of responses).

The main country which responded the most to the test are the United States (471,912 obs). Africa has a low coverage rate.

## Personality traits analysis

### Trait global score

Personality traits score can be obtained by aggregating answers. According to the test documentation, questions can be positive keyed or negative keyed.

| Code  | Question                                                 | Key |    | Code  | Question                                                 | Key |
|-------|:---------------------------------------------------------|-----|----|-------|:---------------------------------------------------------|-----|
| EXT1	| I am the life of the party.                              | (+) | \| | EST1  | I get stressed out easily.                               | (+) |
| EXT2	| I don't talk a lot.                                      | (-) | \| | EST2  | I am relaxed most of the time.                           | (-) |
| EXT3	| I feel comfortable around people.                        | (+) | \| | EST3  | I worry about things.                                    | (+) |
| EXT4	| I keep in the background.                                | (-) | \| | EST4  | I seldom feel blue.                                      | (-) |
| EXT5	| I start conversations.                                   | (+) | \| | EST5  | I am easily disturbed.                                   | (+) |
| EXT6	| I have little to say.                                    | (-) | \| | EST6  | I get upset easily.                                      | (+) |
| EXT7	| I talk to a lot of different people at parties.          | (+) | \| | EST7  | I change my mood a lot.                                  | (+) |
| EXT8	| I don't like to draw attention to myself.                | (-) | \| | EST8  | I have frequent mood swings.                             | (+) |
| EXT9	| I don't mind being the center of attention.              | (+) | \| | EST9  | I get irritated easily.                                  | (+) |
| EXT10	| I am quiet around strangers.                             | (-) | \| | EST10 | I often feel blue.                                       | (+) |


| Code  | Question                                                 | Key |    | Code  | Question                                                 | Key |
|-------|:---------------------------------------------------------|-----|----|-------|:---------------------------------------------------------|-----|
| AGR1	| I feel little concern for others.                        | (-) | \| | CSN1  | I am always prepared.                                    | (+) |
| AGR2	| I am interested in people.                               | (+) | \| | CSN2  | I leave my belongings around.                            | (-) |
| AGR3	| I insult people.                                         | (-) | \| | CSN3  | I pay attention to details.                              | (+) |
| AGR4	| I sympathize with others' feelings.                      | (+) | \| | CSN4  | I make a mess of things.                                 | (-) |
| AGR5	| I am not interested in other people's problems.          | (-) | \| | CSN5  | I get chores done right away.                            | (+) |
| AGR6	| I have a soft heart.                                     | (+) | \| | CSN6  | I often forget to put things back in their proper place. | (-) |
| AGR7	| I am not really interested in others.                    | (-) | \| | CSN7  | I like order.                                            | (+) |
| AGR8	| I take time out for others.                              | (+) | \| | CSN8  | I shirk my duties.                                       | (-) |
| AGR9	| I feel others' emotions.                                 | (+) | \| | CSN9  | I follow a schedule.                                     | (+) |
| AGR10	| I make people feel at ease.                              | (+) | \| | CSN10 | I am exacting in my work.                                | (+) |


| Code  | Question                                              | Key |
|-------|:---------------------------------------------------------|-----|
| OPN1	| I have a rich vocabulary.                                | (+) |
| OPN2	| I have difficulty understanding abstract ideas.          | (-) |
| OPN3	| I have a vivid imagination.                              | (+) |
| OPN4	| I am not interested in abstract ideas.                   | (-) |
| OPN5	| I have excellent ideas.                                  | (+) |
| OPN6	| I do not have a good imagination.                        | (-) |
| OPN7	| I am quick to understand things.                         | (+) |
| OPN8	| I use difficult words.                                   | (-) |
| OPN9	| I spend time reflecting on things.                       | (+) |
| OPN10	| I am full of ideas.                                      | (+) |

First, let's rescale values (1, ..., 5) to (-2, ..., 2). So we can compare traits scores together.

In [None]:
data[answer_columns] = data[answer_columns].apply(lambda x: x - 3)

Then, let's aggregate values to get a score for each personality treat, according to positive or negative keys listed in the tables above.

In [None]:
data["EXT"] = data["EXT1"] - data["EXT2"] + data["EXT3"] - data["EXT4"] + \
    data["EXT5"] - data["EXT6"] + data["EXT7"] - data["EXT8"] + data["EXT9"] - data["EXT10"]

data["EST"] = data["EST1"] - data["EST2"] + data["EST3"] - data["EST4"] + \
    data["EST5"] + data["EST6"] + data["EST7"] + data["EST8"] + data["EST9"] + data["EST10"]

data["AGR"] = - data["AGR1"] + data["AGR2"] - data["AGR3"] + data["AGR4"] - \
    data["AGR5"] + data["AGR6"] - data["AGR7"] + data["AGR8"] + data["AGR9"] + data["AGR10"]

data["CSN"] = data["CSN1"] - data["CSN2"] + data["CSN3"] - data["CSN4"] + \
    data["CSN5"] - data["CSN6"] + data["CSN7"] - data["CSN8"] + data["CSN9"] + data["CSN10"]

data["OPN"] = data["OPN1"] - data["OPN2"] + data["OPN3"] - data["OPN4"] + \
    data["OPN5"] - data["OPN6"] + data["OPN7"] - data["OPN8"] + data["OPN9"] + data["OPN10"]

We can plot the score distributions for each personality trait.

In [None]:
fig, axs = plt.subplots(ncols = 2, nrows = 3, figsize = (18, 18))
sns.distplot(data["EXT"], bins = 40, kde = False, 
             ax = axs[0, 0], color = sns.color_palette()[0]).set_title("Extroversion")
sns.distplot(data["EST"], bins = 40, kde = False, 
             ax = axs[0, 1], color = sns.color_palette()[1]).set_title("Neuroticism")
sns.distplot(data["AGR"], bins = 40, kde = False, 
             ax = axs[1, 0], color = sns.color_palette()[2]).set_title("Agreeableness")
sns.distplot(data["CSN"], bins = 40, kde = False, 
             ax = axs[1, 1], color = sns.color_palette()[3]).set_title("Conscientiousness")
sns.distplot(data["OPN"], bins = 40, kde = False, 
             ax = axs[2, 0], color = sns.color_palette()[4]).set_title("Openness")

fig.delaxes(axs[2, 1])
for ax in axs.flat:
    ax.set(xlabel = None, ylabel = "Count")
    
plt.show()

### Correlations

Are personality traits correlated?

In [None]:
correlation = data[personality_traits].corr()
print(correlation)

In [None]:
mask = np.zeros_like(correlation)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    sns.heatmap(correlation, mask = mask, vmax = .3, cmap = "RdYlBu")
plt.show()

Extroversion and Agreeableness go in the same direction (+0.30). However, Neuroticism has a negative correlation with Extroversion and Conscientiousness (-0.22 and -0.23).

### Values pattern

Do some questions have specific patterns (mostly extreme values? neutral? ...)?

In [None]:
df_answers = pd.melt(data[answer_columns])
df_answers["trait"] = df_answers["variable"].str.slice(0, 3)
df_answers = df_answers.groupby(["variable", "value"]).count()
df_answers.reset_index(inplace = True)
df_answers = df_answers.rename(columns = {"trait": "count"})
df_answers["trait"] = df_answers["variable"].str.slice(0, 3)

In [None]:
fig, axs = plt.subplots(ncols = 2, nrows = 3, figsize = (18, 18))

sns.scatterplot(x = "variable", y = "value", size = "count", 
                color = sns.color_palette()[0], data = df_answers[df_answers["trait"] == "EXT"], 
                sizes = (100, 700), ax = axs[0, 0], legend = None).set_title("Extroversion")

sns.scatterplot(x = "variable", y = "value", size = "count", 
                color = sns.color_palette()[1], data = df_answers[df_answers["trait"] == "EST"], 
                sizes = (100, 700), ax = axs[0, 1], legend = None).set_title("Neuroticism")

sns.scatterplot(x = "variable", y = "value", size = "count", 
                color = sns.color_palette()[2], data = df_answers[df_answers["trait"] == "AGR"], 
                sizes = (100, 700), ax = axs[1, 0], legend = None).set_title("Agreeableness")

sns.scatterplot(x = "variable", y = "value", size = "count", 
                color = sns.color_palette()[3], data = df_answers[df_answers["trait"] == "CSN"], 
                sizes = (100, 700), ax = axs[1, 1], legend = None).set_title("Conscientiousness")

sns.scatterplot(x = "variable", y = "value", size = "count", 
                color = sns.color_palette()[4], data = df_answers[df_answers["trait"] == "OPN"], 
                sizes = (100, 700), ax = axs[2, 0], legend = None).set_title("Openness")

fig.delaxes(axs[2, 1])

for ax in axs.flat:
    ax.set(xlabel = None, ylabel = "Value")

plt.setp(axs, yticks = range(-2, 3))
plt.show()

Regarding Agreeableness and Openness traits, people are able to feel concerned about the questions: they barely stay neutral and prefer ticking extreme values on the scale. Both traits had a distribution following some left-skewed Gaussian curve. The trend is more noisy for the other traits.

### Continental analysis

Let's try to figure out if there are difference in personality traits between continents.

In [None]:
fig, axs = plt.subplots(ncols = 2, nrows = 3, figsize = (18, 18))
list_continent = ["Asia", "Africa", "Oceania", "North America", "South America", "Europe"]
for continent in list_continent:
    g = sns.distplot(data[data["continent"] == continent]["EXT"], bins = 40,
                     hist = False, ax = axs[0, 0], label = continent).set_title("Extroversion")
    g = sns.distplot(data[data["continent"] == continent]["EST"], bins = 40, 
                     hist = False, ax = axs[0, 1], label = continent).set_title("Neuroticism")
    g = sns.distplot(data[data["continent"] == continent]["AGR"], bins = 40, 
                     hist = False, ax = axs[1, 0], label = continent).set_title("Agreeableness")
    g = sns.distplot(data[data["continent"] == continent]["CSN"], bins = 40, 
                     hist = False, ax = axs[1, 1], label = continent).set_title("Conscientiousness")
    g = sns.distplot(data[data["continent"] == continent]["OPN"], bins = 40, 
                     hist = False, ax = axs[2, 0], label = continent).set_title("Openness")

fig.delaxes(axs[2, 1])
for ax in axs.flat:
    ax.set(xlabel = None)
plt.legend()
plt.show()

Asia stands out on every plot. Little differences can be noticed from Europe.

### Time analysis

First of all, let's see how many users tested themselves per day. The dataset starts from March 2016 and stops at November 2018. 

In [None]:
data["date"] = data["dateload"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")).apply(lambda x: x.date())
data["year"] = data["date"].apply(lambda x: x.year)

In [None]:
date_table = data["date"].value_counts()

plt.figure(figsize = (12, 10))
sns.lineplot(x = date_table.index, y = date_table)
plt.show()

There are a few huge spikes, but there's no point trying to figure out the reason behind. In any case, the amount of responses didn't change much from 2016/05 to 2017/01. However, an increasing trend appeared starting from 2018.

Is there any significant shift from users over time?

In [None]:
fig, axs = plt.subplots(ncols = 2, nrows = 3, figsize = (18, 18))
for year in [2016, 2017, 2018]:
    g = sns.distplot(data[data["year"] == year]["EXT"], bins = 40, 
                 hist = False, label = str(year), ax = axs[0, 0]).set_title("Extroversion")
    g = sns.distplot(data[data["year"] == year]["EST"], bins = 40, 
                 hist = False, label = str(year), ax = axs[0, 1]).set_title("Neuroticism")
    g = sns.distplot(data[data["year"] == year]["AGR"], bins = 40, 
                 hist = False, label = str(year), ax = axs[1, 0]).set_title("Agreeableness")
    g = sns.distplot(data[data["year"] == year]["CSN"], bins = 40, 
                 hist = False, label = str(year), ax = axs[1, 1]).set_title("Conscientiousness")
    g = sns.distplot(data[data["year"] == year]["OPN"], bins = 40, 
                 hist = False, label = str(year), ax = axs[2, 0]).set_title("Openness")
    
fig.delaxes(axs[2, 1])

for ax in axs.flat:
    ax.set(xlabel = None)

plt.show()

Nothing important to report. The 2016-2018 period was very calm in the world. It would've been interesting to compare the situation before and after any kind of world crisis. People may be more nervous, being less open, etc.

### Country analysis

Finally, let's plot the average of each personality trait on a map for each country.

In [None]:
traits_per_country = data.groupby("country_iso3").agg({"EXT": "mean",
                                                      "EST": "mean",
                                                      "AGR": "mean",
                                                      "CSN": "mean",
                                                      "OPN": ["mean", "size"]})
traits_per_country.columns = traits_per_country.columns.map("_".join)
traits_per_country.columns = personality_traits + ["count"]
traits_per_country["country_iso3"] = traits_per_country.index
traits_per_country["country_name"] = traits_per_country["country_iso3"].replace(dict_short_name)

Countries with less than 50 answers will be discarded to prevent from disturbing results.

In [None]:
traits_per_country = traits_per_country[traits_per_country["count"] > 50]

In [None]:
fig = go.Figure(data = go.Choropleth(
    locations = traits_per_country.index,
    z = traits_per_country["EXT"],
    text = traits_per_country["country_name"],
    colorscale = px.colors.diverging.Portland_r,
    marker_line_color = "darkgray"
))

fig.update_layout(title = "Extrovertion (higher means more extrovert)")
fig.show()

It would be interesting to combine those data with socio-demographic data (living conditions, ...).

In [None]:
fig = go.Figure(data = go.Choropleth(
    locations = traits_per_country.index,
    z = traits_per_country["EST"],
    text = traits_per_country["country_name"],
    colorscale = px.colors.diverging.Portland_r,
    marker_line_color = "darkgray"
))

fig.update_layout(title = "Neuroticism (higher means more stressed)")
fig.show()

South America seems to be highly impacted by neuroticism.

In [None]:
fig = go.Figure(data = go.Choropleth(
    locations = traits_per_country.index,
    z = traits_per_country["AGR"],
    text = traits_per_country["country_name"],
    colorscale = px.colors.diverging.Portland_r,
    marker_line_color = "darkgray"
))

fig.update_layout(title = "Agreeableness (higher means more agreeable)")
fig.show()

Surprisingly, Canada has a pretty low average score on Agreeableness. However, it is important to note that each region of the world has his own point of view. Canadians are known to be very friendly and nice from the rest of the world but they might think they're average because they're used to it.

In [None]:
fig = go.Figure(data = go.Choropleth(
    locations = traits_per_country.index,
    z = traits_per_country["CSN"],
    text = traits_per_country["country_name"],
    colorscale = px.colors.diverging.Portland_r,
    marker_line_color = "darkgray"
))

fig.update_layout(title = "Conscientiousness (higher means more conscientious)")
fig.show()

In [None]:
fig = go.Figure(data = go.Choropleth(
    locations = traits_per_country.index,
    z = traits_per_country["OPN"],
    text = traits_per_country["country_name"],
    colorscale = px.colors.diverging.Portland_r,
    marker_line_color = "darkgray"
))

fig.update_layout(title = "Openness (higher means more open)")
fig.show()

There's a significant separation between Eastern and Western population. The Western world think themselves as more open, unlike Eastern countries where the openness score is globaly lower.

Once again, it is important to keep in mind that each country or region has a different level of scale.

# Conclusion

Through this EDA, we studied the dataset in different aspects: response times, time series, geographical analysis (continents, countries). Furthermore, comparisons have been made between countries, continents and years for each personality trait.

We can explore more and exploit better the information in this dataset by matching socio-demographic data. Regressions can be done in that case.

Moreover, unsupervised algorithms could be applied to this dataset in order to find out clusters. According to the results obtained, we would be able to give a description of each cluster so people can see themselves through the results.

If you've enjoyed this notebook, do not hesitate to upvote it and feel free to ask questions or give a feedback or new ideas.