In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Big 5 personality


In psychological trait theory, the Big Five personality traits, also known as the five-factor model (FFM) and the OCEAN model, is a suggested taxonomy, or grouping, for personality traits,[1] developed from the 1980s onwards. When factor analysis (a statistical technique) is applied to personality survey data, some words used to describe aspects of personality are often applied to the same person.
The theory of the Big Five Personality Traits, claims that we can describe ourselves with five main characteristics: Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism.

* openness to experience (inventive/curious vs. consistent/cautious)
* conscientiousness (efficient/organized vs. extravagant/careless)
* extraversion (outgoing/energetic vs. solitary/reserved)
* agreeableness (friendly/compassionate vs. challenging/callous)
* neuroticism (sensitive/nervous vs. resilient/confident)

This data set was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP.

The scale was labeled between 1=Disagree, 3=Neutral, 5=Agree.

You can find more info about each question in the data set link.

In this study I will analyse the data set and use unsupervised learning algorithm K-Means Clustering for clustering the participants.

Resources: https://www.youtube.com/watch?v=IB1FVbo8TSs https://ipip.ori.org/newBigFive5broadKey.htm https://www.kaggle.com/tunguz/big-five-personality-test

In [None]:
import seaborn as sns; 
import matplotlib.pyplot as plt
import geopandas as gpd
import pycountry

from datetime import datetime
from shapely.geometry import Point
from geopandas import GeoDataFrame

## Loading the dataset

In [None]:
os.popen('cd ../input/big-five-personality-test/IPIP-FFM-data-8Nov2018; ls').read()
path = r'../input/big-five-personality-test/IPIP-FFM-data-8Nov2018/data-final.csv'
df_full = pd.read_csv(path, sep='\t')
pd.options.display.max_columns = 999
df_full.head()

In [None]:
df = df_full.copy() # reset df
df.head()

In [None]:
df.country.values

In [None]:
print(f'Number of countries: {len(set(df.country.values))}')
print('Number of participants: ', len(df))

In [None]:
print('How many missing values? ', df.isnull().values.sum())
df.dropna(inplace=True)
print('Number of participants after removed missing values: ', len(df))

In [None]:
country_dict = {i.alpha_2: i.alpha_3 for i in pycountry.countries}
countries = pd.DataFrame(df.country.value_counts()).T\
              .drop('NONE', axis=1)\
              .rename(columns=country_dict, index={'country': 'count'})
countries

In [None]:
# Participants' nationality distriution
countries = pd.DataFrame(df['country'].value_counts())
countries_4000 = countries[countries['country'] >= 4000]
plt.figure(figsize=(15,5))
sns.barplot(data=countries_4000, x=countries_4000.index, y='country')
plt.title('Countries With More Than 4000 Participants')
plt.ylabel('Participants');

## Plotting world map

In [None]:
lat_cor = pd.to_numeric(df['lat_appx_lots_of_err'], errors='coerce')
long_cor = pd.to_numeric(df['long_appx_lots_of_err'], errors='coerce')

In [None]:
geometry = [Point(xy) for xy in zip(long_cor, lat_cor)]

In [None]:
gdf = GeoDataFrame(df, geometry=geometry)

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
gdf.plot(ax=world.plot(figsize=(40, 6)), marker='o', color='red', markersize=5);

## Analysing based on date

In [None]:
df['year']  = df.apply(lambda x: datetime.strptime(x['dateload'], '%Y-%m-%d %H:%M:%S').year, axis=1)
df['month'] = df.apply(lambda x: datetime.strptime(x['dateload'], '%Y-%m-%d %H:%M:%S').month, axis=1)

In [None]:
extract_date = df[['year', 'month','IPC']]
date_analysis = pd.DataFrame(extract_date.groupby(['year', 'month']).agg('count')).reset_index()

In [None]:
plt.figure(figsize=(15, 6))
sns.barplot(x="year", hue="month", y="IPC", data=date_analysis)
plt.show()

##  Visualizing the questions and answers

In [None]:
# Defining a function to visualize the questions and answers distribution
def vis_questions(groupname, questions, color):
    plt.figure(figsize=(40,60))
    for i in range(1, 11):
        plt.subplot(10,5,i)
        plt.hist(df[groupname[i-1]], bins=14, color= color, alpha=.5)
        plt.title(questions[groupname[i-1]], fontsize=24)

In [None]:
ext_questions = {'EXT1' : 'I am the life of the party.', 
                 'EXT2':'I don\'t talk a lot.', 
                 'EXT3':'I feel comfortable around people.', 
                 'EXT4':'I keep in the background.', 
                 'EXT5':'I start conversations.',
                 'EXT6':'I have little to say.', 
                 'EXT7':'I talk to a lot of different people at parties.', 
                 'EXT8':'I don\'t like to draw attention to myself.',
                 'EXT9':'I don\'t mind being the center of attention.',
                 'EXT10':'I am quiet around strangers.'}

EXT = [column for column in df if column.startswith('EXT')]

In [None]:
print('Q&A Related to Extroversion Personality')
vis_questions(EXT, ext_questions, 'orange')

In [None]:
est_questions = {'EST1' :'I get stressed out easily.', 
                 'EST2' :'I am relaxed most of the time.', 
                 'EST3' :'I worry about things.', 
                 'EST4' :'I seldom feel blue.', 
                 'EST5' :'I am easily disturbed.',
                 'EST6' :'I get upset easily.', 
                 'EST7' :'I change my mood a lot.', 
                 'EST8' :'I have frequent mood swings.',
                 'EST9' :'I get irritated easily',
                 'EST10':'I often feel blue.'}

EST = [column for column in df if column.startswith('EST')]

In [None]:
print('Q&As Related to Neuroticism Personality')
vis_questions(EST, est_questions, 'black')

In [None]:
agr_questions = {'AGR1' :'I feel little concern for others.', 
                 'AGR2' :'I am interested in people.', 
                 'AGR3' :'I insult people.', 
                 'AGR4' :'I sympathize with others feelings.', 
                 'AGR5' :'I am not interested in other people problems.',
                 'AGR6' :'I have a soft heart', 
                 'AGR7' :'I am not really interested in others.', 
                 'AGR8' :'I take time out for others',
                 'AGR9' :'I feel others emotions',
                 'AGR10':'I make people feel at ease.'}

AGR = [column for column in df if column.startswith('AGR')]

In [None]:
print('Q&As Related to Agreeable Personality')
vis_questions(AGR, agr_questions, 'red')

In [None]:
csn_questions = {'CSN1' :'I am always prepared.', 
                 'CSN2' :'I leave my belongings around.', 
                 'CSN3' :'I pay attention to details', 
                 'CSN4' :'I make a mess of things.', 
                 'CSN5' :'I get chores done right away.',
                 'CSN6' :'I often forget to put things back in their proper place.', 
                 'CSN7' :'I like order.', 
                 'CSN8' :'I shirk my duties.',
                 'CSN9' :'I follow a schedule.',
                 'CSN10':'I am exacting in my work.'}

CSN = [column for column in df if column.startswith('CSN')]

In [None]:
print('Q&As Related to Conscientious Personality')
vis_questions(CSN, csn_questions, 'purple')

In [None]:
opn_questions = {'OPN1' :'I have a rich vocabulary.', 
                 'OPN2' :'I have difficulty understanding abstract ideas.', 
                 'OPN3' :'I have a vivid imagination.', 
                 'OPN4' :'I am not interested in abstract ideas.', 
                 'OPN5' :'I have excellent ideas',
                 'OPN6' :'I do not have a good imagination.', 
                 'OPN7' :'I am quick to understand things.', 
                 'OPN8' :'I use difficult words.',
                 'OPN9' :'I spend time reflecting on things.',
                 'OPN10':'I am full of ideas.'}

OPN = [column for column in df if column.startswith('OPN')]

In [None]:
print('Q&As Related to Open Personality')
vis_questions(OPN, opn_questions, 'blue')

## India's personality

In [None]:
india = df[df.country =='IN']
india.head()

In [None]:
# Defining a function to visualize the questions and answers distribution
def india_response(groupname, questions, color):
    plt.figure(figsize=(40,60))
    for i in range(1, 11):
        plt.subplot(10,5,i)
        plt.hist(india[groupname[i-1]], bins=14, color= color, alpha=.5)
        plt.title(questions[groupname[i-1]], fontsize=24)

In [None]:
print('India\'s response to Extroversion Personality')
india_response(EXT, ext_questions, 'orange')

In [None]:
print('India\'s response to Neuroticism Personality')
india_response(EST, est_questions, 'black')

In [None]:
print('India\'s reponse to Agreeable Personality')
india_response(AGR, agr_questions, 'red')

In [None]:
print('India\'s reponse to Conscientious Personality')
india_response(CSN, csn_questions, 'purple')

In [None]:
print('India\'s reponse to Open Personality')
india_response(OPN, opn_questions, 'blue')

In [None]:
india.drop(india.columns[50:],axis =1,inplace=True)
india

In [None]:
response_count = india.apply(pd.value_counts)
response_count

In [None]:
data = df.copy()
data.drop(data.columns[50:107], axis=1, inplace=True)
data.drop(data.columns[51:], axis=1, inplace=True)
data

In [None]:
# For ease of calculation lets scale all the values between 0-1 and take a sample of 5000 
from sklearn.preprocessing import MinMaxScaler

df1 = data.drop('country', axis=1)
columns = list(df1.columns)

scaler = MinMaxScaler(feature_range=(0,1))
df1 = scaler.fit_transform(df1)
df1 = pd.DataFrame(df1, columns=columns)
df_sample = df1[:5000]

In [None]:
# Visualize the elbow
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

kmeans = KMeans()
visualizer = KElbowVisualizer(kmeans, k=(2,15))
visualizer.fit(df_sample)
visualizer.poof()

As you can see 5 clusters looks optimum for the data set and we already know this researh is to identify 5 different personalities.

In [None]:
# Creating K-means Cluster Model
from sklearn.cluster import KMeans

# I use the unscaled data but without the country column
df_model = data.drop('country', axis=1)

# I define 5 clusters and fit my model
kmeans = KMeans(n_clusters=5)
k_fit = kmeans.fit(df_model)

In [None]:
# Predicting the Clusters
pd.options.display.max_columns = 10
predictions = k_fit.labels_
df_model['Clusters'] = predictions
df_model.head()

In [None]:
chart = df_model.Clusters.value_counts()
chart

In [None]:
plt.pie(chart,labels=chart)

Let's group the results acording to clusters. That way we can investigate the average answer to the each question for each cluster.

That way we can have an intuition about how our model classifies people.

In [None]:
pd.options.display.max_columns = 150
df_model.groupby('Clusters').mean()

In [None]:
# Summing up the different questions groups
col_list = list(df_model)
ext = col_list[0:10]
est = col_list[10:20]
agr = col_list[20:30]
csn = col_list[30:40]
opn = col_list[40:50]

data_sums = pd.DataFrame()
data_sums['extroversion'] = df_model[ext].sum(axis=1)/10
data_sums['neurotic'] = df_model[est].sum(axis=1)/10
data_sums['agreeable'] = df_model[agr].sum(axis=1)/10
data_sums['conscientious'] = df_model[csn].sum(axis=1)/10
data_sums['open'] = df_model[opn].sum(axis=1)/10
data_sums['clusters'] = predictions
data_sums.groupby('clusters').mean()

In [None]:
# Visualizing the means for each cluster
dataclusters = data_sums.groupby('clusters').mean()
plt.figure(figsize=(22,3))
for i in range(0, 5):
    plt.subplot(1,5,i+1)
    plt.bar(dataclusters.columns, dataclusters.iloc[:, i], color='green', alpha=0.2)
    plt.plot(dataclusters.columns, dataclusters.iloc[:, i], color='red')
    plt.title('Cluster ' + str(i))
    plt.xticks(rotation=45)
    plt.ylim(0,4);

In [None]:
# In order to visualize in 2D graph I will use PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_fit = pca.fit_transform(df_model)

df_pca = pd.DataFrame(data=pca_fit, columns=['PCA1', 'PCA2'])
df_pca['Clusters'] = predictions
df_pca.head()

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(data=df_pca, x='PCA1', y='PCA2', hue='Clusters', palette='Set2', alpha=0.8)
plt.title('Personality Clusters after PCA');

In [None]:
my_data = pd.read_csv('../input/ownset/big5.csv')
my_data

In [None]:
my_personality = k_fit.predict(my_data)
print('My Personality Cluster: ', my_personality)

In [None]:
# Summing up the my question groups
col_list = list(my_data)
ext = col_list[0:10]
est = col_list[10:20]
agr = col_list[20:30]
csn = col_list[30:40]
opn = col_list[40:50]

my_sums = pd.DataFrame()
my_sums['extroversion'] = my_data[ext].sum(axis=1)/10
my_sums['neurotic'] = my_data[est].sum(axis=1)/10
my_sums['agreeable'] = my_data[agr].sum(axis=1)/10
my_sums['conscientious'] = my_data[csn].sum(axis=1)/10
my_sums['open'] = my_data[opn].sum(axis=1)/10
my_sums['cluster'] = my_personality
print('Sum of my question groups')
my_sums

In [None]:
my_sum = my_sums.drop('cluster', axis=1)
plt.bar(my_sum.columns, my_sum.iloc[0,:], color='green', alpha=0.2)
plt.plot(my_sum.columns, my_sum.iloc[0,:], color='red')
plt.title('Cluster 2')
plt.xticks(rotation=45)
plt.ylim(0,4);

## Checkout your score with the following traits

## Conscientiousness
Conscientiousness describes a careful, detail-oriented nature.

**High score**
If you score high on conscientiousness, you likely:

* keep things in order
* come prepared to school or work
* are goal-driven
* are persistent

 If you are a conscientious person, you might follow a regular schedule and have a knack for keeping track of details. You likely deliberate over options and work hard to achieve your goals. Coworkers and friends might see you as a reliable, fair person.

You may tend to micromanage situations or tasks. You might also be cautious or difficult to please.

**Low score**
A low score on conscientiousness might mean you:

* are less organized
* complete tasks in a less structured way
* take things as they come
* finish things at the last minute
* are impulsive

A low conscientiousness score might mean you prefer a setting without structure. You may prefer doing things at your own pace to working on a deadline. This might make you appear unreliable to others.



## Agreeableness 
it refers to a desire to keep things running smoothly.

**High score**
A high score in agreeableness might mean you:

* are always ready to help out
* are caring and honest
* are interested in the people around you
* believe the best about others

If you score high in agreeableness, you you’re helpful and cooperative. Your loved ones may often turn to you for help. People might see you as trustworthy. You may be the person others seek when they’re trying to resolve a disagreement.

In some situations, you might a little too trusting or willing to compromise. Try to balance your knack for pleasing others with self-advocacy.

**Low score**
A low agreeableness score might mean you:

* are stubborn
* find it difficult to forgive mistakes
* are self-centered
* have less compassion for others

A low agreeableness score may mean you tend hold grudges. You might also be less sympathetic with others. But you are also likely avoid the pitfalls of comparing yourself to others or caring about what others think of you.

## Neuroticism 
describes a tendency to have unsettling thoughts and feelings.

**High score**
A high score in neuroticism can mean you:

* often feel vulnerable or insecure
* get stressed easily
* struggle with difficult situations
* have mood swings

If you score high on neuroticism, you may blame yourself when things go wrong. You might also get frustrated with yourself easily, especially if you make a mistake. Chances are, you’re also prone to worrying.

But you’re likely also more introspective than others, which helps you to examine and understand your feelings.

**Low score**
If you score low on neuroticism, you likely:

* keep calm in stressful situations
* are more optimistic
* worry less
* have a more stable mood

A low neuroticism score can mean you’re confident. You may have more resilience and find it easy to keep calm under stress. Relaxation might also come more easily to you. Try to keep in mind that this might not be as easy for those around you, so be patient.

## Openness, or openness to experience 
refers to a sense of curiosity about others and the world.

**High score**
If you scored high on openness, you might:

* enjoy trying new things
* be more creative
* have a good imagination
* be willing to consider new ideas

A high score on openness can mean you have broad interests. You may enjoy solving problems with new methods and find it easy to think about things in different ways. Being open to new ideas may help you adjust easily to change.

Just make sure to keep an eye out for any situations where you might need to establish boundaries, whether that be with family members or your work-life balance.

**Low score**
A low openness score might mean you:

* prefer to do things in a familiar way
* avoid change
* are more traditional in your thinking

A low openness score can mean you consider concepts in straightforward ways. Others likely see you as being grounded and down-to-earth.

## Extraversion 
refers to the energy you draw from social interactions.

**High score:**
A high extraversion score might mean you:

* seek excitement or adventure
* make friends easily
* speak without thinking
* enjoy being active with others

If you score high on extraversion, you might consider yourself an extrovert. You might enjoy attention and feel recharged after spending time with friends. You likely feel your best when in a large group of people.

On the other hand, you may have trouble spending long periods of time alone.

**Low score:**
A low extraversion score can mean you:

* have a hard time making small talk or introducing yourself
* feel worn out after socializing
* avoid large groups
* are more reserved

A low extraversion score can mean you prefer to spend time alone or with a small group of close friends. You might also be a more private person when it comes to sharing details about your life. This might come across as standoffish to others.

For more details check out this following link
[https://www.healthline.com/health/big-five-personality-traits#takeaway](http://)
