In [None]:
import numpy as np
import pandas as pd
pd.set_option("max_columns", 200)        
from umap import UMAP
import matplotlib.pyplot as plt
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
from wordcloud import WordCloud
from sklearn.cluster import KMeans
from matplotlib_venn import venn3

# Identifying subsets of the data science community in the 2019 Kaggle survey

This years survey challenge relies on telling a story about a subset of the data science community represented in the survey:
> The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration.


In the previous years [some winners](https://www.kaggle.com/mhajabri/africai) had a great narrative and some found [interesting subsets](https://www.kaggle.com/robikscube/a-tale-of-4-kaggler-types-by-ide-use-2018-survey#The-4-Types-of-Kagglers-(by-IDE-use) of the community to tell a story about. In this kernel I show some methods on how you can identify a subset of the DS community represented in 2019 Kaggle DS/ML survey.

In [None]:
# load the survey results
df = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')

# drop the first row containing the questions

df.drop(0, inplace=True)

# Method 1: Using 'domain' knowledge

The most direct way to extract subsets of the survey respondents is to use your domain knowledge to extract subsets you believe will create an interesting story. 

Below I show several examples of using domain knowledge and how you can extract these using pandas.

### Example 1.1: Boolean masking

You are aware that there is a large gender imbalance in data science so wish to perform a study related to this. After reading some other [Kaggle kernels](https://www.kaggle.com/parulpandey/geek-girls-rising-myth-or-reality-wip) you find out that India has the smallest gender imbalance of any responding nation. You therefore decide to tell an analysis story about Female respondents in India to investigate the employment, income and education of Female repsondents in the nation with the smallest gender imbalance. 

This can be performed with simple boolean masking as follows:

In [None]:
# create boolean masks of people who identify as Female and who are located in India
gender_mask = df['Q2']=='Female'
country_mask = df['Q3']=='India'

df_subset = df[gender_mask & country_mask]

# you can then do your analysis, lets look at the mode of each column:
df_subset.mode().iloc[0:1,] # iloc is a hack to drop some NaN column

#### Perform some analysis on our identified community subset (Word cloud)

Just from looking at the mode of each question we  can already draw some insights from, for example the biggest fraction of female respondents from India are Students.

You can then progess and look at the subset in detail, for example using a wordcloud:

In [None]:
jobcloud = WordCloud(background_color='white').generate(" ".join(df_subset['Q5'].dropna()))

fig = plt.figure()
plt.imshow(jobcloud)
plt.axis('off')
fig.set_size_inches(10,7)

### Example 1.2: Advanced boolean masking

You are a european who spent several years of your life backpacking around South America and now, as a Data Scientist back living in Europe, you are considering migrating to South America to live and work. Motivated by this, you have decided to use the Kaggle DS/ML survey to investigate what the Data science community looks like for South Americans and evaluate your potential prospects.

While you could string together a large a number of conditionals (as in example 1) to extract this subset, you can take advantage of some other DataFrame methods to speed up your indexing:

In [None]:
# Make a list of all countries in SA
south_american_countries = ['Brazil', 'Colombia', 'Argentina','Peru','Venezuela',
                            'Chile', 'Ecuador', 'Bolivia', 'Paraguay', 'Uruguay',
                            'Guyana', 'Suriname', 'French Guiana']

# create boolean mask and use it
SA_mask = df['Q3'].isin(south_american_countries)
df_subset = df[SA_mask]

# again look at the mode of our subset:
df_subset.mode().iloc[0:1,] # iloc is a hack to drop some NaN columns

### Perform some analysis on our identified subset community: Bar chart

Again, we can extract some insights from this analysis: The highest number of South American respondents were from Brazil and the largest job type was Data Scientist.

With out subset, one can now move on and do more analysis. This time lets plot a bar graph of ages:

In [None]:
# count number of respondents in each age group
age_group_cnt = df_subset.groupby('Q1')['Q2'].count()

# plot the graph
fig, ax = plt.subplots()
ax.bar(age_group_cnt.index, age_group_cnt.values, color='gray', alpha=0.80)

# add axis labels etc.
ax.set_xlabel('Age Group')
ax.set_ylabel('Number of respondents')
ax.set_title('Age of of respondents in South America')
_ = plt.setp(ax.get_xticklabels(), rotation=90)
fig.set_size_inches(6,4)

### Example 1.3: Creating new high-level groups

You are an economically minded person and are interested in looking at two subsets of the data science communites: those on a low income and those on a high income. As a economic expert you believe you have a reliable mapping of reported income bracket to a 'low', 'medium' or 'high' income so you can use the DataFrame `.map` method to easily convert the reported income bracket to an income category:

In [None]:
salary_mapping = {'$0-999':'low', '1,000-1,999':'low', 
                  '10,000-14,999':'low', '100,000-124,999':'high',
                  '125,000-149,999':'high', '15,000-19,999':'low', 
                  '150,000-199,999':'high', '2,000-2,999':'low',
                  '20,000-24,999':'low', '200,000-249,999':'high', 
                  '25,000-29,999':'medium', '250,000-299,999':'high',
                  '3,000-3,999':'low','30,000-39,999':'medium',
                  '300,000-500,000':'high', '4,000-4,999':'low',
                  '40,000-49,999':'medium', '5,000-7,499':'low', 
                  '50,000-59,999':'medium', '60,000-69,999':'medium',
                  '7,500-9,999':'low', '70,000-79,999':'medium', 
                  '80,000-89,999':'medium', '90,000-99,999':'medium',
                  '> $500,000':'high'}

# create new column for the income group and convert the old salary
df['income_group'] = df['Q10'].map(salary_mapping)

# check the number of respondents in each income group
df.groupby('income_group')['Q1'].count()

In [None]:
# Using our new group, we can then use boolean masking (example 1) to look at a particular earning bracket:

income_mask = df['income_group'] == 'high'

df_subset = df[income_mask]
df_subset.mode().iloc[0:1,]


Looking at the mode we can already get some valuable insights; for example we can see that the highest number of 'high' earners are located in the United States.

### Lets look at the gender distribution across our new 'income_group' feature

This time lets create a pie chart of the gender distribution by the new feature 'income category':

What you will see is there is a clear gender difference between different income categories: a smaller proportion of high earning respondents identify as Female.

In [None]:
df.loc[~df['Q2'].isin(['Male', 'Female']), 'Q2'] = 'Other'

fig, axes = plt.subplots(nrows=1, ncols=3)

for ax, income in zip(axes, ['low','medium','high']):
    df_income = df[df['income_group']==income]
    gender_count = df_income.groupby('Q2')['Q1'].count()
    ax.pie(gender_count.values, labels=gender_count.index, autopct='%.1f',
           colors=['#a6d99c','#b19cd9','#d9d09c'])
    ax.set_title(income.capitalize() + ' income')

fig.set_size_inches(12,5)

# Method 2: Using KMeans clustering

Last years [2nd place entry](https://www.kaggle.com/robikscube/a-tale-of-4-kaggler-types-by-ide-use-2018-survey) by Rob Mulla used KMeans clustering to great effect to identify 4 types of Kagglers by their IDE use. In this section I follow this work to show you how you can identify subsets in this years Data Science survey challenge.


In this example I cluster on respondents *media use* (Q12 of the 2019 survey). But you can change this to use any question (or multiple questions) to identify subsets of the data science community of interest to you.

### 2.1 First prepare the data (using a one-hot encoded representation of the media question)

In [None]:
# get a df with just media questions in it
columns_to_cluster = df.columns.str.contains('Q12')
df_media = df.loc[:, columns_to_cluster]

# convert it to a binary df
df_media = pd.get_dummies(df_media).iloc[:,0:10]

# clean up the column names
new_col_names = [col.split('_')[-1].split('(')[0].strip() for col in df_media.columns]
df_media.columns = new_col_names

# optionally drop anyone who didn't select any media interaction
drop_mask = ~(df_media.sum(axis=1)==0)
df_media = df_media[drop_mask]
df_subset = df[drop_mask]

### 2.2 Now perform the K-means clustering and inspect

For now we use 3 clusters, but this will need to be played with depending the depth of analysis you have in mind.

You can see that the clusters are not completely clean, but a subset of media usages dominate each cluster.

In [None]:
# perform the clustering
y_pred = KMeans(n_clusters=3, random_state=42, max_iter=10000).fit_predict(df_media.values)

# add the cluster identification to the df

df_media['cluster_number'] = y_pred

In [None]:
# inspect the different clustering
cluster_sizes = df_media.groupby('cluster_number').sum()
cluster_sizes

### 2.3 Plot venn diagrams of each cluster

As in the [original notebook](https://www.kaggle.com/robikscube/a-tale-of-4-kaggler-types-by-ide-use-2018-survey#The-4-Types-of-Kagglers-(by-IDE-use) we can use a Venn diagram to visualise the clusters. 


We can see that Cluster 1 features people who primarily use Kaggle, Blogs or Youtube to engage with Data science media. Respondents in cluster 2 mainly use either Kaggle or Youtube but do sometimes engage with Journal publications. In cluster three people mainly use Kaggle and Blogs but also use Journal publications to engage with data science media. 

Not only does this clustering and graphic tell you about the state of data science media engagement (respondents mainly use Kaggle, Blogs and Youtube but also use Journal publications for their media access) but also allows you to identify different sub-groups of the community:
>- Cluster 1: People who use Kaggle, Youtube and blogs in equal amounts.
- Cluster 2: People who mainly use Kaggle and Youtube and some Journal publications.
- Cluster 3: People who mainly use Kaggle and Blogs and some Journal publications.


In [None]:
fig, axes = plt.subplots(1, 3)

for i, (ax, cluster) in enumerate(zip(axes, cluster_sizes.index)):
    # get the top three media types used in the cluster
    values = cluster_sizes.loc[cluster,:]
    top_three = values.sort_values()[-3:]
    top_3_names = list(top_three.index)
    # create the venn diagram, 
    masks = [(df_media[top_3_names[i]]==0, df_media[top_3_names[i]]==1) for i in [0,1,2]]
    venn3(subsets=(len(df_media.loc[masks[0][1] & masks[1][0] & masks[2][0]]),
                   len(df_media.loc[masks[0][0] & masks[1][1] & masks[2][0]]),
                   len(df_media.loc[masks[0][1] & masks[1][1] & masks[2][0]]),
                   len(df_media.loc[masks[0][0] & masks[1][0] & masks[2][1]]),
                   len(df_media.loc[masks[0][1] & masks[1][0] & masks[2][1]]),
                   len(df_media.loc[masks[0][0] & masks[1][1] & masks[2][1]]),
                   len(df_media.loc[masks[0][1] & masks[1][1] & masks[2][1]])),
          set_labels=(top_3_names[0], top_3_names[1], top_3_names[2]), 
          ax=ax)
    # add titles to the plots
    ax.set_title(f'Cluster {i+1}', fontsize=14)

fig.set_size_inches(20,8)

### 2.4 Use your clusters in your analysis

Going forward, you can use these cluster identifications (or subsets of the data science community) to compare these subsets of the data science community to create your survey story.

For example, below we look at the most frequently occuring answer to each question for clusters 2 and 3. From visual inspection we can see that for cluster 2 (using primarily Kaggle, Youtube and Journal publications) the most frequent respondents was a student and for cluster 3 (using Kaggle, Blogs and Journal publications) the most frequent respondents were Data Scientist.

In [None]:
# assign clusters to original df
df_subset['clusters'] = y_pred+1

In [None]:
# look at mode of cluster 2
df_subset[df_subset['clusters']==2].mode()

In [None]:
# look at mode of cluster 3
df_subset[df_subset['clusters']==3].mode()

# Method 3: Using UMAP

While you can generate additional masks using the technique outlined in Method 1 to start investigating more and more niche subgroups of the data science community, another option is to cluster the data and see what subgroups exist 'naturally'. You can then select some (or one) of these sub-groups to focus your analysis around.

For this example, I will use [UMAP](https://umap-learn.readthedocs.io/en/latest/) a manifold technique which can produce visualisations similar to the t-SNE algorithm. We will produce a 2 dimensional embedding of the dataset using UMAP and use this to identify some subgroups of the community.

In [None]:
# select just a subsection of the questions to cluster on
questions_to_use = ['Q4','Q5','Q6','Q14']
# drop the first row
try:
    df.drop(0, inplace=True, axis=0)
except KeyError:
    print('Row 0 does not exist')
column_mask =1

In [None]:
# one hot encode the questions
encoded_df = pd.get_dummies(df[questions_to_use])

# make the column names more readable
stripped_columns = [col.split('_')[1] if not col.endswith('Other') else col for col in encoded_df.columns]
encoded_df.columns = stripped_columns

### 3.1 Perform the UMAP embedding

We will use a high number of nearest neighbours parameters to maintain more of the global structure. This is with some time trade-off.

In [None]:
umap_params = {'metric':'hamming', # hamming is a boolean distance metric
               'n_neighbors':500, # focus more on global structure
              'random_state':1, # use the random seeds to keep output reproducible
               'transform_seed':1
              }

In [None]:
embedder = UMAP(**umap_params)

X_embedded = embedder.fit_transform(encoded_df)

# add the coords in the 2D embedded space  for each instance
encoded_df['x'] = X_embedded[:,0]
encoded_df['y'] = X_embedded[:,1]

# add income
encoded_df['income_group'] = df['income_group']


### 3.2 Plot the overall embedding and investigate

In the following I plot a basic scatter graph of the 2D embeddings and colour it by whether the respondent is a Data Scientist or not. Hover over the individual instances to see what job, education level, company size and primary analysis software they used. How do the clusters form?

In [None]:
fig = px.scatter(encoded_df, x="x", y="y", hover_data=encoded_df.columns[:-2], color='Data Scientist')
fig.show()

### 3.3 Zoom in on  a cluster of interest centered at (17.6, -48) - what subgroup does this cluster correspond too?
*(You can do this by interating with the above graph, but I will do it explicitly  in the next cell)*

---



You can see that this cluster corresponds to Data scientists with a Master's degree who work for small companies (0-49 employees). This is not a 'sub-group' you would necessarily think of using a domain-knowledge approach only!

Having found this cluster, we can see it features a lot of respondents and would therefore be worth investigating further. We could look at Data scientists with Masters degree's who work for small companies and compare it with Data scientists with masters degrees who work at large companies. Asking questions such as:
>- How do there earnings differ?
- Do they use different software?
- Do they have different age distributions?


In [None]:
fig = px.scatter(encoded_df, x="x", y="y", hover_data=encoded_df.columns[:-2], color='Data Scientist')
fig.layout.xaxis.range = (16.5,19)
fig.layout.yaxis.range = (-50,-47)
fig.show()

### 3.4 Using colour to look for additional patterns

Another option you can try is colouring the clusters by something they were *not* clustered using (e.g., the income group feature we made in Method 1) to see if any interesting patterns jump out (e.g., does income correlate with job type)? This can be used to help inform which subsets of the survey respondents could be interesting to investigate.

For example, in the following figure  I replot the 2 dimensional embedding of the data and look at income. 


Several things jump out:

>1. In the top right of the figure primarily respondents with no income information has been clustered (without this information being provided). Looking at the instances in detail shows these are students and people who are not employed. The individual clusters within the group correspond to different highest education attainment levels (e.g., 'Bachelors' or 'Masters').
2. Low income respondents make up the the biggest fraction of respondents.
3. No other additional structure is obvious. This suggests that the data we had cluster on may not correlate strongly with income. 

In [None]:
encoded_df.loc[encoded_df['income_group'].isna(), 'income_group'] = 'No information'
fig = px.scatter(encoded_df, x="x", y="y", hover_data=encoded_df.columns[:-2], color='income_group', opacity=0.5)
fig.show()

## Conclusion

I have shown three different methods of identfying subsets of the Data Science community in the 2019 Kaggle DS/ML survey. Domain knowledge can be used to identify subsets of potential interest and gennerally creates 'clear-cut' groups (Male vs Female, Country 1 vs Country 2). KMeans clustering can find subsets of the community not so obvious to the naked eye, but the groups our not necessarily that 'clean'. UMAP produces very clean groups which arise naturally in the data, but often with complex definitions (e.g., Masters students, who earn a lot of money who use Python).

Going further, I hope you can try different iterations of the methods shown in this kernel to identify subsets of the DS community represented in the 2019 Kaggle survey which will can be used to create your analytics story.