In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib_venn import venn2,venn2_circles
import seaborn as sns


df=pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')

#Spliting DataFrame
Iran=df[df['Q3']=='Iran, Islamic Republic of...']
Iran['Q3']='Iran'
Israel=df[df['Q3']=='Israel']
Saudi=df[df['Q3']=='Saudi Arabia']

#Merging
Countries=[Iran,Israel,Saudi]
Country_Names=['Iran','Israel','Saudi Arabia']
n_participants={Country_Names[i]:len(Countries[i]) for i in range(3)}
mydf=pd.concat(Countries)

#Styling
colors=[['#ffb1ab','#ff7d7d','#fa2616','#850000'],['#c4d3ff','#60c6fc','#3666ff','#000485'],['#bdffbf','#8cff93', '#00ff0f','#008507']]
palette=[c[2] for c in colors]

# [2019 Kaggle ML & DS Survey](https://www.kaggle.com/c/kaggle-survey-2019)
## Iran & Israel & Saudi Arabia
## ÿß€åÿ±ÿßŸÜ / ◊ô◊©◊®◊ê◊ú / ÿßŸÑÿπÿ±ÿ®€åÿ© ÿßŸÑÿ≥ÿπŸàÿØ€åÿ©

| | | |
|----:|:----------:|:--------|
|![Iran](http://s7.picofile.com/file/8378252834/iran_flag_3d_round_icon_256_1_.png)|![Israel](http://s7.picofile.com/file/8378252842/israel_flag_3d_round_icon_256_1_.png)|![Saudi Arabi](http://s7.picofile.com/file/8378252868/saudi_arabia_flag_3d_round_icon_256_1_.png)|


<br/><br/>
<br/><br/>

In recent years Iran, Israel, and Saudi Arabia (in alphabetical order!) have been somehow rivals. The politicians of each vertex of this triangle accuse others of terrorism and genocide. 

<br/><br/>


![Good, Bad, Ugly](https://i.giphy.com/media/b3mSVYbDLvow8/giphy.webp)

<br/><br/>


But besides the political and military power, let's this time dig into the data acquired by this competition and compare them in the sense of data science. But bear in mind that we only have the country of residence of participants, not the country of origin... and of course, lots of them have been migrated to other countries. Also, take into account that residents of Iran (together with¬†Crimea, Cuba, Syria, North Korea, Sudan) are banned from the competition subject to U.S. export controls or sanctions. So, the data from it might be a bit underrated.

So, let's jump in.

<br/><br/>
<br/><br/>

### Number of Participants
First thing first, let's see how many participants we have from each country.



In [None]:
plt.figure(figsize=(10,6))
serie=pd.Series(n_participants)
mybar=plt.bar(serie.index,serie.values,color=palette)
sns.despine(top=True, right=True, left=False, bottom=False)
_=plt.title('Number of Participants')

Mmmm,

Iran and Israel have almost 100 participants, while Saudi Arabia has exactly 50. And we are lucky since the countries with less than 50 participants are merged into the group 'Others'. So, As the data is not balanced we sometimes normalize it to examine the percentage of members with some specific feature. But also raw numbers have information. Hence we plot both the number and percentages for some quantities.

<br/><br/>

### Gender

Let's take a look at the gender decomposition. As you have probably seen already in other notebooks, women's participation rate is still way too less than men. Indeed something nearby 16%. For the sake of brevity, I omitted the other gender types and used concentric doughnut plots to compare the percentage.

As you can see in the graph below, Iran has the highest rate with almost twice than the world average. Saudi Arabia's surpass of the world wise average is meaningful. One reason of this popularity in these countries could be some limits for women to have some other kinds of jobs. Anyway, *cultural differences* propbably play an important role here.

In [None]:
gender=mydf.groupby(['Q3','Q2']).count()['Q1'].unstack()
gen_pen=np.round(100*gender['Female']/gender.sum(axis=1),2)
gen_pen=pd.DataFrame(gen_pen).reset_index().reindex([2,1,0])
gen_pen.columns=['country','percentage']


# ¬© https://towardsdatascience.com/donut-plot-with-matplotlib-python-be3451f22704
r=.7
rd=.3
startingRadius = r + (rd* (len(gen_pen)-1))
plt.figure(figsize=(15,8))


for index, row in gen_pen.iterrows():
    country = row["country"]
    percentage = row["percentage"]
    textLabel = country + ' ' + str(percentage)+'%'
    remainingPie = 100 - percentage

    donut_sizes = [remainingPie, percentage]

    plt.text(0.01, startingRadius - 0.18, textLabel, horizontalalignment='center', verticalalignment='center')
    plt.pie(donut_sizes, radius=startingRadius, startangle=90, colors=colors[index][::2],
            wedgeprops={"edgecolor": "white", 'linewidth': 1})

    startingRadius-=rd
    

plt.axis('equal')
plt.title('Women\'s Paritipation Rate')

circle = plt.Circle(xy=(0, 0), radius=0.40, facecolor='white')
_=plt.gca().add_artist(circle)

### Age

I preferred to categorize ages into three different categories: under 30 , 30-39 , and 40+. As concentric plots can be confusing with more than two categories, I have separated them this time. I will do the same for 'Educational Degree', 'Number of employees', and 'Salaries'.

As you can see the plots depict that Iran has a younger data scientist generation. Almost *two-third* of them are under 30. This will be more surprising if you notice that this number is almost *one-sixth* for Israel and *one-third* for Saudi Arabia. Of course, Iran and Saudi Arabia have a younger population than Israel, but one reason for this significant difference might be the rising popularity of data science in these countries. Another reason as I mentioned before, could be **brain-drain**. The older generation might have left the country after reaching 30. It would be nice to have the origin of participants to examine this hypothesis.

-------------
*(You may uncomment one line of code to have the percentages also, I preferred to avoid the crowd on plots.)*



In [None]:
def pie_plotter(data,title,col_dir=1):

    plt.figure(figsize=(16,5))
    plt.suptitle(title)
    for i,country_name in enumerate(Country_Names):
        plt.subplot(1,3,i+1)
        plt.pie(data.loc[country_name],labels=data.columns,colors=colors[i][col_dir-1::col_dir],
            wedgeprops={"edgecolor": "white", 'linewidth': 1},shadow=True)
         #,autopct='%1.0f%%', pctdistance=.8, labeldistance=1.1)
        # Uncomment the styles abovefor getting percentages
        
        circle = plt.Circle(xy=(0, 0), radius=0.60, facecolor='white')
        plt.xlabel(country_name)
        _=plt.gca().add_artist(circle)

In [None]:
def age_categorizer(age):
    if age<3: 
        return '0-29'
    elif age>3:
        return '30-39'
    else:
        return '40+'

mydf['Age_Cat']=mydf['Q1'].apply(lambda x:age_categorizer (int(x[0])))
age_df=mydf.groupby(['Q3','Age_Cat']).count()['Q4'].unstack()

In [None]:
pie_plotter(age_df,'Age Decomposition',-1)

### Educational degrees

I did the same thing for educational degrees. I combined the other categories into Bachelors. Two notable facts about these graphs are the percentage of Master's in Iran and Bachelors in Saudi Arabia. It seems that it is *easier to find a job* in Saudi Arabia even with your Bachelor's diploma compared to Iran and Israel.

In [None]:
temp=mydf.groupby(['Q3','Q4']).count()['Q1'].unstack()
temp['Other']=temp.drop(['Doctoral degree','Master‚Äôs degree'],axis=1).fillna(0).sum(axis=1)
degree_df=temp[['Other','Master‚Äôs degree','Doctoral degree']]
degree_df.columns=['B.Sc./Other','M.Sc.','PhD']

In [None]:
pie_plotter(degree_df,'Education Decomposition')

### Companies Size

Looking at the doughnut plots below, you can see that more than half of Iranian data scientists are working in small companies with less than 50 employees while for Saudi Arabia it is the other way around. More than half of them are working in companies with more than 1000 employees.

By the way, working with this quantity needs to be more careful about deriving insights. For example, by knowing the percentage of people working in big companies, it is not easy to have an estimation of the number of big companies. Because for each big company, you might have several data scientists participating in the same survey. It is not surprising as there is a heavy positive correlation between the participation of two colleagues. *(It somehow reminds me of **[Impossible Bet](https://www.youtube.com/watch?v=eivGlBKlK6M)**; the expectation of the size of the cycle of a random person is more than the expectation of a random cycle since bigger cycles contain more persons!)*

Another thing that might worth noticing is that there might be some kind of standards/thresholds for a company to hire a data scientist. It might be the case that only big companies are hiring data scientists in Saudi Arabia. This hypothesis resonates with the fact that the number of participants from Saudi Arabia was much less than Iran and Israel, but at the same time, it slightly contradicts our previous insight about the easiness of finding a job there.

In [None]:
temp=mydf.groupby(['Q3','Q6']).count()['Q1'].unstack()

temp['50-999']=temp['50-249 employees']+temp['250-999 employees']
emp_df=temp[['0-49 employees','50-999','1000-9,999 employees','> 10,000 employees']]
emp_df.columns=['0-49','50-1k','1k-10k','10k+']

In [None]:
pie_plotter(emp_df,'Number of Employees Decomposition')

### Salaries

This section is, I believe, the most significant difference between these countries. Almost half of Israeli data scientist earns more than 100,000\$ per year, the same is true for only 11 percent of Saudis and there is no even a single Iranian making more than 100,000\$ per year! More surprisingly, more than half of Iranian, earn less than 1,000\$ per year. Since most of them were young, it might be the case that they are passing internship... But something really important here is the recent currency devaluation in Iran. Indeed, converting to dollar, people's salary has been divided by 3 in a short period. But even multiplying back by 3, it might be similar to Saudi Arabia but still way too far from salaries in Israel.

In [None]:
s1=['$0-999']
s2=['1,000-1,999','2,000-2,999','3,000-3,999','4,000-4,999','5,000-7,499','7,500-9,999','10,000-14,999','15,000-19,999','20,000-24,999','25,000-29,999']
s3=['30,000-39,999','40,000-49,999','50,000-59,999','60,000-69,999','70,000-79,999','80,000-89,999', '90,000-99,999']
s4=['100,000-124,999','125,000-149,999','150,000-199,999','200,000-249,999','> $500,000']

S=[s1,s2,s3,s4]
S_n=['s1','s2','s3','s4']

temp=mydf.groupby(['Q3','Q10']).count()['Q1'].unstack()

for i,s in enumerate(S):
    temp[S_n[i]]=temp[s].sum(axis=1)
    
sal_df=temp[S_n]
sal_df.columns=['1k-','1k-30k','30k-100k','100k+']

In [None]:
pie_plotter(sal_df,'Salary Decomposition')

<br/><br/>

## Multi-choice Questions
----------------

Let's also take a look at multiple-choice questions, including two general questions 'source of learning' and 'favorite programming languages' followed by two more technical questions 'favorite ML algorithm' and 'favorite database product'.

Note that for general questions, I used two bar plots, one to compare the number of participants, and the other one to compare the percentage. I mainly did this because Saudi Arabia had almost half the participants as others, so I just normalized them to compare the decomposition. Also note that since the participants were allowed to choose multiple options, these percentages don't sum up to 100. For technical questions, I omitted percentages, to avoid the crowd. You may easily return it back by calling the function 'multi_plotter'. You may also use the last arg of this function to rotate the labels if there is some overlap between the plots.

In the end, I would like to mention that there is a clever way to preprocess this data as they were distributed in different columns. You may find the labels by grabbing the unique value in each column, and then just using not-null function on the whole table and then summing up each column grouped by each country. If you don't mind about styling, called build-in pandas plotting, gives you a fairly good visualization of the data... But to be more delightful, I used another trick to be able to use Seaborn's barplot.
<br/><br/>
### Source of Learning

I have sorted the columns based on the sum of the participants in these countries. As you can see, Coursera is the most popular one, in particular, it dominates other sources in Israel. While in Saudi Arabia for example, Udemy and Udacity seem more popular.

In [None]:
def preprocess(question):
    cols=[col for col in mydf.columns if ('{}_Part'.format(question) in col)]
    temp=mydf[cols]
    lables=temp.describe().loc['top']
    notnull=temp.notnull()
    notnull.columns=lables
    notnull['Country']=mydf['Q3']
    return notnull
    

def multi_handler(question):
    notnull=preprocess(question)
    return notnull.groupby('Country').sum().transpose()

#  If you don't mind about styling, you may summerize the code section below in just one line!
# _=multi_handler('Q13').plot(kind='barh',colors=palette)

In [None]:
def percent_handler(table):
    for c in Country_Names:
        table[c]=100*table[c]/n_participants[c]
    return table
        


def bar_handler(table,is_percentage=0):
    table['sum']=table.sum(axis=1)
    table=table.reset_index().sort_values('sum',ascending=False)
    n=len(table)

    cats=list(table['top'])*3
    hue=['Iran']*n+['Israel']*n+['Saudi Arabia']*n
    vals=list(table['Iran'])+list(table['Israel'])+list(table['Saudi Arabia'])
    
    last_label='Percentage of Participants' if is_percentage else 'Number of Participants'
    bardata=pd.DataFrame(zip(cats,hue,vals),columns=['Categories','hue',last_label])
    bardata['Categories']=bardata['Categories'].apply(lambda x: x.split('(')[0])

    return bardata

def bar_plotter(ax,data):
    sns.barplot(ax=ax,data=data,y='Categories',x=data.columns[-1],hue='hue',palette=palette)
    
def multi_plotter(question,title,rotation=False):
    multi_output=multi_handler(question)
    
    _,ax=plt.subplots(1,2,figsize = (16, 5), dpi=300)
    
    data=bar_handler(multi_output)
    bar_plotter(ax[0],data)
    
    data=bar_handler(percent_handler(multi_output),1)
    bar_plotter(ax[1],data)
    ax[0].set_ylabel('')
    ax[1].set_ylabel('')
    ax[0].legend().set_visible(False)
    if rotation:
        
        ax[0].set_yticklabels(ax[0].get_yticklabels(),rotation=15)
        ax[1].set_yticklabels(ax[1].get_yticklabels(),rotation=15)
    
    plt.suptitle(title)
    #plt.subplots_adjust(wspace=wspace)
    sns.despine(top=True, right=True, left=False, bottom=False)
    
    
def single_plotter(question,title):
    data=bar_handler(multi_handler(question))
    plt.figure(figsize=(14,8))
    bar_plotter(None,data)
    plt.title(title)
    plt.ylabel('')
    sns.despine(top=True, right=True, left=False, bottom=False)

In [None]:
multi_plotter('Q13','Source of Learning')

### Programming Language

Python heavily dominates other languages, as it might be expected. SQL, R, and MATLAB are in the next places, respectively. It worth noticing that R, MATLAB, C++, and especially Java are more popular in Iran and Saudi Arabia, compared to Israel. One reason might be the number of data scientists with other backgrounds, such as statistics, engineering, and software developing,  in these countries. 

Also note that Bash is meaningfully less popular in Iran compared to Israel and Saudi Arabia, and it makes sense due to our previous observation about the size of companies that participants were working in.

In [None]:
multi_plotter('Q18','Favorite Programming Language')

### Machine Learining

As you can see, simple algorithms like Linear/Logistic Regression are the most popular. But again, take into account that this question was multiple-choice and probably most of the companies are using simple algorithms somewhere somehow.

The plot below also depicts a different taste of ML in these countries. Neural Networks, including CNN, DNN, and RNN are more popular in Iran, while boosting algorithms and random forests are more favorable in Israel.

In [None]:
single_plotter('Q24','Favorite Machine Learning Algorithm')

### Relational Database

As you can see, softwares like Oracle, and Microsoft SQL Server are more popular in Saudi Arabia, while open-source softwares are more common in Israel. Also notice that Iran has no participant using Azure or AWS. Besides the fact that Iranian data scientists were working in smaller companies, another reason can be is that they cannot bank sanctions against them which makes it hard to use these systems.

In [None]:
single_plotter('Q34','Favorite Database Product')

Another notable fact in the plot below is the peak of 'None' for Israel. It seems that they were more careful filling the form, while Iranians and Arabs were skipping the question. Inspired by this observation, let's do some fun plotting before finishing this notebook.

<br></br>

### Number of Selected Databases 

Each participant had a few options to choose, he even could skip the question without choosing any. So, let's see how many choices everyone made!
As you can see, almost 80% of Iranians just skipped the question without choosing 'None'. This number is 60% for Saudi Arabia and 50% for Israel.



In [None]:
def num_of_chioce(question,title):

    notnull= preprocess(question)
    plt.figure(figsize=(16,4))
    for i,country_name in enumerate(Country_Names):
        plt.suptitle(title)
        plt.subplot(1,3,i+1)
        data=notnull[notnull['Country']==country_name].drop('Country',axis=1).sum(axis=1)
        ax=sns.countplot(data,color=palette[i])    
        ax.set_xlabel(country_name)
        ax.set_ylabel('')
        sns.despine(top=True, right=True, left=False, bottom=False)

In [None]:
num_of_chioce('Q34','Number of DB, Selected by each Participants')

### Time Spent

Noting that Iranian had skipped question 34, let's plot the time each participant has spent on the survey. As there were some outliers, I only kept the ones who finished the survey in 1 hour. Plotting their distribution, you may see that the red curve has an earlier peak and a shorter tail, while the green one, has a longer tail with the latest peak.

As you can see most of the participants have been done in less than half an hour, and almost half of them took at most 10 min on the survey.

In [None]:
plt.figure(figsize=(15,6))
for i,country in enumerate([Iran,Israel,Saudi]):

    time=country['Time from Start to Finish (seconds)'].apply(int)
    time=time[time<3600]
    sns.distplot(time,hist=False,color=colors[i][2])

_=plt.legend(Country_Names)

### Venn Diagram

Venn diagram is always fun when you are dealing with muliple choice questions. So, why not? :)

Eventhough, it is possible to continue the idea and for example define somekind of correlation between different products in each country, I just skipped it as it wasn't that much essential.

In [None]:
def venn_data_creator(question,cat1,cat2):

    notnull=preprocess(question)
    notnull['Both']=notnull[cat1]+notnull[cat2]
    return notnull[['Country',cat1,cat2,'Both']]
    
def venn_plotter(question,cat1,cat2):
    plt.figure(figsize=(16,8))
    venn_data=venn_data_creator(question,cat1,cat2)
    for i,country_name in enumerate(Country_Names):
        data=venn_data[venn_data['Country']==country_name][[cat1,cat2,'Both']].sum(axis=0)
        values=tuple(data.values)
        ax=plt.subplot(1,3,i+1)
        ax.set_title(country_name)
        venn2(subsets = values,set_labels=data.index,set_colors=['#ff6f00','#e417ff'])
        venn2_circles(subsets = values, linewidth=1,color=palette[i])


In [None]:
venn_plotter(question='Q34',cat1='MySQL',cat2='Microsoft SQL Server')

## Summary

Let's be as brief as possible and sum up everything in a table

|üáÆüá∑|üáÆüá±|üá∏üá¶|
|:----:|:-----:|:-----:|
| Lowest Salaries |Highest Salaries| |
|Highest Masters|Highest PhDs|Highest Bachelors|
| Youngest + Highest women rate| Oldest + Lowest women rate| |
|Smallest Companies||Biggest Companies|
||Coursera|Udemy/Udacity|
|Neural Networks|Random Forest/Boosting||
||Open Source DB|Closed Source DB|



--------------------------

## DIY

At the end, I would like to mention that I tried to encapsulate the functions for multichoice questions that you will be able to easily use them defined to draw the same plots for other columns. Here is a list of functions and questions.


|Function|Arguments|
|:--------:|:--------:|
|multi_plotter|(Question,Title)|
|single_plotter|(Question,Title)|
|num_of_chioce|(Question,Title)|
|venn_plotter|(Question,Category1,Category2)|

And here are some question examples:

|Label|Question|
|:--------:|:--------:|
|Q14|Favorite Primary Tool?|
|Q16|Favorite IDE?|
|Q17|Hosted Notebook Products?|
|Q20|Data Visualization Libraries?|
|Q27|NLP methods?|
|Q28|Machine Learning Frameworks?|
|Q29|Cloud Computing Platform?|
|Q31|Big Data Product?|

<br></br>

*Hope you enjoyed the notebook. Feel free to share your comments and insights with me.*

<b> Stay Curious! :)</b>
