## *Exploratory Analysis of the 2019 Kaggle ML and DS Survey* 
# Helping Small Charities Thrive
<img style="height: 320px;" align="left" src="https://www.publicdomainpictures.net/pictures/290000/velka/charity-donation.jpg">

The 2019 Kaggle Machine Learning and Data Science survey data has been published, and now we have another opportunity for great insights with the most comprehensive dataset available on the state of Machine Learning and Data Science.  Lets do some good!

Its fitting that the deadline for this challenge comes on the eve of [GivingTuesday](https://www.givingtuesday.org/), a day created during the busy holiday season to focus on giving back. Charities use this day to remind people not to forget the causes that need their money, time and attention. The people who start or work at these charities run the gamut from full time professionals devoted to fundraising, to a person who simply has a passionate interest in helping a cause that's dear to them. Their knowledge of how to raise money varies too in terms of experience, tools and resources. Perhaps there is no greater area where machine learning and data science can be employed to help so many achieve so much for the good of others.

# Challenges for Charities

2018 was a tough year for many charities. According to an [article](https://www.philanthropy.com/article/Gifts-to-Charity-Dropped-17/246511) published earlier this year by The Chronicle of Philanthropy, the [Giving USA Foundation](https://givingusa.org/) estimates a total decline in 2018 U.S. giving of over 7 billion dollars.  And individual donors, the largest group of U.S. givers, are alone estimated to have a decline of over 10 billion dollars:  

In [None]:
import pandas as pd 
import seaborn as sns
sns.set()
giving_data = [ # from https://www.philanthropy.com/article/Gifts-to-Charity-Dropped-17/246511
    [ 2018, 292_090 ],
    [ 2017, 302_510 ],
    [ 2016, 292_300 ],
    [ 2015, 280_430 ],
    [ 2014, 267_560 ],
    [ 2013, 261_320 ],
    [ 2012, 267_280 ],
    [ 2011, 238_790 ],
    [ 2010, 239_520 ],
    [ 2009, 235_000 ],
    [ 2008, 249_310 ],
    [ 2007, 282_240 ]
]
giving_df = pd.DataFrame(giving_data, columns=['Year','Individual Giving'])
giving_df = giving_df[::-1]  # reverse data 
ax = giving_df.plot.barh(x=0,y=1,rot=0,figsize=(12,6),legend=False, width=.8)
ax.set_title('Individual Giving for U.S. Charities', fontsize=18)
for p in ax.patches:
    #print(p)
    ax.annotate('$'+format(p.get_width()*1_000_000,',d'), (p.get_width() - 64_000, p.get_y()+.25), color='white', weight='bold')  
ax.axes.get_xaxis().set_ticklabels([])
_ = 0 # be quiet, matplotlib

For individual giving, the reported decline of over 10 billion dollars in 2018 is the biggest since the decline of over 32 billion dollars in 2008.  The financial crisis of 2007-2008 and subsequent great recession are likely factors for the 2008 decline.  A 2018 decline in giving might be a sign that the economy is heading towards a recession, but there are other reasons to consider.    

One of the reasons giving maybe down is because of the 2017 tax changes. According to [Forbes magazine](https://www.forbes.com/sites/nextavenue/2019/06/18/charitable-giving-took-a-hit-due-to-tax-reform/#3184c5b6f6ff): 
> The chief reason: the doubling of the standard deduction (to $24,000 for married couples filing jointly) would mean it wouldn’t pay for many to itemize on their 2018 tax returns, and without a charitable contribution deduction, they’d be less inclined to give. New data shows the fear seems to have been proven out. Typically, individual giving will track with growth in GPD [Gross Domestic Product], which was pretty robust in 2018, and with growth in household income, which was also strong,’ said Rick Dunham, chair of the Giving USA Foundation. The tax changes may be why individual giving was flat overall.’

Charities may also feel the effects of a bad stock market, as they did in 2018:

<img align="left" style="margin-left: -20px; height: 300px;" 
src="https://ei.marketwatch.com/Multimedia/2018/10/24/Photos/ZH/MW-GS378_stock__20181024171646_ZH.jpg?uuid=1ceb36f0-d7d2-11e8-97f7-ac162d7bc1f7">

Also, a growing number of charities are competing for scarce resources. According to [Bloomerang](https://bloomerang.co/blog/should-hundreds-of-new-nonprofits-be-created-each-year/):
> The number of 501c3 organizations grew by 29.7% from 2003 through 2013.

They listed other issues facing nonprofits. 

- Most new nonprofits overlap services and missions with existing orgs who are better equipped to properly deliver
- The majority of new nonprofits never even reach $100,000 in annual revenue and may be unable to truly perform the mission intended
- Underfunded nonprofits may do as much harm as good
- Proper staffing is hard for brand new nonprofits with tiny budgets
- Founders may not be equipped to lead if successful
- Volunteers, which are often critical may be spread too thin

They stated the odds of survival after 5 years for small nonprofits are below 2-3%.  This means all those people who take the time, effort and risk to help a case their passionate about won’t be able to do so.

If that wasn't challenging enough, [recent reductions](https://patimes.org/bracing-for-government-funding-cuts-a-call-to-action-for-nonprofit-leaders/) in government funding have placed further demands on charities.  

> when experiencing uncertainty or declining revenues, state and local government agencies that provide financial support to nonprofits may significantly reduce funding, in turn requiring nonprofits take on more responsibilities with fewer resources.

The collective challenges are significant. Charitable giving in the U.S. is on the decline, and charities are suffering as a result.

# How Can Machine Learning & Data Science Help?

With these challenges, charities need new ways to improve their fundraising results.  Here's a look at some of the opportunities for improvement and where it might make sense to apply Machine Learning and Data Science:

### Donor Retention 
The [Fundraising Effectiveness Project](http://afpfep.org/blog/fundraising-effectiveness-project-quarterly-fundraising-report-for-q4-2018/) has been collecting data on donor retention since 2006, and they have prioritized donor retention as a primary measure to improve fundraising effectiveness. Unfortunately, the situation has not improved greatly over the years: overall YTD Donor Retention at the end of 2018 fell 6.3% to 44.5%. Smaller charities are more likely to lose donors than larger ones because they are not equipped with the resources to keep them.

<img align="left" src="http://afpfep.org/wp-content/uploads/2019/02/fepq42018.png">


We have some opportunities to make it better with data science. 

- An automatic monthly giving program is perhaps the most effective way to increase donor retention that many charities have still yet to put into practice.  The capability is easy to acquire, but signing up donors takes time.  In this case, a supervised machine learning algorithm could be used to identify the donors that are the most likely to sign up for an automated monthly giving program so we can streamline their adoption.  
- Supervised machine learning could also be used to model the probability that a donor will lapse, so a charity can prioritize their follow-ups and minimize losses. 
- A model for determining the next best action to take with a donor can be trained from a history of other donor actions with their results.  The lack of action history and their results might make this harder.  A charity should consider the use of a donor management system so that contact history and results can be captured for training.    

### Donor Acquisition
According to [DonorSearch](https://www.donorsearch.net/),one major gift can be the difference between a charity meeting its goal and that same organization coming up short. Successful charities understand this fact and know the significance of running a top-tier program. As with Donor Retention, we have [some opportunities](https://www.donorsearch.net/artificial-intelligence-for-nonprofits/) to use Machine Learning and Data Science to make it better. Here are some others:

- Supervised learning could be used to model the lifetime value of acquired donors, so that candidate donors can be prioritized based on their value. This is even more powerful if the model can be trained with donors from multiple charities.  
- Unsupervised machine learning could be used to cluster segments of candidates based on their similarity to each other, so a charity can strategize and make effective solicitation campaigns.   

### Gift Optimization
Charities could improve their growth in giving results by applying Machine Learning and Data Science for optimizing the best time and amount to ask the donor.

- Supervised machine learning could be used to train a model with past ask amounts and their results, so that we can predict ask amounts that maximize contributions and minimize losses. This is especially effective with a monthly giving program.  
- A model for determining the next best action to take could also work toward increasing gift amounts. 

So there it is.  We have more than enough opportunities to make good use of Machine Learning and Data Science. Now lets explore how people in organizations are approaching their data science practice.  Specifically, we want to explore and fully understand the adoption, costs, tools, team composition, roles, practices and tools involved with applying Machine Learning and Data Science for a small organization like a charity. 

# How are small organizations applying Machine Learning and Data Science?
<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/16394/logos/header.png" align="left">

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pandas.plotting as pp
from IPython.display import display, HTML
import os
import warnings
warnings.filterwarnings('always')
def print_files():
    files = [] 
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            files.append(os.path.join(dirname, filename))
    files.sort()
    for file in files:
        print(file)
def print_full(x):
    pd.set_option('display.max_rows', len(x))
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', 2000)
    pd.set_option('display.float_format', '{:20,.2f}'.format)
    pd.set_option('display.max_colwidth', -1)
    x = x.style.set_properties(**{'text-align': 'left'})
    display(x) # print(x)
    pd.reset_option('display.max_rows')
    pd.reset_option('display.max_columns')
    pd.reset_option('display.width')
    pd.reset_option('display.float_format')
    pd.reset_option('display.max_colwidth')
# kaggle ML and DS survey - 2019
k19mr = pd.read_csv("/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv",skiprows=[1])
k19mh = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv',nrows=1)
k19mh = pd.Series(k19mh.transpose()[0]) # questions keyed by column 
k19tr = pd.read_csv("/kaggle/input/kaggle-survey-2019/other_text_responses.csv",skiprows=[1])
k19th = pd.read_csv("/kaggle/input/kaggle-survey-2019/other_text_responses.csv",nrows=1)
k19th = pd.Series(k19th.transpose()[0]) # questions keyed by column 
k19sr = pd.read_csv("/kaggle/input/kaggle-survey-2019/survey_schema.csv",skiprows=[1])
k19sh = pd.read_csv("/kaggle/input/kaggle-survey-2019/survey_schema.csv",nrows=1)
k19sh = pd.Series(k19sh.transpose()[0]) # questions keyed by column 
k19qh = pd.read_csv("/kaggle/input/kaggle-survey-2019/questions_only.csv")
k19qh = pd.Series(k19qh.transpose()[0]) # questions keyed by column 
k19mr_usa = (k19mr['Q3'] == 'United States of America')
k19mr_mid = (k19mr['Q6'] == '50-249 employees')
sp = k19mr[['Q1','Q2','Q3','Q4','Q5','Q6','Q7','Q10','Q23']]
orders = {
    1: None,
    2: None,
    3: None,
    4: ["No formal education past high school", "Professional degree", "Some college/university study without earning a bachelor’s degree", "Bachelor’s degree", "Master’s degree", "Doctoral degree", "I prefer not to answer"],
    5: None,
    6: ["0-49 employees", "50-249 employees", "250-999 employees", "1000-9,999 employees", "> 10,000 employees"],
    7: ["0", "1-2", "3-4", "5-9", "10-14", "15-19", "20+"],
    8: ["No (we do not use ML methods)", "We are exploring ML methods (and may one day put a model into production)", "We use ML methods for generating insights (but do not put working models into production)", "We recently started using ML methods (i.e., models in production for less than 2 years)", "We have well established ML methods (i.e., models in production for more than 2 years)", "I do not know"],
    10: ["$0-999", "1,000-1,999", "2,000-2,999", "3,000-3,999", "4,000-4,999", "5,000-7,499", "7,500-9,999", "10,000-14,999", "15,000-19,999", "20,000-24,999", "25,000-29,999", "30,000-39,999", "40,000-49,999", "50,000-59,999", "60,000-69,999", "70,000-79,999", "80,000-89,999", "90,000-99,999", "100,000-124,999", "125,000-149,999", "150,000-199,999", "200,000-249,999", "250,000-299,999", "300,000-500,000", "> $500,000"],
    11: ["$0 (USD)", "$1-$99", "$100-$999", "$1000-$9,999", "$10,000-$99,999", "> $100,000 ($USD)"],
    15: ["I have never written code", "< 1 years", "1-2 years", "3-5 years", "5-10 years", "10-20 years", "20+ years"],
    19: None,
    22: ["Never", "Once", "2-5 times", "6-24 times", "> 25 times"],
    23: ["< 1 years", "1-2 years", "2-3 years", "3-4 years", "4-5 years", "5-10 years", "10-15 years", "20+ years"]    
}

In [None]:
### Nathan starts here 
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from IPython.display import display

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
%matplotlib inline
import seaborn as sns
sns.set()

# Graphics in retina format are more sharp and legible
%config InlineBackend.figure_format = 'retina' 

# Increase the default plot size and set the color scheme
plt.rcParams['figure.figsize'] = 8, 5
plt.rcParams['image.cmap'] = 'viridis'


import plotly.offline as py
import pycountry

py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

from plotly.offline import init_notebook_mode, iplot 
init_notebook_mode(connected=True)

import folium 
from folium import plugins

import re

colors = ["steelblue","dodgerblue","lightskyblue","powderblue","deepskyblue","cyan","darkturquoise","paleturquoise","turquoise"]

#Importing the 2019 Dataset
df_2019 = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')
df_2019.columns = df_2019.iloc[0]
df_2019=df_2019.drop([0])
pd.options.display.max_columns = None

#Importing the 2018 Dataset
df_2018 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv')
df_2018.columns = df_2018.iloc[0]
df_2018=df_2018.drop([0])

#Importing the 2017 Dataset
df_2017=pd.read_csv('../input/kaggle-survey-2017/multipleChoiceResponses.csv',encoding='ISO-8859-1')

#Removing everyone that took less than 4 minutes
less3 = df_2019[round(df_2019.iloc[:,0].astype(int) / 60) <= 4].index
df_2019 = df_2019.drop(less3, axis=0)

less3 = df_2018[round(df_2018.iloc[:,0].astype(int) / 60) <= 4].index
df_2018 = df_2018.drop(less3, axis=0)
display(df_2017)
df_2017.columns.tolist().index('Tenure')



#Creating a smaller subset of the data
companyInfo17 = df_2017[df_2017['Country'] == 'United States'].iloc[:,[1,54,8,56]]
companyInfo18 = df_2018[df_2018['In which country do you currently reside?'] == 'United States of America'].iloc[:,[4,5,7,127]]
companyInfo17.columns = companyInfo18.columns = ['Country', 'Degree', 'Title','Experience']
USA_2019 = df_2019[df_2019['In which country do you currently reside?'] == 'United States of America']
companyInfo19 = USA_2019.iloc[:,[4,5,6,8,9,10,11,12,13,14,15,16,17,18,20,21,55]].copy()
#companyInfo18 = df_2018.iloc[:,[4,5,7,10,11,12,13,14,15,16,17,18,20,21]].copy()
#Renaming Columns
cols = ['Country', 'Degree', 'Title', 'Size of Company', 'Size of Team', 'Machine Learning Methods', 'Analyze and understand data to influence product or business decisions', 'Build and_or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data','Build prototypes to explore applying machine learning to new areas', 'Build and/or run a machine learning service that operationally improves my product or workflows', 'Experimentation and iteration to improve existing ML models', 'Do research that advances the state of the art of machine learning', 'None', 'Other', 'Compensation', 'Money Spent on Product','Experience']
companyInfo19.columns = cols
med_companyInfo19 = companyInfo19[companyInfo19['Size of Company']== '250-999 employees']

#Help 2017 Titles Match
changeF = ['Software Developer/Software Engineer', 'Scientist/Researcher', 'Researcher']
changeT = ['Software Engineer', 'Research Scientist', 'Research Scientist']
companyInfo17 = companyInfo17.replace(changeF,changeT)

display(companyInfo19)


The [2019 Kaggle ML & DS Survey](https://www.kaggle.com/c/kaggle-survey-2019) provides the most comprehensive dataset available on the state of machine learning and data science, and can help us answer important questions about how organizations are applying Machine Learning and Data Science.  Let's explore a few of them... 


## How big are data science teams?

So how big are the teams working on data science initiatives? And how does it relate to organization size? 

The Kaggle survey included the following questions:

> 6. What is the size of the company where you are employed?
> 7. Approximately how many individuals are responsible for data science workloads at your place of business?

As we might expect, larger companies tend to have larger teams.  And on average, smaller organizations tend to have between 3 and 6 people working in Data Science.   

In [None]:
##AVG TEAM SIZE BY COMPANY SIZE

numericalTeamSizes = []
for size in companyInfo19['Size of Team']:
    if size == '20+':
        numericalTeamSizes.append(20)
    elif size == '15-19':
        numericalTeamSizes.append(17)
    elif size == '10-14':
        numericalTeamSizes.append(12)
    elif size == '5-9':
        numericalTeamSizes.append(7)
    elif size == '3-4':
        numericalTeamSizes.append(3.5)
    elif size == '1-2':
        numericalTeamSizes.append(1.5)
    elif size == '0':
        numericalTeamSizes.append(0)
    else:
        numericalTeamSizes.append(np.nan)
        
companyInfo19['numericalTeamSizes'] = numericalTeamSizes
meanTmSz=[companyInfo19[companyInfo19['Size of Company'] == companySize].numericalTeamSizes.mean() for companySize in companyInfo19['Size of Company'].unique()]

companySizes = companyInfo19['Size of Company'].unique()
correctOrder = [5, 1, 2, 4, 0]
chartData = np.vstack(([meanTmSz[i] for i in correctOrder],[companySizes[i] for i in correctOrder]))

fig = go.Figure([go.Bar(x=chartData[1,:], y=chartData[0,:], hovertemplate = '<i>Company Size: %{x} </i> <br> Mean Team Size: %{y} <extra></extra>')])
fig.update_xaxes(title_text='Company Size')
fig.update_yaxes(title_text='Mean Data Science Team Size')
fig.update_layout(
    hoverlabel_align = 'right', 
    title = "Data Science Team Size vs. Company Size")

fig.show()

So how do team sizes get distributed across difference organization sizes?  The following chart shows the breakdown of team sizes in the U.S. within each of the surveyed organization sizes.  A particularly interesting insight from this chart is how common small teams are at larger organizations. As for the smaller organizations, the majority appear to use Data Science teams of up to 4 individuals.  

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
cdata = pd.DataFrame(sp[k19mr_usa])
company_size_category = pd.api.types.CategoricalDtype(categories=orders[6][::-1], ordered=True)
ml_team_size_category = pd.api.types.CategoricalDtype(categories=orders[7], ordered=True)
cdata['Organization Size'] = cdata['Q6'].astype(company_size_category)
cdata['ML Team Size'] = cdata['Q7'].astype(ml_team_size_category)
del cdata['Q6']
del cdata['Q7']
df2 = cdata.groupby(['Organization Size', 'ML Team Size'])['Organization Size'].count().unstack('ML Team Size').fillna(0)
df2.plot(kind='barh', stacked=True, figsize=(14,5));

So now we have a better idea of how big the teams are based on the size of the organization.  Next we'll take a look at the various positions on these teams. 

## Which positions are being staffed for data science?

The obvious answer to this question and largest single group of respondents are Data Scientists.  But how big a part do they actually play?  And what other positions are working with them?  Lets take a look at the survey data from North America. 

Data Scientists appear to make up approximately a third of the positions reported.  Other prominent positions represented are those in Engineer and Analyst positions.  Data Science is famously represented as requiring the convergence of Subject Matter, Programming, and Mathematics expertise.  These requirements correlate to the primary positions of Analysts (Subject Matter), Engineers (Programming), and Scientists (Mathematics).  And while it's more likely that a lot of the smaller teams from 1-4 members don't have a Data Scientist at all, the likelihood of having one appears to increase with team size.    

In [None]:
## Nathan continued
##TITLE BY TEAMZISE SUNBURST PLOT
#Get titles for each teamsize
titles = companyInfo19.iloc[:,2].dropna().unique()
teamSizes = companyInfo19.iloc[:,4].dropna().unique()
teamCounts = []
i = 0
titleCounts = np.zeros((teamSizes.__len__(), titles.__len__()))
#titlesByTeamSize
for team in teamSizes:
    tempSubset = companyInfo19[companyInfo19['Size of Team'] == team]
    teamCounts.append(tempSubset.iloc[:,4].count())
    tempList = []
    for title in titles:
        tempList.append(tempSubset[tempSubset['Title'] == title].iloc[:,2].count())  
    titleCounts[i,:] = tempList
    i += 1
#Get data in the correct format for a sunburst plot
import plotly.graph_objects as go
centerText = 'North American Companies'
labels1 = np.concatenate(('Team Size: ' + teamSizes, titles, titles, titles, titles, titles, titles, titles), axis=0)
parents1 = np.concatenate((np.repeat(centerText,7),np.repeat("20+",12),np.repeat("1-2",12),\
                            np.repeat("10-14",12),np.repeat("3-4",12),np.repeat("5-9",12),np.repeat("15-19",12),np.repeat("0",12)), axis=0)

values1 = np.concatenate((np.sum(titleCounts,axis=1), titleCounts[0,:], titleCounts[1,:], titleCounts[2,:], titleCounts[3,:], titleCounts[4,:], titleCounts[5,:], titleCounts[6,:]), axis=0)

ids1 = np.concatenate((teamSizes, ['20+' + title for title in titles], ['1-2' + title for title in titles], ['10-14' + title for title in titles],\
                       ['3-4' + title for title in titles], ['5-9' + title for title in titles], ['15-19' + title for title in titles], ['0' + title for title in titles]), axis=0)


sunburst = pd.DataFrame({'Ids': np.insert(ids1,0,centerText),
                        'Labels': np.insert(labels1,0,centerText),
                        'Parents': np.insert(parents1,0,""),
                       'Values': np.insert(values1,0,np.sum(titleCounts))})
sunburst['Percents'] = Percents = sunburst.Values/[sunburst[sunburst.Ids == parent].Values for parent in sunburst.Parents]*100

#RemoveSlices with zero values
sunburst = sunburst[sunburst['Values']!=0]


#Plot Data
fig =go.Figure(go.Sunburst(
    ids = sunburst.Ids,
    labels = sunburst.Labels,
    parents = sunburst.Parents,
    values = sunburst.Values,
    branchvalues = "total",
    hovertemplate='<b>%{label} </b> <br> Responses: %{value}<br> <extra></extra>',
))
fig.update_layout(margin = dict(t=0, l=0, r=0, b=0))
fig.update_layout(showlegend=True)
fig.show()

When you further filter this chart to only include medium size companies, the team size of 3-4 members becomes the most common.  In addition, the Statistician and DBA/Database Engineer positions have disappeared.

In [None]:
##TITLE BY TEAMZISE SUNBURST PLOT FOR MEDIUM COMPANIES ONLY
#Get titles for each teamsize
titles = med_companyInfo19.iloc[:,2].dropna().unique()
teamSizes = med_companyInfo19.iloc[:,4].dropna().unique()
teamCounts = []
i = 0
titleCounts = np.zeros((teamSizes.__len__(), titles.__len__()))
#titlesByTeamSize
for team in teamSizes:
    tempSubset = med_companyInfo19[med_companyInfo19['Size of Team'] == team]
    teamCounts.append(tempSubset.iloc[:,4].count())
    tempList = []
    for title in titles:
        tempList.append(tempSubset[tempSubset['Title'] == title].iloc[:,2].count())  
    titleCounts[i,:] = tempList
    i += 1


#Get data in the correct format for a sunburst plot
import plotly.graph_objects as go
centerText = 'Mid-Size<br>North American Companies'
titleNum = len(titles)
teamSizeNum = len(teamSizes)

labels1 = np.concatenate(('Team Size: ' + teamSizes, 'team size' + titles, titles, titles, titles, titles, titles, titles), axis=0)

parents1 = np.concatenate((np.repeat(centerText,teamSizeNum),np.repeat("10-14",titleNum),np.repeat("3-4",titleNum),\
                            np.repeat("20+",titleNum),np.repeat("5-9",titleNum),np.repeat("1-2",titleNum),np.repeat("15-19",titleNum),np.repeat("0",titleNum)), axis=0)

values1 = np.concatenate((np.sum(titleCounts,axis=1), titleCounts[0,:], titleCounts[1,:], titleCounts[2,:], titleCounts[3,:], titleCounts[4,:], titleCounts[5,:], titleCounts[6,:]), axis=0)

ids1 = np.concatenate((teamSizes, ['20+' + title for title in titles], ['1-2' + title for title in titles], ['10-14' + title for title in titles],\
                       ['3-4' + title for title in titles], ['5-9' + title for title in titles], ['15-19' + title for title in titles], ['0' + title for title in titles]), axis=0)


sunburst = pd.DataFrame({'Ids': np.insert(ids1,0,centerText),
                        'Labels': np.insert(labels1,0,centerText),
                        'Parents': np.insert(parents1,0,""),
                       'Values': np.insert(values1,0,np.sum(titleCounts))})
sunburst['Percents'] = Percents = sunburst.Values/[sunburst[sunburst.Ids == parent].Values for parent in sunburst.Parents]*100

#RemoveSlices with zero values
sunburst = sunburst[sunburst['Values']!=0]


#Plot Data
fig =go.Figure(go.Sunburst(
    ids = sunburst.Ids,
    labels = sunburst.Labels,
    parents = sunburst.Parents,
    values = sunburst.Values,
    branchvalues = "total",
    hovertemplate='<b>%{label} </b> <br> Responses: %{value}<br> <extra></extra>',
))
fig.update_layout(margin = dict(t=0, l=0, r=0, b=0))
fig.update_layout(showlegend=True)
fig.show()

So how has the breakdown of these positions changes over the past few years.  Machine Learning and Data Science are rapidly growing fields.  Lets get some insight into the matter.  

This chart shows the response rates for various data science job positions over the past three years based on the US kaggle survey responses. To make this chart, title options that had no corollary with other years were eliminated. Additionally the 2017 survey's titles 'Scientist/Researcher' and 'Researcher' were both folded into the "research scientist" category for that year (this may explain the 10% drop in that category from 2017 to 2019).

In general, we see here that "Data Scientist is by far the most common job title among this demographic.  Notice how both Scientist positions have increased in percent while the Engineers and Analysts were comparatively flat.  

In [None]:
titlesToCount = ['DBA/Database Engineer', 'Statistician', 'Data Scientist', 'Software Engineer', 'Data Analyst', 'Research Scientist', 'Business Analyst']


titleCount19 = [companyInfo19[companyInfo19.Title == title].Title.count()/companyInfo19.Title.count()*100 for title in titlesToCount]
titleCount18 = [companyInfo18[companyInfo18.Title == title].Title.count()/companyInfo18.Title.count()*100 for title in titlesToCount]
titleCount17 = [companyInfo17[companyInfo17.Title == title].Title.count()/companyInfo17.Title.count()*100 for title in titlesToCount]

titleCountByYear = pd.DataFrame([titleCount17,titleCount18, titleCount19], columns = titlesToCount)
titleCountByYear.index = [2017,2018,2019]

    
fig = go.Figure()
for title in titlesToCount:
    fig.add_trace(go.Scatter(x=[2017, 2018, 2019], y=titleCountByYear[title],
                             mode='lines',
                             name=title,
                            ))
fig.update_xaxes(title_text='Survey Year', dtick=1)
fig.update_yaxes(title_text='Response Frequency (Percent)')
fig.update_layout(
    hoverlabel_align = 'right', 
    title = "Title Response Frequency by Year")
   

fig.show()

Now let's take a look at the relationship between organization size and position.  If we assume that a disproportionately high percentage of the smallest companies are consulting firms (which seems like a fair assumption) we can see that software enginners are much bigger deal in the biggest companies and, conversely, data analysts are not quite as common.

In [None]:
##TITLE BY TEAMZISE SUNBURST PLOT
#Get titles for each teamsize
titles = companyInfo19.iloc[:,2].dropna().unique()
companySizes = companyInfo19['Size of Company'].dropna().unique()

correctOrder = [4,1,2,3,0]
companySizes = [companySizes[i] for i in correctOrder]

teamCounts = []
i = 0
titleCounts = np.zeros((companySizes.__len__(), titles.__len__()))
#titlesByTeamSize
for size in companySizes:
    tempSubset = companyInfo19[companyInfo19['Size of Company'] == size]
    teamCounts.append(tempSubset.iloc[:,3].count())
    tempList = []
    for title in titles:
        tempList.append(tempSubset[tempSubset['Title'] == title].iloc[:,2].count())  
    titleCounts[i,:] = tempList
    i += 1


#Get data in the correct format for a sunburst plot
import plotly.graph_objects as go
centerText = 'North American Companies'
rows = len(companySizes)
cols = len(titles)

labels1 = np.concatenate((companySizes, np.tile(titles,rows)), axis=0)

parents1 = np.concatenate((np.repeat(centerText,rows),np.repeat(companySizes,cols)), axis=0)

values1 = np.concatenate((np.sum(titleCounts,axis=1), np.asarray(titleCounts).reshape(-1)), axis=0)

ids1 = np.concatenate((companySizes, np.asarray([[company + title for title in titles]for company in companySizes]).reshape(-1)), axis=0)



sunburst = pd.DataFrame({'Ids': np.insert(ids1,0,centerText),
                        'Labels': np.insert(labels1,0,centerText),
                        'Parents': np.insert(parents1,0,""),
                       'Values': np.insert(values1,0,np.sum(titleCounts))})
#sunburst['Percents'] = sunburst.Values/[sunburst[sunburst.Ids == parent].Values for parent in sunburst.Parents]*100

#RemoveSlices with zero values
sunburst = sunburst[sunburst['Values']!=0]


#Plot Data
fig =go.Figure(go.Sunburst(
    ids = sunburst.Ids,
    labels = sunburst.Labels,
    parents = sunburst.Parents,
    values = sunburst.Values,
    branchvalues = "total",
    hovertemplate = '<b>%{label}</b><br><br>Responses: %{value}<extra></extra>'
   # hovertemplate =
    #'<b>%{label}</b>'+
    #'<br>Percent: %{hovertext}<br>'+ 
    #'Responses: %{value}<br> <extra></extra>',
    #hovertext = ['{}%'.format(i) for i in sunburst.Percents]
))
fig.update_layout(margin = dict(t=0, l=0, r=0, b=0))
fig.update_layout(title = 'Role Prevalence by Company Size')
fig.show()

Now we have a better idea of the positions of individuals that are involved with practicing Data Science.  But what kind of experience do they have?   

## How experienced are the individuals?

Data Science is a rapidly growing field.  What kind of real experience do individuals working in it have?  The following chart shows the average level of experience for respondents across organization sizes.  It's interesting how the smallest size organizations have the second highest level of experience. 

In [None]:
#AVG TEAM SIZE BY COMPANY SIZE

numericalYearsExperience = []
for years in companyInfo19.Experience:
    if years == 'I have never written code':
        numericalYearsExperience.append(0)
    elif years == '3-5 years':
        numericalYearsExperience.append(4)
    elif years == '< 1 years':
        numericalYearsExperience.append(0.5)
    elif years == '1-2 years':
        numericalYearsExperience.append(1.5)
    elif years == '5-10 years':
        numericalYearsExperience.append(7.5)
    elif years == '10-20 years':
        numericalYearsExperience.append(15)
    elif years == '20+ years':
        numericalYearsExperience.append(20)
    else:
        numericalYearsExperience.append(np.nan)
        
companyInfo19['numericalYearsExperience'] = numericalYearsExperience
meanYearsExperience=[companyInfo19[companyInfo19['Size of Company'] == companySize].numericalYearsExperience.mean() for companySize in companyInfo19['Size of Company'].unique()]

companySizes = companyInfo19['Size of Company'].unique()
correctOrder = [5, 1, 2, 4, 0]
chartData = np.vstack(([meanYearsExperience[i] for i in correctOrder],[companySizes[i] for i in correctOrder]))

fig = go.Figure([go.Bar(x=chartData[1,:], y=chartData[0,:], hovertemplate = '<i>Company Size: %{x} </i> <br> Mean Data Science Experience: %{y} <extra></extra>')])
fig.update_xaxes(title_text='Company Size')
fig.update_yaxes(title_text='Mean Data Science Experience (years)')
fig.update_layout(
    hoverlabel_align = 'right', 
    title = "Data Science Experience vs. Company Size")

fig.show()

So lets take a look at how experience is distributed across company size.  The next chart reveals that half of the respondents have 3 or less years experience!  This is a classic example og the median (~3) being different than the average (~7).   

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
cdata = pd.DataFrame(sp[k19mr_usa])
company_size_category = pd.api.types.CategoricalDtype(categories=orders[6][::-1], ordered=True)
education_category = pd.api.types.CategoricalDtype(categories=orders[23], ordered=True)
cdata['Company Size'] = cdata['Q6'].astype(company_size_category)
cdata['Experience'] = cdata['Q23'].astype(education_category)
del cdata['Q6']
del cdata['Q23']
df2 = cdata.groupby(['Company Size', 'Experience'])['Company Size'].count().unstack('Experience').fillna(0)
df2.plot(kind='barh', stacked=True, figsize=(14,5));

Now lets take a look at how experience has changed.  The average data science experience of respondents has stayed relatively the same over the last 3 years. This would suggest that the number of newcomers is enough to offset the inveitable aging of the current population. Unfortunately the largest experience option for the 2017 survey was 'More than 10 years,' thus the averages for all options greater than this in the 2018 and 2019 surveys were set to 10 in order to normalize data across survey years.

In [None]:
companyInfo19['numericalYearsExperienceCompat'] = companyInfo19.numericalYearsExperience.replace([15,20],10)

numericalYearsExperience = []
for years in companyInfo18.Experience:
    if (years == 'I have never written code but I want to learn')|(years == 'I have never written code and I do not want to learn'):
        numericalYearsExperience.append(0)
    elif years == '3-5 years':
        numericalYearsExperience.append(4)
    elif years == '< 1 years':
        numericalYearsExperience.append(0.5)
    elif years == '1-2 years':
        numericalYearsExperience.append(1.5)
    elif years == '5-10 years':
        numericalYearsExperience.append(7.5)
    elif years == '10-20 years':
        numericalYearsExperience.append(10)
    elif (years == '20-30 years')|(years == '30-40 years')|(years == '40+ years'):
        numericalYearsExperience.append(10)
    else:
        numericalYearsExperience.append(np.nan)        
companyInfo18['numericalYearsExperience'] = numericalYearsExperience

numericalYearsExperience = []
for years in companyInfo17.Experience:
    if years == 'I don\'t write code to analyze data':
        numericalYearsExperience.append(0)
    elif years == '3 to 5 years':
        numericalYearsExperience.append(4)
    elif years == 'Less than a year':
        numericalYearsExperience.append(0.5)
    elif years == '1 to 2 years':
        numericalYearsExperience.append(1.5)
    elif years == '6 to 10 years':
        numericalYearsExperience.append(7.5)
    elif years == 'More than 10 years':
        numericalYearsExperience.append(10)
    else:
        numericalYearsExperience.append(np.nan)        
companyInfo17['numericalYearsExperience'] = numericalYearsExperience

    
fig = go.Figure()
fig.add_trace(go.Scatter(x=[2017, 2018, 2019], y=[companyInfo17['numericalYearsExperience'].mean(),companyInfo18['numericalYearsExperience'].mean(),companyInfo19['numericalYearsExperienceCompat'].mean()], mode='lines', name=title))
fig.update_xaxes(title_text='Survey Year', dtick=1)
fig.update_yaxes(title_text='Mean Data Science Experience (Years)', range = [0.1, 10])
fig.update_layout(
    hoverlabel_align = 'right', 
    title = "Data Science Experience by Survey Year")
   

fig.show()

Ok, that gives us a better understand of the experience of individuals working with Machine Learning and Data Science.  Now lets look at education. 

## How educated are the individuals?

In [None]:
#Avg Year of Education BY COMPANY SIZE
education = []
for degree in companyInfo19.Degree:
    if degree == 'No formal education past high school':
        education.append(0)
    elif degree == 'Some college/university study without earning a bachelor’s degree':
        education.append(2)
    elif degree == 'Professional degree':
        education.append(7)
    elif degree == 'Bachelor’s degree':
        education.append(4)
    elif degree == 'Master’s degree':
        education.append(6)
    elif degree == 'Doctoral degree':
        education.append(11)
    else:
        education.append(np.nan)
        
companyInfo19['Education'] = education
meanYearsEducation=[companyInfo19[companyInfo19['Size of Company'] == companySize].Education.mean() for companySize in companyInfo19['Size of Company'].unique()]

companySizes = companyInfo19['Size of Company'].unique()
correctOrder = [5, 1, 2, 4, 0]
chartData = np.vstack(([meanYearsEducation[i] for i in correctOrder],[companySizes[i] for i in correctOrder]))

fig = go.Figure([go.Bar(x=chartData[1,:], y=chartData[0,:], hovertemplate = '<i>Company Size: %{x} </i> <br> Mean Years of Education: %{y} <extra></extra>')])
fig.update_xaxes(title_text='Company Size')
fig.update_yaxes(title_text='Mean Years of Secondary Education')
fig.update_layout(
    hoverlabel_align = 'right', 
    title = "Data Science Employee Education by Company Size")

fig.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
cdata = pd.DataFrame(sp[k19mr_usa])
company_size_category = pd.api.types.CategoricalDtype(categories=orders[6][::-1], ordered=True)
education_category = pd.api.types.CategoricalDtype(categories=orders[4], ordered=True)
cdata['Company Size'] = cdata['Q6'].astype(company_size_category)
cdata['Education'] = cdata['Q4'].astype(education_category)
del cdata['Q6']
del cdata['Q4']
df2 = cdata.groupby(['Company Size', 'Education'])['Company Size'].count().unstack('Education').fillna(0)
df2.plot(kind='barh', stacked=True, figsize=(14,5));

## How much are individuals being paid?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
cdata = pd.DataFrame(sp[k19mr_usa])
company_size_category = pd.api.types.CategoricalDtype(categories=orders[6][::-1], ordered=True)
salary_category = pd.api.types.CategoricalDtype(categories=orders[10], ordered=True)
cdata['Company Size'] = cdata['Q6'].astype(company_size_category)
cdata['Salary'] = cdata['Q10'].astype(salary_category)
del cdata['Q6']
del cdata['Q10']
df2 = cdata.groupby(['Company Size', 'Salary'])['Company Size'].count().unstack('Salary').fillna(0)
df2.plot(kind='barh', stacked=True, figsize=(14,5)).legend(loc='center left', bbox_to_anchor=(1.0, 0.5));
#plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))