# Comparing small and big companies 

In this notebook, I will be using mainly the 2021 Kaggle Data Science and Machine Learning survey dataset **to explore the difference between small and big businesses.**

For this analysis I will be following the [OECD Glossary](https://stats.oecd.org/glossary/detail.asp?ID=3123) definition of small business as those with fewer than 50 employees, medium-sized companies as 50-249 employees and large companies with more than 250 employees.
* 0-49 employees = small company
* 50-249 employees = medium-sized company
* 250 or more employees = large company

The main comparison will be between businesses classed as 'small' and 'big'.

I expect to find that small companies, which would account for the majority of startups, would have higher representations of woman and younger staff members. 

I expect that employees of small companies will be "Jacks of all trades", needing to master a wider range of skills than equivalent employees in bigger companies with more siloed structures.

In [None]:
# Data processing
import numpy as np
import pandas as pd

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Setting custom style
import matplotlib.font_manager as fm
font = fm.FontProperties(fname='../input/crimsonpro/CrimsonPro-Regular.ttf')
bold = fm.FontProperties(fname='../input/crimsonpro/CrimsonPro-Bold.ttf')

In [None]:
# rcParams
my_colours = ["#14213d", "#fca311", "#ef233c"] # navy, yellow, red
greys = ["#595959", "#7f7f7f","#a5a5a5", "#cccccc", "#f2f2f2"] # Grey dark to light
pinks = ["#fff0f3", "#ffccd5", "#ffb3c1", "#ff8fa3", "#ff758f", "#ff4d6d", "#c9184a", "#800f2f"]

plt.rcParams['figure.dpi'] = 300
plt.rcParams['font.family'] = font.get_name()
plt.rcParams["figure.facecolor"] = "white"
plt.rcParams['axes.facecolor']= "white"

plt.rcParams['axes.linewidth'] = 1.0
plt.rcParams['axes.edgecolor'] = greys[3]
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.xmargin'] = 0.05

plt.rcParams['xtick.color']= greys[3]
plt.rcParams['ytick.color']= greys[3]
plt.rcParams['xtick.labelsize']= 7
plt.rcParams['xtick.labelcolor']= greys[1]
plt.rcParams['ytick.labelsize']= 7
plt.rcParams['ytick.labelcolor']= greys[1]

In [None]:
# Import data
data=pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv", low_memory=False)
data.shape #(25974, 369)

# Previous year surveys
df17=pd.read_csv("../input/kaggle-survey-2017/multipleChoiceResponses.csv", low_memory=False, encoding='latin-1')
df18=pd.read_csv("../input/kaggle-survey-2018/multipleChoiceResponses.csv", low_memory=False, encoding='latin-1')
df19=pd.read_csv("../input/kaggle-survey-2019/multiple_choice_responses.csv", low_memory=False, encoding='latin-1')
df20=pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", low_memory=False, encoding='latin-1')

# Dividing the dataframe by number of employees
dfs = [pd.DataFrame(y) for x, y in data.groupby('Q21', as_index=False)]

# Saving the questions as a separate dictionary
Q = dfs[5].to_dict('index')[0]

# Separate dataframes per company size
small = dfs[0]
small['size'] = 'small'

medium = dfs[4]
medium['size'] = 'medium'

big = pd.concat([dfs[1], dfs[2], dfs[3]])
big['size'] = 'big'

# Combined df
df = pd.concat([small, medium, big])

print('Small', small.shape, 'Medium-sized', medium.shape, 'Big', big.shape, 'Combined', df.shape)

## Demographics of small versus big companies

In [None]:
# Line plot 
years = ['2017', '2018', '2019', '2020', '2021']
no_of_respondents = [len(df17), len(df18), len(df19), len(df20), len(data)]

# Grouped bar chart
grp_df = pd.DataFrame([['Small', '2019', 4025], ['Big', '2019', 7648], ['Small', '2020', 4208], ['Big', '2020', 5524], ['Small', '2021', 5055], ['Big', '2021', 8629]], columns=['Size', 'Year', 'Count'])

# Pie chart
respndnts = [len(small), len(medium), len(big)]
lab = ["Small", "Medium-sized", "Big"]
wed={"width": 0.4}

# Subplots
fig, axs = plt.subplots(1, 3, figsize=(15, 6))
axs[0].plot(years, no_of_respondents, '-o', linewidth=2, color=greys[2], markersize=5)
axs[0].set_title("Number of Respondents per year", font=font, fontsize=12)
for i,j in zip(years,no_of_respondents):
    axs[0].annotate(str(j),xy=(i,j+200), fontsize=8)

sns.barplot(ax=axs[1], data=grp_df, x='Year', y='Count', hue='Size', palette=[greys[2], greys[4]])
for container in axs[1].containers:
    axs[1].bar_label(container, fontsize=7)
axs[1].set_title("Number of Small versus Big companies", font=font, fontsize=12)
axs[1].tick_params(tick1On=False)
axs[1].axes.yaxis.set_visible(False)
axs[1].set_xlabel(' ')
axs[1].legend(prop=font, fontsize=8)

axs[2].pie(respndnts, labels=lab, autopct='%.f%%', pctdistance=.8, colors=[greys[2], greys[3], greys[4]], wedgeprops=wed, normalize=True)
axs[2].set_title("Proportion of respondents by company size", font=font, fontsize=12)

Responses from employees working at big companies constituted 53% of all responses in 2021.

In [None]:
# Gender (Q2)
all_Q2 = data["Q2"].value_counts().to_frame().drop(index="What is your gender? - Selected Choice")
all_Q2.iloc[2] = all_Q2.iloc[2:].sum()
all_Q2 = all_Q2.iloc[:3].transform(lambda x: x/x.sum()).rename(index={'Prefer not to say':'Other'}).reset_index()

def combine_data(data):
    x = data["Q2"].value_counts().to_frame()
    x.iloc[2] = x.iloc[2:].sum()
    return x.iloc[:3].transform(lambda x: x/x.sum()).rename(index={'Prefer not to say':'Other'}).reset_index()

Q2_s = combine_data(small)
Q2_m = combine_data(medium)
Q2_b = combine_data(big)

# Bar plots
def bar_plot(ax, data, title, x, y, color):
    sns.barplot(ax=ax, data=data, x=x, y=y, color=color)
    ax.set_title(title, fontsize=12, font=font, color=greys[0])
    ax.bar_label(ax.containers[0], fmt='%.2f', fontsize=8, color=greys[0])
    ax.set_xlabel(" ")
    ax.set_ylabel(" ")
    
fig, (ax0, ax1, ax2, ax3) = plt.subplots(1,4, figsize=(16,6))
fig.supxlabel(" ")

bar_plot(ax0, all_Q2, "All Kaggle","index", "Q2",  greys[2])
ax0.set_ylabel('Proportion', font=font, color=greys[1])

bar_plot(ax1, Q2_s, "Small Company","index", "Q2", greys[3])
bar_plot(ax2, Q2_m, "Medium-sized Company", "index", "Q2",greys[3])
bar_plot(ax3, Q2_b, "Big Company","index", "Q2", greys[3])

Contary to my expectations, the gender disparity is consistent across all sizes of businesses, with female employees accounting for ~18% of the workforce.

In [None]:
# Age (Q1)
Q1_s = small["Q1"].value_counts().to_frame().sort_index(ascending=True).reset_index()
Q1_m = medium["Q1"].value_counts().to_frame().sort_index(ascending=True).reset_index()
Q1_b = big["Q1"].value_counts().to_frame().sort_index(ascending=True).reset_index()

# Medium Age Group
Q1_s["cumsum"] = Q1_s["Q1"].cumsum() # (5055+1)/2 Median age group: 25-29
Q1_m["cumsum"] = Q1_m["Q1"].cumsum() #(2567+1)/2 Median age group: 30-34
Q1_b["cumsum"] = Q1_b["Q1"].cumsum() # (8629+1)/2 Median age group: 30-34

def bar_plot_age(ax, data, title, x, y, median, text, x_t, y_t):
    sns.barplot(ax=ax, data=data, x=x, y=y, palette=[my_colours[2] if x== median else greys[2] for x in Q1_s['index']])
    ax.set_title(title, fontsize=12, font=font, color=greys[0])
    ax.bar_label(ax.containers[0], fmt='%.f', fontsize=8, color=greys[0])
    ax.set_xlabel(" ")
    ax.set_ylabel(" ")
    ax.xaxis.set_tick_params(labelsize=8)
    ax.yaxis.set_tick_params(labelsize=8)
    ax.text(x_t, y_t, text, font=font, c=my_colours[2])
    
# Bar plots
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(18, 6))

bar_plot_age(ax1, Q1_s, "Small Company", "index", "Q1", '25-29', "Median Age Group: 25-29", 2.8, 800)
bar_plot_age(ax2, Q1_m, "Medium-sized Company", "index", "Q1", '30-34', "Median Age Group: 30-34", 4, 450)
bar_plot_age(ax3, Q1_b, "Big Company", "index", "Q1", '30-34', "Median Age Group: 30-34", 4, 1350)

In [None]:
# Proportion of age
Q1_s['proportion'] = Q1_s["Q1"] / Q1_s["Q1"].sum()*100
Q1_m['proportion'] = Q1_m["Q1"] / Q1_m["Q1"].sum()*100
Q1_b['proportion'] = Q1_b["Q1"] / Q1_b["Q1"].sum()*100

fig = plt.figure(figsize=(16, 8))
plt.plot(Q1_s['index'], Q1_s["proportion"], linewidth=3, color=greys[2])
plt.bar(Q1_s['index'], Q1_s["proportion"], alpha=0.3, color=greys[2])

plt.plot(Q1_b['index'], Q1_b["proportion"], linewidth=3, color=my_colours[2])
plt.bar(Q1_b['index'], Q1_b["proportion"], alpha=0.3, color=my_colours[2])

plt.title("Age distribution of small and big businesses", font=bold, fontsize=17, color=greys[1])
plt.text(1, 20, "Small", font=bold, fontsize=16, color=greys[2])
plt.text(3.5, 18, "Big", font=bold, fontsize=16, color=my_colours[2])
plt.xlabel(' ')
plt.ylabel('Percentage of total workforce', font=bold, fontsize=12, color=greys[2])
plt.show()

Employees from small companies are noticeably younger. The biggest difference can be seen concerning employees aged 18-21 and 22-24, with these age groups accounting for ~13% and ~18% of the total small business workforce, compared with only ~3% and ~11% of big businesses respectively. 

In [None]:
# Level of Education
edu = df.groupby(["size", "Q4"]).size().groupby(level = 0).transform(lambda x: round(x/x.sum(),3)).to_frame("proportion").reset_index()
labels = ['No formal education past high school', "Some college/university study without earning a bachelor‚Äôs degree", "Bachelor‚Äôs degree", "Master‚Äôs degree", "Doctoral degree", "Professional doctorate", "I prefer not to answer"]
x = np.arange(len(labels))
width=0.2

s_edu = edu[edu["size"]=="small"].set_index("Q4").reindex(labels)["proportion"].to_list()
m_edu = edu[edu["size"]=="medium"].set_index("Q4").reindex(labels)["proportion"].to_list()
b_edu = edu[edu["size"]=="big"].set_index("Q4").reindex(labels)["proportion"].to_list()
all_edu = data.groupby(["Q4"]).size().transform(lambda x: round(x/x.sum(),3)).to_frame("proportion").reindex(labels)

fig, ax = plt.subplots(figsize=(16, 8)) 
rects1 = ax.bar(x - width, s_edu, width, label='Small', color=greys[0])
rects2 = ax.bar(x,  m_edu, width, label='Medium-sized', color=greys[1])
rects3 = ax.bar(x + width, b_edu, width, label="Big", color=greys[2])

ax.set_title("Education levels", font=bold, fontsize=17, color=greys[1])
ax.set_xlabel(' ')
ax.set_ylabel('Proportion', font=font, fontsize=15, color=greys[1])
labels[0] = "No formal education"
labels[1] = "Some college study w/o bachelor's"
ax.set_xticks(x)
ax.set_xticklabels(labels, font=font, fontsize=12)
ax.legend(prop=font)
ax.bar_label(rects1, padding=3, fontsize=9, font=font)
ax.bar_label(rects2, padding=3, fontsize=9, font=font)
ax.bar_label(rects3, padding=3, fontsize=9, font=font)

fig.tight_layout()
plt.show()

In [None]:
colors = [greys[4], greys[4], greys[3], "#ef233c", "#d90429",  "#d90429", greys[4]]
explode = [0, 0, 0, .04, .04, .04, 0]
wedgeprops={"edgecolor": greys[0],"width": 0.6}

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

ax1.pie(s_edu, autopct='%.f%%', pctdistance=.8, wedgeprops=wedgeprops, explode=explode, colors=colors, normalize=False, textprops={'fontsize': 10})
ax2.pie(b_edu, autopct='%.f%%', pctdistance=.8, wedgeprops=wedgeprops, explode=explode, colors=colors, normalize=True, textprops={'fontsize': 10})

ax1.set_title('Small Companies', font=font, fontsize=14)
ax2.set_title('Big Companies', font=font, fontsize=14)

fig.legend(labels, prop=font, loc="lower right")

The darker the tone, the higher the education.
Education from master's level up is noticeably more prevailent in big companies (65% for big companies versus 51% for small businesses).

In [None]:
# Industries
g=list(reversed(greys))
ind = df.groupby(["Q20"]).size().to_frame('proportion').transform(lambda x: x/x.sum()).reset_index()

fig = px.treemap(ind, path=["Q20"], values="proportion", color="proportion", color_continuous_scale=pinks)
fig.update_layout(margin = dict(t=0, l=0, r=0, b=0))
fig.show()

When looking at all respondents' industry areas, Computers/Technology and Academics/Education dominate the field.

In [None]:
# Industries
Q20=df.groupby(["Q20", "size"]).size().groupby(level = 0).transform(lambda x: round(x/x.sum(), 2)).to_frame("Proportion").unstack()
plt.figure(figsize = (9,9))
ax = sns.heatmap(Q20, cmap=g, annot=True)
plt.xticks([0.5, 1.5, 2.5], ['Big', 'Medium-sized', "Small"], font=font)
plt.title('Proportion of Company Size by Industries', font=bold, fontsize=15)
ax.set_xlabel('Company Size', font=font)
ax.set_ylabel('Industries', font=font)

However, when viewing industries through the lens of big vs small, it becomes apparent that Computers/Technology are more dominant in big businesses than in small ones, with Non-profit/Service taking a much higher proportion in small business.

In [None]:
small_ds = small[small['Q5']=="Data Scientist"] # Data Scientists in small companies
big_ds = big[big['Q5']=="Data Scientist"] # Data Scientsits in big companies
print(small_ds.shape, big_ds.shape)


salary_order = ['$0-999', '1,000-1,999', '2,000-2,999','3,000-3,999','4,000-4,999', '5,000-7,499', '7,500-9,999', '10,000-14,999','15,000-19,999', '20,000-24,999', '25,000-29,999', "30,000-39,999", "40,000-49,999", "50,000-59,999", "60,000-69,999", "70,000-79,999", "80,000-89,999", "90,000-99,999", "100,000-124,999", "125,000-149,999", "150,000-199,999", "200,000-249,999", "250,000-299,999", "300,000-499,999", "$500,000-999,999", ">$1,000,000"]
Q25_s = small_ds.groupby(["Q25"]).size().transform(lambda x: round(x/x.sum()*100, 2)).to_frame('proportion').reindex(salary_order)
Q25_b = big_ds.groupby(["Q25"]).size().transform(lambda x: round(x/x.sum()*100, 2)).to_frame('proportion').reindex(salary_order)

fig = plt.figure(figsize=(16, 8))
plt.plot(Q25_s.index, Q25_s["proportion"], linewidth=3, color=greys[2])
plt.bar(Q25_s.index, Q25_s["proportion"], alpha=0.3, color=greys[2])

plt.plot(Q25_b.index, Q25_b["proportion"], linewidth=3, color=my_colours[2])
plt.bar(Q25_b.index, Q25_b["proportion"], alpha=0.3, color=my_colours[2])

plt.title("Yearly compensation distribution", font=font, fontsize=16, color=greys[2])
plt.text(1, 15, "Small", font=bold, fontsize=16, color=greys[2])
plt.text(6, 8, "Big", font=bold, fontsize=16, color=my_colours[2])
plt.xticks(rotation=90)
plt.xlabel(' ')
plt.ylabel(' ')
plt.show()

In [None]:
salary_order2 = ['1,000-1,999','2,000-2,999','3,000-3,999','4,000-4,999', '5,000-7,499', '7,500-9,999', '10,000-14,999','15,000-19,999', '20,000-24,999', '25,000-29,999', "30,000-39,999", "40,000-49,999", "50,000-59,999", "60,000-69,999", "70,000-79,999", "80,000-89,999", "90,000-99,999", "100,000-124,999", "125,000-149,999", "150,000-199,999", "200,000-249,999", "250,000-299,999", "300,000-499,999", "$500,000-999,999", ">$1,000,000"]

Q25_s = small_ds.groupby(["Q25"]).size().to_frame("proportion").iloc[1:, :].transform(lambda x: round(x/x.sum()*100, 2)).reindex(salary_order2)
Q25_b = big_ds.groupby(["Q25"]).size().to_frame("proportion").iloc[1:, :].transform(lambda x: round(x/x.sum()*100, 2)).reindex(salary_order2)

fig = plt.figure(figsize=(16, 8))
plt.plot(Q25_s.index, Q25_s["proportion"], linewidth=3, color=greys[2])
plt.bar(Q25_s.index, Q25_s["proportion"], alpha=0.3, color=greys[2])

plt.plot(Q25_b.index, Q25_b["proportion"], linewidth=3, color=my_colours[2])
plt.bar(Q25_b.index, Q25_b["proportion"], alpha=0.3, color=my_colours[2])

plt.title('Yearly compensation of big and small companies with $0-999 bracket removed', font=font, fontsize=16, color=greys[2])
plt.text(1, 14, "Small", font=bold, fontsize=16, color=greys[2])
plt.text(10, 8, "Big", font=bold, fontsize=16, color=my_colours[2])
plt.xticks(rotation=90)
plt.xlabel(' ')
plt.ylabel(' ')
plt.show()

The sole marked difference in salary discrepancy is that 40% of small-business employees are paid between 0 and 999 dollars per annum. It is likely that this is due to small businesses hiring substantially more unpaid interns.
The second graph shows the distribution again but with the lowest bracket removed. Employees paid up to the 10,000-14,999 bracket are more commonly found in small businesses, with employees of big businesses being paid higher in all other categories. A noteable exception is that more highly paid respondents are from small businesses (\$300,000+). This may be due to these highly paid respondents owning such businesses.

## Work enviornment in small versus big companies

In [None]:
# Jobs of all participants
Q5_cols = [greys[3]]*9 + [my_colours[2],my_colours[2],greys[3],my_colours[2]]
ax = df['Q5'].value_counts().sort_values().plot.barh(color=Q5_cols, figsize=(10,4), width=0.7)

Data scientists, software engineers, and data analysts are the top three most common roles.

In [None]:
# Distribution of respondents
Q5_s = small['Q5'].value_counts().sort_values(ascending=False)
roles_order = ['Data Scientist', 'Software Engineer', 'Data Analyst', 'Machine Learning Engineer', 'Research Scientist', 'Business Analyst', 'Program/Project Manager', 'Data Engineer', 'Statistician', 'Product Manager', 'DBA/Database Engineer', 'Developer Relations/Advocacy', 'Other']
Q5_b = big['Q5'].value_counts().reindex(roles_order)

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18, 6))
Q5_s.plot.bar(ax=ax1, color=greys[2], width=.8)
Q5_b.plot.bar(ax=ax2, color=greys[2], width=.8)

fig.suptitle('Job distribution', font=font, fontsize=16, color=greys[1])
ax1.set_xlabel('Small Company', font=font, fontsize=15, color=greys[1])
ax2.set_xlabel('Big Company', font=font, fontsize=15, color=greys[1])

Overall, respondents primarily work as data scientists, software engineers, and data analysts, regardless of the size of their employer. Further analysis of what roles are contained in the big business "Other" category would be insightful.

In [None]:
# Preference of Cloud Provider
Q27 = ['Q27_A_Part_1','Q27_A_Part_2','Q27_A_Part_3','Q27_A_Part_4','Q27_A_Part_5','Q27_A_Part_6','Q27_A_Part_7','Q27_A_Part_8','Q27_A_Part_9','Q27_A_Part_10','Q27_A_Part_11','Q27_A_OTHER']
cloud = ["AWS", "Azure", "GCP", "IBM", "Oracle", "SAP", "Salesforce", "VMware", "Alibaba", "Tencent", "None", "Other"]

big_Q27 = big[Q27]
small_Q27 = small[Q27]
big_Q27.columns = cloud
small_Q27.columns = cloud

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18, 6))
cloud_cols = [my_colours[2]]*3 + [greys[3]]*9

small_Q27.count().plot.bar(ax=ax1, color=cloud_cols, width=.8)
big_Q27.count().plot.bar(ax=ax2, color=cloud_cols, width=.8)

fig.suptitle('Cloud providers', font=font, fontsize=16)
ax1.set_xlabel('Small Company', font=font, fontsize=15, color=greys[1])
ax2.set_xlabel('Big Company', font=font, fontsize=15, color=greys[1])

The "big three" cloud providers are popular in both big and small businesses. Azure's market share is noticeably lacking behind in small businesses. 

In [None]:
idx = ['I have never written code','< 1 years', '1-3 years', '3-5 years',  '5-10 years', '20+ years', '10-20 years']

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18, 6))

s = small_ds['Q6'].value_counts().reindex(idx).plot.bar(ax=ax1, width=0.8, color=pinks)
b = big_ds['Q6'].value_counts().reindex(idx).plot.bar(ax=ax2, width=0.8, color=pinks)

fig.suptitle('Coding experience', font=font, fontsize=16)
ax1.set_xlabel('Small Company', font=font, fontsize=15, color=greys[1])
ax2.set_xlabel('Big Company', font=font, fontsize=15, color=greys[1])

Consistent with the younger workforce at small businesses, the majority of experienced workers are employed by larger employers, with a particular disparity in the 10-20 year experience bracket.

In [None]:
team_size = ['0', '1-2', '3-4','5-9','10-14', '15-19', '20+']

Q22_small = small_ds['Q22'].value_counts().transform(lambda x: round(x/x.sum()*100, 2)).to_frame('proportion').reindex(team_size)
Q22_big = big_ds['Q22'].value_counts().transform(lambda x: round(x/x.sum()*100, 2)).to_frame('proportion').reindex(team_size)

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18, 6))
Q22_small.plot.bar(ax=ax1, width=0.8, color=greys[2], legend=None)
Q22_big.plot.bar(ax=ax2, width=0.8, color=greys[2], legend=None)

fig.suptitle('Team size', font=font, fontsize=16)
ax1.set_xlabel('Small Company', font=font, fontsize=15, color=greys[1])
ax2.set_xlabel('Big Company', font=font, fontsize=15, color=greys[1])

As expected, employees of small businesses work alone or in small groups, whereas employees of big companies often work in groups of more than 20.

In [None]:
# Computing platform used
label = ['A laptop', 'A personal computer/desktop', 'A cloud computing platform', 'A deep learning workstation', 'None', 'Other']
c = [greys[4], greys[3], greys[2], greys[1], greys[0], greys[0]]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

ax1.pie(small['Q11'].value_counts(normalize=True), labels=label, autopct='%.f%%', pctdistance=.8, wedgeprops=wedgeprops, colors=c, textprops={'fontsize': 10})
ax2.pie(big['Q11'].value_counts(normalize=True), labels=label, autopct='%.f%%', pctdistance=.8, wedgeprops=wedgeprops, colors=c, textprops={'fontsize': 10})

ax1.set_title('Small Companies', font=font, fontsize=14)
ax2.set_title('Big Companies', font=font, fontsize=14)

fig.suptitle('Computing Platform used', font=font, fontsize=16)
fig.legend(label, prop=font, loc="lower right")

In [None]:
# Languages used by Data Scientists in Small vs Big Companies
Q7 = ['Q7_Part_1','Q7_Part_2','Q7_Part_3','Q7_Part_4','Q7_Part_5','Q7_Part_6','Q7_Part_7','Q7_Part_8','Q7_Part_9','Q7_Part_10','Q7_Part_11','Q7_Part_12','Q7_OTHER']
languages = ['Python', 'R', 'SQL', 'C', 'C++', 'Java', 'JavaScript', 'Julia', 'Swift', 'Bash', 'MATLAB', 'None', 'Other']

big_Q7 = big_ds[Q7].count().transform(lambda x: round(x/x.sum()*100, 2)).to_frame('proportion')
small_Q7 = small_ds[Q7].count().transform(lambda x: round(x/x.sum()*100, 2)).to_frame('proportion')

small_Q7.index, big_Q7.index = languages, languages

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18, 6))

small_Q7.plot.bar(ax=ax1, color=greys[3], width=.8, legend=None)
big_Q7.plot.bar(ax=ax2, color=greys[3], width=.8, legend=None)

fig.suptitle('Languages', font=font, fontsize=16)
ax1.set_xlabel('Small Company', font=font, fontsize=15, color=greys[1])
ax2.set_xlabel('Big Company', font=font, fontsize=15, color=greys[1])

In [None]:
small_ds.columns.to_list()
Q14 = ['Q14_Part_1','Q14_Part_2','Q14_Part_3','Q14_Part_4','Q14_Part_5','Q14_Part_6','Q14_Part_7','Q14_Part_8','Q14_Part_9','Q14_Part_10','Q14_Part_11','Q14_OTHER']
viz = [small_ds[Q14][i].dropna().unique()[0] for i in small_ds[Q14].columns]

big_Q14, small_Q14 = big_ds[Q14], small_ds[Q14]
big_Q14.columns, small_Q14.columns = viz, viz

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18, 6))

small_Q14.count().plot.bar(ax=ax1, color=greys[3], width=.8)
big_Q14.count().plot.bar(ax=ax2, color=greys[3], width=.8)

fig.suptitle('Data Visualization Tools', font=font, fontsize=16)
ax1.set_xlabel('Small Company', font=font, fontsize=15, color=greys[1])
ax2.set_xlabel('Big Company', font=font, fontsize=15, color=greys[1])

There are no marked differences in languages or data visualization tolls used by employees of small and big businesses. 

## Machine Learning at Startups versus Big Companies

In [None]:
Q23 = df.groupby(["size", "Q23"]).size().to_frame("percentage")
Q23 = Q23.groupby(level=0).apply(lambda x: round(100 * x / float(x.sum()),1)).reset_index()

labels = ['I do not know', 'No', 'No<br>Exploring Methods', 'Yes<br>Well<br>established', 'Yes<br>Started<br>recently', 'Yes<br>For insights']
x_data = list(Q23['percentage'].tolist()[i*6+min(i, 0):(i+1)*6+min(i+1, 0)] for i in range(3))
y_data = ['big', 'medium', 'small']
colours = [greys[4], greys[2], greys[2], my_colours[2], my_colours[2], my_colours[2]]

fig = go.Figure()

for i in range(0, len(x_data[0])):
    for xd, yd in zip(x_data, y_data):
        fig.add_trace(go.Bar(
            x=[xd[i]], y=[yd],
            orientation='h',marker=dict(color=colours[i],line=dict(color=greys[4], width=1))))

fig.update_layout(
    xaxis=dict(showgrid=False,showline=False,showticklabels=False,zeroline=False,domain=[0.15, 1]),
    yaxis=dict(showgrid=False,showline=False,showticklabels=False,zeroline=False,),barmode='stack',
    paper_bgcolor="white",plot_bgcolor="white",margin=dict(l=60, r=60, t=140, b=60), showlegend=False)

annotations = []

for yd, xd in zip(y_data, x_data):
    annotations.append(dict(xref='paper', yref='y',x=0.14, y=yd,xanchor='right',text=str(yd),
                            font=dict(family='Open Sans', size=14,color=greys[1]),showarrow=False, align='right'))

    annotations.append(dict(xref='x', yref='y', x=xd[0] / 2, y=yd, text=str(xd[0]) + '%',
                            font=dict(family='Open Sans', size=18, color=greys[1]),showarrow=False))

    if yd == y_data[-1]:
        annotations.append(dict(xref='x', yref='paper',x=xd[0] / 2, y=1.1,text=labels[0],
                                font=dict(family='Open Sans', size=10, color=greys[1]), showarrow=False))
    space = xd[0]
    for i in range(1, len(xd)):
            annotations.append(dict(xref='x', yref='y',x=space + (xd[i]/2), y=yd,text=str(xd[i]) + '%',
                                    font=dict(family='Open Sans', size=18,color='rgb(248, 248, 255)'), showarrow=False))

            if yd == y_data[-1]:
                annotations.append(dict(xref='x', yref='paper',
                                        x=space + (xd[i]/2), y=1.1, text=labels[i],font=dict(family='Open Sans', size=10,
                                                  color='rgb(67, 67, 67)'), showarrow=False))
            space += xd[i]

fig.update_layout(annotations=annotations)
fig.show()

Substantially more big businesses use machine learning methods at work. This could be explained by the lack of senior machine learning personnel to promote the use of these methods in younger small businesses, or that growing businesses do not prioritise these methods.

Reference: [Plotly Documentation](https://plotly.com/python/horizontal-bar-charts/)

In [None]:
# Number of years of using ML methods
yrs = ['I do not use machine learning methods', 'Under 1 year', '1-2 years',  '2-3 years', '3-4 years',  '4-5 years', '5-10 years','10-20 years', '20 or more years']
cm = ["#f8f9fa", "#e9ecef","#dee2e6","#ced4da","#adb5bd","#6c757d","#495057","#343a40","#212529"]

Q15_s = small_ds[['Q15']].value_counts().to_frame("count")
Q15_b = big_ds[['Q15']].value_counts().to_frame("count")

Q15_s.index, Q15_b.index = yrs, yrs

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

ax1.pie(Q15_s['count'], labels=yrs, autopct='%.f%%', pctdistance=.8, wedgeprops=wedgeprops, colors=cm, textprops={'fontsize': 10})
ax2.pie(Q15_b['count'], labels=yrs, autopct='%.f%%', pctdistance=.8, wedgeprops=wedgeprops, colors=cm, textprops={'fontsize': 10})

ax1.set_title('Small Companies', font=font, fontsize=14)
ax2.set_title('Big Companies', font=font, fontsize=14)

fig.suptitle('Years of experience using machine learning methods', font=font, fontsize=16)
fig.legend(yrs, prop=font, loc="lower right")


As noted above, 72% of small business employees either do not use machine learning methods or have done so for less than two years. The equivalent percentage for big businesses is 51%.

In [None]:
# ML Frameworks and algorithms
def get_unq(df, cols):
    lst = []
    for i in range(len(cols)):
        lst.append(df[cols[i]].dropna().unique()[0])
    return lst

Q16= ['Q16_Part_1','Q16_Part_2','Q16_Part_3','Q16_Part_4','Q16_Part_5','Q16_Part_6','Q16_Part_7','Q16_Part_8','Q16_Part_9','Q16_Part_10','Q16_Part_11','Q16_Part_12','Q16_Part_13','Q16_Part_14','Q16_Part_15','Q16_Part_16','Q16_Part_17','Q16_OTHER']
Q17= ['Q17_Part_1','Q17_Part_2','Q17_Part_3','Q17_Part_4','Q17_Part_5','Q17_Part_6','Q17_Part_7','Q17_Part_8','Q17_Part_9','Q17_Part_10','Q17_Part_11','Q17_OTHER']
Q16_labels=get_unq(small_ds[Q16], Q16)
Q17_labels=get_unq(small_ds[Q17], Q17)

Q16_s = small_ds[Q16].count().to_frame('small').transform(lambda x: round(x/x.sum()*100,1))
Q16_b = big_ds[Q16].count().to_frame('big').transform(lambda x: round(x/x.sum()*100,1))
Q16_s.index, Q16_b.index = Q16_labels, Q16_labels
Q16_df = pd.concat([Q16_s, Q16_b], axis=1)

Q17_s = small_ds[Q17].count().to_frame('small').transform(lambda x: round(x/x.sum()*100,1))
Q17_b = big_ds[Q17].count().to_frame('big').transform(lambda x: round(x/x.sum()*100,1))
Q17_s.index, Q17_b.index = Q17_labels, Q17_labels
Q17_df = pd.concat([Q17_s, Q17_b], axis=1)

ML_cols ={"small": greys[2], "big":my_colours[2]}

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18, 8))

Q16_df.plot.barh(ax=ax1, width=.9, color=ML_cols, alpha=.7)
ax1.invert_yaxis()
ax1.get_legend().remove()
ax1.tick_params(axis=u'both', which=u'both',length=0)
ax1.spines['left'].set_visible(False)

Q17_df.plot.barh(ax=ax2, width=.8, color=ML_cols, alpha=.7)
ax2.invert_yaxis()
ax2.tick_params(axis=u'both', which=u'both',length=0)
ax2.spines['left'].set_visible(False)

fig.suptitle('Machine Learning Libraries and Algorithms', font=font, fontsize=16)
ax1.set_xlabel('Machine Learning Frameworks', font=font, fontsize=15, color=greys[1])
ax2.set_xlabel('Machine Learning Algorithms', font=font, fontsize=15, color=greys[1])

The top machine learning libraries and algorithms used by big and small businesses are similar. The Xgboost framework is more popular in big companies. Consequently, Gradient Boosting algorithms, which require the user to use the Xgboost framework, are more commonly used in big businesses.

In [None]:
# Activities that make up an important part of your role at work
Q24 = ['Q24_Part_1','Q24_Part_2','Q24_Part_3','Q24_Part_4','Q24_Part_5','Q24_Part_6','Q24_Part_7','Q24_OTHER']
Q24_labels = ['Analyze and understand data', 'Build and/or run \n the data infrastructure', 'Build prototypes', 'Build and/or run \n a machine learning service', 'Improve existing ML models', 'Research', 'None of the options', 'Other']

Q24_s = small_ds[Q24].count().to_frame('small').transform(lambda x: round(x/x.sum()*100,1))
Q24_b = big_ds[Q24].count().to_frame('big').transform(lambda x: round(x/x.sum()*100,1))
Q24_s.index, Q24_b.index = Q24_labels, Q24_labels

width=0.4
x = np.arange(len(Q24_labels))

fig, ax = plt.subplots(figsize=(16, 8)) 
rects1 = ax.bar(x - width/2, Q24_s['small'], width, label='Small', color=greys[2])
rects2 = ax.bar(x + width/2,  Q24_b['big'], width, label='Big', color=my_colours[2])

ax.set_title('Activities for Data Scientists', font=font, fontsize=17, color=greys[1])
ax.set_xlabel(' ')
ax.set_ylabel('Proportion', font=font, fontsize=15, color=greys[1])
ax.set_xticks(x)
ax.set_xticklabels(Q24_labels, font=font, fontsize=10)
ax.legend(prop=font)
ax.bar_label(rects1, padding=3, fontsize=9, font=font)
ax.bar_label(rects2, padding=3, fontsize=9, font=font)

fig.tight_layout()
plt.show()

I expected to see that employees at small companies would undertake more activities than big ones, to account for the lack of siloed personnel. This was not reflected in the results of the survey, which indicate that both small and big business employees could be considered "Jacks of all trades".

***


The insights drawn in this analysis represents the Kaggle community due to this being the source of the data. If you work in a small or big business and would like to share your insights or alternative explanations for the results of this analysis, please leave a comment! üòç