In [None]:
# Importing required libraries & themes 
! pip install seaborn==0.11.0 --upgrade pip
import pandas as pd
import numpy as np
from pandas import DataFrame
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style(style="whitegrid")
plt.show()

# **Kaggle Survey Analysis | Employed Subset**

In [None]:
# Importing the data set
data = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")

# Dropping the question column and the duration row
df = data.iloc[1:,:]
drop = ['Time from Start to Finish (seconds)']
df = df.drop(drop, axis=1)

The 2020 Kaggle Survey has many hidden gems and findings that one can analyze and deduce about the Kaggle community of Data Scientists. As someone who has recently gotten into learning data science (taking the MOOC and self-taught route), I was really curious to learn more about the employed subset of the survey.  

In analyzing the responses from the employed subset, there are many insights that can help students or begginners understand a little bit more in terms of which coding language to start with or what the optimal environment looks like. By following this route, I also realized that there is an equally interesting finding in the business side of data science and how fierce the competition is between some of the most well recognized brands like Google, Amazon, SAP, and Microsoft. 

I sectioned this notebook to categorize the findings in the general survey demographics of the employed subset, followed by a deeper dive into their occupation. I also included an evironment section to identify which platforms or products are being used most regularly.

 **Sections**
- [Demographics](#demographics)
- [Occupation](#occupation)
- [Environment](#environment)

In [None]:
#Occupation 
sns.countplot(y=df.Q5, palette='crest', order = ['Student','Data Scientist','Software Engineer','Other',
                                                    'Currently not employed','Data Analyst','Research Scientist',
                                                   'Machine Learning Engineer','Business Analyst','Product/Project Manager'
                                                   ,'Data Engineer','Statistician','DBA/Database Engineer']).set(ylabel='',
                                                                                                        title='Occupation (Total)')
plt.show()

The largest portion of respondents were students. As this analysis is based on the employed subset, the data from the student and those that are not currently employed were dropped. The following sections and graphs will only include the data from the respondents that said that they are currently employed. 

In [None]:
# Dropping unwanted rows  
students = df[df['Q5']=='Student'].index
df.drop(students, inplace=True)
unemployed = df[df['Q5']=='Currently not employed'].index
df.drop(unemployed,inplace=True)


# Survey Demographics (Employed) <a id='demographics'></a>

This section breaks down the demographics of our employed subset, giving us a better understanding of who they are. 

In [None]:
#Gender Distribution_Employed
sns.countplot(y=df.Q2,palette='crest',order=['Man','Woman','Prefer not to say','Prefer to self-describe',
                                             'Nonbinary']).set(ylabel='',title='Gender Distribution')
plt.show()

There's a considerably large discrepancy in gender - majority being Males. Should the Kaggle survey serve as an appropriate survey sample for the datat science field or the Kaggle community, this might require a deeper dive (with data not available in this survey) to understsand the level of inclusivity and accessibility of the Kaggle platform or the field as a whole. 

However, to their credit, Kaggle, is actively addressing minority group accessibility through their <a href="https://www.kaggle.com/bipoc-grant-application"> BIPOC Grant </a> initiative. 

In [None]:
sns.countplot(y=df.Q1,palette='crest',order=['25-29','30-34','22-24','35-39','40-44','45-49','18-21','50-54','55-59',
                                             '60-69','70+']).set(ylabel='',title='Age Distribution')
plt.show()

Given that this is the age distribution of the employed subset, majority are considerably young. This could hint that the majority of the subset are either in the entry-level/junior or middle-management levels in their careers.  

In [None]:
#Residence 
sns.countplot(y=df.Q3,palette='crest',order=['India','United States of America','Other','Brazil',
                                           'Japan','Russia','United Kingdom of Great Britain and Northern Ireland',
                                         'Germany','Nigeria','Spain']).set(ylabel='',title='Residence(Top 10)')
plt.show()


The majority of the employed subset lives in India, the US, or an under-represented country as 'Other' was a group created by Kaggle to mask the country names that received less than 50 responses for anonymity. Despite the fact that these can be considered under-represented, it is still interesting to see that the Kaggle community has an outreach of many countries across different continents. 

This data is also an important factor for viewers to understand that the companies (in the environment section) may or may not have been chosen as a platform/product of choice due to their prominence in a certain region. 

In [None]:
#Formal Education 
sns.countplot(y=df.Q4,palette='crest',order=['Master’s degree','Bachelor’s degree','Doctoral degree',
                                             'Professional degree','Some college/university study without earning a bachelor’s degree',
                                             'I prefer not to answer','No formal education past high school']).set(ylabel='',
                                                                                                                   title='Formal Education') 
plt.show()

It is interesting to see that the majority of the respondents have completed their formal education. Most of the employed subset have completed their Master's. All three phases of the higher education route (including Doctoral degree) are present in the top three. 

However, this can be misleading as Data Science is a field that does not necessarily require formal education to get into. Since there is no follow-up question asking what the respondents majored in, there is no way of figuring out how many of those currently employed formally studied data science in university or at what stage in their higher educational journey did they decide to switch majors or career choices. 

In [None]:
#Coding Experience
sns.countplot(y=df.Q6,palette='crest',order=['3-5 years','1-2 years','5-10 years','10-20 years','< 1 years',
                                             '20+ years','I have never written code']).set(ylabel='',title='Coding Experience')
plt.show()


It is interesting to see how many of the respondents have been coding between one to ten years; most common being either three to five years or one to two years. It would be interesting to see if this is because of the surge in employment demand for coding knowledge and data scientists given the unprecedented speed of technological advancements or if it is due to the younger audience within the Kaggle community. 

In [None]:
#The most suggested language for aspiring data scientists to start up 
sns.countplot(y=df.Q8,palette='crest',order=['Python','R','SQL','C++','C','MATLAB','Java','Other','Julia','Javascript','None',
                                           'Bash','Swift']).set(ylabel='',title='First language for aspiring data scientists')
plt.show()

It's no surprise that Python was the most suggested language for aspiring data scientists given the plethora of free information available online and the ease of entry. What's interesting is the discrepancy between Python and all of the other languages suggested. 

Given that the focus is on the employed subset, it would have been interesting to understand why Python is so highly suggested - is it because of the reasons speculated above or because it was the first language they had learned before becoming data scientists or perhaps is it because Python is highly sought after by employers. 

In [None]:
#Concatenate the responses for the Media Outlets/Channels
concat_mc = np.concatenate([df.Q39_Part_1.values,df.Q39_Part_2.values,df.Q39_Part_3.values,df.Q39_Part_4.values,df.Q39_Part_5.values,
                             df.Q39_Part_6.values,df.Q39_Part_7.values,df.Q39_Part_8.values,df.Q39_Part_9.values,df.Q39_Part_10.values,
                             df.Q39_Part_11.values,df.Q39_OTHER.values])
dfmc = pd.concat([df,pd.DataFrame(concat_mc)],ignore_index=True,axis=1)
dfmc.columns = np.append(df.columns.values,"Media")

sns.countplot(y=dfmc.Media, palette='crest', order=["Kaggle (notebooks, forums, etc)","YouTube (Kaggle YouTube, Cloud AI Adventures, etc)",
                                                   "Blogs (Towards Data Science, Analytics Vidhya, etc)","Twitter (data science influencers)",
                                                   "Journal Publications (peer-reviewed journals, conference proceedings, etc)",
                                                   "Email newsletters (Data Elixir, O'Reilly Data & AI, etc)",
                                                   "Course Forums (forums.fast.ai, Coursera forums, etc)","Reddit (r/machinelearning, etc)",
                                                   "Slack Communities (ods.ai, kagglenoobs, etc)","Podcasts (Chai Time Data Science, O’Reilly Data Show, etc)",
                                                   "None","Other"]).set(ylabel='',title='Favourite Media Sources for DS')
plt.show()

Looks like a lot of the employed respondents stay up to date on data science topics through a myriad of channels, mostly being Kaggle, YouTube, or blogs. This could be due to the nature of the ever-evolving field. 

This is helpful for students to understand which media sources the employed community consume the most for them to follow similar channels or for networking reaons. 

This can also be insightful for companies to try and get more exposure to the community whether it's through paid advertisements, sponsorships, or though guest-appearances on those channels. 

# Occupation <a id='occupation'></a>

In [None]:
# Updated employment graph  
sns.countplot(y=df.Q5, palette='crest', order = ['Data Scientist','Software Engineer','Other','Data Analyst',
                                                 'Research Scientist','Machine Learning Engineer','Business Analyst',
                                                 'Product/Project Manager','Data Engineer','Statistician',
                                                 'DBA/Database Engineer']).set(ylabel='',title='Occupation (Employed)')
plt.show()

In dropping the unemployed and student subsests from the dataset, the breakdown of occupation is clearer to visualize. Majority of the respondents are Data Scientsits, followed by Software Engineers. 

It would have been helpful if those who responded Other were asked a follow-up question on whether or not they were in the data science field at all as I would have dropped the respondents who were not in the field, providing clearer insight to the analysis. 

In [None]:
#Company size employed Kagglers currently work in 
df.Q20.value_counts()
sns.countplot(y=df.Q20,palette='crest',order=['0-49 employees','10,000 or more employees','1000-9,999 employees',
                                             '50-249 employees','250-999 employees']).set(ylabel='',title='Company Size')
plt.show()

Most of this subset work in a small enterprise that has less than 50 employee. This is useful insight towards understanding their environment, expenditure, and products/platforms of choice. The second most represented company size in this subset is over 10,000 employees which is a stark difference than the highest represented compan size; however, the discrepancy between the top two representations is high. 

In [None]:
#Data Science Team 
df.Q21.value_counts()
sns.countplot(y=df.Q21,palette='crest',order=['1-2','0','20+','3-4','5-9','10-14','15-19']).set(ylabel='',title='Size of Data Science Team')

plt.show()

Taking the company size into consideration, it is not such a stretch to see that most of the employed subset have one to two, or no members in the Data Science Department. However, I would say that is is evident that the larger company sizes have significantly larger Data Science teams. 

In [None]:
#Concatenate the responses for important job responsibilities
concat_jr = np.concatenate([df.Q23_Part_1.values,df.Q23_Part_2.values,df.Q23_Part_3.values,df.Q23_Part_4.values,df.Q23_Part_5.values,
                             df.Q23_Part_6.values,df.Q23_Part_7.values,df.Q23_OTHER.values])
dfjr = pd.concat([df,pd.DataFrame(concat_jr)],ignore_index=True,axis=1)
dfjr.columns = np.append(df.columns.values,"Job_Responsibility")


sns.countplot(y=dfjr.Job_Responsibility, palette='crest', order=['Analyze and understand data to influence product or business decisions',
                                                                'Build prototypes to explore applying machine learning to new areas',
                                                                 'Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data',
                                                                'Experimentation and iteration to improve existing ML models',
                                                                'Build and/or run a machine learning service that operationally improves my product or workflows',
                                                                'Do research that advances the state of the art of machine learning',
                                                                'None of these activities are an important part of my role at work',
                                                                'Other']).set(ylabel='',title='Prominent Job Responsibilities')
plt.show()

This graph depicts an interesting finding in which the employed subset (regardless of their title) have a variety of prominent reponsibilities and not just one that clearly defines their role. Given publicly available documents (job descriptions, blogs etc), it is not only customary in smaller companies to bundle a variety of prominent resoponsibilities but it is also becoming customary for larger companies to adpot this approach. 

# Environment <a id='environment'></a>

In [None]:
#Concatenate the responses for the most commonly used languages
concatpl = np.concatenate([df.Q7_Part_1.values,df.Q7_Part_2.values,df.Q7_Part_3.values,df.Q7_Part_4.values,df.Q7_Part_5.values,
                          df.Q7_Part_6.values,df.Q7_Part_7.values,df.Q7_Part_8.values,df.Q7_Part_9.values,df.Q7_Part_10.values,
                          df.Q7_Part_11.values,df.Q7_Part_12.values,df.Q7_OTHER.values])
dfpl = pd.concat([df,pd.DataFrame(concatpl)],ignore_index=True,axis=1)
dfpl.columns = np.append(df.columns.values,"pl")
dfpl.pl.value_counts()

sns.countplot(y=dfpl.pl, palette='crest', order=['Python','SQL','R','Javascript','Java','C++','C','Other'
                                                 ,'Bash','MATLAB','Julia','Swift','None']).set(ylabel=''
                                                                                                       , title='Most Commonly Used Programming Languages')
plt.show()


It's interesting to see that many of our employed subset commonly use more than one subset. The Python and SQL languages are the two most common, followed by R. This isn't a huge shock as the most common prominent job responsibility was to analyze and understand data to influence business decisions. 

In [None]:
#Most common Computing Platform
df.Q11.value_counts()
sns.countplot(y=df.Q11, palette='crest', order=['A personal computer or laptop','A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)',
                                               'A deep learning workstation (NVIDIA GTX, LambdaLabs, etc)','None',
                                               'Other']).set(ylabel='',title='Most Common Computing Platform')
plt.show()

Given what we know on the most common company sizes and the number of data scientists in a team, it's expected that the majority of the respondents would use laptops/computers as their computing platforms of choice. 

In [None]:
#Concatenate the responses for the most commonly used Hosted Notebooks
concat_nb = np.concatenate([df.Q10_Part_1.values,df.Q10_Part_2.values,df.Q10_Part_3.values,df.Q10_Part_4.values,df.Q10_Part_5.values,
                             df.Q10_Part_6.values,df.Q10_Part_7.values,df.Q10_Part_8.values,df.Q10_Part_9.values,df.Q10_Part_10.values,
                             df.Q10_Part_11.values,df.Q10_Part_12.values,df.Q10_Part_13.values,df.Q10_OTHER.values])
dfnb = pd.concat([df,pd.DataFrame(concat_nb)],ignore_index=True,axis=1)
dfnb.columns = np.append(df.columns.values,"Hosted_Notebooks")
dfnb.Hosted_Notebooks.value_counts()

sns.countplot(y=dfnb.Hosted_Notebooks, palette='crest',order=['Colab Notebooks',' Kaggle Notebooks','None',
                                                             ' Binder / JupyterHub ','Google Cloud AI Platform Notebooks ',
                                                             'Google Cloud Datalab Notebooks','Azure Notebooks',' IBM Watson Studio ',
                                                             ' Amazon Sagemaker Studio ',' Databricks Collaborative Notebooks ',
                                                             'Other',' Amazon EMR Notebooks ',' Paperspace / Gradient ',
                                                              ' Code Ocean ']).set(ylabel='',title='Most Commonly Used Hosted Notebooks')
plt.show()

The top three most commonly used hosted notebooks are Colab Notebooks, Kaggle Notebooks and None. It's surprising to see that over 3,500 respondents don't use any hosted notebooks. 

In [None]:
#Concatenate the responses for the most commonly used IDE's
concat_ide = np.concatenate([df.Q9_Part_1.values,df.Q9_Part_2.values,df.Q9_Part_3.values,df.Q9_Part_4.values,df.Q9_Part_5.values,
                             df.Q9_Part_6.values,df.Q9_Part_7.values,df.Q9_Part_8.values,df.Q9_Part_9.values,df.Q9_Part_10.values,
                             df.Q9_Part_11.values,df.Q9_OTHER.values])
dfide = pd.concat([df,pd.DataFrame(concat_ide)],ignore_index=True,axis=1)
dfide.columns = np.append(df.columns.values,"IDE")

sns.countplot(y=dfide.IDE, palette='crest',order=['Jupyter (JupyterLab, Jupyter Notebooks, etc) ',
                                                 'Visual Studio Code (VSCode)',' PyCharm ',' RStudio ',
                                                 '  Notepad++  ','  Spyder  ','Visual Studio','  Sublime Text  ',
                                                 '  Vim / Emacs  ',' MATLAB ','Other','None']).set(ylabel='',title='Most Commonly Used IDEs')
plt.show()

The most common IDE is Jupyter. It's not a surprise given their popularity and user-friendly interface. However, there is a large discrepancy between Jupyter and the other options chosen, highlighting its popularity even further. 

In [None]:
#Concatenate the responses for the most commonly used Data Visualization Libraries
concat_dvl = np.concatenate([df.Q14_Part_1.values,df.Q14_Part_2.values,df.Q14_Part_3.values,df.Q14_Part_4.values,df.Q14_Part_5.values,
                             df.Q14_Part_6.values,df.Q14_Part_7.values,df.Q14_Part_8.values,df.Q14_Part_9.values,df.Q14_Part_10.values,
                             df.Q14_Part_11.values,df.Q14_OTHER.values])
dfdvl = pd.concat([df,pd.DataFrame(concat_dvl)],ignore_index=True,axis=1)
dfdvl.columns = np.append(df.columns.values,"Data_Vis")

sns.countplot(y=dfdvl.Data_Vis, palette='crest',order=[' Matplotlib ',' Seaborn ',
                                                       ' Plotly / Plotly Express ',' Ggplot / ggplot2 ',
                                                      'None',' Shiny ',' Bokeh ',' D3 js ',' Geoplotlib ',
                                                      ' Leaflet / Folium ','Other',' Altair ']).set(ylabel='',title='Most Commonly Used Data Visualization Libraries')
plt.show()

Given Python's popularity with the employed subset and the ease of these two libraries, it's no suprise that Matplotlib and Seaborn are the two most common data visualization libraries. 

In [None]:
#ML Experience
df.Q15.value_counts()
sns.countplot(y=df.Q15,palette='crest',order=['Under 1 year','1-2 years','2-3 years','I do not use machine learning methods',
                                             '3-4 years','5-10 years','4-5 years','10-20 years',
                                              '20 or more years']).set(ylabel='',title='Machine Learning Experience')
plt.show()

This graph along with the next few graphs all relate to machine learning. Most of the respondents have less than one year of experience with machine learning and the second most common response was between one to two years of experience. This might be due to the fact that machine learning is considered relatively new and something that was propelled as part of digital transformation efforts that became a necessity for business continuity during the COVID-19 pandemic. 

In [None]:
sns.countplot(y=df.Q22,palette='crest',order=['We are exploring ML methods (and may one day put a model into production)',
                                             'No (we do not use ML methods)',
                                              'We have well established ML methods (i.e., models in production for more than 2 years)',
                                             'We recently started using ML methods (i.e., models in production for less than 2 years)',
                                              'I do not know','We use ML methods for generating insights (but do not put working models into production)']).set(ylabel='',title='Machine Learning Established in Workplace')
plt.show()

The subset's responses on their experience with machine learning are in line with their respective employer's stance on machine learning implementation. Most companies are now exploring machine learning methods and may one day put a model into production. 

In [None]:
#Concatenate the responses for the most commonly used Machine Learning Frameworks 
concat_mlf = np.concatenate([df.Q16_Part_1.values,df.Q16_Part_2.values,df.Q16_Part_3.values,df.Q16_Part_4.values,df.Q16_Part_5.values,
                             df.Q16_Part_6.values,df.Q16_Part_7.values,df.Q16_Part_8.values,df.Q16_Part_9.values,df.Q16_Part_10.values,
                             df.Q16_Part_11.values,df.Q16_Part_12.values,df.Q16_Part_13.values,df.Q16_Part_14.values,
                             df.Q16_Part_15.values,df.Q16_OTHER.values])
dfmlf = pd.concat([df,pd.DataFrame(concat_mlf)],ignore_index=True,axis=1)
dfmlf.columns = np.append(df.columns.values,"ML_Frameworks")

sns.countplot(y=dfmlf.ML_Frameworks, palette='crest',order=['  Scikit-learn ','  TensorFlow ',' Keras ',
                                                           ' Xgboost ',' PyTorch ',' LightGBM ','None',' Caret ',
                                                           ' CatBoost ',' Fast.ai ',' Prophet ',' Tidymodels ',
                                                           'Other',' H2O 3 ',' MXNet ',' JAX ']).set(ylabel='',title='Most Commonly Used Machine Learning Frameworks')
plt.show()

The most commonly used machine learning frameworks are Scikit-learn; which is interesting to know for aspiring data scientists or those who want to learn more about machine learning frameworks/implementation. TensorFlow and Keras are also in the top three; however, they not as nearly as popular as Scikit-learn. 

In [None]:
#Concatenate the responses for the most commonly used ML Products
concat_mlpd = np.concatenate([df.Q28_A_Part_1.values,df.Q28_A_Part_2.values,df.Q28_A_Part_3.values,df.Q28_A_Part_4.values,df.Q28_A_Part_5.values,
                             df.Q28_A_Part_6.values,df.Q28_A_Part_7.values,df.Q28_A_Part_8.values,df.Q28_A_Part_9.values,df.Q28_A_Part_10.values,
                             df.Q28_A_OTHER.values])
dfmlpd = pd.concat([df,pd.DataFrame(concat_mlpd)],ignore_index=True,axis=1)
dfmlpd.columns = np.append(df.columns.values,"ML_Products")

sns.countplot(y=dfmlpd.ML_Products, palette='crest',order=['No / None',' Google Cloud AI Platform / Google Cloud ML Engine',
                                                          ' Amazon SageMaker ',' Azure Machine Learning Studio ',' Google Cloud Vision AI ',
                                                          ' Google Cloud Natural Language ',' Azure Cognitive Services ',' Google Cloud Video AI ',
                                                          ' Amazon Rekognition ',' Amazon Forecast ','Other']).set(ylabel='',title='Machine Learning Products Used Regularly')
plt.show()

This graph shows an interesting and stark difference between the subset who don't use any machine learning products regularly and those that do. The lack of machine learning product usage could be due to the overall early stages of machine learning with both the experience and company implementation. 

With those that do use machine learning products regularly - Google, Amazon, and Microsoft were ranked in the top three. 

Another interesting fact to note is that the remaining options are all variations of services provided by Amazon, Google, and Microsoft. On the surface, this looks a lot like business cannibalization that the same company is offering different products to solve a similar purpose. 

In [None]:
#Concatenate the responses for the Automated Machine Learning Tools
concat_aml = np.concatenate([df.Q34_A_Part_1.values,df.Q34_A_Part_2.values,df.Q34_A_Part_3.values,df.Q34_A_Part_4.values,df.Q34_A_Part_5.values,
                             df.Q34_A_Part_6.values,df.Q34_A_Part_7.values,df.Q34_A_Part_8.values,df.Q34_A_Part_9.values,df.Q34_A_Part_10.values,
                             df.Q34_A_Part_11.values,df.Q34_A_OTHER.values])
dfaml = pd.concat([df,pd.DataFrame(concat_aml)],ignore_index=True,axis=1)
dfaml.columns = np.append(df.columns.values,"Auto_ML_Tools")

sns.countplot(y=dfaml.Auto_ML_Tools, palette='crest',order=['No / None','  Auto-Sklearn ','  Auto-Keras ',
                                                           ' Google Cloud AutoML ','  Auto_ml ',' H20 Driverless AI  ','Other',
                                                           ' DataRobot AutoML ',' Databricks AutoML ','  Tpot ','  MLbox ',
                                                           '  Xcessiv ']).set(ylabel='',title='AutoML Tools Used Regularly')
plt.show()

This graph is also in line with the previous findings that automated machine learning tools are not used regularly, and those that are use Auto-SKlearn or Auto-Keras; both of which are proven popular machine leanring frameworks with the subset.

In [None]:
#Approximate expenditure on ML/Cloud in the last 5 years (USD)
sns.countplot(y=df.Q25,palette='crest', order=['$0 ($USD)','$1000-$9,999','$100-$999',
                                              '$1-$99','$10,000-$99,999','$100,000 or more ($USD)']).set(ylabel='',title='ML / Cloud expenditure in the last 5 years (USD)')
plt.show()

Also in line with previous findings, not much is being spent on machine learning implementation. However, it is surprising that cloud expenditure is bundled in with machine learning and most of the respondents still claimed zero expenditure on ML or Coud. 

Seeing as a lot of the software service providers and SaaS companies provide freemium models (starting free and moving up to a paid model for more usage, storage or features), could the current needs of our employed subset be fulfilled in the free tier?

In [None]:
#Concatenate the responses for the most commonly used Cloud Computing Platforms 
concat_ccp = np.concatenate([df.Q26_A_Part_1.values,df.Q26_A_Part_2.values,df.Q26_A_Part_3.values,df.Q26_A_Part_4.values,df.Q26_A_Part_5.values,
                             df.Q26_A_Part_6.values,df.Q26_A_Part_7.values,df.Q26_A_Part_8.values,df.Q26_A_Part_9.values,df.Q26_A_Part_10.values,
                             df.Q26_A_Part_11.values,df.Q26_A_OTHER.values])
dfccp = pd.concat([df,pd.DataFrame(concat_ccp)],ignore_index=True,axis=1)
dfccp.columns = np.append(df.columns.values,"Cloud_Computing_Platforms")

sns.countplot(y=dfccp.Cloud_Computing_Platforms, palette='crest',order=[' Amazon Web Services (AWS) ',' Google Cloud Platform (GCP) ',
                                                                       'None',' Microsoft Azure ',' IBM Cloud / Red Hat ',' Oracle Cloud ',
                                                                       ' VMware Cloud ','Other',' Salesforce Cloud ',' SAP Cloud ',
                                                                        ' Alibaba Cloud ',' Tencent Cloud ']).set(ylabel='',title='Most Commonly Used Cloud Computing Platform')
plt.show()

This graph highlights a very interesting finding for companies given that almost all of the largest names in cloud computing are present. The findings here show that Amazon's AWS is the most popular most commonly used cloud computing platform, followed by Google's GCP. Compared to some of the most prominent names such as SAP, Oracle, VMWare and Salesforce; Microsoft Azure is still quite popular with this subset. 

This could reflect the subset's companies' decision to work with these providers in the region, it could also reflect personal preferences in the popular regions (India and US) to opt for certain providers over others. 

In [None]:
#Concatenate the responses for the most commonly used Cloud Computing Products 
concat_ccpd = np.concatenate([df.Q27_A_Part_1.values,df.Q27_A_Part_2.values,df.Q27_A_Part_3.values,df.Q27_A_Part_4.values,df.Q27_A_Part_5.values,
                             df.Q27_A_Part_6.values,df.Q27_A_Part_7.values,df.Q27_A_Part_8.values,df.Q27_A_Part_9.values,df.Q27_A_Part_10.values,
                             df.Q27_A_Part_11.values,df.Q27_A_OTHER.values])
dfccpd = pd.concat([df,pd.DataFrame(concat_ccpd)],ignore_index=True,axis=1)
dfccpd.columns = np.append(df.columns.values,"Cloud_Computing_Products")

sns.countplot(y=dfccpd.Cloud_Computing_Products, palette='crest',order=[' Amazon EC2 ','No / None',' AWS Lambda ',
                                                                        ' Azure Cloud Services ',' Google Cloud Functions ',
                                                                        ' Google Cloud App Engine ',' Amazon Elastic Container Service ',
                                                                       ' Microsoft Azure Container Instances ',' Azure Functions ',
                                                                        ' Google Cloud Run ','Other']).set(ylabel='',title='Most Commonly Used Cloud Computing Products')
plt.show()

Similar to machine learning products, the cloud computing products also have various services provided by the same company. Amazon's EC2 is by far the most popular, followed by no regular usage of cloud computing products and Amazon's AWS Lambda coming in as third most commonly used. 

In [None]:
#Concatenate the responses for the most commonly used Big Data Products 
concat_bdpd = np.concatenate([df.Q29_A_Part_1.values,df.Q29_A_Part_2.values,df.Q29_A_Part_3.values,df.Q29_A_Part_4.values,df.Q29_A_Part_5.values,
                              df.Q29_A_Part_6.values,df.Q29_A_Part_7.values,df.Q29_A_Part_8.values,df.Q29_A_Part_9.values,df.Q29_A_Part_10.values,
                              df.Q29_A_Part_11.values,df.Q29_A_Part_12.values,df.Q29_A_Part_13.values,df.Q29_A_Part_14.values,
                              df.Q29_A_Part_15.values,df.Q29_A_Part_16.values,df.Q29_A_Part_17.values,df.Q29_A_OTHER.values])
dfbdpd = pd.concat([df,pd.DataFrame(concat_bdpd)],ignore_index=True,axis=1)
dfbdpd.columns = np.append(df.columns.values,"Big_Data_Products")

sns.countplot(y=dfbdpd.Big_Data_Products, palette='crest',order=['MySQL ','PostgresSQL ','Microsoft SQL Server ','None',
                                                                'MongoDB ','SQLite ','Oracle Database ','Google Cloud BigQuery ',
                                                                'Microsoft Access ','Amazon Redshift ','Microsoft Azure Data Lake Storage ',
                                                                'Google Cloud SQL ','Amazon DynamoDB ','Other','Amazon Athena ','Amazon Athena ',
                                                                'Snowflake ','IBM Db2 ','Google Cloud Firestore ']).set(ylabel='',title='Most Commonly Used Big Data Products')
plt.show()

Given SQL being the second most common language used by our subset, it's not a surprise that the top three most commonly used big data products are MySQL, PostgresSQL and Microsoft SQL Server. 

In [None]:
#Most common Big Data Tools
sns.countplot(y=df.Q32, palette='crest',order=['Tableau','Microsoft Power BI','Google Data Studio','Qlik',
                                              'Other','Salesforce','Amazon QuickSight','SAP Analytics Cloud ',
                                              'Alteryx ','TIBCO Spotfire','Looker','Sisense ',
                                              'Einstein Analytics','Domo']).set(ylabel='',title='Most Commonly Used Big Data Tool')
plt.show()

Our final table highlights the most commonly used big data tools by our employed subset; with Tableau being the most popular. Microsoft Power BI came in a close second, followed by a big discrepancy with the third most popular big data tool provided by Google's Data Studio. Similar to the cloud computing graph, many other prominent providers such as SAP and Amazon were not so popular with the subset.  

Thank you so much for going through my analysis! I'd like to give special shoutout to the angels who publish their notebooks and provide answers on forums like Stack Overflow - I truly would not have been able to do it without you. 
I look forward to seeing your thoughts in the comments section below <3 