# So you want a career in data science, huh?
Let's face it: choosing a career path is hard! With so many different fields, job titles, and job descriptions out there, it's no wonder people change jobs an average of 12.4 times within their lifetime (source: [Bureau of Labor Statistics](https://www.bls.gov/news.release/pdf/nlsoy.pdf)). Luckily, for those of you reading this notebook, you've likely narrowed your selection of career choices to those in data science... but that still leaves a wide array of careers to choose from! Based on this, the aim of this notebook is to highlight differences in career paths in data science, using the 2021 Kaggle Machine Learning & Data Science Survey results as the primary data source. The set-up of this notebook is as follows:

- <span style="font-family:courier;color:green;">*Defining the Career Paths*:</span> 6 different career groups are defined. The idea is the reader chooses one of these career groups to learn more about, to better understand what is required to pursue any of the careers within their chosen career group
- <span style="font-family:courier;color:green;">*Quick Comparison*:</span> If the reader is unsure which career group to choose, this section gives a quick comparison to help them decide (this is the last chance to choose a career group!)
- <span style="font-family:courier;color:green;"> *Career Group Deep Dive*: </span> At this point, the reader has chosen a career group. This section provides more visuals and insight specific to the reader's chosen career group

The motivation for this notebook is to help students and unemployed individuals understand the skillset they need to get their foot in the door of a particular career path and to help individuals currently in a data science role understand the skillset they need to change their job role to another role within the data science field.

This notebook includes explanations for each of the visuals to ensure the reader is correctly interpreting the visuals. These explanations are in <span style="font-family:courier;color:green;"> green text </span>, so if you feel you understand a visual, you can skip over these segments. I hope this notebook inspires you and excites you about pursuing a career in data science! ðŸ˜Ž Comments and feedback are always welcome

# Defining the Career Paths
Question 5 of the survey asks users the job title most similar to their current role and gives 15 options to choose from, with one of those options being "Other", if none of the job titles match the user's current role. Job titles can be pretty specific and, chances are, the required skillset for some of these roles are similar. Based on this, we can group some of these job titles together to create different career path groups, then we can look at each career path group in detail. This allows you, the reader, to understand which skills may be beneficial to develop to pursue a career in a particular group, without narrowing your skill development scope to a specific job title. 

A [blog post by Udacity](https://www.udacity.com/blog/2018/01/4-types-data-science-jobs.html) did a great job of highlighting the different skills required for different roles within data science. Anecdotally, using my experience from a data analyst position I held at an AI education research institute, I added onto the Udactiy model to illustrate the different skills used in different data science positions:

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud
pd.options.mode.chained_assignment = None

plt.rcParams.update({'font.size': 14})

# create dataframe with skill importance for different job categories
skills_df= pd.DataFrame({'Data Scientist': [3,3,3,3,3,3,2,2], 'Data Analyst': [3,3,2,2,1,1,1,1], 'Data Modeller': [3,2,2,2,3,1,3,1], 'Software engineer':[3,1,2,1,2,2,3,1], 'Manager': [2,3,3,2,1,1,1,1]}, 
                           columns= ['Data Scientist','Data Analyst','Data Modeller', 'Software engineer', 'Manager'], index= ['Programming tools','Data visualization & communication','Data intuition','Statistics', 'Data wrangling','Machine learning','Software engineering','Multivariate calculus & linear algebra'])

fig, ax= plt.subplots(figsize= (12,5))
sns.heatmap(skills_df, cmap= ['indianred','gold','mediumseagreen'], linecolor= 'black', linewidth= 1, cbar= False, square= False)
ax.xaxis.set_ticks_position('top')
plt.figtext(0.92,0.85,'Red', color= 'indianred', fontsize= 16)
plt.figtext(0.96, 0.85, 'indicates low importance', fontsize= 14)
plt.figtext(0.92, 0.8, 'Green', color= 'mediumseagreen', fontsize= 16)
plt.figtext(0.99, 0.8, 'indicates high importance', fontsize= 14)
plt.figtext(0.92, 0.75, 'Yellow', color= 'gold', fontsize= 16)
plt.figtext(0.99, 0.75, 'is somewhere in between', fontsize= 14)
plt.show()

<span style="font-family:courier;color:green;">Just to make sure you understand this figure, let's look at the Data Scientist job grouping. For any job role in this group, the following skills are very important: programming tools, data visualization & communication, data intuition, statistics, data wrangling, and machine learning. Software engineering and multivariate calculus & linear algebra are moderately important. Conversely, for the Manager job grouping, data visualization & communication and data intuition skills are very important, programming tools and statistics are moderately important, and data wrangling, machine learning, software engineering, and multivariate calculus & linear algebra are not important.</span>

Now that we understand that different skills are of different importance across data science job groups, we can assign each of the job titles listed in the Kaggle survey to one of these groups using each person's response to Q5. The mapping of job titles to career groups is shown below as well as the mean number of years of programming experience for all responses to Q5 and each of the career groups (normalized to age):

In [None]:
def convert_range(survey, question, question_num):
    """
    Converts an object column to be numeric.
    Use only on columns that represent ranges using strings (e.g. question 25 that asks yearly compensation range)
    Creates a new column in the survey dataframe that is the numeric representation of the string range
    
    Inputs:
    -------
    survey : pandas DataFrame with survey responses, including the range column to be changed to a numerical representation
    
    question : str specifying the question/column name to be converted from range (string) to numeric. This must match the column name in survey
                The column specified here must have no special characters. It should only contain numbers and a hypen signifying the range
    
    question_num : str specifying the name of the new column
    
    Returns:
    survey : same as input pandas DataFrame but with additional column (question_num) that is the numeric representation of question column
    -------
    """
    # get two new cols, one for each extreme of the range
    survey[['temp1', 'temp2']] = survey[question].str.split('-',expand= True).astype(float)
    
    # numeric representation of the range is its middle value
    # ranges with only 1 value (e.g. >1000000), the numeric representation is the single value
    survey[question_num] = survey[['temp1','temp2']].mean(axis= 1)
    loc_high= np.where((~np.isnan(survey['temp1']) & np.isnan(survey['temp2'])))
    survey[question_num].iloc[loc_high] = survey['temp1'].iloc[loc_high]
    
    # drop the cols that hold the range extremes
    survey= survey.drop(columns= ['temp1', 'temp2'])
    return survey

# load in survey data, add new column specifying career group
survey= pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', low_memory= False)[1:]
groups= [(survey['Q5'].isin(['Student','Currently not employed'])),
         (survey['Q5'].isin(['Business Analyst','Data Analyst','Statistician'])),
         (survey['Q5'].isin(['Data Engineer','DBA/Database Engineer'])),
         (survey['Q5'].isin(['Product Manager','Program/Project Manager'])),
         (survey['Q5'].isin(['Data Scientist','Machine Learning Engineer','Research Scientist'])),
         (survey['Q5'].isin(['Software Engineer'])),
         (survey['Q5'].isin(['Developer Relations/Advocacy','Other']))]
group_vals= ['Student or Unemployed','Data Analyst','Data Modeller','Manager','Data Scientist','Software engineer','Other']
survey['career_group']= np.select(groups, group_vals)

# convert yearly compensation ranges to numeric
survey['Q25'] = survey['Q25'].str.replace('$','', regex= True).replace('>','', regex=True).replace(',','',regex= True)
survey= convert_range(survey,'Q25', 'Yearly compensation (USD)')

# convert age to numeric
survey['Q1'] = survey['Q1'].str.replace('+','', regex= True)
survey= convert_range(survey, 'Q1', 'Q1 num')

# convert years of programming experience to numeric
survey['Q6'] = survey['Q6'].str.replace('+','',regex= True).replace('<','',regex= True).replace('years','', regex= True).replace('I have never written code','0')
survey= convert_range(survey, 'Q6','Q6 num')

# create col that has normalized # of years programming
survey['Normalized number of years programming'] = survey['Q6 num'].copy(deep=True) / survey['Q1 num'].copy(deep=True)

# visualize mapping of job titles to career groups and average number of years programming for each job title and career group
fig= px.sunburst(survey, path= ['career_group','Q5'], color= 'Normalized number of years programming',
                 color_continuous_scale='thermal',title='Mapping of Job Titles to Career Groups & Avg Years of Programming Experience (Normalized)')
fig.update_layout(margin = dict(t=30))
fig.show()

<span style="font-family:courier;color:green;"> From this chart, based on mean number of years of programming experience (normalized), we can see the mapping of job titles to career groups is pretty good, meaning that individuals in roles that have been mapped to the same career group have similar programming experience. For example, for the Data Modeller career group that includes Data Engineers and DBA/Database Engineers, they have similar programming experience (0.185 and 0.199 respectively). </span>

In addition to seeing that the mapping from job title to career group is acceptable, from this chart, we can get an idea of the career group with the lowest barrier to entry, based on number of years of programming experience. So, for those of you with no idea what career group you want to pursue, you can choose a career group based on the number of years of programming experience individuals in these groups have. For example, based on this mapping, an average 23 year-old student will have 2.48 years of experience programming (23 years old x 0.108 years of experience per year of a person's life), an average 23 year-old in the data analyst career group will have 2.62 years of experience programming, and an average 23 year-old in the software engineer career group will have 5.06 years of experience programming. Therefore, based on years of programming experience alone, the 23 year-old student is closer to becoming a data analyst than a software engineer. Using this approach, a person with no idea which career group they want to pursue could decide based on their years of programming experience alone.

This method of choosing a career group to pursue is all well and good, but there are more factors than just experience and hard skills that people consider when choosing a career. So, before we jump into our career group portfolios, let's do a different quick comparison between the groups to help any of you that are still uncertain.

# Quick Comparison
According to a [survey](https://blog.edx.org/edx-survey-finds-1-3-americans-ages-25-44-completely-changed-fields-since-starting-first-job-post-college) conducted by EdX, 39% of working Americans seek a career change for an increase in salary. So, for each of our career groups, let's look at their average yearly compensation, and highest level of education achieved. This will give us an idea of the career path with the lowest barrier to entry when it comes to education and approximately how much we could expect to make if we were to pursue such a career path. Note that this treemap visual is interactive, so if some nested rectangles are difficult to read, you can hover your mouse over the rectangles to improve readability.

In [None]:
fig= px.treemap(survey.loc[(survey['career_group'] != 'Student or Unemployed') & (~survey['Yearly compensation (USD)'].isnull())], 
                path=[px.Constant('All professionals that provided a job title and yearly compensation (n= ' + str(len(survey.loc[(survey['career_group'] != 'Student or Unemployed') & (~survey['Yearly compensation (USD)'].isnull())])) + ')'),'career_group','Q4'], 
                color= 'Yearly compensation (USD)', color_continuous_scale= 'thermal',
                title= 'Education level & Avg Yearly Compensation (USD) by Career Group')
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25), title_x= 0.5)
fig.show()

<span style="font-family:courier;color:green;"> From this treemap, we can interpret that, for all career groups, most individuals have either a Bachelor's degree or Master's degree, since the nested rectangles corresponding to these education levels are largest in each of the career groups. We can also interpret that, individuals with a higher education level tend to make more than other individuals within their career group. For example, looking at the Master's degree, Bachelor's degree, and Doctoral degree nested rectangles within the Data Scientist section, the coloring of the rectangles indicate the mean yearly salary is lowest for those with a Bachelor's degree and greatest for those with a Doctoral degree. This observation applies to all other career groups except for the Other group - considering we don't know all the specific roles within this group and that it has an element of uncertainty, it is not surprising it has more variability and deviation from what we would expect. </span>

Before we jump into individual career profiles, let's summarize everything we've seen so far, so that anyone still unsure of which career group to further investigate can decide now:

- The order of the career groups from lowest to highest mean normalized number of years of programming experience is as follows: Data Analyst, Other, Manager, Data Modeller, Data Scientist, and Software engineer
- In all career groups, most individuals have at least a Bachelor's degree. For all career groups except for Software engineer, most individuals have a Master's degree. So, if you're serious about pursuing a career in data science, you might want to consider getting serious about pursuing higher education
    - If you need motivation to pursue further education, for most career groups, individuals with a higher education tend to have a greater yearly compensation relative to others in their career group
- The order of the career groups from lowest to highest mean yearly compensation is as follows: Data Analyst, Software engineer, Data Modeller, Data Scientist, Other, and Manager. The career group-education level combination that has the highest mean yearly compensation are Software engineers with a Doctoral degree

# Career Group Deep Dive
## Data Scientist (& How to interpret the other career group subsections)
Now, using everything you've read so far in this notebook, choose a career group! The options are data analyst, data scientist, data modeller, software engineer, manager, and other. Once you've chosen your career group, you can navigate to the notebook subsection to view the visuals and analysis for that group via the right-hand pane... but, before you do that, read this section where we do a deep dive on the Data Scientist career group - this section gives guidelines on how to interpret the visuals in each career group subsection.

First we'll visualize the distribution of age groups and gender from survey responders in the Data Scientist career group:

In [None]:
# formatting and functions to create the visuals for each career group
sns.set_style('darkgrid')
sns.set_style(rc= {'figure.facecolor':'ghostwhite'})

def age_gender_plot(df, career_name):
    """
    Function to plot distribution of age group and gender for survey respondents belonging to a specific career group
    Creates and displays two vertical 100% stacked bar charts (one for age/Q1, one for gender/Q2)
    
    Inputs:
    -------
    df : pandas DataFrame with the survey responses of all individuals belonging to a specific career group
    
    career_name : string specifying the name of the career grou
                    options: Data Analyst, Data Scientist, Data Modeller, Software engineer, Manager, Other
    
    Returns:
    -------
    None (displays plots)
    
    """
    # visualize percentage of responses by age group and gender
    fig, axes= plt.subplots(1,2, figsize= (12,6))
    fig.suptitle("Distribution of Responses by Age (left) and Gender (right) - "+ career_name + " career group", fontsize='medium')

    df.groupby('career_group')['Q1'].value_counts(normalize= True).unstack('Q1').plot.bar(stacked=True, color= sns.color_palette('cubehelix', n_colors= len(df['Q1'].value_counts())), ax= axes[0])
    axes[0].legend(bbox_to_anchor= (2.6,1.05), title= 'Age groups',fontsize=14)
    axes[0].set_ylabel('% responses', fontsize= 14)
    axes[0].set_xticks([])
    axes[0].set_xlabel('')
    axes[0].tick_params('y', labelsize= 14)

    df.groupby('career_group')['Q2'].value_counts(normalize= True).unstack('Q2').reindex(columns= ['Man','Woman','Prefer to self-describe','Prefer not to say']).plot.bar(stacked= True, color= sns.color_palette('icefire',n_colors= len(df['Q2'].value_counts())), ax= axes[1])
    axes[1].legend(bbox_to_anchor= (2.1,1.05), title= 'Gender',fontsize=14)
    axes[1].set_xticks([])
    axes[1].set_xlabel('')
    axes[1].tick_params('y', labelsize= 14)
    fig.show()
    
    return

In [None]:
career_name= 'Data Scientist'
df= survey.loc[survey['career_group'] == career_name]
age_gender_plot(df, career_name)

<span style="font-family:courier;color:green;">For both the age and gender plots, the thicker the bar segment, the greater proportion of survey responders that belong to that group. These plots were included to give an idea of the demographic and gender of individuals in this career group.</span>

Let's take a look at the programming languages most frequently used integrated development environments (IDEs). This gives you an idea of the tools you can expect to use (and might need to brush up your skills on) if you pursue a career in this group.

In [None]:
def langs_ides(df, career_name):
    """
    Function to create stem plot of languages used on a regular basis (Q7) and radar plot of IDEs used on a regular basis (Q9)
    
    Inputs:
    -------
    df : pandas DataFrame with the survey responses of all individuals belonging to a specific career group
    
    career_name : string specifying the name of the career group
                    options: Data Analyst, Data Scientist, Data Modeller, Software engineer, Manager, Other
    
    Returns:
    -------
    None (displays stem plot and radar plot)
    """
    q7_summ= np.unique(df[['Q7_Part_1','Q7_Part_2','Q7_Part_3','Q7_Part_4','Q7_Part_5','Q7_Part_6','Q7_Part_7','Q7_Part_8','Q7_Part_9','Q7_Part_10','Q7_Part_11','Q7_Part_12']].values.flatten().astype(str),return_counts=True)

    # visualize languages most frequently used
    plt.figure(figsize=(20,4))
    plt.stem(np.delete(q7_summ[0], obj= -1), np.delete(q7_summ[1], obj= -1).astype(int))
    plt.title('Languages most frequently used - ' + career_name + ' career group',fontsize= 18)
    plt.ylabel('Count of respondents')
    plt.tick_params('x',labelsize=14);plt.tick_params('y',labelsize= 14)
    plt.show()

    # visualize IDEs most frequently used
    q9_summ= np.unique(df[['Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_Part_7','Q9_Part_8','Q9_Part_9','Q9_Part_10','Q9_Part_11','Q9_Part_12']].values.flatten().astype(str),return_counts=True)
    fig= px.line_polar(r=np.delete(q9_summ[1], obj= -1).astype(int), 
                        theta=np.delete(q9_summ[0], obj= -1),
                        line_close= True, title='IDEs most frequently used - ' + career_name + ' career group')
    fig.update_traces(fill='toself')
    fig.update_layout(title_x= 0.5)
    fig.show()
    
    return

langs_ides(df, career_name)

<span style="font-family:courier;color:green;">Looking at the stem plot, as an example, if a lot of survey responders in this career group say that they regularly use Python, that's a good indicator that you would also regularly use Python in this career path (so you might want to consider practicing and buffing up your Python skills). The same idea goes for the radar plot illustrating the count of responders that use different IDEs: the longer the point to a specific IDE, the more survey responders regularly use this IDE (which is a good indicator that you would also regularly use this IDE in this career group). </span>

These plots give us a good idea of the programming and development tools used in this career group, but we can take it one step further and visualize the data analysis tools and specific software used in this career group:

In [None]:
def software_plot(df, career_name):
    """
    Function to create treemap of data analysis tool categories (Q41), specific big data and business intelligence software regularly used (Q33 & Q35), and mean size of data science team (Q22)
    
    Inputs:
    -------
    df : pandas DataFrame with the survey responses of all individuals belonging to a specific career group
    
    career_name : string specifying the name of the career group
                    options: Data Analyst, Data Scientist, Data Modeller, Software engineer, Manager, Other
    
    Returns:
    -------
    None (displays treemap)
    """
    # get big data and business intelligence software products in the same column for treemap
    data_prods= df[['Data Science Team Size','Q41','Q33','Q35']].copy()
    data_prods_copy= df[['Data Science Team Size','Q41','Q33','Q35']].copy()
    data_prods_copy['Q33']= df['Q35']
    data_prods= pd.concat([data_prods,data_prods_copy])

    # visualize primary data analysis tool and software within those tools. colour by data science team size
    fig= px.treemap(data_prods.loc[~(data_prods['Q41'].isnull()) & (data_prods['Q33'].isin(data_prods['Q33'].value_counts().index[:5])) & (data_prods['Q35'].isin(data_prods['Q33'].value_counts().index[:]))] , 
                    path=['Q41','Q33'], 
                    color= 'Data Science Team Size',
                    color_continuous_scale= 'thermal',
                    title= 'Primary Data Analysis Tools & Specific Software by Avg Data Science Team Size - ' + career_name + ' career group'
                    )
    fig.update_traces(root_color="lightgrey")
    fig.update_layout(margin = dict(t=50, l=25, r=25, b=25), title_x= 0.5)
    fig.show()
    return

# convert Q22 (# of individuals responsible for data science) to numeric
df['Q22'] = survey['Q22'].str.replace('+','', regex= True)
df= convert_range(df, 'Q22', 'Data Science Team Size')

# create visuals
software_plot(df, career_name)

<span style="font-family:courier;color:green;"> Just like the other treemap at the beginning of this notebook, this treemap is interactive, so you can hover your mouse over any nested rectangle to improve readability. In this treemap, we can see the categories of data analysis tools used in this career group as well as the top most-used software (either big data or business intelligence software) in each of the categories of use. This gives you an idea of whether more big data or business intelligence software is used most often in this career group as well as how software is used (e.g. used for basic statistical analysis, cloud-based software development, etc.). This treemap is also coloured according to the average size of data science team, so, for software that are of brighter colour, we can assume there is more collaborative work with use of this software. </span>

Now that we have an idea of the programming, development, and data analysis tools used in this career group, let's look at the available industry options:

In [None]:
def plot_industry(df, career_name):
    """
    Function to create horizontal bar plot of mean company size (Q21), mean data science team size (Q22) by industry (Q20)
    
    Inputs:
    -------
    df : pandas DataFrame with the survey responses of all individuals belonging to a specific career group
    
    career_name : string specifying the name of the career group
                    options: Data Analyst, Data Scientist, Data Modeller, Software engineer, Manager, Other
    
    Returns:
    -------
    None (displays barplot)
    """
    # visualize company size and size of data science team by industry
    fig= px.bar(df.loc[(~df['Q20'].isnull()) & (~df['Company Size'].isnull()) & (~df['Data Science Team Size'].isnull())].groupby('Q20').mean().sort_values('Data Science Team Size', ascending= True),
                y= df.loc[(~df['Q20'].isnull()) & (~df['Company Size'].isnull()) & (~df['Data Science Team Size'].isnull())].groupby('Q20').mean().sort_values('Data Science Team Size', ascending = True).index,
                x= 'Data Science Team Size',
                color='Company Size',
                labels= {'y': '', 'Data Science Team Size': 'Mean Size of Data Science Team', 'Company Size':'Mean Company Size'},
                title= 'Avg Company Size & Data Science Team Size by Industry - '+ career_name + ' career group')
    fig.update_layout(title_x= 0.5)
    fig.show()
    return
    

# convert Q21 (size of company) to numeric
df['Q21'] = df['Q21'].str.replace(' employees','', regex= True).replace(' or more','', regex=True).replace(',','',regex= True)
df= convert_range(df, 'Q21', 'Company Size')

# create visuals
plot_industry(df,career_name)

<span style="font-family:courier;color:green;"> Here we can see mean company size (via color) and mean size of data science team (via bar length) by industry. Brighter bars indicate industries with larger company sizes and longer bars indicate industries with larger data science teams - bright and long bars may suggest the best opportunities with the lowest barrier to entry, since the company sizes and data science teams are larger on average, so there may be more job postings and opportunity for growth within data science. </span>

At this point, we have a pretty good idea of the programming skills and IDEs we should expect to become comfortable with should we choose to pursue a career in our chosen career group. We also have an idea of the different business intelligence and big data products software used and how they're used in data analysis as well as how average company size and data science team size vary across industries... We've looked at a lot! You're probably pretty excited to start your job hunt and prep work to help you succeed in this career path. The last visual we'll look at is a WordCloud with data science courses survey responders have begun or completed to help them improve on skills used in this career path. <span style="font-family:courier;color:green;"> In the WordCloud, course resources with larger font were used more than those with smaller font - you may find those resources displayed in a larger font more useful than those resources displayed in a smaller font. </span>

In [None]:
def course_wordcloud(df, career_name):
    """
    Function to create WordCloud of data science course resources (Q40) used by survey responders in a specific career group
    
    Inputs:
    -------
    df : pandas DataFrame with the survey responses of all individuals belonging to a specific career group
    
    career_name : string specifying the name of the career group
                    options: Data Analyst, Data Scientist, Data Modeller, Software engineer, Manager, Other
    
    Returns:
    -------
    None (displays WordCloud)
    """

    q40_summ= np.unique(df[['Q40_Part_1','Q40_Part_2','Q40_Part_3','Q40_Part_4','Q40_Part_5','Q40_Part_6','Q40_Part_7','Q40_Part_8','Q40_Part_9','Q40_Part_10','Q40_Part_11']].values.flatten().astype(str), return_counts= True)
    q40_dict= dict(zip(*q40_summ))
    del q40_dict['nan']
    del q40_dict['None']

    # create word cloud to illustrate best sources for courses to get started
    wordcloud = WordCloud(colormap ='Wistia').generate_from_frequencies(q40_dict)
    plt.figure(figsize=(10,10))
    plt.imshow(wordcloud, interpolation= 'bilinear')
    plt.title('Course Resources to Get Started - ' + career_name + ' career group')
    plt.axis("off")
    plt.margins(x=0, y=0)
    plt.show()
    return

course_wordcloud(df, career_name)

This concludes our deep dive on the Data Scientist career group (and the explanation of how to interpret the visuals in this deep dive section of this notebook). The following subsections have the visuals for the other career groups and do not include interpretations of the visuals - use what you learned about visual interpretation in this section to understand what skillset is key to develop to best set yourself up for a career in a specific career group!

## Data Analyst

In [None]:
career_name= 'Data Analyst'
df= survey.loc[survey['career_group'] == career_name]
age_gender_plot(df, career_name)

In [None]:
langs_ides(df, career_name)

In [None]:
df['Q22'] = survey['Q22'].str.replace('+','', regex= True)
df= convert_range(df, 'Q22', 'Data Science Team Size')

software_plot(df, career_name)

In [None]:
df['Q21'] = df['Q21'].str.replace(' employees','', regex= True).replace(' or more','', regex=True).replace(',','',regex= True)
df= convert_range(df, 'Q21', 'Company Size')

plot_industry(df, career_name)

In [None]:
course_wordcloud(df, career_name)

## Data Modeller

In [None]:
career_name= 'Data Modeller'
df= survey.loc[survey['career_group'] == career_name]
age_gender_plot(df, career_name)

In [None]:
langs_ides(df, career_name)

In [None]:
df['Q22'] = survey['Q22'].str.replace('+','', regex= True)
df= convert_range(df, 'Q22', 'Data Science Team Size')

software_plot(df, career_name)

In [None]:
df['Q21'] = df['Q21'].str.replace(' employees','', regex= True).replace(' or more','', regex=True).replace(',','',regex= True)
df= convert_range(df, 'Q21', 'Company Size')

plot_industry(df, career_name)

In [None]:
course_wordcloud(df, career_name)

## Software engineer

In [None]:
career_name= 'Software engineer'
df= survey.loc[survey['career_group'] == career_name]
age_gender_plot(df, career_name)

In [None]:
langs_ides(df, career_name)

In [None]:
df['Q22'] = survey['Q22'].str.replace('+','', regex= True)
df= convert_range(df, 'Q22', 'Data Science Team Size')

software_plot(df, career_name)

In [None]:
df['Q21'] = df['Q21'].str.replace(' employees','', regex= True).replace(' or more','', regex=True).replace(',','',regex= True)
df= convert_range(df, 'Q21', 'Company Size')

plot_industry(df, career_name)

In [None]:
course_wordcloud(df, career_name)

## Manager

In [None]:
career_name= 'Manager'
df= survey.loc[survey['career_group'] == career_name]
age_gender_plot(df, career_name)

In [None]:
langs_ides(df, career_name)

In [None]:
df['Q22'] = survey['Q22'].str.replace('+','', regex= True)
df= convert_range(df, 'Q22', 'Data Science Team Size')

software_plot(df, career_name)

In [None]:
df['Q21'] = df['Q21'].str.replace(' employees','', regex= True).replace(' or more','', regex=True).replace(',','',regex= True)
df= convert_range(df, 'Q21', 'Company Size')

plot_industry(df, career_name)

In [None]:
course_wordcloud(df, career_name)

## Other

In [None]:
career_name= 'Other'
df= survey.loc[survey['career_group'] == career_name]
age_gender_plot(df, career_name)

In [None]:
langs_ides(df, career_name)

In [None]:
df['Q22'] = survey['Q22'].str.replace('+','', regex= True)
df= convert_range(df, 'Q22', 'Data Science Team Size')

software_plot(df, career_name)

In [None]:
df['Q21'] = df['Q21'].str.replace(' employees','', regex= True).replace(' or more','', regex=True).replace(',','',regex= True)
df= convert_range(df, 'Q21', 'Company Size')

plot_industry(df, career_name)

In [None]:
course_wordcloud(df, career_name)

# Summary of All Career Groups

Hopefully at this point you've read through the section on your career group and have an idea of what you can expect should you pursue a career in this group. Across all the career groups, there are some commonalities that are highlighted here:

- There's a greater proportion of men than women + other genders combined - over 70% of survey respondents in each career group are men
- Python and SQL are the two languages regularly used by the most people within each career group 
- Jupyter Notebook was the IDE used by the most people within each career group
- Each career group regularly uses a mix of business intelligence and big data software - there is not one career group that regularly uses only big data products (e.g. MySQL) or only business intelligence products (e.g. Microsoft Power BI)
- The industries with the largest data science teams are Insurance/Risk Assessment and Accounting/Finance - for all career groups, these two industries were within the top 5 industries with the largest data science teams
    - Industries with larger data science teams also tended to have larger company sizes (i.e. data science team size is proportional to company size)
- The top data science course resources used by individuals the career groups were Coursera, Kaggle Learn Courses, and Udemy, suggesting that these resources offer a variety of resources that can enable you to develop a diverse skillset

I hope this notebook was inspiring and informative and will help you jump start the career of your dreams in data science! Comments and feedback are always welcome and good luck in your data science career ðŸ˜Ž