# Helping Data scientists to be Competitive

Based on the insights gained by the Survey data, the following are the factors indicating that you are less competitive in your field.
1. If one is performing less important tasks in your field rather than the most important ones.
2. If one lack in perseverence.
3. If one is less confident in sharing/participating in your community or confused


Under these assumption, Lets try to find the features influencing competitive nature of the  Data scientists and help them with a report on where to develop their skills on,

The factors in Data science affecting the above mentioned statements are,
1. Less important tasks - If the tools and products used by one is in the bottom ranks of the most used by Data scientists in the world.
2. Lack of perseverence - If one do not engage in learning and spending with what you want.
3. Less confident - If one does not participate in Data science community and Confused with oneself- If one had taken very long duration in this survey

## Table of Contents

* [Explonatory Data Analysis](#section-one)
    - [Experience in ML Vs Experience in Programming](#subsection-one)
    - [Features affecting Perseverence in a Data Scientist](#subsection-two)
    - [Finding less important tools/products used by Data Scientists](#subsection-three)
        - [In Machine Learning](#subsection-three-subone)
        - [In Big Data](#subsection-three-subtwo)
        - [In Business Intelligence](#subsection-three-subthree)
    - [Features affecting Confidence in a Data Scientist](#subsection-four)
    - [Were you in confusion?](#subsection-five)
    
* [Analysis of Competence in Data Scientists](#section-two)
    - [Data Processing and Categorization](#subsection-oneintwo)
    - [Visualizing the Findings from the above observations](#subsection-twointwo)
    - [Deep Dive into the Less Competitive Data scientists](#subsection-threeintwo)
    - [Let me see What category I fit into?](#subsection-fourintwo)
* [Disclaimer and Conclusion](#section-three)



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
data = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')

data = data.drop(0,axis=0)

data.head()

<a id="section-one"></a>
# Explonatory Data Analysis

<a id="subsection-one"></a>
## Experience in ML Vs Experience in Programming

Analysis performed:
Finding the relation between the Coding experience and Machine Learning Experience with the Data scientists in the world

Questions from the survey under analysis:

1. For how many years have you been writing code and/or programming?
2. For how many years have you used machine learning methods?

In [None]:
fig = make_subplots(rows=1,cols=2,shared_yaxes=True,
                   subplot_titles=['Programming Experience among Data scientists','ML Experience among Data scientists'])

prog_exp = data['Q6'].dropna()
ml_exp = data['Q15'].dropna()

fig.add_trace(go.Bar(x=prog_exp.value_counts().keys(), 
             y=prog_exp.value_counts().values,
             text=prog_exp.value_counts().values,
                    textposition='auto',showlegend=False),1,1)

fig.add_trace(go.Bar(x=ml_exp.value_counts().keys(), 
             y=ml_exp.value_counts().values,
             text=ml_exp.value_counts().values,
                    textposition='auto',showlegend=False),1,2)
fig.update_xaxes(title_text="Programming Experience", row=1, col=1)
fig.update_yaxes(title_text="Number of people", row=1, col=1)

fig.update_xaxes(title_text="ML Experience", row=1, col=2)
fig.show()

From the above graph, it is evident that the number of years with respect to Programming and ML cannot make assumption on the competence of a data scientist. Hence, an analysis on the inter-relation between the ML and Programming experience is needed as below.

In [None]:
exp = data[['Q6',
            'Q15']]
exp.columns= ['prog_exp','ml_exp']
exp.dropna(inplace=True)
fig = go.Figure()
for prog_exp in exp['prog_exp'].unique():
    pg_group = exp[exp['prog_exp']==prog_exp]
    dat =pg_group['ml_exp'].value_counts()
    fig.add_trace(go.Bar(x=dat.index, y=dat.values,name='Program Exp: '+prog_exp,
                        text=dat.values,
                        textposition='auto'))
fig.update_xaxes(title_text="ML Experience")
fig.update_yaxes(title_text="Number of People")
fig.update_layout(title='Machine learning Experice Vs Programming Experience')
fig.show()

As you can see, the performance in ML as well as Programming is more likely equally distributed for a competitive data scientist. 

**For example**:  
The data scientists with ML experience under one year (first group in the bar) has the highest of programming experience of about 1-2 years or under 1 year.

Now,let us see the relationship vice versa (Programming Vs ML)

In [None]:
exp = data[['Q6',
            'Q15']]
exp.columns= ['prog_exp','ml_exp']
exp.dropna(inplace=True)
fig = go.Figure()
for ml_ex in exp['ml_exp'].unique():
    pg_group = exp[exp['ml_exp']==ml_ex]
    dat =pg_group['prog_exp'].value_counts()
    fig.add_trace(go.Bar(x=dat.index, y=dat.values,name='ML Exp: '+ml_ex,
                        text=dat.values,
                        textposition='auto'))
fig.update_xaxes(title_text="Programming Experience")
fig.update_yaxes(title_text="Number of People")
fig.update_layout(title='Programming Experice Vs ML Experience')
fig.show()

Both Machine learning experience is correlating with the Programming experience. Obviously you need programming experience if you have machine learning experience. If not,you need to improve your programming skills. And Vice Versa!


<a id="subsection-two"></a>
## Features affecting perseverance in a Data scientist
Analysis performed:
We consider the distribution  of money spent by experienced Data scientists and the learning curve of any Data scientist.

Questions from the survey under analysis:
1. Approximately how much money have you (or your team) spent on machine learning and/or cloud computing services at home (or at work) in the past 5 years (approximate USD)?
2. On which platforms have you begun or completed data science courses?
3. What is the highest level of formal education that you have attained or plan to attain within the next 2 years?



In [None]:

map_keys={}
map_keys['$0 ($USD)']= '0'
map_keys['$1-$99'] ='1-99'
map_keys['$10,000-$99,999'] = '10,000-99,999'
map_keys['$100-$999'] = '100-999'
map_keys['$1000-$9,999']= '1000-9,999'
map_keys['$100,000 or more ($USD)'] = '>100,000'


exp = data[['Q25',
            'Q15']]
exp.columns= ['money_spent','ml_exp']
# exp.head()
exp.dropna(inplace=True)

fig = go.Figure()
for ml_ex in exp['ml_exp'].unique():
    pg_group = exp[exp['ml_exp']==ml_ex]
    dat =pg_group['money_spent'].value_counts()
    fig.add_trace(go.Bar(x=[map_keys[i] for i in dat.index], y=dat.values,name='ML Exp: '+ml_ex,
                        text=dat.values,
                        textposition='auto'))
fig.update_xaxes(title_text="Money Spent in USD")
fig.update_yaxes(title_text="Number of People")
fig.update_layout(title='Machine learning Experice Vs Money spent')
fig.show()


The ranking of money spent to gain good experience in Data science is noted and based on the survey and the important aspect of being competitive is "**Perseverance**", bottom 3 ranks in the money spent vs ML experience takes the tag "**Need to be more determined**". 

Another feature to be noted is how much you learn in ML?

If the learning have not started or in the bottom three ranks of the survey determines the perseverance on becoming a Data scientist. As this is an individual choice, the weightage to competitive is low. For example, if you are in any of the education (be it Data science or not) a seperate course is not needed which is determined by the below question in the survey

**What is the highest level of formal education that you have attained or plan to attain within the next 2 years?**

In [None]:

subs = 'Q37' 
cols = [i for i in list(data.columns) if subs in i]
learn =data[cols]
platform={}
for col in cols:
    platform.update(learn[col].value_counts())
platform = dict( sorted(platform.items(),key=lambda item: item[1],
                           reverse=True)) 

course_platforms = list(platform.keys())[-3:]

fig = go.Figure(go.Bar(x=list(platform.values())
                       , y=list(platform.keys()), orientation='h'))
fig.update_layout(showlegend=False,title_text="Competitive Course platforms")
fig.show()
print(course_platforms)

If you are not in any education course, involved with an top data science course platforms make you one Determined Data scientist. 

<a id="subsection-three"></a>
## Finding less important tools/products used by Data scientists
Analysis performed:
We consider the ranking of most used tools/products by the Data scientists in the world.

**In Machine Learning**

Questions from the survey under analysis:
1. What programming languages do you use on a regular basis?
2. What programming language would you recommend an aspiring data scientist to learn first? 
3. Which of the following integrated development environments (IDE's) do you use on a regular basis?  4. Which of the following hosted notebook products do you use on a regular basis?  
5.	What type of computing platform do you use most often for your data science projects?	
6.	Which types of specialized hardware do you use on a regular basis?  
7.	Approximately how many times have you used a TPU (tensor processing unit)?	
8.	What data visualization libraries or tools do you use on a regular basis?  
9.	Which of the following machine learning frameworks do you use on a regular basis? 
10.	Which of the following ML algorithms do you use on a regular basis?  
11.	Which categories of computer vision methods do you use on a regular basis?  
12.	Which of the following natural language processing (NLP) methods do you use on a regular basis?  
13.	Which of the following cloud computing platforms do you use on a regular basis
14.	Do you use any of the following cloud computing products on a regular basis? 
15.	Do you use any of the following machine learning products on a regular basis? 
16.	Do you use any automated machine learning tools (or partial AutoML tools) on a regular basis	
17.	Which of the following automated machine learning tools (or partial AutoML tools) do you use on a regular basis?  
18.	Do you use any tools to help manage machine learning experiments? 



**In Big Data**
 
Questions from the survey under analysis:

1. Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis? 
2.	Which of the following big data products (relational database, data warehouse, data lake, or similar) do you use most often? 
3. What is the primary tool that you use at work or school to analyze data? 



**In Business Intelligence**

Questions from the survey under analysis:

1.	Which of the following business intelligence tools do you use on a regular basis
2.	Which of the following business intelligence tools do you use most often? 


<a id="subsection-three-subone"></a>
### In Machine Learning

In [None]:

subs = 'Q7' 
cols = [i for i in list(data.columns) if subs in i]
learn =data[cols]
platform={}

for col in cols:
    platform.update(learn[col].value_counts())
platform = dict( sorted(platform.items(),key=lambda item: item[1],
                           reverse=True)) 

fig = make_subplots(rows=1, cols=2,
                    subplot_titles=['Programming language in Practice',
                                    'Programming language for Aspiring Data scientist'])
fig.add_trace((go.Bar(x=list(platform.values())
                       , y=list(platform.keys()), orientation='h')),1,1)


prog = data['Q8'].dropna()

fig.add_trace((go.Bar(x=prog.value_counts().keys(), y=prog.value_counts().values)),1,2)
fig.update_layout(showlegend=False,title_text="Programming language in Practice Vs Programming language for Aspiring Data scientist")
fig.show()
dat = dict( sorted(prog.value_counts().items(),key=lambda item: item[1],
                           reverse=True)) 

pr_in_pract =  list(dat.keys())[-3:]
print(pr_in_pract)

If the programming language in practice is in the top 3 programming languages used by an aspiring data scientist makes you competetitive.

**Tools and Products in ML**  
The ranking of top tools and products in practice among the data scientists in the world, gives an idea on where to concentrate for an ML practitioner

In [None]:
ml_tools={}
questions ={}
questions['Q9']='IDE for Data scientists'
questions['Q10']= 'Host notebooks for Data scientists'
questions['Q11']= 'Cloud computing for Data scientists'
questions['Q12']='Hardware for Data scientists'
questions['Q13']= 'TPU Usage by Data scientists'
questions['Q14']='Visualization for Data scientists'
questions['Q16']='ML Frameworks for Data scientists'
questions['Q17']='ML algorithms for Data scientists'
questions['Q26_A']='Cloud Computing platforms for Data scientists'
questions['Q27_A']='Cloud computing products for Data scientists'
questions['Q28_A']='ML  Products for Data scientists'
questions['Q33_A']= 'Auto ML usage for Data scientists'
questions['Q34_A']='Auto ML Tools for Data scientists'
questions['Q35_A']='Managment tools in ML for Data scientists'
questions['Q19']='NLP for Data scientists'
questions['Q18'] = 'CV for Data scientists'
for subs in questions.keys():
    
    cols = [i for i in list(data.columns) if subs in i]
    if len(cols)>1:
        learn =data[cols]
        platform={}

        for col in cols:
            platform.update(learn[col].value_counts())
        platform = dict( sorted(platform.items(),key=lambda item: item[1],
                           reverse=True)) 
        ml_tools[subs]=list(platform.keys())[-2:]

        fig = go.Figure(data=[go.Pie(labels=list(platform.keys()), values=list(platform.values()),
                                    textinfo='label+percent',
                             insidetextorientation='radial')])
            
        fig.update_layout(showlegend=False,title_text=questions[subs])
        fig.show()
    else:
        
        prog = data[cols[0]].dropna()
        platform=dict( sorted(prog.value_counts().items(),key=lambda item: item[1],
                           reverse=True))
        ml_tools[subs]=list(platform.keys())[-2:]
        fig = go.Figure(data=[go.Pie(labels=prog.value_counts().keys(), 
                                     values=prog.value_counts().values,
                             textinfo='label+percent',
                             insidetextorientation='radial')])
        fig.update_layout(showlegend=False,title_text=questions[subs])
        fig.show()
    


<a id="subsection-three-subtwo"></a>
### In Big Data

In [None]:

subs = 'Q29_A'
cols = [i for i in list(data.columns) if subs in i]
learn =data[cols]
platform={}

for col in cols:
    platform.update(learn[col].value_counts())
platform = dict( sorted(platform.items(),key=lambda item: item[1],
                           reverse=True)) 


fig = make_subplots(rows=1, cols=2,
                    subplot_titles=['Big Data products in Practice',
                                    'Big data often used by Aspiring Data scientist'],
                   horizontal_spacing=0.4)
fig.add_trace((go.Bar(x=list(platform.values())
                       , y=list(platform.keys()), orientation='h')),1,1)


prog = data['Q30'].dropna()

fig.add_trace((go.Bar(x=prog.value_counts().values, y=prog.value_counts().keys(), orientation='h')),1,2)
fig.update_layout(showlegend=False,title_text="Big Data produts in Practice Vs Big Data products for Aspiring Data scientist")
fig.show()

platform = dict( sorted(prog.value_counts().items(),key=lambda item: item[1],
                           reverse=True))
bd_tools=list(platform.keys())[-3:]
print(bd_tools)

Similar to programming in ML,  if the Big data tool/products in practice falls on the top used products tags you being competitive big data practitioner


In [None]:
prog = data['Q38'].dropna()
fig = go.Figure(data=[go.Pie(labels=prog.value_counts().keys(), 
                             values=prog.value_counts().values,
                     textinfo='label+percent',
                     insidetextorientation='radial')])
fig.update_layout(showlegend=False,title_text='Data analysis tool for Data scientists')
fig.show()
platform = dict( sorted(prog.value_counts().items(),key=lambda item: item[1],
                           reverse=True))
bd_analy=list(platform.keys())[-3:]
print(bd_analy)

Being a Big data practitioner, the data analysis tool is the most useful in practice and so it is needed to get in touch with the top used tool.

<a id="subsection-three-subthree"></a>
### In Business Intelligence

In [None]:


subs = 'Q31'
cols = [i for i in list(data.columns) if subs in i]
learn =data[cols]
platform={}

for col in cols:
    platform.update(learn[col].value_counts())
platform = dict( sorted(platform.items(),key=lambda item: item[1],
                           reverse=True)) 

fig = make_subplots(rows=1, cols=2,
                    subplot_titles=['BI tools in Practice',
                                    'BI tools often used by Aspiring Data scientist'],
                   horizontal_spacing=0.4)
fig.add_trace((go.Bar(x=list(platform.values())
                       , y=list(platform.keys()), orientation='h')),1,1)


prog = data['Q32'].dropna()

fig.add_trace((go.Bar(x=prog.value_counts().values, y=prog.value_counts().keys(), orientation='h')),1,2)
fig.update_layout(showlegend=False,title_text="BI tools in Practice Vs BI tools for Aspiring Data scientist")
fig.show()


platform = dict( sorted(prog.value_counts().items(),key=lambda item: item[1],
                           reverse=True))
bi_tools=list(platform.keys())[-3:]
print(bi_tools)

If the Business Intelligence tool/products in practice falls on the top used products tags you being competitive business intelligence.


From the above plots, we can analyse one's competence with respect to Machine learning, Big data and Business intelligence.

<a id="subsection-four"></a>
## Features affecting Confidence in a Data Scientist

Analysis performed: 
We consider the sharing and learning platforms used by Data scientists. Further more social media usage by Data scientists is also considered. These factors shows the confidence of any data scientist on getting involved with the Data Science Community

Questions from the survey under analysis:

1. Where do you publicly share or deploy your data analysis or machine learning applications?	
2.	On which platforms have you begun or completed data science courses? 
3.	Who/what are your favorite media sources that report on data science topics? 


In [None]:
conf={}
questions ={}
questions['Q36']='Sharing platforms for Data scientists'
questions['Q37']= 'EdTech for Data scientists'
questions['Q39']= 'Media sources for Data scientists'

for subs in questions.keys():
    cols = [i for i in list(data.columns) if subs in i]
    if len(cols)>1:
        learn =data[cols]
        platform={}

        for col in cols:
            platform.update(learn[col].value_counts())
        platform = dict( sorted(platform.items(),key=lambda item: item[1],
                           reverse=True)) 
        conf[subs]= list(platform.keys())[-3:]

        fig = go.Figure(data=[go.Pie(labels=list(platform.keys()), values=list(platform.values()),
                                    textinfo='label+percent',
                             insidetextorientation='radial')])
            
        fig.update_layout(showlegend=False,title_text=questions[subs])
        fig.show()
    else:
        
        prog = data[cols[0]].dropna()
        platform = dict(sorted(prog.value_counts().items(),key=lambda item: item[1],
                           reverse=True)) 
        conf[subs]= list(platform.keys())[-3:]
        fig = go.Figure(data=[go.Pie(labels=prog.value_counts().keys(), 
                                     values=prog.value_counts().values,
                             textinfo='label+percent',
                             insidetextorientation='radial')])
        fig.update_layout(showlegend=False,title_text=questions[subs])
        fig.show()
print(conf)

From the above ranking with respect to Sharing platforms, EdTech and Social media for a confident Data scientist is noted.

<a id="subsection-five"></a>
## Where you in confusion?

Analysis performed: 
We consider the time taken by the Data scientists and group them into three categories such as 

* Highly confused - if you are in a hurry to complete the survey or you took lot of time to complete the same
* Confused - if you take more time than normal range taken by data scientists
* Not confused - if your time taken is in the normal range taken by data scientists

Fetures considered:

* Duration of Seconds


In [None]:

mean, median, mini, maxi = [data['Time from Start to Finish (seconds)'].astype('int').mean(),
                      data['Time from Start to Finish (seconds)'].astype('int').median(),
                         data['Time from Start to Finish (seconds)'].astype('int').min(),
                                 data['Time from Start to Finish (seconds)'].astype('int').max()]
df = pd.DataFrame(data['Time from Start to Finish (seconds)'].astype('int'))

                      
print('Stats of the Duration in seconds')
df['Time from Start to Finish (seconds)'].describe()

In [None]:

data['Time from Start to Finish (seconds)']=data['Time from Start to Finish (seconds)'].astype('int')
counts, bin_edges = np.histogram(data[data['Time from Start to Finish (seconds)']>=200]['Time from Start to Finish (seconds)'].values, bins=3)
confused={}
keys = ['no','bit','yes']
for i in range(len(bin_edges)-1):
    confused[keys[i]] =[bin_edges[i] , bin_edges[i+1]]


data['Time from Start to Finish (seconds)'].hist()
print(confused)

<a id="section-two"></a>
# Analysis of Competence in Data Scientists


<a id="subsection-oneintwo"></a>
## Data Processing and Categorization


In [None]:
def combine(x,lst):
    if len(set(x)&set(lst))>=1:
        return 1
    else:
        return 0
    
mult_cols =['Q37','Q9','Q10','Q12','Q14','Q16','Q17','Q26_A','Q27_A','Q28_A',
            'Q29_A','Q31','Q33_A','Q34_A','Q35_A','Q19','Q18', 'Q36','Q37','Q39'
           ]
for subs in mult_cols:
    cols = [i for i in list(data.columns) if subs in i]
    learn =data[cols].astype('str')
    data[subs] = learn.apply(lambda x: '+'.join(x), axis = 1)

    data[subs] = data[subs].str.split('+')


In [None]:
mlvsprgm = {}
exp = data[['Q6',
            'Q15']]
exp.columns= ['prog_exp','ml_exp']
exp.dropna(inplace=True)
fig = make_subplots(rows=3, cols=3,
                    subplot_titles=tuple(exp['ml_exp'].unique()),
                    specs=[[{"type": "table"},{"type": "table"},{"type": "table"}],
                           [{"type": "table"},{"type": "table"},{"type": "table"}],
                           [{"type": "table"},{"type": "table"},{"type": "table"}]])

for ml_ex in exp['ml_exp'].unique():
    pg_group = exp[exp['ml_exp']==ml_ex]['prog_exp']
    dat =pg_group.value_counts().sort_index()
    dat =dict( sorted(dat.items(),key=lambda item: item[1],
                           reverse=True))
    if ml_ex!='I do not use machine learning methods':
        mlvsprgm[ml_ex]=list(dat.keys())[-3:]

mlvsspend={}
exp = data[['Q25',
            'Q15']]
exp.columns= ['money_spent','ml_exp']
exp.dropna(inplace=True)
for ml_ex in exp['ml_exp'].unique():
    pg_group = exp[exp['ml_exp']==ml_ex]['money_spent']
    dat =pg_group.value_counts().sort_index()
    dat =dict( sorted(dat.items(),key=lambda item: item[1],
                           reverse=True))
    mlvsspend[ml_ex]=list(dat.keys())[-3:]

In [None]:
def Anaylse_competitiveness(data):
    programming_improvement=[]
    perseverence_needed=[]
    ML_improvement={}
    Big_data=[]
    BI=[]
    Confidence=[]
    Confused={}
    education=['Doctoral degree', 'Master’s degree', 'Bachelor’s degree',
               'Some college/university study without earning a bachelor’s degree',
           'Professional degree']
    for k, v in mlvsprgm.items():
        ind_k = data[data['Q15']==k].index
        ind_v = data.loc[ind_k][data.loc[ind_k]['Q6'].isin(v)].index
        programming_improvement.extend(ind_v)
        ind_v = data.loc[ind_k][data.loc[ind_k]['Q8'].isin(pr_in_pract)].index
        programming_improvement.extend(ind_v)
    for k, v in mlvsspend.items():
        ind_e = data[-data['Q4'].isin(education)].index
        data_ed = data.loc[ind_e]
        ind_k = data_ed[data_ed['Q15']==k].index
        ind_v = data_ed.loc[ind_k][data_ed.loc[ind_k]['Q25'].isin(v)].index
        perseverence_needed.extend(ind_v)
        data['Q37_res'] = data_ed.loc[ind_k]['Q37'].apply(lambda x: combine(x,course_platforms))
        ind_v = data[data['Q37_res']==1].index
        perseverence_needed.extend(ind_v)

    cloud_computing=['Q9','Q11','Q26_A','Q27_A','Q13','Q10','Q13']
    ML_core = ['Q16','Q17','Q28_A','Q33_A','Q34_A','Q35_A']
    nlp=['Q19']
    cv=['Q18']
    cl, ml_c, NLP, CV=[],[],[],[]
    for k,v in ml_tools.items():
        if k in cloud_computing:
            if k in mult_cols:
                col =k+'res'
                data[col] = data[k].apply(lambda x: combine(x,v))
                ind_v = data[data[col]==1].index
                cl.extend(ind_v)
            else:
                ind_v = data[data[k].isin(v)].index
                cl.extend(ind_v)
        elif k in ML_core:
            if k in mult_cols:
                col =k+'res'
                data[col] = data[k].apply(lambda x: combine(x,v))
                ind_v = data[data[col]==1].index
                ml_c.extend(ind_v)
            else:
                ind_v = data[data[k].isin(v)].index
                ml_c.extend(ind_v)
        elif k in nlp:
            if k in mult_cols:
                col =k+'res'
                data[col] = data[k].apply(lambda x: combine(x,v))
                ind_v = data[data[col]==1].index
                NLP.extend(ind_v)
            else:
                ind_v = data[data[k].isin(v)].index
                NLP.extend(ind_v)
        elif k in cv:
            if k in mult_cols:
                col =k+'res'
                data[col] = data[k].apply(lambda x: combine(x,v))
                ind_v = data[data[col]==1].index
                CV.extend(ind_v)
            else:
                ind_v = data[data[k].isin(v)].index
                CV.extend(ind_v)
    ML_improvement['Cloud_computing'] = np.unique(cl)
    ML_improvement['ML_core'] = np.unique(ml_c)
    ML_improvement['NLP'] = np.unique(NLP)
    ML_improvement['CV'] = np.unique(CV)


    data['Q29_A_res'] = data['Q29_A'].apply(lambda x: combine(x,bd_tools))
    ind_v = data[data['Q29_A_res']==1].index
    Big_data.extend(ind_v)
    ind_v =  data[data['Q38'].isin(bd_analy)].index
    Big_data.extend(ind_v)
    data['Q31_res'] = data['Q31'].apply(lambda x: combine(x,bi_tools))
    ind_v = data[data['Q31_res']==1].index
    BI.extend(ind_v)


    for k,v in conf.items():
        if k in mult_cols:
            col =k+'res'
            data[col] = data[k].apply(lambda x: combine(x,v))
            ind_v = data[data[col]==1].index
            Confidence.extend(ind_v)
        else:
            ind_v = data[data[k].isin(v)].index
            Confidence.extend(ind_v)
    data['Time from Start to Finish (seconds)'] = data['Time from Start to Finish (seconds)'].astype('int')
    for k,v in confused.items():
        start =v[0]
        end=v[1]
        ind_v = data[(data['Time from Start to Finish (seconds)']>=start) &
             (data['Time from Start to Finish (seconds)']<=end)].index
        Confused[k]=ind_v
    ind_v = data[data['Time from Start to Finish (seconds)']<=200].index
    Confused['yes'] = list(Confused['yes'])+list(ind_v)

    programming_improvement =set(programming_improvement)    
    perseverence_needed = set(perseverence_needed)
    Big_data = set(Big_data)
    BI = set(BI)
    Confidence = set(Confidence)
    return programming_improvement, perseverence_needed, Big_data, BI, Confidence, ML_improvement,Confused


In [None]:
programming_improvement, perseverence_needed, Big_data, BI, Confidence, ML_improvement,Confused = Anaylse_competitiveness(data)


In [None]:
Comp = []
Comp.extend(list(programming_improvement))
Comp.extend(list(perseverence_needed))
Comp.extend(list(Big_data))
Comp.extend(list(BI))
Comp.extend(list(ML_improvement['Cloud_computing']))
Comp.extend(list(ML_improvement['ML_core']))
Comp.extend(list(ML_improvement['NLP']))
Comp.extend(list(ML_improvement['CV']))
Comp.extend(list(Confidence))
Comp.extend(list(Confused['yes']))
Comp.extend(list(Confused['bit']))
Comp =np.unique(Comp)
fig = go.Figure(data=[go.Pie(labels=['Less Competitive Data scientists','Competitive Data scientists'], 
                             values=[len(Comp),len(data)-len(Comp)],
                                    textinfo='label+percent',
                             insidetextorientation='radial')])
            
fig.update_layout(showlegend=False,title_text='Data scientists found Competitive based on the assumptions')
fig.show()

We find about 63% of Data scientists in the survey as Less competitive. No offense, it is just based on the assumptions and from the survey data.

<a id="subsection-twointwo"></a>
## Visualizing the Findings from the above observations

In [None]:
Analysis={}
Analysis['Need to improve programming'] = len(programming_improvement)
Analysis['Need to be more determined'] = len(perseverence_needed)
Analysis['Need to be improve with CLoud Computing'] = len(ML_improvement['Cloud_computing'])
Analysis['Need to be improve with Core ML'] = len(ML_improvement['ML_core'])
Analysis['Need to be improve with NLP'] = len(ML_improvement['NLP'])
Analysis['Need to be improve with CV'] = len(ML_improvement['CV'])
Analysis['Need to be improve with BigData'] = len(Big_data)
Analysis['Need to be improve with Business Intelligence'] = len(BI)
Analysis['Need to be more Confident'] = len(Confidence)
Analysis['Need to be more outspoken'] = len(Confused['yes']) + len(Confused['bit'])

fig = go.Figure(data=[go.Pie(labels=list(Analysis.keys()), 
                             values=list(Analysis.values()),
                                    textinfo='label+percent',
                             insidetextorientation='radial')])
            
fig.update_layout(showlegend=False,title_text='Analysis on Less Competitive Data scientists based on the assumptions')
fig.show()


From the 63% of less competitive data scientists found based on the assumptions, we have analysed and reported the advice for their competence in various fields such as Cloud computing, programming, ML core, etc...

<a id="subsection-threeintwo"></a>
## Deep Dive into the Less Competitive Data scientists



In [None]:
LessComp_data = data.loc[Comp]

k,v = 'Q1','Age Group'
dat = LessComp_data[k].dropna()
dat = LessComp_data[k].value_counts()
fig = go.Figure(data=[go.Pie(labels=list(dat.keys()), values=list(dat.values),textinfo='none')])
fig.update_layout(showlegend=True,title_text='Analysis on Less Competitive Data scientists - '+v+'(in years)')
fig.show()

Adult Data scientists are more competitive compared to middle age categories. Obviously growing up teaches the improtance of determination!

In [None]:
LessComp_data = data.loc[Comp]

k,v = 'Q2','Gender'
dat = LessComp_data[k].dropna()
dat = LessComp_data[k].value_counts()
fig = go.Figure(data=[go.Pie(labels=list(dat.keys()), values=list(dat.values),textinfo='none')])
fig.update_layout(showlegend=True,title_text='Analysis on Less Competitive Data scientists - '+v+'(in years)')
fig.show()

As the percentage of Gender-Male is high in the survey data, and so it reflects in the competitive split. 

In [None]:
LessComp_data = data.loc[Comp]

k,v = 'Q3','Country'
dat = LessComp_data[k].dropna()
dat = LessComp_data[k].value_counts()
fig = go.Figure(data=[go.Pie(labels=list(dat.keys()), values=list(dat.values),textinfo='none')])
fig.update_layout(showlegend=True,title_text='Analysis on Less Competitive Data scientists - '+v+'(in years)')
fig.show()

In [None]:
LessComp_data = data.loc[Comp]

k,v = 'Q5','Job role'
dat = LessComp_data[k].dropna()
dat = LessComp_data[k].value_counts()
fig = go.Figure(data=[go.Pie(labels=list(dat.keys()), values=list(dat.values),textinfo='none')])
fig.update_layout(showlegend=True,title_text='Analysis on Less Competitive Data scientists - '+v+'(in years)')
fig.show()

<a id="subsection-fourintwo"></a>
## Let me see What category I fit into?

In [None]:
mydata = pd.read_csv('../input/my-data/mydata.csv')
header= mydata.loc[0]
mydata = mydata.drop(0,axis=0)
mydata.columns= header
mydata.head()

In [None]:
mult_cols =['Q37','Q9','Q10','Q12','Q14','Q16','Q17','Q26_A','Q27_A','Q28_A',
            'Q29_A','Q31','Q33_A','Q34_A','Q35_A','Q19','Q18', 'Q36','Q37','Q39'
           ]
for subs in mult_cols:
    cols = [i for i in list(mydata.columns) if subs in i]
    learn =mydata[cols].astype('str')
    mydata[subs] = learn.apply(lambda x: '+'.join(x), axis = 1)

    mydata[subs] = mydata[subs].str.split('+')
    

In [None]:
programming_improvement, perseverence_needed, Big_data, BI, Confidence, ML_improvement,Confused = Anaylse_competitiveness(mydata)


print('Report of the Analysis')

print('In Programming:-')
if len(programming_improvement)>0:
    print('Improvement Needed')
else:
    print('So far good')
    
print('In Determination:-')
if len(perseverence_needed)>0:
    print('Improvement Needed')
else:
    print('So far good')    
print('In Machine Learning:-')
print('Improvement needed in the following areas...')
for k,v in ML_improvement.items():
    if len(v)>0:
        print(k)
        
print('In Confidence:-')
for k,v in Confused.items():
    if len(v)>0:
        if k=='no':
            print('You are confident enough')
        if k=='yes' or k=='bit':
            print('You need to boost your confidence more')

<a id="subsection-five"></a>
# Disclaimer and Conclusion

### Disclaimer:
The complete analysis are based on assumptions/insights gained by analysing DS survey data. And the results/observations shown are based on categorizing the Survey data in this Notebook. Please do not assume or take any observation literally.


### Conclusion:
All the chart analysis gives us a report or tell us what should we concentrate more on during the upcoming challenges in Data science. A main drawback would be dropping the missing values. It is really great if this helps. If you have even a slightest positive feeling towards this notebook, don't forget to Upvote!!!

## Thank You!!!

---