****Objective:**** 
To study the behavior of various ^generations such as
* Baby Boomers(Age group: 55 to 70+)
* Gen X(Age group: 40 to 54)
* Gen Y(Age group: 25 to 39)
* Gen Z(Age group: 18 to 24)

with respect to

1. Demographics
2. User Profile
3. Usage
4. Data Science and ML Career Goal
5. Organisation

****Reference:****
The segmentation of age groups to various generation names has no thumb rule but has been inspired from the link ^[here](https://www.kasasa.com/articles/generations/gen-x-gen-y-gen-z).


In [None]:
# Import libraries
import os
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

# Read data from file,skip 2nd row which contains the decription
df = pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv",skiprows=lambda x: x in [1]) 

#Store column names
col_df=df.columns

#Creation of Segment variable based on Age group
df['Segment']= np.where(((df['Q1'] =='18-21') | (df['Q1'] =='22-24')), 'Gen Z', 
                        np.where(((df['Q1'] =='25-29') | (df['Q1'] =='30-34') | (df['Q1'] =='35-39')),'Gen Y',
                        np.where(((df['Q1'] =='40-44') | (df['Q1'] =='45-49') | (df['Q1'] =='50-54')),'Gen X','Baby Boomers')))

#Renaming th evalues of Q25 since output values in dataframe loses $ symbol
df['Q25']= np.where(df['Q25'] =='$0 ($USD)', '0 USD',
                          np.where(df['Q25'] =='$1-$99','1 USD to 99 USD',
                                   np.where(df['Q25'] =='$100-$999','100 USD to 9999 USD',
                                           np.where(df['Q25'] =='$1000-$9,999','1000 USD to 9999 USD',
                                                   np.where(df['Q25'] =='$10,000-$99,999','10K USD to 99999 USD',
                                                           np.where(df['Q25'] =='$100,000 or more ($USD)','More than 100K USD',None))))))




In [None]:
#Data Preprocessing

#There are two kinds of questions,questions with single choice selection and questions with multiple choice selection.

#Single choice selection  questions are named as Non-Underscore columns(unscr_cols).For e.g Q1,Q2,etc
#Multiple choice selection questions are named as Underscore columns(unscr_cols) For e.g (Q7_Part_1,Q7_Part_2,etc)
#Supplemetary questions are names as supp_cols

unscr_cols=[]
for i in col_df:
    if i.find('_')>0:
        unscr_pos=i.find('_')#Identifying Multiple choice question
        cols=i[:unscr_pos]
        unscr_cols.append(cols)

#Remove duplicates
final_unscr_cols = [] 
for i in unscr_cols: 
    if i not in final_unscr_cols: 
        final_unscr_cols.append(i)

non_unscr_cols=[]
for i in col_df:
    if i.find('_')<0 and len(i)<=4:#Identifying Single choice question where underscore won't be present and negative value be returned
        cols=i
        non_unscr_cols.append(cols) 

#Remove duplicates
final_non_unscr_cols = [] 
for i in non_unscr_cols: 
    if i not in final_non_unscr_cols: 
        final_non_unscr_cols.append(i) 

#All unique columns
all_cols=final_non_unscr_cols+final_unscr_cols

#Sort columns
all_cols.sort(key=len)

supp_cols=[]
for i in col_df:
    if i.find('_B_')>0:
        cols=i
        supp_cols.append(cols)  

supp_unscr_cols=[]
for i in df.columns:
    if i.find('_B')>0:#Identifying supplementary column
        supp_unscr_pos=i.find('_B')
        cols=i[:supp_unscr_pos]
        supp_unscr_cols.append(cols)
        
#Remove Supplementary column duplicates
final_supp_unscr_cols = [] 
for i in supp_unscr_cols: 
    if i not in final_supp_unscr_cols: 
        final_supp_unscr_cols.append(i)

#Non Supplementary
non_supp_cols = [item for item in col_df if item not in supp_cols]

str1="df"
str2="ovr_df"
str3="seg_df"
str4='cols'
all_cols_df = ["{}{}{}".format(i,'_',str1) for i in all_cols]
all_ovr_cols_df = ["{}{}{}".format(i,'_',str2) for i in all_cols]
all_seg_cols_df = ["{}{}{}".format(i,'_',str3) for i in all_cols]

non_unscr_cols_list = ["{}{}{}".format(i,'_',str1) for i in final_non_unscr_cols]
non_unscr_ovr_cols_list = ["{}{}{}".format(i,'_',str2) for i in final_non_unscr_cols]
non_unscr_seg_cols_list = ["{}{}{}".format(i,'_',str3) for i in final_non_unscr_cols]

unscr_cols_list = ["{}{}{}".format(i,'_',str1) for i in final_unscr_cols]
unscr_ovr_cols_list = ["{}{}{}".format(i,'_',str2) for i in final_unscr_cols]
unscr_seg_cols_list = ["{}{}{}".format(i,'_',str3) for i in final_unscr_cols]


unscr_subs_cols_list = ["{}{}".format(i,'_') for i in final_unscr_cols]
unscr_col_names_list = ["{}{}{}".format(i,'_',str4) for i in final_unscr_cols]

supp_unscr_subs_cols_list = ["{}{}".format(i,'_B_') for i in final_supp_unscr_cols]
supp_unscr_col_names_list = ["{}{}{}".format(i,'_B_',str4) for i in final_supp_unscr_cols]

supp_unscr_cols_list = ["{}{}{}".format(i,'_B_',str1) for i in final_supp_unscr_cols]
supp_unscr_ovr_cols_list = ["{}{}{}".format(i,'_B_',str2) for i in final_supp_unscr_cols]
supp_unscr_seg_cols_list = ["{}{}{}".format(i,'_B_',str3) for i in final_supp_unscr_cols]

#Setting caption for the output tables for each question
non_unscr_desc={'Q1':'Age',
                'Q2':'Gender',
                'Q3':'Country',
                'Q4':'Education',
                'Q5':'Current Role',
                'Q6':'Code Experience',
                'Q8':'Recommended Programming Language',
                'Q11':'Current Computing Platform',
                'Q13':'Current TPU Usage Frequency',
                'Q15':'Machine Learning Experience',
                'Q20':'Size of The Company',
                'Q21':'Data Science Team Size',
                'Q22':'Machine Learning Adoption',
                'Q24':'Current Yearly Compensation',
                'Q25':'ML/Cloud Computing Investment Past 5 years',
                'Q30':'Current Big Data Usage',
                'Q32':'Current BI Tool Usage',
                'Q38':'Primary Tool for Data Analysis'
                }

unscr_desc= {   'Q7':'Regular Programming Language',
                'Q9':'Regular IDE',
                'Q10':'Regular Notebook Products',
                'Q12':'Reguar Specialized Hardware',
                'Q14':'Regular Visualization Tools/Libraries',
                'Q16':'Regular Machine Learning Frameworks',
                'Q17':'Regular Machine Learning Algorithms',
                'Q18':'Regular Computer Vision Methods',
                'Q19':'Regular Natural Language Processing Methods',
                'Q23':'Work Activities',
                'Q26':'Regular Computing Platforms',
                'Q27':'Regular Cloud Computing Products',
                'Q28':'Regular Machine Learning Products',
                'Q29':'Regular Big Data Products',
                'Q31':'Regular BI Tools',
                'Q33':'Regular AutoML Tools',
                'Q34':'Regular AutoML Tools',
                'Q35':'Tools to help ML Experiments',
                'Q36':'Deploy Platform',
                'Q37':'Datascience Courses Learning',
                'Q39':'Favorite Media Sources'
                }

supp_unscr_desc= { 
                'Q26':'Next 2 years Cloud Computing Platform',
                'Q27':'Next 2 years Cloud Computing Products',
                'Q28':'Next 2 years ML Products',
                'Q29':'Next 2 years Big Data Products',
                'Q31':'Next 2 years BI Tools',
                'Q33':'Next 2 years AutoML Category Tools',
                'Q34':'Next 2 years AutoML Specific Tools',
                'Q35':'Next 2 years Manage ML Experiment Tools'
                }

In [None]:
#Non-Underscore Columns Preprocessing(Single Choice Questions)

#Overall Non Underscore columns
non_unscr_ovr_cols_df=pd.DataFrame(non_unscr_ovr_cols_list ,columns=['col'])
non_unscr_ovr_cols_dict = {elem : pd.DataFrame() for elem in non_unscr_ovr_cols_list}

for key in non_unscr_ovr_cols_dict.keys():
    non_unscr_ovr_cols_dict[key] = non_unscr_ovr_cols_df[:][non_unscr_ovr_cols_df.col == key]

for key,i in zip(non_unscr_ovr_cols_dict.keys(),non_unscr_cols):
    non_unscr_ovr_cols_dict[key]=df[i].value_counts(normalize=True,dropna=True).round(3).reset_index()
    non_unscr_ovr_cols_dict[key].columns = [i,'Overall']       

#Segment Non Underscore columns
non_unscr_seg_cols_df=pd.DataFrame(non_unscr_seg_cols_list,columns=['col'])

non_unscr_seg_cols_dict = {elem : pd.DataFrame() for elem in non_unscr_seg_cols_list}

for key in non_unscr_seg_cols_dict.keys():
    non_unscr_seg_cols_dict[key] = non_unscr_seg_cols_df[:][non_unscr_seg_cols_df.col == key]

for key,i in zip(non_unscr_seg_cols_dict.keys(),non_unscr_cols):
    non_unscr_seg_cols_dict[key]=pd.crosstab(df[i], df.Segment,dropna=True).apply(lambda r: (r/r.sum()), axis=0).round(2).reset_index()

#Combine Overall and Segment Non Underscore columns
non_unscr_cols_df=pd.DataFrame(non_unscr_cols_list,columns=['col'])
non_unscr_cols_dict = {elem : pd.DataFrame() for elem in non_unscr_cols_list}
for key in non_unscr_cols_dict.keys():
    non_unscr_cols_dict[key] = non_unscr_cols_df[:][non_unscr_cols_df.col == key]
    
for key,i,key_ovr,key_seg in zip(non_unscr_cols_dict.keys(),non_unscr_cols,non_unscr_ovr_cols_dict.keys(),non_unscr_seg_cols_dict.keys()):
    non_unscr_cols_dict[key]=pd.merge(non_unscr_ovr_cols_dict[key_ovr], non_unscr_seg_cols_dict[key_seg],how='inner',on=i)

#Final Non Underscore column
non_unscr_col_names_dict = {elem : [] for elem in final_non_unscr_cols}
for key,key2,val in zip(non_unscr_cols_dict.keys(),non_unscr_col_names_dict.keys(),non_unscr_desc.values()):
    non_unscr_col_names_dict[key2]=non_unscr_cols_dict[key].style\
    .format({'Overall': "{:.0%}",'Baby Boomers': "{:.0%}",'Gen X': "{:.0%}",'Gen Y': "{:.0%}",'Gen Z': "{:.0%}"})\
    .background_gradient(cmap='Blues',axis=1)\
    .set_properties(**{'text-align': 'center'})\
    .set_caption(val)\
    .set_table_styles([{
        'selector': 'caption',
        'props': [
        ('color', 'black'),
        ('font-size', '18px')
        ]
        }])\
    .hide_index()

In [None]:
#Underscore Columns Preprocessing(Multiple choice questions)

#Storing Underscore column names in the dict
unscr_subs_col_names_dict = {elem : [] for elem in unscr_col_names_list}
for key,i in zip(unscr_subs_col_names_dict.keys(),unscr_subs_cols_list):
    unscr_subs_col_names_dict[key] = [col for col in non_supp_cols if i in col]

#Overall Underscore columns
unscr_ovr_cols_dict = {elem : pd.DataFrame() for elem in unscr_ovr_cols_list}
for key,value,j in zip(unscr_ovr_cols_dict.keys(),unscr_subs_col_names_dict.values(),final_unscr_cols) :
    for i in value:
        tmp=pd.DataFrame(df[i].value_counts().rename_axis(j).reset_index(name='Overall'))
        unscr_ovr_cols_dict[key]=pd.concat([unscr_ovr_cols_dict[key],tmp],ignore_index=True)
        unscr_ovr_cols_dict[key]=unscr_ovr_cols_dict[key].sort_values(by=['Overall'], ascending=False)
    unscr_ovr_cols_dict[key]['Overall']=(unscr_ovr_cols_dict[key]['Overall']/len(df))
    
#Segment Underscore columns
unscr_seg_cols_dict = {i : pd.DataFrame(columns=[j,'Baby Boomers','Gen X','Gen Y','Gen Z']) for i,j in zip(unscr_seg_cols_list,final_unscr_cols)}

for key,value,j in zip(unscr_seg_cols_dict.keys(),unscr_subs_col_names_dict.values(),final_unscr_cols) :
    for i in value:
        tmp=pd.crosstab(df[i],df['Segment']).rename_axis(j).reset_index()
        unscr_seg_cols_dict[key]=pd.concat([unscr_seg_cols_dict[key],tmp],ignore_index=True)

cols = ['Baby Boomers','Gen X','Gen Y','Gen Z']
for key in unscr_seg_cols_dict.keys():
    for i in cols:
        unscr_seg_cols_dict[key][i]  = (unscr_seg_cols_dict[key][i]/(len(df[df.Segment== i]))).apply(pd.to_numeric, errors='coerce')

#Combine Overall and Segment Underscore columns
unscr_cols_dict = {elem : pd.DataFrame() for elem in unscr_cols_list}
for key,i,key_ovr,key_seg in zip(unscr_cols_dict.keys(),final_unscr_cols,unscr_ovr_cols_dict.keys(),unscr_seg_cols_dict.keys()):
    unscr_cols_dict[key]=pd.merge(unscr_ovr_cols_dict[key_ovr], unscr_seg_cols_dict[key_seg],how='inner',on=i)

#Final Underscore Columns Non Supplementary
unscr_col_names_dict = {elem : [] for elem in final_unscr_cols}
for key,key2,val in zip(unscr_cols_dict.keys(),unscr_col_names_dict.keys(),unscr_desc.values()):
    unscr_col_names_dict[key2]=unscr_cols_dict[key].style\
    .format({'Overall': "{:.0%}",'Baby Boomers': "{:.0%}",'Gen X': "{:.0%}",'Gen Y': "{:.0%}",'Gen Z': "{:.0%}"})\
    .background_gradient(cmap='Blues',axis=1)\
    .set_properties(**{'text-align': 'center'})\
    .set_caption(val)\
    .set_table_styles([{
        'selector': 'caption',
        'props': [
        ('color', 'black'),
        ('font-size', '18px')
        ]
        }])\
    .hide_index()

In [None]:
#Supplementary columns Pre-processing

#Storing Supplementary column names in the dict
supp_unscr_subs_col_names_dict = {elem : [] for elem in supp_unscr_col_names_list}
for key,i in zip(supp_unscr_subs_col_names_dict.keys(),supp_unscr_subs_cols_list):
    #print(key,i)
    supp_unscr_subs_col_names_dict[key] = [col for col in supp_cols if i in col]

#Overall Underscore Supplementary columns
supp_unscr_ovr_cols_dict = {elem : pd.DataFrame() for elem in supp_unscr_ovr_cols_list}

for key,value,j in zip(supp_unscr_ovr_cols_dict.keys(),supp_unscr_subs_col_names_dict.values(),final_supp_unscr_cols) :
    for i in value:
        tmp=pd.DataFrame(df[i].value_counts().rename_axis(j).reset_index(name='Overall'))
        supp_unscr_ovr_cols_dict[key]=pd.concat([supp_unscr_ovr_cols_dict[key],tmp],ignore_index=True)
        supp_unscr_ovr_cols_dict[key]=supp_unscr_ovr_cols_dict[key].sort_values(by=['Overall'], ascending=False)
    supp_unscr_ovr_cols_dict[key]['Overall']=(supp_unscr_ovr_cols_dict[key]['Overall']/len(df))

#Segment Underscore Supplementary columns
supp_unscr_seg_cols_dict = {i : pd.DataFrame(columns=[j,'Baby Boomers','Gen X','Gen Y','Gen Z']) for i,j in zip(supp_unscr_seg_cols_list,final_supp_unscr_cols)}

for key,value,j in zip(supp_unscr_seg_cols_dict.keys(),supp_unscr_subs_col_names_dict.values(),final_supp_unscr_cols) :
    for i in value:
        tmp=pd.crosstab(df[i],df['Segment']).rename_axis(j).reset_index()
        supp_unscr_seg_cols_dict[key]=pd.concat([supp_unscr_seg_cols_dict[key],tmp],ignore_index=True)

cols = ['Baby Boomers','Gen X','Gen Y','Gen Z']
for key in supp_unscr_seg_cols_dict.keys():
    for i in cols:
        supp_unscr_seg_cols_dict[key][i]  = (supp_unscr_seg_cols_dict[key][i]/(len(df[df.Segment== i]))).apply(pd.to_numeric, errors='coerce')

#Combine Overall and Segment Underscore Supplementary columns
supp_unscr_cols_dict = {elem : pd.DataFrame() for elem in supp_unscr_cols_list}
for key,i,key_ovr,key_seg in zip(supp_unscr_cols_dict.keys(),final_supp_unscr_cols,supp_unscr_ovr_cols_dict.keys(),supp_unscr_seg_cols_dict.keys()):
    supp_unscr_cols_dict[key]=pd.merge(supp_unscr_ovr_cols_dict[key_ovr], supp_unscr_seg_cols_dict[key_seg],how='inner',on=i)

#Final Supplementary columns
supp_unscr_col_names_dict = {elem : [] for elem in final_supp_unscr_cols}
for key,key2,val in zip(supp_unscr_cols_dict.keys(),supp_unscr_col_names_dict.keys(),supp_unscr_desc.values()):
    supp_unscr_col_names_dict[key2]=supp_unscr_cols_dict[key].style\
    .format({'Overall': "{:.0%}",'Baby Boomers': "{:.0%}",'Gen X': "{:.0%}",'Gen Y': "{:.0%}",'Gen Z': "{:.0%}"})\
    .background_gradient(cmap='Blues',axis=1)\
    .set_properties(**{'text-align': 'center'})\
    .set_caption(val)\
    .set_table_styles([{
        'selector': 'caption',
        'props': [
        ('color', 'black'),
        ('font-size', '18px')
        ]
        }])\
    .hide_index()

# Generation Distribution

In [None]:
g=sns.color_palette("Blues_r")
ax = sns.countplot(y="Segment", data=df,order = df['Segment'].value_counts().index,palette=g)
#"#3498db"

#plt.title('Generation Distribution')
#plt.xlabel('Segment')
sns.set(rc={'figure.figsize':(13,5)})
total = len(df['Segment'])
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))
sns.set_style("white")
sns.despine()

plt.show()

Gen Y((Age group: 25 to 39) composition is the highest compared to any other segments


# Demographics

In [None]:
display(non_unscr_col_names_dict['Q2'])

1. Men in Baby Boomer's Segment is over-indexed(more compared to Overall) whereas Women are under-indexed (are very less compared to Overall).
1. Women in Gen Z are more interested in ML & DS when compared to Overall 

In [None]:
non_unscr_col_names_dict['Q3']


1. In India,Gen Z is highly over indexed and Baby Boomers are under indexed which indicates that young age group from India are highly interested in ML & DS whereas the Senior most citizens are less inclined
2. In US,unlike India Baby Boomers are more interested in ML & DS whereas Gen Z is not much inclined.

Hypotheses:
* Gen Z of India more interested in DS & ML compared to Gen Z of US
* Marketing Campaigns of Data Science and ML to lure Gen Z of India is more compared to Gen Z of US
* More diversified opportunities for Gen Z in US compared to less diversified opportunites to Gen Z in India

In [None]:
non_unscr_col_names_dict['Q4']

1. Master's degree education of Gen Y is over indexed compared to Overall
2. Gen Z is the youngest group so it is logical that Bachelor's degree is more in that segment
3. Doctoral degree education of Baby Boomer is over indexed compared to Overall

# User Profile

In [None]:
non_unscr_col_names_dict['Q5']

1. Software Engineers of Gen X is more inclined towards Ml & DS when compared to Overall
2. People with Other Job role and Unemployed of Baby Boomers are interested in ML & DS

In [None]:
non_unscr_col_names_dict['Q6']

The age group of each generation mostly correlates with the years of code experience with each Segment

In [None]:
non_unscr_col_names_dict['Q24']

1. Users with less than $1000 is 20% which is quite higher than any of the income groups.
1. Gen Z comprises the youngest age group so it is obvious that they reside in the lowest income group.

In [None]:
unscr_col_names_dict['Q36']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.
1. Github,Kaggle and Colab are the mostly shared platforms by users
1. Gen X and Gen Y are over indexed in platforms-Github and Kaggle.
1. Baby boomers are quite closed in terms of sharing

In [None]:
unscr_col_names_dict['Q37']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.
1. Coursera,Kaggle and Udemy are the top three learning platforms utilized by Users
1. Baby boomers over indexed in edX specifically

In [None]:
unscr_col_names_dict['Q39']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.
1. Kaggle,Youtube and Blogs are the top three favorite Media Sources by Users
1. Baby Boomers are versatile as they are over indexed in learning from media sources such as Kaggle,Email newsletters,Journals,Forums


# Usage

In [None]:
unscr_col_names_dict['Q7']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

1. Gen Z is over-indexed in Python,C++,Java and C compared to Overall
2. Gen X is over-indexed in R, SQL and Bash compared to Overall
3. Baby boomers is over-indexed in R and highly over indexed in Other,would be interesting to know what are the other programming languages,SAS,SPSS are not mentioned so Others might be becasue of that

In [None]:
non_unscr_col_names_dict['Q8']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

1. Python is the most recommended programming language
1. Baby boomers are more incllined in recommending R

In [None]:
unscr_col_names_dict['Q9']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

1. Gen Z-Jupyter,Visual Studio Code, Pycharm and Spyder are over indexed compared to overall which indicates that Gen Z are more into Python
2. Baby Boomer-R Studio,Notepad++ and Visual Studio are over indexed  compared to overall

Hypothesis:
Younger generation inclined towards Python and Older generation inclined towards Non-Python languages

In [None]:
unscr_col_names_dict['Q10']

None is highly over indexed for Baby Boomer which has association in Other Programming language,might be the reason for choosing None 

In [None]:
non_unscr_col_names_dict['Q11']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

Personal laptop/computer is majorly used as the current computing platform

In [None]:
unscr_col_names_dict['Q12']

Gen Z is more interested in GPUs and Baby boomers are not much inclined towards GPU

In [None]:
non_unscr_col_names_dict['Q13']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

TPU usage is still in nascent stage as most of the segments have never used it

In [None]:
unscr_col_names_dict['Q14']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

Seaborn and Matplotlib are the mostly used Visualization libraries followed by Plotly and GGplot


In [None]:
non_unscr_col_names_dict['Q15']

The age group of each generation mostly correlates with the years of ML experience with each Segment

In [None]:
unscr_col_names_dict['Q16']

Scikit-learn is the most regular machine learin framework

In [None]:
unscr_col_names_dict['Q17']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

1. Baby boomers-Over indexed in CNN,Bayesian Approaches and RNN,this migh be becsause of engagement in Deep Learning
2. Gen X-Over indexed in Linear/Logistic Regression,Decision Tree/Random Forest,GBM

In [None]:
unscr_col_names_dict['Q18']

CNN architectures are the most regular Computer vision methods used

In [None]:
unscr_col_names_dict['Q19']

Word embeddings/vectors are the most regular Computer vision methods used

In [None]:
unscr_col_names_dict['Q26']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

AWS,GCP and Azure are the regulary used computing platforms

In [None]:
unscr_col_names_dict['Q27']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

Gen X is over-indexing in most of the cloud computing products.

Hypothesis:
Gen X is cloud centric

In [None]:
unscr_col_names_dict['Q28']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

No regular machine learing products used by the Segments

In [None]:
unscr_col_names_dict['Q29']

MySQL, PostgresSQL and Micrsoft SQl are the most used big data products

Hive SQL,Teradata not present in options?

In [None]:
non_unscr_col_names_dict['Q30']

1. Gen Z-Over indexed in MySQL and Mongodb 
2. Gen X-Over indexed in PostgresSQL and Oracle Database
3. Baby Boomer-Over indexed in Microsoft SQl server and DB2

In [None]:
unscr_col_names_dict['Q31']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

Tableau is the most regulary used BI tool

In [None]:
non_unscr_col_names_dict['Q32']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

Tableau and Microsoft power BI are mostly the currently used BI Tools

In [None]:
unscr_col_names_dict['Q33']

AutoML tools are not being used majorly

In [None]:
unscr_col_names_dict['Q34']

In [None]:
unscr_col_names_dict['Q35']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

TensorBoard is one of the major tools used to help ML experiments

In [None]:
non_unscr_col_names_dict['Q38']

1. Local development environment and Basic Statistical software are mostly the primary tool used for data analysis
1. Baby Boomers over indexed in Stastical Softwares(SPSS,SAS) and under indexed in Local development environments 

# Data Science and Machine Learning Career Goal

In [None]:
supp_unscr_col_names_dict['Q26']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

1. AWS,GCP and Microsoft Azure are the career goals of the non-professionals in the next 2 years
1. Gen Z is over indexed here as they are hungry to learn everything

In [None]:
supp_unscr_col_names_dict['Q27']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

Google cloud computing products are the  career goals of the non-professionals in the next 2 years.

Hypothesis:
Skill updated with AWS so planning to upgrade skills to Google's products


In [None]:
supp_unscr_col_names_dict['Q28']

In [None]:
supp_unscr_col_names_dict['Q29']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

MySQL,Mongodb and PostgresSQL are the career goals of the non-professionals in the next 2 years in Big Data Products.

In [None]:
supp_unscr_col_names_dict['Q31']

*Please Note the columns wouldn't add to 100% as multiple choices can be selected by a respondent*.

Tableau,PowerBI and Google data studio are the career goals of the non-professionals in the next 2 years in Big Data Products.

In [None]:
supp_unscr_col_names_dict['Q33']

In [None]:
supp_unscr_col_names_dict['Q34']

In [None]:
supp_unscr_col_names_dict['Q35']

# Organization

In [None]:
non_unscr_col_names_dict['Q20']

1. Gen Z higher indexed in 0-49 employees size company
2. Gen X higher indexed in 10,000 or more employees size company

In [None]:
non_unscr_col_names_dict['Q21']

The overall data science team size is 0 which indicates that most of them could be Independent contributors

In [None]:
non_unscr_col_names_dict['Q22']

Overall ML Usage is less(17%+16%=33%) and most of them involved in Non-ML usecases in their type of work

In [None]:
non_unscr_col_names_dict['Q25']

Overall 36%,No-Investment has been done in ML or Cloud computing

# Summary

**Baby Boomers:**
* Men involved in ML &DS is higher whereas Women in the same segment is least involved
* People with Other Job role and Unemployed are interested in ML & DS
* In India, they are under indexed which indicates that the Senior most citizens are less inclined towards ML &DS
* In US, unlike India, they are more interested in ML & DS 
* Doctoral degree education is over indexed compared to Overall, which indicates that most of them are highly qualified
* Quite closed in terms of sharing
* Their choice of learning platform in edX is highly indexed
* They are over indexed in learning from media sources such as Kaggle, Email newsletters, Journals, Forums.
* Baby boomers are more inclined in recommending R
* They don’t use regular Notebook Products as they are mainly R users and would work in localR Studio environment
* They are not much inclined towards GPU
* Over indexed in engagement in Deep Learning
* Over indexed in Big data products- Microsoft SQL server and DB2
* Some respondents of Baby boomers are Over indexed in Statistical Softwares(SPSS,SAS) and under indexed in Local development environments

**Gen X:**
* Software Engineers of Gen X is more inclined towards ML & DS when compared to Overall
* Over indexed in platforms-Github and Kaggle
* over-indexed in R, SQL and Bash compared to Overall
* Over indexed in Linear/Logistic Regression, Decision Tree/Random Forest,GBM
* Over-indexing in most of the cloud computing products, which shows that they are cloud centric
* Over indexed in PostgresSQL and Oracle Database
* Over indexed in 10,000 or more employees size company


**Gen Y:**
* Highest composition of the segment in the survey and these folks are **Millennials**
* Master's degree education of Gen Y is over indexed compared to Overall
* Over indexed in platforms-Github and Kaggle
* The %response of Gen Y with respect to questions mostly aligns with Overall%, that is the reason they are neither over indexed nor under indexed

**Gen Z:**
* Women in Gen Z are more interested in ML & DS when compared to Overall
* In India, Gen Z is highly over indexed which indicates that young age group from India are highly interested in ML & DS
* In US, they are not much inclined towards ML & DS
* It is the youngest group so it is logical to find more of them either pursuing/completed Bachelor's degree 
* They are the youngest age group with little experience so it is obvious that they reside in the lowest income group
* Over-indexed in Programming Languages-Python, C++, Java and C
* Jupyter,Visual Studio Code, Pycharm and Spyder are over indexed compared to overall which indicates that Gen Z are more into Python
* More interested in GPUs
* Over indexed in MySQL and Mongodb
* Over indexed in all the technologies present in “Next 2 years Cloud Computing Platform” as they are hungry to learn everything
* Google cloud computing products are the career goals of the non-professionals in the next 2 years.
* Over indexed in 0-49 employees size company

