<h1 style="text-align: center;">Let us see how Gen-Z, between age 18-24 years are taking DS and ML. </h1> 

<span style="visibility: hidden">[Gen-Z.jpg]https://image.freepik.com/free-vector/virtual-tiny-people-messaging-social-media-flat-vector-illustration-characters-near-huge-smartphone-modern-demography-trend-with-progressive-youth-gen-z-generation-digital-technology-concept_74855-10187.jpg</span>
<div style="float: left; margin-left:50px">
<img  src="https://image.freepik.com/free-vector/virtual-tiny-people-messaging-social-media-flat-vector-illustration-characters-near-huge-smartphone-modern-demography-trend-with-progressive-youth-gen-z-generation-digital-technology-concept_74855-10187.jpg" width="250px">!
<a style="font-size:8px" href='https://www.freepik.com/vectors/people'>People vector created by pch.vector - www.freepik.com</a>
</div>
<div style="text-align: left;"><br>
<h3>Generation Z, colloquially known as Zoomers, is the demographic cohort succeeding Millennials and preceding Generation Alpha. Researchers and popular media use the mid-to-late 1990s as starting birth years and the early 2010s as ending birth years. </h3>
</div>

In [None]:
import numpy as np 
import pandas as pd 
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt 
import seaborn as sns
import warnings 
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False)
data.info()
data.drop([0], axis=0, inplace=True)

In [None]:
def countPlot(columnString="Q1", labelName="Values", dataframe = data, width=800, height=500, order="total descending"): 
    fig = px.histogram(dataframe, x=columnString, labels={columnString: labelName}, color_discrete_sequence=px.colors.sequential.haline).update_xaxes(categoryorder=order)
    fig.update_layout(title_font_size=18, width=width, height=height)
    fig.show()
    
def histgramPlot(rowString="rows", columnString="columns", dataframe=data, width=800, height=500, order="total descending"):
    fig = px.histogram(dataframe, x=rowString, y=columnString, color_discrete_sequence=px.colors.sequential.haline).update_xaxes(categoryorder=order)
    fig.update_layout(title_font_size=18, width=width, height=height)
    fig.show()
    
def piePlot(columnString="Q1", dataframe=data, width=500, height=500, textPosition='auto', textinfo='percent+label'): 
    plotLabels, counts = np.unique(dataframe[columnString].values.tolist(), return_counts=True)
    fig = px.pie(data, values=counts, names=plotLabels, color_discrete_sequence=px.colors.sequential.haline)
    fig.update_traces(textposition=textPosition, textinfo=textinfo)
    fig.update_layout(width=width, height=height)
    fig.show()

def donutPlot(columnString="Q1", dataframe=data, width=500, height=500, holeSize=0.3, textPosition='auto', textinfo='percent+label'): 
    plotLabels, counts = np.unique(dataframe[columnString].values.tolist(), return_counts=True)
    fig = px.pie(dataframe, values=counts, names=plotLabels, hole=holeSize, color_discrete_sequence=px.colors.sequential.haline)
    fig.update_traces(textposition=textPosition, textinfo=textinfo)
    fig.update_layout(width=width, height=height)
    fig.show()

    
def visualizeMultipleColumns_vert(columnString="Q1", labelName="values", dataframe=data, height=500, width=800 ):
    df = dataframe[[i for i in dataframe.columns if columnString in i]]
    df_specific = pd.Series(dtype='int')
    for i in df.columns:
        df_specific[df[i].value_counts().index[0]] = df[i].count()
    df_specific = df_specific.sort_values(ascending=False)
    df_specific = df_specific.to_frame()
    df_specific.reset_index(inplace=True)
    df_specific.rename(columns={'index':labelName, 0:'Counts'}, inplace=True)
    fig = px.bar(df_specific, x=labelName, y='Counts',color_discrete_sequence=px.colors.sequential.haline)
    fig.update_layout(height=height, width=width)
    fig.show()
    
def visualizeMultipleColumns_horz(columnString="Q1", labelName="values", dataframe=data, height=500, width=800 ):
    df = dataframe[[i for i in dataframe.columns if columnString in i]]
    df_specific = pd.Series(dtype='int')
    for i in df.columns:
        df_specific[df[i].value_counts().index[0]] = df[i].count()
    df_specific = df_specific.sort_values(ascending=True)
    df_specific = df_specific.to_frame()
    df_specific.reset_index(inplace=True)
    df_specific.rename(columns={'index':labelName, 0:'Counts'}, inplace=True)
    fig = px.bar(df_specific, y=labelName, x='Counts',color_discrete_sequence=px.colors.sequential.haline)
    fig.update_layout(height=height, width=width)
    fig.show()

def visualizeMultipleColumns_pie(columnString="Q7", labelName="values", dataframe=data, height=500, width=800, holeSize=0.4 ):
    df = dataframe[[i for i in dataframe.columns if columnString in i]]
    df_specific = pd.Series(dtype='int')
    for i in df.columns:
        df_specific[df[i].value_counts().index[0]] = df[i].count()
    df_specific = df_specific.sort_values(ascending=True)
    df_specific = df_specific.to_frame()
    df_specific.reset_index(inplace=True)
    df_specific.rename(columns={'index':labelName, 0:'Counts'}, inplace=True)
    fig = px.pie(df_specific, names=labelName, values='Counts', hole=holeSize, color_discrete_sequence=px.colors.sequential.haline)
    fig.update_layout(height=height, width=width)
    fig.show()
    
def treeMap(columnString="Q3", dataframe=data, labelName="Values"):
    percents = np.round((dataframe[columnString].value_counts()/ dataframe[columnString].count()*100), 2).sort_index()
    fig = px.treemap(dataframe[columnString].sort_values(), path= [columnString], height=800, width=800, title=labelName, color_discrete_sequence=px.colors.sequential.haline)
    fig.data[0].textinfo = 'label+text+value'
    fig.update_traces(text = percents, textfont_size=15, texttemplate="%{label}<br>%{value}<br>%{text:.2f}%")
    fig.show()

<h1 id="AgeGroup">
    Age Group
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#AgeGroup">¶</a>
</h1>

In [None]:
#countPlot(dataframe=data, columnString="Q1", labelName="Age Group")
donutPlot(dataframe=data, columnString="Q1", width=800, height=700, holeSize=0.5)

<div class="alert alert-block alert-success">
<h2> 36.2% of users that have taken the survey belong to Gen-Z </h2>
</div>

In [None]:
#Taking data of Zoomers for further analysis! 
Zoomers = data.loc[ (data['Q1'] == '18-21') | (data['Q1'] == '22-24') ]
Zoomers

<h1 id="Gender">
    Gender amongst Zoomers! 
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#Gender">¶</a>
</h1>

In [None]:
#Q2 is the column for gender
countPlot(dataframe=Zoomers, columnString="Q2", labelName="Gender amongst Zoomers")

<div class="alert alert-block alert-success">
<h2> From the survey, 78.8% of zoomers are men, while 19.4% are women. Taking an approximation - out of 4 men, there is 1 woman who is into DS and ML.</h2>
</div>

<h1 id="Countries"> Spread of Zoomers across Countries  
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#Countries">¶</a>
</h1>

In [None]:
# Zoomers.Q3.nunique()
# Zoomers['Q3'] = Zoomers['Q3'].replace(['United States of America', 'United Kingdom of Great Britain and Northern Ireland'], ['USA', 'UK'])
# Countries, counts = np.unique(Zoomers['Q3'].values.tolist(), return_counts=True)
# Zoomer_Countries = pd.DataFrame({'Countries': Countries, 'Counts': list(counts)}, columns=['Countries', 'Counts'])
# Zoomer_Countries['Percentage'] = (Zoomer_Countries['Counts']/Zoomer_Countries['Counts'].sum())*100
# Zoomer_Countries.sort_values(by="Counts", ascending=False, inplace=True)
# Zoomer_Countries_Trimmed = Zoomer_Countries.loc[Zoomer_Countries['Counts'] >= 50]

# histgramPlot(dataframe=Zoomer_Countries_Trimmed, rowString="Countries", columnString="Counts", order="total descending", width=800, height=500)
# Zoomer_Countries.loc[Zoomer_Countries['Counts'] >= 100]

In [None]:
treeMap(columnString="Q3", dataframe=Zoomers, labelName="Spread of Zoomers amongst countries!")

<div class="alert alert-block alert-success">
<h2> On the given dataset, 49.29% of Zoomers are from India. No other country was able to represent even 5% of total zoomers that participated in the survey.</h2>
</div>

<h1 id="Education"> Current Education or Plan in 2 years
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#Education">¶</a>
</h1>

In [None]:
piePlot(dataframe=Zoomers, columnString="Q4", width=800, height=500, textinfo='percent')

<div class="alert alert-block alert-success">
<h2> Close to 82% holds bachelors / masters degree or will hold in the coming 2 years. It is to  note that there are also users with no education beyond high school or dropouts from college and it is good to see them benefiting from these online platforms.</h2>
</div>



<h1 id="Role"> Current Role / Title
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#Role">¶</a>
</h1>

In [None]:
donutPlot(dataframe=Zoomers, columnString="Q5", width=800, height=500, textinfo='percent', holeSize=0.4)
#countPlot(dataframe=Zoomers, columnString="Q5", labelName="Current Role or Title")

<div class="alert alert-block alert-success">
<h2> More than half of the total zoomers are students - either in their bachelors, masters or doctoral degrees. The interesting fact here is that 30% of remaining are working in roles related to Data science, Machile learning or software development. Considering the ratio to be round 2:1, if each employed user connects with 2 students at minimum - the learning would be much greater and enjoyable</h2>
</div>

<h1 id="CodingExp"> Coding Experience
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#CodingExp">¶</a>
</h1>

In [None]:
countPlot(dataframe=Zoomers, columnString="Q6", labelName="Coding Experience", width=800, height=600)
piePlot(dataframe=Zoomers, columnString="Q6", width=800, height=600)

<div class="alert alert-block alert-success">
<h2> Removing the users, who hadn't answered the question, there are 5% of Zoomers with no coding experienece. 90% have a min experience of 1 year to max of 5 years. Rest of 5% are the extremists coding for 5 - 20+ years! </h2>
</div>

<h1 id="ProgrammingLanguages"> Familiar Programming languages 
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#ProgrammingLanguages">¶</a>
</h1>

In [None]:
visualizeMultipleColumns_pie(columnString="Q7", labelName="values", dataframe=Zoomers, holeSize=0.4)

<div class="alert alert-block alert-success">
<h2> Most users are comfortable with Python, obvious choice for ML and DS. This is followed by SQL, C++ and C languages. Profeciency in each language is not mentioned - we would have got better insights if we've had that field included.   </h2>
</div>

<h1 id="RecommendedProgrammingLanguage"> Recommended Programming Language for Aspiring Data Scientists
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#RecommendedProgrammingLanguage">¶</a>
</h1>

In [None]:
#We have null values, and to visualize with better numbers, dropping the Null values on Q8 column! 
Zoomers_DroppedNaN = Zoomers[Zoomers['Q8'].notna()]
treeMap(columnString="Q8", dataframe=Zoomers_DroppedNaN, labelName="Recommended programming Language for Aspiring Data Scientists")

<div class="alert alert-block alert-success">
<h2> 82.04% users suggested Python as a programming language for aspiring Data scientist! This is followed by R, C++ and SQL. Comparing this with familiar programming languages, we observe python taking the lead, with SQL, C++, C below. R can be considered second to python for learning. </h2>
</div>

<h1 id="IDEsUsed"> IDEs in use
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#IDEsUsed">¶</a>
</h1>

In [None]:
visualizeMultipleColumns_horz(columnString="Q9", dataframe=Zoomers, labelName="IDEs in Use")

<div class="alert alert-block alert-success">
<h2> Zoomers are comfortable in use of Jupyter related products, VS Code, PyCharm, Spyder followed by other IDEs. 28% of zoomers preferred Jupyter. </h2>
</div>

<h1 id="HostedNotebookProducts"> Hosted Notebook Products in use
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#HostedNotebookProducts">¶</a>
</h1>

In [None]:
visualizeMultipleColumns_horz(columnString="Q10", dataframe=Zoomers, labelName="Hosted Notebook Products in use")

<div class="alert alert-block alert-success">
<h2> More than 50% of users are using Google Colab or Kaggle Notebook as hosted python notebooks. This could be due to ease of use and friendly for beginners. </h2>
</div>

<h1 id="ComputingPlatformUsed"> Most often used computing platform
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#ComputingPlatformUsed">¶</a>
</h1>

In [None]:
#Observed 17% of Nan values - dropping them and calibrating the plot
Zoomers_DroppedNaN = Zoomers[Zoomers['Q11'].notna()]
donutPlot(columnString="Q11", dataframe=Zoomers_DroppedNaN, width=900, height=500, textinfo='percent')

<div class="alert alert-block alert-success">
<h2> Most Zoomers are comfortable in using their personal computer or laptop for computing. Only 11% are choosing cloud computing platforms. These cloud computing platforms can see what they can offer to users to increase the cloud usage significantly. </h2>
</div>

<h1 id="SpecializedHardware"> Specialized hardware in use
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#SpecializedHardware">¶</a>
</h1>

In [None]:
visualizeMultipleColumns_vert(columnString="Q12", dataframe=Zoomers, labelName="Specialized hardware in use")

<div class="alert alert-block alert-success">
<h2> Users are preferring GPUs or not using any specialized hardware at all!  </h2>
</div>

<h1 id="UseOfTPUs"> Approximate Use of TPUs
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#UseOfTPUs">¶</a>
</h1>

In [None]:
Zoomers_DroppedNaN_TPUs = Zoomers[Zoomers['Q13'].notna()]
donutPlot(columnString="Q13", dataframe=Zoomers_DroppedNaN_TPUs, width=900, height=500, textinfo='percent')

<div class="alert alert-block alert-success">
<h2> 70% of Gen-Z users have never used TPUs. Awareness on TPUs, advantages of using them and providing them as default hardware to experience the processing speed.  </h2>
</div>

<h1 id="VisualizationLibraries"> Visualization libraries or Tools in use
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#VisualizationLibraries">¶</a>
</h1>

In [None]:
visualizeMultipleColumns_horz(columnString="Q14", dataframe=Zoomers, labelName="Visualization Libraries or Tools in use")

<div class="alert alert-block alert-success">
<h2> Matplotlib, Seaborn are the most used visualization libraries amongst Zoomers. These are followed by Plotly, Ggplot and Geoplotlib. </h2>
</div>

<h1 id="MachineLearningMethods"> Experience in use of Machine Learning Methods
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#MachineLearningMethods">¶</a>
</h1>

In [None]:
#Observed null values in Q15 column. Dropping them! 
Zoomers_DroppedNaN_MLMethods = Zoomers[Zoomers['Q15'].notna()]
treeMap(columnString="Q15", dataframe=Zoomers_DroppedNaN_MLMethods, labelName="Experience in use of Machine learning methods")

<div class="alert alert-block alert-success">
<h2> Close to 55% of Gen-Z users have ML experience under 1 year. 23% of the rest have 1-2 years of experience</h2>
</div>

<h1 id="MLFrameworksUsed"> ML Framework used regularly
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#MLFrameworksUsed">¶</a>
</h1>

In [None]:
visualizeMultipleColumns_vert(columnString="Q16", dataframe=Zoomers, labelName="ML Framework used regularly")

<div class="alert alert-block alert-success">
<h2> Scikit-learn, Tensorflow, Keras, PyTorch, Xgboost and lightGBM are most used ML frameworks amongst zoomers. Throught this data, CatBoost, Fast.ai, Prophet and other methods can showcase more examples, making them user friendly so that users can also start using them.  </h2>
</div>

<h1 id="MLAlgorithms"> ML Algorithms used regularly
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#MLAlgorithms">¶</a>
</h1>

In [None]:
visualizeMultipleColumns_vert(columnString="Q17", dataframe=Zoomers, labelName="ML Algorithms used regularly")

<div class="alert alert-block alert-success">
<h2> Regression models are most used, followed by decision trees. Then comes the deep networks and boosting algorithms. If these are compared with other generation users - we might find some unique differences.  </h2>
</div>

<h1 id="YearlyCompensation"> Yearly Compensation 
    <a class="anchor-link" href="https://www.kaggle.com/koushikkumarl/role-of-gen-z-in-ds-and-ml/notebook#YearlyCompensation">¶</a>
</h1>

In [None]:
#Observed null values in Q25 column. Dropping them! 
Zoomers_DroppedNaN_comp = Zoomers[Zoomers['Q25'].notna()]
donutPlot(columnString="Q25", dataframe=Zoomers, width=600, height=500, textinfo='percent')
treeMap(columnString="Q25", dataframe=Zoomers_DroppedNaN_comp, labelName="Yearly Compansation")

<div class="alert alert-block alert-success">
<h2> 75% of Gen-Z users preferred not to answer their yearly compansation. Taking out the null values, 47% of the rest are not earning - this could be due to the students being majority of users. </h2>
</div>

#### To be Continued! This is my first Notebook on Kaggle! Feel free to give your views and areas of improvement! If you find this Notebook useful please Show your appreciation with an UPVOTE

#### Thanks for reading!