In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from matplotlib.patches import ConnectionPatch
import seaborn as sns
import sys
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **<div style="text-align: center">The Challenge Objective</div>**

**Tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners.**

# **<div style="text-align: center">What We Observe?</div>**

**We will explore the data of Data Scientist practitioners, get information about their job role, experience, their favorite tools and other information related to being an expert in Data Science.**

# **<div style="text-align: center">Read The Dataset</div>**

**We read the data given by Kaggle, the Kaggle Survey 2021 data. The warning occurs because the data frame has a different data type on the first index or in index 0. Now we will remove the 0 index and use the rest as our main data frame.**

In [None]:
dataframe = pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
dataframe.head()

# **<div style="text-align: center">Main Dataframe</div>**

In [None]:
df = dataframe.copy()
df = df[1:]
df = df.reset_index(drop=True) #Reset index
df.head()

# **<div style="text-align: center">Exploratory Data Analysis</div>**

**At this step, we'll try to understand and find out who is our subject of observation.**

# **A. Respondent's Profile**

In [None]:
def ds_explore(column):
    plt.figure(figsize=(12,6))
    count = sns.countplot(data=df, y=column, order=column.value_counts().iloc[:16].index)
    for i in count.patches:
        count.annotate(format(i.get_width()),((i.get_x() + i.get_width()), i.get_y()), 
                                xytext=(30,-8),fontsize=9, color='#000000', 
                                textcoords='offset points', 
                                horizontalalignment='right')

**1. Age**

In [None]:
ds_explore(df['Q1'])
plt.title('Age Distribution', fontfamily='sans-serif', fontsize=16, fontweight="bold")
plt.ylabel('Age', fontsize=16)

**2. Gender**

In [None]:
ds_explore(df['Q2'])
plt.title('Gender Distribution', fontfamily='sans-serif', fontsize=16, fontweight="bold")
plt.ylabel('Gender', fontsize=16)

**3. Country**

In [None]:
df['Q3'] = df['Q3'].replace({
    'United States of America':'USA',
    'United Kingdom of Great Britain and Northern Ireland':'UK'
})
ds_explore(df['Q3'])
plt.title('Country Distribution', fontfamily='sans-serif', fontsize=16, fontweight="bold")
plt.ylabel('Country', fontsize=16)

**4. Education**

In [None]:
df['Q4'] = df['Q4'].replace({
    'Some college/university study without earning a bachelor’s degree':'Without bachelor’s degree',
    'No formal education past high school':'High school'
})
ds_explore(df['Q4'])
plt.title('Formal Education Distribution', fontfamily='sans-serif', fontsize=16, fontweight="bold")
plt.ylabel('Education', fontsize=16)

**5. Job Title**

In [None]:
ds_explore(df['Q5'])
plt.title('Job Distribution', fontfamily='sans-serif', fontsize=16, fontweight="bold")
plt.ylabel('Job', fontsize=16)

**According to the information provided, the majority of poll respondents are men aged 18 to 29. India has the most data science responders, followed by the United States in second place. In their education, the majority of our responders have a master's degree. Some of them don't refer to themselves as data scientists-practitioners, but they have used data science in various capacities.**

# **B. Data Science Experience**

**Analyze the respondent's experience data, this information is significant since it will be used as one of the clustering technique's features, along with other data that supports the method.**

**1. Programming Experience**

In [None]:
df['Q6'] = df['Q6'].replace({
    'I have never written code':'0',
    '20+ years':'> 20'
})
df['Q6'] = df['Q6'].str.replace('years,?','', regex=True)
ds_explore(df['Q6'])
plt.title('Programming Experience', fontfamily='sans-serif', fontsize=16, fontweight="bold")
plt.ylabel('Years', fontsize=16)

**2. Machine Learning Experience**

In [None]:
df['Q15'] = df['Q15'].replace({
    'Under 1 year':'< 1',
    'I do not use machine learning methods':'0',
    '20 or more years':'> 20'
})
df['Q15'] = df['Q15'].str.replace('years,?','', regex=True)
ds_explore(df['Q15'])
plt.title('Machine Learning Experience', fontfamily='sans-serif', fontsize=16, fontweight="bold")
plt.ylabel('Years', fontsize=16)

# **C. Work as Data Scientist**

**Some of our respondents have worked in the data field and they are paid for it, they also spend money on computing services. This information can be used as a data feature.**

**1. Yearly Compensation**

In [None]:
plt.figure(figsize=(12,6))
count = sns.countplot(data=df, y='Q25', order=df['Q25'].value_counts().index)
for i in count.patches:
    count.annotate(format(i.get_width()),((i.get_x() + i.get_width()), i.get_y()),
                   xytext=(30,-8),fontsize=9, color='#000000', textcoords='offset points',
                   horizontalalignment='right')
plt.title('Yearly Compensation', fontfamily='sans-serif', fontsize=16, fontweight="bold")
plt.ylabel('Compensation (USD)', fontsize=16)

**2. Money Spent on Machine Learning/Cloud Computing Services**

In [None]:
df['Q26'] = df['Q26'].replace({
    '$0 ($USD)':'0',
    '$100,000 or more ($USD)':'> 100,000'
})
plt.figure(figsize=(12,6))
count = sns.countplot(data=df, y='Q26', order=df['Q26'].value_counts().index)
for i in count.patches:
    count.annotate(format(i.get_width()),((i.get_x() + i.get_width()), i.get_y()),
                   xytext=(30,-8),fontsize=9, color='#000000', textcoords='offset points',
                   horizontalalignment='right')
plt.title('Money Spend on Cloud Services', fontfamily='sans-serif', fontsize=16, fontweight="bold")
plt.ylabel('(USD)', fontsize=16)

# **D. Identifying The Data Scientist**

**Now, We know the profile, experience, and salary of our respondents. Next, we will identify our respondents into some categories. We can use the data of profile, experience, salary and etc. The categories will split our respondents by their capabilities or level.**

**1. Convert from string to numeric variable**

In [None]:
def to_float(column):
    for i in range(len(column)):
        if pd.isnull(column[i]) == False:
            if column[i].find('-') != -1:
                number0 = float(column[i].split('-')[0])
                number1 = float(column[i].split('-')[1])
                column[i] = np.random.uniform(number0, number1)
            elif column[i] == 0:
                column[i] = 0
            else:
                number2 = float(column[i].split()[0])
                number3 = number2+2.0
                column[i] = np.random.uniform(number2, number3)
    return column

def to_int(column):
    for i in range(len(column)):
        if pd.isnull(column[i]) == False:
            if column[i].find('-') != -1:
                number0 = int(column[i].split('-')[0])
                number1 = int(column[i].split('-')[1])
                column[i] = np.random.randint(number0, number1)
            elif column[i] == 0:
                column[i] = 0
            else:
                number2 = int(column[i].split()[0])
                number3 = number2+2
                column[i] = np.random.randint(number2, number3)
    return column

In [None]:
# 1. Change salary column from string into float, we fill the value using random value with range
df['yearly_comp'] = df['Q25']
df['yearly_comp'] = df['yearly_comp'].str.replace('\$|,|<|>','', regex=True)
df['yearly_comp'] = to_float(df['yearly_comp'])

# 2. Change salary column from string into float, we fill the value using random value with range
df['money_spend'] = df['Q26']
df['money_spend'] = df['money_spend'].str.replace('\$|,|<|>','', regex=True)
df['money_spend'] = to_float(df['money_spend'])

# 3. Change experience column from string into int, we fill the value using random value with range
df['prog_exp'] = df['Q6']
df['prog_exp'] = df['prog_exp'].str.replace(' |,|<|>','', regex=True)
df['prog_exp'] = to_int(df['prog_exp'])

# 4. Change experience column from string into int, we fill the value using random value with range
df['ml_exp'] = df['Q15']
df['ml_exp'] = df['ml_exp'].str.replace(' |,|<|>','', regex=True)
df['ml_exp'] = to_int(df['ml_exp'])

# 5. Change age column from string into int, we fill the value using random value with range
df['age_y'] = df['Q1']
df['age_y'] = df['age_y'].str.replace('+','', regex=True)
df['age_y'] = to_int(df['age_y'])

**2. Create a new dataframe for clustering**

In [None]:
newdf = df[['ml_exp', 'prog_exp', 'age_y', 'money_spend', 'yearly_comp']].copy()

**3. Handle the missing value**

In [None]:
# We don't remove the nan value, but we will fill the missing value
# Sort the dataframe by age
newdf = newdf.sort_values(by='age_y')
# Fill the missing value using  ffill method
# ffill method : propagates the last observed non-null value forward until another non-null value is encountered
# That's why we sort the dataframe by age
newdf = newdf.fillna(axis=0, method='ffill')

# Fill with mean value
newdf['prog_exp'] = newdf['prog_exp'].fillna(newdf['prog_exp'].mean())
newdf['age_y'] = newdf['age_y'].fillna(newdf['age_y'].mean())
newdf['ml_exp'] = newdf['ml_exp'].fillna(newdf['ml_exp'].mean())
newdf['money_spend'] = newdf['money_spend'].fillna(newdf['money_spend'].mean())
newdf['yearly_comp'] = newdf['yearly_comp'].fillna(newdf['yearly_comp'].mean())
newdf = newdf.sort_index()

**4. Clustering with KMeans method**

In [None]:
# CLustering library
from sklearn.cluster import KMeans

In [None]:
dataset = newdf[['ml_exp', 'prog_exp', 'age_y', 'money_spend', 'yearly_comp']].copy()

# Using elbow method to find the best K value
sse = []
k_rng = range(1, 10)
for k in k_rng:
    km = KMeans(n_clusters=k)
    km.fit(dataset)
    sse.append(km.inertia_)

plt.plot(k_rng, sse)
plt.xlabel('K')
plt.ylabel('Sum of Squared Error', fontsize=16)
plt.title('Elbow Method', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**The use of 3 clusters makes acceptable. The level of a data scientist is commonly divided into an entry-level data scientist, a mid-level data scientist, and an expert data scientist, depending on the data science job opening.**

In [None]:
# Modeling
model = KMeans(n_clusters=3)
y_pred = model.fit_predict(dataset)

# add cluster to new column
newdf['cluster'] = y_pred
newdf.corr()

**It is not easy to visualize all features of data in clustering, we will use one feature with the highest correlation value.**

In [None]:
df1 = newdf[newdf.cluster==0]
df2 = newdf[newdf.cluster==1]
df3 = newdf[newdf.cluster==2]
plt.figure(figsize=(8, 6))
plt.scatter(df1.age_y, df1['yearly_comp'], color='green')
plt.scatter(df2.age_y, df2['yearly_comp'], color='red')
plt.scatter(df3.age_y, df3['yearly_comp'], color='blue')
plt.xlabel('Age')
plt.ylabel('Yearly Compensation')

**5. Give a name to the cluster**

**We can give name labels to the highest yearly compensation as Expert, and the middle as Mid, and the least yearly compensation as Entry.**

In [None]:
newdf['exp_level'] = newdf['cluster']
newdf['exp_level'] = newdf['exp_level'].replace({
    0:'Entry',
    1:'Expert',
    2:'Mid'
})

# concatinating df with newdf
df = pd.concat([df, newdf['exp_level']], axis=1)
df['exp_level'].value_counts()

**6. Who are they?**

**We can use the experience level data to see their profile broken down by experience level.**

In [None]:
def profile_level(col, level, explode, explode_n):
    # Entry
    # make figure and assign axis objects
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    fig.subplots_adjust(wspace=0)

    # pie chart parameters
    ratios = df['exp_level'].value_counts()
    labels = df['exp_level'].unique()
    explode = explode

    # rotate so that first wedge is split by the x-axis
    angle = 0 * ratios[0]
    ax1.pie(ratios, autopct='%1.1f%%', startangle=angle, labels=labels, explode=explode)

    # bar chart parameters
    xpos = 0
    bottom = 0
    ratios = df.loc[df['exp_level'] == level, col].value_counts()
    width = .2

    for j in range(len(ratios)):
        height = ratios[j]
        ax2.bar(xpos, height, width, bottom=bottom)
        ypos = bottom + ax2.patches[j].get_height() / 2
        bottom += height
        ax2.text(xpos, ypos, (ax2.patches[j].get_height()), ha='center')

    ax2.set_title('Count')
    ax2.legend(df.loc[df['exp_level'] == level, col].unique(), loc='upper right')
    ax2.axis('off')
    ax2.set_xlim(- 2.5 * width, 2.5 * width)

    # use ConnectionPatch to draw lines between the two plots
    # get the wedge data
    theta1, theta2 = ax1.patches[explode_n].theta1, ax1.patches[explode_n].theta2
    center, r = ax1.patches[explode_n].center, ax1.patches[explode_n].r
    bar_height = sum([item.get_height() for item in ax2.patches])

    # draw top connecting line
    x = r * np.cos(np.pi / 180 * theta2) + center[0]
    y = r * np.sin(np.pi / 180 * theta2) + center[1]
    con = ConnectionPatch(xyA=(-width / 2, bar_height), coordsA=ax2.transData, xyB=(x, y), coordsB=ax1.transData)
    con.set_color([0, 0, 0])
    con.set_linewidth(4)
    ax2.add_artist(con)

    # draw bottom connecting line
    x = r * np.cos(np.pi / 180 * theta1) + center[0]
    y = r * np.sin(np.pi / 180 * theta1) + center[1]
    con = ConnectionPatch(xyA=(-width / 2, 0), coordsA=ax2.transData, xyB=(x, y), coordsB=ax1.transData)
    con.set_color([0, 0, 0])
    ax2.add_artist(con)
    con.set_linewidth(4)
    plt.show()

**By Respondent's Age**

**Entry Level by Age**

In [None]:
sns.set_palette('CMRmap_r')
# col, level, explode, explode_n
profile_level('Q1', 'Entry', [0.1, 0.0, 0.0], 0)

****Middle Level by Age****

In [None]:
# col, level, explode, explode_n
profile_level('Q1', 'Mid', [0.0, 0.1, 0.0], 1)

**Expert Level by Age**

In [None]:
# col, level, explode, explode_n
profile_level('Q1', 'Expert', [0.0, 0.0, 0.1], 2)

**By Respondent's Education**

**Entry Level by Education**

In [None]:
# col, level, explode, explode_n
profile_level('Q4', 'Entry', [0.1, 0.0, 0.0], 0)

**Middle Level by Education**

In [None]:
# col, level, explode, explode_n
profile_level('Q4', 'Mid', [0.0, 0.1, 0.0], 1)

**Expert Level by Education**

In [None]:
# col, level, explode, explode_n
profile_level('Q4', 'Expert', [0.0, 0.0, 0.1], 2)

# **E. The Behavior of Data Scientist**

**We'll learn about data scientists' behaviors when working in the field, such as "What programming language do they use?" "What cloud computing services do they use?" "What are their preferred tools?" and so on. We'll use their actions as a blueprint for becoming data scientists.**

In [None]:
def behavior_ds(cols):
    list_Q = [col for col in df.columns if cols in col]
    Q = df[list_Q].copy()
    exp = pd.concat([Q, df['exp_level']], axis=1)
    new_exp = (exp.melt(
        id_vars='exp_level', 
        value_vars = list_Q, 
        value_name=cols).drop('variable', axis=1))
    plt.figure(figsize=(12,10))
    count = sns.countplot(data=new_exp, y=cols, hue='exp_level', order=new_exp[cols].value_counts().iloc[:10].index)
    for i in count.patches:
        count.annotate(int(i.get_width()),((i.get_x() + i.get_width()), i.get_y()), 
                       xytext=(26,-9),fontsize=9, color='#000000', textcoords='offset points', horizontalalignment='right')
        
def behavior_ds_one(col):
    plt.figure(figsize=(12,10))
    count = sns.countplot(data=df, y=col, hue='exp_level', order=df[col].value_counts().iloc[:10].index)
    for i in count.patches:
        count.annotate(format(i.get_width()),((i.get_x() + i.get_width()), i.get_y()), 
                       xytext=(30,-8),fontsize=9, color='#000000', textcoords='offset points', horizontalalignment='right')

**1. Programming Language**

**What is their favorite programming language?**

In [None]:
behavior_ds('Q7')
plt.ylabel('Programming Language', fontsize=16)
plt.title('Most used programming language on reguler basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**They also give you a recommendation about what programming language you should learn to work in the data field**

In [None]:
behavior_ds_one('Q8')
plt.ylabel('Programming Language', fontsize=16)
plt.title('Programming language to learn data science recommend by data scientist', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**2. Integrated Development Environments (IDE)**

**What is their favorite IDE?**

In [None]:
behavior_ds('Q9')
plt.ylabel('IDE', fontsize=16)
plt.title('Most used IDE on regular basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**What is their favorite Hosted Notebook?**

In [None]:
behavior_ds('Q10')
plt.ylabel('Hosted Notebook', fontsize=16)
plt.title('Most used hosted notebook products on regular basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**3. Specialized Hardware**

**What is their favorite specialized hardware?**

In [None]:
behavior_ds('Q12')
plt.ylabel('Hardware', fontsize=16)
plt.title('Most used specialized hardware on regular basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**4. Data Visualization Libraries**

**What is their favorite visualization library?**

In [None]:
behavior_ds('Q14')
plt.ylabel('Data Visualization Libraries', fontsize=16)
plt.title('Most used data visualization libraries on regular basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**5. Machine Learning**

**What is their favorite machine learning framewok?**

In [None]:
behavior_ds('Q16')
plt.ylabel('Machine Learning Framework', fontsize=16)
plt.title('Most used machine learning framework on regular basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**What is their favorite machine learning algorithm?**

In [None]:
behavior_ds('Q17')
plt.ylabel('Machine Learning Algorithms', fontsize=16)
plt.title('Most used machine learning algorithms on regular basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**What is their favorite machine learning product?**

In [None]:
behavior_ds('Q31_A')
plt.ylabel('Machine Learning Products', fontsize=16)
plt.title('Most used managed machine learning products', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**What is their favorite automated machine learning tool?**

In [None]:
behavior_ds('Q37_A')
plt.ylabel('Automated Machine Learning Tools', fontsize=16)
plt.title('Most used automated machine learning tools on regular basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**6. Computer Vision**

**What is their favorite computer vision method?**

In [None]:
behavior_ds('Q18')
plt.ylabel('Computer Vision Methods', fontsize=16)
plt.title('Most used computer vision methods on regular basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**7. Cloud Computing Platforms**

**What is their favorite cloud computing platforms?**

In [None]:
behavior_ds('Q27_A')
plt.ylabel('Cloud Computing Platforms', fontsize=16)
plt.title('Most used cloud computing platforms on regular basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**8. Data Storage Products**

**What is their favorite data storage product?**

In [None]:
behavior_ds('Q30_A')
plt.ylabel('Data Storage Products', fontsize=16)
plt.title('Most used data storage products on regular basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**What is their favorite big data product?**

In [None]:
behavior_ds('Q32_A')
plt.ylabel('Big Data Products', fontsize=16)
plt.title('Most used big data products one regular basis', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**9. Business Intelligence Tools**

**What is their favorite business intelligence tool?**

In [None]:
behavior_ds_one('Q35')
plt.ylabel('Business Intelligence Tools', fontsize=16)
plt.title('Most used business intelligence tools for data science', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**10. Data Science Platform**

**Favorite platform to learn data science**

In [None]:
behavior_ds('Q40')
plt.ylabel('Learning Platform', fontsize=16)
plt.title('A platform to learn data science', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**Keep up-to-date with data science topics**

In [None]:
behavior_ds('Q42')
plt.ylabel('Media Sources', fontsize=16)
plt.title('Favorite media sources that report on data science topics', fontfamily='sans-serif', fontsize=16, fontweight="bold")

**What is their favorite platform to share their machine learning application?**

In [None]:
behavior_ds('Q39')
plt.ylabel('Publicy Platform', fontsize=16)
plt.title('A platform for publish data analysis or machine learning applications', fontfamily='sans-serif', fontsize=16, fontweight="bold")

# **F. Conclusion**

**When using data science tools, the majority of our respondents behaved in a similar way, whether they were beginners, intermediates, or experts.** 
1. Python is the most popular programming language among our respondents, and they advise learning it if you want to work as a data scientist.
2. Their favorite IDEs are Jupyter Notebook and VSCode. Their favorite hosted notebooks are Colab, and Kaggle.
3. Although the majority of our respondents do not employ specialized hardware, NVIDIA GPU and Google Cloud TPU are the most popular.
4. Their favorite data visualization libraries are Matplotlib and Seaborn.
5. The most popular machine learning frameworks are Scikit-Learn and TensorFlow, and the most popular machine learning algorithms are Linear Regression and Decision Trees or Random Forest. However, most of them do not use machine learning products or automated machine learning tools.
6. Their primary computer vision method is image classification.
7. Amazon's cloud computing and data storage capabilities remain at the top
8. MySQL as their big data solution.
9. The most popular business intelligence tool is Microsoft Power BI, however other experts prefer Tableau.
10. Our respondents' preferred learning sites are Coursera,Kaggle Learn and Udemy, they used Kaggle to gain information about data science topics. They use GitHub to share their machine learning apps.

**We can draw some conclusions based on the information we've gathered.**

**Data Science Roadmap**

In [None]:
from datetime import date

dates = [date(2022, 1, 1), date(2022, 1, 15), date(2022, 2, 15), date(2022, 3, 15), 
         date(2022, 4, 15), date(2022, 6, 15), date(2022, 7, 1)]
 
labels = [
    'Statistics and Mathematics\nStd, Mean, Median, Mode, etc',
    'Programming\nPython, R, SQL', 
    'Data Extraction\nPython Libraries (e.g. Pandas, Numpy)', 
    'Exploratory Data Analysis\nSeaborn, Matplotlib, Plotly, etc',
    'Machine Learning\nMachine Learning Frameworks and Algorithms',
    'Data Engineering\nCloud Services, Deploying Machine Learning Model, ETL, etc',
    'Keep Practicing ~'
]

# Add label to dates
labels = ['{0:%d %b %Y}\n{1}'.format(d, l) for l, d in zip (labels, dates)]

fig, ax = plt.subplots(figsize=(16, 4), constrained_layout=True)
ax.set_ylim(-1, 1)

ax.scatter(dates, np.zeros(len(dates)), s=120, zorder=2)
ax.scatter(dates, np.zeros(len(dates)), s=30, zorder=3)

label_offsets = np.zeros(len(dates))
label_offsets[::2] = 0.4 # top label
label_offsets[1::2] = -0.7 # bottom label
for i, (l, d) in enumerate(zip(labels, dates)):
    ax.text(d, label_offsets[i], l, ha='center', fontsize=12)

stems = np.zeros(len(dates))
stems[::2] = 0.35 # top marker
stems[1::2] = -0.35 # bottom marker
markerline, stemline, baseline = ax.stem(dates, stems, use_line_collection=True)
plt.setp(markerline, color='black')
plt.setp(stemline, color='black')

# hide chart border
for spine in ["left", "top", "right", "bottom"]:
    ax.spines[spine].set_visible(False)

ax.set_xticks([])
ax.set_yticks([])    
ax.set_title('Data Science Roadmap', fontweight="bold", fontsize=16)

**We might assume that the respondent's lowest level of education is high school. I believe we all studied basic statistics and mathematics in high school, and with only those fundamentals, we can begin our journey to become data scientists. Basic statistics lectures are mostly given by some Data Science learning platforms as a refresher.**

**Although Data Science is a high-demand job, there are still few experts in the field. You can become a Data Science regardless of your background. After reading some of the data presented above and reading the Data Science Roadmap, I hope you have a better understanding of how to begin as a Data Science until you become an expert. This is only a small portion of what data science has to provide, you may learn more from experts by visiting data science forums or platforms like Kaggle and others.**

**I hope that everyone who reads this data visualization finds it to be useful. Since I'm also an entry-level person, you can give me advice if you see any inaccurate or misleading information.**

# **<div style="text-align: center">Thank You!</div>**