 # <u> Exploratory Data Analysis of Kaggle Data Science Survey 2020 </u>

### This was an exciting project because the data is very rich and has a lot of dimensions to it. This analysis is just exploring the surface of it, there is so much deep dive that can be done with this data. 

### In this analysis, I have answered some basic questions, some of it revealed what we already knew, others were new information. Let me quickly give some credit to PAUL MOONEY as I have used a technique from his notebook, whereby I did the count calculations of several columns using a dictionary! Thank you. 

### Your comments, criticism and advise are very much welcome. Thank you.


In [None]:
# We'll start by importing the relevant modules

import os
import pandas as pd
import numpy as np
import math
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
pd.set_option('display.max_columns', 5000)
import squarify
import warnings

In [None]:
# Read and preview the data

file = '../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv'
df = pd.read_csv(file, low_memory=False)
df.head()

## Data cleaning & preview of basic descriptive statistics

In [None]:
# check for duplicates

df.duplicated().values.any() 

In [None]:
# Let's see the so called duplicate rows 

duplicate_rows = df[df.duplicated()]
duplicate_rows

In [None]:
# We will drop the duplicate values, as they are not enough to cause issues

df.drop_duplicates(keep=False,inplace=True)

In [None]:
# Run a quick check to see if that worked, it did.

df.duplicated().values.any()

In [None]:
# Some descriptive statitstics of selected columns, we some patterns emerge.

df.iloc[:,0:7].describe()

In [None]:
# Let's understand the dimension of the data
df.shape

## Data analysis: Let's begin to answer some questions

In [None]:
# Where are respondents located?

df.iloc[1:].groupby(['Q3'])['Q3'].count().sort_values(ascending=False)

In [None]:
# Top 10 countries of respondents

df['Q3'].value_counts()[:10]

In [None]:
# What is the most popular job title of respondents?

df.iloc[1:].groupby(['Q5'])['Q5'].count()

#### Let's drill down to find most common job titles by most common countries

In [None]:
# India

df_india = df[df['Q3']=='India']
df_india.groupby(['Q5'])['Q5'].count()

In [None]:
# United States of America

df_usa = df[df['Q3']=='United States of America']
df_usa.groupby(['Q5'])['Q5'].count()

In [None]:
# Brazil

df_brazil = df[df['Q3']=='Brazil']
df_brazil.groupby(['Q5'])['Q5'].count()


In [None]:
# United Kingdom of Great Britain and Northern Ireland

df_uk = df[df['Q3']=='United Kingdom of Great Britain and Northern Ireland']
df_uk.groupby(['Q5'])['Q5'].count()


In [None]:
# Let's import plotly for subsequent data visualization

import plotly.express as px

In [None]:
# What are the age group of respondents?

age_group_count = df.iloc[1:]['Q1'].value_counts(dropna=False)
age_group_count

In [None]:
# Let's convert the result from a series to Dataframe

age_group_count = pd.DataFrame(age_group_count).reset_index()
age_group_count.columns = ['Age Group', 'Count']
age_group_count

In [None]:
# Let's visualize that in a simple table

#define figure and axes
fig, ax = plt.subplots(figsize=(6,6))

#hide the axes
fig.patch.set_visible(False)
ax.axis('off')
ax.axis('tight')

table = ax.table(cellText=age_group_count.values, colLabels=age_group_count.columns, loc='center')
plt.suptitle('Age Group of Kagglers', y= .75)


#display table
fig.tight_layout()
plt.show()

In [None]:
# We can then use the dataframe result to plot a graph

fig = px.bar(age_group_count, x="Age Group",y="Count")
fig.update_layout(
    title={'text': "Age group of kaggle data scientists",
           'y':0.95,'x':0.5,
           'xanchor': 'center',
           'yanchor': 'top'}, autosize=False, width=700, height=400)
fig.show()

In [None]:
# Age of kagglers by percentage distribution

count_num = df.iloc[1:]['Q1'].value_counts(dropna=False)
count_num = round(count_num*100/(df['Q1'].count()),1)
count_num

In [None]:
# What is the gender distribution of respondents?

gender_of_kagglers = df.iloc[1:].groupby(['Q2'])['Q2'].count()
gender_of_kagglers

In [None]:
fig, ax = plt.subplots(figsize=(7, 6), subplot_kw=dict(aspect="equal"))
label = ['Man', 'Nonbinary', 'Prefer not to say', 'Prefer to self-describe', 'Woman']
colors = ['cornflowerblue', 'purple', 'teal','orange', 'indianred']
plt.pie(gender_of_kagglers)
circle = plt.Circle( (0,0), 0.90, color='white')
p=plt.gcf()
p.gca().add_artist(circle)

wedges, texts = ax.pie(gender_of_kagglers, wedgeprops=dict(width=0.9), startangle=-105, colors = colors)

bbox_props = dict(boxstyle="square,pad=0.5", fc="w", ec="k", lw=0.75)
kw = dict(arrowprops=dict(arrowstyle="-"),
          bbox=bbox_props, zorder=0, va="center")

for i, p in enumerate(wedges):
    ang = (p.theta2 - p.theta1)/2 + p.theta1
    y = np.sin(np.deg2rad(ang))
    x = np.cos(np.deg2rad(ang))
    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
    connectionstyle = "angle,angleA=0,angleB={}".format(ang)
    kw["arrowprops"].update({"connectionstyle": connectionstyle})
    ax.annotate(label[i], xy=(x, y), xytext=(1.2*np.sign(x), 1.5*y),
                horizontalalignment=horizontalalignment, **kw)
    
ax.set_title("What gender do respondents identify as?", y=1,fontdict={'fontsize': 16})
plt.tight_layout()
plt.axis('equal')
plt.show()

In [None]:
fav_lang = df.loc[:,'Q7_Part_1':'Q7_OTHER']
fav_lang.head()

dict_count_Q7 = {
    'Python' : (fav_lang['Q7_Part_1'].count()),
    'R': (fav_lang['Q7_Part_2'].count()),
    'SQL' : (fav_lang['Q7_Part_3'].count()),
    'C' : (fav_lang['Q7_Part_4'].count()),
    'C++' : (fav_lang['Q7_Part_5'].count()),
    'Java' : (fav_lang['Q7_Part_6'].count()),
    'Javascript' : (fav_lang['Q7_Part_7'].count()),
    'Julia' : (fav_lang['Q7_Part_8'].count()),
    'Swift' : (fav_lang['Q7_Part_9'].count()),
    'Bash' : (fav_lang['Q7_Part_10'].count()),
    'MATLAB' : (fav_lang['Q7_Part_11'].count()),
    'None' : (fav_lang['Q7_Part_12'].count()),
    'Other' : (fav_lang['Q7_OTHER'].count())
}
dict_count_Q7


dict_count_Q7 = pd.DataFrame(dict_count_Q7.items(), columns=['Language', 'Count'])


fig = px.bar(dict_count_Q7, x="Language",y="Count")
fig.update_layout(
    title={'text': "Favourite Programming Lanugaue of Kagglers",
           'y':0.95,'x':0.5,
           'xanchor': 'center',
           'yanchor': 'top'}, autosize=False, width=700, height=400)
fig.show()


In [None]:
# slice favourite IDE first and pass into a variable
fav_ide = df.loc[:,'Q9_Part_1':'Q9_OTHER']

# then count the total of each IDE using dictionary
dict_count_Q9 = {
    'JupyterLab' : (fav_ide['Q9_Part_1'].count()),
    'RStudio': (fav_ide['Q9_Part_2'].count()),
    'Visual Studio' : (fav_ide['Q9_Part_3'].count()),
    'Visual Studio Code (VSCode)' : (fav_ide['Q9_Part_4'].count()),
    'PyCharm' : (fav_ide['Q9_Part_5'].count()),
    'Spyder' : (fav_ide['Q9_Part_6'].count()),
    'Notepad++' : (fav_ide['Q9_Part_7'].count()),
    'Sublime Text' : (fav_ide['Q9_Part_8'].count()),
    'Vim, Emacs, or similar' : (fav_ide['Q9_Part_9'].count()),
    'MATLAB' : (fav_ide['Q9_Part_10'].count()),
    'None' : (fav_ide['Q9_Part_11'].count()),
    'Other' : (fav_ide['Q9_OTHER'].count())
}

# convert to Dataframe object
dict_count_Q9 = pd.DataFrame(dict_count_Q9.items(), columns=['IDE', 'Count of IDE'])


# then put all together and plot the chart

fig = px.bar(dict_count_Q9, x="IDE",y="Count of IDE")
fig.update_layout(
    title={'text': "Favourite IDE used by Kagglers",
           'y':0.95,'x':0.5,
           'xanchor': 'center',
           'yanchor': 'top'}, autosize=False, width=700, height=400)
fig.show()

In [None]:
# slice columns of favourite cloud platform and pass into a variable
fav_clound_platform = df.loc[:,'Q26_A_Part_1':'Q26_A_OTHER']


# count the total of each column of cloud platform using dictionary
dict_count_Q26_A = {
    'Amazon Web Services (AWS)' : (fav_clound_platform['Q26_A_Part_1'].count()),
    'Microsoft Azure': (fav_clound_platform['Q26_A_Part_2'].count()),
    'Google Cloud Platform (GCP)' : (fav_clound_platform['Q26_A_Part_3'].count()),
    'IBM Cloud / Red Hat' : (fav_clound_platform['Q26_A_Part_4'].count()),
    'Oracle Cloud' : (fav_clound_platform['Q26_A_Part_5'].count()),
    'SAP Cloud' : (fav_clound_platform['Q26_A_Part_6'].count()),
    'Salesforce Cloud' : (fav_clound_platform['Q26_A_Part_7'].count()),
    'VMware Cloud' : (fav_clound_platform['Q26_A_Part_8'].count()),
    'Alibaba Cloud' : (fav_clound_platform['Q26_A_Part_9'].count()),
    'Tencent Cloud' : (fav_clound_platform['Q26_A_Part_10'].count()),
    'None' : (fav_clound_platform['Q26_A_Part_11'].count()),
    'Other' : (fav_clound_platform['Q26_A_OTHER'].count())
}

# convert to Dataframe object
dict_count_Q26_A = pd.DataFrame(dict_count_Q26_A.items(), columns=['Cloud Platforms', 'Count of Cloud Platform'])


# then put all together and plot the chart

fig = px.bar(dict_count_Q26_A, x="Cloud Platforms",y="Count of Cloud Platform")
fig.update_layout(
    title={'text': "Favourite Cloud Computing platforms of Kagglers",
           'y':0.95,'x':0.5,
           'xanchor': 'center',
           'yanchor': 'top'}, autosize=False, width=700, height=400)
fig.show()

In [None]:
# slice columns of favourite big data platform and pass into a variable
fav_database_platform = df.loc[:,'Q29_A_Part_1':'Q29_A_OTHER']


# count the total of each big data platform using dictionary
dict_count_Q29 = {
    'MySQL' : (fav_database_platform['Q29_A_Part_1'].count()),
    'PostgreSQL': (fav_database_platform['Q29_A_Part_2'].count()),
    'SQLite' : (fav_database_platform['Q29_A_Part_3'].count()),
    'Oracle Database' : (fav_database_platform['Q29_A_Part_4'].count()),
    'MongoDB' : (fav_database_platform['Q29_A_Part_5'].count()),
    'Snowflake' : (fav_database_platform['Q29_A_Part_6'].count()),
    'IBM Db2' : (fav_database_platform['Q29_A_Part_7'].count()),
    'Microsoft SQL Server' : (fav_database_platform['Q29_A_Part_8'].count()),
    'Microsoft Access' : (fav_database_platform['Q29_A_Part_9'].count()),
    'Microsoft Azure Data Lake Storage' : (fav_database_platform['Q29_A_Part_10'].count()),
    'Amazon Redshift' : (fav_database_platform['Q29_A_Part_11'].count()),
    'Amazon Athena' : (fav_database_platform['Q29_A_Part_12'].count()),
    'Amazon DynamoDB' : (fav_database_platform['Q29_A_Part_13'].count()),
    'Google Cloud BigQuery' : (fav_database_platform['Q29_A_Part_14'].count()),
    'Google Cloud SQL' : (fav_database_platform['Q29_A_Part_15'].count()),
    'Google Cloud Firestore' : (fav_database_platform['Q29_A_Part_16'].count()),
    'None' : (fav_database_platform['Q29_A_Part_17'].count()),
    'Other' : (fav_database_platform['Q29_A_OTHER'].count())
}


# convert to Dataframe object
dict_count_Q29 = pd.DataFrame(dict_count_Q29.items(), columns=['Database Platforms', 'Count of Database Platform'])


# then put all together and plot the chart

fig = px.bar(dict_count_Q29, x="Database Platforms",y="Count of Database Platform", color="Database Platforms")
fig.update_layout(
    title={'text': "Favourite Big Data platforms of Kagglers",
           'y':0.95,'x':0.5,
           'xanchor': 'center',
           'yanchor': 'top'}, autosize=False, width=850, height=550)
fig.show()


In [None]:
# What is the most used Business intelligence tool?

# slice columns of favourite Business Intelligence tool and pass into a variable
fav_business_intelligence_tool = df.loc[:,'Q31_A_Part_1':'Q31_A_OTHER']


dict_count_Q31A = {
    'Amazon QuickSight' : (fav_business_intelligence_tool['Q31_A_Part_1'].count()),
    'Microsoft Power BI': (fav_business_intelligence_tool['Q31_A_Part_2'].count()),
    'Google Data Studio' : (fav_business_intelligence_tool['Q31_A_Part_3'].count()),
    'Looker' : (fav_business_intelligence_tool['Q31_A_Part_4'].count()),
    'Tableau' : (fav_business_intelligence_tool['Q31_A_Part_5'].count()),
    'Salesforce' : (fav_business_intelligence_tool['Q31_A_Part_6'].count()),
    'Einstein Analytics' : (fav_business_intelligence_tool['Q31_A_Part_7'].count()),
    'Qlik' : (fav_business_intelligence_tool['Q31_A_Part_8'].count()),
    'Domo' : (fav_business_intelligence_tool['Q31_A_Part_9'].count()),
    'TIBCO Spotfire' : (fav_business_intelligence_tool['Q31_A_Part_10'].count()),
    'Alteryx' : (fav_business_intelligence_tool['Q31_A_Part_11'].count()),
    'Sisense' : (fav_business_intelligence_tool['Q31_A_Part_12'].count()),
    'SAP Analytics Cloud' : (fav_business_intelligence_tool['Q31_A_Part_13'].count()),
    'None' : (fav_business_intelligence_tool['Q31_A_Part_14'].count()),
    'Other' : (fav_business_intelligence_tool['Q31_A_OTHER'].count())
}

# convert to Dataframe object
dict_count_Q31A = pd.DataFrame(dict_count_Q31A.items(), columns=['Favourite BI tool', 'Count of BI tools'])

# Let's visualize this in a simple table

# define figure and axes
fig, ax = plt.subplots(figsize=(6,5))

# hide the axes
fig.patch.set_visible(False)
ax.axis('off')
ax.axis('tight')

table = ax.table(cellText=dict_count_Q31A.values, colLabels=dict_count_Q31A.columns, loc='center')
plt.suptitle('Most Used Business Intelligence Tool of Kagglers', y=0.85)


# display table
fig.tight_layout()
plt.show()


In [None]:
# What programming language would Kagglers recommend to a beginner

df.iloc[1:].groupby(['Q8'])['Q8'].count().sort_values(ascending=False)

In [None]:
#Let's go ahead and visualize that

# pass series object from previous calculation into a variable
recommended_programming_lang = df.iloc[1:].groupby(['Q8'])['Q8'].count().sort_values(ascending=False)

#then plot series using Matplotlib
fig, ax = plt.subplots(figsize=(7,5), tight_layout=True)
with plt.style.context('tableau-colorblind10'):
    ax.set_title('Recommended Programming Languages for Newbies', fontdict=None)
    recommended_programming_lang.plot.bar()
plt.show()

In [None]:
# What is the yearly salary of Kagglers?

df.iloc[1:].groupby(['Q24'])['Q24'].count().sort_values(ascending=False)

In [None]:
# Let's visualize that in a simple graph

# pass series object from previous calculation into a variable
salary_range = df.iloc[1:].groupby(['Q24'])['Q24'].count().sort_values(ascending=False)


#then plot series using Matplotlib
fig, ax = plt.subplots(figsize=(7,6), tight_layout=True)
with plt.style.context('tableau-colorblind10'):
    ax.set_title('Salary Range of Kagglers', fontdict=None)
    salary_range.plot.bar(x= 'Count of Salary Range', y = 'Salary Range')
plt.show()

## Findings and Conclusion

<b>The analysis reinforced a lot of the things we have always known, such as the popularity of Python programming language.
But there were other reveals, which I will point out below.</b>


1. India is the most represented country on Kaggle.
2. Nigeria is the most represented African country on Kaggle.
3. Python programming languaging is by far the most used by Kagglers.
4. Starters can learn Python, MYSQL, and Tableau/or Power BI for a start.
5. Majority of kaggle members are between the age of 25-29. 
6. Students are far more represented on Kaggler, followed by the job title of Data Scientist.
7. We need more women in Data Science.
8. Jupyter Notebooks is the most commonly used IDE among Kaggle memebers.
9. AWS is the cloud platform of choice (no surprises here), followed by Google cloud.



And that brings us to the end of this project. As I said in the beginning, this is just a scratch, there's so much that can be done with this data, and the newly update one for 2021. Your comments, criticisms and advise are welcome.

And please, feel free to build on this project if you wish to.
