# Kaggle Data Analysis 

This is my first notebook and the main idea of it is to apply EDA ideas and work on data visualization.

Feel free to comment and drop some nice tips! I'm here to learn.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
from pandas_profiling import ProfileReport

In [None]:
df = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv'
)
#pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
df.head()

In [None]:
df.info()

In [None]:
report = ProfileReport(df, title='Profiling Kaggle Survey 2020', minimal=True)

In [None]:
report

# Data Cleaning


Let's check how many missing cells are in the dataset

In [None]:
pd.set_option('display.max_rows', None)
df.isna().sum()

As we can see the only cells with non-missing values are `Q1 to Q3`, which corresponds to `age`, `gender`, and `country`. 


The missing values are in:
* `Q4`: Formal education
* `Q5`: Current role
* `Q6`: Years programming
* `Q7`: What programming language do you use (divided into 13 parts)
* `Q8`: What programming language would you recommend 
* `Q9`: Which IDE do you use (divided into 12 parts)
* `Q10`: Which hosted notebook products do you use (divided into 14 parts)
* `Q11`: What kind of computing platform do you use in data science projects
* `Q12`: Which type of hardware do you use (divided into 4 parts)
* `Q13`: How many times have you used a TPU
* `Q14`: What data visualization libraries do you use (divided into 12 parts)
* `Q15`: For how many years you've been using machine learning methods
* `Q16`: What machine learning frameworks do you use (divided into 16 parts)
* `Q17`: ML algorithms do you use (divided into 12 parts)
* `Q18`: CV methods do you use (divided into 7 parts)
* `Q19`: NLP methods do you use (divided into 6)

Questions about workplace:

* `Q20`: Size of the company that you work
* `Q21`: How many employees work with data science in your workplace
* `Q22`: Do your workplace implements ML methods
* `Q23`: Activities that is an important step in your job (divided into 8 parts)
* `Q24`: How much are you paid in a year 

Questions about services and tools:

* `Q25`: How much you've invested in ML, Cloud Services in the past 5 years
* `Q26`: What Cloud Computing services do you use (divided into 12 parts)
* `Q27`: What Cloud Computing products do you use (divided into 12 parts)
* `Q28`: What ML products do you use (divided into 11 parts)
* `Q29`: What Big Data products do you use (divided into 18 parts)
* `Q30`: What Big Data products do you use most
* `Q31`: What BI tools do you use (divided into 15 parts)
* `Q32`: What BI tools do you use the most
* `Q33`: Do you use any automated ML tools? (divided into 8 parts)
* `Q34`: What automate ML tools do you use regularly (divided into 12 parts)
* `Q35`: Do you use any tools to help manage ML experiments? (divided into 11 parts)
* `Q36`: Where do you share your data analysis or ML applications? (divided into 10 parts)
* `Q37`: What platforms have you completed ML courses (divided into 12 parts)
* `Q38`: What are the primary tools do you use to analyze data
* `Q39`: Favorite media sources that report on DS topics (divided into 12 parts)

Questions about future:

* `Q26_B`: Cloud Computing platform you hope to become more familiar into 2 years (divided into 12 parts)
* `Q27_B`: Cloud Computing products you hope to become more familiar into 2 years (divided into 12 parts)
* `Q28_B`: ML products you hope to become more familiar into 2 years (divided into 11 parts)
* `Q29_B`: Big Data products you hope to become more famliar into 2 years (divided into 18 parts)
* `Q31_B`: BI tools you hope to become more familiar into 2 years (divided into 15 parts)
*  `Q33_B`: Categories of automating ML tools you hope to become familiar into 2 years (divided into 8 parts)
* `Q34_B`: Specific automated ML tools you hope to become more familiar into 2 years (divided into 12 parts)
* `Q35_B`: Managing ML experiments tools you hope to become more familiar into 2 years (divided into 11 parts)

After analyzing what columns have missing parts and organizing them in topics it became easier to see what columns I want to work in my analysis.

* Show Q1, Q2, and Q3 and analyze them;

* Compare Q6 and Q8 to show what's the most recommended programming language users recommend;
* Compare Q7 and Q8 to see what users most recommend and what users most use

* Q9 and Q10 to show what is more popular among users

* Q14 to visualize which data shows more often, same procedure with Q17, Q18, and Q19;


### Drop columns we won't work with

In [None]:
df.columns.values

In [None]:
# working with columns: Q1, Q2, Q3, Q6, Q7, Q8, Q9, Q10, Q14, Q17, Q18, Q19
df = df.drop(['Q4','Q5','Q11', 'Q12_Part_1',
       'Q12_Part_2', 'Q12_Part_3', 'Q12_OTHER', 'Q13','Q15', 
       'Q16_Part_1','Q16_Part_2', 'Q16_Part_3', 'Q16_Part_4', 'Q16_Part_5',
       'Q16_Part_6', 'Q16_Part_7', 'Q16_Part_8', 'Q16_Part_9',
       'Q16_Part_10', 'Q16_Part_11', 'Q16_Part_12', 'Q16_Part_13',
       'Q16_Part_14', 'Q16_Part_15', 'Q16_OTHER','Q20',
       'Q21', 'Q22', 'Q23_Part_1', 'Q23_Part_2', 'Q23_Part_3',
       'Q23_Part_4', 'Q23_Part_5', 'Q23_Part_6', 'Q23_Part_7',
       'Q23_OTHER', 'Q24', 'Q25', 'Q26_A_Part_1', 'Q26_A_Part_2',
       'Q26_A_Part_3', 'Q26_A_Part_4', 'Q26_A_Part_5', 'Q26_A_Part_6',
       'Q26_A_Part_7', 'Q26_A_Part_8', 'Q26_A_Part_9', 'Q26_A_Part_10',
       'Q26_A_Part_11', 'Q26_A_OTHER', 'Q27_A_Part_1', 'Q27_A_Part_2',
       'Q27_A_Part_3', 'Q27_A_Part_4', 'Q27_A_Part_5', 'Q27_A_Part_6',
       'Q27_A_Part_7', 'Q27_A_Part_8', 'Q27_A_Part_9', 'Q27_A_Part_10',
       'Q27_A_Part_11', 'Q27_A_OTHER', 'Q28_A_Part_1', 'Q28_A_Part_2',
       'Q28_A_Part_3', 'Q28_A_Part_4', 'Q28_A_Part_5', 'Q28_A_Part_6',
       'Q28_A_Part_7', 'Q28_A_Part_8', 'Q28_A_Part_9', 'Q28_A_Part_10',
       'Q28_A_OTHER', 'Q29_A_Part_1', 'Q29_A_Part_2', 'Q29_A_Part_3',
       'Q29_A_Part_4', 'Q29_A_Part_5', 'Q29_A_Part_6', 'Q29_A_Part_7',
       'Q29_A_Part_8', 'Q29_A_Part_9', 'Q29_A_Part_10', 'Q29_A_Part_11',
       'Q29_A_Part_12', 'Q29_A_Part_13', 'Q29_A_Part_14', 'Q29_A_Part_15',
       'Q29_A_Part_16', 'Q29_A_Part_17', 'Q29_A_OTHER', 'Q30',
       'Q31_A_Part_1', 'Q31_A_Part_2', 'Q31_A_Part_3', 'Q31_A_Part_4',
       'Q31_A_Part_5', 'Q31_A_Part_6', 'Q31_A_Part_7', 'Q31_A_Part_8',
       'Q31_A_Part_9', 'Q31_A_Part_10', 'Q31_A_Part_11', 'Q31_A_Part_12',
       'Q31_A_Part_13', 'Q31_A_Part_14', 'Q31_A_OTHER', 'Q32',
       'Q33_A_Part_1', 'Q33_A_Part_2', 'Q33_A_Part_3', 'Q33_A_Part_4',
       'Q33_A_Part_5', 'Q33_A_Part_6', 'Q33_A_Part_7', 'Q33_A_OTHER',
       'Q34_A_Part_1', 'Q34_A_Part_2', 'Q34_A_Part_3', 'Q34_A_Part_4',
       'Q34_A_Part_5', 'Q34_A_Part_6', 'Q34_A_Part_7', 'Q34_A_Part_8',
       'Q34_A_Part_9', 'Q34_A_Part_10', 'Q34_A_Part_11', 'Q34_A_OTHER',
       'Q35_A_Part_1', 'Q35_A_Part_2', 'Q35_A_Part_3', 'Q35_A_Part_4',
       'Q35_A_Part_5', 'Q35_A_Part_6', 'Q35_A_Part_7', 'Q35_A_Part_8',
       'Q35_A_Part_9', 'Q35_A_Part_10', 'Q35_A_OTHER', 'Q36_Part_1',
       'Q36_Part_2', 'Q36_Part_3', 'Q36_Part_4', 'Q36_Part_5',
       'Q36_Part_6', 'Q36_Part_7', 'Q36_Part_8', 'Q36_Part_9',
       'Q36_OTHER', 'Q37_Part_1', 'Q37_Part_2', 'Q37_Part_3',
       'Q37_Part_4', 'Q37_Part_5', 'Q37_Part_6', 'Q37_Part_7',
       'Q37_Part_8', 'Q37_Part_9', 'Q37_Part_10', 'Q37_Part_11',
       'Q37_OTHER', 'Q38', 'Q39_Part_1', 'Q39_Part_2', 'Q39_Part_3',
       'Q39_Part_4', 'Q39_Part_5', 'Q39_Part_6', 'Q39_Part_7',
       'Q39_Part_8', 'Q39_Part_9', 'Q39_Part_10', 'Q39_Part_11',
       'Q39_OTHER', 'Q26_B_Part_1', 'Q26_B_Part_2', 'Q26_B_Part_3',
       'Q26_B_Part_4', 'Q26_B_Part_5', 'Q26_B_Part_6', 'Q26_B_Part_7',
       'Q26_B_Part_8', 'Q26_B_Part_9', 'Q26_B_Part_10', 'Q26_B_Part_11',
       'Q26_B_OTHER', 'Q27_B_Part_1', 'Q27_B_Part_2', 'Q27_B_Part_3',
       'Q27_B_Part_4', 'Q27_B_Part_5', 'Q27_B_Part_6', 'Q27_B_Part_7',
       'Q27_B_Part_8', 'Q27_B_Part_9', 'Q27_B_Part_10', 'Q27_B_Part_11',
       'Q27_B_OTHER', 'Q28_B_Part_1', 'Q28_B_Part_2', 'Q28_B_Part_3',
       'Q28_B_Part_4', 'Q28_B_Part_5', 'Q28_B_Part_6', 'Q28_B_Part_7',
       'Q28_B_Part_8', 'Q28_B_Part_9', 'Q28_B_Part_10', 'Q28_B_OTHER',
       'Q29_B_Part_1', 'Q29_B_Part_2', 'Q29_B_Part_3', 'Q29_B_Part_4',
       'Q29_B_Part_5', 'Q29_B_Part_6', 'Q29_B_Part_7', 'Q29_B_Part_8',
       'Q29_B_Part_9', 'Q29_B_Part_10', 'Q29_B_Part_11', 'Q29_B_Part_12',
       'Q29_B_Part_13', 'Q29_B_Part_14', 'Q29_B_Part_15', 'Q29_B_Part_16',
       'Q29_B_Part_17', 'Q29_B_OTHER', 'Q31_B_Part_1', 'Q31_B_Part_2',
       'Q31_B_Part_3', 'Q31_B_Part_4', 'Q31_B_Part_5', 'Q31_B_Part_6',
       'Q31_B_Part_7', 'Q31_B_Part_8', 'Q31_B_Part_9', 'Q31_B_Part_10',
       'Q31_B_Part_11', 'Q31_B_Part_12', 'Q31_B_Part_13', 'Q31_B_Part_14',
       'Q31_B_OTHER', 'Q33_B_Part_1', 'Q33_B_Part_2', 'Q33_B_Part_3',
       'Q33_B_Part_4', 'Q33_B_Part_5', 'Q33_B_Part_6', 'Q33_B_Part_7',
       'Q33_B_OTHER', 'Q34_B_Part_1', 'Q34_B_Part_2', 'Q34_B_Part_3',
       'Q34_B_Part_4', 'Q34_B_Part_5', 'Q34_B_Part_6', 'Q34_B_Part_7',
       'Q34_B_Part_8', 'Q34_B_Part_9', 'Q34_B_Part_10', 'Q34_B_Part_11',
       'Q34_B_OTHER', 'Q35_B_Part_1', 'Q35_B_Part_2', 'Q35_B_Part_3',
       'Q35_B_Part_4', 'Q35_B_Part_5', 'Q35_B_Part_6', 'Q35_B_Part_7',
       'Q35_B_Part_8', 'Q35_B_Part_9', 'Q35_B_Part_10', 'Q35_B_OTHER'], axis=1)

#### Columns we are going to work

In [None]:
df.columns

### Join columns

In [None]:
df.head()

#### Drop 1st row and 1st column

In [None]:
df = df.drop(df.index[0])
df = df.drop(columns=df.columns[0])

#### Join the columns

In [None]:
c = ['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5','Q7_Part_6', 'Q7_Part_7', 'Q7_Part_8', 'Q7_Part_9', 'Q7_Part_10','Q7_Part_11', 'Q7_Part_12', 'Q7_OTHER']
df['Q7'] = df[c].apply(lambda x : '_'.join(x.dropna().astype(str)), axis=1)
       
c = ['Q9_Part_1', 'Q9_Part_2','Q9_Part_3', 'Q9_Part_4', 'Q9_Part_5', 'Q9_Part_6', 'Q9_Part_7','Q9_Part_8', 'Q9_Part_9', 'Q9_Part_10', 'Q9_Part_11', 'Q9_OTHER']
df['Q9'] = df[c].apply(lambda x: '_'.join(x.dropna().astype(str)), axis=1)

c = ['Q10_Part_1', 'Q10_Part_2', 'Q10_Part_3', 'Q10_Part_4', 'Q10_Part_5','Q10_Part_6', 'Q10_Part_7', 'Q10_Part_8', 'Q10_Part_9', 'Q10_Part_10','Q10_Part_11', 'Q10_Part_12', 'Q10_Part_13', 'Q10_OTHER']
df['Q10'] = df[c].apply(lambda x: '_'.join(x.dropna().astype(str)), axis=1)
       
c = ['Q14_Part_1','Q14_Part_2', 'Q14_Part_3', 'Q14_Part_4', 'Q14_Part_5', 'Q14_Part_6','Q14_Part_7', 'Q14_Part_8', 'Q14_Part_9', 'Q14_Part_10', 'Q14_Part_11','Q14_OTHER']
df['Q14'] = df[c].apply(lambda x : '_'.join(x.dropna().astype(str)), axis=1)

c = ['Q17_Part_1', 'Q17_Part_2', 'Q17_Part_3', 'Q17_Part_4','Q17_Part_5', 'Q17_Part_6', 'Q17_Part_7', 'Q17_Part_8', 'Q17_Part_9','Q17_Part_10', 'Q17_Part_11', 'Q17_OTHER']
df['Q17'] = df[c].apply(lambda x: '_'.join(x.dropna().astype(str)), axis=1)
                
c = ['Q18_Part_1', 'Q18_Part_2','Q18_Part_3', 'Q18_Part_4', 'Q18_Part_5', 'Q18_Part_6', 'Q18_OTHER']
df['Q18'] = df[c].apply(lambda x: '_'.join(x.dropna().astype(str)), axis=1)

c =['Q19_Part_1', 'Q19_Part_2', 'Q19_Part_3', 'Q19_Part_4', 'Q19_Part_5','Q19_OTHER']
df['Q19'] = df[c].apply(lambda x: '_'.join(x.dropna().astype(str)), axis=1)

In [None]:
#Drop the old columns
df = df.drop(['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5',
       'Q7_Part_6', 'Q7_Part_7', 'Q7_Part_8', 'Q7_Part_9', 'Q7_Part_10',
       'Q7_Part_11', 'Q7_Part_12', 'Q7_OTHER','Q9_Part_1', 'Q9_Part_2',
       'Q9_Part_3', 'Q9_Part_4', 'Q9_Part_5', 'Q9_Part_6', 'Q9_Part_7',
       'Q9_Part_8', 'Q9_Part_9', 'Q9_Part_10', 'Q9_Part_11', 'Q9_OTHER',
        'Q10_Part_1', 'Q10_Part_2', 'Q10_Part_3', 'Q10_Part_4', 'Q10_Part_5',
       'Q10_Part_6', 'Q10_Part_7', 'Q10_Part_8', 'Q10_Part_9', 'Q10_Part_10',
       'Q10_Part_11', 'Q10_Part_12', 'Q10_Part_13', 'Q10_OTHER','Q14_Part_1',
       'Q14_Part_2', 'Q14_Part_3', 'Q14_Part_4', 'Q14_Part_5', 'Q14_Part_6',
       'Q14_Part_7', 'Q14_Part_8', 'Q14_Part_9', 'Q14_Part_10', 'Q14_Part_11',
       'Q14_OTHER','Q17_Part_1', 'Q17_Part_2', 'Q17_Part_3', 'Q17_Part_4',
       'Q17_Part_5', 'Q17_Part_6', 'Q17_Part_7', 'Q17_Part_8', 'Q17_Part_9',
       'Q17_Part_10', 'Q17_Part_11', 'Q17_OTHER','Q18_Part_1', 'Q18_Part_2',
       'Q18_Part_3', 'Q18_Part_4', 'Q18_Part_5', 'Q18_Part_6', 'Q18_OTHER',
       'Q19_Part_1', 'Q19_Part_2', 'Q19_Part_3', 'Q19_Part_4', 'Q19_Part_5',
       'Q19_OTHER'],axis=1)

#### Explode values from new columns

In [None]:
# Explode columns
df.Q7 = df.Q7.str.split('_').explode('Q7')
df.Q9 = df.Q9.str.split('_').explode('Q9')
df.Q10 = df.Q10.str.split('_').explode('Q10')
df.Q14 = df.Q14.str.split('_').explode('Q14')
df.Q17 = df.Q17.str.split('_').explode('Q17')
df.Q18 = df.Q18.str.split('_').explode('Q18')
df.Q19 = df.Q19.str.split('_').explode('Q19')

#### Replace empty cells with NaN

In [None]:
df.replace(r'^\s*$', np.nan, regex=True, inplace=True)

#### Fix Q6 and Q8 missing values

In [None]:
df['Q6'] = df['Q6'].fillna(df['Q6'].mode()[0])
df['Q8'] = df['Q8'].fillna(df['Q8'].mode()[0])

# Data visualization

In [None]:
# Stacked Bar Chart with gender and age

q = df.groupby(['Q1', 'Q2']).size().reset_index(name='Quantity')
q = q.rename(columns={'Q1':'Age', 'Q2':'Gender',})

fig = px.bar(q, x='Age', y='Quantity',color='Gender', title='Age by Gender Correlation', height=500)
fig.show()

In [None]:
# Gender and country Q2 and Q3
q = df.groupby(['Q2','Q3']).agg({'Q1':'count'}).reset_index()
q.columns=['Q2', 'Q3', 'counts']
fig = px.sunburst(q, path=['Q2','Q3'], values='counts', title='Gender and Country correlation',height=600, color='counts',
                  color_continuous_scale=px.colors.sequential.Blues)
fig.show()

In [None]:
q = df['Q6'].value_counts()

fig = px.bar(x=q.index,  y=q.values, title='Writing code experience', labels={'x': 'Experience','y': 'Quantity'}, height=500)
fig.show()

In [None]:
q = df.groupby(['Q6', 'Q8']).agg({'Q1': 'count'}).reset_index()
q.columns = ['Q6', 'Q8', 'counts']
q = q.rename(columns={'Q6':'Experience', 'Q8':'Language'})

fig = px.bar(q, x='Experience', y='counts', color='Language',title='Recommended language by experience', height=500)
fig.update_layout(barmode='group')
fig.show()

In [None]:
# Compare Q7 and Q8
# Q7 - programming language use in regular basis
# Q8 - programming language recommend

q = df.groupby(['Q7', 'Q8']).agg({'Q1': 'count'}).reset_index()
q.columns = ['Q7', 'Q8', 'counts']
q = q.rename(columns={'Q7':'Daily use', 'Q8':'Recommended'})

fig = px.bar(q, x='Recommended', y='counts', color='Daily use',title='Recommended language - Daily use  Correlation')
fig.update_layout(barmode='group')
fig.show()

In [None]:
# Q9 - Which IDE do you use
q = df.Q9.value_counts()
fig = px.bar(q, x=q.index, y=q.values, title='Most popular IDE', labels=({'x':'IDE', 'y':'Values'}))
fig.show()

In [None]:
# Q10 - Which hosted notebook products do you use
q = df['Q10'].value_counts()
fig = px.bar(q, x=q.index, y=q.values, title='Most used hosted notebooks', )
fig.show()

In [None]:
# Q14 - Data visualization library
q = df['Q14'].value_counts()
fig = px.pie(q, values=q.values, names=q.index, title='Most used data visualization library')
fig.update_traces(textposition='inside')
fig.show()

In [None]:
#Q17 - ML algorithms 
q = df.Q17.value_counts()
fig = px.pie(q, values=q.values, names=q.index, title='Most used ML algorithms')
fig.update_traces(textposition='inside')
fig.show()

In [None]:
# Q18 - CV methods 
q = df.Q18.value_counts()
fig = px.pie(q, values=q.values, names=q.index, title='Most used CV methods', height=600)
fig.update_traces(textposition='inside')
fig.show()

In [None]:
# Q19 - NLP methods
q = df.Q19.value_counts()
fig = px.pie(q, values=q.values, names=q.index, title='Most used NLP methods')
fig.update_traces(textposition='inside')
fig.show()
