## 2019 Kaggle ML & DS Survey: 

### The challenge objective: 
Tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

### I want to participate in this challenge and present you all 'The story of Indian Kagglers'.

Kernel Credits: Thanks to Paul Mooney, the developer advocate at Kaggle for his kernel: https://www.kaggle.com/paultimothymooney/2018-kaggle-machine-learning-data-science-survey

### The following few blocks of code are for warming up the Kernel.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
mcr = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')
qs = pd.read_csv('/kaggle/input/kaggle-survey-2019/questions_only.csv')

In [None]:
print(mcr.shape, qs.shape)

In [None]:
mcr.head()

In [None]:
for i in qs.columns:
    print(i,' : ',qs[i][0])

There were 34 questions asked in the survey.

This notebook is limited to Indian respondents.

In [None]:
mcr_indian = mcr[mcr['Q3']=='India']

In [None]:
mcr_indian.shape

In [None]:
mcr_indian.shape[0]/mcr.shape[0]

There are 4786 respondents from India which is 24% of total respondents.

In [None]:
mcr_indian.head()

In [None]:
#Reset_index
mcr_indian = mcr_indian.reset_index().drop('index', axis=1)

In [None]:
mcr_indian.head()

In [None]:
for i in mcr_indian.isna().sum().index:
    print(i,' : ', mcr_indian[i].isna().sum())

Each question has different number of responses which shows that different questions were asked to different respondents based on their experience level and previous responses.

In [None]:
def question(i):
    if i<=34:
        print('Question ',i, ' : ')
        return qs[qs.columns[i]][0]
    else:
        print('No such question exist')

In [None]:
question(1)

This analysis follows the order of questions asked in the survey

Let us start with the time that each Indian has taken to complete this survey.

In [None]:
question(0)

In [None]:
duration_only = mcr_indian[mcr_indian.columns[0]]

In [None]:
duration_only.isna().sum()

There are no missing values in the duration attribute of the Kaggle's Survey data for Indian respondents.

In [None]:
max(pd.to_numeric(duration_only)/60)

In [None]:
import matplotlib
matplotlib.rcParams.update({'font.size': 24})

In [None]:
import seaborn as sns
lst = pd.to_numeric(duration_only)/60
lst = pd.DataFrame(lst)
lst.columns = ['Time from Start to Finish (minutes)']
lst = lst['Time from Start to Finish (minutes)']
sns.distplot(lst, bins = 5000).set(xlim=(0, 60))

Average time spent by Indians taking this Survey: 5 - 10 minutes

In [None]:
print('There are {} total indian respondents out of whom {} respondents have spent more than 10 minutes taking this survey and {} respondents really took more than an hour.'.format(lst.shape[0],lst[lst>10].shape[0],lst[lst>60].shape[0]))

In [None]:

def plotMcqHist(i):
    print(mcr_indian.columns[i], '  : ', mcr[mcr_indian.columns[i]][0])
    temp = mcr_indian[mcr_indian.columns[i]].value_counts()
    temp = temp.fillna(-10)
    plt.figure(figsize=(20,5))
    plt.xlabel(mcr_indian.columns[i])
    plt.bar(list(temp.index),list(temp))
    
    plt.xticks(rotation=90)
    plt.show()
    print('='*50)
    

In [None]:
mcr_indian.head()

In [None]:
for i in enumerate(mcr_indian.columns):
    print(i)

In [None]:
lst1 = [1,2,5,6,8,9,10,20,21,48,55,95,116,117]

In [None]:
plotMcqHist(lst1[0])

In [None]:
print("There are {} % of Indian respondents who belong to 18 to 21 Age group".format(round((((mcr_indian[mcr_indian.columns[lst1[0]]]=='18-21')).sum()/len(mcr_indian))*100,2)))
print("There are {} % of Indian respondents who belong to 22 to 24 Age group".format(round((((mcr_indian[mcr_indian.columns[lst1[0]]]=='22-24')).sum()/len(mcr_indian))*100,2)))
print("There are {} % of Indian respondents who belong to 25 to 29 Age group".format(round((((mcr_indian[mcr_indian.columns[lst1[0]]]=='25-29')).sum()/len(mcr_indian))*100,2)))


Indian Kagglers come from younger generations

In [None]:
plotMcqHist(lst1[1])

In [None]:
print("{} % of Indian respondents are Male".format(round((((mcr_indian[mcr_indian.columns[lst1[1]]]=='Male')).sum()/len(mcr_indian))*100,2)))
print("{} % of Indian respondents are Female".format(round((((mcr_indian[mcr_indian.columns[lst1[1]]]=='Female')).sum()/len(mcr_indian))*100,2)))

In [None]:
print("There are {} % of Indian respondents who belong to 18 to 21 Age group and are Female".format(round((((mcr_indian[mcr_indian.columns[lst1[0]]]=='18-21') & (mcr_indian[mcr_indian.columns[lst1[1]]]=='Female')).sum()/len(mcr_indian))*100,2)))

In [None]:
print("There are {}% of Indian respondents who belong to 18 to 29 Age group and are Male".format(round(((((mcr_indian[mcr_indian.columns[lst1[0]]]=='18-21') | (mcr_indian[mcr_indian.columns[lst1[0]]]=='22-24') | (mcr_indian[mcr_indian.columns[lst1[0]]]=='25-29')) & (mcr_indian[mcr_indian.columns[lst1[1]]]=='Male')).sum()/len(mcr_indian))*100,2)))
print("There are {}% of Indian respondents who belong to 18 to 29 Age group and are Female".format(round(((((mcr_indian[mcr_indian.columns[lst1[0]]]=='18-21') | (mcr_indian[mcr_indian.columns[lst1[0]]]=='22-24') | (mcr_indian[mcr_indian.columns[lst1[0]]]=='25-29')) & (mcr_indian[mcr_indian.columns[lst1[1]]]=='Female')).sum()/len(mcr_indian))*100,2)))

Even though Machine Learning and Data Science are easy and attractive fields of learning, Indian Female still are far from enjoying Kaggle. It can be observed that Male Indians in the 18 to 29 age group have taken a strong interest in Machine Learning and started participating right in their early 20s.

In [None]:
plotMcqHist(lst1[2])

In [None]:
print("{} % of Indian respondents have Bachelors as Highest level of Formal education".format(round((((mcr_indian[mcr_indian.columns[lst1[2]]]=='Bachelor’s degree')).sum()/len(mcr_indian))*100,2)))
print("{} % of Indian respondents have Masters as Highest level of Formal education".format(round((((mcr_indian[mcr_indian.columns[lst1[2]]]=='Master’s degree')).sum()/len(mcr_indian))*100,2)))
print("{} % of Indian respondents have Masters as Highest level of Formal education".format(round((((mcr_indian[mcr_indian.columns[lst1[2]]]=='Doctoral degree')).sum()/len(mcr_indian))*100,2)))

The Indian Data Scientist community is well educated. Over 88% of Indian respondents either have or intend to get in two years a degree bachelor's level or above. Over 40% have a degree master's level or above. While 5% of respondents have a doctoral degree.

In [None]:
plotMcqHist(lst1[3])

In [None]:
print("{} % of Indian respondents are students".format(round((((mcr_indian[mcr_indian.columns[lst1[3]]]=='Student')).sum()/len(mcr_indian))*100,2)))
print("While {} % of Indian respondents are not employed".format(round((((mcr_indian[mcr_indian.columns[lst1[3]]]=='Not employed')).sum()/len(mcr_indian))*100,2)))

It is interesting to note that some of the Indian respondents are Kaggling while working as Data Scientists and Software Engineers. Some are doing it while pursuing a degree. Some are kaggling to earn a better role while some are trying to get employed. This varied diversity shows that Indians are cautious about the emerging technology and are not leaving their current roles looking out for the right time to make the decision. <br>

While most of them have a tendency of jumping into Machine Learning roles very soon, it is interesting to note that students are leading India towards Machine Learning and Data Science. This also indicates that India will become the second House for Machine Learning if these students become successful in this field.

In [None]:
plotMcqHist(lst1[4])

In [None]:
print("{} % of Indian respondents are employed and belong to a company size <=49 employees".format(round((((mcr_indian[mcr_indian.columns[lst1[4]]]=='0-49 employees')).sum()/len(mcr_indian))*100,2)))
print("{} % of Indian respondents are employed and belong to a company size >10,000 employees".format(round((((mcr_indian[mcr_indian.columns[lst1[4]]]=='> 10,000 employees')).sum()/len(mcr_indian))*100,2)))
print("{} % of Indian respondents are employed".format(round((((mcr_indian[mcr_indian.columns[lst1[4]]]=='0-49 employees') | (mcr_indian[mcr_indian.columns[lst1[4]]]=='> 50-249 employees') | (mcr_indian[mcr_indian.columns[lst1[4]]]=='250-999 employees') | (mcr_indian[mcr_indian.columns[lst1[4]]]=='1000-9,999 employees') | (mcr_indian[mcr_indian.columns[lst1[4]]]=='> 10,000 employees')).sum()/len(mcr_indian))*100,2)))

In [None]:
plotMcqHist(lst1[5])

In [None]:
plotMcqHist(lst1[6])

This idea and analyses is boring. Any idea how to make it interesting.