# How are Kagglers Learning?

Kaggle conducted an industry-wide survey to establish a comprehensive view of the state of data science and machine learning. The survey received over 16,000 responses with over 6 full months of aggregated time spent completing it (an average response time of more than 16 minutes). In this notebook I will look at this data to understand how these survey respondents are learning so I can learn from that.
 
First let us load up the Data. What does this tell us? Majority of respondents are men. This is not too surprising from our earlier kernel on[ gender gap](https://www.kaggle.com/sureshsrinivas/my-5-day-joy-of-data-challenge-day-1).  So let us split the data and see them seperately.
* Total Respondents =  16716
* Women Respondents =  2778
* Men Respondents =  13610

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
#print(check_output(["ls", "../input"]).decode("utf8"))

cvRates = pd.read_csv('../input/conversionRates.csv', encoding="ISO-8859-1")
freeForm = pd.read_csv('../input/freeformResponses.csv', encoding="ISO-8859-1")
data = pd.read_csv('../input/multipleChoiceResponses.csv', encoding="ISO-8859-1")
schema = pd.read_csv('../input/schema.csv', encoding="ISO-8859-1")


women_data = data.loc[data['GenderSelect'] == 'Female']
men_data = data.loc[data['GenderSelect'] == 'Male']
print('Total Respondents = ', len(data.index))
print('Women Respondents = ', len(women_data.index))
print('Men Respondents = ', len(men_data.index))

women_data['FirstTrainingSelect'].head(3)
# Any results you write to the current directory are saved as output.

## How are Kagglers Learning?

Let us plot this using `sns.countplot` to plot. A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable. We will use the FirstTrainingSelect category.

From the plot we see
* Online courses are the most popular
* University Courses and Self-Taught are very close
* Work and Kaggle competition round out the bottom.

> Thank you to the [answer on StackOverflow](https://stackoverflow.com/questions/42528921/how-to-prevent-overlapping-x-axis-labels-in-sns-countplot) that helped me fix overlapping x-axis labels by rotating the xticklabels.
> [Python DataScience Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) has an excellent chapter on [Visualization with Seaborn](https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html)

In [None]:

# import seaborn and alias it as sns
import seaborn as sns
import matplotlib.pyplot as plt

sns.set()
with sns.axes_style('white'):
    ax = sns.countplot( x="FirstTrainingSelect",  data=data, color='steelblue')
    ax.set_title("First Training Selection")
    ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
    for key,spine in ax.spines.items():
        spine.set_visible(False)
plt.show()


## What is their FirstTraining Selection?

Let us see how the Kagglers (both men and women) are learning. We look at the 'FirstTraining' in the data. Interestingly women and men survey results offer contrasts and similarities.
* The top learning choice for women is 'University Courses' **39%** but for men it is 'Online Courses'
* The 2nd choice for men is 'Self Taught' at **27%**but for Women it is a distant third **17%**
* Both Women and Men are not learning much on data-science at work only **6-9%**
* Both Women and Men are learning a lot from Courses be it University or Online. **63-70%**
* Kaggle Competitions doesnt come high. 

From my learning perspective, a few more things are not included in the survey and it would be interesting to know about those.
* Googling it. This may sound obvious.  I learn from StackOverFlow answers, blogs, and documentation of API's.
* Meetup and Learning from Experts. There are very popular meetups in each city. I enjoy interacting with experts without being in a class setting
* Missions. Maybe this is a self taught category but to take specific problems and solving them. The company [DataQuest](http://dataquest.io) is offerring some very hands on missions to encourage a fast learning. 

> I learned to better visualize through this  Medium article [Better Visualization of Pie Charts](https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f). and highly recommend it.

Forking and Learning Suggestions
* The two pie charts are seperate. How would we present them as one?
* What else can we do to improve the “Data-Ink Ratio” — term coined by Edward Rolf Tufte whose [One Day Course](https://www.edwardtufte.com/tufte/courses) I was fortunate to take when he visited Portland. It is very inspiring.

In [None]:

import matplotlib.pyplot as plt

temp_women=women_data['FirstTrainingSelect'].value_counts()
temp_men=men_data['FirstTrainingSelect'].value_counts()
labels = temp_women.index
sizes = temp_women.values
data_gender = ['Women', 'Men']
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
explode = (0.1, 0, 0, 0, 0, 0)  # only "explode" the 1st slice
fig = plt.figure(figsize=(12, 12))
for sp in range(0,2):
    ax = fig.add_subplot(2, 1,sp+1)
    patches, texts, autotexts = ax.pie(sizes, explode=explode, colors=colors,labels=labels, autopct='%1.1f%%',shadow=True,labeldistance=1.05)
    ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    for pie_wedge in patches:
        pie_wedge.set_edgecolor('white')
    for text in texts:
        text.set_color('grey')
    for autotext in autotexts:
        autotext.set_color('grey')

    ax.set_title(data_gender[sp])
    ax.tick_params(bottom="off", top="off", left="off", right="off")
    labels = temp_men.index
    sizes = temp_men.values

plt.show()


## How does learning change with Age?
![Ford On Learning](https://pbs.twimg.com/media/Cl7uS49UkAE8OHQ.jpg)
Next let us look at how the 'FirstLearning' changes with age?
* The first learning platform for older learners is 'Self Taught' but for young learners is 'Online Courses'
* The percentage of learning at work is steadily increasing from Young Learners to Old.
* University Courses steadily drop from Young Leaners to Old. Looks like 10% of Older Learners still rely on University Courses.



In [None]:
def plotPie (labels, sizes, title):
    colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
    explode = (0.1, 0, 0, 0, 0, 0)  # only "explode" the 1st slice
    fig = plt.figure(figsize=(12, 12))
    for sp in range(0,1):
        ax = fig.add_subplot(1, 1,sp+1)
        #explsion
       # explode = (0.2,0.2,0.2,0.2)
        patches, texts, autotexts = ax.pie(sizes, colors = colors, labels=labels, explode=explode, autopct='%1.1f%%', pctdistance=0.85)
#        patches, texts, autotexts = ax.pie(sizes, explode=explode, colors=colors,labels=labels, autopct='%1.1f%%',shadow=True,labeldistance=1.05)
        ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
        for pie_wedge in patches:
            pie_wedge.set_edgecolor('white')
        for text in texts:
            text.set_color('grey')
        for autotext in autotexts:
            autotext.set_color('grey')
        ax.set_title(title[sp])
        #draw circle
        centre_circle = plt.Circle((0,0),0.70,fc='white')
        fig = plt.gcf()
        fig.gca().add_artist(centre_circle)
        # Equal aspect ratio ensures that pie is drawn as a circle
        ax.axis('equal')  
        ax.tick_params(bottom="off", top="off", left="off", right="off")
    plt.tight_layout()
    plt.show()
    plt.show()
    return;

def selectPlot(temp_data, temp_title):
    temp=temp_data['FirstTrainingSelect'].value_counts()
    labels = temp.index
    sizes = temp.values
    plotPie(labels, sizes, temp_title )
    
young_data = data[(data['Age']>=18) & (data['Age']<=29)]
mid_data = data[(data['Age'] > 29) & (data['Age']<=40)]
senior_data = data[(data['Age'] > 41) & (data['Age']<=50)]
old_data = data[(data['Age'] > 50)]


selectPlot(young_data, ['Young Learners'])
selectPlot(mid_data, ['Middle Age Learners'])
selectPlot(senior_data, ['Senior Age Learners'])
selectPlot(old_data, ['Old Age Learners'])                                        


I hope you enjoyed this kernel. Few suggestions for further learning
* How can you combine the pie charts together