 ### [Introduction](#intro)
*  #### [Package Installation & Data Load](#dl)
* #### [Data Cleaning](#dc)

### [Data Analysis](#das)
* #### [Age](#age)


<a id='intro'></a>
### **Introduction**
    Kaggle conducted a **ML and Data Science Survey ** in 2017. The survey got over 16000 responses from 171 countries and terroteries. The analysis of the survey focuses on the following points:
    *     Gender & Age analysis of respondents using the platform
    *     languages & tools preferred by respondents
    *     tools that helped the respondents learn
.     

<a id='dl'></a>
#### **Package Installation & Data Load**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.
MCR=pd.read_csv('../input/multipleChoiceResponses.csv',encoding="ISO-8859-1")
FFR=pd.read_csv('../input/freeformResponses.csv')


    Getting to know the data, number of rows and columns

In [None]:
print(MCR.info())
print(FFR.info())

In [None]:
print(MCR.columns)
print(MCR.isnull().sum().sort_values(ascending=False))

#### **Data Cleaning**
    As you can see the data contains a lot of null values, I am going to clean and impute only the variables that are important for the analysis. 
    *   Replacing the Null value for age to the mean age
    *   Consolidating "republic of china" & "peoples republic of china" to "China".


In [None]:
MCR['Age'].fillna(int(MCR['Age'].mean()),inplace=True)
print(MCR['Age'].value_counts(bins=20))

        Some of the values might not be accurate like 18 people under the age 5 are using Kaggle and also it is highly unlikely that someone aged 95-100 will use Kaggle. We will keep these discrepencies for the analysis

In [None]:
print(MCR['Country'].value_counts())

       We have two different country name for china so changing "republic of china" & "peoples republic of china" to "China".

In [None]:
MCR['Country'].replace('Republic of China', 'China',inplace=True)
MCR['Country'].replace("People 's Republic of China", 'China',inplace=True)

<a id='das'></a>
### **Data Analysis**

* #### **Age**
        Analysing age through histogram

In [None]:

#plt.hist(MCR['Age'],bins=5)
f1=plt.figure()
f1.set_size_inches(20, 10)
plt.hist([MCR['Age']], color=['r'], alpha=0.5)


    There isnt any surprise that most of the people using Kaggle are between 20 to 40 years of age. Over 7000 responders fall under the bin of 20-30, which shows that Kaggle is popular among youngesters

In [None]:
f1=plt.figure()
f1.set_size_inches(20, 20)
#sns.violinplot(y='Age',data=MCR,x='GenderSelect',split=True)
#plt.legend(loc=7)
g = sns.factorplot(x="Country", y="Age",row="GenderSelect",data=MCR, kind="violin", size=20, aspect=0.8,row_order=['Male','Female'])
g.set(ylim=(10,50))


####  

#### **Country**

In [None]:


f1=plt.figure()
f1.set_size_inches(20, 10)
#sns.countplot('Country',data=MCR)
MCR['Country'].value_counts().plot(kind='bar',color=['g'],alpha=0.6)
plt.xticks( rotation='vertical')

        Top three nationalities that Kagglers belong to are:
               1. US
               2. India
               3. Russia

In [None]:

f1=plt.figure()
f1.set_size_inches(10, 10)
sns.countplot(y='Country',data=MCR,hue='GenderSelect')
plt.legend(loc=7)

Above is the gender count based on countries, male population using Kaggle in US surpass every other country. 

In [None]:

g = sns.factorplot(y="Country", x="Age",col='GenderSelect',data=MCR,kind="bar",size=10,aspect=1,col_wrap=2,ci=None)
#sns.violinplot(y='Age',x='GenderSelect',data=MCR)
#plt.xticks(fontsize=20)                    
#plt.yticks(fontsize=14)

An interesting observation here is that Male/Female data is spread across the range 20 to 40 years for all countries whereas for the other 2 gender columns the age range is going above 40.

Let's see the relationship between employment status & data science learning

In [None]:

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
sns.countplot(x="CareerSwitcher",data=MCR,ax=ax1)
sns.countplot(x="EmploymentStatus",data=MCR,ax=ax2)
f.set_size_inches(20, 10)
plt.xticks( rotation='vertical')

In [None]:
f1=plt.figure()
f1.set_size_inches(8, 8)
sns.countplot(x="EmploymentStatus", hue="CareerSwitcher", data=MCR)
plt.xticks( rotation='vertical')

One interesting observation here is survey responders who are out of job and looking for work do not want to switch careers 

In [None]:
f1=plt.figure()
f1.set_size_inches(15, 10)
ax = sns.countplot(x="CurrentJobTitleSelect", hue="EmploymentStatus", data=MCR)
plt.xticks( rotation='vertical')
#print(MCR.groupby('EmploymentStatus',as_index=False)['CareerSwitcher'].sum())

In [None]:
f1=plt.figure()
f1.set_size_inches(8, 8)
sns.countplot( x="LearningDataScience", hue='EmploymentStatus',data=MCR)
plt.xticks( rotation='vertical')

In [None]:
f1=plt.figure()
f1.set_size_inches(8, 8)
sns.countplot(x="Country", hue="CareerSwitcher", data=MCR)
plt.xticks( rotation='vertical')

India has the largest number of career switchers surpassing US

In [None]:
#sns.countplot('CareerSwitcher',data=MCR)
f1=plt.figure()
f1.set_size_inches(8,20)
Job_colors=['#78C850','#F08030', '#6890F0', '#9b59b6','#95a5a6','#2ecc71','#F8D030','#E0C068', '#EE99AC',
                    '#C03028','#F85888','#B8A038','#705898', '#98D8D8', '#34495e' ]
sns.countplot(y="MLToolNextYearSelect", hue="CurrentJobTitleSelect", data=MCR,palette=Job_colors, linewidth=5,saturation=1)
plt.xticks( rotation='vertical')
plt.legend(loc=7)

