# Kaggle Journey 2017-2018

![](https://cdn-images-1.medium.com/max/920/1*T5oDltDFi8FQJ8kZdFUMoQ.png)

Exactly 2 years ago I joined Kaggle while searching for some small datasets for learning Excel and Pandas. Being a novice and completely new to Data Science, Kaggle Kernels were pretty overwhelming for me in the start. However with the help of the Kaggle Community, everyday is a great learning experience. Kaggle not only helped me in refining my Data Science skills, but also helped me connect with some great people. Cheers to such a great and ever growing community.

Now coming to the dataset, it is the result of the second worldwide survey conducted by Kaggle, which will provide meaningful insights about the people working in the field of Data Science. Kaggle had released it's survey data last year for the first time. Following is my notebook using the dataset: **[Novice To Grandmaster](https://www.kaggle.com/ash316/novice-to-grandmaster)**. With the help of this dataset, we will see how the Kaggle Community has grown and changed in the past year and try to find some interesting patterns and insights that are hidden in this trove of data.

I hope you like this notebook.If you find like this notebook and find it useful, **DO UPVOTE**.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import networkx as nx
import warnings
warnings.filterwarnings('ignore')
import folium 
from folium import plugins
from highcharts import Highchart
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
df_2017=pd.read_csv('../input/kaggle-survey-2017/multipleChoiceResponses.csv',encoding='ISO-8859-1')
df_2018=pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv')
df_2018.columns=df_2018.iloc[0]
df_2018=df_2018.drop([0])

Lets check how many people did respond to the survey!!

In [None]:
print('Total respondents in 2018:',df_2018.shape[0],'with a growth of:',(df_2018.shape[0]-df_2017.shape[0])/df_2017.shape[0]*100,'%')

So about **24k** people did respond to the survey, which is almost **43%** more than the number of respondents in 2017. This shows how actively the Data Science Community is growing.

## Response Time

In [None]:
print('Maximum Response Time: ',(df_2018['Duration (in seconds)'].astype('int')/60).max(),'mins')
print('Minimum Response Time: ',(df_2018['Duration (in seconds)'].astype('int')/60).min(),'mins')
print('Mean Response Time: ',(df_2018['Duration (in seconds)'].astype('int')/60).mean(),'mins')

The Maximum Response Time is just crazy...lmao!! Similarly the minimum response time is also funny. The mean response time is on a very higher side due to some outliers like the maximum response time. As far as I remember, I took somewhere between 12-15 mins to complete the survey, and believe that majority of the respondents did take that much time only. Lets check it and make the data speak itself!!

In [None]:
df_2018['mins']=df_2018['Duration (in seconds)'].astype('int')/60
plt.hist(df_2018[df_2018['mins']<100].mins,bins=30,edgecolor='black', linewidth=1.2)
plt.title('Response Time in Minutes')
plt.gcf().set_size_inches(10,5)

After removing time > 100 minutes, we can see that majority of the respondents finished the survey in less than 20-15 minutes.

## Gender Split

In [None]:
ax=df_2018['What is your gender? - Selected Choice'].value_counts().plot.barh(width=0.9,color='#ffd700')
for i, v in enumerate(df_2018['What is your gender? - Selected Choice'].value_counts().values): 
    ax.text(200, i, v,fontsize=12,color='red',weight='bold')
plt.title('Gender Distribution')
plt.gcf().set_size_inches(8,5)
plt.show()

The number of female respondents are still very less as compared to males.There is a huge gender bias as almost **80%** of the respondents are males.However lets check the rise in the numbers compared to 2017 to get a better idea.

In [None]:
gen_2018=df_2018['What is your gender? - Selected Choice'].value_counts().to_frame()
gen_2017=df_2017['GenderSelect'].value_counts().to_frame()
gen_2017=gen_2017.merge(gen_2018,left_index=True,right_index=True,how='left').dropna()
gen_2017.columns=['2017','2018']
H = Highchart(width=800, height=400)

options = {
    'title': {
        'text': 'Male vs Female Respondents (2017-2018)'
    },
    'xAxis': {
        'categories': ['Male', 'Female',],
        'title': {
            'text': None
        }
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': 'Respondents'
        },
        'labels': {
            'overflow': 'justify'
        }
    },
    'legend': {
        'layout': 'vertical',
        'align': 'right',
        'verticalAlign': 'top',
        'x': -40,
        'y': 80,
        'floating': True,
        'borderWidth': 1,
        'shadow': True
    },
    'credits': {
        'enabled': False
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True
            }
        }
    }
}

H.set_dict_options(options)




data1 = list(gen_2017['2017'])
data2 = list(gen_2017['2018'])
H.add_data_set(data1, 'bar', '2017')
H.add_data_set(data2, 'bar', '2018')

H

Due to my wrong calculations, I had previouslt mentioned that the growth is low for the female coders. However thanks to [Heads and Tails](https://www.kaggle.com/headsortails) for correcting me on that point. The growth for both the sexes is almost the same i.e around 44%. But still the number of females are still pretty low as compared to the males.

## Respondents By Country

In [None]:
names=['Afghanistan','Albania','Algeria','American Samoa','Andorra','Angola','Anguilla','Antarctica','Antigua & Barbuda','Argentina','Armenia','Aruba','Australia','Austria','Azerbaijan','Bahamas','Bahrain','Bangladesh','Barbados','Belarus','Belgium','Belize','Benin','Bermuda','Bhutan','Bolivia','Bosnia','Botswana','Bouvet Island','Brazil','British Indian Ocean Territory','British Virgin Islands','Brunei','Bulgaria','Burkina Faso','Burundi','Cambodia','Cameroon','Canada','Cape Verde','Caribbean Netherlands','Cayman Islands','Central African Republic','Chad','Chile',"People 's Republic of China",'Christmas Island','Cocos (Keeling) Islands','Colombia','Comoros','Congo - Brazzaville','Congo - Kinshasa','Cook Islands','Costa Rica','Croatia','Cuba','Curaçao','Cyprus',
 'Czech Republic','Côte d’Ivoire','Denmark','Djibouti','Dominica','Dominican Republic','Ecuador','Egypt','El Salvador','Equatorial Guinea','Eritrea','Estonia','Ethiopia','Falkland Islands','Faroe Islands','Fiji','Finland','France','French Guiana','French Polynesia','French Southern Territories','Gabon','Gambia','Georgia','Germany','Ghana','Gibraltar','Greece','Greenland','Grenada','Guadeloupe','Guam','Guatemala','Guernsey','Guinea','Guinea-Bissau','Guyana','Haiti','Heard & McDonald Islands','Honduras','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Isle of Man','Israel','Italy','Jamaica','Japan','Jersey','Jordan',
 'Kazakhstan','Kenya','Kiribati','Kuwait','Kyrgyzstan','Laos','Latvia','Lebanon','Lesotho','Liberia','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Madagascar','Malawi','Malaysia','Maldives','Mali','Malta','Marshall Islands','Martinique','Mauritania','Mauritius','Mayotte','Mexico','Micronesia','Moldova','Monaco','Mongolia','Montenegro','Montserrat','Morocco','Mozambique','Myanmar','Namibia','Nauru','Nepal','Netherlands','New Caledonia','New Zealand','Nicaragua','Niger','Nigeria','Niue','Norfolk Island','North Korea','Northern Mariana Islands','Norway','Oman','Pakistan','Palau','Palestine','Panama','Papua New Guinea','Paraguay','Peru','Philippines','Pitcairn Islands','Poland','Portugal','Puerto Rico','Qatar','Romania','Russia','Rwanda','Réunion','Samoa','San Marino','Saudi Arabia','Senegal','Serbia','Seychelles','Sierra Leone','Singapore','Sint Maarten','Slovakia','Slovenia','Solomon Islands','Somalia','South Africa','South Georgia & South Sandwich Islands','South Korea','South Sudan','Spain','Sri Lanka','St. Barthélemy','St. Helena','St. Kitts & Nevis','St. Lucia','St. Martin','St. Pierre & Miquelon','St. Vincent & Grenadines','Sudan','Suriname','Svalbard & Jan Mayen','Swaziland','Sweden','Switzerland','Syria','São Tomé & Príncipe','Taiwan','Tajikistan','Tanzania','Thailand','Timor-Leste','Togo','Tokelau','Tonga','Trinidad & Tobago','Tunisia','Turkey','Turkmenistan','Turks & Caicos Islands','Tuvalu','U.S. Outlying Islands','U.S. Virgin Islands','United Kingdom','United States','Uganda','Ukraine','United Arab Emirates','Uruguay','Uzbekistan','Vanuatu','Vatican City','Venezuela','Vietnam','Wallis & Futuna','Western Sahara','Yemen','Zambia','Zimbabwe','Åland Islands']
long=[33.93911,41.153332,28.033886,-14.270972,42.506285,-11.202692,18.220554,-82.862752,17.060816,-38.416097,40.069099,12.52111,-25.274398,47.516231,40.143105,25.03428,26.0667,23.684994,13.193887,53.709807,50.503887,17.189877,9.30769,32.3078,27.514162,-16.290154,43.915886,-22.328474,-54.4207915,-14.235004,-6.343194,18.420695,4.535277,42.733883,12.238333,-3.373056,12.565679,7.369722,56.130366,16.5388,12.1783611,19.3133,6.611111,15.454166,-35.675147,35.86166,-10.447525,-12.164165,4.570868,-11.6455,-0.228021,-4.038333,-21.236736,9.748917,45.1,21.521757,12.16957,35.126413,49.817492,7.539989,56.26392,11.825138,15.414999,18.735693,-1.831239,26.820553,13.794185,1.650801,15.179384,58.595272,9.145,-51.796253,61.892635,-17.713371,61.92411,46.227638,3.933889,-17.679742,-49.280366,-0.803689,13.443182,32.1656221,51.165691,7.946527,36.140751,39.074208,71.706936,12.1165,16.265,13.444304,15.783471,49.465691,9.945587,11.803749,4.860416,18.971187,-53.08181,15.199999,22.396428,47.162494,64.963051,20.593684,-0.789275,32.427908,33.223191,53.1423672,
54.236107,31.046051,41.87194,18.109581,36.204824,49.214439,30.585164,48.019573,-0.023559,-3.370417,29.31166,41.20438,19.85627,56.879635,33.854721,-29.609988,6.428055,26.3351,47.166,55.169438,49.815273,22.198745,41.608635,-18.766947,-13.254308,4.210484,3.202778,17.570692,35.937496,7.131474,14.641528,
21.00789,-20.348404,-12.8275,23.634501,7.425554,47.411631,43.7384176,46.862496,42.708678,16.742498,31.791702,-18.665695,21.916221,-22.95764,-0.522778,28.394857,52.132633,-20.904305,-40.900557,12.865416,17.607789,9.081999,-19.054445,-29.040835,40.339852,15.0979,60.472024,21.4735329,30.375321,7.51498,31.952162,8.537981,-6.314993,-23.442503,-9.189967,12.879721,-24.3767537,51.919438,39.399872,18.220833,25.354826,45.943161,61.52401,-1.940278,-21.115141,-13.759029,43.94236,23.885942,14.497401,44.016521,-4.679574,8.460555,1.352083,18.04248,48.669026,46.151241,-9.64571,5.152149,-30.559482,-54.429579,35.907757,6.8769919,40.463667,7.873054,17.9,-15.9650104,17.357822,13.909444,18.0708298,46.8852,12.984305,12.862807,3.919305,77.553604,-26.522503,60.128161,46.818188,34.802075,0.18636,23.69781,38.861034,-6.369028,15.870032,-8.874217,8.619543,-9.2002,-21.178986,10.691803,33.886917,38.963745,38.969719,21.694025,-7.109535,19.2823192,18.335765,55.378051,40.7605367,1.373333,48.379433,23.424076,-32.522779,41.377491,-15.376706,41.902916,6.42375,14.058324,-14.2938,24.215527,15.552727,-13.133897,-19.015438,60.1785247,]
lat= [67.709953 ,   20.168331 ,1.659626 , -170.132217 ,    1.521801 ,   17.873887 ,-63.068615 ,  135. ,  -61.796428 ,  -63.616672 ,45.038189 ,  -69.968338 ,  133.775136 ,   14.550072 ,47.576927 ,  -77.39628  ,   50.5577   ,   90.356331 ,
-59.543198 ,   27.953389 ,    4.469936 ,  -88.49765  ,2.315834 ,  -64.7505   ,   90.433601 ,  -63.588653 ,17.679076 ,   24.684866 ,    3.3464497,  -51.92528  ,71.876519 ,  -64.639968 ,  114.727669 ,   25.48583  ,-1.561593 ,   29.918886 ,  104.990963 ,   12.354722 ,-106.346771 ,  -23.0418   ,  -68.2385339,  -81.2546   ,20.939444 ,   18.732207 ,  -71.542969 ,  104.195397 ,105.690449 ,   96.870956 ,  -74.297333 ,   43.3333   ,15.827659 ,   21.758664 , -159.777671 ,  -83.753428 ,15.2000001,  -77.781167 ,  -68.99002  ,   33.429859 ,15.472962 ,   -5.54708  ,    9.501785 ,   42.590275 ,-61.370976 ,  -70.162651 ,  -78.183406 ,   30.802498 ,-88.89653  ,   10.267895 ,   39.782334 ,   25.0136071,40.489673 ,  -59.523613 ,   -6.9118061,  178.065032 ,
25.7481511,    2.213749 ,  -53.125782 , -149.406843 ,69.3485571,   11.609444 ,  -15.310139 ,  -82.9000751,10.451526 ,   -1.023194 ,   -5.353585 ,   21.824312 ,-42.604303 ,  -61.679    ,  -61.551    ,  144.793731 ,-90.230759 ,   -2.585278 ,   -9.696645 ,  -15.180413 ,-58.93018  ,  -72.285215 ,   73.504158 ,  -86.241905 ,114.109497 ,   19.5033041,  -19.020835 ,   78.96288  ,113.921327 ,   53.688046 ,   43.679291 ,   -7.6920536,-4.548056 ,   34.851612 ,   12.56738  ,  -77.297508 ,138.252924 ,   -2.13125  ,   36.238414 ,   66.923684 ,37.906193 , -168.734039 ,   47.481766 ,   74.766098 ,102.495496 ,   24.603189 ,   35.862285 ,   28.233608 ,-9.429499 ,   17.228331 ,    9.555373 ,   23.881275 ,6.129583 ,  113.543873 ,   21.745275 ,   46.869107 ,34.301525 ,  101.975766 ,   73.22068  ,   -3.996166 ,14.375416 ,  171.184478 ,  -61.024174 ,  -10.940835 ,57.552152 ,   45.166244 , -102.552784 ,  150.550812 ,28.369885 ,    7.4246158,  103.846656 ,   19.37439  ,-62.187366 ,   -7.09262  ,   35.529562 ,   95.955974 ,18.49041  ,  166.931503 ,   84.124008 ,    5.291266 ,165.618042 ,  174.885971 ,  -85.207229 ,    8.081666 ,8.675277 , -169.867233 ,  167.954712 ,  127.510093 ,145.6739   ,    8.468946 ,   55.975413 ,   69.345116 ,134.58252  ,   35.233154 ,  -80.782127 ,  143.95555  ,-58.443832 ,  -75.015152 ,  121.774017 , -128.3242376,19.145136 ,   -8.224454 ,  -66.590149 ,   51.183884 ,24.96676  ,  105.318756 ,   29.873888 ,   55.536384 ,-172.104629 ,   12.457777 ,   45.079162 ,  -14.452362 ,21.005859 ,   55.491977 ,  -11.779889 ,  103.819836 ,-63.05483  ,   19.699024 ,   14.995463 ,  160.156194 ,46.199616 ,   22.937506 ,  -36.587909 ,  127.766922 ,31.3069788,   -3.74922  ,   80.771797 ,  -62.833333 ,-5.7089241,  -62.782998 ,  -60.978893 ,  -63.0500809,-56.3159   ,  -61.287228 ,   30.217636 ,  -56.027783 ,23.6702719,   31.465866 ,   18.643501 ,    8.227512 ,38.996815 ,    6.613081 ,  120.960515 ,   71.276093 ,34.888822 ,  100.992541 ,  125.727539 ,    0.824782 ,-171.8484   , -175.198242 ,  -61.222503 ,    9.537499 ,35.243322 ,   59.556278 ,  -71.797928 ,  177.64933  ,166.647047 ,  -64.896335 ,   -3.435973 ,  -73.9788903,32.290275 ,   31.1655799,   53.847818 ,  -55.765835 ,64.585262 ,  166.959158 ,   12.453389 ,  -66.58973  ,108.277199 , -178.1165   ,  -12.885834 ,   48.516388 ,27.849332 ,   29.154857 ,   19.9156105]

coun_dat=pd.DataFrame({'name':names,'longitude':long,'latitude':lat})
dd1=df_2017['Country'].value_counts().to_frame()
df_2018['In which country do you currently reside?'].replace({'United States of America':'United States','Viet Nam':'Vietnam','China':"People 's Republic of China","United Kingdom of Great Britain and Northern Ireland":'United Kingdom',"Hong Kong (S.A.R.)":"Hong Kong"},inplace=True)
dd2=df_2018['In which country do you currently reside?'].value_counts().to_frame()
#dd2=dd2.rename(index={'United States of America':'United States','Viet Nam':'Vietnam','China':"People 's Republic of China","United Kingdom of Great Britain and Northern Ireland":'United Kingdom',"Hong Kong (S.A.R.)":"Hong Kong"})
dd1=dd1.merge(dd2,left_index=True,right_index=True,how='left')
dd1.columns=['2017','2018']
dd1['Growth']=(dd1['2018']-dd1['2017'])/dd1['2017']*100
decimals = 2   
dd1['Growth']=dd1['Growth'].apply(lambda x: round(x, decimals))
dd1.dropna(inplace=True)
coun_fin=coun_dat.merge(dd1,left_on='name',right_index=True,how='left').dropna()
def growth_col(value):
    if value < 0:
        return 'red'
    elif value > 0 and value < 50:
        return 'yellow'
    else:
        return 'green'
locate=coun_fin[['longitude','latitude']]
coun=coun_fin['name']
resp_2017=coun_fin['2017']
resp_2018=coun_fin['2018']
growth=coun_fin['Growth']
map1 = folium.Map(location=[48.85, 2.35], tiles="Mapbox Control Room", zoom_start=2)
for point in coun_fin.index:
    info='<font color="red" >Country: </font>'+str(coun.loc[point])+'<br><font color="red"> Total Respondents 2017: </font>'+str(resp_2017.loc[point])+'<br><font color="red"> Total Respondents 2018: </font>'+str(resp_2018.loc[point])+'<br><font color="red"> Growth: </font>'+str(growth.loc[point])+' %'
    iframe = folium.IFrame(html=info, width=250, height=250)
    folium.CircleMarker(list(locate.loc[point]),popup=folium.Popup(iframe),radius=resp_2018.loc[point]*0.01,color=growth_col(growth.loc[point]),fill_color=growth_col(growth.loc[point]),fill=True).add_to(map1)
map1

I have tried to put a lot of information into the map :p. Click on any bubble to get more information about it. Lets look into it in detail.

So the circle size is proportional to the number of respondents. However the color is represents the growth % in the number of respondents. Looking into the map we can clearly say that USA and India have the highest number of respondents. However India has a 52% growth in respondents, whereas USA has just 12% growth. This growth is evident, as we Indians are the highest producer of Engineers...:p!!

Its not just India but other asian countries like **China,Russia and Vietnam** have also shown good growth. The European continent also has an average growth somewhere between **30-50%**. The only 3 countries that have shown a negative growth in the number of respondents are **Australia, Phillipines and South Korea!**

## Age

In [None]:
H = Highchart(width=800, height=400)

options = {
    'title': {
        'text': 'Age Ranges'
    },
    'xAxis': {
        'categories': list(df_2018['What is your age (# years)?'].value_counts().index),
        'title': {
            'text': None
        }
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': 'Respondents'
        },
        'labels': {
            'overflow': 'justify'
        }
    },
    'legend': {
        'layout': 'vertical',
        'align': 'right',
        'verticalAlign': 'top',
        'x': -40,
        'y': 80,
        'floating': True,
        'borderWidth': 1,
        'shadow': True
    },
    'credits': {
        'enabled': False
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True
            }
        },
    }
}

H.set_dict_options(options)




data1 = list(df_2018['What is your age (# years)?'].value_counts())
H.add_data_set(data1, 'bar')

H

Similar to the last year results, majority of the respondents are young with their age ranges lying between 18-29 years. It will be interesting to see which age-group do the respondents belong to by their countries. For this, lets split the age-groups into 3 distinct parts

 - Young (18-34 years)
 - Middle (35-54 years)
 - Old (55-80+ years)

### Age Groups By Country

In [None]:
df_2018['Age Grp']=np.where(df_2018['What is your age (# years)?'].isin(['25-29','22-24','30-34','18-21']),'Young','')
df_2018['Age Grp']=np.where(df_2018['What is your age (# years)?'].isin(['35-39', '40-44', '45-49', '50-54']),'Middle',df_2018['Age Grp'])
df_2018['Age Grp']=np.where(df_2018['What is your age (# years)?'].isin(['55-59', '60-69', '70-79', '80+']),'Old',df_2018['Age Grp'])
coun_age=df_2018.groupby(['In which country do you currently reside?','Age Grp'])['What is your age (# years)?'].count().reset_index()
coun_age.columns=['Country','Age Grp','Count']
coun_age=coun_age[coun_age['Country'].isin(df_2018['In which country do you currently reside?'].value_counts()[:6].index)]
coun_age=coun_age[coun_age['Country']!='Other']
coun_age.pivot('Country','Age Grp','Count').plot.bar(stacked=True,width=0.8)
plt.gcf().set_size_inches(16,8)
plt.show()

Looks like people in the United States are getting old early :p!

## UnderGraduate Major

In [None]:
H = Highchart(width=800, height=500)

options = {
    'title': {
        'text': 'UnderGraduate Major'
    },
    'xAxis': {
        'categories': list(df_2018['Which best describes your undergraduate major? - Selected Choice'].value_counts().index),
        'title': {
            'text': None
        }
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': 'Count'
        },
        'labels': {
            'overflow': 'justify'
        }
    },
    'legend': {
        'layout': 'vertical',
        'align': 'right',
        'verticalAlign': 'top',
        'x': -40,
        'y': 80,
        'floating': True,
        'borderWidth': 1,
        'shadow': True
    },
    'credits': {
        'enabled': False
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True
            }
        }
    }
}

H.set_dict_options(options)




data1 = list(df_2018['Which best describes your undergraduate major? - Selected Choice'].value_counts())
H.add_data_set(data1, 'bar')

H

The results are similar to that of last year. Majority of the respondents are from the CS background. However Data Science being a diverse field, people from Non-CS background have a great proportion too.

## Current Role

In [None]:
plt.figure(figsize=(8,8))
ax=df_2018['Select the title most similar to your current role (or most recent title if retired): - Selected Choice'].value_counts()[:10].plot.barh(width=0.9,color=sns.color_palette('inferno_r',25))
for i, v in enumerate(df_2018['Select the title most similar to your current role (or most recent title if retired): - Selected Choice'].value_counts()[:10].values): 
    ax.text(200, i, v,fontsize=12,color='blue',weight='bold')
plt.gca().invert_yaxis()
plt.title('Current Roles')
plt.show()

As expected, Kaggle has a huge number of student community. I too joined Kaggle as a student and now I am a working professional, but still I love going through Kaggle. 

## Data Scientists By Countries

In [None]:
lol1=df_2017[(df_2017['CurrentJobTitleSelect']=='Data Scientist')&(df_2017['StudentStatus']!='Yes')].Country.value_counts()[:10].to_frame()
lol2=df_2018[(df_2018['Select the title most similar to your current role (or most recent title if retired): - Selected Choice']=='Data Scientist')&(df_2018['In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice']!='I am a student')]['In which country do you currently reside?'].value_counts()[:10].to_frame()
coun_10=lol1.merge(lol2,left_index=True,right_index=True,how='outer')
coun_10.columns=['2017','2018']
H = Highchart(width=800, height=500)

options = {
    'title': {
        'text': 'No of Data Scientists Respondents By Top-10 Countries (2017-2018)'
    },
    'xAxis': {
        'categories': list(coun_10.index),
        'title': {
            'text': None
        }
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': 'Respondents'
        },
        'labels': {
            'overflow': 'justify'
        }
    },
    'legend': {
        'layout': 'vertical',
        'align': 'right',
        'verticalAlign': 'top',
        'x': -80,
        'y': 20,
        'floating': True,
        'borderWidth': 1,
        'shadow': True
    },
    'credits': {
        'enabled': False
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True
            }
        }
    }
}

H.set_dict_options(options)




data1 = list(coun_10['2017'])
data2 = list(coun_10['2018'])
H.add_data_set(data1, 'bar', '2017')
H.add_data_set(data2, 'bar', '2018')

H

The above graph shows the number of Data Science Respodents by Country in 2017 and 2018 respectively. Again we can see that **India** has shown a great growth in the number of Data Scientists, as the demand for Data Science professionals have soared in India. **[Read through this article](https://economictimes.indiatimes.com/jobs/indias-demand-for-data-scientists-grows-over-400-report/articleshow/64930355.cms)** to get a real Idea about the demand for Data Scientists in India.

## Favorite IDE's

In [None]:
l1=[col for col in df_2018 if col.startswith("Which of the following integrated development environments (IDE's) have you used at work or school in the last 5 years?")]
col1=[]
col2=[]
l2=df_2018[l1[:-1]]
for i in l2.columns:
    col1.append(df_2018[i].value_counts().index.values[0])
    col2.append(df_2018[i].value_counts().values[0])
ide=pd.DataFrame({'IDE':col1,'Count':col2})
H = Highchart(width=800, height=500)

options = {
    'title': {
        'text': "Most Used IDE's"
    },
    'xAxis': {
        'categories': list(ide.sort_values('Count',ascending=False)['IDE']),
        'title': {
            'text': None
        }
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': 'Count'
        },
        'labels': {
            'overflow': 'justify'
        }
    },
    'legend': {
        'layout': 'vertical',
        'align': 'right',
        'verticalAlign': 'top',
        'x': -40,
        'y': 80,
        'floating': True,
        'borderWidth': 1,
        'shadow': True
    },
    'credits': {
        'enabled': False
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True
            }
        }
    }
}

H.set_dict_options(options)




data1 = list(ide.sort_values('Count',ascending=False)['Count'])
H.add_data_set(data1, 'bar')

H

Jupyter Notebook leads the list with a huge difference. The reason might be because of the wide range of language supported by Jupyter.

## First Recommended Language

In [None]:
plt.figure(figsize=(8,10))
ax=df_2018['What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice'].value_counts()[:10].plot.barh(width=0.9,color=sns.color_palette('viridis_r',25))
for i, v in enumerate(df_2018['What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice'].value_counts()[:10]): 
    ax.text(200, i, v,fontsize=12,color='blue',weight='bold')
plt.gca().invert_yaxis()
plt.title('First Language Recommended')
plt.show()                                                                                                                                                      

And the clear winner is Pythonnnnnn!!. The obvious reason for this is the ease and flexibility of Python. It is so easy to learn and can be used in any technology domain. However lets compare it with the last year results to get a better idea of how the community thinks

In [None]:
lang_2017=df_2018['What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice'].value_counts().to_frame()
lang_2018=df_2017['LanguageRecommendationSelect'].value_counts().to_frame()
lang_2017=lang_2017.merge(lang_2018,left_index=True,right_index=True,how='left')
lang_2017.columns=['2018','2017']
H = Highchart(width=900, height=400)

options = {
    'title': {
        'text': 'Top Recommended First Language (2017-2018)'
    },
    'xAxis': {
        'categories': list(lang_2017.index),
        'title': {
            'text': None
        }
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': 'Count'
        },
        'labels': {
            'overflow': 'justify'
        }
    },
    'legend': {
        'layout': 'vertical',
        'align': 'right',
        'verticalAlign': 'top',
        'x': -80,
        'y': 20,
        'floating': True,
        'borderWidth': 1,
        'shadow': True
    },
    'credits': {
        'enabled': False
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True
            }
        }
    }
}

H.set_dict_options(options)




data1 = list(lang_2017['2017'])
data2 = list(lang_2017['2018'])
H.add_data_set(data1, 'bar', '2017')
H.add_data_set(data2, 'bar', '2018')

H

The recommendation for Python as the first language has shot up sharply. More than 100% rise as compared to 2017. On the contrary, the recommendation for R has gone down. The reason maybe the steep learning curve of R.

## DataSet Source

In [None]:
l1=[col for col in df_2018 if col.startswith("Where do you find public datasets? (Select all that apply) - Selected Choice - ")]
col1=[]
col2=[]
l2=df_2018[l1[:-2]]
for i in l2.columns:
    col1.append(df_2018[i].value_counts().index.values[0])
    col2.append(df_2018[i].value_counts().values[0])
source=pd.DataFrame({'Source':col1,'Count':col2})
H = Highchart(width=650, height=500)

options = {
    'chart': {
        'type': 'pie',
        'options3d': {
            'enabled': True,
            'alpha': 45
        }
    },
    'title': {
        'text': "Finding Public DataSets From?"
    },
    'plotOptions': {
        'pie': {
            'innerSize': 100,
            'depth': 45
        }
    },
}

data = source.values.tolist()

H.set_dict_options(options)
H.add_data_set(data, 'pie', 'Count')

H

Kaggle is still the major source for finding datasets. 

## Time Spent in Various Data Science Tasks

In [None]:
time_spent=[col for col in df_2018 if col.startswith("During a typical data science project at work or school, approximately what proportion of your time is devoted to the following? ")]
time_spent=time_spent[:-1]
import itertools
plt.figure(figsize=(20,20))
length=len(time_spent)
for i,j in itertools.zip_longest(time_spent,range(length)):
    plt.subplot((length/2),2,j+1)
    plt.subplots_adjust(wspace=0.2,hspace=0.5)
    df_2018[i].astype('float').hist(bins=10,edgecolor='black')
    plt.axvline(df_2018[i].astype('float').mean(),linestyle='dashed',color='r')
    plt.title(i[161:],size=20)
    plt.xlabel('% Time')
plt.show()

Lets go through the Data Science Pipeline stepwise and understand why each phase takes lesser time or more time:

 - **Gathering Data**: Undoubtedly the one of the most time consuming part in the entire pipeline. Getting the data can be painstaking and it really depends from where do we fetch our data. If it comes from a publicly hosted site like Kaggle, it is very easy and no time is required. However its not the case in real world. In real world cases, we need to select the right data in the right format and build a secure way of transferring the data flow to our ML models or application.
 - **Cleaning Data**: The most time consuming part, and I am sure no one will disagree on this!! Transforming the data into correct format for the application, detecting and correcting corrupt or inaccurate data, etc all come under data cleaning. The main challenge in this is correction of values to remove duplicates and invalid entries, as deletion of data can lead to information loss. Hence critical decisions and thinking is required, which makes it a time consuming process.
 - **Visualizing Data**: It is probably the least time consuming process(and probably the most enjoyable one..:p), and it reduces even further if we use Enterprise Tools like Tableau,Qlik,Tibco,etc, which helps in building graphs and dashboards with simple drag and drop features.
 - **Model Building**: It is where the data scientists build decide a suitable algorithm ,build predictive models, tune these models,etc. It is the 2nd most time consuming process after Data Gathering.
 - **Putting Model into Production**: Simply it means encapsulating the ML model into an application or hosting it somewhere so that others can use the model with their own data. Now a days it has become very easy to build API's that directly query the ML model and return results based on the user input data. The process has become easier due to the great integration of ML models and cloud deployment, which has infact reduced the time required for production deployment.
 - **Finding Insights and Communicating with Stakeholders**: Finding insights is retrieving all the important patterns and facts in the trove in data and effectively communicating it to the clients with minimum cognitive load. Effective communication and simple but effective visualizations play a very important role in this phase.

## Individual Projects or Academic Achievements

In [None]:
projects=df_2018['Which better demonstrates expertise in data science: academic achievements or independent projects? - Your views:'].value_counts().to_frame()
projects.columns=['Count']
H = Highchart(width=1000, height=500)

options = {
    'title': {
        'text': "Independent Projects Good or Not?",
        'style':{
            'color': 'white',
        }
    },
    'xAxis': {
        'categories': list(projects.index),
        'title': {
            'text': None
        },'labels': {
            'overflow': 'justify',
            'style': {
            'color': 'white',
         }
        },
        
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': None
        },
        'labels': {
            'overflow': 'justify',
            'style': {
            'color': 'white',
         }
        }
    },
    'credits': {
        'enabled': False
    },
    'chart': {
            'backgroundColor':'black',
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True,
                'style': {
            'color': 'white',
         }
            }
        }
    }
}

H.set_dict_options(options)




data1 = list(projects['Count'])
H.add_data_set(data1, 'bar')

H

People often ask me "How do I show case my knowledge in a certain domain?". My simple answer is build projects and upload to your Github profile. The above results also show that Individual projects are indeed a great asset as compared to Acandemic Achievements.

## Data Science And Cloud

Machine learning was once out of the reach of most enterprise budgets, but today, public cloud providers’ ability to offer machine-learning services makes this technology affordable. ML/DL requires GPU powered machines which are too costly, but with the advent of powerful virtual instances, the costs have potentially gone down. Being a Cloud Engineer I know how easily we can run and scale ML models/applications on virtual instances. Lets us now see how the Top 3 Cloud Providers viz AWS, GCP and Azure compete against each other.

In [None]:
cloud=[col for col in df_2018 if col.startswith("Which of the following cloud computing services have you used at work or school in the last 5 years? (Select all that apply) - Selected Choice -")][:-4]
col1,col2=[],[]
for i in df_2018[cloud].columns:
    col1.append(df_2018[i].value_counts().index.values[0])
    col2.append(df_2018[i].value_counts().values[0])
cloud1=pd.DataFrame({'Cloud Provider':col1,'Count':col2})

cloud_ml=df_2018[df_2018['Select the title most similar to your current role (or most recent title if retired): - Selected Choice']=='Data Scientist']
cloud=[col for col in cloud_ml if col.startswith("Which of the following cloud computing services have you used at work or school in the last 5 years? (Select all that apply) - Selected Choice -")][:-4]
col1,col2=[],[]
for i in cloud_ml[cloud].columns:
    col1.append(cloud_ml[i].value_counts().index.values[0])
    col2.append(cloud_ml[i].value_counts().values[0])
cloud2=pd.DataFrame({'Cloud Provider':col1,'Count':col2})

cloud=cloud1.merge(cloud2,left_on='Cloud Provider',right_on='Cloud Provider',how='left')
cloud.columns=['Cloud Provider','Cloud N','Cloud ML']

H = Highchart(width=800, height=400)

options = {
    'title': {
        'text': 'Top Cloud Providers'
    },
    'xAxis': {
        'categories': list(cloud['Cloud Provider']),
        'title': {
            'text': None
        }
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': 'Count'
        },
        'labels': {
            'overflow': 'justify'
        }
    },
    'legend': {
        'layout': 'vertical',
        'align': 'right',
        'verticalAlign': 'top',
        'x': -80,
        'y': 20,
        'floating': True,
        'borderWidth': 1,
        'shadow': True
    },
    'credits': {
        'enabled': False
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True
            }
        }
    }
}

H.set_dict_options(options)




data1 = list(cloud['Cloud N'])
data2 = list(cloud['Cloud ML'])
H.add_data_set(data1, 'bar', 'All Uses')
H.add_data_set(data2, 'bar', 'ML')

H

AWS is a clear winner among the Cloud Providers. The reason is simply due to the variety of services available on AWS. However being a Cloud Engineer, I think I can add a point here. As far as I have seen, GCP beats AWS in the Machine Learning capability part. Also after my own small research, I did find that many small ML/AI ventures do go for GCP rather than AWS. 

Lets dig a bit further and check which all services are mostly used in AWS and GCP!!

## AWS vs GCP (Fight For ML)

In [None]:
aws=[col for col in df_2018 if col.startswith('Which of the following cloud computing products have you used at work or school in the last 5 years (Select all that apply)? - Selected Choice - AWS') or col.startswith('Which of the following machine learning products have you used at work or school in the last 5 years? (Select all that apply) - Selected Choice - Amazon ')]
google=[col for col in df_2018 if col.startswith('Which of the following cloud computing products have you used at work or school in the last 5 years (Select all that apply)? - Selected Choice - Google') or col.startswith('Which of the following machine learning products have you used at work or school in the last 5 years? (Select all that apply) - Selected Choice - Google ')]
col1=[]
col2=[]
for i in df_2018[aws].columns:
    col1.append(df_2018[i].value_counts().index.values[0])
    col2.append(df_2018[i].value_counts().values[0])
aws=pd.DataFrame({'Service':col1,'Count':col2})

col1=[]
col2=[]
for i in df_2018[google].columns:
    col1.append(df_2018[i].value_counts().index.values[0])
    col2.append(df_2018[i].value_counts().values[0])
google=pd.DataFrame({'Service':col1,'Count':col2})
f,ax=plt.subplots(1,2,figsize=(25,15))
google.set_index('Service').plot.barh(width=0.8,ax=ax[0])
aws.set_index('Service').plot.barh(width=0.8,ax=ax[1])
plt.subplots_adjust(wspace=0.5)
ax[0].set_title('GCP Services')
ax[1].set_title('AWS Services')
ax[0].set_xlim(0,4000,500)
plt.show()

We can clearly see that GCP's offers far more ML services as compared to AWS, and these services are also more used as compared to the AWS's ML services!!

## Most Used Machine Learning Libraries

In [None]:
frame=[col for col in df_2018 if col.startswith('What machine learning frameworks have you used in the past 5 years? (Select all that apply) - Selected Choice - ')]
frame=frame[:-2]
col1=[]
col2=[]
for i in df_2018[frame].columns:
    col1.append(df_2018[i].value_counts().index.values[0])
    col2.append(df_2018[i].value_counts().values[0])
frame=pd.DataFrame({'FrameWork':col1,'Count':col2})
frame=frame.sort_values('Count',ascending=False)
H = Highchart(width=850, height=500)

options = {
    'title': {
        'text': "Most commonly used ML FrameWorks",
    },
    'xAxis': {
        'categories': list(frame['FrameWork']),
        'title': {
            'text': None
        },'labels': {
            'overflow': 'justify',
        },
        
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': None
        },
        'labels': {
            'overflow': 'justify',
        }
    },
    'credits': {
        'enabled': False
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True,
            }
        }
    }
}

H.set_dict_options(options)

data1 = list(frame['Count'])
H.add_data_set(data1, 'bar')

H

We did see earlier that the number of Python users are way more than R users according to the survey data. Thus Sklearn being on the most commonly used ML framework was somewhat implicit.

## Most Commonly Used Machine Learning Libraries

In [None]:
viz=[col for col in df_2018 if col.startswith('What data visualization libraries or tools have you used in the past 5 years? (Select all that apply) - Selected Choice - ')]
viz=viz[:-2]
col1=[]
col2=[]
for i in df_2018[viz].columns:
    col1.append(df_2018[i].value_counts().index.values[0])
    col2.append(df_2018[i].value_counts().values[0])
viz=pd.DataFrame({'Viz Lib':col1,'Count':col2})
viz=viz.sort_values('Count',ascending=False)

H = Highchart(width=850, height=500)

options = {
    'title': {
        'text': "Most commonly used Visualization FrameWorks",
    },
    'xAxis': {
        'categories': list(viz['Viz Lib']),
        'title': {
            'text': None
        },'labels': {
            'overflow': 'justify',
        },
        
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': None
        },
        'labels': {
            'overflow': 'justify',
        }
    },
    'credits': {
        'enabled': False
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True,
            }
        }
    }
}

H.set_dict_options(options)

data1 = list(viz['Count'])
H.add_data_set(data1, 'bar')

H

Matplotlib wins with a huge margin. This was expected because it is the first visualization library any Python user starts with while starting with Data Science Journey!!

## Most Used Programming Languages

In [None]:
lang=[col for col in df_2018 if col.startswith('What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice -')]
lang=lang[:-2]
df_lang=df_2018[lang]
df_lang.columns=[i[102:] for i in lang]
col1=[]
col2=[]
for i in df_2018[lang].columns:
    col1.append(df_2018[i].value_counts().index.values[0])
    col2.append(df_2018[i].value_counts().values[0])
languages=pd.DataFrame({'Lang':col1,'Count':col2})
languages=languages.sort_values('Count',ascending=False)

H = Highchart(width=850, height=500)

options = {
    'title': {
        'text': "Most commonly used Languages",
    },
    'xAxis': {
        'categories': list(languages['Lang']),
        'title': {
            'text': None
        },'labels': {
            'overflow': 'justify',
        },
        
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': None
        },
        'labels': {
            'overflow': 'justify',
        }
    },
    'credits': {
        'enabled': False
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True,
            }
        }
    }
}

H.set_dict_options(options)

data1 = list(languages['Count'])
H.add_data_set(data1, 'bar')

H

As usual, Python leads with a huge margin. A noteable difference as compared to 2017 is that SQL surpasses R by a pretty decent margin. The reason is pretty obvious, SQL is the bread and butter for databases and with the increase in data and databases, the use of SQL will indeed increase.

Lets try to make a simple network that will show which all languages are used together.

## A Simple Language Network

In [None]:
import networkx as nx
c = df_lang.stack().groupby(level=0).apply(tuple).value_counts()
out = [i + (j,) for i, j in c.items()]
out=[word for word in out if len(word)==3]
lang_net=pd.DataFrame(out)
lang_net.columns=['Lang 1','Lang 2','Count']
g = nx.from_pandas_edgelist(lang_net,source='Lang 1',target='Lang 2')
cmap = plt.cm.RdYlGn
colors = [n for n in range(len(g.nodes()))]
k = 0.35
pos=nx.spring_layout(g, k=k)
nx.draw_networkx(g,node_size=languages['Count'].values, cmap = cmap, node_color=colors, edge_color='grey', font_size=22, width=lang_net['Count'].values*0.01)
plt.title('Languages Network',size=50)
plt.gcf().set_size_inches(25,20)

We can clearly see that Python is pretty much used with every other language. This shows the versatility of Python. However the major chunk of network is with R and SQL, because of the obvious fact that these data munging and analysis tools are the most frequently tool-set.

## What Type of Data do Industries Deal With?

Not every industry needs to deal with time series data and similarly not every industry deals with audio or video data. Lets check what different types of data the various industries deal with..!

In [None]:
ind=[col for col in df_2018 if col.startswith('Which types of data do you currently interact with most often at work or school? (Select all that apply) - Selected Choice - ') or col.startswith('In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice') or col.startswith('Duration (in seconds)')]
ind=ind[:-2]
lulz=df_2018[ind]
dataframe=pd.DataFrame()
for i in lulz.columns[2:]:
    abc=lulz.groupby(['In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice',i])['Duration (in seconds)'].count().reset_index()
    abc.columns=['Industry','Data','Count']
    dataframe=dataframe.append(abc)

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le=le.fit(dataframe['Industry'])
dataframe['Industry_enc']=le.transform(dataframe['Industry'])
le=le.fit(dataframe['Data'])
dataframe['Data_enc']=le.transform(dataframe['Data'])
dataframe.sort_values(['Industry_enc','Data_enc'],inplace=True)

H = Highchart(width=800, height=800)
data= dataframe[['Data_enc','Industry_enc','Count']].values.tolist()
H.add_data_set(data, series_type='heatmap', borderWidth=1, dataLabels={
    'enabled': True,
})

H.set_options('chart', {
    'style': {
            'fontFamily': '\'Unica One\', sans-serif'
        },
    'type': 'heatmap',
    'plotBorderWidth': 1
})
H.set_options('xAxis', {
    'categories': list(dataframe['Data'].unique())
})
H.set_options('yAxis', {
    'categories': list(dataframe['Industry'].unique())
})
H.set_options('title', {
    'text': "Industry vs Type of Data Faced"
})
H.set_options('colorAxis', {
    'min': 0,
    'minColor': '#FFFFFF',
    'maxColor': '#00ff00'
})
H.set_options('legend', {
    'align': 'right',
    'layout': 'vertical',
    'margin': 0,
    'verticalAlign': 'top',
    'y': 25,
    'symbolHeight': 280
})
H.set_options('tooltip', {
    'formatter': "function () {" + 
                "return 'Industry: '+'<b>' + this.series.yAxis.categories[this.point.x] + '</b><br>Frequency: <b>' +" +
                    "this.point.value + '</b><br> Data: <b>' + this.series.xAxis.categories[this.point.x] + '</b>';" +
            "}"
})
H

## How did you learn Data Science??

In [None]:
train=[col for col in df_2018 if col.startswith('What percentage of your current machine learning/data science training falls under each category? ')]
train=train[:-2]
import math
plt.figure(figsize=(20,20))
length=len(train)
for i,j in itertools.zip_longest(train,range(length)):
    plt.subplot(math.ceil((length/2)),2,j+1)
    plt.subplots_adjust(wspace=0.2,hspace=0.5)
    df_2018[i].astype('float').hist(bins=10,edgecolor='black')
    plt.axvline(df_2018[i].astype('float').mean(),linestyle='dashed',color='r')
    plt.title(i[130:],size=20)
    plt.xlabel('% Time')
plt.show()

**Formal education will make you a living; self-education will make you a fortune -Jim Rohn**

 - It is evident that many respondents do learn Data Science on their own. I personally haven't taken up any courses till now and have learnt everything on the fly with the help of Kaggle Kernels.
 - With the increased number of Online Courses on various MOOC's, people starting new get a good start to Data Science using these online platforms, and thus Online courses also have a good share.
 - However the low percentages for **Work and Kaggle Competitions** were a bit surprising. I have read many a time that how people learn a lot many new things by competing in a Kaggle Competition. Similarly, I think we get to work on real DS stuff in the industry. However the lower percentages don't reflect this?
 
Lets just break this down and check if we get similar results for the Top 2 Countries viz United States and India.

## Learn Data Science (USA vs India)

In [None]:
train=[col for col in df_2018 if col.startswith('What percentage of your current machine learning/data science training falls under each category? ')]
train=train[:-2]
df_ind=df_2018[df_2018['In which country do you currently reside?']=='India']
df_usa=df_2018[df_2018['In which country do you currently reside?']=='United States']
plt.figure(figsize=(20,20))
length=len(train)
for i,j in itertools.zip_longest(train,range(length)):
    plt.subplot(math.ceil((length/2)),2,j+1)
    plt.subplots_adjust(wspace=0.2,hspace=0.5)
    df_ind[i].astype('float').hist(bins=10,color='tomato',alpha=0.7,label='India')
    df_usa[i].astype('float').hist(bins=10,color='lightgreen',alpha=0.7,label='United States')
    plt.axvline(df_ind[i].astype('float').mean(),linestyle='dashed',color='r')
    plt.axvline(df_usa[i].astype('float').mean(),linestyle='dashed',color='g')
    plt.title(i[130:],size=20)
    plt.xlabel('% Time')
    plt.legend()
plt.show()


I was indeed expecting something like this :p. Just look at the University histogram!! It clearly shows how eduaction system in United States is so better as compared to that of India's. 

MOOC's are so very famous in India, I have seen so many students go through these extensive online courses. One major reason what I think is the **Education System** in India. Students easily learn stuff that is not taught in the univerities.

## Online Platforms

In [None]:
online=[col for col in df_2018 if col.startswith('On which online platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - ')]
online=online[:-2]
col1=[]
col2=[]
l2=df_2018[online]
for i in l2.columns:
    col1.append(df_2018[i].value_counts().index.values[0])
    col2.append(df_2018[i].value_counts().values[0])
online=pd.DataFrame({'Platform':col1,'Count':col2})
H = Highchart(width=650, height=500)
options = {
    'chart': {
        'style': {
            'fontFamily': '\'Unica One\', sans-serif'
        },
        'type': 'pie',
        'options3d': {
            'enabled': True,
            'alpha': 45
        }
    },
    'title': {
        'text': "Online Platforms"
    },
    'plotOptions': {
        'pie': {
            'innerSize': 100,
            'depth': 45
        }
    },
}

data = online.values.tolist()

H.set_dict_options(options)
H.add_data_set(data, 'pie', 'Count')

H

## ML Maturity in Business

In [None]:
ind_ml=df_2018[df_2018['In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice']!='I am a student']
ind_ml['Does your current employer incorporate machine learning methods into their business?'].value_counts().plot.barh(width=0.8,color=sns.color_palette('inferno_r',10))
plt.gcf().set_size_inches(6,8)
plt.gca().invert_yaxis()
plt.title('ML Maturity')

A majority of the respondents(about 55-60 %) said that they either don't use ML or are still exploring or still don't know about it in their business. Since the ML/DS is boom is a recent thing, I was expecting similar results. Many industries are still researching for ML techniques to implement and integrate them in their existing systems. Lets dig in and check how is it distributed by industries. 

## ML Maturity By Industry

In [None]:
ind_ml=ind_ml.groupby(['In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice','Does your current employer incorporate machine learning methods into their business?'])['Duration (in seconds)'].count().reset_index()
ind_ml.columns=['Industry','ML','Count']
le=le.fit(ind_ml['Industry'])
ind_ml['Industry_enc']=le.transform(ind_ml['Industry'])
le=le.fit(ind_ml['ML'])
ind_ml['ML_enc']=le.transform(ind_ml['ML'])
ind_ml.sort_values(['Industry_enc','ML_enc'],inplace=True)

H = Highchart(width=800, height=800)
data= ind_ml[['ML_enc','Industry_enc','Count']].values.tolist()
H.add_data_set(data, series_type='heatmap', borderWidth=1, dataLabels={
    'enabled': True,
})

H.set_options('chart', {
    'style': {
            'fontFamily': '\'Unica One\', sans-serif'
        },
    'type': 'heatmap',
    'plotBorderWidth': 1
})
H.set_options('xAxis', {
    'categories': list(ind_ml['ML'].unique())
})
H.set_options('yAxis', {
    'categories': list(ind_ml['Industry'].unique())
})
H.set_options('title', {
    'text': "Industry vs ML Maturity"
})
H.set_options('colorAxis', {
    'min': 0,
    'minColor': '#FFFFFF',
    'maxColor': '#00ff00'
})
H.set_options('legend', {
    'align': 'right',
    'layout': 'vertical',
    'margin': 0,
    'verticalAlign': 'top',
    'y': 25,
    'symbolHeight': 280
})
H.set_options('tooltip', {
    'formatter': "function () {" + 
                "return 'Industry: '+'<b>' + this.series.yAxis.categories[this.point.x] + '</b><br>Frequency: <b>' +" +
                    "this.point.value + '</b><br> Maturity: <b>' + this.series.xAxis.categories[this.point.x] + '</b>';" +
            "}"
})
H

It looks like only the Computers/Technology industries have very well adopted the ML/DS in their business. The next industry which looks promising is Education. Now a days solutions made by reputed Universities like Stanford, MIT,etc give equal competition to those built by some Top CS companies. 

## ML Maturity (India vs Usa)

In [None]:
ind=df_ind['Does your current employer incorporate machine learning methods into their business?'].value_counts().to_frame()
usa=df_usa['Does your current employer incorporate machine learning methods into their business?'].value_counts().to_frame()
df_c=ind.merge(usa,left_index=True,right_index=True,how='left')
df_c.columns=['India','United States']
H = Highchart(width=800, height=400)

options = {
    'title': {
        'text': 'ML in Business (India vs USA)'
    },
    'xAxis': {
        'categories': list(df_c.index),
        'title': {
            'text': None
        }
    },
    'yAxis': {
        'min': 0,
        'title': {
            'text': 'Count'
        },
        'labels': {
            'overflow': 'justify'
        }
    },
    'legend': {
        'layout': 'vertical',
        'align': 'right',
        'verticalAlign': 'top',
        'x': -30,
        'y': 20,
        'floating': True,
        'borderWidth': 1,
        'shadow': True
    },
    'credits': {
        'enabled': False
    },
    'plotOptions': {
        'bar': {
            'dataLabels': {
                'enabled': True
            }
        }
    }
}

H.set_dict_options(options)




data1 = list(df_c['India'])
data2 = list(df_c['United States'])
H.add_data_set(data1, 'bar', 'India')
H.add_data_set(data2, 'bar', 'United States')

H

The 5th bar-pair shows how United States does have a great opportunity for ML engineers and Data Scientists. This is one of the main reasons that so many Asian's flock to the United States for better work opportunities or even better and relevant work experience.

However the 1st bar-pair does show that India also has great potential for becoming a ML/AI/DS hotspot, and with the increase in demand for Data Scientists in the past few years in India, I am sure it will catch up soon with the United States. Have a look at **[this article](https://www.analyticsindiamag.com/top-countries-hiring-most-number-of-artificial-intelligence-machine-learning-experts/)**, which mentions the Top countries which have a great potential for ML/DS.

# Stay Tuned... More To Come!!
### Just short of time...too many things on the plate..:p