# Overview
As the number of people who likes to play with data is increasing, the analysis of skillset and techniques that are being used by the professionals will help the beginners to get the idea of what will be helpful for them to keep in their toolkit.

The below notebook does the analysis of data that was accumulated by the survey happened on kaggle where the users have entered the skills and language they used the most for their tasks.

# About Data
The data consists of four colums as below:

1. WorkToolSelect - Tools used for analysis and processing
2. LanguageRecommendationSelect - Language that the respondentis using
3. EmployerIndustry - Industry/Domain where respondent is working
4. WorkAlgorithmSelect - Algorithms that are widely used

In [None]:
#importing required libraries
import numpy as np
import pandas as pd

#reading the file from input path
dataset = pd.read_csv('../input/datascience-survey-on-kaggle/kagglesurvey.csv')

#first glance at the data
dataset.head()

The dataset we are using consists of string values so for missing columns we can't use imputing that we could have used with numericals values. So, its better to drop such rows.

In [None]:
#dataset before dropping missing values
print("Dataset Before:")
dataset.info()

dataset = dataset.dropna()

#dataset after dropping values
print("Dataset After:")
dataset.info()

In [None]:
""" 
This method takes the column as parameter and returns the string containing all the
values that were separated by ',' and it also concats the strings like
'Amazon Web Services' to 'AmazonWebServices' to avoid any ambiguity
"""
def combineText(Column):
    text = []
    result = ""
    for entry in Column:
        text.append(str(entry).split(','))
    
    for item in text:
        for skill in item:
            skill = skill.replace(" ","")
            result += skill
            result += " "
            
    return result

**1. Let's take a quick look at the most used tools**

In [None]:
from wordcloud import WordCloud
#from PIL import Image
import matplotlib.pyplot as plt

tools=combineText(dataset["WorkToolsSelect"])

wordcloud = WordCloud(width = 1000, height = 800,
                     background_color ="white",
                     min_font_size = 15).generate(tools)

#plotting the pie chart for tools used
plt.figure(figsize = [10, 10])
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
plt.show()

In [None]:
"""
This method takes string as input and returns dictonary in sorted order by keys.
input- "Python Python R Python"
output-{'Python':3, 'R':1}
"""
def countFrequency(text):
    frequency = {}
    text_list = text.split()

    for item in text_list:
        if item in frequency:
            frequency[item] += 1
        else:
            frequency[item] = 1

    sorted_dict = {}
    sorted_keys = sorted(frequency, key = frequency.get, reverse = True)

    for w in sorted_keys:
        sorted_dict[w] = frequency[w]

    return(sorted_dict)

In [None]:
""" 
Takes sorted dictionary as input and returns the list of respective keys and values as list of list
input-{'Python':3, 'R':1}
output-[['Python','R'],[3,1]]
"""
def CreateListForPlotting(sorted_dict):
    count = []
    skill = []
    for key, value in sorted_dict.items():
        skill.append(key)
        count.append(value)

    return [count,skill]  

1.1 For better look of the tools used let's plot a pie chart for the column values

In [None]:
#counting the frequency of values and getting the values in dictionary
frequency_dict = countFrequency(tools)

#getting separate list of keys and values for above dictionary so that we can plot it
result_list = CreateListForPlotting(frequency_dict)

frequency = result_list[0]
skills = result_list[1]


In [None]:
#getting top 20 tools used by the respondents
y = np.array(frequency[:20])
mylabels = skills[:20]

#creating pie chart with final lists
plt.figure(figsize = [10, 10])
plt.pie(y, labels = mylabels,autopct='%1.1f%%',labeldistance=1.07)
plt.show() 

**1.2** From the above plot we can clearly see programming languages used majorly with
    1. Python - 15.5%
    2. R - 12.1%
    3. SQL - 10.7%

Followed by other tools that are being used by the respondents.

**2. Speaking about language, let's plot for the recommended language that respondent entered.**

In [None]:
#quick look on the most used language
print(dataset["LanguageRecommendationSelect"].value_counts())

In [None]:
#getting value and its frequency in list to create pie chart
language = dataset['LanguageRecommendationSelect'].value_counts().keys().tolist()
counts = dataset['LanguageRecommendationSelect'].value_counts().tolist()

#top 10 languages
y = np.array(counts[:10])
mylabels = language[:10]

#plots pie chart for the 'LanguageRecommendationSelect' column
plt.figure(figsize=[10, 10])
plt.pie(y, labels = mylabels,autopct='%1.1f%%',labeldistance=1.07)
plt.show() 

**2.1 From the above plot we can see that Python covers over almost 62.5% of total languages used**

The remaining top languages are:
    1. R - 26.1%
    2. SQL - 3.7%

**3. As we now know about the tools and languages, let's see on what industry these skills are mostly used**

In [None]:
#first glance at the industries
print(dataset["EmployerIndustry"].value_counts())

In [None]:
#separating keys and values from value_counts
Industry = dataset['EmployerIndustry'].value_counts().keys().tolist()
Industry_count = dataset['EmployerIndustry'].value_counts().tolist()

#creating pie chart for the above lists
y = np.array(Industry_count)
mylabels = Industry
plt.figure(figsize=[10, 10])
plt.pie(y, labels = mylabels,autopct='%1.1f%%',labeldistance=1.07)
plt.show() 

It can be seen that the skills are widely used in technological fields such as to extract meaningful information and to predict future patterns and behavior.

While we can also see that Academics is also not far behind technology because major number of the respondents are students who are practising the skills to get better at it.

**4. Languages that are used in Technology**

In [None]:
#filtering the columns
technology= dataset["LanguageRecommendationSelect"][dataset["EmployerIndustry"]=="Technology"]
print(technology.head())

In [None]:
#language count for technology domain
print(technology.value_counts())

In [None]:
#extracting result of value_counts in list for plotting
LanguageUsed = technology.value_counts().keys().tolist()
Language_count = technology.value_counts().tolist()

#top 10 languages
y = np.array(Language_count[:10])
mylabels = LanguageUsed[:10]

#pie chart for the above data
plt.figure(figsize=[10, 10])
plt.pie(y, labels = mylabels,autopct='%1.1f%%',labeldistance=1.07)
plt.show() 

**Python is capturing a major 71.6% of the technology industry followed by R at 18.2%**

**5. Languages that are used in Academics**

In [None]:
#filtering the column for academic industry
academics= dataset["LanguageRecommendationSelect"][dataset["EmployerIndustry"]=="Academic"]
print(academics.head())

In [None]:
#language and its frequency
print(academics.value_counts())

In [None]:
##extracting result of value_counts in list for plotting
LanguageUsed = academics.value_counts().keys().tolist()
Language_count = academics.value_counts().tolist()

#top 10 languages for academic domain
y = np.array(Language_count[:10])
mylabels = LanguageUsed[:10]

#creating pie chart for languages used
plt.figure(figsize=[10, 10])
plt.pie(y, labels = mylabels,autopct='%1.1f%%',labeldistance=1.07)
plt.show()

**Here also Python is leading with 58.5% followed by R which is at 27.8%. We can see that R in academics is used more than technology because the students are more inclined to learn new languages due to which we see a bump in it.**

**5. Let's see the languages most used in each domain**

In [None]:
#creating a new column by concatinating the columns 'LanguageRecommendationSelect' and 'EmployerIndustry' 
dataset["IndustryLanguage"]=dataset["LanguageRecommendationSelect"]+ "-" +dataset["EmployerIndustry"]
print(dataset["IndustryLanguage"].head())

In [None]:
#frequency of languages used in different domains
print(dataset["IndustryLanguage"].value_counts())

In [None]:
#creating list of value_counts() values
IndustryLanguage = dataset['IndustryLanguage'].value_counts().keys().tolist()
IndustryLanguage_count = dataset['IndustryLanguage'].value_counts().tolist()

#selecting top 20 rows
y = np.array(IndustryLanguage_count[:20])
mylabels = IndustryLanguage[:20]

#creating pie chart for the values obtained
plt.figure(figsize=[10, 10])
plt.pie(y, labels = mylabels,autopct='%1.1f%%',labeldistance=1.07)
plt.show() 

**5.1 Python is dominating other languages as it is used in-**
    1. Technology
    2. Academics
    3. Financial

**6. Next up is the algorithms respondents used widely.** 

The measure does not emphasize on the eficiency of the algorithms but it indicates the algos that helped respondent in completing their task.

In [None]:
algorithms = combineText(dataset["WorkAlgorithmsSelect"])

#creating wordcloud for the algorithms string we got from above method
wordcloud = WordCloud(width = 1000, height = 800,
                     background_color = "white",
                     min_font_size = 15).generate(algorithms)

plt.figure(figsize=[10, 10])
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
#getting list of values and their frequency
algorithms_dict = countFrequency(algorithms)
result_list = CreateListForPlotting(algorithms_dict)
algorithmsFrequency = result_list[0]
algorithms = result_list[1]

In [None]:
y = np.array(algorithmsFrequency)
mylabels = algorithms

#pie chart for most helpful algorithms
plt.figure(figsize=[10, 10])
plt.pie(y, labels = mylabels)
plt.show()

**7. Outcome**

7.1 We got the basic idea of technical skills required to do data analysis and apply models on it. Some of the tools are:
1. JupyterNotebook
2. Tensorflow
3. Unix/awk
4. Amazon Web Services
5. Tableau

7.2 While talking about languages we have 
1. Python
2. R
3. SQL

In most of the cases Python outruns other languages in terms of usage, but its good to have variety in your arsenal so give a try to R as well. SQL will come into place when you have to deal with relational databases.

7.3 Algorithms mostly used
1. Regression/Logistic Regression
2. Decision Trees
3. Random Forest
4. Neural Networks
5. Bayesian Techniques
6. Ensemble Methods
7. SVM

However, the efficiency of algorithms depends on the quality and type of data. The above list gives only an idea of widely used algorithms.




**This is my first notebook. Let me know if you like it and also I am open for suggestions.**

**Happy Kaggling!**