# Data Science Jobs Exploration

## PART 0. Introduction
This project is to explore some key insights for data scientist job market. Some key questions will be asked, and answers to these questions will be provided by looking at data from <a href = "http://www.indeed.com">indeed</a>, one of the largest job site globally. The questions include:

1. Total number of data scientist jobs available, its geographical popularity
2. Companies that recruit the most data scientists.
3. Degree requirement, skills requirement for data scientist jobs
4. Key skillset for data scientist, data engineer, data analyst and machine learner. 

The first two question will be answered by querying indeed's RESTful API, the last two question will be answered by a sequence of web crawling, text cleansing, filtering and LDA clustering.

## PART I. Basic Stats about Data Ccientist Jobs

In this section, we will answer the first questions from our question lists, which are:

* Total number of data scientist jobs available, compared to SE and other data jobs
* Hot states that recruit data scientists
* Hot companies that recruit data scientists

The main approach is to query indeed.com API.

<br>
Note: The codes here **could not run** without  <font color = "red">replacement for publisher_key</font> at the beginning of first code block. Publisher account could be registered from http://www.indeed.com/jsp/apiinfo.jsp.

In [1]:
import urllib2
import xml.etree.ElementTree as ET
import matplotlib.pyplot as plt
from collections import Counter 
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import  LogisticRegression
from random import sample
from bs4 import BeautifulSoup
import nltk
import lda
import pyLDAvis
import pyLDAvis.sklearn
from operator import itemgetter

publisher_key = "#######"

### 1.1 Number of Data Scientists Jobs Available In the Market 

The indeed API could be easily queried by following the instructions from <a href = "https://ads.indeed.com/jobroll/xmlfeed"> XML Job Search Feed</a>. It provides the following XML response header, thus the first question could be easily answered with "totalresults" tag content.

\begin{align}
& \quad <response version="2">\\ 
& \quad <query>"data+scientist"</query>\\ 
& \quad ... \\ 
& \quad <pageNumber>0</pageNumber>\\ 
& \quad <totalresults>2503</totalresults>\\ 
& \quad ... \\ 
& \quad</response>\\ 
\end{align}


In [2]:
# given a query, return the total result from the XML
def getTotalResults(query):
    """Obtain total number of jobs given a query
    Inputs:
        string: query, seperated by +
    Outputs:
        int: indicating no. of total jobs of the query
    """
        
    #form url
    query = "\"" + query + "\""   #double quotes mean it's querying exact title
    url = "http://api.indeed.com/ads/apisearch?publisher=" + publisher_key + "&v=2&q="+query +"&l=&sort=&radius=&st=&jt=fulltime&start=0&limit=26&fromage=365&highlight=0&filter=&latlong=1&co=us&chnl=&userip=45.56.94.21&useragent=&v=2"

    #read website
    response = urllib2.urlopen(url)
    content = response.read()
    
    #parse XML
    root = ET.fromstring(content)
    num = int(root.find('totalresults').text)
    return num

print "###########Data Scientist Jobs############"
print "No. of Jobs: ", getTotalResults("data+scientist")
print "##########Software Engineer Jobs##########"
print "No. of Jobs: ", getTotalResults("software+engineer")
print "#############Other Data Jobs##############"
for i in ["data+engineer", "machine+learning", "data+analyst", "business+analyst"]:
    print  i,": ", getTotalResults(i)

###########Data Scientist Jobs############
No. of Jobs:  2473
##########Software Engineer Jobs##########
No. of Jobs:  20625
#############Other Data Jobs##############
data+engineer :  1301
machine+learning :  8210
data+analyst :  4186
business+analyst :  10813


** <font color = 'red'>Analysis: </font>**

From the above codes output, it could be easily observed that the number of data scientist jobs is only 1/10 if software engineer jobs. The job market for other data jobs are demanding talents as well, however. Machine learning, data analyst and business analyst are requiring more talents than data scientist do.

### 1.2  Hot Locations 

The second part of first question could be answered by examining geographical components from the XML formatted response:

\begin{align}
& <result> \\
& <jobtitle>Data Scientist / Quantitative Analyst, Engineering</jobtitle> \\
& <company>Google</company> \\
& <city>Mountain View</city> \\
& <state>CA</state> \\
& <country>US</country> \\
& <formattedLocation>Mountain View, CA</formattedLocation> \\
& <source>Google</source> \\
& <date>Thu, 03 Nov 2016 10:27:06 GMT</date> \\
& <latitude>37.384617</latitude> \\
& <longitude>-122.08242</longitude> \\
& ... \\
& </result> \\
\end{align}

For each of the result (one for each job posting), tags of "state" and "formattedLocation" could be helpful in answering the questions.

However, there are two tricky prolems to note:
* The API gives a constraint of returning 1025 results at maximum
* With each one query, it would only return 25 results. Manually "turning page" is feasible with a change of start page component in queried url.

In [3]:
def getTopStates(query, topK):
    """query top k states that recruits a given query title
    
    Inputs:
        query: string, indicating query of job title, seperated by +
        topK: int, indicating top number of results to return
    
    Outputs:
        top states that recruit the querying title: list of tuples with state and count
    """

    #query maximum number records to be obtained
    query = "\"" + query + "\""
    total = min(getTotalResults(query), 1025)
    
    states = []  
    start = 0 # as we could only see 25 results per page 
    while (start < total):
        #form url
        url = "http://api.indeed.com/ads/apisearch?publisher=" + publisher_key + "&v=2&q="+query +"&l=&sort=&radius=&st=&jt=fulltime&start="+str(start)+"&limit=26&fromage=365&highlight=0&filter=&latlong=1&co=us&chnl=&userip=45.56.94.21&useragent=&v=2"

        #read website
        response = urllib2.urlopen(url)
        content = response.read()
        
        # parse XML
        root = ET.fromstring(content) 
        states.extend([i.text for i in root.find("results").findall('.//state')])  #find states and put all to the list states
        start += 25
    
    #construct counter
    states_counter = Counter(states)
    result = states_counter.most_common(topK)
    return result

print getTopStates("data+scientist", 5)

[('CA', 310), ('NY', 109), ('VA', 74), ('MA', 68), ('WA', 58)]


In [4]:
def getTopLocs(query, topK):
    """query top k locations that recruits a given query title
    
    Inputs:
        query: string, indicating query of job title, seperated by +
        topK: int, indicating top number of results to return
    
    Outputs:
        top locations that recruit the querying title: list of tuples with locations and count
    """

    #query maximum number records to be obtained
    query = "\"" + query + "\""
    total = min(getTotalResults(query), 1025)
    
    locs = []  
    start = 0 # as we could only see 25 results per page 
    while (start < total):
        #form url
        url = "http://api.indeed.com/ads/apisearch?publisher=" + publisher_key + "&v=2&q="+query +"&l=&sort=&radius=&st=&jt=fulltime&start="+str(start)+"&limit=26&fromage=365&highlight=0&filter=&latlong=1&co=us&chnl=&userip=45.56.94.21&useragent=&v=2"

        #read website
        response = urllib2.urlopen(url)
        content = response.read()
        
        # parse XML
        root = ET.fromstring(content) 
        locs.extend([i.text for i in root.find("results").findall('.//formattedLocation')])  #find formatted locations and put all to the list states
        start += 25
    
    #construct counter
    locs_counter = Counter(locs)
    result = locs_counter.most_common(topK)
    
    return result


print getTopLocs("data+scientist", 10)

[('New York, NY', 93), ('San Francisco, CA', 73), ('Chicago, IL', 38), ('Seattle, WA', 35), ('Washington, DC', 27), ('Palo Alto, CA', 27), ('Boston, MA', 26), ('San Jose, CA', 25), ('Mountain View, CA', 22), ('Santa Clara, CA', 21)]


** <font color = 'red'>Analysis: </font>**

The above two queries reveal that California is still the most popular places for data scientists hires. NY, which ranks second, only have 1/3 job opportunities available comparing to CA. The top cities to recruit data scientists are cities famous for either its IT, finance and consultancy industry.

### 1.3 Hot Companies

Similarly, top companies that recruit data scientists could also be easily quried with tag "company". Let's look at top 50 companies that recruit most data scientist, and observe industry patterns within.

In [5]:
def getTopCompanies(query, topK):
    """query top k locations that recruits a given query title
    
    Inputs:
        query: string, indicating query of job title, seperated by +
        topK: int, indicating top number of results to return
    
    Outputs:
        top locations that recruit the querying title: list of tuples with locations and count
    """

    #query maximum number records to be obtained
    query = "\"" + query + "\""
    total = min(getTotalResults(query), 1025)
    
    companies = []  
    start = 0 # as we could only see 25 results per page 
    while (start < total):
        #form url
        url = "http://api.indeed.com/ads/apisearch?publisher=" + publisher_key + "&v=2&q="+query +"&l=&sort=&radius=&st=&jt=fulltime&start="+str(start)+"&limit=26&fromage=365&highlight=0&filter=&latlong=1&co=us&chnl=&userip=45.56.94.21&useragent=&v=2"

        #read website
        response = urllib2.urlopen(url)
        content = response.read()
        
        # parse XML
        root = ET.fromstring(content) 
        companies.extend([i.text for i in root.find("results").findall('.//company')])  #find companies and put all to the list states
        start += 25
    
    #construct counter
    companies_counter = Counter(companies)
    result = companies_counter.most_common(topK)
    
    return result

topcompanies = getTopCompanies("data+scientist", 50)
print topcompanies

[('KPMG', 26), ('Booz Allen Hamilton', 21), ('Amazon Corporate LLC', 13), ('Verizon', 12), ('Microsoft', 12), ('Leidos', 9), ('Facebook', 9), ('Workbridge Associates', 8), ('Capital One', 8), ('CACI', 7), ('Google', 7), ('Walmart eCommerce', 7), ('Indeed Prime', 6), ('Uber', 6), ('UnitedHealth Group', 6), ('Oracle', 6), ('Cisco Systems, Inc.', 6), ('Netflix', 6), ('IBM', 5), ('NVIDIA', 5), ('The Nielsen Company', 5), ('Morgan Stanley', 5), ('Faraday Future, Inc.', 4), ('Mitre Corporation', 4), ('Intel', 4), ('SAP', 4), ('Selby Jennings', 4), ('Predictive Science', 4), ('Walmart', 4), ('eBay', 4), ('Oscar Technology', 4), ('Seagate', 4), ('CGI', 4), ('Teradata', 4), ('Aspen Technology', 3), ('Aetna', 3), ('SAIC', 3), ('OnDeck', 3), ('Dropbox', 3), ('Central Intelligence Agency', 3), ('NCR', 3), ('Xerox', 3), ('Illinois Technology Association', 3), ('Adobe', 3), ('Accenture', 3), ('DataRobot', 3), ('TARGET', 3), ('Home Depot', 3), ('GE Healthcare', 3), ('Magento', 3)]


** <font color = 'red'>Analysis: </font>**


The following industries are making big movement to "Data Science Era":
1. Consulting. Representative firms include: 
    * Booz Allen Hamilton 
    * KPMG
    * The Nielsen Company
    * IBM
    * CACI
    * Accenture
    * Apogee Integration LLC
    * CGI
    * McKinsey & Company
    * Aspen Technology
    * DataRobot
2. Technology. Reprensentative firms include: 
    * Microsoft
    * Leidos
    * Teradata
    * Oracle
    * NVIDIA
    * Cisco Systems, Inc
    * Intel
    * SAP
    * Adobe
    * Xerox
    * Illinois Technology Association
3. Internet. Representative firms include:
    * Facebook
    * Google
    * Netflix
    * Indeed Prime
    * Dropbox
4. eCommerce. Representative firms include:
    * Amazon Corporate LLC
    * Walmart eCommerce
    * eBay
    * Groupon
    * TARGET
    * Magento
5. Healthcare. Representative firms include:
    * UnitedHealth Group
    * Aetna
    * Preventice Services, LLC
    * UnityPoint Health
    * GE Healthcare
6. Finance. Representative firms include:
    * Capital One
    * Morgan Stanley
    * OnDeck

Some other top industries include Telecommunications (Verizon), Government Service (Mitre Corporation, SAIC, Central Intelligence Agency) and Transportation Network (Uber).

    

## PART II. Information Retrieval

Information Retrieval takes two steps.

1. Download information from indeed API;
2. Follow the url links to crawl the webs for job descriptions.

### 2.1 Download from indeed API

In order to provide useful information for the following analysis, it's necessary to download information and save furthur use. The XML formatted result is shown in previous section. Therefore it's feasible to download these information with iterations within the XML file. Note that, in order to provide useful analysis in future steps, **query is stored** within the dataframe, which is furthur saved as dataJobs.csv .


In [6]:
dataJobs = pd.read_csv("dataJobs.csv");
dataJobs.head()

Unnamed: 0.1,Unnamed: 0,city,company,country,date,expired,formattedLocation,formattedLocationFull,formattedRelativeTime,indeedApply,...,latitude,longitude,onmousedown,query,snippet,source,sponsored,state,stations,url
0,0,San Francisco,Doximity,US,"Tue, 11 Oct 2016 03:29:16 GMT",False,"San Francisco, CA","San Francisco, CA 94107",30+ days ago,False,...,37.76923,-122.39011,"indeed_clk(this,'2266');",data+scientist,"As a Data Scientist, you'll work closely with ...",Doximity,False,CA,,http://www.indeed.com/viewjob?jk=788fdeb848461...
1,1,San Ramon Village,GE Digital,US,"Tue, 15 Nov 2016 02:00:47 GMT",False,"San Ramon Village, CA","San Ramon Village, CA",3 days ago,True,...,37.71978,-121.92857,"indeed_clk(this,'2266');",data+scientist,Qualifications/Requirements for a Principal Da...,Indeed,False,CA,,http://www.indeed.com/viewjob?jk=3169483be454b...
2,2,Santa Clara,Intel,US,"Wed, 09 Nov 2016 05:21:34 GMT",False,"Santa Clara, CA","Santa Clara, CA 95052",9 days ago,False,...,37.354397,-121.95055,"indeed_clk(this,'2266');",data+scientist,"Background in deep learning, machine learning ...",Intel,False,CA,,http://www.indeed.com/viewjob?jk=e938e7fdb4f1b...
3,3,New York,Morgan Stanley,US,"Mon, 10 Oct 2016 17:51:58 GMT",False,"New York, NY","New York, NY 10032",30+ days ago,False,...,40.839622,-73.941025,"indeed_clk(this,'2266');",data+scientist,"Proficiency working with large datasets, data ...",Morgan Stanley,False,NY,,http://www.indeed.com/viewjob?jk=fe65822ddca4a...
4,4,Research Triangle Park,Sciome LLC,US,"Fri, 18 Nov 2016 18:39:35 GMT",False,"Research Triangle Park, NC","Research Triangle Park, NC",3 hours ago,True,...,35.895603,-78.85714,"indeed_clk(this,'2266');",data+scientist,Data Scientist – Text-mining and Natural Langu...,Indeed,False,NC,,http://www.indeed.com/viewjob?jk=92ed8fb391d39...


### 2.2 Web Crawling

The traditional way of web crawling is to find relevant section in the html source code. Indeed website stores relevant job descriptions in $<span id="jobsummary" class = "summary"> $ . However, some constraints are posed and there's no way to perform web crawling by accessing this section. Instead, we will use an alternative methods to extract useful information. 

Reasons for this and the alternative methods will be explained using a walkthrough example with <a href = "http://www.indeed.com/viewjob?jk=fe65822ddca4a793&qd=LsVW8c0iEXRzkb9K4S0ffamBYu_x_hMniBdnXt78vi0_wHkwQrCdt91k9_FLudNYMlkHegqyCpRRHufrX2C9KLeJkvxhLXTS2xRABv6u61aptQI94s5DsPOzRyYMc52A&indpubnum=9207766499679789&atk=1b1sovehd18j37av">a data scientist job link</a>.


#### 2.2.0 Web Crawling: Problem Identification and Solutions

In [7]:
joburl = dataJobs["url"][3]

def webCrawl(url):
    """Obtain job summary section text given an indeed url
    Input:
        url: String
    Output:
        text: String
    """
    try:
        html = urllib2.urlopen(url).read() # Connect to the job posting
    except:
        print "failed to open"
        return ""

    soup = BeautifulSoup(html, "html.parser")
    return soup.find("span", id = "job_summary").getText()
    
print webCrawl(dataJobs["url"][3])

Morgan Stanley's business around the world is supported by groups and teams with a wide variety of specialized skills. They provide information and strategic thinking to the Management Committee; help to ensure the long-term growth and efficient day-to-day functioning of our business; and serve the well-being of our shareholders, clients and employees.


Morgan Stanley Strats & Modeling (MSSM) provides revenue-generating activities that are centered on financial analytics. Embedded Desk Strategist (Strat) teams reside within our Sales & Trading businesses including Equity, Fixed Income, and Commodities as well as our Banking businesses including Investment Banking and Global Capital Markets. The Modeling team and other MSSM project based teams such as Core Analytics and Core Electronic Trading provide quantitative solutions to multiple


businesses.


We are looking for a self-motivated, innovative, hard-working individual who can handle changing priorities and multiple tasks in a time

The above code output shows that, although the website is successfully opened and read, the output is imcomplete from $<span id="jobsummary" class = "summary"> $ section - the qualifications part is <font color = 'red'>missing</font>. This is because some factuous span close tag, which intend to prevent web crawling. Multiple trials on other examples show consistent pattern regarding this problem, though the missing lengths vary.

The alternative solution is to extract all texts from the html site and then extract useful information. This takes three steps:
1. Exclude any script or style related elements.
2. Find where the job requirement inside job description starts and ends.
3. Using job descriptions from accountant jobs to screen any irrelevant wordings.

#### 2.2.1 Exclude Script or Style Related Elements

Below code block shows the example of first step.

In [8]:
def webCrawl(url):
    """Given an indeed job url, return the whole text excluding script and style
    Input:
        url: String
    Output:
        content: String
    """
    try:
        html = urllib2.urlopen(url).read() # Connect to the job posting
    except:
        return ""
    
    
    soup = BeautifulSoup(html, "html.parser")
    
    # Reference for this step: https://jessesw.com/Data-Science-Skills/ 
    for script in soup(["script", "style"]):
        script.extract() # Remove these two elements from the BS4 object to get clean text
    content = soup.getText().lower()
    return content

content = webCrawl(dataJobs["url"][3])
print content



data scientist / modeler job - morgan stanley - new york, ny | indeed.com

















skip to job description, searchclose













find jobsfind resumesemployers / post job







upload your resume

sign in







:







what
where





advanced job search


 

 



 



job title, keywords or company


city, state, or zip












data scientist / modeler

morgan stanley

1,559 reviews
 -
new york, ny


morgan stanley's business around the world is supported by groups and teams with a wide variety of specialized skills. they provide information and strategic thinking to the management committee; help to ensure the long-term growth and efficient day-to-day functioning of our business; and serve the well-being of our shareholders, clients and employees.


morgan stanley strats & modeling (mssm) provides revenue-generating activities that are centered on financial analytics. embedded desk strategist (strat) teams reside within our sales & trading businesses including eq

#### 2.2.1 Extract Substring for Job Requirement

There's an unavoidable problem that, the job details are always comprised of company introduction, then job description, and then job requirement. In our case, the job details start by introducing Morgan Stanley, and then data scientist's duty, finally it's qualification/ requirement. The first two parts, however, are irrelevant. In order to dump these unuseful information, it's necessary to find the start position and end position for the case.

After many trials and errors, we determine that:

* Start keywords include "qualification", "responsibilit", "require", "skill", "role", "experience", "demonstrat" - some of them are in lemmatized form to avoid any missing starts. As some words may re-occur amongst job descriptions, so we will make the first appearance of any words above as the start position for job description.
* End position is "days ago". That's because we found for each job description, the ending part is always the date it was posted: in this case, it was "12 days ago". Since every job has a different post date, so we only retain "days ago".

In [9]:
def extractUseful (content):
    if type(content) == float: #i
        return "notok"
    else:
        content = content.replace("\r"," ").replace("\n", " ")
        startwords = ["qualification", "responsibilit", "require", "skill", "role", "experience", "demonstrat"]
        start = set([content.find(i) for i in startwords])
        if (-1 in start): #if doesn't find then it will be -1
            start.remove(-1)
        if (len(start) != 0): #if at least one of words is found
            start_pos = min(start)
            end_pos = content.find("days ago")-3 #end pos -3 is because we want to eliminate number if possible
            return  content[start_pos:end_pos] 
        else: 
            return "notok"
        
content = extractUseful (content)
print content

skills. they provide information and strategic thinking to the management committee; help to ensure the long-term growth and efficient day-to-day functioning of our business; and serve the well-being of our shareholders, clients and employees.   morgan stanley strats & modeling (mssm) provides revenue-generating activities that are centered on financial analytics. embedded desk strategist (strat) teams reside within our sales & trading businesses including equity, fixed income, and commodities as well as our banking businesses including investment banking and global capital markets. the modeling team and other mssm project based teams such as core analytics and core electronic trading provide quantitative solutions to multiple   businesses.   we are looking for a self-motivated, innovative, hard-working individual who can handle changing priorities and multiple tasks in a timely fashion.   as a quant developer and modeler, you will build tools to query, clean, and analyze raw data by f

In this way, only 30/5425 web sites could not be cleaned successfully. 

#### 2.2.3 Filter Out Irrelevant Wording.

However, it's easy to see that, the extraction is not yet perfect. Some keywords may still have to appear before the job requirement section. Thus creating barriers for furthur analysis. In order to furthur clean the content, some accountant job descriptions are downloaded, and serve as a  "filter" to these remaining descriptions.


In [10]:
### Construct Filters from Accountant Job Descriptions ###
file = open("accountant.txt").read().lower()
filters = set(nltk.word_tokenize(file))
filters.update(nltk.corpus.stopwords.words('english'))
filters = list(filters)

In [11]:
def process(text,  filters = nltk.corpus.stopwords.words('english')):
    """ Normalizes case and handles punctuation
    Inputs:
        text: str: raw text
        lemmatizer: an instance of a class implementing the lemmatize() method
                    (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
    Outputs:
        list(str): tokenized text
    """
    lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()
    word_list = nltk.word_tokenize(text);
    
    lemma_list = [];
    for i in word_list:
        if i not in filters:
            try:
                lemma = lemmatizer.lemmatize(i);
                lemma_list.append(str(lemma));
            except:
                pass
    return " ".join(lemma_list)

processed_content = process(content, filters)
print processed_content



strategic thinking committee help long-term growth efficient day-to-day functioning serve well-being shareholder client employee morgan stanley strats modeling mssm provides revenue-generating centered analytics embedded desk strategist strat team reside trading equity income commodity well banking investment banking market modeling mssm team core analytics core trading quantitative solution multiple self-motivated innovative hard-working handle changing priority multiple fashion quant developer modeler build tool query clean analyze raw data filtering database develop model pricing hedging risk securitized product product q- shell-based script update database model kdb+ database designing fitting debugging sophisticated econometric mortgage model team-oriented significant growth advanced engineering mathematics statistic quantitative programming programming language q r c/c++ java python linux using unix command-line tool quantitative solving datasets data mining method numerical meth

In the codes above, the text is furthur processed and half of unuseful words are filtered out. 

The processed file is saved at dataJobs_crawled.csv for furthur processing. dataJobs.csv is modified with a column "jd" containing cleaned web content and then renamed as dataJobs_v2_crwaled.csv . Rows with jd = "notok" (which are only 30 rows out of 5400+ rows) are dropped, for they don't contain useful information.

In [12]:
webContent = pd.read_csv("webcrawled.csv")
webContent = webContent.drop(webContent[webContent["Cleaned"] == "notok"].index)
print "Total Length:",len(webContent.index)
webContent.head()

Total Length: 5395


Unnamed: 0.1,Unnamed: 0,Cleaned,Content
0,0,role in the product development process by unc...,\r\r\r\r\r\n\r\r\r\r\r\ndata scientist - growt...
1,1,"responsibilities:at ge software, we are creati...",\r\r\r\r\r\n\r\r\r\r\r\ndata scientist job - g...
2,2,qualifications what you will be worki...,\r\r\r\r\r\n\r\r\r\r\r\ndata scientist job - i...
3,3,skills. they provide information and strategic...,\r\r\r\r\r\n\r\r\r\r\r\ndata scientist / model...
4,4,skillset: the ideal candidate will have experi...,\r\r\r\r\r\n\r\r\r\r\r\ndata scientist – text-...


In [13]:
dataJobs = pd.read_csv("dataJobs_v2_crawled.csv")
print "Total Length:",len(dataJobs.index)
dataJobs[["jd"]].head()

Total Length: 5395


Unnamed: 0,jd
0,role in the product development process by unc...
1,"responsibilities:at ge software, we are creati..."
2,qualifications what you will be worki...
3,skills. they provide information and strategic...
4,skillset: the ideal candidate will have experi...


## PART III. Feature Selection& Construction

The data processing part could have been stopped here if the project were about text classification. In fact, in the next step it will show some classification tasks with high accuracy. As the nature of the project is to really understand the data science job requirements, feature extraction regarding data science (or other relevant) skills will have to continue. The feature extraction takes three steps:

1. Find important features that distinguish between data science jobs and other jobs with lasso regression
2. Manually select some high frequency skills that are common within all these titles 
3. Bigram features are combined together, non-relevant skills are screened out

### 3.1 Features that Distinguish

It's still possible that, some of the features with less frequency still help to distinguish data science job requirement from other data jobs. Logistic regression with lasso regularization provides an automatic feature selection for this step. It works with the 3.1 step to have a complete skillset.

Some key variables in featuren construction / lasso logistic regression with scikit-learn:
* Minimum document frequency is selected at 10
* Cross validation is applied to see the accuracy of the classification, left-out ratio is 1/10
* Features whose coefficient is not zero is printed out for skillset construction

As "data+scientist" will be labeled as 0 and alternative query be labeled as 1; the logistic regression parameters with negative coefficient means it's classifying toward "data+scientist" and vice versa; the function will output seperately the most impacting coefficients for "data+scientist" and its alternative, sorted by the impacting coefficients.

In [14]:
def keyDistinguish( alternativeTitle):
    """ Construct a lasso regression model and outputs important features
    Inputs:
        alternativeTitle: query that is different from "data+scientist" and exist in the dataJobs
    Outputs:
        accuracy: float, indicating the accurac on the test portion
        dataScienceCoef: list of tuples, n-gram and its coefficient denoting classifying towards "data+scientist"
        alternativeCoef: list of tuples, n-gram and its coefficient denoting classifying towards alternative query
    """
    ### df as subset of dataJobs
    df = dataJobs[ (dataJobs["query"] == "data+scientist")|(dataJobs["query"] == alternativeTitle)][["query", "jd"]]

    # construct features
    stopwords = filters #stopwords to filtered out
    vectorizer = sklearn.feature_extraction.text.CountVectorizer(stop_words = stopwords, ngram_range=(1,2), min_df=10) #vectorizer
    X = vectorizer.fit_transform(df['jd']) #transform the df['jd'] into matrix
    y = np.array(df[['query']].replace(["data+scientist", alternativeTitle], [0,1]) )[:,0] #construct labels
    #vectors as unigrams and bigrams as key, entry index as value
    words_index = dict(zip(vectorizer.vocabulary_.values(), vectorizer.vocabulary_.keys())) 
     
    #left out proportion
    N = len(y)
    permutation = sample(range(N), N)
    split = int(N*0.8)
    train_index = permutation[:split]
    test_index =  permutation[split:]
    
    #construct logistic Regression Model
    logistic = LogisticRegression(penalty = 'l1')
    logistic= logistic.fit(X[train_index,:], y[train_index])
    y_pred = logistic.predict(X[test_index,:]) # predict on the test part
    accuracy = sum(y_pred == y[test_index])/float(len(y[test_index])) #obtain accuracy of prediction
    
    #put all the coefs that not equal to 0 into dict
    dataScienceCoef = {}
    alternativeCoef = {}
    coef = logistic.coef_.T[:,0]
    for i in range(len(coef)):
        if (coef[i]!=0):
            if coef[i] < 0:
                dataScienceCoef[words_index[i]] =  coef[i]
            else:
                alternativeCoef[words_index[i]] =  coef[i]
    
    dataScienceCoef = sorted(dataScienceCoef.items(), key=itemgetter(1))
    alternativeCoef = sorted(alternativeCoef.items(), key=itemgetter(1), reverse=True)
            
    return accuracy, dataScienceCoef, alternativeCoef

In [15]:
keyDistinguish("data+engineer")

(0.90617283950617289,
 [(u'cloud technologies', -1.5510855140718205),
  (u'java scala', -1.2862865064222013),
  (u'development teams', -1.2690493004210897),
  (u'effective', -1.227766304109452),
  (u'regression', -1.1920871422244668),
  (u'part data', -1.0901586152540299),
  (u'explain', -1.0763183052997198),
  (u'parts', -1.0141058272037557),
  (u'clearance', -0.99325690636608877),
  (u'15', -0.98750119263814751),
  (u'discovery', -0.90925463773849446),
  (u'methods', -0.89560488325646714),
  (u'experiments', -0.87533623345166978),
  (u'models', -0.87478069907674161),
  (u'master srequired', -0.8690853005611342),
  (u'phd', -0.86838048227290632),
  (u'mining', -0.83173471553089395),
  (u'components', -0.81983474577034376),
  (u'uber', -0.81031702995231392),
  (u'srequired data', -0.7834588158199477),
  (u'actions', -0.78179772719068152),
  (u'campaign', -0.73856004307019862),
  (u'allen', -0.72950389435273311),
  (u'sci', -0.68899614413683785),
  (u'strategy', -0.64978010690803123),
 

<font color = 'red'>Analysis:</font>

Key skills that could be used for furthur analysis:

* Data Scientist Skills

    sci, sas, statistics, phd, regression, pandas, fraud, bioinformatics, automation, ml,d3,learning models, mining, classification, components, analytics, detection, predictive modelling, natural, shiny, data analysis, matlab, javascript, numpy, graph
    

* Data Engineer Skills

    Configuration, scale data, transition, kafka, data visualization, metadata, pipeline, maintenance, distributed, apache, stack, design implement, hadoop, warehousing, scala, actuarial, apis, scalable, data integrity, data engineering, 

In [17]:
keyDistinguish("machine+learning")

(0.77917981072555209,
 [(u'statistics engineering', -1.2573029809500829),
  (u'statistics mathematics', -1.2411588452165494),
  (u'presenting', -1.2148772905835292),
  (u'spark hadoop', -1.1863656179420667),
  (u'developments', -1.1778008034319059),
  (u'compensation', -1.1507616937287073),
  (u'hypotheses', -1.0147379490302819),
  (u'uber', -0.97262626862187185),
  (u'facility', -0.88119025001026841),
  (u'medicine', -0.87493936628423818),
  (u'offices', -0.87073660597657765),
  (u'family', -0.86936514244656959),
  (u'disciplines', -0.84703556038048333),
  (u'others', -0.83848641720131223),
  (u'pig hive', -0.83127841954639459),
  (u'requires', -0.8254410681201596),
  (u'online', -0.79180589704035054),
  (u'matplotlib', -0.76644019430588139),
  (u'religion gender', -0.76235112138673577),
  (u'thought', -0.74472476131293652),
  (u'using data', -0.71942453285858943),
  (u'neural', -0.7077951767376528),
  (u'genetics', -0.70554148162396968),
  (u'cognitive', -0.69276136444721814),
  (u'r

<font color = 'red'>Analysis:</font>

Key skills that could be used for furthur analysis:

* Data Scientist Skills

	pig, math statistics, phd, hypothesis, curiosity, spark, hadoop, semantic, ruby, hive, learning deep, matplotlib, python, java, sci, predict, data storage, learning data, performance computing, recommendations, agile development, deep learning, engineering bioinformatics, 

* Machine Learner Skills

	physics applied, vector machines, quantitative statistics, architecture, sensor, intelligent, econometrics, torch, perl, data structures, matlab, machine learning, graphical

Note that, some skills that belong to data scientist from previous section (when distinguishing between ds and data engineer), magically appear in machine learning skillset instead of data scientist skillset. Some representative words are like matlab, graphical. This means, there are some overlapping for machine learning and data scientist (see the accuracy is the lowest amongst three tasks); and these skills, while touched as data scientists, are more frequently appear in machine learning job regions.

In [18]:
keyDistinguish("data+analyst")

(0.87128712871287128,
 [(u'qlik', -2.8566684797936017),
  (u'ts', -1.4253367786778253),
  (u'ml', -1.385729499757834),
  (u'acquisition', -1.3397678767487717),
  (u'phd', -1.2837029590749374),
  (u'committed', -1.2778795756953838),
  (u'algorithms', -1.2754223665047024),
  (u'natural language', -1.266972640455255),
  (u'internal external', -1.2514951325604866),
  (u'spark', -1.0536503196800999),
  (u'data engineering', -1.0453749911443846),
  (u'minitab', -1.0374125788851101),
  (u'scientists', -1.0286457487739931),
  (u'python', -0.98158867113865667),
  (u'java', -0.97262913755350133),
  (u'week', -0.93692132923699722),
  (u'pandas', -0.93613685288217885),
  (u'harassment', -0.88447652487823047),
  (u'curiosity', -0.86066791932419495),
  (u'applied', -0.85648238685923939),
  (u'actionable', -0.85526545205683857),
  (u'math', -0.8402137197590428),
  (u'acceptance', -0.7853868064313384),
  (u'intended', -0.78465192392103045),
  (u'sensing', -0.76187438898391957),
  (u'expertise', -0.706

<font color = 'red'>Analysis:</font>

Key skills that could be used for furthur analysis:

* Data Scientist Skills

	predictive analytics, qlik, algorithms, python, applied, ml, visualizations, cause analysis, spark, aws, matlab, gis, sas, big data, intelligence, recommendations, exploratory, modelling, nosql, math, spatial
    

* Data Analyst Skills

	statistics, pivot, modelling machine, javascript, genomics, powerpoint, cloud, vba, dashboards, tableau, 

### 3.2 Manual Selection with Frequency >= 100
As the previous section may not be able to cover some high frequency skills that are commonly shared between all these titles, manual selection is necessary. It's feasible if we only look at words with frequency >=100, as this appraoch reduces the number of N-grams to screen from 584,751 to 3904, as indicated with following code output.

In [19]:
### Counstruct Counter for Unigram and Bigrams ###
words = []; #list to put all the N-grams

for i in webContent["Cleaned"]:
    tokens = i.split();
    words.extend(tokens) #put all unigrams into words list
    
    #Construct bigrams
    for j in range( len(tokens) - 1):
        words.append(" ".join([tokens[j], tokens[j+1]])) #put all the bigrams into words list
        
c = Counter(words);

print "Total Number of Bigram and Unigrams:", len(c)
print "Number of N-grams with frequency >= 100:", sum ( np.array(c.values()) >= 100)
print c.most_common(3904)

Total Number of Bigram and Unigrams: 584751
Number of N-grams with frequency >= 100: 3904
[('and', 114899), ('to', 57808), ('the', 43485), ('of', 42949), ('in', 34737), ('with', 31043), ('a', 28793), ('data', 27924), ('experience', 19862), ('for', 19032), ('or', 18899), ('business', 13273), ('as', 12233), ('is', 11205), ('work', 9599), ('on', 9334), ('be', 8131), ('will', 8111), ('our', 8053), ('you', 7693), ('ability', 7403), ('an', 7136), ('skills', 7107), ('ability to', 7075), ('that', 7055), ('are', 6838), ('we', 6517), ('years', 6034), ('\xe2\x80\xa2', 5848), ('team', 5766), ('experience with', 5723), ('other', 5402), ('of the', 5386), ('strong', 5043), ('analysis', 5013), ('at', 4935), ('knowledge', 4889), ('this', 4746), ('working', 4713), ('marketing', 4686), ('development', 4644), ('-', 4554), ('from', 4533), ('experience in', 4444), ('in a', 4380), ('by', 4357), ('management', 4328), ('degree', 4326), ('in the', 4306), ('requirements', 4279), ('new', 4254), ('all', 4204), ('h

According to the output from 3.1 and top 3900 counter keys from section 3.2, key skills are stored within wantedskills.txt . The key skills and its corresponding counts are presented in the following code block.

In [20]:
f_skill = open("wantedskills.txt")  #containing all the wanted skills

for line in f_skill:
    skill = line.strip()
    print skill, c[skill]

c++ 129
marketing 4686
marketing data 76
communication 2890
database 1690
machine learning 1900
statistical 2298
python 865
computer science 738
statistics 672
programming 1773
modelling 49
r 511
big data 1504
algorithm 185
data analysis 1053
hadoop 769
testing 1104
c 51
java 477
bachelor 135
microsoft 1359
quantitative 1261
financial 1229
data science 991
problem solving 706
security 925
mathematics 235
code 776
metric 54
social 809
ad 528
optimization 585
predictive 972
data mining 520
business process 402
architecture 475
bioinformatics 62
agile 746
masters 241
infrastructure 550
tableau 354
project management 600
business requirement 32
business intelligence 625
interpersonal 607
written communication 575
distributed 663
hive 151
large scale 306
server 390
dashboard 108
phd 452
visualization 750
visualizations 171
oracle 326
powerpoint 252
analytical skill 6
digital marketing 459
data warehouse 312
data management 419
statistical analysis 307
large data 500
optimize 469
business an

### 3.3 Feature Constructions

The final step for feature transformation is to construct features according to the key skills found in the previous section. As CountVectorizer is employeed in the final analysis, punctuations (such as "c++") and bigrams (such as "machine learning") should be reconstructured to be unigram without any special characters. Some words with similar meanings will be combined (such as "visualizations" and "visualization").

In [21]:
f_skill = open("wantedskills.txt")  #containing all the wanted skills
f_skill_replacement = open("wantedskillsBigramReplaced.txt") #containing all the 

skills = []
skills_replacement = []
for line in f_skill:
    skills.append(line.strip())  #store all the skills in skills list
for line in f_skill_replacement:
    skills_replacement.append(line.strip()) #store all the replacement skills wording in skills_replacement

skills_rep = dict(zip(skills, skills_replacement))  #construct dict


def featureConstruct(text): 
    """Given a text input, construct a new text with only wanted skills 
    Input:
        text, String
    Output:
        text, String, a new text with only wanted skills 
    """

    text_skills = [] #skills list
    
    #construct tokens
    tokens = text.split()
    
    #unigram: if the unigram is wanted skill, then append its **replacement skill word** to text_skills
    for i in tokens:
        if i in skills:
            text_skills.append(skills_rep[i])
    #bigram: if the unigram is wanted skill, then append its **replacement skill word** to text_skills
    for j in range( len(tokens) - 1):
        bigram = " ".join([tokens[j], tokens[j+1]]) #construct bigram
        if bigram in skills:
            text_skills.append(skills_rep[bigram])
            
    text_skills = " ".join(list(set(text_skills)))#join tokens from text_skills with a space in between
    return text_skills

A consistent walk through example is provided below based on processed content on section two web crawling. As you may notice, some words with punctuation (such as q- , shell-based) could not be extracted to correct feature form (q, shell). That's because we have to return some punctuations such as c/c++ or ab-testing. There would only be a few missing, so it won't be a big concern to our project.

In [22]:
print "###################### Original text: #################\n", processed_content
print "######################## New text: #################### \n", featureConstruct(processed_content)

###################### Original text: #################
strategic thinking committee help long-term growth efficient day-to-day functioning serve well-being shareholder client employee morgan stanley strats modeling mssm provides revenue-generating centered analytics embedded desk strategist strat team reside trading equity income commodity well banking investment banking market modeling mssm team core analytics core trading quantitative solution multiple self-motivated innovative hard-working handle changing priority multiple fashion quant developer modeler build tool query clean analyze raw data filtering database develop model pricing hedging risk securitized product product q- shell-based script update database model kdb+ database designing fitting debugging sophisticated econometric mortgage model team-oriented significant growth advanced engineering mathematics statistic quantitative programming programming language q r c/c++ java python linux using unix command-line tool quantit

In [23]:
processor = lambda x: featureConstruct(x);
dataJobs[['parsedJd']]= dataJobs.copy()[["jd"]].applymap(processor);
dataJobs = dataJobs.drop(dataJobs[dataJobs["parsedJd"] == ""].index).reset_index(drop = True)
dataJobs.to_csv("dataJobs_v3_featured.csv")
dataJobs[["parsedJd"]].head()

Unnamed: 0,parsedJd
0,largedata machinelearning datamining python ma...
1,masters infrastructure largedata machinelearni...
2,nlp machinelearning unsupervised ai ml cpluspl...
3,financial algorithm problemsolving communicati...
4,clustering communication nlp unsupervised


## PART IV. Job Requirement Analysis

The final part of the project is to conduct some useful analysis given the data. Reviewing back, from section 1 we have obtained the job market size and industrial trend, and from section 3 we have obtained key skills that distinguish data scientist with others. The remaining questions include: 

* Degree requirement for data scientist and other data jobs
* Top skills required as data scientist
* Types of data jobs applicable under different titles

The first two questions will be answered with Counters, the last question will be explored with LDA.

In [24]:
### Construct Dictionary, query as key and Counter of the query as value
counterByQuery = {}

#data means data jobs to explore
data = ['data+scientist', 'machine+learning', 'data+engineer', 'data+analyst']
for i in data:
    #construct pdSeries parsedJd with subset of dataJobs dataframe by its query;
    parsedJd = dataJobs[dataJobs["query"] == i]["parsedJd"] 
    
    #construct skill lists
    parsedSkills = []
    for j in parsedJd:
        parsedSkills.extend(j.split())
    
    #construct counter
    counterByQuery[i] = Counter(parsedSkills)

### 4.1 Degree Requirement for Data Jobs

**<font color = 'red'> Analysis:</font>**

With below output, it's easy to see that there are a strong demand for phd for both data scientist and machine learning jobs, and relatively small demand for masters degrees. For data scientists, quantitative backgrounds are preferred, and for machine learning, CS background is emphasized. Few data jobs require Kaggle experiences

In [25]:
def querySkillsFreq(data, skilllists):
    print data
    print "#############################################################"
    
    
    for j in skilllists:
        print j, "\t", 
        for i in data:
            print  counterByQuery[i][j],"\t",
        print "\n"


degrees = ['bachelor', 'masters','phd','computerscience','mathematics','quantitative','kaggle']
querySkillsFreq(data, degrees)

['data+scientist', 'machine+learning', 'data+engineer', 'data+analyst']
#############################################################
bachelor 	24 	7 	20 	25 	

masters 	114 	38 	31 	30 	

phd 	236 	161 	28 	8 	

computerscience 	118 	123 	251 	80 	

mathematics 	78 	38 	38 	26 	

quantitative 	344 	86 	78 	192 	

kaggle 	8 	2 	4 	0 	



### 4.2 Skillset Requirement for Data Scientist Jobs

This part outputs top 10 skills for each category from data scientist entries.

In [26]:
machine_learning = ['nlp','anomalydetection','predictive','linear','genomics', 'advancedml','learningalgorithm',
                    'forecasting', 'classification','clustering','fraud','tuning','unsupervised', 'tree', 'cluster', 'optimization', 'deeplearning',
                    'modelling', 'ai','artificialintelligence', 'survivalanalysis', 'testing','multivariate', 'linear','hypothesistesting','simulation']
programming = ['python', 'scipy','scala','r','corcplusplus','javascript','matplotlib','c', 'weka','sas','d3', 'cplusplus','java','perl', 'scikit',
               'pandas','shell','weka']
database = ['nosql',  'sqlserver','mysql','hbase','ssrs','redshift','cassandra','mongodb','oracle']
big_data = ['hadoop','apache','kafka','hbase', 'hive','mapreduce','mahout','spark','julia', 'flume','zookeeper', 'pig']

def querySkillsFreq(skilllists):
    
    c_skills = {}
    for j in skilllists:
        c_skills[j] = counterByQuery["data+scientist"][j]
    return Counter(c_skills).most_common(10)

In [27]:
print "machine learning", "\n", querySkillsFreq(machine_learning), "\n"
print "programming", "\n",querySkillsFreq(programming), "\n"
print "big data", "\n", querySkillsFreq(big_data), "\n"
print "database", "\n",querySkillsFreq(database)

machine learning 
[('predictive', 334), ('nlp', 154), ('optimization', 150), ('testing', 127), ('multivariate', 68), ('deeplearning', 63), ('classification', 56), ('cluster', 54), ('linear', 51), ('unsupervised', 43)] 

programming 
[('python', 282), ('r', 220), ('sas', 107), ('java', 81), ('scala', 38), ('cplusplus', 30), ('corcplusplus', 19), ('shell', 18), ('c', 17), ('javascript', 13)] 

big data 
[('hadoop', 152), ('spark', 101), ('hive', 50), ('apache', 35), ('pig', 21), ('mapreduce', 19), ('hbase', 17), ('kafka', 7), ('mahout', 6), ('julia', 6)] 

database 
[('nosql', 63), ('oracle', 33), ('hbase', 17), ('sqlserver', 14), ('redshift', 11), ('mongodb', 11), ('mysql', 10), ('cassandra', 8), ('ssrs', 2)]


### 4.3 Skillset and Data Jobs

There are some correlations within the skills: some of them are big data related, some of them are for database, and some of them are for machine learning. There exists certain "topic" within these skills, the combination for machine learning and big data are called "scalable machine learning", for database and visualization are called "data retrieval and visualization".. The purpose of this analysis is to identify the major "topics" within all these job titles, and find that, for each data title, 

* What is the dominant data topics that data people mainly deal with?
* What is the distribution of time/ energy to spend all over each topics?

The analysis will be taken with LDA and some visualization.

In [None]:
#code reference https://pypi.python.org/pypi/lda 
dtm = sklearn.feature_extraction.text.CountVectorizer()  #build counter vectorizer
dtm_fit = dtm.fit_transform(dataJobs['parsedJd'])        #perform fit transformation

k = 5    

model = lda.LDA(n_topics=k, n_iter=150, random_state=0) # create model
dtm_tf = model.fit_transform(dtm_fit)                   # train model

Note that, as there's a bug for LDA visualization, that the topics output from model and topics indicated in the visualization are not consistent in terms of numbering. Therefore, it involves a manual step to identify correlated index for the topic numbering in the visualization.

In [29]:
topic_word = model.topic_word_                           # get words that have a high probability in a given topic
vocab = dtm.get_feature_names()                          # feature names

w_t = 10                                                # the number of words to display from each topic
print "number of topics:",k
print "number of words to display per topic:",w_t, "\n"

#print out topics with the words that compose the topics
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(w_t+1):-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

number of topics: 5
number of words to display per topic: 10 

Topic 0: statistical quantitative programming communication predictive visualization machinelearning statistics dataanalysis sas
Topic 1: bigdata hadoop database programming computerscience code architecture cloud java nosql
Topic 2: machinelearning programming statistical communication python datascience phd bigdata nlp computerscience
Topic 3: communication microsoft testing projectmanagement financial interpersonal writtencommunication businessprocess problemsolving agile
Topic 4: marketing communication microsoft database recommendations optimization socialmedia ad financial dataanalysis


For topics 1-5 in the visualization, its corresponding topics numbering, by comparing topic words, are 1,0,4,2,3. 
We will identify the dominating topics and distributions of topics, with the new index (the same as in the visualization, not the same as outputted from previous cell!)

In [31]:
new_index = [1, 0, 4, 2, 3]                              

topics = model.doc_topic_[:,new_index] #revise the indexing
for i in data:
    index= dataJobs[dataJobs["query"] == i].index.tolist()       #create subset index
    distributions = np.mean(topics[index,:], axis = 0)           #compute mean distribution
    dominated = (np.argmax(topics[index,:],axis = 1)+1).tolist() #compute dominating topics
    count_topic = Counter(dominated)                             #construct counter for dominating topics
    print  i
    print  "Dominating Topic"
    print  count_topic.most_common() 
    print  "Distributions of Topics from 1-5"
    print [round(i,4) for i in distributions ]
    print 

data+scientist
Dominating Topic
[(2L, 465), (4L, 361), (1L, 98), (3L, 61), (5L, 26)]
Distributions of Topics from 1-5
[0.1129, 0.4141, 0.0825, 0.3304, 0.0602]

machine+learning
Dominating Topic
[(4L, 349), (2L, 104), (1L, 81), (5L, 16), (3L, 11)]
Distributions of Topics from 1-5
[0.145, 0.1857, 0.0494, 0.5545, 0.0654]

data+engineer
Dominating Topic
[(1L, 701), (4L, 114), (2L, 69), (5L, 58), (3L, 53)]
Distributions of Topics from 1-5
[0.6114, 0.0871, 0.0757, 0.1383, 0.0876]

data+analyst
Dominating Topic
[(3L, 352), (2L, 252), (5L, 242), (1L, 78), (4L, 52)]
Distributions of Topics from 1-5
[0.0968, 0.2393, 0.3377, 0.0784, 0.2478]



In [30]:
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(model, dtm_fit, dtm)

** <font color = 'red'>Analysis:</font>**

* **Topic 1: The Big Data Guy**

This guy is an expert of big data stuff. He is probabilty from CS background, and masters Java and Python. He knows everything from big data architecture and infrastructure, cloud computing, data warehousing to data processing. He might know a bit abount machine learning and data visualization, but this is not as strongly required as previously mentioned skills. 

He is most likely titled with "data engineer". 

* **Topic 2: The Analyst**

The key words for the analyst are statistics, analysis, communication. She is from a quantitative discipline, with masters or phd degree and well-rounded skills of programming, database, statistics, machine learning, visualization and some domain knowledge such as marketing or finance. She masters SAS and Python, and knows how to make recommendations based on the data.

About 50% of the data scientists are in the role of the analyst. About 15% of data analysts and machine learners are doing something similar.

* **Topic 3: The Business Analyst**

The business analyst is a bit like the analyst, with an emphasis on database and business applications, rather than statistics and programming. He knows digital marketing, social media, google analytics, project management, and has strong interpersonal skills. He cares less about programming, instead, he uses data analysis or visualization software such as tableau and excel, and makes good PPT.

He is titled with "data analyst".

* **Topic 4: The Data Modeler**

The data modeler is an expert in machine learning and statistics. He is most likely from CS or Math background, with a phd degree, and he used Java, Python or C++. He masters NLP, deep learning, optimization and predictive analysis. He knows cutting edge theories. 

He is most likely titled with "machine learner". About 40% of data scientist are in this role as well.


## PART V: Conclusion

This project explores data scientist job market and job requirements. 

**In section 1**, by querying Indeed API, we found that 
* Total number of data scientist jobs in this is 2492. Though it's comparably little with software engineer jobs (20746), some other data jobs make up the differences. Machine learner and data analysts are strongly demanded by the job market.
* Top 3 states for data scientist jobs include CA, NY and VA. 
* Top industries that recruit data scientists are consulting, technology, Internet, eCommerce, healthcare, finance and Government services.

**In section 2**, we tackled the problem of anti-webcrawling factious close tags. 

**In section 3**, while preparing the features for furthur analysis, we find the key distinguishable skills for data scientist and other data jobs. 
* Data Scientist and Data Engineer: 
    Top Data Scientist Features include sci, sas, statistics, phd, regression, pandas, fraud, bioinformatics, automation, ml,d3,learning models, mining, classification;
    Top Data Engineer Features include Configuration, scale data, transition, kafka, data visualization, metadata, pipeline, maintenance, distributed, apache, stack, design implement, hadoop, warehousing, scala.

* Data Scientist and Machine Learner:
    Top Data Scientist Features include pig, math statistics, phd, hypothesis, curiosity, spark, hadoop, semantic, ruby, hive, learning deep, matplotlib, python, java;
    Top Machine Learner Features include physics applied, vector machines, quantitative statistics, architecture, sensor, intelligent, econometrics, torch, perl, data structures, matlab.

* Data Scientist and Data Analyst:
    Top Data Scientist Features include predictive analytics, qlik, algorithms, python, applied, ml, visualizations, cause analysis, spark, aws, matlab, gis, sas, big data, nosql;
    Top Data Analyst Features include pivot, modelling machine, javascript, genomics, powerpoint, cloud, vba, dashboards, tableau.

** In section 4**, we explore degree requirement, skill requirement and different data job topics. We find 

* A strong deman for PHD and following master for data scientist and machine learners. 
* Python, R, SAS as most demanding data scientist programming skills; hadoop, spark, hive as most demanding big data skills; nosql, oracle and hbase and most demanding databse skills.
* 4 types of data topics - the big data guy, the analyst, the business analyst and the data modeler - and their corresponding job titles.


Some main techniques employed including web crawling, data processing, lasso regression, and LDA.