<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [2]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [44]:
from bs4 import BeautifulSoup
import requests

df = pd.read_csv("data/job_listings.csv")

print(df.shape)
df.head()

(426, 3)


Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [45]:
df = df.drop(["Unnamed: 0"], axis=1)

print(df.shape)
df.head()

(426, 2)


Unnamed: 0,description,title
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [46]:
description_text = []

for row in df["description"]:
    soup = BeautifulSoup(row, features="html")
    text = soup.get_text()
    description_text.append(text)

print(len(description_text))

426


## 2) Use Spacy to tokenize the listings 

In [48]:
nlp = spacy.load("en_core_web_lg")

# From the lecture notes
# Note that I have set token.like_num==False
# given that soup.get_text() pulled out 
# numeric data which is entirely useless
# for our purposes.

def tokenize(document):
    doc = nlp(document)
    return [token.lemma_.lower().strip() for token in doc if (token.is_stop!=True) and (token.is_punct!=True) and (token.like_num==False)]

In [49]:
tokenized_description = []

for entry in description_text:
    tokenized_entry = tokenize(entry)
    tokenized_description.append(tokenized_entry)
    
print(len(tokenized_description))

426


In [50]:
df["tokenized_description"] = tokenized_description
df = df[["title", "description", "tokenized_description"]]

print(df.shape)
df.head()

(426, 3)


Unnamed: 0,title,description,tokenized_description
0,Data scientist,"b""<div><div>Job Requirements:</div><ul><li><p>...","[b""job, requirements:\nconceptual, understandi..."
1,Data Scientist I,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,"[b'job, description\n\na, data, scientist, hel..."
2,Data Scientist - Entry Level,b'<div><p>As a Data Scientist you will be work...,"[b'as, data, scientist, work, consult, busines..."
3,Data Scientist,"b'<div class=""jobsearch-JobMetadataHeader icl-...","[b'$4,969, $, monthcontractunder, general, sup..."
4,Data Scientist,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,"[b'location, usa, \xe2\x80\x93, multiple, loca..."


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [56]:
df["description_text"] = description_text
df = df[["title", "description", "description_text", "tokenized_description"]]

print(df.shape)
df.head()

(426, 4)


Unnamed: 0,title,description,description_text,tokenized_description
0,Data scientist,"b""<div><div>Job Requirements:</div><ul><li><p>...","b""Job Requirements:\nConceptual understanding ...","[b""job, requirements:\nconceptual, understandi..."
1,Data Scientist I,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,"b'Job Description\n\nAs a Data Scientist 1, yo...","[b'job, description\n\na, data, scientist, hel..."
2,Data Scientist - Entry Level,b'<div><p>As a Data Scientist you will be work...,b'As a Data Scientist you will be working on c...,"[b'as, data, scientist, work, consult, busines..."
3,Data Scientist,"b'<div class=""jobsearch-JobMetadataHeader icl-...","b'$4,969 - $6,756 a monthContractUnder the gen...","[b'$4,969, $, monthcontractunder, general, sup..."
4,Data Scientist,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,b'Location: USA \xe2\x80\x93 multiple location...,"[b'location, usa, \xe2\x80\x93, multiple, loca..."


In [72]:
vectorizer = CountVectorizer(stop_words='english', max_features=1000)
dtm = vectorizer.fit_transform(df["description_text"])

dtm = pd.DataFrame(dtm.todense(), columns=vectorizer.get_feature_names())
dtm.head()

Unnamed: 0,000,10,100,2019,40,abilities,ability,able,academic,access,...,xa6,xae,xb7,xbb,xc2,xe2,xef,year,years,york
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,2,0,0,0,0,8,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0


## 4) Visualize the most common word counts

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 