<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [40]:
from bs4 import BeautifulSoup
import requests

##### Your Code Here #####
# raise Exception("\nThis task is not complete. \nReplace this line with your code for the task.")

df = pd.read_csv('data/job_listings.csv')
del df['Unnamed: 0']
df.head()

Unnamed: 0,description,title
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [41]:
def clean(text):
    soup = BeautifulSoup(text, 'html')
    x = soup.get_text()[2:]
    x = re.sub(r'\\n', '', x)
    x = re.sub(r'\\x..', '', x)
    return x

In [42]:
df.description = df['description'].apply(clean)

In [43]:
df.head()

Unnamed: 0,description,title
0,Job Requirements:Conceptual understanding in M...,Data scientist
1,"Job DescriptionAs a Data Scientist 1, you will...",Data Scientist I
2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level
3,"$4,969 - $6,756 a monthContractUnder the gener...",Data Scientist
4,Location: USA multiple locations2+ years of A...,Data Scientist


## 2) Use Spacy to tokenize the listings 

In [47]:
from tqdm import tqdm
tqdm.pandas()
nlp = spacy.load("en_core_web_lg")

In [55]:
def get_lemmas(text):
    
    text = re.sub('[^a-zA-Z ]', '', text)
    doc = nlp(text)
    
    lemmas = []
    for token in doc: 
        if ((token.is_stop == False) and (token.is_punct == False)) and (token.pos_ != 'PRON'):
            lemmas.append(token.lemma_)
    
    return lemmas

In [56]:
words = df['description'].apply(get_lemmas)
words


0      [job, RequirementsConceptual, understanding, M...
1      [job, DescriptionAs, Data, scientist,  , help,...
2      [Data, scientist, work, consult, business, res...
3      [   , monthContractUnder, general, supervision...
4      [location, USA,  , multiple, location, year, A...
                             ...                        
421    [UsWant, fantastic, fun, startup, s, revolutio...
422    [InternshipAt, uber, ignite, opportunity, set,...
423    [   , yearA, million, people, year, die, car, ...
424    [senior, data, SCIENTISTJOB, DESCRIPTIONABOUT,...
425    [Cerner, Intelligence, new, innovative, organi...
Name: description, Length: 426, dtype: object

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [59]:
df['lemmas'] = df['description'].progress_apply(get_lemmas)
df.head()

100%|████████████████████████████████████████████████████████████████████████████████| 426/426 [00:18<00:00, 23.54it/s]


Unnamed: 0,description,title,lemmas
0,Job Requirements:Conceptual understanding in M...,Data scientist,"[job, RequirementsConceptual, understanding, M..."
1,"Job DescriptionAs a Data Scientist 1, you will...",Data Scientist I,"[job, DescriptionAs, Data, scientist, , help,..."
2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level,"[Data, scientist, work, consult, business, res..."
3,"$4,969 - $6,756 a monthContractUnder the gener...",Data Scientist,"[ , monthContractUnder, general, supervision..."
4,Location: USA multiple locations2+ years of A...,Data Scientist,"[location, USA, , multiple, location, year, A..."


In [61]:
#word count
from collections import Counter
# The object `Counter` takes an iterable, but you can instaniate an empty one and update it. 
word_counts = Counter()

# Update it based on a split of each of our documents
df['lemmas'].apply(lambda x: word_counts.update(x))

# Print out the 10 most common words
word_counts.most_common(10)

[('datum', 2611),
 (' ', 1663),
 ('work', 1359),
 ('team', 1188),
 ('experience', 1135),
 ('business', 1081),
 ('model', 805),
 ('Data', 760),
 ('product', 742),
 ('data', 692)]

In [77]:
tfidf = CountVectorizer(max_features = 500,
                        tokenizer=get_lemmas
#                         ngram_range=(1,2)
#                         max_df=.97,
#                         min_df=3,
#                         tokenizer=tokenize
                       )

# Create a vocabulary and get word counts per document
dtm = tfidf.fit_transform(df.description) # Similiar to fit_predict

# Print word counts

# Get feature names to use as dataframe column headers
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

# View Feature Matrix as DataFrame
dtm.head()

Unnamed: 0,Unnamed: 1,ab,ability,able,access,accommodation,achieve,action,actionable,activity,...,want,way,well,wide,will,work,workplace,world,write,year
0,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,4,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,4,0,1,2,1
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,2,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## 4) Visualize the most common word counts

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 