<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [47]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read thru the documentation to accomplish this task. 

`Tip:` You will need to install the `bs4` library inside your conda environment. 

In [48]:
from bs4 import BeautifulSoup
import requests


##### My Code Goes Here #####

with open( './data/job_listings.csv') as nn:
    borscht = BeautifulSoup( nn, from_encoding= "unicode").get_text( strip= True)

In [None]:
print( borscht.prettify())

#### Hmm. That's... messy....

In [76]:
df = pd.read_csv( './data/job_listings.csv')

In [33]:
print( df.shape)
df.head()

(426, 3)


Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [45]:
# checking number of observations of literal "Data Scientist" in title (what we were instructed to check for)
df[ df[ 'title'].str.contains( "Data Scientist")].shape

(399, 3)

In [77]:
# but let's try being a bit more inclusive, seeing as entry [0] uses a lowercase S (and people aren't always perfect)
df = df[ df[ 'title'].str.lower().str.contains( "data scientist")]

print( df.shape)
df.head()

(406, 3)


Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


#### Really? ~Only 27 don't have "Data Scientist" (case-sensitive) in the title?~
#### ...Only 20 don't have "data scientist" (case-insensitive) in the title??

##### Allllrighty then....

In [114]:
df.keys()

Index(['Unnamed: 0', 'description', 'title'], dtype='object')

In [116]:
df[ 'description'][0]

'b"<div><div>Job Requirements:</div><ul><li><p>\\nConceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them</p>\\n</li><li><p>Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)</p>\\n</li><li><p>Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R</p>\\n</li><li><p>Ability to communicate Model findings to both Technical and Non-Technical stake holders</p>\\n</li><li><p>Hands on experience in SQL/Hive or similar programming language</p>\\n</li><li><p>Must show past work via GitHub, Kaggle or any other published article</p>\\n</li><li><p>Master\'s degree in Statistics/Mathematics/Computer Science or any other quant specific field.</p></li></ul><div><div><div><div><div><d

#### So, beautifulsoup seems cool, but a nightmare for what's needed here. let's try a little regex stripping instead; see where that gets us, first....

In [51]:
def strip( stuff):

    def stripUnicodeBytes( stuff):
        # Remove unicode bytecodes that soup and others were completely missing
        p = re.compile( r"(\\xc2\\xa8)|(\\xe2\\x80\\xa6)|(\\xe2\\x80\\x99)")
        return p.sub( "", stuff)
        
    def stripHTML( stuff):
        # Remove html and \\ tags from text, after above bytecode removal
        p = re.compile( r"(<.*?>)|(\\n)|(\\+)")
        return p.sub( ".", stripUnicodeBytes( stuff))

    return stripHTML( stuff)

In [229]:
df[ 'description'][9]
strip( df[ 'description'][9])

"b'..Slack is hiring experienced data scientists to join our Lifecycle team. The Lifecycle teams mission is to build product experiences that help companies of all sizes adopt and scale Slack within their organization. These product experiences include team and user onboarding, payments and billing systems, and the marketing and packaging of paid plans.....You will use data to help the team discover opportunities and define appropriate solutions to user problems by performing research into user behavior, defining and applying experimentation standards, and understanding the drivers of business performance.....Slack has a positive, diverse, and supportive culture.xe2.x80.x94we look for people who are curious, inventive, and work to be a little better every single day. In our work together we aim to be smart, humble, hardworking and, above all, collaborative. If this sounds like a good fit for you, why not say hello?.....What you will be doing......Use data to influence the direction of 

In [None]:
n = 0
for text in df[ 'description']:
    df[ 'description'][n] = strip( text)
    n += 1

In [80]:
df['description'][99]

'b"About the Company....Civis Analytics is a technology and strategic services company that works with organizations seeking to make more efficient, accurate, and data-driven decisions. Civis was founded in 2013 by former members of the analytics team on the 2012 Obama campaign and today works with some of the nation\'s leading progressive political and advocacy organizations, Fortune 500 companies, and major nonprofits......Civis\' objective this year and next is simple:. we want every political client organization to be data-driven and digital first. Here\'s how you\'d help:....First, the Civis platform is the technology of record for hundreds of in-house data scientists and analysts across the progressive movement. We seek to extend the feature set and capabilities to help these users quickly and self-reliantly accomplish their missions through data.....Second, Civis is building new digital technologies to help campaigns make the transition to digital outreach. It\'s 2019 after all.

In [81]:
df.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""..Job Requirements:.....Conceptual understan...",Data scientist
1,1,"b'.Job Description.....As a Data Scientist 1, ...",Data Scientist I
2,2,b'..As a Data Scientist you will be working on...,Data Scientist - Entry Level
3,3,"b'...$4,969 - $6,756 a month...Contract...Unde...",Data Scientist
4,4,b'..Location: USA .xe2.x80.x93 multiple locati...,Data Scientist


#### While the instructions above "recommend" that we use beautiful soup to clean the descriptions, I honestly found it too clunky and messy and, overall, needlessly convoluted. Additionally, trying to make sense of its documentation is -- seemingly like most documentation in this industry -- an effort in futility... and migraines....

#### So, I made the judgment call, after fussing with soup for ~8+ hours, that using a comparatively simple regex script or two to strip out the html tags (easy to do because they're all bookended by <>s) would be much more time and energy efficient, both in creation and understanding by third-parties as well, probably.

##### It's almost like beautiful soup was created as a solution in search of a problem....

## 2) Use Spacy to tokenize the listings 

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 4) Visualize the most common word counts

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [None]:
##### Your Code Here #####
raise Exception("\nThis task is not complete. \nReplace this line with your code for the task."

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 