<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

import requests
from bs4 import BeautifulSoup
from gensim.summarization import summarize
import textwrap

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [2]:
import urllib.request as ur
import pandas as pd

#raise Exception("\nThis task is not complete. \nReplace this line with your code for the task.")
# Retrieve page text
url = 'https://raw.githubusercontent.com/techthumb1/DS-Unit-4-Sprint-1-NLP/main/module2-vector-representations/data/job_listings.csv'
page = requests.get(url).text             

In [3]:
# Turn page into BeautifulSoup object to access HTML tags
soup = BeautifulSoup(page)

In [4]:
# Get headline
headline = soup.find('p').get_text()
print(f'The headline for this article is: \n{headline}')

The headline for this article is: 
\nConceptual understanding in Machine Learning models like Nai\xc2\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them


In [5]:
# Get text from all <p> tags.
p_tags = soup.find_all('p')
# Get the text from each of the “p” tags and strip surrounding whitespace.
p_tags_text = [tag.get_text().strip() for tag in p_tags]

In [6]:
# Filter out sentences that contain newline characters '\n' or don't contain periods.
sentence_list = [sentence for sentence in p_tags_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]

In [7]:
# Combine list items into string.
article = ' '.join(sentence_list)
print(f'This is the original article: \n{textwrap.fill(article[:200], 120)}')

This is the original article: 
Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data
manipulation are mandatory for this role) Master's degree in Statistics/Mathematics/Comput


In [8]:
# Make a summary of the article
summary = summarize(article, ratio=0.3)
print(f'This is the summary of the article: \n{textwrap.fill(summary[:200], 20)}')

This is the summary of the article: 
As a Data Scientist
1, you will help us
build machine
learning models,
data pipelines, and
micro-services to
help our clients
navigate their
healthcare journey.
You would be joining
Spotify on the Pre


## 2) Use Spacy to tokenize the listings 

In [9]:
nlp = spacy.load("en_core_web_lg")

In [10]:
doc = nlp(summary)
print([token.lemma_ for token in doc if (token.is_stop != True) and (token.is_punct != True)][:200])

['Data', 'scientist', '1', 'help', 'build', 'machine', 'learning', 'model', 'data', 'pipeline', 'micro', 'service', 'help', 'client', 'navigate', 'healthcare', 'journey', '\n', 'join', 'Spotify', 'Premium', 'Analytics', 'team', 'core', 'business', 'strategy', 'insight', 'team', 'Associate', 'Data', 'Scientist', '\n', 'join', 'Spotify', 'Premium', 'Analytics', 'team', 'core', 'business', 'strategy', 'insight', 'team', 'Associate', 'Data', 'Scientist', '\n', 'unique', 'position', 'work', 'essential', 'shape', 'Spotify', 'able', 'grow', 'data', 'drive', 'recommendation', 'new', 'product', 'offering', 'innovative', 'marketing', 'effort', '\n', 'unique', 'position', 'work', 'essential', 'shape', 'Spotify', 'able', 'grow', 'data', 'drive', 'recommendation', 'new', 'product', 'offering', 'innovative', 'marketing', 'effort', '\n', '\\nYou', 'work', 'global', 'team', 'world', 'class', 'analyst', 'data', 'scientist', 'business', 'manager', 'marketer', 'engineer', '\n', '\\nYou', 'work', 'global'

In [11]:
import os 

def gather_data(filefolder):
    """ Produces List of Documents from a Directory
    
    filefolder (str): a path of .txt files
    
    returns list of strings 
    """
    
    data = []
    
    files = os.listdir(filefolder) # Causes variation across machines
    
    for article in files: 
        
        path = os.path.join(filefolder, article)
                    
        if  path[-3:] == 'txt': # os ~endswith('txt')
            with open(path, 'rb') as f:
                data.append(f.read())
    
    return data

In [12]:
data = gather_data('./data')
data[0]

b'Ink helps drive democracy in Asia\n\nThe Kyrgyz Republic, a small, mountainous state of the former Soviet republic, is using invisible ink and ultraviolet readers in the country\'s elections as part of a drive to prevent multiple voting.\n\nThis new technology is causing both worries and guarded optimism among different sectors of the population. In an effort to live up to its reputation in the 1990s as "an island of democracy", the Kyrgyz President, Askar Akaev, pushed through the law requiring the use of ink during the upcoming Parliamentary and Presidential elections. The US government agreed to fund all expenses associated with this decision.\n\nThe Kyrgyz Republic is seen by many experts as backsliding from the high point it reached in the mid-1990s with a hastily pushed through referendum in 2003, reducing the legislative branch to one chamber with 75 deputies. The use of ink is only one part of a general effort to show commitment towards more open elections - the German Embass

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
vect.fit(data)
dtm = vect.transform(data)

## 4) Visualize the most common word counts

numpy.matrix

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [18]:
# We need to tokenize our doc.
def tokenize(document):
    
    doc = nlp(document)
    
    return [token.lemma_.strip() for token in doc if (token.is_stop != True) and (token.is_punct != True)]

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', 
                        ngram_range=(1,2),
                        max_df=.95,
                        min_df=2,
                        tokenizer=tokenize)

dtm = tfidf.fit_transform(data)
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())
dtm.head()

Unnamed: 0,.post,.travel,1,10,12,15,1991,"2,000",2004,2025,...,£ 5,£ 50,£ 52,£ 53,£ 6,£ 6.30,£ 69,"£ 7,455",£ 8.3,£ 99
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.050341,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 