<a href="https://colab.research.google.com/github/wel51x/DS-Unit-4-Sprint-2-NLP/blob/master/My_LS_DS_423_Document_Classification_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2 Assignment 3*

# Document Classification

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [0]:
import pandas as pd
import numpy as np
pd.set_option('display.width', 1000)
pd.set_option('max_colwidth', 160)


In [0]:
url = "https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv"

df = pd.read_csv(url)

In [64]:
df.describe()

Unnamed: 0,description,title,job
count,499,499,500
unique,423,232,2
top,"b'<p></p><div><h3 class=""jobSectionHeader""><b>Description:\n</b></h3><p></p><div>As part of a small, passionate and accomplished team of experts, you will w...",Data Scientist,Data Analyst
freq,3,84,250


In [65]:
df = df.dropna()
df.count()

description    499
title          499
job            499
dtype: int64

In [66]:
df.job.value_counts(normalize=True)


Data Scientist    0.501002
Data Analyst      0.498998
Name: job, dtype: float64

In [67]:
df['job_num'] = df.job.map({'Data Analyst': 0, 'Data Scientist': 1})

df.sample(11)

Unnamed: 0,description,title,job,job_num
300,"b'<div class=""jobsearch-JobMetadataHeader icl-u-xs-mb--md""><div class=""jobsearch-JobMetadataHeader-item ""><span class=""icl-u-xs-mr--xs"">$70,000 - $100,000 a...",Junior Data Analyst,Data Analyst,0
445,"b""<div><div><div><div><p>Summary</p><div><div>\nPosted: <b>Jan 14, 2019</b>\n</div><div>Weekly Hours: <b>40</b>\n</div><div>Role Number: <b>114182628</b></d...",Maps Data Analyst - Maps Evaluation,Data Analyst,0
475,"b'<div class=""jobsearch-JobMetadataHeader icl-u-xs-mb--md""><div class=""jobsearch-JobMetadataHeader-item icl-u-xs-mt--xs"">Internship</div></div><div><h1 clas...",Reporting/Data Analyst Intern,Data Analyst,0
483,b'<p></p><div><div><div><div><p><b>Characteristics of position</b><b>: </b>Provide research activities and the completion of deliverables specified in the r...,Volunteer-Data Analyst,Data Analyst,0
25,"b""As a Data Scientist for Ads Measurement in the Partners org you will work on projects to demonstrate and create a generalizable value of the Pinterest pla...",Measurement Data Scientist,Data Scientist,1
38,"b'<div class=""jobsearch-JobMetadataHeader icl-u-xs-mb--md""><div class=""jobsearch-JobMetadataHeader-item icl-u-xs-mt--xs"">Temporary, Internship</div></div><d...",Data Scientist Summer Intern,Data Scientist,1
228,"b'<div class=""jobsearch-JobMetadataHeader icl-u-xs-mb--md""><div class=""jobsearch-JobMetadataHeader-item icl-u-xs-mt--xs"">Internship</div></div>About the rol...",Associate Data Scientist Internship - Customer Data & Analytics,Data Scientist,1
45,"b""<div><div>You will collaborate with the brightest technical minds in building futuristic products, researching groundbreaking concepts, and influencing ne...",Junior Data Scientist,Data Scientist,1
36,"b'<div><p><b>Position Summary\n</b></p><p></p>In today\xe2\x80\x99s fast evolving technology world, one aspect remains common \xe2\x80\x93 reliance on data ...",Data Scientist,Data Scientist,1
246,"b""<p></p><div><p><b>Sourcing Analyst\n</b></p><p><b>J.C. Penney Company, Inc.</b></p><p><b>\nPlano, Texas</b></p><p>\nThe Sourcing Analyst will develop, dep...","Data Scientist, Sourcing Analytics",Data Scientist,1


In [0]:
from sklearn.model_selection import train_test_split

X = df.description
y = df.job_num

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [69]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(399,)
(100,)
(399,)
(100,)


## Count Vectorizer

Today we're just going to let Scikit-Learn do our text cleaning and preprocessing for us.

Lets run our vectorizer on job description and take a peek at the tokenization of the vocabulary

In [70]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')
# vectorizer = CountVectorizer(max_features=1000, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(len(vectorizer.get_feature_names()))

9366


### Transform Train

In [71]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(399, 9366)


Unnamed: 0,00,000,00079,00805,00pm,02115,03,0356,04,062,...,zeta,zetahub,zeus,zheng,zogsports,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Transform Test

In [72]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(100, 9366)


Unnamed: 0,00,000,00079,00805,00pm,02115,03,0356,04,062,...,zeta,zetahub,zeus,zheng,zogsports,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [0]:
results = pd.DataFrame(columns = ["model", "acc_train", "acc_test", "vect_type"])

In [0]:
# results func
def format_results(model, y_train, train_predictions, y_test, test_predictions, vect):
  results_dict = {}
  results_dict["model"] = model
  results_dict["acc_train"] = accuracy_score(y_train, train_predictions)
  results_dict["acc_test"] = accuracy_score(y_test, test_predictions)
  results_dict["vect_type"] = vect
  return results_dict

## Logistic Regression

In [75]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = lr.predict(X_train_vectorized)
test_predictions = lr.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.93




In [0]:
results = results.append(format_results("Logistic Regression", y_train, train_predictions, y_test, test_predictions, "Count"), ignore_index=True)

## Multinomial Naive Bayes

In [77]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = mnb.predict(X_train_vectorized)
test_predictions = mnb.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9699248120300752
Test Accuracy: 0.88


In [0]:
results = results.append(format_results("Multinomial Naive Bayes", y_train, train_predictions, y_test, test_predictions, "Count"), ignore_index=True)

## Random Forest Classifier

In [79]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = rfc.predict(X_train_vectorized)
test_predictions = rfc.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9874686716791979
Test Accuracy: 0.82




In [0]:
results = results.append(format_results("Random Forest Classifier", y_train, train_predictions, y_test, test_predictions, "Count"), ignore_index=True)

In [51]:
results

Unnamed: 0,model,acc_train,acc_test,vect_type
0,Logistic Regression,0.997494,0.93,Count
1,Multinomial Naive Bayes,0.969925,0.88,Count
2,Random Forest Classifier,0.994987,0.89,Count


## TF-IDF Vectorization Method

In [81]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)

{'div': 2464, 'hrs': 3797, 'looking': 4592, 'add': 385, 'data': 2099, 'scientist': 7626, 'rapidly': 7091, 'growing': 3572, 'engineering': 2777, 'team': 8420, 'things': 8506, 'individual': 3977, 'responsible': 7395, 'using': 8878, 'answers': 637, 'questions': 7049, 'healthcare': 3674, 'professionals': 6853, 'managing': 4677, 'patient': 6435, 'care': 1274, 'including': 3936, 'areas': 709, 'risk': 7471, 'analysis': 593, 'statistical': 8122, 'anonymization': 631, 'predictive': 6735, 'analytics': 599, 'role': 7491, 'involve': 4217, 'working': 9184, 'primarily': 6789, 'python': 7004, 'jupiter': 4316, 'labs': 4381, 'framework': 3302, 'small': 7896, 'member': 4810, 'able': 273, 'learn': 4440, 'quickly': 7055, 'self': 7710, 'sufficient': 8264, 'flexible': 3207, 'motivated': 4998, 'work': 9174, 'different': 2333, 'technologies': 8438, 'applications': 666, 'needed': 5405, 'exciting': 2940, 'opportunity': 6259, 'huge': 3809, 'impact': 3889, 'early': 2581, 'stages': 8075, 'department': 2222, 'devel

### Vectorize training data

In [82]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(399, 9366)


Unnamed: 0,00,000,00079,00805,00pm,02115,03,0356,04,062,...,zeta,zetahub,zeus,zheng,zogsports,zoho,zone,zones,zoom,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.09339,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Vectorize test data

In [83]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(100, 9366)


Unnamed: 0,00,000,00079,00805,00pm,02115,03,0356,04,062,...,zeta,zetahub,zeus,zheng,zogsports,zoho,zone,zones,zoom,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.033325,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.030632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [84]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9674185463659147
Test Accuracy: 0.91




In [0]:
results = results.append(format_results("Logistic Regression", y_train, train_predictions, y_test, test_predictions, "TF-IDF"), ignore_index=True)

## Multinomial Naive Bayes

In [86]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9674185463659147
Test Accuracy: 0.87


In [0]:
results = results.append(format_results("Multinomial Naive Bayes", y_train, train_predictions, y_test, test_predictions, "TF-IDF"), ignore_index=True)

## Random Forest Classifier

In [88]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9874686716791979
Test Accuracy: 0.86




In [0]:
results = results.append(format_results("Random Forest Classifier", y_train, train_predictions, y_test, test_predictions, "TF-IDF"), ignore_index=True)

In [61]:
results

Unnamed: 0,model,acc_train,acc_test,vect_type
0,Logistic Regression,0.997494,0.93,Count
1,Multinomial Naive Bayes,0.969925,0.88,Count
2,Random Forest Classifier,0.994987,0.89,Count
3,Logistic Regression,0.967419,0.91,TF-IDF
4,Multinomial Naive Bayes,0.967419,0.87,TF-IDF
5,Random Forest Classifier,0.992481,0.86,TF-IDF


In [90]:
results

Unnamed: 0,model,acc_train,acc_test,vect_type
0,Logistic Regression,0.997494,0.93,Count
1,Multinomial Naive Bayes,0.969925,0.88,Count
2,Random Forest Classifier,0.987469,0.82,Count
3,Logistic Regression,0.967419,0.91,TF-IDF
4,Multinomial Naive Bayes,0.967419,0.87,TF-IDF
5,Random Forest Classifier,0.987469,0.86,TF-IDF


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
