<a href="https://colab.research.google.com/github/ttogle918/ds-section4-sprint2/blob/master/N421_count-based_representation/N421a_Count_based_Representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="right" src="https://ds-cs-images.s3.ap-northeast-2.amazonaws.com/Codestates_Fulllogo_Color.png" width=100>

## *AIB / SECTION 4 / SPRINT 2 / NOTE 1*

# 📝 Assignment

---


# Count-based_Representation

indeed.com 에서 Data Scientist 키워드로 Job descrition을 찾아 스크래핑한 데이터를 이용해 과제를 진행해 보겠습니다.

[Data_Scienties.csv](https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/indeed/Data_Scientist.csv) 파일에는 1300여개의 Data Scientist job description 정보가 담겨 있습니다.

## 1. 데이터 전처리 (Text preprocessing)

In [1]:
import re
import string

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### 0) 텍스트 분석에 앞서 데이터 전처리를 진행합니다.

- 파일을 불러온 후 title, company, description 에 해당하는 Column만 남겨주세요.
- 중복값을 제거하세요.

In [2]:
df = pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/indeed/Data_Scientist.csv')

In [3]:
df = df[['title', 'company', 'description']]
df.duplicated().sum()

543

In [4]:
df = df.drop_duplicates()
df.isna().sum()

title          0
company        0
description    0
dtype: int64

In [5]:
df.shape

(757, 3)

### 1) 토큰을 정제합니다.

- 문자를 소문자로 통일
- 분석에 관련 없는 정보 제거
- 이번 과제는 `spacy` 로부터 `"en_core_web_sm"` 을 로드하여 진행해주세요.

- **문항 1) 대문자를 소문자로 변경하는 함수를 입력하세요.**
- **문항 2) 정규 표현식을 사용하여 re 라이브러리에서 알파벳 소문자, 숫자만 받을 수 있는 코드를 작성하세요.**

In [6]:
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_sm")
tokenizer = Tokenizer(nlp.vocab)

In [7]:
def lower_and_regex(sentence):
    """
    모든 대문자를 소문자로 변경 후
    정규식을 이용하여 알파벳 소문자 이외의 구두점 삭제
    """
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.lower()
    sentence = re.sub(r"[^a-z0-9 ]", "", sentence)
    tokens = sentence.split()
    return tokens

In [8]:
df['tokens'] = df['description'].apply(lower_and_regex)
df.tail()

Unnamed: 0,title,company,description,tokens
1288,Senior Data Analyst,Intuit,Our Expert Delivery & Business Intelligence te...,"[our, expert, delivery, business, intelligence..."
1294,"Senior / Data Scientist, Advertising Business",Spotify,"Music for everyone, no credit card needed. It’...","[music, for, everyone, no, credit, card, neede..."
1295,Senior Data & Applied Scientist,Microsoft,Senior Data & Applied Scientist\nDo you have a...,"[senior, data, applied, scientist, do, you, ha..."
1297,Senior Data Scientist,eBay Inc.,eBay is a global commerce leader that allows y...,"[ebay, is, a, global, commerce, leader, that, ..."
1299,Senior Data Scientist,Spring Discovery,tl;dr\nSpring is accelerating the discovery of...,"[tldr, spring, is, accelerating, the, discover..."


### 2) 정제한 토큰을 시각화 합니다.

- Top 10 토큰을 프린트 합니다.
- 토큰의 수, 빈도 순위, 존재 문서 수, 비율 등 정보를 계산합니다.
- 토큰 순위에 따른 퍼센트 누적 분포 그래프를 시각화합니다.

- **문항 3) 추천 토큰 순위 10개 단어를 입력하세요.**

In [9]:
# Top 10 토큰을 프린트
from collections import Counter

word_counts = Counter()
df['tokens'].apply(lambda x: word_counts.update(x))
word_counts.most_common(10)

[('and', 21863),
 ('to', 12694),
 ('the', 10538),
 ('of', 8839),
 ('data', 7425),
 ('in', 6769),
 ('a', 6436),
 ('with', 5727),
 ('for', 4132),
 ('or', 3812)]

In [10]:
# 토큰의 수, 빈도 순위, 존재 문서 수, 비율 등 정보를 계산
# word_counts

In [11]:
# wc = word_count(df['tokens'])
# wc.head()

In [12]:
# import seaborn as sns

# sns.lineplot(x='rank', y='cul_percent', data=wc);

### 4) 확장된 불용어 사전을 사용해 토큰을 정제합니다.


- **문항 4) 기본 불용어 사전에 두 단어(`"data", "work"`)를 추가하는 코드를 사용해주세요.**
- **문항 5) 불용어를 제거하고 난 뒤 토큰 순위 10개의 단어를 입력하세요.**

In [13]:
# 문항 4)
print(nlp.Defaults.stop_words)
STOP_WORDS = nlp.Defaults.stop_words.union(["data", "work"])

{'became', 'whose', 'also', 'neither', 'only', 'fifteen', "'ll", '’ve', 'if', 'along', 'together', 'thus', 'being', 'of', 'hundred', 'upon', 'more', 'out', 'meanwhile', 'his', 'latterly', 'they', 'made', 'myself', 'eight', 'now', 'serious', 'be', 'when', "'s", 'whether', '‘ll', 're', 'yourself', 'me', 'four', 'make', 'many', 'against', 'to', 'i', 'the', 'which', 'everyone', 'once', 'behind', 'least', 'into', '’d', 'but', 'whereupon', 'latter', 'some', 'at', 'around', 'between', 'sometimes', 'anyway', 'show', 'amongst', 'before', 'own', 'these', 'seem', 'both', 'empty', 'always', 'call', 'across', 'otherwise', 'twenty', 'front', 'over', 'while', 'can', "'d", 'besides', 'must', 'again', 'all', 'you', 'except', 'first', 'however', 'yours', 'will', 'itself', 'herein', 'five', 'with', 'anywhere', 'elsewhere', '‘ve', 'top', 'for', 'too', 'above', 'toward', 'noone', 'are', 'may', 'could', 'thru', 'give', 'without', 'have', 'becomes', 'whoever', 'cannot', 'although', 'in', 'herself', 'whereas'

In [14]:
tokens = []

for doc in tokenizer.pipe(df['description']):
    doc_tokens = []
    
    for token in doc:
        txt = re.sub(r"[^a-z0-9 ]", "", token.text.lower()).strip()
        if txt not in STOP_WORDS:
            if txt not in ['', '\n'] :
              doc_tokens.append(txt)

    tokens.append(doc_tokens)
    
df['tokens'] = tokens
df.head()

Unnamed: 0,title,company,description,tokens
0,Data Scientist (Structured Products),EquiTrust Life Insurance Company,Job Details\nDescription\nEssential Duties and...,"[job, details, description, essential, duties,..."
2,"Specialist, Data Science",Nationwide,As a team member in the Finance and Internal A...,"[team, member, finance, internal, audit, depar..."
4,Sr. Data Scientist (Remote),American Credit Acceptance,Overview:\nAmerican Credit Acceptance seeks a ...,"[overview, american, credit, acceptance, seeks..."
5,Data Scientist Associate Sr (DADS06) BTB - LEG...,"JPMorgan Chase Bank, N.A.",J.P. Morgan's Corporate & Investment Bank (CIB...,"[jp, morgans, corporate, investment, bank, cib..."
6,Data Scientist,VyStar Credit Union,"At VyStar, we offer competitive pay, an excell...","[vystar, offer, competitive, pay, excellent, b..."


In [15]:
# 문항 5) 불용어를 제거하고 난 뒤 토큰 순위 10개의 단어를 입력하세요.
# Top 10 토큰을 프린트
# from collections import Counter

word_counts = Counter()
df['tokens'].apply(lambda x: word_counts.update(x)) # update가 안됌.. 새로 객체를 받아서 실행하면 됌
most_10_word = word_counts.most_common(10)
most_10_word

[('experience', 3450),
 ('business', 2064),
 ('science', 1648),
 ('team', 1625),
 ('learning', 1596),
 ('analysis', 1349),
 ('skills', 1251),
 ('machine', 1152),
 ('analytics', 1136),
 ('models', 1034)]

### 5) Lemmatization(표제어 추출) 사용 효과를 분석해 봅니다.



- **문항 6) Lemmatization을 진행한 뒤 상위 10개 단어를 입력하세요.**

In [16]:
def get_lemmas(txt):
    txt = txt.replace('\n', ' ')
    txt = re.sub(r"[^a-z0-9 ]", "", txt.lower()).strip()

    lemmas = []
    
    doc = nlp(txt)

    for token in doc: 
        if ((token.is_stop == False) and (token.is_punct == False)) and (token.pos_ != 'PRON'):
          if token.text not in ['', ' '] :
            lemmas.append(token.lemma_)
    
    return lemmas

# print([token.lemma_ for token in doc if (token.is_stop != True) and (token.is_punct != True)])

In [17]:
df['lemmas'] = df['description'].apply(get_lemmas)
df['lemmas'].head()

0    [job, detail, description, essential, duty, re...
2    [team, member, finance, internal, audit, depar...
4    [overview, american, credit, acceptance, seek,...
5    [jp, morgans, corporate, investment, bank, cib...
6    [vystar, offer, competitive, pay, excellent, b...
Name: lemmas, dtype: object

In [18]:
lemmas_word_counts = Counter()

df['lemmas'].apply(lambda x: lemmas_word_counts.update(x))
most_10__lemmas_word = lemmas_word_counts.most_common(10)
most_10__lemmas_word

[('datum', 5067),
 ('experience', 3622),
 ('work', 2861),
 ('data', 2358),
 ('team', 2298),
 ('business', 2157),
 ('science', 1704),
 ('analysis', 1580),
 ('model', 1554),
 ('analytic', 1379)]

In [19]:
# !pip install squarify

## 2. 유사한 문서 찾기

### 1) `TfidfVectorizer`를 이용해 각 문서들을 벡터화 한 후 KNN 모델을 만들고, <br/> 내가 원하는 `job description`을 질의해 가장 가까운 검색 결과들을 가져오고 분석합니다.

- **문항 9) 88번 index의 `job description`와 5개의 가장 유사한 `job description`이 있는 index를 입력하세요.**
    - 답은 88번 인덱스를 포함합니다.
    - `max_features = 3000` 으로 설정합니다.
    - [88, 90, 91, 93, 94] 형태로 답을 입력해주세요

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA

# TF-IDF vectorizer. 테이블을 작게 만들기 위해 max_features=15로 제한하였습니다.
tfidf = TfidfVectorizer(stop_words='english', max_features=15)

# Fit 후 dtm을 만듭니다.(문서, 단어마다 tf-idf 값을 계산합니다)
dtm_tfidf = tfidf.fit_transform(df['description'])

dtm_tfidf = pd.DataFrame(dtm_tfidf.todense(), columns=tfidf.get_feature_names())
dtm_tfidf



Unnamed: 0,ability,analysis,analytics,business,data,experience,learning,machine,models,science,skills,statistical,team,work,years
0,0.301373,0.417446,0.000000,0.000000,0.514613,0.214731,0.000000,0.000000,0.300937,0.122172,0.394946,0.000000,0.125346,0.358723,0.134582
1,0.219865,0.058009,0.130702,0.507044,0.386160,0.469967,0.199107,0.091949,0.125455,0.025466,0.164646,0.330165,0.104509,0.299091,0.056105
2,0.202192,0.062237,0.280457,0.483558,0.184137,0.096043,0.305171,0.197302,0.471100,0.054644,0.000000,0.425077,0.112127,0.213928,0.060194
3,0.179847,0.000000,0.187096,0.645174,0.614198,0.042714,0.217155,0.175497,0.119724,0.145814,0.104750,0.000000,0.099735,0.000000,0.053542
4,0.029973,0.332141,0.280635,0.161288,0.614178,0.363058,0.081431,0.087746,0.059860,0.097206,0.183306,0.346579,0.199463,0.190279,0.133850
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
752,0.000000,0.000000,0.000000,0.811929,0.206119,0.215016,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.502050,0.000000,0.000000
753,0.069705,0.128735,0.362572,0.375083,0.618930,0.248326,0.063124,0.000000,0.000000,0.226057,0.060898,0.073271,0.347895,0.276564,0.000000
754,0.092069,0.085020,0.000000,0.495429,0.377314,0.328001,0.416884,0.359370,0.000000,0.149294,0.000000,0.000000,0.153172,0.365300,0.082230
755,0.276875,0.191756,0.072009,0.620780,0.330946,0.295913,0.000000,0.000000,0.000000,0.056120,0.241894,0.218281,0.230313,0.274636,0.247284


In [21]:
from sklearn.neighbors import NearestNeighbors

# dtm을 사용히 NN 모델을 학습시킵니다. (디폴트)최근접 5 이웃.
nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm_tfidf)

nn.kneighbors([dtm_tfidf.iloc[88]])

  "X does not have valid feature names, but"


(array([[0.        , 0.2847889 , 0.30134951, 0.34527149, 0.35129336]]),
 array([[ 88,  72, 717, 597, 427]]))

In [22]:
print(df['description'][88][0:300], '\n---------------')
print(df['description'][72][0:300])

We're looking for Data Scientists to work on our core and business products to help shape the future of what we build at Hyperspace Ventures. You will enjoy working with various data sets, cutting edge technology, and the ability to see your insights turned into real products on a regular basis. The 
---------------
As a part of the Data Science team, you will run the development of next-generation platforms for customers micro-segmentation, clusterization, behavior analysis, and prediction as well as building up new improved recommendation systems for the customers.
You will use your expertise to design data s


## 3. TF-IDF 이용한 텍스트 분류 진행하기

TF-IDF를 이용해 문장 혹은 문서를 벡터화한 경우, 이 벡터값을 이용해 문서 분류 태스크를 진행할 수 있습니다. 

현재 다루고 있는 데이터셋에는 label이 존재하지 않으므로, title 컬럼에 "Senior"가 있는지 없는지 여부를 통해 Senior 직무 여부를 분류하는 작업을 진행해보겠습니다.

### 1) title 컬럼에 "Senior" 문자열이 있으면 1, 없으면 0인 "Senior"라는 새로운 컬럼을 생성해주세요.

문항 7) 새롭게 만든 Senior 컬럼에서 값이 1인 (Senior O) 데이터의 개수는?

In [23]:
def findWord(txt) :
  if 'Senior' in txt:
    return 1
  return 0

In [24]:
df['senior'] = df['title'].apply(findWord)

In [25]:
df['senior'].sum()

95

문항 8) sklearn의 `train_test_split`을 통해 train 데이터와 valid 데이터로 나눈 후, `sklearn`의 `DecisionTreeClassifier`를 이용해 분류를 진행해주세요. 

단, x값은 위에서 학습한 dtm_tfidf를 그대로 이용해주세요. train_test_split과 DecisionTreeClassifier의 random_state을 42로 고정하고, test_size는 0.1로 설정해주세요.

학습을 완료한 후, test 데이터에 대한 예측을 진행하고 label 1에 대한 precision과 recall 값을 적어주세요

In [26]:
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(dtm_tfidf, test_size=0.1, random_state=42)
y_train, y_test = train_test_split(df['senior'], test_size=0.1, random_state=42)

X_train, X_val = train_test_split(X_train, test_size=0.1, random_state=42)
y_train, y_val = train_test_split(y_train, test_size=0.1, random_state=42)
X_train.head()

Unnamed: 0,ability,analysis,analytics,business,data,experience,learning,machine,models,science,skills,statistical,team,work,years
359,0.0,0.0,0.0,0.319673,0.182595,0.317461,0.403488,0.521734,0.177964,0.144496,0.233558,0.093671,0.222376,0.070712,0.397936
501,0.0,0.164947,0.247766,0.0,0.854033,0.0,0.21568,0.232406,0.118911,0.0,0.0,0.250352,0.049528,0.047248,0.0
169,0.124556,0.460074,0.259152,0.0,0.680596,0.354987,0.0,0.0,0.0,0.100986,0.108819,0.0,0.207218,0.197678,0.111244
627,0.0,0.0,0.0,0.0,0.813125,0.121175,0.0,0.0,0.169822,0.0,0.297163,0.0,0.141468,0.404863,0.151892
269,0.0,0.114801,0.129331,0.668971,0.594396,0.088579,0.112583,0.242627,0.12414,0.100795,0.108613,0.0,0.206827,0.098652,0.0


In [27]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, precision_score, recall_score

model = DecisionTreeClassifier(max_depth=7, random_state=42, criterion='entropy')
model.fit(X_train, y_train)
print(model.score(X_train, y_train))

# 검증세트로 평가
y_pred_val = model.predict(X_val)
print('검증 정확도 : ', model.score(X_val, y_val))
print('검증 f1 score : ', f1_score(y_val, y_pred_val))
print('검증 precision score : ', precision_score(y_val, y_pred_val, average=None)[0])
print('검증 recall score : ', recall_score(y_val, y_pred_val, average=None)[0])

y_pred = model.predict(X_test)
print('test 정확도 : ', model.score(X_test, y_test))
print('test precision score : ', precision_score(y_test, y_pred, average=None)[0])
print('test recall score : ', recall_score(y_test, y_pred, average=None)[0])

0.9379084967320261
검증 정확도 :  0.782608695652174
검증 f1 score :  0.0
검증 precision score :  0.8307692307692308
검증 recall score :  0.9310344827586207
test 정확도 :  0.8421052631578947
test precision score :  0.8888888888888888
test recall score :  0.9411764705882353
