# **Dataset Info.**

**apply_train.csv**

이력서가 채용 공고에 실제 지원한 관계 목록 (히스토리)

**이력서 관련 데이터**

resume.csv

resume_certificate.csv

resume_education.csv

resume_language.csv

**채용공고 관련 데이터**

recruitment.csv
company.csv

sample_submission.csv [파일]

- 제출 양식

resume_seq : 추천을 진행할 이력서 고유 번호

recruitment_seq : 이력서에 대해 추천한 채용 공고 고유 번호

resume.csv에 존재하는 모든 resume_seq에 대해서 5개의 채용 공고를 추천해야 합니다.

해당 이력서에서 실제 지원이 이루어졌던 채용 공고는 추천하지 않습니다.

중복된 채용 공고를 추천하거나, 5개가 아닌 개수의 채용 공고를 추천하는 경우 제출이 불가능합니다.

# **참고할만한 문헌**

**유사도를 활용한 맞춤형 보험 추천 시스템**
 - https://eds.s.ebscohost.com/abstract?site=eds&scope=site&jrnl=22344772&AN=160571615&h=nBWJ8MZ1VBxMHQMYz4s9qQHgOjuLMkVqjCvDvX%2fto3azA7BD2spqKM7vicp7gS4GyrqWjkVPK3G9ZoRdyNUaKA%3d%3d&crl=c&resultLocal=ErrCrlNoResults&resultNs=Ehost&crlhashurl=login.aspx%3fdirect%3dtrue%26profile%3dehost%26scope%3dsite%26authtype%3dcrawler%26jrnl%3d22344772%26AN%3d160571615
 > -> decision tree and random forest classifier, K-means clustering algorithm and manually operated algorithm

**구글 스칼라 인용수 1000회 이상 문헌**

**Recommender systems**
 - https://www.sciencedirect.com/science/article/pii/S0370157312000828
 - 1212회 인용

**Recommender Systems in E-Commerce**
 - https://dl.acm.org/doi/pdf/10.1145/336992.337035
 - 2736회 인용

**Recommender system application developments: A survey**
 - https://www.sciencedirect.com/science/article/pii/S0167923615000627
 - 1717회 인용

**Research-paper recommender systems: a literature survey**
 - https://link.springer.com/article/10.1007/s00799-015-0156-0
 - 1087회 인용

In [55]:
import pandas as pd
import numpy as np
import matplotlib as plt


from tqdm.auto import tqdm
from collections import defaultdict
from sklearn.decomposition import TruncatedSVD, NMF, SparsePCA
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec

In [56]:
df_apply_train = pd.read_csv('/content/apply_train.csv')
df_company = pd.read_csv('/content/company.csv')
df_recruitment = pd.read_csv('/content/recruitment.csv')
df_resume = pd.read_csv('/content/resume.csv')
df_resume_certificate = pd.read_csv('/content/resume_certificate.csv')
df_resume_education = pd.read_csv('/content/resume_education.csv')
df_resume_language = pd.read_csv('/content/resume_language.csv')

In [57]:
print(df_apply_train.shape)
print("이력서 번호의 수 (지원자의 수) :",len(df_apply_train['resume_seq'].unique()))
df_apply_train.head(1)

# resume_seq : 추천을 진행할 이력서 고유 번호 (취준생들의 ID)
# -> 8482명의 지원자들을 가지고 학습 -> 8482의 지원자들이 여러 기업(이력서)를 작성 -> 총 57946개의 이력서

# recreuitment_seq : 이력서에 대해 추천한 채용 공고 고유 번호 (기업들의 ID?)

(57946, 2)
이력서 번호의 수 (지원자의 수) : 8482


Unnamed: 0,resume_seq,recruitment_seq
0,U05833,R03838


In [58]:
print(df_company.shape)
print("공고 종류 :", len(df_company['recruitment_seq'].unique()))
# 공고 종류 : 2377
print("회사 종류 :", len(df_company['company_type_seq'].unique()))
# 회사 종류 : 6
print("주 업종 종류 :", len(df_company['supply_kind'].unique()))
# 주 업종 종류 : 17
df_company.head(1)

(2377, 4)
공고 종류 : 2377
회사 종류 : 6
주 업종 종류 : 17


Unnamed: 0,recruitment_seq,company_type_seq,supply_kind,employee
0,R02073,2,514,20


In [59]:
print(df_recruitment.shape)
df_recruitment.head(1)

(6695, 11)


Unnamed: 0,recruitment_seq,address_seq1,address_seq2,address_seq3,career_end,career_start,check_box_keyword,education,major_task,qualifications,text_keyword
0,R02264,3.0,,,0,0,2507;2707;2810,4,8,1,


In [60]:
print(df_resume.shape)
df_resume.head(1)

(8482, 13)


Unnamed: 0,resume_seq,reg_date,updated_date,degree,graduate_date,hope_salary,last_salary,text_keyword,job_code_seq1,job_code_seq2,job_code_seq3,career_month,career_job_code
0,U00606,2020-03-04,2020-05-22,4,2008,3500.0,3500.0,스타일디자이너;우븐디자이너,재료·화학·섬유·의복,,,67,


In [61]:
print(df_resume_certificate.shape)
df_resume_certificate.head(1)

(12975, 2)


Unnamed: 0,resume_seq,certificate_contents
0,U06421,손해보험사


In [62]:
print(df_resume_education.shape)
df_resume_education.head(1)

(8482, 14)


Unnamed: 0,resume_seq,hischool_type_seq,hischool_special_type,hischool_nation,hischool_gender,hischool_location_seq,univ_type_seq1,univ_type_seq2,univ_transfer,univ_location,univ_major,univ_sub_major,univ_major_type,univ_score
0,U01419,21,일반고,사립,남자학교,3,5,5,0,3,,,9,60.0


In [63]:
print(df_resume_language.shape)
df_resume_language.head(1)

(869, 4)


Unnamed: 0,resume_seq,language,exam_name,score
0,U01774,2,4,742.42


In [64]:
df_recruitment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6695 entries, 0 to 6694
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   recruitment_seq    6695 non-null   object 
 1   address_seq1       6694 non-null   float64
 2   address_seq2       100 non-null    float64
 3   address_seq3       9 non-null      float64
 4   career_end         6695 non-null   int64  
 5   career_start       6695 non-null   int64  
 6   check_box_keyword  6695 non-null   object 
 7   education          6695 non-null   int64  
 8   major_task         6695 non-null   int64  
 9   qualifications     6695 non-null   int64  
 10  text_keyword       707 non-null    object 
dtypes: float64(3), int64(5), object(3)
memory usage: 575.5+ KB


In [65]:
df_recruitment = df_recruitment[~df_recruitment['address_seq1'].isna()]

In [66]:
df_recruitment.fillna(value=0)

Unnamed: 0,recruitment_seq,address_seq1,address_seq2,address_seq3,career_end,career_start,check_box_keyword,education,major_task,qualifications,text_keyword
0,R02264,3.0,0.0,0.0,0,0,2507;2707;2810,4,8,1,0
1,R06317,3.0,0.0,0.0,0,0,2204;2205;2707,3,2,1,0
2,R04017,3.0,0.0,0.0,0,0,2101;2108;2201;2707,3,2,1,0
3,R02865,3.0,0.0,0.0,0,0,2201;2204;2205;2707,2,2,1,0
4,R04890,3.0,0.0,0.0,0,0,2201;2204;2205;2707,2,2,2,0
...,...,...,...,...,...,...,...,...,...,...,...
6690,R03678,3.0,0.0,0.0,0,0,2101;2108;2201;2204;2205;2707,3,2,1,0
6691,R04593,3.0,0.0,0.0,0,0,2201;2204;2205;2707,4,2,1,0
6692,R03252,3.0,0.0,0.0,0,0,2109,3,2,1,0
6693,R05130,3.0,0.0,0.0,0,0,2201;2204;2205;2707,2,2,2,0


In [67]:
### 데이콘 Baseline code (협업 필터링 코드) ###

# 사용자-아이템 행렬 생성: 구직자가 해당 채용 공고에 지원했으면 1, 아니면 0으로 설정
user_item_matrix = df_apply_train.groupby(['resume_seq', 'recruitment_seq']).size().unstack(fill_value=0)
user_item_matrix[user_item_matrix > 1] = 1
user_item_matrix

# 사용자 간의 유사성 계산
user_similarity = cosine_similarity(user_item_matrix)

# 추천 점수 계산
user_predicted_scores = user_similarity.dot(user_item_matrix) / np.array([np.abs(user_similarity).sum(axis=1)]).T

# 이미 지원한 채용 공고 제외하고 추천
recommendations = []
for idx, user in enumerate(user_item_matrix.index):
    # 해당 사용자가 지원한 채용 공고
    applied_jobs = set(user_item_matrix.loc[user][user_item_matrix.loc[user] == 1].index)

    # 해당 사용자의 추천 점수 (높은 점수부터 정렬)
    sorted_job_indices = user_predicted_scores[idx].argsort()[::-1]
    recommended_jobs = [job for job in user_item_matrix.columns[sorted_job_indices] if job not in applied_jobs][:5]

    for job in recommended_jobs:
        recommendations.append([user, job])

In [68]:
col_recommend = pd.DataFrame(recommendations)
col_recommend.head(10)

Unnamed: 0,0,1
0,U00001,R01528
1,U00001,R03811
2,U00001,R06276
3,U00001,R00165
4,U00001,R02888
5,U00002,R02412
6,U00002,R04074
7,U00002,R01081
8,U00002,R05574
9,U00002,R04070


In [69]:
### 콘텐츠 기반 추천 모델링 진행 ###
## 특징 추출에 사용할 df : df_resume, df_resume_certificate, df_resume_education, df_resume_language
##

In [70]:
df_resume.head()

# 모든 컬럼 다 사용, text_keyword, job_code_seq1는 어떻게 처리?

Unnamed: 0,resume_seq,reg_date,updated_date,degree,graduate_date,hope_salary,last_salary,text_keyword,job_code_seq1,job_code_seq2,job_code_seq3,career_month,career_job_code
0,U00606,2020-03-04,2020-05-22,4,2008,3500.0,3500.0,스타일디자이너;우븐디자이너,재료·화학·섬유·의복,,,67,
1,U00509,2019-08-25,2020-09-02,2,0,0.0,3700.0,상품기획;MD;기획;머천다이저;머천다이징,재료·화학·섬유·의복,,,84,섬유;봉제;가방;의류
2,U02012,2017-11-20,2020-01-26,5,1979,3500.0,3100.0,니트디자인,재료·화학·섬유·의복,,,121,학교;학원;직원훈련(교육서비스)
3,U04599,2020-05-13,2020-05-28,4,2012,0.0,2500.0,MD;기획MD,재료·화학·섬유·의복,,,24,섬유;봉제;가방;의류
4,U07573,2019-07-23,2020-03-08,4,2010,1900.0,0.0,디자이너;남성복;스포츠웨어;편집디자인;코디네이터;일러스트레이터;VMD;MD,재료·화학·섬유·의복,,,0,


In [71]:
df_resume_certificate.head(1)

# certificate_contents
# Word Toeknizing 진행 bert까지는 안가고 word2vec 까지만 해도 될 듯?

Unnamed: 0,resume_seq,certificate_contents
0,U06421,손해보험사


In [72]:
df_resume_education.head()
# hischool_type_seq, hischool_special_type, univ_type_seq1, univ_type_seq2, univ_location, univ_major, univ_sub_major, univ_major_type, univ_score
# major 가 Nan 값인거는 뭐지? -> major 타입도 똑같이 Nan이면 해당 행은 그냥 0으로 유지, major 타입은 있으면 drop
# major 정보의 수가 매우 적음 -> major를 유지하면서 코딩? -> 성능 비교해보자

Unnamed: 0,resume_seq,hischool_type_seq,hischool_special_type,hischool_nation,hischool_gender,hischool_location_seq,univ_type_seq1,univ_type_seq2,univ_transfer,univ_location,univ_major,univ_sub_major,univ_major_type,univ_score
0,U01419,21,일반고,사립,남자학교,3,5,5,0,3,,,9,60.0
1,U03375,21,일반고,사립,여자학교,3,5,5,0,3,,,4,80.0
2,U06523,21,일반고,사립,남여공학,3,5,5,0,3,,,8,70.0
3,U06619,21,일반고,사립,남여공학,5,5,5,0,5,,,8,80.0
4,U05015,16,특성화고,공립,남여공학,3,5,5,0,3,,,9,80.0


In [73]:
resume_education_use = df_resume_education[['resume_seq','hischool_type_seq', 'hischool_special_type', 'univ_type_seq1', 'univ_type_seq2', 'univ_location', 'univ_major', 'univ_sub_major', 'univ_major_type', 'univ_score']]

In [74]:
resume_education_use.head(1)

Unnamed: 0,resume_seq,hischool_type_seq,hischool_special_type,univ_type_seq1,univ_type_seq2,univ_location,univ_major,univ_sub_major,univ_major_type,univ_score
0,U01419,21,일반고,5,5,3,,,9,60.0


In [75]:
df_resume_language.head(1)

# language, exam_name, score

Unnamed: 0,resume_seq,language,exam_name,score
0,U01774,2,4,742.42


In [76]:
# resume_use -> resume_need_token들은 token 필요, resume_certificate_use -> token 필요
# resume_education_use, resume_language_use

In [77]:
resume_use.head(1)
resume_need_token = ['text_keyword','job_code_seq1','job_code_seq2','job_code_seq3']

NameError: ignored

In [None]:
df_resume_certificate = df_resume_certificate.dropna()

In [None]:
result_sum = df_resume_certificate.groupby('resume_seq')['certificate_contents'].agg(', '.join).reset_index()

In [None]:
## 인코딩 진행하기

## model = Word2Vec(sentences=data, vector_size=100, window=5, min_count=1, sg=0)

## df_resume : text_keyword, job_code_seq1, job_code_seq2, job_code_seq3, career_job_code
model = Word2Vec(sentences = df_resume['text_keyword'], vector_size=100, window=5, min_count=1, sg=0)


## df_resume_certificate : certificate_contents



## df_resume_education : hischool_special_type, univ_major, univ_sub_major




In [None]:
result_1 = pd.merge(df_apply_train, result_sum, on='resume_seq', how='left')
result_1.head(1)
## 자격증까지 merge한 상태

In [None]:
result_1.shape

In [None]:
result_1.fillna(0, inplace = True)
result_1.info()

In [None]:
## 나머지 데이터프레임들도 merge하고 싶음

In [None]:
df_resume = df_resume.dropna(subset=['text_keyword'])

In [None]:
df_resume.fillna(0, inplace = True)

In [None]:
# df_resume['job_code_seq2'] = df_resume['job_code_seq2'].fillna(0, inplace=True)
# df_resume['job_code_seq3'] = df_resume['job_code_seq3'].fillna(0, inplace=True)
df_resume.info()

In [None]:
result_2 = pd.merge(result_1, df_resume, on='resume_seq', how='left')
result_2.head(2)

In [None]:
result_3 = pd.merge(result_2, resume_education_use, on='resume_seq', how='left')
result_3.head(2)

In [None]:
result_3.fillna(0, inplace = True)

In [None]:
df_resume_language.head(1)

In [None]:
result_sum = df_resume_certificate.groupby('resume_seq')['certificate_contents'].agg(', '.join).reset_index()

In [None]:
result_4 = pd.merge(result_3, df_resume_language, on='resume_seq', how='left')
result_4.head(2)

In [None]:
result_4.fillna(0, inplace = True)
result_4.info()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re

In [None]:
preprocess = ['text_keyword', 'job_code_seq1', 'job_code_seq2', 'job_code_seq3', 'career_job_code', 'certificate_contents', 'hischool_special_type', 'univ_major', 'univ_sub_major']

In [None]:
result_4[preprocess].head(2)

In [None]:
special_characters = re.compile('[^ ㄱ-ㅎ|ㅏ-ㅣ|가-힣|0-9]+')

for col in preprocess:
    result_4[col] = result_4[col].apply(lambda x: special_characters.sub('', str(x)))

In [None]:
time_df = ['reg_date' ,'updated_date']
result_4[time_df].head(1)

In [None]:
import datetime

In [None]:
result_4['reg_year'] = pd.to_datetime(result_4['reg_date']).dt.year
result_4['reg_month'] = pd.to_datetime(result_4['reg_date']).dt.month
result_4['reg_day'] = pd.to_datetime(result_4['reg_date']).dt.day

result_4['updated_year'] = pd.to_datetime(result_4['updated_date']).dt.year
result_4['updated_month'] = pd.to_datetime(result_4['updated_date']).dt.month
result_4['updated_day'] = pd.to_datetime(result_4['updated_date']).dt.day

result_4 = result_4.drop(columns = ['reg_date','updated_date'])

In [None]:
result_4.info()

In [None]:
tfidf = TfidfVectorizer()

In [None]:
result_4[preprocess].head()

In [None]:
model = Word2Vec(sentences=data, vector_size=100, window=5, min_count=1, sg=0)