CS544 NLP project

Introduction of project

In this project, we have a set of users' information and users' search history from a Chinese search engine, we will develop a system to predict users' information, based on their search history.

The following is an example of data set.

E36710951AD5D8379F85FFBFCF46780E 1 1 5 藏御堂28泡大丈夫功名未成小米概念机藏秘28泡贾汪功夫少林纪录片华为商城秦时明月之君临天下 l是多大号的衣服小米note2概念机健胃消食片张书省海鲜菇的做法搞笑动态图片秦时明月之诸子百家全集优酷下载苹果4s 扣扣下载好男儿志在四方重做手机系统如何才能成就一番事业为什么秋冬进补 cf陈子豪藏御堂28泡怎么样 oppo a617 丛大夫补气血的中药移动网上营业厅红米手机重做系统木头城子到朝阳客车血未冷健脾丸康爱多如何成就一番事业 28泡脚章子怡发誓再也不拍功夫片龙泉宝剑血未冷,梦还滚烫换个喇叭多少钱小米2es 古代形容当兵的诗句健脾丸的功效与作用高山流水古筝曲杨角沟到喀左大枣吃多了会上火吗苹果5s怎么样红米note重做系统 cf官网乱斗西游那个英雄厉害华为5a 360手机照身份证多长时间能下来腾讯手机管家红米1s m4黑龙大丈夫事业未成,何以为家无极剑圣无锋剑 vivo 功夫少林 vivo官网党参的功效与作用手机换个喇叭多少钱关于当兵的诗句吃大枣上火怎么办东皇太一龙眼肉的功效与作用好男儿就是要当兵苹果5s oppoa31 移动卡号码选号中关村在线 oppoa33 黑龙搜狗地图藏御堂28泡是假的吗换个充电接口多少钱 oppoa37 唐刀治疗神经衰弱的中草药成就事业的句子 m4a1 移动卡号码选号辽宁功夫少林纪录片的背景音乐 oppoa59 旷修大枣的功效与作用办理移动卡龙眼肉 oppo a90 藏秘二十八泡状况反应 cf刷枪有谁成功喀左有网吧吗治疗神经衰弱的中药藏密28泡跑酷视频朝阳县丛大夫秋冬进补的道理 m4a1黑龙如何才能成就大事治疗神经衰弱的药物 l是多大号的衣服男快照身份证需要多长时间华为4a

The first hash string is that user's ID.

The first integer label is user's age label, there are 7 possible values for age label, each value and its meaning is listed below.

Age label value	Meaning
0	unknown
1	0-18 years old
2	19-23 years old
3	24-30 years old
4	31-40 years old
5	41-50 years old
6	51-999 years old

The second integer label is user's gender label, there are 3 possible values for gender label, each value and its meaning is listed below.

Gender label value	Meaning
0	unknown
1	male
2	female

The third integer label is user's education label, there are 7 possible values for education label, each value and its meaning is listed below.

Education label value	Meaning
0	unknown
1	PhD
2	Master
3	Bachelor
4	High School
5	Middle School
6	Primary School

The rest of the data example is user's seach hisory. Each query is seperated by "\t".

I seperated the data into training and testing parts, traning is 75% of total (15000 lines), and testing is 25% (5000 lines).

Tools we use

https://github.com/fxsjy/jieba This is the word segmentation tool I used.

Output file of WordSegmentation.py is wordprocessed.txt. Output file has two lines, the first line is user ID and user information, second line is user ID and user query. You can use code snippet like following to retrive data and store them as a two dimensional array.

file = open("wordprocessed.txt")
fileget = file.read().split("\n")
userInfo = eval(fileget[0])
queryList = eval(fileget[1])

zh.json is stop word list, sougou.dic is a Chinese word tag dictionary, both are used to improve word segmentation accuracy.

sklearn is the library we use for td-idf feature selection + SVM classification. http://scikit-learn.org/stable/modules/svm.html#svm

Results we get

--SVM Wtih Word Segmentation Accuracy-----------
age: 0.5714
gender: 0.8062
education: 0.55
average: 0.642533333333
--Naive Bayes With Word Segmentation Accuracy---
age: 0.2474
gender: 0.485
education: 0.2328
average: 0.321733333333
--SVM Wtihout Word Segmentation Accuracy--------
age: 0.438
gender: 0.6932
education: 0.4324
average: 0.5212
--Neural Network Accuracy-----------------------
age: 0.4756
gender: 0.785
education: 0.4176
average: 0.5594

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Pilaszy.pdf		Pilaszy.pdf
README.md		README.md
WordSegmentation.py		WordSegmentation.py
WordSegmentation2.py		WordSegmentation2.py
a_train.txt		a_train.txt
dev_data.txt		dev_data.txt
getAccuracy.py		getAccuracy.py
main.py		main.py
main2.py		main2.py
nb_classify_v2.py		nb_classify_v2.py
nb_classify_v3.py		nb_classify_v3.py
nb_training_v2.py		nb_training_v2.py
nb_training_v3.py		nb_training_v3.py
nb_woseg_train.py		nb_woseg_train.py
neural_network.py		neural_network.py
predictNN.txt		predictNN.txt
sougou.dic		sougou.dic
svmWithoutSeg copy.py		svmWithoutSeg copy.py
svmWithoutSeg.py		svmWithoutSeg.py
test_data_raw.txt		test_data_raw.txt
test_data_tagged.txt		test_data_tagged.txt
testprocessed2.txt		testprocessed2.txt
tf-idf.py		tf-idf.py
wordlabel2.txt		wordlabel2.txt
wordprocessed2.txt		wordprocessed2.txt
zh.json		zh.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CS544 NLP project

Introduction of project

Tools we use

Results we get

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

runxuanw/CS544

Folders and files

Latest commit

History

Repository files navigation

CS544 NLP project

Introduction of project

Tools we use

Results we get

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages