Attempt on a Kaggle competition, Personalized Web Search Challenge (hosted by Yandex)

URL: http://www.kaggle.com/c/yandex-personalized-web-search-challenge

Deadline: Friday, January 10, 2014

Team Members

Yosuke Sugishita
David Kim
Possibly Brendan and David Hsiao
Idea: Should we make this open to people in Data Science Club and local data meetups on Meetup.com? If we have too many people (I think 4-5 people in one team is a limit), we can make multiple teams and still work together.

Ideas on our team name

Asian Revolution
West Coasters
Canadian Kimchi Roll

File structure

script
- file_manupulation
- analysis
lib
- Functions / classes to use in other scripts.
test
- Scripts to test functions.
data
- Contains all the data, like test and train. Not committed due to the large size of the files. Download them directly from Kaggle.

About branches / pull requests

All the code must be reviewed by at least by one other person before being pulled into the master. Make a branch, write code, test, and send a pull request. Use short, descriptive names for branches.

Never directly work on the master.

Tools

Version Control
- Git with scm_breeze (https://github.com/ndbroadbent/scm_breeze)
  - Note, scm_breeze is a must. It's a huge productivity booster.
Language(s)
- Python
Editors
- Vim
- PyCharm might be good. The same company's Ruby IDE is awesome.
- Any other editors you like?
Database?
- Looks like some people on Kaggle tried to use databases, but it didn't work out very well:
- http://www.kaggle.com/c/yandex-personalized-web-search-challenge/forums/t/6183/handling-703-000-000-urls/
- http://www.kaggle.com/c/yandex-personalized-web-search-challenge/forums/t/6353/someone-else-using-r-mysql-as-database-need-some-feedback

Notes on possible strategies (more on the wiki)

Two ways to look at this problem:

Collaborative filtering (recommender) problem - Netflix Prize winners' solution: http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf
We can also look at the past clicks a certain user has performed. - The user is probably more (or less) likely to click the pages they already clicked and liked. => Need to test this.

Our first strategy is based on 2. (Low-hanging fruits! Yay!)

https://github.com/yosukesugishita/personalized_search_challenge/wiki/Initial-Model:-Take-advantage-of-multiple-visits
Here is the paper that inspired this strategy: http://people.csail.mit.edu/teevan/work/publications/papers/wsdm11-pnav.pdf

Here is the paper I got inspiration from for this strategy: http://people.csail.mit.edu/teevan/work/publications/papers/wsdm11-pnav.pdf

Some notes on the data

The train file is big (16GB when uncompressed)

We need to think about how to handle this. Perhaps use a database, like sqlite or MySQL? I (Yosuke) suspect we can try our first strategies with a randomly-sampled subset of the data. How would we go about it?

Train and test

In the competition, the first 27 days are used as train data, and the last 3 days as test data. (http://www.kaggle.com/c/yandex-personalized-web-search-challenge/data)

Perhaps we can locally test our model using the first 24 days train and the next 3 days as test.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
data		data
lib		lib
script		script
test		test
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

lib

lib

script

script

test

test

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Attempt on a Kaggle competition, Personalized Web Search Challenge (hosted by Yandex)

Deadline: Friday, January 10, 2014

Team Members

Ideas on our team name

File structure

About branches / pull requests

Tools

Notes on possible strategies (more on the wiki)

Some notes on the data

The train file is big (16GB when uncompressed)

Train and test

About

Releases

Packages

Contributors 4

Languages

ykdojo/personalized_search_challenge

Folders and files

Latest commit

History

Repository files navigation

Attempt on a Kaggle competition, Personalized Web Search Challenge (hosted by Yandex)

Deadline: Friday, January 10, 2014

Team Members

Ideas on our team name

File structure

About branches / pull requests

Tools

Notes on possible strategies (more on the wiki)

Some notes on the data

The train file is big (16GB when uncompressed)

Train and test

About

Resources

Stars

Watchers

Forks

Languages