Skip to content

yanxht/TripAdvisorData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Processed TripAdvisor Data

The repo shares TripAdvisor review and rating data that were used in Yan, X. and Bien, J. (2018) "Rare Feature Selection in High Dimensions", including training data (Xtrain, ytrain), testing data (Xtest, ytest), adjectives from the reviews, and a binary matrix that encodes a hierarchical tree for the terms. These structured data were processed from raw data crawled by Wang et.al. (2010). The hierarchical tree for relating the terms was generated with 100-dimensional embeddings that were pre-trained by GloVe (Pennington et al., 2014) on Gigaword5 and Wikipedia2014 corpora. In constructing the tree, we also leverage NRC Emotion Lexicon (Mohammad and Turney, 2013) to separate positive and negative sentiments. Please refer to Section 6 of Yan and Bien (2018) for details of processing the data.

Load the Data

  • In R
load(“tripadvisor.RData”)
  • In Python (requires numpy and scipy)
import loadtxtdata
term, A, Xtrain, Xtest, ytrain, ytest = LoadTripAdvisorData()

When you are using above dataset in your research, please consider to cite the following paper:

@article{yan2018rare,
  title={Rare feature selection in high dimensions},
  author={Yan, Xiaohan and Bien, Jacob},
  journal={arXiv preprint arXiv:1803.06675},
  year={2018}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages