* Data Downloaded from http://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/
 and saved at `/project/mstephens/zihao/text_dataset/`
* I read and save the datasets as MatrixMarket format (*.mtx)

* In python, can read using `scipy.io.mmread` (https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.mmread.html)

* In R, can read it using 
```{r} 
Matrix::readMM(filename) 
```

In [1]:
from scipy import sparse
from scipy.io import mmwrite
import numpy as np
import gzip
import time


In [2]:
def read_BOW(data_dir, data_name):
    print("read data: {}".format(data_name))
    row_idx = []
    col_idx = []
    val = []
    summary_names = ["Number of documents", "Vocab size","Number of nonzero", "percentage of nonzero"]
    summary = []
    with gzip.open("{}{}".format(data_dir, data_name),'rt') as f:
        idx = 0 
        for line in f:
            line_split = line.split()
            if idx <= 2: ## first 3 lines describe information of the data
                summary.append(int(line_split[0]))
            else:
                ## index starts from 1, so adapt it to python by starting from 0
                row_idx.append(int(line_split[0])- 1)
                col_idx.append(int(line_split[1]) - 1)
                val.append(int(line_split[2]))
            idx += 1
    summary.append(summary[2]/(summary[0]*summary[1]))
    for name, suma in zip(summary_names, summary):
        print("{}:{}".format(name, suma))
    mat_coo = sparse.coo_matrix((val, (row_idx, col_idx)), dtype=int)
    return mat_coo

In [3]:
data_dir = "./"
data_names = ["docword.enron","docword.kos","docword.nips","docword.nytimes","docword.pubmed"]
for name in data_names:
    data_name = "{}.txt.gz".format(name)
    start = time.time()
    out = read_BOW(data_dir, data_name)
    print("save sparse matrix to MatrixMarket file")
    mmwrite(name, out, comment='', field=None, precision=None, symmetry=None)
    runtime = time.time() - start
    print("read and save data takes {}".format(runtime))
    print("\n")

read data: docword.enron.txt.gz
Number of documents:39861
Vocab size:28102
Number of nonzero:3710420
percentage of nonzero:0.0033123609274989824
save sparse matrix to MatrixMarket file
read and save data takes 11.63464903831482


read data: docword.kos.txt.gz
Number of documents:3430
Vocab size:6906
Number of nonzero:353160
percentage of nonzero:0.014909078935036842
save sparse matrix to MatrixMarket file
read and save data takes 1.0832927227020264


read data: docword.nips.txt.gz
Number of documents:1500
Vocab size:12419
Number of nonzero:746316
percentage of nonzero:0.04006312907641517
save sparse matrix to MatrixMarket file
read and save data takes 2.3032643795013428


read data: docword.nytimes.txt.gz
Number of documents:300000
Vocab size:102660
Number of nonzero:69679427
percentage of nonzero:0.0022624659718163517
save sparse matrix to MatrixMarket file
read and save data takes 211.67759895324707


read data: docword.pubmed.txt.gz
Number of documents:8200000
Vocab size:141043
Numb