# reddit word2vec with gensim

#### Simon Lindgren 190502

A straightforward word2vec model based on a data of all reddit comments during 2018.

The data was downloaded from [pushshift.io](https://files.pushshift.io/reddit/submissions/).



**Data preparations**: The field `body` was extracted from the downloaded json data, and split by sentence, into one big file with one sentence per line. Punctuation and leading/trailing whitespace was removed.

The resulting sentences were saved in `sentences/reddit-sentences-2018`.

The model was trained using the gensim package, and the code in the cell below (based on [this tutorial](https://rare-technologies.com/word2vec-tutorial/)).

In [None]:
# Import libraries and set up logging
import os
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Prepare a memory-friendly iterator that won't try to keep the entire sentence list in RAM 
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()
 
 
# Point it to a directory containing one or several files with one sentence per line 
sentences = MySentences('sentences')

# Train the w2v model
model = gensim.models.Word2Vec(sentences, min_count=40) # exclude words occurring less than 40 times

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# Save the model to disk
model.save('reddit-w2v.model')