Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Create Term Document Matrix or TF-IDF

The script takes a csv with a (preprocessed) text column and outputs a tdm, tf-idf. It also prints out summary including most frequent terms, most infrequent terms etc. See below for more detail.

The script depends on sklearn which in turn depends on numpy and `scipy'. To install the dependencies:

sudo apt-get install -qq python-numpy python-scipy
pip install -U sklearn

To run the script:

python tdm.py [options] <CSV input file>

Here are all the script options and default values of the options:

  -h, --help            show this help message and exit
                        Text column name (default: 'Body')
                        Label column(s) name (default: 'Online Section')
  -d DELIMITER, --delimiter=DELIMITER
                        Delimeter use to split option's value if multiple
                        values (default: ';')
                        Minimum ngram(s) (default: 1)
                        Maximum ngram(s) (default: 2)
                        Maximum features (default: 2**16)
  --n-freq=N_FREQ       Report most frequent terms (default: 20)
  --n-sparse=N_SPARSE   Report most sparse terms (default: 20)
  -r REMOVE_TERMS_FILE, --remove-terms-file=REMOVE_TERMS_FILE
                        File name contains terms to be removed (default: None)
                        Top most of frequent term(s) to be removed (default:
                        Top most of sparse term(s) to be removed (default: 0)
                        Save output TDM to CSV filename (default: None)
  --use-tfidf           Use TF-IDF (default: False)
                        Save output TF-IDF to CSV filename (default: None)


python tdm.py --max-features=100 --n-freq=10 --n-sparse=20 --remove-n-freq=10 --remove-n-sparse=40 -t speaking -l speaker_party  --out-tdm-file=tdm.csv --use-tfidf --out-tfidf-file=tfidf.csv sample_in.csv

The script processes the speaking column in sample_in.csv that contains the text and produces both tdm and tfidf

Note that the TDM/TF-IDF output CSV file will be large and take a long time to save if there are a lot of terms (columns).

The index column also added to output CSV file as reference unique ID (row index of the input CSV file).