A general pipeline for analyzing text data: Acquire, preprocess, process and analyze text data.
You can get text data from scraping, APIs, searchable pdfs, images of paper, etc. Some examples:
- Get text from searchable pdfs. e.g. Get data from Wisconsin Ads storyboards using Python
- Get text from images of text using Tesseract from Python
- Get text from images of text using Abbyy FineReader Cloud OCR from R
- Get text from images of text using Captricity OCR from R
- Get Congressional Speech Data using Capitol Words API from the Sunlight Foundation
Preprocess text for text-as-data analysis.
Depending on the need, remove stop words, punctuation, capitalization, special characters, and stem.
- preprocess_csv takes a csv with 'raw' text and outputs a csv with processed text.
Output a simple or stratified random sample of a csv, and only the columns you need. Get summary of crucial aspects of the data. Takes a csv.
Create a term-document-matrix and get some information about the matrix including frequent and infrequent terms. Options available for removing sparse terms etc.
- Basic sentiment analysis using AFINN
- Classify text in R using SVM or Lasso. See Basic Text Classifier
- Worked out example of how to model words as a function of ideology using Congressional Speech. See Speech Learn
Scripts are released under the MIT License.