GitHub - gojiplus/text-as-data: Pipeline for Analyzing Text Data: Acquire, Preprocess, Analyze

Text as Data

A general pipeline for analyzing text data: Acquire, preprocess, process and analyze text data.

Get Text

You can get text data from scraping, APIs, searchable pdfs, images of paper, etc. Some examples:

Get text from searchable pdfs. e.g. Get data from Wisconsin Ads storyboards using Python
Get text from images of text using Tesseract from Python
Get text from images of text using Abbyy FineReader Cloud OCR from R
Get text from images of text using Captricity OCR from R
Get Congressional Speech Data using Capitol Words API from the Sunlight Foundation

Preprocess Text

Preprocess text for text-as-data analysis.

Depending on the need, remove stop words, punctuation, capitalization, special characters, and stem.

preprocess_csv takes a csv with 'raw' text and outputs a csv with processed text.

Get Summary of the Data, Subset Data

Output a simple or stratified random sample of a csv, and only the columns you need. Get summary of crucial aspects of the data. Takes a csv.

Summarize and Subset.

Get TDM

Create a term-document-matrix and get some information about the matrix including frequent and infrequent terms. Options available for removing sparse terms etc.

Get TDM, TF-IDF, Summary.

Sentiment Analysis in Python

Basic sentiment analysis using AFINN

Analyze Text in R

Classify text in R using SVM or Lasso. See Basic Text Classifier
Worked out example of how to model words as a function of ideology using Congressional Speech. See Speech Learn

License

Scripts are released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
preprocess_csv		preprocess_csv
subset		subset
tdm		tdm
.gitignore		.gitignore
.travis.yml		.travis.yml
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text as Data

Get Text

Preprocess Text

Get Summary of the Data, Subset Data

Get TDM

Sentiment Analysis in Python

Analyze Text in R

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

gojiplus/text-as-data

Folders and files

Latest commit

History

Repository files navigation

Text as Data

Get Text

Preprocess Text

Get Summary of the Data, Subset Data

Get TDM

Sentiment Analysis in Python

Analyze Text in R

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages