Today we're going to talk a bit about text scraping, manipulation, and analysis in Python.
Workshop by Phil, Riley, and Yeli.
If you don't have a favorite text editor already, download Sublime Text. You can use Xcode for these exercises if you're used to it, but we recommend Sublime since it's simpler and less clunky.
Open up a terminal and run this command to download the installer:
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
then to install,
sudo python get-pip.py
Once we've got pip
set up, we can install Beautiful Soup with
$ sudo pip install beautifulsoup4
Beautiful Soup helps us scrape text from the internet. Muahahaha! 👹
Once we've got pip
set up, we can install NLTK with
$ sudo pip install nltk
NLTK is a suite of text processing libraries for Python that lets us analyze text in some really interesting and powerful ways. For the intro exercises, we'll work through part of the NLTK Book. It's a great resource, check it out!!
NLTK comes loaded with a bunch of corpora and trained models. We're going to use some of them, so in your Python REPL type:
import nltk
nltk.download()
If it looks like nothing happened, check if a new window popped open in the background. We want to download book
under the "Collections" tab.
- Python scripts for getting stuff from social media: https://github.com/lamthuyvo/social-media-data-scripts