# Welcome to Intro to NLTK!

**This workshop will walk you through the common functions of NLTK in Python!**

<ul>
    <li> Run each chunk in top-to-bottom order and follow along with the code. </li>
    <li> Occasionally there will be opportunities for you to write new code, so you will have the chance to apply what you learn right away. </li>
</ul>

I am assuming minimal experience with Python3 and programming in general.

### *Packaged Texts.*
There are two texts packaged with this workshop that provide different levels of challenges: **(1)** Walden by Henry David Thoreau and **(2)** Complete Works of William Shakespeare.

These were downloaded from Project Gutenberg (https://www.gutenberg.org), a website that hosts a library of over 60,000 free books. Many of these books are available in plaintext (.txt) format, making them easilly parsable by packages like NLTK. That said, the skills from this workshop are broadly applicable to any type and format of machine-readable text, not just books from Project Gutenburg and not just plaintext.


## Installing Required Packages
First we will load NLTK into the environment and make sure we can parse text files.

### *Installing NLTK.* 
NLTK is preinstalled on this virtual server, so if you follow this workshop on your own machine, then you will have to install the package on your own. In most cases, this process only takes one line of code and a few minutes. You can find instructions here: https://www.nltk.org/install.html

### *Installing other packages.*
Other packages required in this workshop are **numpy**, **pickle**, **os**, **re**, **string**, **scipy**, and **pyplot**. Like NLTK, they are preinstalled on this server, but if you want to run these scripts on a different machine, you will need to install these packages.


## First Code Chunk: Load NLTK and download popular NLTK packages
The following box is an example of a code chunk! Run it to import (= load) NLTK into the environment. It will take a few minutes and you will see an asterisk (`*`) to the left of the block while it runs.

To run a chunk, you can press the "Run" button above, or hold Shift and press Return/Enter.

In [None]:
#################
## Set up NLTK ##
#################

# Import all functions from NLTK
# This allows us to call nltk functions without specifying the package
from nltk import *

# Download or update popular NLTK packages
# This will let us filter punctuation and common keywords (i.e. stopwords)
download('popular')

# This will allow us to tag tokens in our texts in different ways, including by part of speech
download('tagsets')


## Test our connection to a text file

This code chunk loads the OS package which will help us to load external files, including our texts.

Specifically, we will load the famous American novel 'Walden' into the variable 'file' and use a "for loop" to print the first 20 lines of the novel. This is not a necessary step, but it is a good idea to test the connection to your text file, since problems with file paths are common.

In [None]:
# The OS package is for handling file paths
import os

# The relative path to our text file.
# In other words, where the file is relative to this Jupyter notebook.
pathToFile = 'texts/walden.txt'

# Open Walden and read it (hence the 'r') into the variable 'file'
file = open( pathToFile, 'r')

# This "for loop" will repeat 20 times -- that is what the range() function is doing
# Inside the loop, readline() prints the *next* line, relative to the previous loop.
# So the two lines do a lot. It counts from 0 to 20 and for each count...
for x in range(20):
    
    # ...it prints the next line from our text file
    print( file.readline() )
    

## Clean up

In [None]:
# Close the Walden text file to release it from memory
#      We don't need the original text anymore.
file.close()


# Some notes about using Jupyter notebook


## Where's the code?
This workshop is **browser based** and the code you will read and run is not on your computer. While that means you do not have to take time to set up the server yourself, it also means that to recreate the workshop, you will have to download a copy of the files, which are here:

<ul>
    <li>https://github.com/turnerdan/nltk_tutorial/</li>
</ul>

Simply download (or, even better *clone*) the code repository and you will be 90% of the way there.


### This Jupyter notebook will disappear

Shortly after the end of this workshop the virtual server running this program will be shut down. If you want to save any of your work, you might want to copy and paste your code into a text file and email it to yourself. (No need to copy and paste my code -- it's available above.)


## What is a Jupyter notebook?

It is a way to run Python and R Markdown scripts in *chunks*, so that I can have explanations embedded along with code, which you can edit directly if you want to do so. To run a code chunk, make sure it is selected, then click "Run" at the top of the screen.

You can add more code blocks to try different formulations of functions as we go by clicking the "+" button above.



In [None]:
# What kind of computer is this notebook running on?
os.uname()


### Where to get more information on nltk

Parts of this workshop follow the exercises and topics of preexisting tutorials, classes, and books, which are:

<ul>
    <li>https://www.nltk.org/book/</li>
    <li>http://www.ling.helsinki.fi/kit/2009s/clt231/NLTK/book/ch01-LanguageProcessingAndPython.html#sec-computing-with-language-simple-statistics</li>
    <li>https://library.nd.edu/event/introduction-to-python-and-the-nltk-2019-11-19</li>
    
</ul>

You are encouraged to check out these resources and explore the many others that have been posted online.


# Next: Python Refresher

In the next part, we will review some Python basics...

**...but first:**

# Code it: Hello World


**It is tradition to test a new environment with a *Hello world*, which is just a simple script that print "Hello world" to the console.**

Type the following code into the code block below to print "Hello world":

>Line 1: ``my_message = "Hello world!"`` This line saves the STRING "Hello world!" to the variable "my_message".

>Line 2: ``print( my_message )`` This line prints the variable "my_message".


In [None]:
# <- This symbol at the beginning of the line means this is a COMMENT.
# Use comments to explain what your code is doing.
# To get the most out of this workshop, read all the comments to follow along with the details of what we do.
# We need comments here because this is a CODE BLOCK.
# Whatever Python script is in this box will run when you press the "Run" button at the top of the screen.

#################################################
## ## ## ## > Code it < ## ## ## ##             #
################################### Hello_world #
## Sample answer in /answer_keys ##             #
#################################################

# Type the Hello World script below and click Run:
