# Introduction to Data Science with Python 
## General Assembly
## Natural Language Processing (NLP)

Make sure you have installed spacy ( `!pip install spacy` ) and successfully run `!python -m spacy download en` 



## Lab Part 1

### Tokenization

What:  Separate text into units such as sentences or words

Why:   Gives structure to previously unstructured text

Notes: Relatively easy with English language text, not easy with some languages


"corpus" = collection of documents

"corpora" = plural form of corpus


In [None]:
import spacy
import codecs

In [None]:
nlp = spacy.load('en')

Jane Austen's Emma is in the `data/` directory as `emma.txt`
If you are using a Azure Notebook, upload it somewhere 
convenient (e.g. `~/library`) and adjust the path in the next cell

In [None]:
# open it as a UTF-8 file.
emma_file = codecs.open('../../data/emma.txt', encoding='utf-8')

In [None]:
# Create an NLP analysis of the first 5000 characters of Emma:
emma = nlp(emma_file.read()[:5000])

In [None]:
# Do the same for Alice in Wonderland, which is in the data/ directory
# as alice-in-wonderland.txt

In [None]:
# Count the number of sentences in each novel.


In [None]:
# Count the number of words in each sentence


In [None]:
# Which novel has more average words per sentence?
# Given their target audience, is this what you would expect?

In [None]:
# For each novel, construct a set of all the distinct words used


In [None]:
# Calculate the lexical diversity of each novel (distinct words / word count)

In [None]:
# (Optional, only for the very keen)
# The nltk library includes a sample of project Gutenberg books in nltk.corpus.gutenberg
# Create a dataframe with the names of the novels, when they were written,
# whether they were for children, the lexical diversity and the average sentence length.
# Can you use logistic regression to predict the audience, based on the content?

## Lab Part 2

Words in context. The textacy library has some utility functions that make this easy.

In [None]:
!pip install textacy
import textacy

In [None]:
# Does Jane Austen ever mention the word 'young' in Emma? What about Lewis Carroll?
# In what context does it appear?



In [None]:
# Where does the word 'cat' appear in Alice in Wonderland?

## Lab Part 3 Part of speech tagging

What:  Determine the part of speech of a word
    
Why:   This can inform other methods and models such as Named Entity Recognition
    

In [None]:
# Jane Austen is very long-winded. The 25th sentence of Emma isn't
# too bad though: "She dearly loved her father, but he was no companion for her."
#
emma25 = list(emma.sents)[25]
print(emma25.text)

In [None]:
# Print out each word in this sentence and its part of speech

In [None]:
# (Optional extra)
# Who are the most commonly named characters in Emma? What about Alice in Wonderland?

In [None]:
# (Optional extra; harder)
# What are the most commonly referenced nouns that are not characters?

## Lab Part 4 - Vectorisation

What: Turn sentences and documents into arrays of numbers

Why:  Then we can use machine learning techniques

In [None]:
# Use a count vectorizer (or tfidf vectorizer) to vectorise all the sentences
# in both Emma and Alice in Wonderland (in that order)

In [None]:
# How many sentences are there in Emma? Create a list of this length filled with the number zero.

In [None]:
# How many sentences are there in Alice in Wonderland? Create a list of this length filled with the number one.

In [None]:
# Create a list called 'which_document' by appending the second list to the first

In [None]:
# Create a KNearestCentroid classifier

In [None]:
# Train your KNearestCentroid classifier. Let X be the vectorisation of sentences, and let y
# be the 'which_document' list

In [None]:
# Make up a sentence about love, marriage and any other Jane Austen topics. 
# Transform it using your vectorizer, and then get your classifier to predict what sort of document
# it is

FWIW, a TFIDF vectoriser combined with KNearestCentroid classifier is known as a Rocchio search.

## Lab Part 5

In [None]:
# Repeat lab 4 using spacy's implementation of word2vec as the vectoriser