<a href="https://colab.research.google.com/github/yahyenur/Yahyenur/blob/main/Copy_of_nltk_ch1_py_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#[NLTK Chapter 1](https://www.nltk.org/book/ch01) Python Basics

We will briefly go through key parts of Sections 1.4 to 3.2 of NLTK Chapter 1 together. Then you'll work on the exercises below on your own.

If you like, you can work through these NLTK sections in a more leisurely way using [this notebook](https://colab.research.google.com/github/BetoBob/NLTK-Book-Resource/blob/master/01/01_notes.ipynb#scrollTo=GgpIL8YyLnab).

## NLTK necessities

In [None]:
# load NLTK necessities
import nltk
nltk.download('book')
from nltk.book import *

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/dependency_treebank.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Common Python data types



First, we'll note the types of the most common Python objects and values.

In [None]:
# numbers such as int(egers)
my_num = 123
type(my_num)

int

In [None]:
# str(ings): a sequence of characters
my_str = "Hello, world!"
type(my_str)

str

In [None]:
# lists: a sequence of objects or values, separated by commas typically
my_list = ["Hello", ",", "world", "!"]
type(my_list)

list

In [None]:
# sets: a collection of unique values (in no particular order)
s = set(['a', 'a', 'b', 'b', 'c'])
print(s)
s2 = {'a', 'b', 'c'}
print(s2)
type(s2)

{'b', 'c', 'a'}
{'b', 'c', 'a'}


set

In [None]:
# a tuple is a pair or triple, etc.
my_pair = ('a', 'b')
type(my_pair)

tuple

In [None]:
# sequences such as strings, lists and tuples allow retrieving elements their index
# for example, we can retrieve the 0th item in a pair
my_pair[0]

'a'

In [None]:
# dict(ionaries): a collection of unique keys paired with values
d = dict([('a', 1), ('b', 2), ('c', 2)])
print(d)
d2 = {'a': 1, 'b': 2, 'c': 3}
print(d2)
type(d2)

{'a': 1, 'b': 2, 'c': 2}
{'a': 1, 'b': 2, 'c': 3}


dict

In [None]:
# dictionaries allow values to be retrieved by their keys
# for example, we can retrieve the value paired with the key 'b'
d2['b']

2

With these basic types in mind, note that an NLTK `Text` is just a list with some extra operations, and an NLTK `FreqDist` is just a dictionary with some extra operations.

## Indexing and Slicing

Strings, lists and tuples are ordered collections that allow their elements to be retrieved by their numeric position, or *index*.  A range, or *slice*, of elements can also be retrieved using a pair of indices.  Square brackets following the ordered collection are used to provide the index or range.

In [None]:
# The first element is actually indexed by zero.
print(my_str)
my_str[0]

Hello, world!


'H'

In [None]:
# Cleverly, -1 can be used to index the last item!
my_str[-1]

'!'

In [None]:
# Slicing fhe first 5 letters can be done with the range [0:5], where the second index is exclusive (i.e., up to but not including the second index).
my_str[0:5]

'Hello'

In [None]:
# If the first index is zero, it can be left out for conciseness.
my_str[:5]

'Hello'

In [None]:
# Similarly, if the range goes all the way to the end, the second index can be left out.
my_str[-6:]

'world!'

## List Comprehensions



A list comprehension is a convenient way of constructing lists using a notation reminiscent of [set builder notation](https://en.wikipedia.org/wiki/Set-builder_notation), which is hopefully more familiar.

In set-builder notation, one specifies a set of elements x in some domain E that satisfy a property p as follows:

{ x | x ∈ E ∧ p(x) }

For example, here's how to specify the strictly positive real numbers:

{ x | x ∈ ${\Bbb R}$ ∧ x > 0 }

Similarly, you can construct lists in Python where you use square brackets instead of curly ones and use the keywords `for` and `if` to specify the domain and predicate, respectively.  (You can actually construct sets and dictionaries too, though that's less common.)

In [None]:
# alphabetic letters in my_str
[l for l in my_str if l.isalpha()]

['H', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']

In set-builder notation, you can also have a formula on the left hand side.  For example, here are the even natural numbers:

{ 2n | n ∈ ${\Bbb N}$ }

Similarly, in Python you can perform an operation on each element in the domain that satisfies the predicate.

In [None]:
# how long are the longest words in Moby Dick?
[(word, len(word)) for word in text1 if len(word) > 16]

[('uncomfortableness', 17),
 ('cannibalistically', 17),
 ('circumnavigations', 17),
 ('superstitiousness', 17),
 ('superstitiousness', 17),
 ('comprehensiveness', 17),
 ('preternaturalness', 17),
 ('indispensableness', 17),
 ('characteristically', 18),
 ('comprehensiveness', 17),
 ('comprehensiveness', 17),
 ('uncompromisedness', 17),
 ('uninterpenetratingly', 20),
 ('subterraneousness', 17)]

## Frequency Distributions

In NLTK, a frequency distribution, or `FreqDist`, allows you to examine the counts of words in a text.

In [None]:
# make a frequency distribution for Moby Dick
fdist1 = FreqDist(text1)
# how often does 'whale' occur?
fdist1['whale']

906

In [None]:
# are any of the 40 most common words open class words, or are they all function words?
fdist1.most_common(40)

[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982),
 ("'", 2684),
 ('-', 2552),
 ('his', 2459),
 ('it', 2209),
 ('I', 2124),
 ('s', 1739),
 ('is', 1695),
 ('he', 1661),
 ('with', 1659),
 ('was', 1632),
 ('as', 1620),
 ('"', 1478),
 ('all', 1462),
 ('for', 1414),
 ('this', 1280),
 ('!', 1269),
 ('at', 1231),
 ('by', 1137),
 ('but', 1113),
 ('not', 1103),
 ('--', 1070),
 ('him', 1058),
 ('from', 1052),
 ('be', 1030),
 ('on', 1005),
 ('so', 918),
 ('whale', 906),
 ('one', 889),
 ('you', 841),
 ('had', 767)]

In [None]:
# what long words occur more than once?
[word for word in set(text1) if len(word) > 15 and fdist1[word] > 1]

['circumnavigating',
 'apprehensiveness',
 'circumnavigation',
 'physiognomically',
 'simultaneousness',
 'indiscriminately',
 'superstitiousness',
 'comprehensiveness']

# Exercises (from NLTK)

4. How many words are there in `text2` (Sense and Sensibility)?  How many distinct words are there?

In [None]:
# total words in text2
print(len(text2))

# total distinct words in text2
print(len(set(text2)))

141576
6833


27. Define a function called vocab_size(text) that has a single parameter for the text, and which returns the vocabulary size of the text.  Test it with *Sense and Sensibility* (text2).

In [None]:
def vocab_size(text):
  return len(set(text))

vocab_size(text2)

6833

19. What is the difference between the following two lines? Which one will give a larger value? Will this be the case for other texts?


1.   `sorted(set(w.lower() for w in text1))`
2.   `sorted(w.lower() for w in set(text1))`


ANSWER:



In [None]:
sorted(w.lower() for w in set(text1))

['!',
 '!"',
 '!"--',
 "!'",
 '!\'"',
 '!)',
 '!)"',
 '!*',
 '!--',
 '!--"',
 "!--'",
 '"',
 '"\'',
 '"--',
 '"...',
 '";',
 '$',
 '&',
 "'",
 "',",
 "',--",
 "'-",
 "'--",
 "';",
 '(',
 ')',
 '),',
 ')--',
 ').',
 ').--',
 '):',
 ');',
 ');--',
 '*',
 ',',
 ',"',
 ',"--',
 ",'",
 ",'--",
 ',)',
 ',*',
 ',--',
 ',--"',
 ",--'",
 '-',
 '--',
 '--"',
 "--'",
 '--\'"',
 '--(',
 '---"',
 '---,',
 '.',
 '."',
 '."*',
 '."--',
 ".'",
 '.\'"',
 '.)',
 '.*',
 '.*--',
 '.,',
 '.--',
 '.--"',
 '...',
 '....',
 '.]',
 '000',
 '1',
 '10',
 '100',
 '101',
 '102',
 '103',
 '104',
 '105',
 '106',
 '107',
 '108',
 '109',
 '11',
 '110',
 '111',
 '112',
 '113',
 '114',
 '115',
 '116',
 '117',
 '118',
 '119',
 '12',
 '120',
 '121',
 '122',
 '123',
 '124',
 '125',
 '126',
 '127',
 '128',
 '129',
 '13',
 '130',
 '131',
 '132',
 '133',
 '134',
 '135',
 '14',
 '144',
 '1492',
 '15',
 '150',
 '15th',
 '16',
 '1652',
 '1668',
 '1671',
 '1690',
 '1695',
 '16th',
 '17',
 '1726',
 '1729',
 '1750',
 '1772',
 '1775

In [None]:
sorted(set(w.lower() for w in text1))

['!',
 '!"',
 '!"--',
 "!'",
 '!\'"',
 '!)',
 '!)"',
 '!*',
 '!--',
 '!--"',
 "!--'",
 '"',
 '"\'',
 '"--',
 '"...',
 '";',
 '$',
 '&',
 "'",
 "',",
 "',--",
 "'-",
 "'--",
 "';",
 '(',
 ')',
 '),',
 ')--',
 ').',
 ').--',
 '):',
 ');',
 ');--',
 '*',
 ',',
 ',"',
 ',"--',
 ",'",
 ",'--",
 ',)',
 ',*',
 ',--',
 ',--"',
 ",--'",
 '-',
 '--',
 '--"',
 "--'",
 '--\'"',
 '--(',
 '---"',
 '---,',
 '.',
 '."',
 '."*',
 '."--',
 ".'",
 '.\'"',
 '.)',
 '.*',
 '.*--',
 '.,',
 '.--',
 '.--"',
 '...',
 '....',
 '.]',
 '000',
 '1',
 '10',
 '100',
 '101',
 '102',
 '103',
 '104',
 '105',
 '106',
 '107',
 '108',
 '109',
 '11',
 '110',
 '111',
 '112',
 '113',
 '114',
 '115',
 '116',
 '117',
 '118',
 '119',
 '12',
 '120',
 '121',
 '122',
 '123',
 '124',
 '125',
 '126',
 '127',
 '128',
 '129',
 '13',
 '130',
 '131',
 '132',
 '133',
 '134',
 '135',
 '14',
 '144',
 '1492',
 '15',
 '150',
 '15th',
 '16',
 '1652',
 '1668',
 '1671',
 '1690',
 '1695',
 '16th',
 '17',
 '1726',
 '1729',
 '1750',
 '1772',
 '1775

10. Define a variable `my_sent` to be a string of words, using the syntax `my_sent = "This is my sentence."` (but with your own words, or a favorite saying; be sure to include some punctuation).


1. Use `my_list = my_sent.split()` to split the string into a list of tokens.  Did this work fully as desired?
2. Edit `my_list` to be the correct list of tokens, then use `' '.join(my_list)` to convert this back into a string.  Did this step work fully as desired?

ANSWER:

In [None]:
my_sent = "This is my sentence, with some punctuation."
my_list = my_sent.split()

17. Use `text9.index()` to find the index of the word *sunset*. You'll need to insert this word as an argument between the parentheses. By a process of trial and error, find the slice for the complete sentence that contains this word.






In [None]:
text9.index('sunset')

629

22. Find all the four-letter words in the Chat Corpus (`text5`). With the help of a frequency distribution (`FreqDist`), show the first 50 of these words in decreasing order of frequency.  Are any of them swear words?

ANSWER:

In [None]:
text5_four_letter_words = [word for word in text5 if len(word) == 4]

# Solutions

## Exercise 4

4. How many words are there in `text2` (Sense and Sensibility)?  How many distinct words are there?

In [None]:
# total words in text2
len(text2)

141576

So there are 141,576 words (or tokens) in text2.

In [None]:
# total distinct words
len(set(text2))

6833

## Exercise 27

27. Define a function called vocab_size(text) that has a single parameter for the text, and which returns the vocabulary size of the text.  Test it with *Sense and Sensibility* (text2).

In [None]:
# define vocab_size
def vocab_size(text):
  return len(set(text))

# test on text2
vocab_size(text2)

6833

## Exercise 19

19. What is the difference between the following two lines? Which one will give a larger value? Will this be the case for other texts?



1.   `sorted(set(w.lower() for w in text1))`
2.   `sorted(w.lower() for w in set(text1))`






In [None]:
# this one will generally be shorter as it
# lowercases before finding unique forms
len(sorted(set(w.lower() for w in text1)))

17231

In [None]:
# this one will be longer as it finds unique
# forms before lowercasing, e.g. it will keep
# both "This" and "this" in the set then
# lowercase them afterwards, yielding some
# duplication
len(sorted(w.lower() for w in set(text1)))

19317

## Exercise 10

10. Define a variable `my_sent` to be a string of words, using the syntax `my_sent = "This is my sentence."` (but with your own words, or a favorite saying; be sure to include some punctuation).


1. Use `my_list = my_sent.split()` to split the string into a list of tokens.  Did this work fully as desired?
2. Edit `my_list` to be the correct list of tokens, then use `' '.join(my_list)` to convert this back into a string.  Did this step work fully as desired?


In [None]:
my_sent = "This is my sentence."
# split on white space
my_list = my_sent.split()
# note how sentence-final period is kept with last word;
# we'll see later how to do more accurate tokenizaiton
print(my_list)

['This', 'is', 'my', 'sentence.']


In [None]:
# correct my_list
my_list = ['This', 'is', 'my', 'sentence', '.']
# join together with spaces
my_joined_str = ' '.join(my_list)
# note extra space before full stop!
print(my_joined_str)

This is my sentence .


## Exercise 17

17. Use `text9.index()` to find the index of the word *sunset*. You'll need to insert this word as an argument between the parentheses. By a process of trial and error, find the slice for the complete sentence that contains this word.






In [None]:
# find "sunset" in text9
index = text9.index('sunset')
print('index is', index)
# verify that word appears there
text9[index]

index is 629


'sunset'

In [None]:
# show words around the index
text9[index-20:index+20]

['K',
 '.',
 'C',
 '.',
 'CHAPTER',
 'I',
 'THE',
 'TWO',
 'POETS',
 'OF',
 'SAFFRON',
 'PARK',
 'THE',
 'suburb',
 'of',
 'Saffron',
 'Park',
 'lay',
 'on',
 'the',
 'sunset',
 'side',
 'of',
 'London',
 ',',
 'as',
 'red',
 'and',
 'ragged',
 'as',
 'a',
 'cloud',
 'of',
 'sunset',
 '.',
 'It',
 'was',
 'built',
 'of',
 'a']

In [None]:
# show just the sentence after counting backwards and forwards
# (later we will discuss automatic sentence segmentation!)
text9[index-8:index+15]

['THE',
 'suburb',
 'of',
 'Saffron',
 'Park',
 'lay',
 'on',
 'the',
 'sunset',
 'side',
 'of',
 'London',
 ',',
 'as',
 'red',
 'and',
 'ragged',
 'as',
 'a',
 'cloud',
 'of',
 'sunset',
 '.']

## Exercise 22

22. Find all the four-letter words in the Chat Corpus (`text5`). With the help of a frequency distribution (`FreqDist`), show the first 50 of these words in decreasing order of frequency.  Are any of them swear words?

In [None]:
# make a frequency distribution of the four-letter words in text5
fdist5 = FreqDist(word for word in text5 if len(word) == 4)


In [None]:
# list the 50 most frequent ones
# only 'lmao' is marginally a swear word
fdist5.most_common(50)

[('JOIN', 1021),
 ('PART', 1016),
 ('that', 274),
 ('what', 183),
 ('here', 181),
 ('....', 170),
 ('have', 164),
 ('like', 156),
 ('with', 152),
 ('chat', 142),
 ('your', 137),
 ('good', 130),
 ('just', 125),
 ('lmao', 107),
 ('know', 103),
 ('room', 98),
 ('from', 92),
 ('this', 86),
 ('well', 81),
 ('back', 78),
 ('hiya', 78),
 ('they', 77),
 ('dont', 75),
 ('yeah', 75),
 ('want', 71),
 ('love', 60),
 ('guys', 58),
 ('some', 58),
 ('been', 57),
 ('talk', 56),
 ('nice', 52),
 ('time', 50),
 ('when', 48),
 ('haha', 44),
 ('make', 44),
 ('girl', 43),
 ('need', 43),
 ('U122', 42),
 ('MODE', 41),
 ('will', 40),
 ('much', 40),
 ('then', 40),
 ('over', 39),
 ('work', 38),
 ('were', 38),
 ('take', 37),
 ('U121', 36),
 ('U115', 36),
 ('song', 36),
 ('even', 35)]

In [None]:
alphabetic_four_letter_words = [word for word in text5_four_letter_words if word.isalpha()]
fdist_alphabetic_four_letter_words = FreqDist(alphabetic_four_letter_words)
fdist_alphabetic_four_letter_words.most_common(20)

[('JOIN', 1021),
 ('PART', 1016),
 ('that', 274),
 ('what', 183),
 ('here', 181),
 ('have', 164),
 ('like', 156),
 ('with', 152),
 ('chat', 142),
 ('your', 137),
 ('good', 130),
 ('just', 125),
 ('lmao', 107),
 ('know', 103),
 ('room', 98),
 ('from', 92),
 ('this', 86),
 ('well', 81),
 ('back', 78),
 ('hiya', 78)]