**Task 1: Preprocessing**

Given the text file (grail.txt) from the Web Text Corpus of NLTK, perform the following tasks:
1. Report the number of sentences and tokens contained in the file given as input.
2. Convert the whole text to lower case and report the number of unique tokens present before and after lower casing in the input file.
3. Report the number of stopwords in the file. Report the number of tokens left after stopword removal.
4. Perform stemming after removing stopwords and report the number of unique tokens left in the text.
5. Report the number of words starting with a consonant and the number of words starting with a vowel in the file given after performing steps 1,2,3, and 4.
6. Given a word and a file as input, return the number of sentences starting with that word in the input file after performing steps 1,2,3, and 4.
7. Given a word and a file as input, return the number of sentences ending with that word in the input file after performing steps 1,2,3, and 4.
8. Given a word and a file as input, return the count of that word in the input file after performing steps 1,2,3, and 4.

In [None]:
!pip install nltk



**Read the file**

In [None]:
with open("grail.txt") as f:
    text = f.read()
print(text)

SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop clop clop] 
SOLDIER #1: Halt!  Who goes there?
ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!
SOLDIER #1: Pull the other one!
ARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.
SOLDIER #1: What?  Ridden on a horse?
ARTHUR: Yes!
SOLDIER #1: You're using coconuts!
ARTHUR: What?
SOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.
ARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--
SOLDIER #1: Where'd you get the coconuts?
ARTHUR: We found them.
SOLDIER #1: Found them?  In Mercea?  The coconut's tropical!
ARTHUR: What do you mean?
SOLDIER #1: Well, this is a temperate zone.
AR

**1. Report the number of sentences and tokens contained in the file given as input.**

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
tokens = word_tokenize(text) # NLTK function
sentences = sent_tokenize(text) # NLTK function
print(tokens)
#print(set(tokens))
print("\n")
print("Number of Sentences: " ,len(sentences))
print("Number of Tokens: " ,len(tokens))
#print("Number of Unique Tokens: " ,len(set(tokens)))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Number of Sentences:  1881
Number of Tokens:  16450


**2. Convert the whole text to lower case and report the number of unique tokens present before and after lower casing in the input file.**

**Make everything lower case**

In [None]:
text1 = text.lower()
print(text1)

scene 1: [wind] [clop clop clop] 
king arthur: whoa there!  [clop clop clop] 
soldier #1: halt!  who goes there?
arthur: it is i, arthur, son of uther pendragon, from the castle of camelot.  king of the britons, defeator of the saxons, sovereign of all england!
soldier #1: pull the other one!
arthur: i am, ...  and this is my trusty servant patsy.  we have ridden the length and breadth of the land in search of knights who will join me in my court at camelot.  i must speak with your lord and master.
soldier #1: what?  ridden on a horse?
arthur: yes!
soldier #1: you're using coconuts!
arthur: what?
soldier #1: you've got two empty halves of coconut and you're bangin' 'em together.
arthur: so?  we have ridden since the snows of winter covered this land, through the kingdom of mercea, through--
soldier #1: where'd you get the coconuts?
arthur: we found them.
soldier #1: found them?  in mercea?  the coconut's tropical!
arthur: what do you mean?
soldier #1: well, this is a temperate zone.
ar

In [None]:
#Remove punctuations
import string
puncs = string.punctuation
print(puncs)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


I will use str.translate() method to replace the punctuations. str.translate() takes as input a dictionary which contains the mapping. To make that dictionary, we'll use str.maketrans() method.

In [None]:
mapping = str.maketrans("","",puncs)
text1 = text1.translate(mapping)
print(text1)

scene 1 wind clop clop clop 
king arthur whoa there  clop clop clop 
soldier 1 halt  who goes there
arthur it is i arthur son of uther pendragon from the castle of camelot  king of the britons defeator of the saxons sovereign of all england
soldier 1 pull the other one
arthur i am   and this is my trusty servant patsy  we have ridden the length and breadth of the land in search of knights who will join me in my court at camelot  i must speak with your lord and master
soldier 1 what  ridden on a horse
arthur yes
soldier 1 youre using coconuts
arthur what
soldier 1 youve got two empty halves of coconut and youre bangin em together
arthur so  we have ridden since the snows of winter covered this land through the kingdom of mercea through
soldier 1 whered you get the coconuts
arthur we found them
soldier 1 found them  in mercea  the coconuts tropical
arthur what do you mean
soldier 1 well this is a temperate zone
arthur the swallow may fly south with the sun or the house martin or the plov

**Convert numbers to string**

In [None]:
!pip install inflect



In [None]:
import inflect 
p = inflect.engine()

new_tokens = []
for t in tokens1:
    if t.isdigit(): 
        new_tokens.append(p.number_to_words(t))
    else:
        new_tokens.append(t)
        
print(new_tokens)



**Correcting the tokens**

In [None]:
text1 = " ".join(new_tokens)
print(text1)



In [None]:
text1 = text1.replace("-"," ")
print(text1)



In [None]:
tokens1 = word_tokenize(text1)
print(set(tokens1))



In [None]:
from nltk.tokenize import word_tokenize
tokens1 = word_tokenize(text1) # NLTK function
#print(tokens1)
print("Original number of tokens: ",len(tokens))
print("Number of Unique tokens after converting to lower case: " ,len(set(tokens1)))

Original number of tokens:  16450
Number of Unique tokens after converting to lower case:  1830


**3. Report the number of stopwords in the file. Report the number of tokens left after stopword removal.**

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
{'them', 'they', 'more', 'mightn', 've', 'have', "she's", 'both', 'as', 'that', 'the', 'o', 'be', 'been', 'doing', 'which', 'through', 'here', 'after', 'herself', 'ourselves', 'nor', 'about', 'each', 'isn', 'wouldn', 'i', "wasn't", "hadn't", "wouldn't", 'too', 'needn', 'do', 'their', 'him', 'this', 'until', "didn't", 'was', "mustn't", "shan't", 'those', 'didn', "you've", 'where', 'hers', 'above', 'how', 'weren', 'but', 'all', 'against', 'before', 'some', 'an', 'does', 'its', "needn't", 'wasn', "hasn't", 'other', 'me', 'between', 'very', 'ma', 'mustn', 'she', 'with', 'only', 'to', 'over', 'itself', 'just', 'once', "you'd", 'y', 'while', "you're", 'it', 'these', 'such', "that'll", 'out', "aren't", 'down', 'most', "weren't", 'theirs', 'why', 'in', 'are', "don't", 'now', "doesn't", 'yours', 'own', 's', 'yourselves', "shouldn't", 'then', 'were', 'he', 'there', "won't", 'or', 'off',

In [None]:
new_stop_words = set()
for st in stop_words:
    new_stop_words.add(st.translate(mapping))
print(new_stop_words)

{'isnt', 'them', 'they', 'mightn', 'more', 'hasnt', 've', 'have', 'both', 'as', 'wasnt', 'that', 'youre', 'o', 'the', 'be', 'been', 'doing', 'arent', 'which', 'shouldnt', 'through', 'here', 'after', 'mustnt', 'herself', 'ourselves', 'youve', 'youll', 'nor', 'about', 'each', 'isn', 'wouldn', 'i', 'too', 'needn', 'do', 'their', 'him', 'this', 'until', 'was', 'those', 'didn', 'where', 'hers', 'above', 'how', 'weren', 'but', 'all', 'werent', 'against', 'before', 'some', 'an', 'does', 'its', 'doesnt', 'shouldve', 'wasn', 'other', 'me', 'between', 'very', 'ma', 'mustn', 'couldnt', 'she', 'with', 'only', 'to', 'dont', 'over', 'itself', 'just', 'once', 'y', 'wont', 'while', 'shant', 'it', 'these', 'such', 'thatll', 'out', 'down', 'most', 'theirs', 'why', 'in', 'are', 'now', 'yours', 'own', 's', 'didnt', 'yourselves', 'then', 'were', 'he', 'there', 'or', 'off', 'because', 'won', 'being', 'again', 'has', 'ours', 'll', 'youd', 'we', 'by', 'on', 'd', 'my', 'am', 'hasn', 'any', 'doesn', 'your', 'hi

In [None]:
imp_tokens = [t for t in tokens1 if t not in new_stop_words]
print(imp_tokens)



In [None]:
print("Original number of tokens: ",len(tokens))
print("Number of stopwords in the file: ",len(tokens)-len(imp_tokens), " or ",round((len(tokens)-len(imp_tokens))*100/len(tokens),2),"%")
print("Number of tokens left are removing stopwords: ",len(imp_tokens))

Original number of tokens:  16450
Number of stopwords in the file:  9514  or  57.84 %
Number of tokens left are removing stopwords:  6936


**4. Perform stemming after removing stopwords and report the number of unique tokens left in the text.**

In [None]:
from nltk.stem import PorterStemmer

pr = PorterStemmer()
stemmed_tokens = [pr.stem(t) for t in imp_tokens]

unique_tokens = list(set(stemmed_tokens))
unique_tokens.sort()
#print(unique_tokens)
print("Original Number of tokens are removing stopwords: ",len(imp_tokens))
print("Number of Unique Tokens Left after Perform stemming: ",len(unique_tokens))

Original Number of tokens are removing stopwords:  6936
Number of Unique Tokens Left after Perform stemming:  1511


**5. Report the number of words starting with a consonant and the number of words starting with a vowel in the file given after performing steps 1,2,3, and 4**

In [None]:
import re
vowels = 'aeiou'
consonants = 'bcdfghjklmnpqrstvwxyz'
from re import findall
print("Number of words starting with a Consonants: ", len(findall('^[%s]' % consonants, text1)))
print("Number of words starting with a Vowels: ", len(findall('^[%s]' % vowels, text1)))

Number of words starting with a Consonants:  1
Number of words starting with a Vowels:  0


**6. Given a word and a file as input, return the number of sentences starting with that word in the input file after performing steps 1,2,3, and 4.**

In [None]:
word = input()
re.findall("[^b]og",word)

sabyasa


[]