## Reading data using PlainCorpusReader

Nltk supports multiple CorpusReaders depending upon the type of data source. Details available in http://www.nltk.org/howto/corpus.html  
  
The PlaintextCorpusReader is used to read a list of text files under a directory

In [1]:
import os
import nltk
# nltk.download('punkt')
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

# Read the file into a corpus. The same command can read an entire directory
corpus = PlaintextCorpusReader(os.getcwd(), "data.txt")

# Print raw contents of the corpus
print(corpus.raw())


You'll learn NLP.

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project.

NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

Natural Language Processing with Python pro

Each file in directory becomes a single file id. The contents of the file together constitue a corpus.  
Data is then split into paragraphs, sentences and tokens automatically, while the corpus is read. 

In [2]:
# Extract file ids, paragraphs, sentences and words from the corpus
print(f'Files: {corpus.fileids()}\n')

paragraphs = corpus.paras()
print(f'Total paragraphs: {len(corpus.paras())}\n')

sentences = corpus.sents()
print(f'Total sentences: {len(sentences)}\n')
print(f'The first sentence: {sentences[0]}\n')

print(f'Words: {corpus.words()}')


Files: ['data.txt']

Total paragraphs: 5

Total sentences: 9

The first sentence: ['You', "'", 'll', 'learn', 'NLP', '.']

Words: ['You', "'", 'll', 'learn', 'NLP', '.', 'NLTK', 'is', ...]


## Analyze the Corpus

The NLTK library provides a number of functions to analyze the distributions and aggregates for data in the corpus.

In [3]:
#Find the frequency distribution of words in the corpus
course_freq_dist=nltk.FreqDist(corpus.words())

#Print most commonly used words
print("Top 10 words in the corpus : ", course_freq_dist.most_common(10))

#find the distribution for a specific word
print("\n Distribution for \"Spark\" : ",course_freq_dist.get("Spark"))

Top 10 words in the corpus :  [(',', 27), ('.', 8), ('and', 8), ('for', 7), ('NLTK', 6), ('a', 6), ('to', 6), ('with', 5), ('-', 5), ('is', 4)]

 Distribution for "Spark" :  None
