Importing corpus

In [1]:
import nltk

1. Listing the available corpus names:

In [2]:
print(dir(nltk.corpus))

['_LazyModule__lazymodule_globals', '_LazyModule__lazymodule_import', '_LazyModule__lazymodule_init', '_LazyModule__lazymodule_loaded', '_LazyModule__lazymodule_locals', '_LazyModule__lazymodule_name', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__']


In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [4]:
from nltk.corpus import wordnet

In [5]:
statement = "NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum."

In [6]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [7]:
words = word_tokenize(statement)

In [8]:
len(words)
nltk.download('stopwords')
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
only_words = [w for w in words if not w in string.punctuation]

In [10]:
from nltk.corpus import stopwords

words = [w for w in only_words if not w in set(stopwords.words("english"))]

In [11]:
len(words)

38

In [12]:
print(words)

['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', 'It', 'provides', 'easy-to-use', 'interfaces', '50', 'corpora', 'lexical', 'resources', 'WordNet', 'along', 'suite', 'text', 'processing', 'libraries', 'classification', 'tokenization', 'stemming', 'tagging', 'parsing', 'semantic', 'reasoning', 'wrappers', 'industrial-strength', 'NLP', 'libraries', 'active', 'discussion', 'forum']


2. Defining some words

In [13]:
for i in range(len(words)):
    syns = wordnet.synsets(words[i])
    if syns:
        print("{} : ".format(words[i]), end='')
        print(syns[0].definition())
    else:
        print("{} has no definition".format(words[i]))

NLTK has no definition
leading : thin strip of metal used to separate lines of type in printing
platform : a raised horizontal surface
building : a structure that has a roof and walls and stands more or less permanently in one place
Python : large Old World boas
programs : a series of steps to be carried out or goals to be accomplished
work : activity directed toward making or doing something
human : any living or extinct member of the family Hominidae characterized by superior intelligence, articulate speech, and erect carriage
language : a systematic means of communicating by the use of sounds or conventional symbols
data : a collection of facts from which conclusions may be drawn
It : the branch of engineering that deals with the use of computers and telecommunications to retrieve and store and transmit information
provides : give something useful or necessary to
easy-to-use has no definition
interfaces : (chemistry) a surface forming a common boundary between two things (two object

3. Building a custom corpus

In [14]:
import os, os.path
path = os.path.expanduser('~/nltk_data')

if not os.path.exists(path):
    os.mkdir(path)

os.path.exists(path)

True

In [15]:
import nltk.data
path in nltk.data.path

True

In [18]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [19]:
from nltk.corpus.reader import WordListCorpusReader

In [20]:
rc = WordListCorpusReader('.', ['/content/drive/My Drive/corpus.txt'])

In [21]:
rc.words()

['NLTK',
 'leading',
 'platform',
 'building',
 'Python',
 'programs',
 'work',
 'human',
 'language',
 'data',
 'It',
 'provides',
 'easy-to-use',
 'interfaces',
 '50',
 'corpora',
 'lexical',
 'resources',
 'WordNet',
 'along',
 'suite',
 'text',
 'processing',
 'libraries',
 'classification',
 'tokenization',
 'stemming',
 'tagging',
 'parsing',
 'semantic',
 'reasoning',
 'wrappers',
 'industrial-strength',
 'NLP',
 'libraries',
 'active',
 'discussion',
 'forum']

In [22]:
len(words)

38

My "words" variable has list of the words which are present in the above corpus file.

In [23]:
import nltk
nltk.download('tagsets')


[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

4. Overview of tagset

In [24]:
print(nltk.help.brown_tagset())

(: opening parenthesis
    (
): closing parenthesis
    )
*: negator
    not n't
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ? ; ! :
:: colon
    :
ABL: determiner/pronoun, pre-qualifier
    quite such rather
ABN: determiner/pronoun, pre-quantifier
    all half many nary
ABX: determiner/pronoun, double conjunction or pre-quantifier
    both
AP: determiner/pronoun, post-determiner
    many other next more last former little several enough most least only
    very few fewer past same Last latter less single plenty 'nough lesser
    certain various manye next-to-last particular final previous present
    nuf
AP$: determiner/pronoun, post-determiner, genitive
    other's
AP+AP: determiner/pronoun, post-determiner, hyphenated pair
    many-much
AT: article
    the an no a every th' ever' ye
BE: verb 'to be', infinitive or imperative
    be
BED: verb 'to be', past tense, 2nd person singular or all persons plural
    were
BED*: verb 'to be', past tense, 2nd person singular or 

In [25]:
## Details of some tag 
print(nltk.help.brown_tagset(r'VB'))

VB: verb, base: uninflected present, imperative or infinitive
    investigate find act follow inure achieve reduce take remedy re-set
    distribute realize disable feel receive continue place protect
    eliminate elaborate work permit run enter force ...
None


In [26]:
## Details on related tagsets
print(nltk.help.brown_tagset(r'VB*'))

VB: verb, base: uninflected present, imperative or infinitive
    investigate find act follow inure achieve reduce take remedy re-set
    distribute realize disable feel receive continue place protect
    eliminate elaborate work permit run enter force ...
VB+AT: verb, base: uninflected present or infinitive + article
    wanna
VB+IN: verb, base: uninflected present, imperative or infinitive + preposition
    lookit
VB+JJ: verb, base: uninflected present, imperative or infinitive + adjective
    die-dead
VB+PPO: verb, uninflected present tense + pronoun, personal, accusative
    let's lemme gimme
VB+RP: verb, imperative + adverbial particle
    g'ahn c'mon
VB+TO: verb, base: uninflected present, imperative or infinitive + infinitival to
    wanta wanna
VB+VB: verb, base: uninflected present, imperative or infinitive; hypenated pair
    say-speak
VBD: verb, past tense
    said produced took recommended commented urged found added praised
    charged listed became announced brought atten

5. Let's look at some names available in corpus!

In [27]:
nltk.download('names')
from nltk.corpus import names

print("Number of male names : ", len(names.words('male.txt')))
print("Number of female names : ", len(names.words('female.txt')))

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.
Number of male names :  2943
Number of female names :  5001


In [28]:
malename = names.words('male.txt')
femalename = names.words('female.txt')

In [29]:
malename[:10]

['Aamir',
 'Aaron',
 'Abbey',
 'Abbie',
 'Abbot',
 'Abbott',
 'Abby',
 'Abdel',
 'Abdul',
 'Abdulkarim']

In [30]:
femalename[:10]

['Abagael',
 'Abagail',
 'Abbe',
 'Abbey',
 'Abbi',
 'Abbie',
 'Abby',
 'Abigael',
 'Abigail',
 'Abigale']

In [31]:
male_name = [(name, 'male') for name in malename]
female_name = [(name, 'female') for name in femalename]

In [32]:
male_name

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male'),
 ('Abdullah', 'male'),
 ('Abe', 'male'),
 ('Abel', 'male'),
 ('Abelard', 'male'),
 ('Abner', 'male'),
 ('Abraham', 'male'),
 ('Abram', 'male'),
 ('Ace', 'male'),
 ('Adair', 'male'),
 ('Adam', 'male'),
 ('Adams', 'male'),
 ('Addie', 'male'),
 ('Adger', 'male'),
 ('Aditya', 'male'),
 ('Adlai', 'male'),
 ('Adnan', 'male'),
 ('Adolf', 'male'),
 ('Adolfo', 'male'),
 ('Adolph', 'male'),
 ('Adolphe', 'male'),
 ('Adolpho', 'male'),
 ('Adolphus', 'male'),
 ('Adrian', 'male'),
 ('Adrick', 'male'),
 ('Adrien', 'male'),
 ('Agamemnon', 'male'),
 ('Aguinaldo', 'male'),
 ('Aguste', 'male'),
 ('Agustin', 'male'),
 ('Aharon', 'male'),
 ('Ahmad', 'male'),
 ('Ahmed', 'male'),
 ('Ahmet', 'male'),
 ('Ajai', 'male'),
 ('Ajay', 'male'),
 ('Al', 'male'),
 ('Alaa', 'male'),
 ('Alain', 'male'),
 ('Alan', 'male

In [33]:
combined_name = male_name + female_name

6. Random male and female names

In [35]:
import random
for _ in range(15):
    i = random.randrange(7944) ##total number of names
    print(combined_name[i])

('Dietrich', 'male')
('Faythe', 'female')
('Spence', 'male')
('Neale', 'male')
('Hedy', 'female')
('Koo', 'female')
('Zane', 'male')
('Karalee', 'female')
('Lorne', 'male')
('Alton', 'male')
('Kee', 'female')
('Anjanette', 'female')
('Elijah', 'male')
('Ivie', 'female')
('Miles', 'male')


7. Building a list of alphabets which appear at the end of the name.

In [36]:
lastletter = []

In [37]:
for i in malename:
    lastletter.append(i[-1])

In [38]:
for i in femalename:
    lastletter.append(i[-1])

In [39]:
lastletter

['r',
 'n',
 'y',
 'e',
 't',
 't',
 'y',
 'l',
 'l',
 'm',
 'h',
 'e',
 'l',
 'd',
 'r',
 'm',
 'm',
 'e',
 'r',
 'm',
 's',
 'e',
 'r',
 'a',
 'i',
 'n',
 'f',
 'o',
 'h',
 'e',
 'o',
 's',
 'n',
 'k',
 'n',
 'n',
 'o',
 'e',
 'n',
 'n',
 'd',
 'd',
 't',
 'i',
 'y',
 'l',
 'a',
 'n',
 'n',
 'r',
 'r',
 's',
 't',
 'o',
 't',
 'n',
 's',
 'o',
 'c',
 'h',
 's',
 'n',
 'c',
 'k',
 'o',
 's',
 's',
 'o',
 'x',
 'r',
 'i',
 's',
 'f',
 'e',
 'e',
 'o',
 'o',
 'd',
 'd',
 'o',
 'n',
 'i',
 'c',
 'r',
 'x',
 'h',
 'n',
 'n',
 'y',
 'e',
 'n',
 'n',
 'o',
 'o',
 's',
 'e',
 'o',
 'n',
 'n',
 'n',
 'n',
 's',
 's',
 'e',
 'i',
 'o',
 's',
 'y',
 'y',
 's',
 'l',
 'e',
 'o',
 'y',
 's',
 'e',
 's',
 'e',
 'a',
 's',
 'j',
 's',
 'w',
 'y',
 'i',
 's',
 's',
 'j',
 'y',
 'l',
 'o',
 'o',
 'e',
 's',
 'l',
 'l',
 'm',
 'n',
 'y',
 'n',
 'e',
 'n',
 'e',
 'i',
 'n',
 'o',
 'o',
 's',
 'y',
 'g',
 'o',
 's',
 'm',
 'd',
 'd',
 'e',
 'n',
 'y',
 'l',
 'i',
 'e',
 'l',
 'e',
 'o',
 'd',
 'o',
 'd'

In [40]:
len(lastletter)

7944

7944 alphabets!!!!