<a href="https://colab.research.google.com/github/xpdlaldam/nlp/blob/master/nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
### Stemming & Lemmatization
## Stemming is very crude as it just chops off the end of the word => the result is not necessarily a real word ex) better -> better
## Lemmatization is more sophisticated as it uses actual rules of language => the true root word will be returned ex) better -> good
## Parts-of-Speech (POS) matters as "going" can be a noun but when the root form is "go" it is not a noun

### Bag-of-Words (BoW)
## counts word frequencies
## unordered
## simpler model than TF-IDF but useful if computaional speed is importat

### TF-IDF
## considers both word frequency and rarity across all documents

### Tokenization
## the process of breaking down text into smaller units, called tokens
## text.split(): ex) I like cats => [I, "like", "cats"]
## punctuations like . or ? can matter depending on your task ex) I like cats? => [I, "like", "cats", "?"]
## note: scikit-learn's CountVectorizer ignores punctuation
## casing: cat and Cat are identical => CountVectorizer(lowercase=True)
## accents ex) resume => CountVectorizer(strip_accents=True)

## Types of tokenization
# 1. word-based tokenization: takes up too much space but contains more information => CountVectorizer(analyzer="word")
# 2. character-based tokenization: takes up less space but contains less information => CountVectorizer(analyzer="char")
# 3. subword-based tokenization: middle ground b/w word and character-based
# ex) we want "walking" to be split into "walk" + "ing" o.w. "walking" and "walk" will be recognized as
# different words

### Steps of a typical NLP analysis
## 1. get the text
## 2. tokenize the text
## 3. stopwords, stemming / lemmatization
## 4. map tokens to integers
## 5. convert text into count vectors / TF-IDF
## 6. Do ML task

### Markov Property
## x_t only depends on x_t-1 or
## p(x_t|x_t-1, x_t-2, ...) = p(x_t|x_t-1) or
## x_t is independent of x_t-2, x_t-3, ...
## use case in NLP:
  # from 2000 most common English words
  # we want to predict the 10th word
  # from the previous 9 words in a sentence
  # our model is p(x_10|x_9, x_8, ..., x_1)
  # each x_i has 2_000 possible words, hence
  # total probabilities to estimate is 2000^10
## we assume the Markov property holds,
## even when it does not as the next word
## doesn't always depend on the previous word
  # ex) I like eggs and ham
  # Does "ham" only depend on "and"? no

### Markov Model
## state distribution p(s_t) = state distribution w/ length M vector
# ex) prob it will rain on Sunday = p(s_sunday=rainy)

## state transition matrix:
# A_ij = p(s_t = j|s_t-1 = i), for all i = 1,...,M and j = 1,...,M
# if we apply Markov model to NLP, for the first word in a sentence
# there is no previous state, hence to quantify the probability
# of the first state in a sequence, we use the initial state distribution
# pi_i = p(s_1 = i) for i=1,...,M

## How to estimate A and pi
# A hat sub ij = count(i => j) / count(i)
# ex) prob of seeing the word "cat" following the word "the"
# i.e. "the" => "cat"
# i.e. count("the cat") / count("the")

## we can use a Markov model for text classification
# note that text classification is supervised learning
# but Markov models are unsupervised
# solution is to apply Bayes' rule
# ex) train a Markov model by feeding poems by poet A
# train a Markov model by feeding poems by poet B
# for a new unknown text we want to know which
# poet wrote the poem i.e.
# p(x|poet=A)
# p(x|poet=B)
# we choose the highest probability
# in other words, we have p(poem|poet)
# but we want p(poet|poem)
# argmax_k p(class=k|data)
# p(poet|poem) = p(poem|poet=k)p(poet=k) / p(poem)
# k* = argmax_k p(poem|poet=k)p(poet=k) / p(poem)
# proportional to
# = argmax_k p(poem|poet=k)p(poet=k)
# = argmax_k logp(poem|poet=k) + logp(poet=k)
# proportional to
# = argmax_k logp(poem|poet=k) (assuming p(poet) is uniform)

In [None]:
lemmatizer.lemmatize("mice") # returns "mouse"

lemmatizer.lemmatize("going", pos=wordnet.NOUN) # returns "going"
lemmatizer.lemmatize("going", pos=wordnet.VERB) # returns 'go'

# 14.Stemming & Lemmatization Demo 6/16/2024

In [None]:
import nltk
from nltk.stem import PorterStemmer

from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
from nltk.corpus import wordnet

In [None]:
porter = PorterStemmer()

In [None]:
porter.stem("walking")

In [None]:
porter.stem("walked")

In [None]:
porter.stem("ran") # wrong stemming

In [None]:
sentence = "Lemmatization is more sophisticated than stemming".split()
for token in sentence:
  # print(porter.stem(token), end=" ") # end = " " is to print the result in a single line
  print(porter.stem(token))

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
lemmatizer.lemmatize("walking") # default: pos=wordnet.NOUN

In [None]:
lemmatizer.lemmatize("walking", pos=wordnet.VERB)

In [None]:
lemmatizer.lemmatize("ran", pos=wordnet.VERB)

In [None]:
lemmatizer.lemmatize("was", pos=wordnet.VERB)

In [None]:
lemmatizer.lemmatize("better", pos=wordnet.ADJ)

In [None]:
def get_wordnet_pos(treebank_tag):
  if treebank_tag.startswith("J"):
    return wordnet.ADJ
  elif treebank_tag.startswith("V"):
    return wordnet.VERB
  elif treebank_tag.startswith("N"):
    return wordnet.NOUN
  elif treebank_tag.startswith("R"):
    return wordnet.ADJ
  else:
    return wordnet.NOUN

In [None]:
nltk.download("averaged_perceptron_tagger")

In [None]:
sentence = "Donald Trump has a devoted following".split()
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

In [None]:
for word, tag in words_and_tags:
  lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
  print(lemma, end=" ")

In [None]:
sentence = "The cat was following the bird as it flew by".split()
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

In [None]:
for word, tag in words_and_tags:
  lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
  print(lemma, end=" ")

# 15. Count Vectorizer

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import wordnet

In [None]:
## download data for nltk
nltk.download("wordnet")
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

In [None]:
# https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification
!gsutil cp gs://dataset-uploader/bbc/bbc-text.csv .

In [None]:
df = pd.read_csv("bbc-text.csv")

In [None]:
df

In [None]:
inputs = df['text']
labels = df['category']

In [None]:
labels.hist(figsize=(10, 5))

In [None]:
inputs_train, inputs_test, Ytrain, Ytest = train_test_split(
    inputs, labels, random_state=123)

In [None]:
vectorizer = CountVectorizer()

In [None]:
Xtrain = vectorizer.fit_transform(inputs_train)
Xtest = vectorizer.transform(inputs_test)

In [None]:
Xtrain

In [None]:
(Xtrain != 0).sum()

In [None]:
# what percentage of values are non-zero?
# < 1% => most values are non-zeros
# thus, makes sense to use a sparse matrix
(Xtrain != 0).sum() / np.prod(Xtrain.shape)

In [None]:
model = MultinomialNB()
model.fit(Xtrain, Ytrain)
print("train score:", model.score(Xtrain, Ytrain))
print("test score:", model.score(Xtest, Ytest))

In [None]:
# using stopwords (default: stopwords are not removed)
vectorizer = CountVectorizer(stop_words='english')
Xtrain = vectorizer.fit_transform(inputs_train)
Xtest = vectorizer.transform(inputs_test)
model = MultinomialNB()
model.fit(Xtrain, Ytrain)
print("train score:", model.score(Xtrain, Ytrain))
print("test score:", model.score(Xtest, Ytest))

In [None]:
def get_wordnet_pos(treebank_tag):
  if treebank_tag.startswith('J'):
    return wordnet.ADJ
  elif treebank_tag.startswith('V'):
    return wordnet.VERB
  elif treebank_tag.startswith('N'):
    return wordnet.NOUN
  elif treebank_tag.startswith('R'):
    return wordnet.ADV
  else:
    return wordnet.NOUN

In [None]:
class LemmaTokenizer:
  def __init__(self):
    self.wnl = WordNetLemmatizer()

  # call an object as if it were a function
  def __call__(self, doc):
    tokens = word_tokenize(doc)
    words_and_tags = nltk.pos_tag(tokens)
    return [self.wnl.lemmatize(word, pos=get_wordnet_pos(tag)) \
            for word, tag in words_and_tags]

In [None]:
# with lemmatization
vectorizer = CountVectorizer(tokenizer=LemmaTokenizer())
Xtrain = vectorizer.fit_transform(inputs_train)
Xtest = vectorizer.transform(inputs_test)
model = MultinomialNB()
model.fit(Xtrain, Ytrain)
print("train score:", model.score(Xtrain, Ytrain))
print("test score:", model.score(Xtest, Ytest))

In [None]:
class StemTokenizer:
  def __init__(self):
    self.porter = PorterStemmer()
  def __call__(self, doc):
    tokens = word_tokenize(doc)
    return [self.porter.stem(t) for t in tokens]

In [None]:
# with stemming
vectorizer = CountVectorizer(tokenizer=StemTokenizer())
Xtrain = vectorizer.fit_transform(inputs_train)
Xtest = vectorizer.transform(inputs_test)
model = MultinomialNB()
model.fit(Xtrain, Ytrain)
print("train score:", model.score(Xtrain, Ytrain))
print("test score:", model.score(Xtest, Ytest))

In [None]:
def simple_tokenizer(s):
  return s.split()

In [None]:
# string split tokenizer
vectorizer = CountVectorizer(tokenizer=simple_tokenizer)
Xtrain = vectorizer.fit_transform(inputs_train)
Xtest = vectorizer.transform(inputs_test)
model = MultinomialNB()
model.fit(Xtrain, Ytrain)
print("train score:", model.score(Xtrain, Ytrain))
print("test score:", model.score(Xtest, Ytest))

## TF-IDF Theory

In [None]:
### concept
# Term Frequency - Inverse Document Frequency
# to avoid stopwords so that it doesn't over represent the documnet
# term frequency / document frequency

### tfidf(t, d) = tf(t, d) * idf(t)
## tf(t, d)
# number of times a term appears in an document
# t is the term count and d is the ith document ex) document 1 has the word "hi" 3 times and document 2 has the word "bye" 4 times
# the t is the only argument in idf as the document count only depends on the term t
# i.e. tf(t, d) = # of times t appears in d => same result by using CountVectorizer; if we have V unique terms and N documents the dimension will be N by V

## idf(t) = log(N/N(t)), where N = total # of documnets; N(t) = # of documents term t appears in
# a measure of how rare a term is across all documents
# why log? nomotinic: if N/N(t) increases, so will its log
# TF-IDF decreases if t appears in more documents
# to squash big numbers ex) if N = 10M and N(t) = 1, then TF-IDF explodes to 1M but with log it goes down to 13.8

# TF-IDF Recommender System

In [None]:
# https://www.kaggle.com/tmdb/tmdb-movie-metadata
!wget https://lazyprogrammer.me/course_files/nlp/tmdb_5000_movies.csv

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import json

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

In [None]:
df = pd.read_csv('tmdb_5000_movies.csv')
df

In [None]:
df['genres'].unique()

In [None]:
df['keywords'].unique()

In [None]:
j = json.loads(df.iloc[0]['genres'])
j

In [None]:
[jj['name'] for jj in j]

In [None]:
[jj['name'].split() for jj in j] # splits into words w/ each of them stored in a list of lists

In [None]:
[''.join(jj['name'].split()) for jj in j] # remove white space

In [None]:
' '.join(''.join(jj['name'].split()) for jj in j) # make all genres in a single string

In [None]:
# convert the relevant data for each movie into a single string
# to be ingested by TfidfVectorizer
def genres_and_keywords_to_string(row):
  genres = json.loads(row['genres'])
  genres = ' '.join(''.join(j['name'].split()) for j in genres)

  keywords = json.loads(row['keywords'])
  keywords = ' '.join(''.join(j['name'].split()) for j in keywords)
  return "%s %s" % (genres, keywords)

In [None]:
df['string'] = df.apply(genres_and_keywords_to_string, axis=1)
df.head()

In [None]:
# create a tf-idf vectorizer object
tfidf = TfidfVectorizer(max_features=2000) # max_features: limits the number of columns in the final matrix i.e. discards less frequent terms

In [None]:
# create a data matrix from the overviews
X = tfidf.fit_transform(df['string'])
X

In [None]:
# generate a mapping from movie title -> index (in df)
# why are we doing this? b/c our X matrix doesn't tell you which movie. It's just numbers so we need to have a mapping that tells which movie it is
movie2idx = pd.Series(df.index, index=df['title'])
movie2idx

In [None]:
idx = movie2idx['Avatar']
idx

In [None]:
query = X[idx]
query

In [None]:
query.toarray()

In [None]:
# compute similarity between query and every vector in X
# shape will be 1 (query.shape[0]) by 4_803 (X.shape[0])
# 0 means there are no common terms b/w two movies => dot product = 0
scores = cosine_similarity(query, X)
scores

In [None]:
# currently the array is 1 x N, make it just a 1-D array
scores = scores.flatten()

In [None]:
plt.plot(scores) # the spike at 1 is the query movie as the cosine similarity b/w two identical vector is 1

In [None]:
# reason for doing argsort: we don't care about the scores we only care about the ranks
# -scores: sort by descending order
(-scores).argsort()

In [None]:
scores[(-scores).argsort()]

In [None]:
# interpretation: most movies are unrelated to the query movie
plt.plot(scores[(-scores).argsort()])

In [None]:
# get top 5 matches
# exclude self (similarity between query and itself yields max score)
recommended_idx = (-scores).argsort()[1:6] # 0 is the query movie itself so start from 1

# convert indices back to titles
df['title'].iloc[recommended_idx]

In [None]:
# create a function that generates recommendations
def recommend(title):
  # get the row in the dataframe for this movie
  idx = movie2idx[title]
  if type(idx) == pd.Series: # if it's a pd.Series, it's the same title
    idx = idx.iloc[0]

  # calculate the pairwise similarities for this movie
  query = X[idx]
  scores = cosine_similarity(query, X)

  # currently the array is 1 x N, make it just a 1-D array
  scores = scores.flatten()

  # get the indexes of the highest scoring movies
  # get the first K recommendations
  # don't return itself!
  recommended_idx = (-scores).argsort()[1:6]

  # return the titles of the recommendations
  return df['title'].iloc[recommended_idx]

In [None]:
recommend(title='The Dark Knight Rises')

# Word-to-Index mapping

In [None]:
import pandas as pd

In [None]:
# word-to-index is needed to build TF-IDF from scratch, hence it is not needed for sklearn but for advanced NLP
# such as deep nlp with pytorch, word2vec, RNNs, transformers etc

# when converting documents to vectors, the data format is a document-term matrix, where
# row = document and column = term (words)
# ex) document 1: I like cats; document 2: I love cats; document 3: I love dogs
data = {
    'document': ['doc1', 'doc2', 'doc3'],
    'I': [1, 1, 1],
    'like': [1, 0, 0],
    'cats': [1, 1, 0],
    'love': [0, 1, 1],
    'dogs': [0, 0, 1],
}
pd.DataFrame(data).set_index('document')

# TF-IDF from scratch

In [None]:
# https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk import word_tokenize

In [None]:
nltk.download('punkt_tab')

In [None]:
df = pd.read_csv('bbc_text_cls.csv')
df.head()

In [None]:
idx = 0
word2idx = {} # dictionary to populate w/ word: index
tokenized_docs = []

# loops through rows in the 'text' column
for doc in df['text'].head(1):

  # doc is a collection of strings inside a list ex) "Ad sales booost time warner profit..."
  # so doc.lower() makes all the individual strings lower case
  # word_tokenize() splits doc smartly into words ex) ['ad', 'sales', 'boost', 'time', 'warner', 'profit']
  words = word_tokenize(doc.lower()) # ex) ['ad', 'sales', 'boost', 'time', 'warner', 'profit']
  # print(words)

  doc_as_int = []
  for word in words: # loops through all the individual words (separated by white space) from words
    if word not in word2idx: # if it's a new word
      word2idx[word] = idx # then add the new word to the dictionary word2idx
      idx += 1 # we have to increase the number by 1 to make sure each new word gets assigned a new idx

    # the keys from word2idx are all unique/new words
    # however, this line is outside the "if word not in word2idx"
    # hence, this line happens once all the unique new words are populated in word2idx
    # doc_as_int adds all the values of the words from the populated word2idx that are in the entire row (word)
    # so it can have duplicates if a word repeats such as "to" (was mapped as 16 so will have multiple 16's)
    doc_as_int.append(word2idx[word])
  tokenized_docs.append(doc_as_int)

word2idx

In [None]:
doc_as_int

In [None]:
tokenized_docs

In [None]:
# reverse word2idx
idx2word = {v:k for k, v in word2idx.items()}
idx2word

In [None]:
# number of documents
N = len(df['text'])
N

In [None]:
# number of words
V = len(word2idx)
V

In [None]:
# instantiate term-frequency matrix
# note: could have also used count vectorizer
tf = np.zeros((N, V)) # number of documents by number of words
tf.shape # (2_225, 235)

In [None]:
tokenized_docs

In [None]:
# populate term-frequency counts
for i, doc_as_int in enumerate(tokenized_docs):
  # print(i)
  # print(i, doc_as_int)
  for j in doc_as_int:
    tf[i, j] += 1 # in document i count how many times the jth word appears

In [None]:
tf

In [None]:
df['text'][0]

In [None]:
idx2word

In [None]:
doc_as_int

In [None]:
# compute IDF
# axis=0: sum over each word
document_freq = np.sum(tf > 0, axis=0) # document frequency (shape = (V,))
idf = np.log(N / document_freq)
idf

In [None]:
tf_idf = tf * idf
tf_idf

In [None]:
# pick a random document, show the top 5 terms (in terms of tf_idf score)
i = np.random.choice(N) # random document index
row = df.iloc[i] # the row of the chosen document
print("Label:", row['labels'])
# print("Text:", row['text'].split("\n", 1)[0])
print("Text:", row['text'])
print("Top 5 terms:")

scores = tf_idf[i]

# -scores orders in descending order
# argsort: we don't care about the actual score instead we need the order
indices = (-scores).argsort()

for j in indices[:5]:
  print(idx2word[j])

# Word Embeddings

In [None]:
### download pretrained models but this one takes too long
## method 1
# !wget -nc https://archive.org/download/google-news-vectors-negative-300.bin_202311/GoogleNews-vectors-negative300.bin.gz

## method 2
import gdown
!gdown https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM

In [None]:
## unzip our gunzip file#
!gunzip GoogleNews-vectors-negative300.bin.gz

In [None]:
## KeyedVectors has the necessary AIPs for word embeddings
from gensim.models import KeyedVectors

In [None]:
## Call the function load_word2vec_format passing in the pretrained word embeddings
word_vectors = KeyedVectors.load_word2vec_format(
  'GoogleNews-vectors-negative300.bin',
  binary=True
)

In [None]:
def find_analogies(w1, w2, w3):
  # w1 - w2 = ? - w3
  # e.g. king - man = ? - woman
  #      ? = +king +woman -man
  r = word_vectors.most_similar(positive=[w1, w3], negative=[w2])
  # print("%s - %s = %s - %s" % (w1, w2, r[0][0], w3))
  print("%s (%s) = %s - %s + %s" % (r[0][0], round(r[0][1], 2), w1, w2, w3))
  # print(r)

In [None]:
find_analogies('king', 'man', 'woman')

In [None]:
find_analogies('france', 'paris', 'seoul')

In [None]:
def nearest_neighbors(w):
  r = word_vectors.most_similar(positive=[w])
  print("neighbors of: %s" % w)
  for word, score in r:
    print("\t%s" % word)

In [None]:
nearest_neighbors('king')

In [None]:
nearest_neighbors('seoul')

In [None]:
# Exercise: download pretrained GloVe vectors from
!wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
!unzip glove*.zip

In [None]:
import numpy as np

def load_glove_model(File):
    print("Loading Glove Model")
    glove_model = {}
    with open(File,'r') as f:
        for line in f:
            split_line = line.split()
            word = split_line[0]
            embedding = np.array(split_line[1:], dtype=np.float64)
            glove_model[word] = embedding
    print(f"{len(glove_model)} words loaded!")
    return glove_model

In [None]:
load_glove_model('glove.6B.50d.txt')

# Text Classification

In [2]:
!wget -nc https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/edgar_allan_poe.txt
!wget -nc https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/robert_frost.txt

--2025-02-07 02:33:13--  https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/edgar_allan_poe.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26622 (26K) [text/plain]
Saving to: ‘edgar_allan_poe.txt’


2025-02-07 02:33:14 (22.4 MB/s) - ‘edgar_allan_poe.txt’ saved [26622/26622]

--2025-02-07 02:33:14--  https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/robert_frost.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56286 (55K) [text/plain]
Saving 

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.model_selection import train_test_split

In [5]:
!head edgar_allan_poe.txt

LO! Death hath rear'd himself a throne
In a strange city, all alone,
Far down within the dim west
Where the good, and the bad, and the worst, and the best,
Have gone to their eternal rest.
 
There shrines, and palaces, and towers
Are not like any thing of ours
Oh no! O no! ours never loom
To heaven with that ungodly gloom!


In [10]:
!tail edgar_allan_poe.txt

Of the old time entombed.

And travellers now within that valley,
Through the red-litten windows, see
Vast forms that move fantastically
To a discordant melody;
While, like a rapid ghastly river,
Through the pale door,
A hideous throng rush out forever,
And laugh --but smile no more.

In [6]:
## view entire file
# !less edgar_allan_poe.txt

##
!wc edgar_allan_poe.txt # number of lines / number of words / number of characters

  796  4901 26622 edgar_allan_poe.txt


In [7]:
!head robert_frost.txt

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth; 

Then took the other, as just as fair,
And having perhaps the better claim
Because it was grassy and wanted wear,
Though as for that the passing there


In [None]:
input_files = [
  'edgar_allan_poe.txt',
  'robert_frost.txt',
]

In [None]:
from itertools import islice

for label, file in enumerate(input_files):
  # print(file)
  for line in open(file):
  # for line in islice(file, 5):
    # print(line)
    # print(line.split())
    # print(line.split()[:2])
    print(" ".join(line.split()[:2]))

In [None]:
# Python3 code to demonstrate
# translations using
# maketrans() and translate()

# specify to translate chars
str1 = "wy"

# specify to replace with
str2 = "gf"

# delete chars
str3 = "u"

# target string
trg = "weeksyourweeks"

# using maketrans() to
# construct translate
# table
table = trg.maketrans(str1, str2, str3)

# Printing original string
print ("The string before translating is : ", end ="")
print (trg)

# using translate() to make translations.
print ("The string after translating is : ", end ="")
print (trg.translate(table))


In [13]:
# collect data into lists
input_texts = []
labels = []

for label, file in enumerate(input_files):
  print(f"{file} corresponds to label {label}")

  for line in open(file):
    line = line.rstrip().lower() # rstrip() removes white spaces from the end
    if line: # b/c some lines are empty

      # remove punctuation
      # for now understanding this code is not necessary
      line = line.translate(str.maketrans('', '', string.punctuation))

      input_texts.append(line)
      labels.append(label)

edgar_allan_poe.txt corresponds to label 0
robert_frost.txt corresponds to label 1


In [15]:
input_texts[:5]

['lo death hath reard himself a throne',
 'in a strange city all alone',
 'far down within the dim west',
 'where the good and the bad and the worst and the best',
 'have gone to their eternal rest']

In [17]:
labels[:5]

[0, 0, 0, 0, 0]

In [19]:
train_text, test_text, Ytrain, Ytest = train_test_split(input_texts, labels)
len(train_text)
train_text[:5]

['of here and there a tent in grove and orchard',
 'the chisel work of an enormous glacier',
 'no doubt its grown up some to woods around it',
 'had spilled them near the window toward the light',
 'when shoeing home across the white']