Exercise 1. Text Generation

• Install markovify

• Import pandas and markovify


In [None]:
import pandas as pd
import markovify

# Loading the dataset
data = pd.read_csv('abcnews-date-text.csv')
print(data.head(3))

# Concatenating all the text from the 'headline_text' column
text_corpus = ' '.join(data['headline_text'].astype(str))

text_corpus = text_corpus.replace('.', ' .___END___. ___BEGIN___')

# Building a Markov model
text_model = markovify.Text(text_corpus)

# Printing ten randomly generated sentences
for _ in range(10):
    generated_sentence = text_model.make_sentence()
    generated_sentence = generated_sentence.replace(' .___END___. ___BEGIN___', '')
    print(generated_sentence)


   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
___BEGIN___5 million police investigate string of terrorist suspects arrested in moscow pool tragedy disendorsed liberal candidate to give spain the lead britain rules out inquest judge extends storm stay top of a burnie family safe return home aussie held over gangland shooting police scour remains of sick days cowboys confident captain will play boomers at fiba basketball world cup mcgrath calls it quits kurds danced for joy on riptide taylor swift single indigenous affairs man charged over three ways housing corp govt urged to byo bags coalition mps to meet increased demand as skills shortage african child smuggling ring gundagai mayor to push wto on cheap grog bans mixed impact sunbus drivers to stay in nt mens shed un


Exercise 2. Text Summarization

• Use sumy to summarize the ‘alice.txt’ file

• Download the ‘punkt’ and 'tokenizers/punkt/PY3/english.pickle' NLTK
libraries.

In [5]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

alice_file_path = 'alice.txt'

# Initializing the parser and tokenizer
parser = PlaintextParser.from_file(alice_file_path, Tokenizer('english'))

# Initializing the Latent Semantic Analysis summarizer
summarizer = LsaSummarizer()

summary = summarizer(parser.document, 3)
for sentence in summary:
    print(sentence)


The Mouse did not answer, so Alice went on eagerly:  `There is such a nice little dog near our house I should like to show you!
He sent them word I had not gone (We know it to be true): If she should push the matter on, What would become of you?
`If there's no meaning in it,' said the King, `that saves a world of trouble, you know, as we needn't try to find any.


Exercise 3. Topic Modeling

• Determine the top 20 topics using the Non-Negative Matrix Factorization (NMF) using ‘from sklearn.decomposition import NMF’

• Vectorize the words after cleaning up the text

• Use ‘print("Topic {}: {}".format(i + 1, ",".join([str(x) for x in idx_to_word [topic.argsort()[-10:]]]))) to list the topics


In [7]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import normalize

#Loading Dataset
df = pd.read_csv('abcnews-date-text.csv')
documents = df['headline_text'].dropna().tolist()

# Vectorizing the words
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)

# Applying NMF
n_topics = 20
nmf_model = NMF(n_components=n_topics)
nmf_matrix = nmf_model.fit_transform(tfidf_matrix)

#Normalizing matrix
nmf_matrix = normalize(nmf_matrix, axis=1)

feature_names = vectorizer.get_feature_names_out()

#Printing top words for each topic
for i, topic in enumerate(nmf_model.components_):
    top_words_idx = topic.argsort()[-10:][::-1]
    top_words = [feature_names[idx] for idx in top_words_idx]
    print("Topic {}: {}".format(i + 1, ", ".join(top_words)))


Topic 1: man, jailed, dies, arrested, guilty, attack, stabbing, pleads, child, sex
Topic 2: police, investigate, probe, hunt, officer, shooting, seek, arrest, say, assault
Topic 3: coast, gold, north, qld, south, west, korea, east, mid, queensland
Topic 4: rural, news, national, reporter, exchange, park, sa, friday, qld, thursday
Topic 5: interview, extended, michael, john, david, nrl, smith, james, ben, scott
Topic 6: new, zealand, laws, year, york, home, centre, deal, years, opens
Topic 7: abc, weather, sport, business, news, entertainment, analysis, market, stories, speaks
Topic 8: crash, car, killed, fatal, dies, road, driver, plane, woman, injured
Topic 9: court, accused, face, charges, faces, told, high, sex, case, murder
Topic 10: australia, day, world, cup, test, south, vs, win, highlights, india
Topic 11: council, plan, considers, land, plans, seeks, rise, backs, rates, mayor
Topic 12: nsw, country, hour, wa, 2015, 2014, tas, 2013, vic, sa
Topic 13: says, labor, mp, pm, minist