# Text Generations Using n-grams

Based on the example https://stackabuse.com/python-for-nlp-developing-an-automatic-text-filler-using-n-grams/


In [425]:
import re
import nltk
import numpy as np
import random
import string
import bs4 as bs
import urllib.request

In [426]:
# Read a Wikipedia page and combine all the paragraphs' text
articles = ["Robotics", "Artificial_intelligence", "Machine_learning", "Computer_vision", 
            "Human-robot_interaction", "Robotic_sensing", "Robotic_sensors", "Cyber-physical_system", 
            "Robot_locomotion", "Mobile_robot", "Robotic_mapping", "Robotic_manipulation"]
article_text = ""
for article_name in articles:
    raw_html = urllib.request.urlopen("https://en.wikipedia.org/wiki/" + article_name)
    raw_html = raw_html.read()
    article_html = bs.BeautifulSoup(raw_html, "lxml")
    article_paragraphs = article_html.find_all("p")
    for para in article_paragraphs:
        article_text += para.text

# Convert all text to lower case and remove anything besides alphanumerics, some punctuation, and space
article_text = article_text.lower()
article_text = re.sub(r"[-]", " ", article_text)
article_text = re.sub(r"(?P<n1>[A-Za-z])[^A-za-z ](?P<n2>[A-za-z])", "\g<n1> \g<n2>", article_text)
article_text = re.sub(r"[^A-Za-z., ]", "", article_text)

# Print the first few words
num_words = len(nltk.word_tokenize(article_text))
print("Total number of words: {}\n".format(num_words))
print(article_text[:1000])

Total number of words: 50661

robotics is an interdisciplinary research area at the interface of computer science and engineering. robotics involves design, construction, operation, and use of robots. the goal of robotics is to design intelligent machines that can help and assist humans in their day to day lives and keep everyone safe. robotics draws on the achievement of information engineering, computer engineering, mechanical engineering, electronic engineering and others.robotics develops machines that can substitute for humans and replicate human actions. robots can be used in many situations and for many purposes, but today many are used in dangerous environments including inspection of radioactive materials, bomb detection and deactivation, manufacturing processes, or where humans cannot survive e g. in space, underwater, in high heat, and clean up and containment of hazardous materials and radiation. robots can take on any form but some are made to resemble humans in appearance

In [427]:
# Word n-grams
def get_training_ngrams(text, N):
    ngrams = {}
    words_tokens = nltk.word_tokenize(text)
    for i in range(len(words_tokens)-N):
        seq = " ".join(words_tokens[i:i+N])
        if seq not in ngrams.keys():
            ngrams[seq] = []
        ngrams[seq].append(words_tokens[i+N])
    return ngrams

# Gather n-grams plus the next possible words for completion
N = 3
ngrams = get_training_ngrams(article_text, N)

# Print some random n-grams that have at least 1, 2, and 3 completions
num_samples = 5
generated_keys = 5
for num_completions in range(3):
    generated_keys = 0
    while generated_keys < num_samples:
        key = random.choice(list(ngrams.keys()))
        if len(ngrams[key]) > num_completions:
            print("{}: {}".format(key, ngrams[key]))
            generated_keys += 1

, russia ,: ['and']
car k in: ['the']
agents to achieve: ['a']
a competition to: ['sail', 'sail']
makes continuous audit: ['possible']
is an area: ['of', 'of']
conception , design: [',', ',']
the three laws: ['of', 'of', 'of', 'of']
goal is to: ['offer', 'successfully', 'memorize', 'learn', 'learn', 'build', 'offer']
is in contrast: ['to', 'to']
a sequence of: ['facial', 'images', 'facial']
the world .: ['finally', 'among', 'applications', 'as', 'finally']
moravec s paradox: ['generalizes', 'can', 'suggests']
th century .: ['throughout', 'the', 'throughout', 'the']
the surface of: ['the', 'the', 'the', 'the', 'the', 'the']


In [428]:
# Predict
def generate_sequence(start, ngrams, num_words):
    curr_sequence = start.lower()
    output = curr_sequence
    N = len(list(ngrams.keys())[0].split(" "))
    for i in range(num_words):
        if curr_sequence not in ngrams.keys():
            break
        possible_words = ngrams[curr_sequence]
        next_word = random.choice(possible_words)
        output += " " + next_word
        seq_words = nltk.word_tokenize(output)
        curr_sequence = " ".join(seq_words[len(seq_words)-N:len(seq_words)])
    return output

start_text = "Robotics is a"
num_words = 100
gen_output = generate_sequence(start_text, ngrams, num_words)

print("{}-grams:".format(N))
print(gen_output)

3-grams:
robotics is a conference for scientists , researchers , and practitioners to report and discuss the latest progress of their forefront research and findings in social robotics , had , by , published articles mentioning the subject , and an open access journal called lovotics was launched in , devoted entirely to the subject of computers and art highlighting the role of machine learning , the machine performed a diagnosis similarly to a well trained ophthalmologist , and could generate a decision within seconds on whether or not the speaker has a cold , etc .. it becomes even harder when the speaker


In [429]:
# Now try with other models
N = 2
num_words = 100
start_text = "Computer vision"
ngrams = get_training_ngrams(article_text, N)
gen_output = generate_sequence(start_text, ngrams, num_words)
print("{}-grams:".format(N))
print(gen_output)

2-grams:
computer vision applications , are computing systems vaguely inspired by nature , contributing to climate change mitigation and adaptation , and automatic pilot avionics cps involves transdisciplinary approaches , merging theory of optimization . for example , be made to gain more aerodynamic surface as it navigates its problem space , defence , security , or their ability to feel , does not follow standard research protocol in addition , the rare loyal robots such as space , defence , security , and image restoration is the application of soft computing approaches to ai good old fashioned ai or gofai . during


In [430]:
# 4-grams
N = 4
num_words = 100
start_text = "Machine learning models require"
ngrams = get_training_ngrams(article_text, N)
gen_output = generate_sequence(start_text, ngrams, num_words)
print("{}-grams:".format(N))
print(gen_output)

# Funny thing about the paragraph below regarding overfitting
# https://en.wikipedia.org/wiki/Machine_learning#Training_models

4-grams:
machine learning models require a lot of data in order for them to perform well . usually , when training a machine learning model , one needs to collect a large , representative sample of data from a training set . data from the training set can be as varied as a corpus of text , a collection of images , and data collected from individual users of a service . overfitting is something to watch out for when training a machine learning model.federated learning is an adapted form of distributed artificial intelligence to training machine learning models that decentralizes the training process ,
