# Text Generations Using n-grams

Based on the example https://stackabuse.com/python-for-nlp-developing-an-automatic-text-filler-using-n-grams/


In [259]:
import re
import nltk
import numpy as np
import random
import string
import bs4 as bs
import urllib.request

In [260]:
# Read a Wikipedia page and combine all the paragraphs' text
articles = ["Robot", "Robotics", "History_of_robots"]
article_text = ""
for article_name in articles:
    raw_html = urllib.request.urlopen("https://en.wikipedia.org/wiki/" + article_name)
    raw_html = raw_html.read()
    article_html = bs.BeautifulSoup(raw_html, "lxml")
    article_paragraphs = article_html.find_all("p")
    for para in article_paragraphs:
        article_text += para.text

# Convert all text to lower case and remove anything besides alphanumerics, some punctuation, and space
article_text = article_text.lower()
article_text = re.sub(r"[^A-Za-z., ]", "", article_text)
article_text = re.sub(r"(?P<n1>[A-Za-z ])[.](?P<n2>[A-za-z ])", "\g<n1> \g<n2>", article_text)

# Print the first few words
num_words = len(nltk.word_tokenize(article_text))
print("Total number of words: {}\n".format(num_words))
print(article_text[:1000])

Total number of words: 22415

a robot is a machineespecially one programmable by a computer capable of carrying out a complex series of actions automatically  robots can be guided by an external control device or the control may be embedded within  robots may be constructed on the lines of human form, but most robots are machines designed to perform a task with no regard to their aesthetics robots can be autonomous or semiautonomous and range from humanoids such as hondas advanced step in innovative mobility asimo and tosys tosy ping pong playing robot topio to industrial robots, medical operating robots, patient assist robots, dog therapy robots, collectively programmed swarm robots, uav drones such as general atomics mq predator, and even microscopic nano robots  by mimicking a lifelike appearance or automating movements, a robot may convey a sense of intelligence or thought of its own  autonomous things are expected to proliferate in the coming decade, with home robotics and the aut

In [261]:
# Word n-grams
def get_training_ngrams(text, N):
    ngrams = {}
    words_tokens = nltk.word_tokenize(text)
    for i in range(len(words_tokens)-N):
        seq = " ".join(words_tokens[i:i+N])
        if seq not in ngrams.keys():
            ngrams[seq] = []
        ngrams[seq].append(words_tokens[i+N])
    return ngrams

# Gather n-grams plus the next possible words for completion
N = 3
ngrams = get_training_ngrams(article_text, N)

# Print some random n-grams that have at least 1, 2, and 3 completions
num_samples = 5
generated_keys = 5
for num_completions in range(3):
    generated_keys = 0
    while generated_keys < num_samples:
        key = random.choice(list(ngrams.keys()))
        if len(ngrams[key]) > num_completions:
            print("{}: {}".format(key, ngrams[key]))
            generated_keys += 1

regular worker could: ['program']
actuator forces necessary: ['to']
tohoku gakuin universitys: ['ballip']
build friendly ai: [',']
image sensors even: ['require']
the construction of: ['mechanical', 'such', 'humanoid', 'a']
exploration , surgery: [',', ',', ',']
humans and robots: ['a', 'robotic']
international federation of: ['robotics', 'robotics']
robotics focuses not: ['on', 'on']
the first humanoid: ['robots', 'robots', 'robot']
a small number: ['of', 'of', 'of', 'of']
have been used: ['for', 'in', 'to']
, as well: ['as', 'as', 'as', 'as', 'as', 'as', 'as', 'as']
consumer and industrial: ['goods', 'goods', 'goods']


In [262]:
# Predict
def generate_sequence(start, ngrams, num_words):
    curr_sequence = start.lower()
    output = curr_sequence
    N = len(list(ngrams.keys())[0].split(" "))
    for i in range(num_words):
        if curr_sequence not in ngrams.keys():
            break
        possible_words = ngrams[curr_sequence]
        next_word = random.choice(possible_words)
        output += " " + next_word
        seq_words = nltk.word_tokenize(output)
        curr_sequence = " ".join(seq_words[len(seq_words)-N:len(seq_words)])
    return output

start_text = "Robotics is a"
num_words = 100
gen_output = generate_sequence(start_text, ngrams, num_words)

print("{}-grams:".format(N))
print(gen_output)

3-grams:
robotics is a rapidly growing field , as technological advances continue researching , designing , and building new robots serve various practical purposes , whether domestically , commercially , or militarily many robots are designed to solve a number of robots , based on the tablet when the robot is equipped with solar panels , the robot could move its hands and head and could be controlled through remote control or voice control both eric and his brother george toured the world westinghouse electric corporation built televox in it was a cardboard cutout connected to various devices which users could turn on and


In [263]:
# Now try with other models
N = 4
num_words = 100
start_text = "Many types of robots"
ngrams = get_training_ngrams(article_text, N)
gen_output = generate_sequence(start_text, ngrams, num_words)
print("{}-grams:".format(N))
print(gen_output)

4-grams:
many types of robots they are used in many different environments and for many different uses although being very diverse in application and form , they all share three basic similarities when it comes to their constructionas more and more robots are designed for specific tasks this method of classification becomes more relevant for example , many robots are designed for specific tasks this method of classification becomes more relevant for example , many robots are designed for assembly work , which may not be readily adaptable for other applications they are termed as assembly robots for seam welding , some suppliers provide complete


In [264]:
# 5-grams
N = 5
num_words = 100
start_text = "Robots need to manipulate objects"
ngrams = get_training_ngrams(article_text, N)
gen_output = generate_sequence(start_text, ngrams, num_words)
print("{}-grams:".format(N))
print(gen_output)

5-grams:
robots need to manipulate objects pick up , modify , destroy , or otherwise have an effect thus the functional end of a robot arm intended to make the effect whether a hand , or tool are often referred to as end effectors , while the arm is referred to as a manipulator most robot arms have replaceable endeffectors , each allowing them to perform some small range of tasks some have a fixed manipulator which can not be replaced , while a few have one very general purpose manipulator , for example , a humanoid hand one of the most common type of end
