<a href="https://colab.research.google.com/github/zhgjenny93/NLP-Thinkful/blob/main/Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The corpus came from the [University of Rochester]('https://cs.rochester.edu/nlp/rocstories/'), the dataset contains 5-sentence commonsense stories. 

In [3]:
# pip install markovify



In [26]:
import nltk
import numpy as np
import pandas as pd
import random
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import re
import spacy
import warnings
import gensim
import markovify
from collections import Counter

warnings.filterwarnings('ignore')

In [5]:
df = pd.read_csv('https://raw.githubusercontent.com/zhgjenny93/datasets/main/ROCStories_winter2017.csv')

In [6]:
df.head()

Unnamed: 0,storyid,storytitle,sentence1,sentence2,sentence3,sentence4,sentence5
0,8bbe6d11-1e2e-413c-bf81-eaea05f4f1bd,David Drops the Weight,David noticed he had put on a lot of weight re...,He examined his habits to try and figure out t...,He realized he'd been eating too much fast foo...,He stopped going to burger places and started ...,"After a few weeks, he started to feel much bet..."
1,0beabab2-fb49-460e-a6e6-f35a202e3348,Frustration,Tom had a very short temper.,One day a guest made him very angry.,He punched a hole in the wall of his house.,Tom's guest became afraid and left quickly.,Tom sat on his couch filled with regret about ...
2,87da1a22-df0b-410c-b186-439700b70ba6,Marcus Buys Khakis,Marcus needed clothing for a business casual e...,All of his clothes were either too formal or t...,He decided to buy a pair of khakis.,The pair he bought fit him perfectly.,Marcus was happy to have the right clothes for...
3,2d16bcd6-692a-4fc0-8e7c-4a6f81d9efa9,Different Opinions,Bobby thought Bill should buy a trailer and ha...,Bill thought a truck would be better for what ...,Bobby pointed out two vehicles were much more ...,Bill was set in his ways with conventional thi...,He ended up buying the truck he wanted despite...
4,c71bb23b-7731-4233-8298-76ba6886cee1,Overcoming shortcomings,John was a pastor with a very bad memory.,He tried to memorize his sermons many days in ...,He decided to learn to sing to overcome his ha...,He then made all his sermons into music and sa...,His congregation was delighted and so was he.


In [7]:
df['story'] = df[['sentence1', 'sentence2', 'sentence3', 'sentence4', 'sentence5']].agg(' '.join, axis=1)

In [8]:
df['story'].iloc[0]

"David noticed he had put on a lot of weight recently. He examined his habits to try and figure out the reason. He realized he'd been eating too much fast food lately. He stopped going to burger places and started a vegetarian diet. After a few weeks, he started to feel much better."

In [9]:
story_str = df.story.sample(frac=0.35, random_state=123)
story_str = ' '.join(story_str)
story_str = ' '.join(story_str.split())

In [10]:
story_str[:200]

'Jill started her YouTube channel in 2006 and loved the process. She uploaded random videos at random times for the first few years. In 2010 Jill realized that she may be able to make a living doing it'

In [11]:
len(story_str)

4121040

In [12]:
nlp = spacy.load('en')
nlp.max_length = 5000000
story_doc = nlp(story_str)

In [23]:
# Explore the objects that you've built
print("The story_doc object is a {} object.".format(type(story_doc)))
print("It is {} tokens long".format(len(story_doc)))
print("The first three tokens are '{}'".format(story_doc[:3]))
print("The type of each token is {}".format(type(story_doc[0])))

The story_doc object is a <class 'spacy.tokens.doc.Doc'> object.
It is 898006 tokens long
The first three tokens are 'Jill started her'
The type of each token is <class 'spacy.tokens.token.Token'>


In [24]:
# Remove stopwords
story_without_stopwords = [token for token in story_doc if not token.is_stop]

In [27]:
# Utility function to calculate how frequently words appear in the text
def word_frequencies(text):

  # Build a list of words
  # Strip out punctuation
  words = []
  for token in text:
    if not token.is_punct:
      words.append(token.text)
  # Build and return a `Counter` object containing word counts
  return Counter(words)

# Instantiate your list of the most common words
story_word_freq = word_frequencies(story_without_stopwords).most_common(10)
print('\nWord Frequencies:', story_word_freq)


Word Frequencies: [('day', 3782), ('went', 3780), ('got', 3677), ('decided', 3078), ('wanted', 2689), ('new', 2464), ('Tom', 2416), ('home', 1960), ('time', 1951), ('friends', 1761)]


In [28]:
# Utility function to calculate how frequently each lemma appears in the text
def lemma_frequencies(text):

  # Build a list of lemmas
  # Strip out punctuation
  lemmas = []
  for token in text:
    if not token.is_punct:
      lemmas.append(token.lemma_)

  # Build and return a `Counter` object containing lemma counts
  return Counter(lemmas)

# Instantiate your list of most common lemmas
story_lemma_freq = lemma_frequencies(story_without_stopwords).most_common(10)
print('\nLemma Frequencies:', story_lemma_freq)


Lemma Frequencies: [('go', 5052), ('get', 4477), ('day', 4135), ('decide', 3281), ('want', 3198), ('friend', 3122), ('new', 2483), ('find', 2441), ('Tom', 2416), ('work', 2414)]


In [13]:
story_sents = [sent.text for sent in story_doc.sents if len(sent.text) > 1]

## Text Generation
Use part of speech tags with the Markovify package and train a Markov chain model with the story dataset.

In [80]:
class POSifiedText(markovify.Text):
  
  def word_split(self, sentence):
    return ['::'.join((word.orth_, word.pos_)) for word in nlp(sentence)]

  def word_join(self, words):
    sentence = ' '.join(word.split('::')[0] for word in words)
    sentence = re.sub(r'\s+([?.!,\'"])', r'\1', sentence)
    return sentence

In [None]:
# Build model to generate text by looking at the 3 previous words (state_size=3)
story_generator = POSifiedText(story_sents, state_size=3)

In [22]:
# Generate some sentences
for i in range(3):
  print(story_generator.make_sentence())

# Generate some sentences with 100 character limit
for i in range(3):
  print(story_generator.make_short_sentence(100))

so she put me in the corner .
Ari was walking home from school to avoid failure .
Tom was out all day yesterday .
Alicia is now very careful about every part of the night .
She could n't decide which game to play in the sand .
The firemen walked into the store with only five dollars .


## Chatbot

In [17]:
# pip install chatterbot

In [18]:
# # Import libraries
# from chatterbot import ChatBot
# from chatterbot.trainers import ListTrainer, ChatterBotCorpusTrainer
# from chatterbot.conversation import Statement

In [19]:
# # Create a chatbot
# chatbot = ChatBot('Story')

# # This is to remove the accumulated knowledge base
# chatbot.storage.drop()

# # Create a new trainer for the chatbot
# trainer = ListTrainer(chatbot)

# # Train the chatbot based on dialogs
# trainer.train(story_sents)

In [20]:
# print("StoryBot: I will try to respond to you reasonably. If you want to exit, type bye")

# # Below is the chatting
# while True:
#   user_input = input("User: ")
#   user_input = user_input.lower()

#   if(user_input != 'bye'):
#     if (user_input == 'thanks' or user_input == 'thank you'):
#       print("Persuasion: You're welcome.")
#       break
#     else:
#       if (greeting(user_input) != None):
#         print("Persuasion: " + greeting(user_input))
#       else:
#         print("Persuasion: ", end ="")
#         print(chatbot.get_response(user_input))
#   else:
#     print("Persuasion: Bye! It was a great chat.")
#     break

In [21]:
# GREETING_INPUTS = ['hello', 'hi', 'greetings', "what's up", 'hey', 'yo','heya','hiya']
# GREETING_RESPONSES = ['hello', 'hi', 'hey', 'hi there']
# def greeting(sentence):
#   for word in sentence.split():
#     if word.lower() in GREETING_INPUTS:
#       return random.choice(GREETING_RESPONSES)