<a href="https://colab.research.google.com/github/spags093/text_generation/blob/main/text_generation_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation Project

Runtime of several cells in Jupyter was wayyyy too long so we're moving this project over to colab to take advantage of the high-ram runtime.  Let's see how this goes. 

## Abstract
This is where I will write an abstract one day.  Someday.  Probably.  

## Introduction

In this project, we'll be building a text generator that will be able to generate fake Amazon product reviews that will, hopefully, be indistinguishable from normal, everyday reviews left by actual human beings. This is, by no means, meant to be for fraudulent uses. This is an excercise in generating believable text and we were fortunate enough to have a large dataset of Amazon reviews to experiement with.

### Environment Check

In [2]:
# Checking the ram usage

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('This runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('This notebook is using a high level of RAM.')

This runtime has 38.0 gigabytes of available RAM

This notebook is using a high level of RAM.


### Imports

In [3]:
# The usuals
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os
import sys
import re

# NLTK stuff
import nltk
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

# Tensorflow stuff
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.python.keras import utils
from tensorflow.keras.callbacks import ModelCheckpoint

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Functions

In [4]:
# File Path Functions
def print_dir_contents(fpath = None):
    '''Prints the contents of a provided filepath.  Default is current directory.
  
    Args:
        fpath (str): File path of directory.
    
    Returns: 
        Print of contents of folder
    '''

    if fpath is None:
        fpath = os.path.abspath(os.curdir)

    print(f"Contents of Folder: '{fpath}':")
    files = sorted(os.listdir(fpath))
    tab = '\n\t'
    print('\t' + tab.join(files))

# NLP Functions
def remove_url(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r"", text)


def remove_n(text):
    n = re.compile(r'\n')
    return n.sub(r'', text)


def remove_emoji(string):
    emoji_pattern = re.compile(pattern = 
    "["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    "]+",
    flags = re.UNICODE
    )
    return emoji_pattern.sub(r"", string)


def clean_text(text_column):
    text_column = text_column.apply(lambda x: remove_url(x))
    text_column = text_column.apply(lambda x: remove_n(x))
    text_column = text_column.apply(lambda x: remove_emoji(x))
    return text_column

## Obtain

### Importing Dataset From Google Drive

In [5]:
# Mount google drive
from google.colab import drive
drive.mount('/gdrive', force_remount = True)

Mounted at /gdrive


In [6]:
# Getting current directory and contents
print(os.getcwd())
print(os.listdir())

/content
['.config', 'sample_data']


In [7]:
# Changing the directory to the parent folder
os.chdir('../')
print(os.getcwd())
print(os.listdir())

/
['run', 'bin', 'root', 'lib', 'sys', 'etc', 'media', 'dev', 'proc', 'srv', 'sbin', 'home', 'boot', 'mnt', 'usr', 'lib64', 'tmp', 'var', 'opt', 'gdrive', '.dockerenv', 'tools', 'datalab', 'swift', 'tensorflow-1.15.2', 'content', 'lib32']


In [8]:
# Function to print the file path
def print_path(return_ = False):
  '''Prints the current directory.'''
  path = os.path.abspath(os.curdir)
  print('Current Directory = ', path)
  if return_:
    return path

print_path(return_ = True)

Current Directory =  /


'/'

In [9]:
# Checking out the contents of the directory.

print_dir_contents()

Contents of Folder: '/':
	.dockerenv
	bin
	boot
	content
	datalab
	dev
	etc
	gdrive
	home
	lib
	lib32
	lib64
	media
	mnt
	opt
	proc
	root
	run
	sbin
	srv
	swift
	sys
	tensorflow-1.15.2
	tmp
	tools
	usr
	var


In [10]:
# Getting to the source folder

source_folder = r'/gdrive/My Drive/Datasets-2'
print_dir_contents(source_folder)

Contents of Folder: '/gdrive/My Drive/Datasets-2':
	Appliances.json
	Deepfake-images V2.zip
	allfake.png
	deepfake-video.jpg


### Importing Dataset Into Pandas

In [11]:
# Setting zip path and filename

file_path = '/gdrive/My Drive/Datasets-2/Appliances.json'
fname = 'Appliances.json'

In [12]:
# Importing the json into pandas

appliances_df = pd.read_json(file_path, lines = True)
print(appliances_df.shape)
appliances_df.head()

(602777, 12)


Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,image
0,5,2.0,False,"11 27, 2013",A3NHUQ33CFH3VM,1118461304,{'Format:': ' Hardcover'},Greeny,Not one thing in this book seemed an obvious o...,Clear on what leads to innovation,1385510400,
1,5,,False,"11 1, 2013",A3SK6VNBQDNBJE,1118461304,{'Format:': ' Kindle Edition'},Leif C. Ulstrup,I have enjoyed Dr. Alan Gregerman's weekly blo...,Becoming more innovative by opening yourself t...,1383264000,
2,5,,False,"10 10, 2013",A3SOFHUR27FO3K,1118461304,{'Format:': ' Hardcover'},Harry Gilbert Miller III,Alan Gregerman believes that innovation comes ...,The World from Different Perspectives,1381363200,
3,5,,False,"10 9, 2013",A1HOG1PYCAE157,1118461304,{'Format:': ' Hardcover'},Rebecca Ripley,"Alan Gregerman is a smart, funny, entertaining...",Strangers are Your New Best Friends,1381276800,
4,5,10.0,False,"09 7, 2013",A26JGAM6GZMM4V,1118461304,{'Format:': ' Hardcover'},Robert Morris,"As I began to read this book, I was again remi...","How and why it is imperative to engage, learn ...",1378512000,


In [13]:
# Checking for null values....since we can see them already in several columns

appliances_df.isna().sum()

overall                0
vote              537515
verified               0
reviewTime             0
reviewerID             0
asin                   0
style             464804
reviewerName          15
reviewText           324
summary              128
unixReviewTime         0
image             593519
dtype: int64

In [14]:
# We can get rid of most of these columns since they aren't relevant to this project

drop_columns = ['overall', 'vote', 'verified', 'reviewTime', 'reviewerID',
                'asin', 'style', 'unixReviewTime', 'image', 'reviewerName']

appliances_df.drop(drop_columns, axis = 1, inplace = True)

print(appliances_df.shape)
appliances_df.head()

(602777, 2)


Unnamed: 0,reviewText,summary
0,Not one thing in this book seemed an obvious o...,Clear on what leads to innovation
1,I have enjoyed Dr. Alan Gregerman's weekly blo...,Becoming more innovative by opening yourself t...
2,Alan Gregerman believes that innovation comes ...,The World from Different Perspectives
3,"Alan Gregerman is a smart, funny, entertaining...",Strangers are Your New Best Friends
4,"As I began to read this book, I was again remi...","How and why it is imperative to engage, learn ..."


#### Summary Column

> Not really sure what to do with the summary column just yet, so we'll leave this as-is for now. 

In [15]:
# for i in appliances_df['summary']:
#     print(i)

## Scrubbing

### Null Values

In [16]:
# Let's check the null values again

appliances_df.isna().sum()

reviewText    324
summary       128
dtype: int64

In [17]:
# We can drop the rows with null values since they represent a very small percentage of data
# Also...can't do much with ones that don't have reviews.  

appliances_df.dropna(inplace = True)
appliances_df.isna().sum()

reviewText    0
summary       0
dtype: int64

### Cleaning Text

In [18]:
# for i in appliances_df['reviewText']:
#     print(i)

In [19]:
# Running several functions for cleaning text

appliances_df['reviewText'] = appliances_df['reviewText'].apply(lambda x: remove_url(x))
appliances_df['reviewText'] = appliances_df['reviewText'].apply(lambda x: remove_n(x))
appliances_df['reviewText'] = appliances_df['reviewText'].apply(lambda x: remove_emoji(x))

In [20]:
# Checking to make sure this worked

appliances_df['reviewText'][0]

"Not one thing in this book seemed an obvious original thought. However, the clarity with which this author explains how innovation happens is remarkable.Alan Gregerman discusses the meaning of human interactions and the kinds of situations that tend to inspire original and/or clear thinking that leads to innovation. These things include how people communicate in certain situations such as when they are outside of their normal patterns.Gregerman identifies the ingredients that make innovation more likely. This includes people being compelled to interact when they normally wouldn't, leading to serendipity. Sometimes the phenomenon will occur through collaboration, and sometimes by chance such as when an individual is away from home on travel.I recommend this book for its common sense, its truth and the apparent mastery of the subject by the author."

In [21]:
appliances_df['summary'] = clean_text(appliances_df['summary'])

In [22]:
appliances_df['summary'][0]

'Clear on what leads to innovation'

In [23]:
appliances_df.head()

Unnamed: 0,reviewText,summary
0,Not one thing in this book seemed an obvious o...,Clear on what leads to innovation
1,I have enjoyed Dr. Alan Gregerman's weekly blo...,Becoming more innovative by opening yourself t...
2,Alan Gregerman believes that innovation comes ...,The World from Different Perspectives
3,"Alan Gregerman is a smart, funny, entertaining...",Strangers are Your New Best Friends
4,"As I began to read this book, I was again remi...","How and why it is imperative to engage, learn ..."


### Text Preprocessing

#### Tokenizing

In [24]:
# Createa a function to tokenize 

def tokenize_words(text):
    # make everything lowercase 
    text = text.lower()
    
    # Tokenize
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text)
    
    # filter out stop words for now
    filtered_words = filter(lambda token: token not in stopwords.words('english'), tokens)
    
    return " ".join(filtered_words)

In [25]:
# Testing function

tokenize_words(appliances_df['reviewText'][0])

'one thing book seemed obvious original thought however clarity author explains innovation happens remarkable alan gregerman discusses meaning human interactions kinds situations tend inspire original clear thinking leads innovation things include people communicate certain situations outside normal patterns gregerman identifies ingredients make innovation likely includes people compelled interact normally leading serendipity sometimes phenomenon occur collaboration sometimes chance individual away home travel recommend book common sense truth apparent mastery subject author'

#### Maybe try getting rid of the summary column for now too so we're not iterating over both and can cut down on runtime.

In [27]:
appliances_df['reviewText'] = appliances_df['reviewText'].apply(lambda x: tokenize_words(x))
appliances_df.head()

KeyboardInterrupt: ignored

## Data Exploration

### Basic NLP

## Modeling