In [1]:
# Load in some data for use in some of the lessons
library(RSQLite)
library(tidyverse)

sub <- "MachineLearning" 
db <- src_sqlite('../input/reddit-comments-may-2015/reddit-comments-may-2015/database.sqlite', create = F)

# Load the desired subset of data from the database
db_subset <- db %>% 
             tbl('May2015') %>% 
             filter(subreddit == sub)
          
data <- data.frame(db_subset)[, c("author","score","body")]

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──

[32m✔[39m [34mggplot2[39m 3.2.1.[31m9000[39m     [32m✔[39m [34mpurrr  [39m 0.3.3     
[32m✔[39m [34mtibble [39m 2.1.3          [32m✔[39m [34mdplyr  [39m 0.8.3     
[32m✔[39m [34mtidyr  [39m 1.0.0          [32m✔[39m [34mstringr[39m 1.4.0     
[32m✔[39m [34mreadr  [39m 1.3.1          [32m✔[39m [34mforcats[39m 0.4.0     

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



<br><br><br><br><br><br><br><br><br>

# stringr: Basic String Manipulation

In [2]:
library(tidyverse)

chr_data <- c("Data", "Daft", "YouTube", "channel",
             "learn", "and", "have", "FUN!")

In [3]:
# Check the length of a string
str_length("awefon8g-gn951nksjdg")
str_length(chr_data)

In [4]:
# Convert string letters to uppercase
str_to_upper(chr_data)

In [5]:
# Convert string letters to lowercase
str_to_lower(chr_data)

In [6]:
# Convert string to title (first letter uppercase)
str_to_title(chr_data)

In [7]:
# Convert string to sentence (only first letter of first word uppercase)
str_to_sentence("make me into a SENTENCE!")

In [8]:
# Trim whitespace
str_trim("  Trim Me!   ")

In [9]:
# Pad strings with whitespace
str_pad("Pad Me!", width = 15, side="both")

In [10]:
# Truncate strings to a given length
str_trunc("If you have a long string, you might want to truncate it!", 
          width = 50)

<br><br><br><br><br><br><br><br>

# stringr: Split and Join Strings

In [11]:
library(tidyverse)

# Split strings
str_split("Split Me!", pattern = " ")

In [12]:
# Join strings (equivalent to base R paste())
str_c("Join", "Me!", sep="_")

# Join strings (equivalent to base R paste())
str_c(c("Join", "vectors"), c("Me!", "too!"), sep="_")

In [13]:
# Collapse a vector of strings into a single string
str_c(c("Turn", "me", "into", "one", "string!"), collapse= " ")

In [14]:
# Convert NA values in character vector to string "NA"
str_replace_na(c("Make", NA, "strings!"))

<br><br><br><br><br><br><br><br>

# stringr: Sorting Strings

In [15]:
library(tidyverse)

sort_data <- c("sort", "me", "please!")

# Get vector of indicies that would sort a string alphabetically
str_order(sort_data)

In [16]:
# Use discovered ordering to extract data in sorted order
sort_data[str_order(sort_data)]

In [17]:
# Directly extract sorted strings
str_sort(sort_data)

In [18]:
# Extract in reverse sorted order
str_sort(sort_data, decreasing = TRUE)

<br><br><br><br><br><br><br><br>

# stringr: String Interpolation

In [19]:
library(tidyverse)

first <- c("Luke", "Han", "Jean-Luc")
last <- c("Skywalker", "Solo", "Picard")

# Interpolate (insert variable values) into strings with str_glue()
str_glue("My name is {first}. {first} {last}.")

In [20]:
minimum_age <- 18
over_minimum <- c(5, 17, 33)

# Interpolate the result of an execution into a string
str_glue("{first} {last} is {minimum_age + over_minimum} years old.")

In [21]:
num <- c(1:5)

# Interpolate the result of function calls
str_glue("The square root of {num} is {round(sqrt(num), 3)}.")

In [22]:
fuel_efficiency <- 30

# Interpolate strings using data from a data frame
mtcars %>% rownames_to_column("Model") %>%
         filter(mpg > fuel_efficiency) %>%
         str_glue_data("The {Model} gets {mpg} mpg.")

<br><br><br><br><br><br><br><br>

# stringr: String Matching

In [23]:
library(tidyverse)

head(data,8)

Unnamed: 0_level_0,author,score,body
Unnamed: 0_level_1,<chr>,<int>,<chr>
1,benanne,3,"I would advise against using RBMs nowadays. If you want to pre-train a deep autoencoder, you can just train shallow autoencoders to do that. It will work just as well as training an RBM and it's much simpler conceptually and in terms of implementation. But depending on the problem, pre-training may be unnecessary anyway. Just build a deep autoencoder with ReLUs and possibly dropout and you should be able to train that end-to-end in many cases."
2,butt_ghost,3,"Hdf5. It's structured, it's easy to get data in and out, and it's fast. Plus it will scale if you ever get up there in dataset size."
3,buntaro_pup,1,"yep, good point."
4,iidealized,2,Google must have done (and is doing) serious internal research in ranking. I've heard they're pretty good at that and they've even made some money doing it :P
5,[deleted],1,[deleted]
6,stathibus,6,"Sebastian Thrun's book, Probabilistic Robotics, goes through this in great detail. Get it, read it, make it your bible."
7,soulslicer0,2,"This. Such a fucking legendary book. Kalman filters, particle filters, recursive Bayesian filters and a whole bunch of other stuff. I learnt so much. Read these 3 for starts from the book, then come back and ask the questions"
8,swiftsecond,1,Do you still need help?


In [24]:
# Detecting the presence of a pattern in strings
str_detect(data$body[1:100], pattern="deep")

In [25]:
# Get the indicies of matched strings
str_inds <- str_which(data$body[1:100], pattern="deep")
str_inds

In [26]:
# Extract matched strings using detected indicies
data$body[str_inds]

In [27]:
# Count the number of matches
str_count(data$body[1:100], "deep")

In [28]:
# Get the position of matches
str_locate_all(data$body[1], "deep")

start,end
72,75
338,341


In [29]:
# Get a list of the first match found in each string as a vector
str_extract(data$body[1:3], "deep|the|and")

In [30]:
# Get a list of the first match found in each string as matrix
str_match(data$body[1:3], "deep|the|and")

0
deep
and
""


In [31]:
# Get a list of the all matches found in each string as list of matricies
str_match_all(data$body[1:3], "deep|the|and")

0
deep
and
and
the
deep
and
and

0
and
and
the


<br><br><br><br><br><br><br><br>

# stringr: Subset and Replace Strings

In [32]:
library(tidyverse)

head(data,8)

Unnamed: 0_level_0,author,score,body
Unnamed: 0_level_1,<chr>,<int>,<chr>
1,benanne,3,"I would advise against using RBMs nowadays. If you want to pre-train a deep autoencoder, you can just train shallow autoencoders to do that. It will work just as well as training an RBM and it's much simpler conceptually and in terms of implementation. But depending on the problem, pre-training may be unnecessary anyway. Just build a deep autoencoder with ReLUs and possibly dropout and you should be able to train that end-to-end in many cases."
2,butt_ghost,3,"Hdf5. It's structured, it's easy to get data in and out, and it's fast. Plus it will scale if you ever get up there in dataset size."
3,buntaro_pup,1,"yep, good point."
4,iidealized,2,Google must have done (and is doing) serious internal research in ranking. I've heard they're pretty good at that and they've even made some money doing it :P
5,[deleted],1,[deleted]
6,stathibus,6,"Sebastian Thrun's book, Probabilistic Robotics, goes through this in great detail. Get it, read it, make it your bible."
7,soulslicer0,2,"This. Such a fucking legendary book. Kalman filters, particle filters, recursive Bayesian filters and a whole bunch of other stuff. I learnt so much. Read these 3 for starts from the book, then come back and ask the questions"
8,swiftsecond,1,Do you still need help?


In [33]:
# Get a string subset based on character position
str_sub(data$body[1], start=1, end=100)

In [34]:
# Get a string subset based on words
word(data$body[1], start=1, end=10)

In [35]:
# Get the strings that contain a certain pattern
str_subset(data$body[1:100], pattern="deep")

In [36]:
# Replace a substring with a new string by substring position
str_sub(data$body[1], start=1, end=100) <- str_to_upper(str_sub(data$body[1], 
                                                                start=1, 
                                                                end=100))
str_sub(data$body[1], start=1, end=100)

In [37]:
# Replace first occurrence of a substring with a new string by matching
str_replace(data$body[1], pattern="deep|DEEP", replacement="multi-layer")

In [38]:
# Replace all occurrences of a substring with a new string by matching
str_replace_all(data$body[1], pattern="deep|DEEP", replacement="multi-layer")

<br><br><br><br><br><br><br><br>

# stringr: Viewing Strings

In [39]:
library(tidyverse)

In [40]:
# Basic printing
print(data$body[1:10])

 [1] "I WOULD ADVISE AGAINST USING RBMS NOWADAYS. IF YOU WANT TO PRE-TRAIN A DEEP AUTOENCODER, YOU CAN JUSt train shallow autoencoders to do that. It will work just as well as training an RBM and it's much simpler conceptually and in terms of implementation.\n\nBut depending on the problem, pre-training may be unnecessary anyway. Just build a deep autoencoder with ReLUs and possibly dropout and you should be able to train that end-to-end in many cases."
 [2] "Hdf5. It's structured, it's easy to get data in and out, and it's fast. Plus it will scale if you ever get up there in dataset size. "                                                                                                                                                                                                                                                                                                                             
 [3] "yep, good point."                                                             

In [41]:
deep_learning_posts <- data$body[str_which(data$body, "deep learning")]

# View strings in HTML format with the first occurence of a pattern highlighted
str_view(deep_learning_posts, pattern="deep")

In [42]:
# View strings in HTML format with the first all occurences highlighted
str_view_all(deep_learning_posts, pattern="deep")

In [43]:
# Format strings into paragraphs of a given width with str_wrap()
wrapped <- str_wrap(data$body[str_which(data$body, "deep learning")][1], 
                    width = 50)
wrapped 

In [44]:
# Print wrapped string with output obeying newlines
wrapped %>% cat()

Word2vec isn't deep learning. Its explicitly
and deliberately as shallow as possible - one of
Mikolov's central realisations was that a very
simple model trained on vast amounts of data
can be superior to a complicated model trained
on less data. There's another algorithm called
Paragraph2Vec which builds on Word2Vec and *does*
incorporate 'deep learning' (in so far as it uses
a neural network). Its purpose is to amalgamate
the Word2Vec vectors for the words in a chunk of
text - kinda like you do by averaging them. P2V
tries to find a paragraph vector which is a good
predictor of the word vectors, with the idea being
that such a vector effectively represents the
content of the paragraph.

In [45]:
# Display wrapped paragraph as HTML, inserting paragraph breaks
str_wrap(data$body[str_which(data$body, "deep learning")][1], width = 50) %>%
str_replace_all("\n", "<br>") %>%
str_view_all(pattern = "deep")

<br><br>

# The End