<a href="https://colab.research.google.com/github/scveatch/Buddhabrot/blob/main/SparkNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's set up SparkNLP.

## Hello world -- Part 2

In [None]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2024-07-09 22:36:30--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 3.86.22.73
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|3.86.22.73|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2024-07-09 22:36:30--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1191 (1.2K) [text/plain]
Saving to: ‘STDOUT’


2024-07-09 22:36:30 (12.8 MB/s) - written to stdout [1191/1191]

Installing PySpark 3.2.3 and Spark NLP 5.4.0
setup Colab for PySpark 3.2.3 and Spark NLP 5.4.0


In [None]:
# Access Data
!curl "https://raw.githubusercontent.com/WillA656/NLP_Project/main/Gutendex_JSON" -o book_list
!wget -i book_list

In [None]:
import sparknlp
spark = sparknlp.start()

from sparknlp.pretrained import PretrainedPipeline



In [None]:
pipeline = PretrainedPipeline("explain_document_ml")

explain_document_ml download started this may take some time.
Approx size to download 9 MB
[OK!]


We can use some recent headlines.

In [None]:
hls = [ # was headlines
		"She ran",
		"He ran",
		"I saw her",
		"I saw him",
		"I know her name",
		"I know his name",
		"That is hers",
		"That is his"
	]

Let's use SparkNLP to analyze these headlines.

In [None]:
# Use dataframes, or...
# data = spark.createDataFrame(hls).toDF("text")
# dfs = pipeline.transform(data)
# ... use list comprehension
dfs = [pipeline.annotate(hl) for hl in hls] # I don't know how to use dataframes

In [None]:
# its big
dfs

[{'document': ['She ran'],
  'spell': ['She', 'ran'],
  'pos': ['PRP', 'VBD'],
  'lemmas': ['She', 'run'],
  'token': ['She', 'ran'],
  'stems': ['she', 'ran'],
  'sentence': ['She ran']},
 {'document': ['He ran'],
  'spell': ['He', 'ran'],
  'pos': ['PRP', 'VBD'],
  'lemmas': ['He', 'run'],
  'token': ['He', 'ran'],
  'stems': ['he', 'ran'],
  'sentence': ['He ran']},
 {'document': ['I saw her'],
  'spell': ['I', 'saw', 'her'],
  'pos': ['PRP', 'VBD', 'PRP$'],
  'lemmas': ['I', 'see', 'she'],
  'token': ['I', 'saw', 'her'],
  'stems': ['i', 'saw', 'her'],
  'sentence': ['I saw her']},
 {'document': ['I saw him'],
  'spell': ['I', 'saw', 'him'],
  'pos': ['PRP', 'VBD', 'PRP'],
  'lemmas': ['I', 'see', 'he'],
  'token': ['I', 'saw', 'him'],
  'stems': ['i', 'saw', 'him'],
  'sentence': ['I saw him']},
 {'document': ['I know her name'],
  'spell': ['I', 'know', 'her', 'name'],
  'pos': ['PRP', 'VBP', 'PRP', 'NN'],
  'lemmas': ['I', 'know', 'she', 'name'],
  'token': ['I', 'know', 'her', 

Let's say we want to fuse part-of-speech tags to words, to make word differentiation easier.

In [None]:
# Extract words and parts-of-speech
tok_tag = [(df['token'],df['pos']) for df in dfs]

In [None]:
# Still big
tok_tag

[(['She', 'ran'], ['PRP', 'VBD']),
 (['He', 'ran'], ['PRP', 'VBD']),
 (['I', 'saw', 'her'], ['PRP', 'VBD', 'PRP$']),
 (['I', 'saw', 'him'], ['PRP', 'VBD', 'PRP']),
 (['I', 'know', 'her', 'name'], ['PRP', 'VBP', 'PRP', 'NN']),
 (['I', 'know', 'his', 'name'], ['PRP', 'VBP', 'PRP$', 'NN']),
 (['That', 'is', 'hers'], ['DT', 'VBZ', 'NNS']),
 (['That', 'is', 'his'], ['DT', 'VBZ', 'PRP$'])]

In [None]:
# fuse pos to word
zips = [list(zip(tt[0], tt[1])) for tt in tok_tag]

In [None]:
# not too big
zips

[[('She', 'PRP'), ('ran', 'VBD')],
 [('He', 'PRP'), ('ran', 'VBD')],
 [('I', 'PRP'), ('saw', 'VBD'), ('her', 'PRP$')],
 [('I', 'PRP'), ('saw', 'VBD'), ('him', 'PRP')],
 [('I', 'PRP'), ('know', 'VBP'), ('her', 'PRP'), ('name', 'NN')],
 [('I', 'PRP'), ('know', 'VBP'), ('his', 'PRP$'), ('name', 'NN')],
 [('That', 'DT'), ('is', 'VBZ'), ('hers', 'NNS')],
 [('That', 'DT'), ('is', 'VBZ'), ('his', 'PRP$')]]

In [None]:
tagged = [" ".join(["".join(word) for word in hl]) for hl in zips]

In [None]:
tagged

['ShePRP ranVBD',
 'HePRP ranVBD',
 'IPRP sawVBD herPRP$',
 'IPRP sawVBD himPRP',
 'IPRP knowVBP herPRP nameNN',
 'IPRP knowVBP hisPRP$ nameNN',
 'ThatDT isVBZ hersNNS',
 'ThatDT isVBZ hisPRP$']

What about ebooks?

In [None]:
!curl "https://raw.githubusercontent.com/cd-public/books/main/pg1342.txt" -o austen.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  739k  100  739k    0     0  3198k      0 --:--:-- --:--:-- --:--:-- 3202k


In [None]:
austen = open('austen.txt').read()

In [None]:
print(austen[:1000])

﻿The Project Gutenberg eBook of Pride and Prejudice
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Pride and Prejudice

Author: Jane Austen

Release date: June 1, 1998 [eBook #1342]
                Most recently updated: April 14, 2023

Language: English

Credits: Chuck Greif and the Online Distributed Proofreading Team at http://www.pgdp.net (This file was produced from images available at The Internet Archive)


*** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***





                            [Illustration:

                             GEORGE ALLEN
                 

In [None]:
pipeline.annotate(austen[:100])['pos']

['DT',
 'NNP',
 'NNP',
 'NN',
 'IN',
 'NNP',
 'CC',
 'NNP',
 'DT',
 'NN',
 'VBZ',
 'IN',
 'DT',
 'NN',
 'IN',
 'NN',
 'RB']

Previously with ebooks, we conducted word counts. We can do that here as well, with Spark.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("demo").getOrCreate()

In [None]:
# change 'austen' variable from a string to a spark object
austen = spark.sparkContext.textFile("austen.txt")

counts = (
    austen.flatMap(lambda line: line.split(" "))
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a + b)
)

In [None]:
counts.collect()[:10]

[('The', 285),
 ('Project', 79),
 ('of', 3897),
 ('Pride', 7),
 ('', 10603),
 ('ebook', 2),
 ('is', 861),
 ('use', 23),
 ('anyone', 20),
 ('anywhere', 3)]