# Download Data

In [None]:
!gdown 1ejm645rRkuIdt1VETpICmJbLGyxMCf7F
!gdown 1nGfw8MOq_gYV55uJuFQ_1wgjpSIsvVw1

Downloading...
From: https://drive.google.com/uc?id=1ejm645rRkuIdt1VETpICmJbLGyxMCf7F
To: /content/Ukraine.json
100% 33.5k/33.5k [00:00<00:00, 43.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1nGfw8MOq_gYV55uJuFQ_1wgjpSIsvVw1
To: /content/London.json
100% 30.6k/30.6k [00:00<00:00, 29.5MB/s]


# Install Spark

In [None]:
# install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
# note that this is grabbing from the archive
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

# set your spark folder to your system path environment. 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"


# install findspark using pip
!pip install -q findspark

# findspark
import findspark
findspark.init()

# Start session

We start the builder pattern `SparkSession.builder` and then chain a configuration parameter that defined the application name.

Providing a useful `appName` helps you identify which programs are running on your Spark cluster.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .appName("Exploring_JSON_actors")\
        .getOrCreate()

In [None]:
import pyspark.sql.functions as F

# Q1.  Inspect the .html and .json files

Imagine that someone just sent you the search results for these two queries. You can see that someone has been searching for 'top actors' in various cities (Kyiv, Ukraine and London, England.) The first geographic name referes to the 'top actors', and while the second set of geographic names looks duplicative, it is actually where you are telling Google to search from! 

If you dig through the two JSON files, you can actually click through and see the HTML result! It is buried in the JSON file but you can find it after looking around. Just lop off the **%7C** at the end and you can see the raw webpage you grabbed.
* https://serpapi.com/searches/2667d9f1ec89c315/63f12abf33b236a2bb4e555d.html%7C

Here are the clean HTML files:
* https://serpapi.com/searches/0fb523f94d3aeeac/63f12abda3f4ef6286fc8dfd.html
* https://serpapi.com/searches/2667d9f1ec89c315/63f12abf33b236a2bb4e555d.html

You can also replace the .html and just get the .json right away, too.
* https://serpapi.com/searches/0fb523f94d3aeeac/63f12abda3f4ef6286fc8dfd.json
* https://serpapi.com/searches/2667d9f1ec89c315/63f12abf33b236a2bb4e555d.json


Now that you have examined both the .html and .json files for these two search queries, **write down three interesting observations that you see**. Pay attention to how the structure of the .html website is efficiently captured in the .json file. No code here - just your qualitative observations.

## Interesting Thing 1

Each sublink in html from search results are divided in to blocks in JSON file and are filled with data related to that block.

## Interesting Thing 2

JSON file gives a clear understanding of "organic_results_state": "Results for exact spelling" and "organic_results"

## Interesting Thing 3

HTML Files are user friendly and JSON Files are computer friendly

# 🔴 Ukraine

I think this is the easier file to work with, so let's start here.

# Q2. Read the Ukraine .json file and printSchema()

Describe what you see in the schema.

In [None]:
ukraine_df = spark.read.json("/content/Ukraine.json", multiLine= True)

In [None]:
ukraine_df.printSchema()
ukraine_df.count()

root
 |-- inline_images: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- link: string (nullable = true)
 |    |    |-- original: string (nullable = true)
 |    |    |-- source: string (nullable = true)
 |    |    |-- source_name: string (nullable = true)
 |    |    |-- thumbnail: string (nullable = true)
 |    |    |-- title: string (nullable = true)
 |-- organic_results: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- about_page_link: string (nullable = true)
 |    |    |-- about_page_serpapi_link: string (nullable = true)
 |    |    |-- about_this_result: struct (nullable = true)
 |    |    |    |-- source: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- icon: string (nullable = true)
 |    |    |    |    |-- security: string (nullable = true)
 |    |    |    |    |-- source_info_link: string (nullable = true)
 |    |    |-- cached_page_link: 

1

In [None]:
ukraine_df.show(10)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|       inline_images|     organic_results|          pagination|    related_searches|  search_information|     search_metadata|   search_parameters|  serpapi_pagination|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|[[https://www.goo...|[[https://www.goo...|[1, https://www.g...|[[https://www.goo...|[[[, 1,, All], [h...|[2023-02-18 19:45...|[desktop, google,...|[1, https://serpa...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



# Q3. Extract the `organic_results.snippet` from the Ukraine .json
Are there any actors listed here? What does the output look like?

In [None]:
organic_results_snippet = ukraine_df.select("organic_results.snippet")

In [None]:
organic_results_snippet.select(F.explode("snippet").alias("snippet")).show(100,False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------+
|snippet                                                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
|Alec Utgoff is a British actor best known for playing 'Alexei' in the Netflix hit show, ... Ivanna Sakhno was born on November 14, 1997 in Kiev, Ukraine.|
|From Ben Stiller to Jessica Chastain, celebrities have embraced Ukraine's president and offered support to the country's war effort.                     |
|Zelenskyy wrote on his social media page that Sean Penn is in Ukraine filming a movie about the war. He also said that Penn was in Ukraine ...           |
|Ukrainian actor Roman Matsiyta was ready to fight once the war 

There are actors listed in the organic_results.snippet. They are embedded in the sentence describing something about their personality.

# Q4. Extract the `organic_results.snippet_highlighted_words` from the Ukraine .json
Are there any actors listed here? What does the output look like?

In [None]:
organic_results_snippet_highlighted = ukraine_df.select("organic_results.snippet_highlighted_words")

In [None]:
organic_results_snippet_highlighted.select(F.explode("snippet_highlighted_words").alias("highlighted_words")).show(100,False)

+------------------------------+
|highlighted_words             |
+------------------------------+
|[actor best, Kiev, Ukraine]   |
|[Ukraine's]                   |
|null                          |
|[Ukrainian actor]             |
|[Ukrainian, actor]            |
|[actor, city, Ukrainian, Kyiv]|
|[Top Kyiv, Kyiv, Ukraine]     |
|[city, Kyiv, Ukrainian, city] |
+------------------------------+



There are no actors listed in the organic_results.snippet_highlighted_words. The results contain highlighted words in each search result links.

# Q5. Extract the `search_parameters.q` to get the name of the search query
Does the search you extracted match the search on the HTML page?



In [None]:
search_parameters = ukraine_df.select(F.col("search_parameters.q").alias("search_parameter_q"))

In [None]:
search_parameters.show(10, False)

+------------------------------------+
|search_parameter_q                  |
+------------------------------------+
|top actors in Kyiv,Kyiv city,Ukraine|
+------------------------------------+



Yes, the search_parameters.q results the same as search query. 

# Q6. Extract the list of 12 names listed as the first SERP result from the Ukraine .json
Where was this information hiding in the .json?

In [None]:
rich_snippet_list = ukraine_df.select("organic_results.rich_snippet_list")

In [None]:
rich_snippet_list.select(F.explode("rich_snippet_list").alias("list")).select(F.explode("list").alias("Names")).show(15, False)

+----------------+
|Names           |
+----------------+
|Ivanna Sakhno   |
|Natalie Burn    |
|Gene Stupnitsky |
|Ilia Volok      |
|Oleg Zagorodnii |
|Aleksey Gorbunov|
|Ana Layevska    |
|Larisa Polonsky |
|Anna Sten       |
|Vadim Perelman  |
|Anna Sedokova   |
|Alex Feldman    |
+----------------+



The information is in organic_results -> rich_snippet_list.

# Q7. Extract a list of the 9 websites listed on the Ukraine .json
Sometimes Google (SERP API) does not return 10 related searches, it will give you less! To be clear, I would like all of the links from the 9 positions within the organic results. Names should include imdb.com, theguardian.com, euronews.com etc.

In [None]:
source_info_links = ukraine_df.select("organic_results.about_this_result.source.source_info_link")

In [None]:
source_info_links.select(F.explode("source_info_link").alias("links")).show(100, False)

+------------------------------------------------------------------------------------------------------------------------------------+
|links                                                                                                                               |
+------------------------------------------------------------------------------------------------------------------------------------+
|https://www.imdb.com/search/name/?birth_place=Kiev,+Ukraine                                                                         |
|https://www.theguardian.com/world/2023/jan/08/ukraine-how-zelenskiy-hollywood-man-of-the-hour                                       |
|https://www.euronews.com/video/2022/06/28/us-actor-penn-meets-zelenskyy-in-kyiv                                                     |
|https://nypost.com/2022/03/06/ukrainian-actor-who-played-soldier-takes-up-arms-vs-russians/                                         |
|https://www.newsweek.com/meet-zelensky-actor-ben-still

SERP API has returned 8 links and the links are mentioned above. This information is in organic_results.about_this_result.source.source_info_link

# 🔴 London

# Q8. Extract the list of famous actors from the first result on the London .json

```
Richard Foreman. Christian Bale. ...
Nick Briggs. Sean Bean. ...
Kate Beckinsale. ...
Dirk Bogarde. ...
Michael Caine. ...
John Cleese. ...
Sacha Baron Cohen. ..
```

This one is a bit tough to extract because it is 'hiding' in an answer box (notice how this first search result is prominent and specially formatted by Google.) It is NOT the first result (from Position 1) of the organic results.

While it is easy for you to just retrieve the answer (once you find it in the JSON), why not practice your PySpark and also clean up the information - I see 9 names in the answer box, please make me a table with one column called actors and 9 rows, one for each actor. **Hint:** You can use the `.` as a delimiter and replace all `...` with nothing...

In [None]:
london_df = spark.read.json("London.json")
london_df.printSchema()

root
 |-- answer_box: struct (nullable = true)
 |    |-- about_this_result: struct (nullable = true)
 |    |    |-- source: struct (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 |    |    |    |-- icon: string (nullable = true)
 |    |    |    |-- security: string (nullable = true)
 |    |    |    |-- source_info_link: string (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- displayed_link: string (nullable = true)
 |    |-- images: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- link: string (nullable = true)
 |    |-- list: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- snippet: string (nullable = true)
 |    |-- snippet_highlighted_words: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- title: string (nullable = true)
 |    |-- type: string (nullable = true)
 |-- organic_results: array (nullable = true)
 |    |-- elemen

In [None]:
list_actors = london_df.select("answer_box.list")
list_actors_cleaned = list_actors.select(F.explode("list").alias("Names"))\
                                  .withColumn("Names", F.regexp_replace('Names', '\. \.\.\.', ''))\
                                  .withColumn("Names",F.split("Names","\."))\
                                  .select(F.explode("Names").alias("Names"))\
                                  .select(F.ltrim("Names"))\
                                  .show(9,False)

+------------------+
|ltrim(Names)      |
+------------------+
|Richard Foreman   |
|Christian Bale    |
|Nick Briggs       |
|Sean Bean         |
|Kate Beckinsale   |
|Dirk Bogarde      |
|Michael Caine     |
|John Cleese       |
|Sacha Baron Cohen |
+------------------+
only showing top 9 rows



The above result shows the 9 actors from famous actors list in london.

# Q9. Examine 'Orang juga bertanya'/'People Also Ask' in the London .json
For some reason, this London search result came from the Indonesian Google search engine. 'Orang juga bertanya' is Indonesian for 'People Also Ask'. 

Click through the HTML and find out 'Who is No 1 actor in the world?', then use code to extract the answer from the .json. Print the text of the answer to get full credit.

In [None]:
related_questions = london_df.select("related_questions")
related_questions.printSchema()
related_questions_ = related_questions.select(F.explode(related_questions.related_questions).alias("struct_cols"))
question = related_questions_.filter(related_questions_.struct_cols.question == "Who is No 1 actor in the world?")


root
 |-- related_questions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- displayed_link: string (nullable = true)
 |    |    |-- link: string (nullable = true)
 |    |    |-- list: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- next_page_token: string (nullable = true)
 |    |    |-- question: string (nullable = true)
 |    |    |-- serpapi_link: string (nullable = true)
 |    |    |-- snippet: string (nullable = true)
 |    |    |-- thumbnail: string (nullable = true)
 |    |    |-- title: string (nullable = true)



In [None]:
answer = question.select(F.col("struct_cols.snippet").alias("Answer"))
answer.show(1, False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Answer                                                                                                                                                                                                                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|American veteran actor Dwayne Johnson is the most popular actor in the worl

# Q10. Extract a list of the 10 websites listed on the London .json
Similar to what you did for Ukraine. These come from the organic results. Should include names like timeout.com, imdb.com, etc.

In [None]:
source_info_links = london_df.select("organic_results.about_this_result.source.source_info_link")

In [None]:
source_info_links.select(F.explode("source_info_link").alias("links")).show(100, False)

+------------------------------------------------------------------------------------------------------+
|links                                                                                                 |
+------------------------------------------------------------------------------------------------------+
|https://www.imdb.com/search/name/?birth_place=London,+England,+UK                                     |
|https://www.mrdustbin.com/us/famous-british-actors/                                                   |
|https://www.glamourmagazine.co.uk/gallery/young-hot-british-actors                                    |
|https://www.entoin.com/entertainment/british-actors                                                   |
|https://londranews.com/english/london-born-actors-many-of-very-different-origins/                     |
|https://www.youtube.com/watch?v=9rduOCd8FWU                                                           |
|https://www.youtube.com/watch?v=igyp8sxZI_E           

The above result shows the 8 links in search results.

# 🔴 Q11. Comments
Make three good bullets that describe what you learned in this assignment. Talk about how even JSON files that are semi-similar can still be difficult to work with if they are heterogenous! 

* JSON files with semi-structed data types are hard to flatten and convert to a normal table of rows and columns.
* To work with JSON files, it is almost certain to preview the data before performing queries and analyis.
* Large datasets with uneven structure types can be eaisly handled with JSON formats and are eailsy transferable through databases.

# 🔵 Extra Credit (5 pts)
This one is tough because Ukraine only has 9 links and London has 10 links!

Read both .json files at once using a wildcard, and make a dataframe with two rows (one for Ukraine and one for London) and 21 columns. The first column has the search query, and the next 10 columns are the (up to) 10 websites and the 10 columns after that are the 10 snippets. Good luck!

Max grade is 100 points on this assignment, if you still have a 105 score, it will be recoded as 100.

In [None]:
! mkdir JSONdata
! mv London.json Ukraine.json  "/content/JSONdata/"


mkdir: cannot create directory ‘JSONdata’: File exists


In [None]:
df = spark.read.json("/content/JSONdata/*")

In [None]:
df.printSchema()

root
 |-- answer_box: struct (nullable = true)
 |    |-- about_this_result: struct (nullable = true)
 |    |    |-- source: struct (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 |    |    |    |-- icon: string (nullable = true)
 |    |    |    |-- security: string (nullable = true)
 |    |    |    |-- source_info_link: string (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- displayed_link: string (nullable = true)
 |    |-- images: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- link: string (nullable = true)
 |    |-- list: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- snippet: string (nullable = true)
 |    |-- snippet_highlighted_words: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- title: string (nullable = true)
 |    |-- type: string (nullable = true)
 |-- inline_images: array (nullable = true)
 |    |-- element:

In [None]:
# Extracting Search Parameters from loaded JSON DF
search_parameters = df.select("search_parameters.q")
search_parameters.show(10, False)

+---------------------------------------------------+
|q                                                  |
+---------------------------------------------------+
|top actors in Kyiv,Kyiv city,Ukraine               |
|top actors in Greater London,England,United Kingdom|
+---------------------------------------------------+



In [None]:
#Extracting links from loaded JSON DF
links = df.select("organic_results.link")
links.show(10, False)


+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|link                                                                                                                                                                                                                                        

In [None]:
# Define window function to add row number column
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec  = Window.orderBy("link")

In [None]:
#Add row number column to links
links = links.withColumn("rownum", row_number().over(windowSpec))

In [None]:
links.show(2, False)
#Here first row shows Ukraine links and second row shows London links

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|link                                                                                                                                                                                                                                 

In [None]:
#Selecting Ukraine Search Links
uk_links =links.where(F.col("rownum") == 1).select(F.explode("Link"))
uk_links = uk_links.toPandas()
uk_links = uk_links.transpose()
uk_links.head()

Unnamed: 0,0,1,2,3,4,5,6,7
col,https://www.imdb.com/search/name/?birth_place=...,https://www.theguardian.com/world/2023/jan/08/...,https://www.euronews.com/video/2022/06/28/us-a...,https://nypost.com/2022/03/06/ukrainian-actor-...,https://www.newsweek.com/meet-zelensky-actor-b...,https://www.tribuneindia.com/news/nation/three...,https://www.tripadvisor.com/Attractions-g29447...,https://www.latimes.com/entertainment-arts/bus...


In [None]:
# Selecting London Search Links
ld_links =links.where(F.col("rownum") == 2).select(F.explode("Link"))
ld_links = ld_links.toPandas()
ld_links = ld_links.transpose()
ld_links.head()

Unnamed: 0,0,1,2,3,4,5,6,7
col,https://www.imdb.com/search/name/?birth_place=...,https://www.mrdustbin.com/us/famous-british-ac...,https://www.glamourmagazine.co.uk/gallery/youn...,https://www.entoin.com/entertainment/british-a...,https://londranews.com/english/london-born-act...,https://www.youtube.com/watch?v=9rduOCd8FWU,https://www.youtube.com/watch?v=igyp8sxZI_E,https://www.thegentlemansjournal.com/article/t...


In [None]:
#Convert to Pandas DF for easy functions
import pandas as pd
total_links = pd.concat([uk_links, ld_links])
total_links.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, col to col
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       2 non-null      object
 1   1       2 non-null      object
 2   2       2 non-null      object
 3   3       2 non-null      object
 4   4       2 non-null      object
 5   5       2 non-null      object
 6   6       2 non-null      object
 7   7       2 non-null      object
dtypes: object(8)
memory usage: 144.0+ bytes


In [None]:
total_links.head()

Unnamed: 0,0,1,2,3,4,5,6,7
col,https://www.imdb.com/search/name/?birth_place=...,https://www.theguardian.com/world/2023/jan/08/...,https://www.euronews.com/video/2022/06/28/us-a...,https://nypost.com/2022/03/06/ukrainian-actor-...,https://www.newsweek.com/meet-zelensky-actor-b...,https://www.tribuneindia.com/news/nation/three...,https://www.tripadvisor.com/Attractions-g29447...,https://www.latimes.com/entertainment-arts/bus...
col,https://www.imdb.com/search/name/?birth_place=...,https://www.mrdustbin.com/us/famous-british-ac...,https://www.glamourmagazine.co.uk/gallery/youn...,https://www.entoin.com/entertainment/british-a...,https://londranews.com/english/london-born-act...,https://www.youtube.com/watch?v=9rduOCd8FWU,https://www.youtube.com/watch?v=igyp8sxZI_E,https://www.thegentlemansjournal.com/article/t...


In [None]:
#convert search_parameters to pandas DF
search_parameters = search_parameters.toPandas()

In [None]:
search_parameters.head()

# reset index to concatenate
total_links = total_links.reset_index(drop= True)
search_parameters = search_parameters.reset_index(drop = True)

In [None]:
#concat links and search parameter
total_links = pd.concat([search_parameters, total_links], axis = 1)
total_links = total_links.rename(columns = {"q":"Search Query"})
total_links.head()

Unnamed: 0,Search Query,0,1,2,3,4,5,6,7
0,"top actors in Kyiv,Kyiv city,Ukraine",https://www.imdb.com/search/name/?birth_place=...,https://www.theguardian.com/world/2023/jan/08/...,https://www.euronews.com/video/2022/06/28/us-a...,https://nypost.com/2022/03/06/ukrainian-actor-...,https://www.newsweek.com/meet-zelensky-actor-b...,https://www.tribuneindia.com/news/nation/three...,https://www.tripadvisor.com/Attractions-g29447...,https://www.latimes.com/entertainment-arts/bus...
1,"top actors in Greater London,England,United Ki...",https://www.imdb.com/search/name/?birth_place=...,https://www.mrdustbin.com/us/famous-british-ac...,https://www.glamourmagazine.co.uk/gallery/youn...,https://www.entoin.com/entertainment/british-a...,https://londranews.com/english/london-born-act...,https://www.youtube.com/watch?v=9rduOCd8FWU,https://www.youtube.com/watch?v=igyp8sxZI_E,https://www.thegentlemansjournal.com/article/t...


I didn't get the problem of inconsistent schema for tables london links and ukraine links because SERP API resulted only 8 links for both london and Ukraine JSON.