In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)


# Lecture 11 - Parsing HTML

## DSC 80, Fall 2022

## Today, in DSC 80...

- We can *scrape* data from the internet.
- What we get back is a mess of HTML.
- How do we extract information from this HTML?

## Recall: APIs vs. Scraping

### Programmatic requests

* We learned how to use the Python `requests` package to exchange data via HTTP.
    - `GET` requests are used to request data **from** a server.
    - `POST` requests are used to **send** data to a server. 
* There are two ways of collecting data via requests:
    * By using a published API (application programming interface).
    * By scraping a webpage to collect its HTML source code.

### APIs

* An API is a service that makes data directly available to the user in a convenient fashion.

* Advantages:
    - The data are usually clean, up-to-date, and ready to use.
    - The presence of a API signals that the data provider is okay with you using their data.
    - The data provider can plan and regulate data usage.
        - Some APIs require you to create an API "key", which is like an account for using the API.
        - APIs can also give you access to data that isn't publicly available on a webpage.

* Disadvantages:
    - APIs don't always exist for the data you want!

### Scraping

* Scraping is the act of programmatically "browsing" the web, downloading the source code (HTML) of pages that you're interested in extracting data from.

* Advantages:
    * You can always do it!
        - e.g. Google scrapes webpages in order to make them searchable.

* Disadvantages:
    - It is often difficult to parse and clean scraped data.
        - Source code often includes a lot of content unrelated to the data you're trying to find (e.g. formatting, advertisements, other text).
    - Websites can change often, so scraping code can get outdated quickly.
    - Websites may not want you to scrape their data!

- In general, we prefer APIs.

### Accessing HTML

Let's make a `GET` request to the HDSI Faculty page and see what the resulting HTML looks like. 

In [None]:
url = 'https://datascience.ucsd.edu/about/faculty/faculty/'
r = requests.get(url)
r

In [None]:
urlText = r.text
len(urlText)

In [None]:
print(urlText[:1000])

Wow, that is gross looking! 😰 

- It is **raw** HTML, which web browsers use to display websites.
- The information we are looking for – faculty information – is in there somewhere, but we have to search for it and extract it, which we wouldn't have to do if we had an API.

## The anatomy of HTML documents

### What is HTML?

* HTML (HyperText Markup Language) is **the** basic building block of the internet. 
* It defines the content and layout of a webpage, and as such, it is what you get back when you scrape a webpage.
* See [this tutorial](http://fab.academany.org/2018/labs/fablaboshanghai/students/bob-wu/Fabclass/week2_project_management/HTML.html) for more details.

In [None]:
!cat data/lec15_ex1.html

### The anatomy of HTML documents

* **HTML document**: The totality of markup that makes up a webpage.

* **Document Object Model (DOM)**: The internal representation of a HTML document as a hierarchical **tree** structure.

* **HTML element**: An object in the DOM, such as a paragraph, header, or title.
* **HTML tags**: Markers that denote the **start** and **end** of an element, such as `<p>` and `</p>`.

<center><img src='imgs/dom.jpg'></center>

<center><a href='https://simplesnippets.tech/what-is-document-object-modeldom-how-js-interacts-with-dom/'>(source)</a></center>

### Useful tags to know


|Element|Description|
|:---|:---|
|`<html>`|the document|
|`<head>`|the header|
|`<body>`|the body|
|`<div>` |a logical division of the document|
|`<span>`|an *in-line* logical division|
|`<p>`|a paragraph|
| `<a>`| an anchor (hyper-link)|
|`<h1>, <h2>, ...`| header(s) |
|`<img>`| an image |

There are many, many more. See [this article](https://en.wikipedia.org/wiki/HTML_element) for examples.

### Example: images and hyperlinks

Tags can have **attributes**, which further specify how to display information on a webpage.

For instance, `<img>` tags have `src` and `alt` attributes (among others):

```html
<img src="billy-selfie.png" alt="A photograph of Billy." width=500>
```

Hyperlinks have `href` attributes: 

```html
Click <a href="https://dsc80.com/project3">this link</a> to access Project 3.
```

What do you think this webpage looks like?

In [None]:
!cat data/lec15_ex2.html

### The `<div>` tag

```html
<div style="background-color:lightblue">
  <h3>This is a heading</h3>
  <p>This is a paragraph.</p>
</div>
```

* The `<div>` tag defines a division or a "section" of an HTML document.
    * Think of a `<div>` as a "cell" in a Jupyter Notebook.

* The `<div>` element is often used as a container for other HTML elements to style them with CSS or to perform operations involving them using JavaScript.

* `<div>` elements often have attributes, which are important when scraping!

### Document trees

In [None]:
!cat data/lec15_ex1.html

Under the document object model (DOM), HTML documents are trees. In DOM trees, child nodes are **ordered**.

<center>

<img src="imgs/webpage_anatomy.png" width="50%">

</center>    

What does the DOM tree look like for this document?

<center><img src="imgs/dom_tree.png" width="50%"></center>

### Example: Quote scraping

Consider the following webpage.

<center><img src="imgs/quotes2scrape.png" width=60%></center>

- What do you think the DOM tree looks like?
- If you had to store the data on this page in a DataFrame, what would the rows and columns represent?

<center><img src="imgs/quote_dom.png" width="50%"></center>

## Parsing HTML via Beautiful Soup

### Beautiful Soup 🍜

* [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python HTML parser.
    - To "parse" means to "extract meaning from a sequence of symbols".
* **Warning:** Beautiful Soup 4 and Beautiful Soup 3 work differently, so make sure you are using and looking at documentation for Beautiful Soup 4.

### Example HTML document

To start, let's instantiate a `BeautifulSoup` object, using the source code for an HTML page with the DOM tree shown below:

<center><img src="imgs/dom_tree_1.png" width="60%"></center>

The string `html_string` contains an HTML "document".

In [None]:
html_string = '''
<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>
'''.strip()

Using the `HTML` function in the `IPython.display` module, we can render an HTML document from within our Jupyter Notebook:

In [None]:
HTML(html_string)

### `BeautifulSoup` objects

`bs4.BeautifulSoup` takes in a string or file-like object representing HTML (`markup`) and returns a **parsed** document.

In [None]:
import bs4

In [None]:
bs4.BeautifulSoup?

Normally, we pass the result of a `GET` request to `bs4.BeautifulSoup`, but here we will pass our hand-crafted `html_string`.

In [None]:
soup = bs4.BeautifulSoup(html_string)
soup

In [None]:
type(soup)

`BeautifulSoup` objects have several useful attributes, e.g. `text`:

In [None]:
print(soup.text)

### Child nodes

- Recall, HTML documents are represented as trees.
    - Each page element becomes a node in this tree.
- A `BeautifulSoup` object represents a **node** in the tree.
    - Each `BeautifulSoup` object has 0 or more child nodes.
    - To access the children of a node, use the `children` attribute.

In [None]:
soup

In [None]:
soup.children

### Aside: iterators

On the previous slide, we saw that that `soup.children` isn't another `BeautifulSoup` object, but rather something of the form `<list_iterator at 0x7f7b0ab8c370>`.

What are [iterators](https://www.w3schools.com/python/python_iterators.asp), again?

In [None]:
nums = [1, 2, 3, 4]
double = map(lambda x: x ** 2, nums)
double

In [None]:
next(double)

In [None]:
list(double)

### Child nodes

The `children` attribute returns an iterator so that it doesn't have to load the entire DOM tree in memory.

In [None]:
soup

In [None]:
soup.children

In [None]:
len(list(soup.children))

In [None]:
root = next(soup.children)
root

In [None]:
list(root.children)

In [None]:
list(list(root.children)[1].children)

In [None]:
list(list(list(root.children)[1].children)[3].children)

### Depth-first traversal through `descendants`

- While we could use the `children` attribute to navigate to any node in a `BeautifulSoup` tree, there are easier ways of navigating the tree.

- The `descendants` attribute traverses a `BeautifulSoup` tree using **depth-first traversal**.
    - Why depth-first? Elements closer to one another on a page are more likely to be related than elements further away.
    - Question: What type of depth-first traversal does this use – preorder, inorder, or postorder traversal?

<center><img src="imgs/dom_tree_1.png" width="60%"></center>

In [None]:
for child in soup.descendants:
    # print(child) # What would happen if we ran this instead?
    if isinstance(child, str):
        continue
    print(child.name)

### Finding elements in a tree

Practically speaking, you will not use the `children` or `descendants` attributes directly very often. Instead, you will use the following methods:

- `soup.find(tag)`, which finds the **first** instance of a tag (the first one on the page, i.e. the first one that DFS sees).
    - More general: `soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)`.
- `soup.find_all(tag)` will find **all** instances of a tag.


### Using `find`

Let's try and extract the first `<div>` subtree.

<center><img src="imgs/dom_tree_1.png" width="60%"></center>  

In [None]:
soup

In [None]:
div = soup.find('div')
div

<center><img src="imgs/dom_subtree_1.png" width="30%"></center>  

Let's try and find the `<div>` element that has an `id` attribute equal to `'nav'`.

In [None]:
soup.find('div', attrs={'id': 'nav'})

`find` will return the first occurrence of a tag, regardless of what depth it is in the tree.

In [None]:
soup.find('ul')

In [None]:
soup.find('li')

### Using `find_all`

`find_all` returns a list of all matches.

In [None]:
soup.find_all('div')

In [None]:
soup.find_all('li')

In [None]:
[x.text for x in soup.find_all('li')]

`text` is a node attribute.

### Node attributes
* The `text` attribute of a tag element gets the text between the opening and closing tags.
* The `attrs` attribute lists all attributes of a tag.
* The `get(key)` method gets the value of a tag attribute.

In [None]:
soup.find('p')

In [None]:
soup.find('p').text

In [None]:
soup.find('div')

In [None]:
soup.find('div').attrs

In [None]:
soup.find('div').get('id')

You can access tags using attribute notation, too.

In [None]:
soup

In [None]:
soup.html.div.h1

In [None]:
soup.html.div.h1.text

In [None]:
soup.html.div.next_sibling.next_sibling.attrs

## Example: Scraping the HDSI Faculty page

### Example

Let's try and extract a list of HDSI Faculty from https://datascience.ucsd.edu/about/faculty/faculty/.

A good first step is to use the "inspect element" tool in our web browser.

In [None]:
fac_response = requests.get('https://datascience.ucsd.edu/about/faculty/faculty/')
fac_response

In [None]:
soup = bs4.BeautifulSoup(fac_response.text)

It seems like the relevant `<div>`s for faculty are the ones where the `data-entry-type` attribute is equal to `'individual'`. Let's find all of those.

In [None]:
divs = soup.find_all('div', attrs={'data-entry-type': 'individual'})

In [None]:
divs[0]

Within here, we need to extract each faculty member's name. It seems like names are stored in the `title` attribute within an `<a>` tag.

In [None]:
divs[0].find('a').get('title')

We can also extract job titles:

In [None]:
divs[0].find('h4').text

And bios:

In [None]:
divs[0].find('div', attrs={'class': 'cn-bio'}).text.strip()

Let's create a DataFrame consisting of names and bios for each faculty member.

In [None]:
names = [div.find('a').get('title') for div in divs]
names[:5]

In [None]:
titles = [div.find('h4').text if div.find('h4') else '' for div in divs]

In [None]:
bios = [div.find('div', attrs={'class': 'cn-bio'}).text.strip() for div in divs]

In [None]:
faculty = pd.DataFrame().assign(name=names, title=titles, bio=bios)
faculty.head()

Now we have a DataFrame!

In [None]:
faculty[faculty['title'] == 'Lecturer']

What if we want to get faculty members' pictures? It seems like we should look at the attributes of an `<img>` tag.

In [None]:
divs[0].find('img')

In [None]:
def show_picture(name):
    idx = names.index(name)
    url = divs[idx].find('img').get('srcset')
    url = 'https://' + url.strip('/').strip(' 1x')
    display(Image(url))

In [None]:
# no longer works : (
# (the webpage has changed)
# this is a downside of scraping!
show_picture('Suraj Rampure')

In [None]:
display(Image(url))

## Example: Scraping quotes

### Example: Scraping quotes

Let's scrape quotes from https://quotes.toscrape.com/.

<center><img src="imgs/quotes2scrape.png" width=60%></center>

Specifically, let's try to make a DataFrame that looks like the one below:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>quote</th>
      <th>author</th>
      <th>author_url</th>
      <th>tags</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>change,deep-thoughts,thinking,world</td>
    </tr>
    <tr>
      <th>1</th>
      <td>“It is our choices, Harry, that show what we truly are, far more than our abilities.”</td>
      <td>J.K. Rowling</td>
      <td>https://quotes.toscrape.com/author/J-K-Rowling</td>
      <td>abilities,choices</td>
    </tr>
    <tr>
      <th>2</th>
      <td>“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>inspirational,life,live,miracle,miracles</td>
    </tr>
  </tbody>
</table>

### The plan

Eventually, we will create a single function – `quote_df` – which takes in an integer `n` and returns a **DataFrame** with the quotes on the **first `n` pages** of https://quotes.toscrape.com/.

To do this, we will define several helper functions:
- `download_page(i)`, which downloads a **single page** (page `i`) and returns a `BeautifulSoup` object of the response.
- `process_quote(div)`, which takes in a `<div>` tree corresponding to a **single quote** and returns a Series containing all of the relevant information for that quote.
- `process_page(divs)`, which takes in a list of `<div>` trees corresponding to a **single page** and returns a DataFrame containing all of the relevant information for all quotes on that page.

Key principle: some of our helper functions will make **requests**, and others will **parse**, but none will do both! 
- Easier to debug and catch errors.
- Avoids unnecessary requests.

### Aside: f-strings in Python

- f-strings in Python provide a convenient way to format strings.
- To create an f-string, create a string with the character `f` **right before** the opening quote. Then, anything in the subsequent string that is inside `{curly brackets}` will be evaluated. 

In [2]:
f'2 + 3 = {2 + 3}'

'2 + 3 = 5'

In [3]:
def make_greeting(name):
    return f"Hi {name}! 👋 Your name has {len(name)} characters, the first of which is {name[0]}."

In [4]:
make_greeting('Billy')

'Hi Billy! 👋 Your name has 5 characters, the first of which is B.'

### Downloading a single page

In [5]:
def download_page(i):
    url = f'https://quotes.toscrape.com/page/{i}'
    request = requests.get(url)
    return bs4.BeautifulSoup(request.text)

In `quote_df`, we will call `download_page` repeatedly – once for `i=1`, once for `i=2`, ..., `i = n`. For now, we will work with just page 5 (chosen arbitrarily).

In [6]:
soup = download_page(5)

### Parsing a single page

Let's look at the page's source code (via "inspect element") to find where the quotes in the page are located.

In [7]:
divs = soup.find_all('div', attrs={'class': 'quote'})

In [8]:
divs[0]

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”</span>
<span>by <small class="author" itemprop="author">George R.R. Martin</small>
<a href="/author/George-R-R-Martin">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="read,readers,reading,reading-books" itemprop="keywords"/>
<a class="tag" href="/tag/read/page/1/">read</a>
<a class="tag" href="/tag/readers/page/1/">readers</a>
<a class="tag" href="/tag/reading/page/1/">reading</a>
<a class="tag" href="/tag/reading-books/page/1/">reading-books</a>
</div>
</div>

From this `<div>`, we can extract the quote, author name, author's URL, and tags.

In [9]:
divs[0].find('span', attrs={'class': 'text'}).text

'“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”'

In [10]:
divs[0].find('small', attrs={'class': 'author'}).text

'George R.R. Martin'

In [11]:
divs[0].find('a').get('href')

'/author/George-R-R-Martin'

In [12]:
divs[0].find('meta', attrs={'class': 'keywords'}).get('content')

'read,readers,reading,reading-books'

Let's write an intermediate function, `process_quote`, which takes in a `<div>` corresponding to a single quote and returns a **Series** containing the quote's information.

Note that this approach is different than the approach taken in the HDSI Faculty page example – there, we created each column of our final DataFrame separately, while here we are creating one **row** of our final DataFrame at a time.

In [13]:
def process_quote(div):
    quote = div.find('span', attrs={'class': 'text'}).text
    author = div.find('small', attrs={'class': 'author'}).text
    author_url = 'https://quotes.toscrape.com' + div.find('a').get('href')
    tags = div.find('meta', attrs={'class': 'keywords'}).get('content')
    
    return pd.Series({'quote': quote, 'author': author, 'author_url': author_url, 'tags': tags})

In [14]:
process_quote(divs[3])

quote         “If you can make a woman laugh, you can make h...
author                                           Marilyn Monroe
author_url    https://quotes.toscrape.com/author/Marilyn-Monroe
tags                                                 girls,love
dtype: object

Next, we can write a function that takes in a list of `<div>`s, calls the above function on each `<div>` in the list, and returns a **DataFrame**.

In [15]:
def process_page(divs):
    return pd.DataFrame([process_quote(div) for div in divs])

In [16]:
process_page(divs)

Unnamed: 0,quote,author,author_url,tags
0,“A reader lives a thousand lives before he die...,George R.R. Martin,https://quotes.toscrape.com/author/George-R-R-...,"read,readers,reading,reading-books"
1,“You can never get a cup of tea large enough o...,C.S. Lewis,https://quotes.toscrape.com/author/C-S-Lewis,"books,inspirational,reading,tea"
2,“You believe lies so you eventually learn to t...,Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,
3,"“If you can make a woman laugh, you can make h...",Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,"girls,love"
4,“Life is like riding a bicycle. To keep your b...,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"life,simile"
5,“The real lover is the man who can thrill you ...,Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,love
6,"“A wise girl kisses but doesn't love, listens ...",Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,attributed-no-source
7,“Only in the darkness can you see the stars.”,Martin Luther King Jr.,https://quotes.toscrape.com/author/Martin-Luth...,"hope,inspirational"
8,"“It matters not what someone is born, but what...",J.K. Rowling,https://quotes.toscrape.com/author/J-K-Rowling,dumbledore
9,“Love does not begin and end the way we seem t...,James Baldwin,https://quotes.toscrape.com/author/James-Baldwin,love


### Putting it all together

In [17]:
def quote_df(n):
    '''Returns a DataFrame containing the quotes on the first n pages of https://quotes.toscrape.com/.'''
    dfs = []
    for i in range(1, n + 1):
        # Download page n and create a BeautifulSoup object
        soup = download_page(i)
        
        # Create DataFrame using the information in that page
        divs = soup.find_all('div', attrs={'class': 'quote'})
        df = process_page(divs)
        
        # Append DataFrame to dfs
        dfs.append(df)
        
    # Stitch all DataFrames together
    return pd.concat(dfs).reset_index(drop=True)

In [18]:
first_three_pages = quote_df(3)
first_three_pages.head()

Unnamed: 0,quote,author,author_url,tags
0,“The world as we have created it is a process ...,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"change,deep-thoughts,thinking,world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,https://quotes.toscrape.com/author/J-K-Rowling,"abilities,choices"
2,“There are only two ways to live your life. On...,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"inspirational,life,live,miracle,miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,https://quotes.toscrape.com/author/Jane-Austen,"aliteracy,books,classic,humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,"be-yourself,inspirational"


The elements in the `'tags'` column are all strings, but they look like lists. This is not ideal, as we will see shortly.

### An extension

We could:
- Request information about each of the **authors** in the DataFrame.
    - See https://quotes.toscrape.com/author/Albert-Einstein/ for an example.
- Create a DataFrame of author information.
- Merge that DataFrame with `first_three_pages`.

In [19]:
np.unique(first_three_pages['author_url'])

array(['https://quotes.toscrape.com/author/Albert-Einstein',
       'https://quotes.toscrape.com/author/Allen-Saunders',
       'https://quotes.toscrape.com/author/Andre-Gide',
       'https://quotes.toscrape.com/author/Bob-Marley',
       'https://quotes.toscrape.com/author/Douglas-Adams',
       'https://quotes.toscrape.com/author/Dr-Seuss',
       'https://quotes.toscrape.com/author/Eleanor-Roosevelt',
       'https://quotes.toscrape.com/author/Elie-Wiesel',
       'https://quotes.toscrape.com/author/Friedrich-Nietzsche',
       'https://quotes.toscrape.com/author/Garrison-Keillor',
       'https://quotes.toscrape.com/author/J-K-Rowling',
       'https://quotes.toscrape.com/author/Jane-Austen',
       'https://quotes.toscrape.com/author/Jim-Henson',
       'https://quotes.toscrape.com/author/Marilyn-Monroe',
       'https://quotes.toscrape.com/author/Mark-Twain',
       'https://quotes.toscrape.com/author/Mother-Teresa',
       'https://quotes.toscrape.com/author/Pablo-Neruda',
    

In [20]:
einstein = bs4.BeautifulSoup(requests.get('https://quotes.toscrape.com/author/Albert-Einstein').text)

In [21]:
einstein.find('div', attrs={'class': 'author-description'}).text[:1000]

'\n        In 1879, Albert Einstein was born in Ulm, Germany. He completed his Ph.D. at the University of Zurich by 1909. His 1905 paper explaining the photoelectric effect, the basis of electronics, earned him the Nobel Prize in 1921. His first paper on Special Relativity Theory, also published in 1905, changed the world. After the rise of the Nazi party, Einstein made Princeton his permanent home, becoming a U.S. citizen in 1940. Einstein, a pacifist during World War I, stayed a firm proponent of social justice and responsibility. He chaired the Emergency Committee of Atomic Scientists, which organized to alert the public to the dangers of atomic warfare.At a symposium, he advised: "In their struggle for the ethical good, teachers of religion must have the stature to give up the doctrine of a personal God, that is, give up that source of fear and hope which in the past placed such vast power in the hands of priests. In their labors they will have to avail themselves of those forces w

### Key takeaways

* Make as few requests as possible.
* Create a request and parsing plan **beforehand**.
* Create your output schema **beforehand**.
* Make requests and parse in **separate functions**!
* See Lab 6, Question 2 for a related example.

## Nested vs. flat data formats

### Nested vs. flat data formats

- **Nested** data formats, like HTML, JSON, and XML, allow us to represent hierarchical relationships between variables.

* **Flat** (i.e. tabular) data formats, like CSV, do not.

<center><img src="imgs/hierarchy.png" width=40%></center>

### Example: Scraping quotes, again

- Suppose we obtained the quotes data via an API and saved it to the file `data/quotes2scrape.json`.
- `quotes2scrape.json` is a **JSON records** file; each line is a valid JSON object, **but the entire document is not**.

In [None]:
f = open(os.path.join('data', 'quotes2scrape.json'))

In [None]:
json.loads(f.readline())

Note that for a single quote, we have keys for `'auth_url'`, `'quote_auth'`, `'quote_text'`, `'bio'`, `'dob'`, and `'tags'`.

Since each line is a separate JSON object, let's read in each line one at a time.

In [None]:
L = [json.loads(x) for x in open(os.path.join('data', 'quotes2scrape.json'))]

Let's convert the result to a DataFrame.

In [None]:
df = pd.DataFrame(L)
df.head()

What data type is the `'tags'` column?

In [None]:
df['tags'].iloc[0]

Let's save `df` to a CSV and read it back in.

In [None]:
df.to_csv('out.csv')

In [None]:
df_again = pd.read_csv('out.csv')
df_again.head()

What data type is the `'tags'` column now?

In [None]:
df_again['tags'].iloc[0]

### One-hot encoding

- So that we don't have to deal with lists within Series, we can **flatten** lists of tags so that there is **one column per tag**.
    - For example, consider the tag `'inspirational'`.
    - If a quote has a 1 in the `'inspirational'` column, it **was** tagged `'inspirational'`.
    - If a quote has a 0 in the `'inspirational'` column, it **was not** tagged `'inspirational'`.
- This process – of converting categorical variables into columns of 1s and 0s – is called **one-hot encoding**. We will revisit it in a few weeks.

In [None]:
distinct_tags = np.unique(df['tags'].sum())
distinct_tags

Let's write a function that takes in the list of tags (`taglist`) for a given quote and returns the one-hot-encoded sequence of 1s and 0s for that quote.

In [None]:
def flatten_tags(taglist):
    return pd.Series({k:1 for k in taglist}, dtype=float)

tags = df['tags'].apply(flatten_tags).fillna(0).astype(int)
tags.head()

Let's combine this one-hot-encoded DataFrame with `df`.

In [None]:
df_full = pd.concat([df, tags], axis=1).drop(columns='tags')
df_full.head()

If we want all quotes tagged `'inspiration'`, we can simply query:

In [None]:
df_full[df_full['inspirational'] == 1].head()

Note that this DataFrame representation of the response JSON takes up much more space than the original JSON. Why is that?