# Data Analysis with JSON Files

In [None]:
import json
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from collections import defaultdict

Part of being a good data scientist is having the capacity to work with data in many different forms. Much data online is stored in JSON format. So let's see if we can conduct a descriptive analysis of data when it's given to us in that form!

## Loading the Data

In [None]:
with open('plants.json', 'r') as f:
    plants = json.load(f)

This dataset was gathered with the [Trefle API](https://trefle.io). It contains information about 1000 plants in JSON form.

In [None]:
len(plants)

## What Information Do We Have?

Let's take a look at the first plant in our list to see what information is available.

In [None]:
plants[0]

You can find more information about these fields [here](https://docs.trefle.io/docs/advanced/plants-fields).

## Plant Families

Let's start by seeing what botanical families are represented in our data. We'll build a dictionary that counts the plants by family. We'll use the common names for this purpose.

In [None]:
plant_dict = defaultdict(int)
for plant in plants:
    plant_dict[plant['family_common_name']] += 1

In [None]:
plant_dict['Beech family']

Are there any missing values?

In [None]:
plant_dict[None]

Let's remove these by using a dictionary comprehension:

In [None]:
fams_clean = {fam: num for fam, num in plant_dict.items() if fam != None}

Let's also remove the families whose counts are less than ten:

In [None]:
fams_clean = {fam: num for fam, num in fams_clean.items() if num >= 10}

Now we can make a bar chart of the numbers:

In [None]:
plt.style.use('seaborn')

In [None]:
fig, ax = plt.subplots(figsize=(20, 12))

ax.bar(list(fams_clean.keys()), list(fams_clean.values()))
ax.set_title('Families by Number', fontsize=30)
plt.xticks(rotation=80, fontsize=20)
plt.yticks(fontsize=20);

## Synonyms

Suppose we wanted to know how many synonymous names were listed for each plant. We could grab this number with a list comprehension:

In [None]:
num_syn = [len(plant['synonyms']) for plant in plants]

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

ax.scatter(range(1000), num_syn)
ax.set_title('Numbers of Synonyms');

### Exercise

There's an outlier here. Find the plant with almost 350 synonyms!

<details>
<summary>One of many possible answers here - No peeking!
    </summary>
    <code>[plant for plant in plants if len(plant['synonyms']) > 300]</code>
    </details>

## Bibliography

In [152]:
biblios = [plant['bibliography'] for plant in plants]

In [149]:
biblios[0]

'Encycl. 1: 723 (1785)'

Let's check the bibliographies for any explicit mention of Linnaeus, the godfather of biological taxonomy. First we'll remove the plants with no bibliography:

In [172]:
hasbiblio = [plant['bibliography'] for plant in plants if plant['bibliography'] != None]

In [173]:
len(hasbiblio)

994

In [175]:
linnaeans = [biblio for biblio in hasbiblio if 'Linnaeus' in biblio]
linnaeans

['Linnaeus, C. (1753). Species plantarum, exhibentes plantas rite cognitas ad genera relatas cum differentiis specificis, nominibus trivialibus, synonymis selectis, locis natalibus, secundum systema sexuale digestas. Stockholm.']

### Exercise

What plant is this?

<details>
    <summary>
        Answer
    </summary>
<code>[plant for plant in plants if plant['bibliography'] == linnaeans[0]]</code>
    </details>

## Matching Author and Bibliography

In fact, _Species Plantarum_ (1753) was written by Linnaeus, and "Sp. Pl." in these bibliographies is a reference to that master work. Let's check to see who the author is for these records whose bibliographies start with "Sp. Pl.":

In [177]:
sps = []
for plant in plants:
    try:
        if plant['bibliography'].startswith('Sp. Pl.'):
            sps.append(plant)
    except:
        continue

In [181]:
sp_authors = [sp['author'] for sp in sps]

In [191]:
uniq_auth = list(set(sp_authors))

In [192]:
counts = []
for author in uniq_auth:
    counts.append(sp_authors.count(author))

In [193]:
dict(zip(uniq_auth, counts))

{'L.': 439,
 None: 1,
 '(J.Presl & C.Presl) Parl.': 1,
 '(L.) L.': 2,
 'A.Haines (Linnaeus)': 1,
 'Willd.': 1}

### Exercise

Change the author to 'L.' (for 'Linnaeus') for all of these records.

<details>
    <summary>
        Answer
    </summary>
<code>for plant in plants:
    try:
        if plant['bibliography'].startswith('Sp. Pl.'):
            plant['author'] == 'L.'
    except:
        continue</code>
    </details>

## Bringing in `pandas`: Back to Synonyms

Let's take advantage of the DataFrame tools from `pandas`:

In [None]:
plants_df = pd.DataFrame(plants)

In [None]:
plants_df.head()

### Exercises

1. Add the number of synonyms as a new column called "num_syn".
2. Sort the DataFrame by number of synonyms in descending order.
3. Grab the Image URL of the plant that has the fourth-highest number of synonyms.
4. Paste it into your browser and take a look!

<details>
    <summary>Answer here
    </summary>
    <code>plants_df['num_syn'] = plants_df['synonyms'].map(len)
sorted = plants_df.sort_values('num_syn', ascending=False)
sorted.loc[3, 'image_url']</code>
    </details>

## Year

The year is an indication of when a valid name for the plant first appeared in print. Let's build a simple histogram of these years:

In [None]:
fig, ax = plt.subplots()

ax.hist(plants_df['year'], color='darkgreen');

### Exercise

How many plants have first been written about just in the last 20 years? Make a bar chart or pie chart that shows the distribution by family of these plants.

<details>
    <summary>
        One answer here
    </summary>
<code>recent = plants_df[plants_df['year'] >= 2001]
fig, ax = plt.subplots(1, 2, figsize=(14, 6)
ax[0].bar(recent['family_common_name'].value_counts().index,
       recent['family_common_name'].value_counts(),
      color='darkgreen')
ax[1].pie(recent['family_common_name'].value_counts(),
      labels=recent['family_common_name'].value_counts().index,
      radius=1.2, labeldistance=0.4, rotatelabels=True);</code>
    </details>