<a href="https://colab.research.google.com/github/ybressler/intro-to-python/blob/master/Module%202%20%E2%80%93%20Interacting%20with%20Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 2. Interacting with Data



---
We'll be analyzing text data from Shakespeare's *THE TEMPEST*

Full script: http://shakespeare.mit.edu/tempest/full.html

# 1. Import data

_How the data is loaded is beyond the scope of this module. Just know that the final object `data` is a dictionary._

In [0]:
import requests

url = "https://storage.googleapis.com/yb-intro-to-python/the%20tempest%20(scenes%201-2).json"
r = requests.get(url)
data = r.json()

# 2. Navigating our dictionary

## 2.1 Understanding nested structure

A useful method for storing lots of dependent data is to use nested dictionaries. The dictionary at hand – `data` is structured as follows:
```
ACTS
>> SCENES
  >> Setting
  >> Character
    >> Dialogue
    >> Action
```
* Each `act` contains `scenes`.
* Each `scene` contains `setting` and `characters`.
* Each `character` contains `dialogue` and `action`
 * `Setting` is a string value
 * `Dialogue` is a list of strings
 * `Action` is a list of strings


In [0]:
# Let's begin by printing all acts
data.keys()

In [0]:
# Print all scenes in ACT 1
 data["ACT 1"].keys()

In [0]:
# What are the keys for scene 1?
data["ACT 1"]["SCENE I"].keys()

In [0]:
# What are the keys for a character in scene 1?
data["ACT 1"]["SCENE I"]["Master"].keys()

In [0]:
# Print the nested data object
data["ACT 1"]["SCENE I"]["Master"]

## 2.2 Looping through nested objects

Non-symetrical structure can be challenge in looping through nested objects.<br>*(Things aren't how they're supposed to be!)*

In [0]:
# Let's loop through and choose only character keys

# Method 1: String match
for key in data["ACT 1"]["SCENE I"].keys():
  if key != "setting (at start)":
    print(key)

In [0]:
# Method 2: String list match
for key in data["ACT 1"]["SCENE I"].keys():
  if key not in ["setting (at start)"]:
    print(key)

In [0]:
# Method 3: String case match
for key in data["ACT 1"]["SCENE I"].keys():
  if key == key.title():
    print(key)

In [0]:
# Method 4: Value matching (Best approach)
for key, value in data["ACT 1"]["SCENE I"].items():
  if type(value)==dict:
    print(key)

# 3. Parsing textual data _(words and stuff)_

## 3.1 Get all characters per scene

In [0]:
for key, value in data["ACT 1"].items():
  
  # more meaningful variable name
  scene = key

  # value is a dictionary
  for k, v in value.items():
  
    # more meaningful variable name
    char = k

    if type(v)==dict:
      print(char)



In [0]:
# Instead of printing them, let's store them in a dictionary:

all_chars = {}

for key, value in data["ACT 1"].items():
  
  # more meaningful variable name
  scene = key

  # Create an empty list
  all_chars[scene] = []

  # value is a dictionary
  for k, v in value.items():
  
    # more meaningful variable name
    char = k

    if type(v)==dict:
      # If it's a character, add to the scene
      all_chars[scene].append(char)



# Now that you're done, print
all_chars

## 3.2 Get all dialog from a character

In [0]:
# If the data is clean and nice, the structure of character should be the same
char_dict = data["ACT 1"]["SCENE I"]["Master"]

print("Keys are:", list(char_dict.keys()))

# What is the data type for each?
for key in char_dict.keys():
  value = char_dict[key]
  print(type(value))

In [0]:
# Alternatively
for key, value in char_dict.items():
  print(type(value))

In [0]:
# Let's print all dialogue
char_dict["dialogue"]

In [0]:
# How many lines does he speak?
len(char_dict["dialogue"])

In [0]:
# How many words does he speak?
for x in char_dict["dialogue"]:
  # The split method will by default split along each space
  print(len(x.split()))
  print(x.split())
  

In [0]:
# Do it again, but keep a total
n_words = 0
for x in char_dict["dialogue"]:
  n = len(x.split())

  n_words = n_words + n

print(n_words)

In [0]:
# alternatively
n_words = 0
for x in char_dict["dialogue"]:
  n = len(x.split())

  # += is short notation for "add this to what you already are"
  n_words += n

print(n_words)

In [0]:
# How many characters does he speak?
n_chars = 0
for x in char_dict["dialogue"]:
  n = len(x)
  n_chars +=n

print(n_chars)

## 3.3 Parse action from a single character

In [0]:
# Essentially, I have a list of what each character said and what they've done
char = "Gonzalo"

# If I look up a character from the wrong scene, I don't want my code to break
char_dict = data["ACT 1"]["SCENE I"].get(char, {"dialogue":[],"action":[]})

# What's the first value?
char_action = char_dict["action"]

# which character had the most action in act 1, scene 1?
n_words = 0
n_characters = 0
n_instances = len(char_action)

for x in char_action:
  # split for words
  x_words = x.split()
  # Use this to print the words
  #print(x_words)

  n_words += len(x_words)
  n_characters+=len(x)
  

# I can store all this information in a dictionary!
all_char_action = {}

all_char_action[char] = {
    "n_words":n_words,
    "n_characters":n_characters,
    "n_instances":n_instances   
}

all_char_action

## 3.4 Parse action from all characters

In [0]:
# Create an empty dictionary
all_data = {}
x_lookup = "action"

for key, value in data["ACT 1"]["SCENE I"].items():

  # easier name
  char = key

  # If it's a string, skip it (we only want characters)
  if type(value)==str:
    continue

  # Instead of creating 3 variables, I can create a dictionary
  char_dict = {
      "n_words":0,
      "n_characters":0,
      "n_instances":len(value[x_lookup])
  }

  # I want to count the details of each character's action
  for x in value[x_lookup]:

    # Count the words
    char_dict["n_words"] += len(x.split())

    # Count the characters
    # We won't get rid of spaces
    char_dict["n_characters"] += len(x)


  # Add my char_dict to the main dict
  all_data[char] = char_dict

  #break # Stop when you get here


# Done my loop
all_data

## 3.5 Parse dialogue from all characters

In [0]:
# Create an empty dictionary
all_data = {}
x_lookup = "dialogue"

for key, value in data["ACT 1"]["SCENE I"].items():

  # easier name
  char = key

  # If it's a string, skip it (we only want characters)
  if type(value)==str:
    continue

  # Instead of creating 3 variables, I can create a dictionary
  char_dict = {
      "n_words":0,
      "n_characters":0,
      "n_instances":len(value[x_lookup])
  }

  # I want to count the details of each character's action
  for x in value[x_lookup]:

    # Count the words
    char_dict["n_words"] += len(x.split())

    # Count the characters
    # We won't get rid of spaces
    char_dict["n_characters"] += len(x)


  # Add my char_dict to the main dict
  all_data[char] = char_dict

  #break # Stop when you get here


# Done my loop
all_data

# 4. Analyze

## 4.1 Another reason to love `Pandas`

In [0]:
import pandas as pd
import matplotlib.pyplot as plt # This is a plotting library, ignore it

In [0]:
# a dataframe is a matrix representation of data
df = pd.DataFrame.from_dict(all_data)

# Pretty print (only in notebooks)
df

In [0]:
# Switch rows and columns (Transpose)
df.T

In [0]:
# Keep dependent code next to itself:
df = pd.DataFrame.from_dict(all_data)
df = df.T

# Now print
df

## 4.2 Summarizing data in pandas

In [0]:
# What's the type of a dataframe?
type(df)

In [0]:
# What's the type of a column?
type(df["n_words"])

In [0]:
# What's the type of data *in* the columns?
df.dtypes

In [0]:
# Some other cool stuff
df.info()

In [0]:
df.describe()

In [0]:
df.corr()

In [0]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

## 4.3 Calculate using columns

Dataframes are powerful data structures

In [0]:
# Get the average characters per word
df["avg char per word"] = df["n_characters"] / df["n_words"]

df

In [0]:
# What is the longest words?
n_max = df["avg char per word"].max()

# Who speaks with the longest words??
n_char = df["avg char per word"].idxmax()

print(n_char, n_max)

## 4.4 Visualize – Explore

In [0]:
# Here's what it looks like
df["avg char per word"].plot(kind="bar")

In [0]:
# You can visualize the whole thing
df.plot(kind="bar")

In [0]:
# You can visualize the whole thing
df.plot(kind="bar", logy=True)

In [0]:
# You can visualize the whole thing
df.plot(kind="box",logy=True)

In [0]:
# Do all of it!

color_dict = {
    "n_words":"#7a0099",
    "n_characters":"pink",
    "n_instances":"blue",
    "avg char per word":"orange"
    }

for col in df.columns:
  df[col].plot(kind="bar", color=color_dict[col])
  plt.title(col, weight="bold", color= color_dict[col])

  # If you wanted to save each graphe:
  # plt.savefig(f"the tempest – act 1 scene 1 – dialogue – {col}.png", dpi=150)

  # You can't show it and then save it. First save, then show.
  plt.show()

In [0]:
plt.scatter(
    x = "n_words",
    y = "n_characters",
    data = df
    )
plt.show()

In [0]:
plt.scatter(
    x = "n_words",
    y = "n_instances",
    data = df
    )

## 4.5 Visualize – Focused

**What is our projects goal?**
* Who is the most important character?
* Do _evil_ characters have specific patterns?

In [0]:
# Plot n words per character
deep_purple = "#7a0099"
df["n_words"].plot(kind="bar", color=deep_purple)

In [0]:
# Do some more calculations
df["words_per_instance"] = df["n_words"]/df["n_instances"]
df["chars_per_instance"] = df["n_characters"]/df["n_instances"]
df

In [0]:
df["words_per_instance"].plot(kind="bar", color=deep_purple)

In [0]:
df["chars_per_instance"].plot(kind="bar", color=deep_purple)

In [0]:
df[["chars_per_instance","words_per_instance"]].plot(kind="bar", color=[deep_purple,"blue"])

plt.title("THE TEMPEST – Act 1. Scene 1.", weight="bold", size=14)
plt.show()

# If you want to save
# plt.savefig("n_words.png", dpi=150,  bbox_inches = "tight")

# 5. Conclusion

* Gonzalo has:
  * the most words per instance: `21.857143`
  * the most characters per instance: `115.142857`


* He has few characters per word: `5.267974`<br>
* This is more than Botswain: `5.196262`<br>
* But less than Master: `6.375000`

It's hard to say if Gonzalo is evil or not, but he's certainly important. ***How intruiging!***