<a href="https://colab.research.google.com/github/ybressler/intro-to-python/blob/master/Module%203%20%E2%80%93%20Analyzing%20Song%20Lyrics%20%E2%80%93%20Cardi%20B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 3 – Analyzing Song Lyrics – Cardi B

*We'll be structuring, parsing, and analyzing data from Cardi B's discography.*


---

**Outline:**
* load data from json
* clean and structure data
* visualize and explore

In [0]:
import re
import json
import requests
import pandas as pd
import datetime
import matplotlib.pyplot as plt

In [0]:
# All of cardi b's music!
data_path = "https://storage.googleapis.com/yb-intro-to-python/cardi_b_all.json"
data = requests.get(data_path).json()

# 1. Introduction to manipulating data

# 1.1 Datetimes

In [0]:
# a computer doesn't know that this string corresponds with the month June
dt = '04/31/2017'

In [0]:
# Lets get the current time
now = datetime.datetime.now()
print(now)

# reformat to the string we want
# abbrev for --> str + f[ormat] + time
now = now.strftime("%m/%d/%Y")
print(now)

In [0]:
# If I can create with srftime --> I can parse using the same method
dt = '04/16/2017'
dt = datetime.datetime.strptime(dt, "%m/%d/%Y")
print(dt)

## 1.2 Loading into pandas

In [0]:
my_dict = {
    "Yaakov":{"color":"purple","lunch":"sushi"},
    "Selethal":{"color":"red", "lunch":"waffles"}
}

# this is what the dataframe looks like
pd.DataFrame.from_dict(my_dict).T

In [0]:
my_dict = {
    "Yaakov":{"color":"purple","lunch":"sushi"},
    "Selethal":{"color":"red", "lunch":"waffles"},
    "Stephanie":{"color":"blue", "lunch":"noodles"}
}

all_records = []
for key, value in my_dict.items():
  rec = {"name":key}
  # Add the lunch
  rec.update(value)

  all_records.append(rec)


all_records

In [0]:
df = pd.DataFrame.from_records(all_records)
df

# 2. Manipulate data

## 2.1 Load to pandas

In [0]:
# Make an empty list of records
all_records = []

for key, value in data.items():

  # save the song name
  rec = {"song_name":key}
  
  # save the others
  rec.update(value)

  # save for real now
  all_records.append(rec)

# Load to pandas
df = pd.DataFrame.from_records(all_records)
df

In [0]:
# get summary
df.describe()

In [0]:
df.dtypes

## 2.4 Clean dates

**Here's how it works:**
* select a single column: `df["date_release"]`
* perfrom an operation on it...
* `pd.to_datetime` works for strings, lists, and all the likes

In [0]:
# Update the date!
df["date_release"] = df["date_release"].apply(pd.to_datetime)

## 2.5 Extract lines / breaks

In [0]:
def count_line_breaks(lyrics):
  """
  counts the number of line breaks while removing verse type
  """

  # n_no_clean = len(lyrics.split("\n"))
  # print(f"no cleaning at all: {n_no_clean}")
  
  n_lines = 0

  for x in lyrics.split("\n"):
    x = x.strip() # get rid of extra space at end
    if x=="":
      continue # I don't want to even see you. Next please!
      
    # what do these lines look like?
    if "[" in x:
      continue

    n_lines+=1 # keep counting until I say you stop goddamit

  return n_lines


# # Use my function – for testing
# lyrics = df["lyrics"][2]
# count_line_breaks(lyrics)

# actually use
df["lyrics_n_lines"] = df["lyrics"].apply(count_line_breaks)

## 2.6 Extract word counts

In [0]:
# Let's count how many times Cardi B says "money" in each song

def count_words(string, word="money"):
  """
  This will let you count all sorts of words
  """

  # an okay method
  # n = string.lower().count(word)


  # a more powerful method
  # Lets you search all sorts of stuff
  pattern = re.compile(word, re.I)
  n = len(pattern.findall(string))

  return n


In [0]:
all_words = ["money","bitch","fuck","\\bass\\b","\\bshit\\b","hoes","ain\'t", "mama","dance","I\\b", "you\\b", "\\bhe\\b", "\\bshe\\b"]

for word in all_words:
  col_name = word.replace("\\b","")
  # df[col_name] = df["lyrics"].apply(count_words, word=word) # set params outside of the function

In [0]:
df["n_curse_words"] = df[["bitch","fuck","ass","shit","hoes"]].sum(axis=1) # get the sum, rowwise
df

# 3. Visualize

## 3.1 Explore

What are we curious about?
* How does word usage change over time?
* When are her songs shorter instead of longer?
* When are her choruses more complex?
* How often does she repeat the same word?
* Gender parity in her lyrics?
* Female power in her lyrics?

### 3.1.1 Time

Set axis to time – and boom. *(This is why we love pandas)*

In [0]:
df_new = df.set_index("date_release")

In [0]:
df_new.plot(figsize=(14,5))

In [0]:
df_new.plot(figsize=(14,5))
plt.yscale("log")

In [0]:
df_new["n_curse_words"].plot(figsize=(14,4))

### 3.1.2 Song length

In [0]:
df.head(1)

In [0]:
x_col = "lyrics_n_lines"
y_col = "n_curse_words"

plt.figure(figsize=(6,2))
plt.scatter(x=df[x_col], y=df[y_col])

plt.xlabel(x_col)
plt.ylabel(y_col)

plt.show()

In [0]:
x_col = "lyrics_n_lines"

for y_col in df.drop(columns=["lyrics"]).columns:
  plt.figure(figsize=(6,2))
  plt.scatter(x=df[x_col], y=df[y_col])

  plt.xlabel(x_col)
  plt.ylabel(y_col)

  plt.show()

### 3.1.3 Repeat words

In [0]:
use_cols = ['lyrics_n_lines', 'money',
       'bitch', 'fuck', 'hoes', 'ain\'t', 'mama', 'dance', 'I\b', 'you\b',
       '\bhe\b', '\bshe\b', '\bass\b', '\bshit\b', 'n_curse_words']
       

In [0]:
corr = df[df.columns.drop(["lyrics","song_name","date_release"])].corr()

size = 12
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr)
plt.xticks(range(len(corr.columns)), corr.columns);
plt.yticks(range(len(corr.columns)), corr.columns);

### 3.1.4 Gender Parity

In [0]:
plt.scatter(
    x = df["\\bhe\\b"],
    y = df["\\bshe\\b"]
)

plt.xlabel("HE", color="blue", weight="bold")
plt.ylabel("SHE", color="purple", weight="bold")

plt.show()

## 3.2 Exact

What is a specific research question?
_You get back to us!_

# Next steps!

In [0]:
# to be continued