# DS 3000 Quiz 1

Due by: Tuesday Oct 9 @ 11:59 PM EST

Time Limit: You have 2 hours to complete the assignment once started

## Instructions

This quiz has 100 points total.

- You are welcome to post a private note on piazza, but to keep a consistent testing environment for all students we are unlikely to provide assistance.
- You may not contact other students with information about this this quiz
    - even saying "it was easy/hard" in a general sense can introduce a bias in favor of students who take the quiz earlier or later
- Under no circumstances should you share a copy of this quiz with anyone who isn't a member of the course staff.
- Take this quiz with open notes and feel free to access any online resource / documentation you'd like.  

### Submission Instructions
After completing the quiz below, please follow the instructions below to submit:
1. "Kernel" -> "Restart & Run All"
1. save your quiz file to this latest version
1. upload the `.ipynb` to gradescope **before** clicking submit
1. ensure that you can see your jupyter notebook in the gradescope interface after clicking "submit"

We specify the last note above as gradescope has allowed students to "submit" without uploading a file.  It is your responsibility to ensure that you've actually submitted a file.

### Academic Integrity Pledge

Input your name below to sign the Academic Integrity Pledge before continuing with the quiz. Failure to do so will result in a score of **0**.

In [1]:
name = 'Sean Ayoub'
print(f'I, {name}, declare that the following work is entirely my own, and that I did not copy or seek help from any students who have currently or previously taken this course, nor from any online source other than private messages between myself and the professor on Piazza/via email.')

I, Sean Ayoub, declare that the following work is entirely my own, and that I did not copy or seek help from any students who have currently or previously taken this course, nor from any online source other than private messages between myself and the professor on Piazza/via email.


In [2]:
# the following modules will be necessary to complete the quiz
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from datetime import datetime
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

# Part 1: Dictionary API (50 points)

Using [this dictionary API](https://dictionaryapi.dev/) create the following dataframe by searching for the words `hello`, `data` and `science`.

Note that your searches may return multiple words, multiple definitions or multiple pronounciations.  Where necessary, always select the first.  


|   |    word |                                                          url_pronounce |                                                                                                                                                                                              definition |
|--:|--------:|-----------------------------------------------------------------------:|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| 0 |   hello |     https://api.dictionaryapi.dev/media/pronunciations/en/hello-au.mp3 |                                                                                                                                                                     "Hello!" or an equivalent greeting. |
| 1 |    data |   https://api.dictionaryapi.dev/media/pronunciations/en/data-au-nz.mp3 | (plural: data) A measurement of something on a scale understood by both the recorder (a person or device) and the reader (another person or device). The scale is arbitrarily defined, such as from ... |
| 2 | science | https://api.dictionaryapi.dev/media/pronunciations/en/science-1-ca.mp3 |                                                A particular discipline or branch of learning, especially one dealing with measurable or systematic principles rather than intuition or natural ability. |

**Note:** Because each row of the pandas dataframe contains so many characters, you may find that:

    pd.options.display.max_colwidth = 200
    
allows you to see the whole thing.

**Note Also:** Your response need not build any functions, but be sure to name variables appropriately and document your process.

In [3]:
def call_dict_api(word):
    """ calls dictionary api
    
    Args:
        word (str): word in the dictionary to look up
    
    Returns:
        response_dict (dict): api response as a dictionary
    """
    url = f"https://api.dictionaryapi.dev/api/v2/entries/en/{word}"
    response = requests.get(url).text
    response_dict = json.loads(response)
    return response_dict

# create lists
word_list = ["hello", "data", "science"]
pronounce_list = []
definition_list = []

# retrieve info for each word
for item in word_list:
    response_dict = call_dict_api(item)
    pronounce_list.append(response_dict[0]["phonetics"][0]["audio"])
    definition_list.append(response_dict[0]["meanings"][0]["definitions"][0]["definition"])

# create dataframe
word_dict = {"word": word_list, 
             "url_pronounce": pronounce_list, 
            "definition": definition_list}
df_dictionary = pd.DataFrame.from_dict(word_dict)
pd.options.display.max_colwidth = 200

df_dictionary

Unnamed: 0,word,url_pronounce,definition
0,hello,https://api.dictionaryapi.dev/media/pronunciations/en/hello-au.mp3,"""Hello!"" or an equivalent greeting."
1,data,https://api.dictionaryapi.dev/media/pronunciations/en/data-au-nz.mp3,"(plural: data) A measurement of something on a scale understood by both the recorder (a person or device) and the reader (another person or device). The scale is arbitrarily defined, such as from ..."
2,science,https://api.dictionaryapi.dev/media/pronunciations/en/science-1-ca.mp3,"A particular discipline or branch of learning, especially one dealing with measurable or systematic principles rather than intuition or natural ability."


# Part 2: Web Scraping Korean Dramas (50 points)

Your goal is to build a data frame that includes two columns: `category` and `movie` based on the 50 best Korean Dramas according to [this website](https://www.marieclaire.com/culture/a26895105/best-korean-dramas/). To help you, the actual web scraping part of this problem is done in the first code cell below, along with the first step of cleaning the data. The result is a list of headers from the web site. Note:

- certain elements in the list are the categories
- all elements trailing a category belong to that category until a new category appears

**Note:** the below are directions for one way to accomplish the task. If you can think of a faster way to do it, please do so!

Create two empty lists, `kdramas_cats` and `kdramas_movs`. Then, loop through the `headers` and build out the lists such that the `kdramas_cats` contains the category corresponding to each movie and `kdramas_movs` contains all the movies. Then, use these lists to create a data frame with a column called `category` (where the `kdramas_cats` data are stored) and a column called `movie` (where the `kdramas_movs` data are stored). Make sure you clean the data so that:

- the categories do not have the `' Korean Dramas'` part of their string
- the last element of `headers` is not included (since it is not a movie, but rather an advertisment that happened to share the `<h2>` tag)

When you are done, print the entire data frame to ensure it all worked.

**Note:** Your response need not build any functions, but be sure to name variables appropriately and document your process.

In [4]:
# the url, scraper, and soup object
url = 'https://www.marieclaire.com/culture/a26895105/best-korean-dramas/'
html_text = requests.get(url).text
soup = BeautifulSoup(html_text)

# parsing the data to get only the headers
headers = soup.find_all('h2')
headers

[<h2 class="article-body__section" id="section-action-thriller-korean-dramas"><span>Action/Thriller Korean Dramas</span></h2>,
 <h2 id="apos-squid-game-apos-2">'Squid Game'</h2>,
 <h2 id="apos-vincenzo-apos-2">'Vincenzo'</h2>,
 <h2 id="apos-happiness-apos-2">'Happiness'</h2>,
 <h2 id="apos-all-of-us-are-dead-apos-2">'All of Us Are Dead'</h2>,
 <h2 id="apos-my-name-apos-2">'My Name'</h2>,
 <h2 id="apos-d-p-apos-2">'D.P'</h2>,
 <h2 id="apos-weak-hero-class-1-apos-2">'Weak Hero Class 1'</h2>,
 <h2 id="apos-bloodhounds-apos-2">'Bloodhounds'</h2>,
 <h2 class="article-body__section" id="section-romance-korean-dramas"><span>Romance Korean Dramas</span></h2>,
 <h2 id="apos-crash-landing-on-you-apos-2">'Crash Landing on You'</h2>,
 <h2 id="apos-business-proposal-apos-2">'Business Proposal'</h2>,
 <h2 id="apos-hometown-cha-cha-cha-apos-2">'Hometown Cha-Cha-Cha'</h2>,
 <h2 id="apos-coffee-prince-apos-2">'Coffee Prince'</h2>,
 <h2 id="apos-boys-over-flowers-apos-2">'Boys Over Flowers'</h2>,
 <h2 i

In [5]:
kdramas_cats = []
kdramas_movs = []

# separate categories and movies into separate lists
for item in headers:
    if "Korean Dramas" in item.text:
        cat = item.text.replace(" Korean Dramas", "")
        kdramas_cats.append(cat)
        
    else:
        kdramas_movs.append(item.text)

# clean
kdramas_movs.remove("Marie Claire Newsletter")

# ensures both lists are the same length
while len(kdramas_cats) != len(kdramas_movs):
    kdramas_cats.append("")

# creates dataframe
kdrama_dict = {"category": kdramas_cats, 
              "movie": kdramas_movs}
df_kdrama = pd.DataFrame.from_dict(kdrama_dict)
df_kdrama

Unnamed: 0,category,movie
0,Action/Thriller,'Squid Game'
1,Romance,'Vincenzo'
2,Fantasy,'Happiness'
3,Melodrama,'All of Us Are Dead'
4,Historical,'My Name'
5,Professional,'D.P'
6,Slice of Life,'Weak Hero Class 1'
7,,'Bloodhounds'
8,,'Crash Landing on You'
9,,'Business Proposal'
