In [58]:
import otter
# nb_name should be the name of your notebook without the .ipynb extension
nb_name = "p8"
py_filename = nb_name + ".py"
grader = otter.Notebook(nb_name + ".ipynb")

In [59]:
import p8_test

In [60]:
# PLEASE FILL IN THE DETAILS
# Enter none if you don't have a project partner
# You will have to add your partner as a group member on Gradescope even after you fill this

# project: p8
# submitter: 9082531048
# partner: None

# Project 8: Going to the Movies

## Learning Objectives:

In this project, you will demonstrate how to:

* integrate relevant information from various sources (e.g. multiple csv files),
* build appropriate data structures for organized and informative presentation (e.g. list of dictionaries),
* practice good coding style

Please go through [lab-p8](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-f22-projects/-/tree/main/lab-p8) before working on this project. The lab introduces some useful techniques related to this project.

## Note on Academic Misconduct:

**IMPORTANT**: p8 and p9 are two parts of the same data analysis. You **cannot** switch project partners between these two projects. That is if you partner up with someone for p8, you have to sustain that partnership until end of p9. Now may be a good time to review [our course policies](https://cs220.cs.wisc.edu/f22/syllabus.html).

## Testing your code:

Along with this notebook, you must have downloaded the file `p8_test.py`. If you are curious about how we test your code, you can explore this file, and specifically the value of the variable `expected_json`, to understand the expected answers to the questions.

## Introduction:

In this project and the next, we will be working on the [IMDb Movies Dataset](https://www.imdb.com/interfaces/). We will use Python to discover some cool facts about our favorite movies, cast, and directors.

In this project, you will combine the data from the movie and mapping files into a more useful format.
Start by downloading the following files: `p8_test.py`, `small_mapping.csv`, `small_movies.csv`, `mapping.csv`, and `movies.csv`.

## The Data:

Open `movies.csv` and `mapping.csv` in any spreadsheet viewer, and see what the data looks like.
`movies.csv` has ~100,000 rows and `mapping.csv` has ~350,000 rows. These files store information about **every** movie on the IMDb dataset which was released in the US. These datasets are **very** large when compared to `small_movies.csv` and `small_mapping.csv` from [lab-p8](https://github.com/msyamkumar/cs220-f22-projects/tree/main/lab-p8), but the data is stored in the **same format**. For description of the datasets, please refer back to [lab-p8](https://github.com/msyamkumar/cs220-f22-projects/tree/main/lab-p8).

Before we start working with these very large datasets, let us start with the much smaller datasets, `small_movies.csv` and `small_mapping.csv` from lab-p8. In the latter half of p8 and in p9, you will be working with `movies.csv` and `mapping.csv`. Since the files `movies.csv` and `mapping.csv` are large, some of the functions you write in p8 and p9 **may take a while to execute**. You do not have to panic if a single cell takes between 5 to 10 seconds to run. If any cell takes significantly longer, follow the recommendations below:

- **Do not** calling **slow functions** multiple times within a loop.
- **Do not** calling functions that **iterate over the entire dataset within a loop**; instead, call the function before the loop and store the result in a variable.
- **Do not** compute quantities **inside a loop** if it can be computed outside the loop; for example, if you want to calculate the average of a list, you should use the loop to find the numerator and denominator but divide **once** after the loop ends instead of inside the loop.

## Project Requirements:

You **may not** hardcode indices in your code, unless the question explicitly . If you open your `.csv` files with Excel, manually count through the rows and use this number to loop through the dataset, this is also considered as hardcoding. We'll **manually deduct** points from your autograder score on Gradescope during code review.

**Store** your final answer for each question in the **variable specified for each question**. This step is important because Otter grades your work by comparing the value of this variable against the correct answer.

For some of the questions, we'll ask you to write (then use) a function to compute the answer. If you compute the answer **without** creating the function we ask you to write, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer.

Required Functions:
- `get_mapping`
- `get_raw_movies`
- `get_movies`
- `find_specific_movies`
- `bucketize_by_genre`

In this project, you will also be required to define certain **data structures**. If you do not create these data structures exactly as specified, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer.

Required Data Structures:
- `small_movies`
- `movies`
- `genre_dict`

You are only allowed to define these data structures **once** and we'll **manually deduct** points from your autograder score on Gradescope if you redefine the values of these variables.

In this project (and the next), you will be asked to create **lists** of movies. For all such questions, **unless it is explicitly mentioned otherwise**, the movies should be in the **same order** as in the `movies.csv` (or `small_movies.csv`) file. Similarly, for each movie, the **list** of `genres`, `directors`, and `cast` members should always be in the **same order** as in the `movies.csv` (or `small_movies.csv`) file.

Students are only allowed to use Python commands and concepts that have been taught in the course prior to the release of p8. Therefore, you should not use the pandas module.  We will **manually deduct** points from your autograder score on Gradescope otherwise.

In addition, you are also **required** to follow the requirements below:
- **Do not use the method `csv.DictReader` for p8**. Although the required output can be obtained using this method, one of the learning outcomes of this project is to demonstrate your ability to build dictionaries with your own code.  
- Additional import statements beyond those that are stated in the directions are not allowed. For this project, we allow you to use `csv` and `copy` packages (that is, you can use the `import csv` and `import copy` statements in your submission). You should not use concepts / modules that are yet to be covered in this course; for example: you should not use modules like `pandas`. **We'll manually deduct points** accordingly, if you don't follow the provided directions.

For more details on what will cause you to lose points during code review and specific requirements, please take a look at the [Grading rubric](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-f22-projects/-/blob/main/p8/rubric.md).


## Questions and Functions:

Let us start by importing all the modules we will need for this project.



In [61]:
# it is considered a good coding practice to place all import statements at the top of the notebook
# please place all your import statements in this cell if you need to import any more modules for this project
import csv
import math
import copy

### Function 1: `get_mapping(path)`

We require you to complete the below function to answer the next several questions (this is a **requirement**, and you will **lose points** if you do not implement this function). You may copy/paste code from your lab-p8 notebook to finish this function.

In [62]:
def get_mapping(path):
    
    example_file = open(path, encoding="utf-8")
    example_reader = csv.reader(example_file)
    example_data = list(example_reader)
    example_file.close()
    
    mapping_dict={}
    for i in range(len(example_data)):
        mapping_dict[example_data[i][0]]=example_data[i][1]
    return mapping_dict


**Question 1:** What is returned by `get_mapping("small_mapping.csv")`?

Your output **must** be a **dictionary** which maps the *IDs* in `small_mapping.csv` to *names*.

In [63]:
# compute and store the answer in the variable 'small_mapping', then display it
small_mapping=get_mapping("small_mapping.csv")
small_mapping

{'tt3104988': 'Crazy Rich Asians',
 'nm0160840': 'Jon M. Chu',
 'nm2090422': 'Constance Wu',
 'nm6525901': 'Henry Golding',
 'nm0000706': 'Michelle Yeoh',
 'nm2110418': 'Gemma Chan',
 'nm0523734': 'Lisa Lu',
 'tt4846340': 'Hidden Figures',
 'nm0577647': 'Theodore Melfi',
 'nm0378245': 'Taraji P. Henson',
 'nm0818055': 'Octavia Spencer',
 'nm1847117': 'Janelle Monáe'}

In [64]:
grader.check("q1")

**Question 2:** What is the **value** associated with the **key** *nm2110418*?

Your output **must** be a **string**.

In [65]:
# access and store the answer in the variable 'nm2110418_value', then display it
nm2110418_value=small_mapping["nm2110418"]
nm2110418_value

'Gemma Chan'

In [66]:
grader.check("q2")

**Question 3:** What are the **values** associated with **keys** that **begin** with *nm*?

Your output **must** be a **list** of **strings**.

In [67]:
# compute and store the answer in the variable 'nm_values', then display it
nm_values=[]
for keysq3 in small_mapping:
    if "nm" in keysq3:
        nm_values.append(small_mapping[keysq3])
nm_values

['Jon M. Chu',
 'Constance Wu',
 'Henry Golding',
 'Michelle Yeoh',
 'Gemma Chan',
 'Lisa Lu',
 'Theodore Melfi',
 'Taraji P. Henson',
 'Octavia Spencer',
 'Janelle Monáe']

In [68]:
grader.check("q3")

**Question 4:** Find the **keys** of the people (keys **beginning** with *nm*) whose **last name** is *Spencer*.

Your output **must** be a **list** of **string(s)**.

**Requirements:** Your **code** must be robust and satisfy all the requirements, even if you were to run this on a larger dataset (such as `mapping.csv`). In particular:
1. You will **lose points** if your code would find people whose **first** name or **middle** name is *Spencer* (e.g. *Spencer Garrett* or *Charlie Spencer Clark*).
2. You will **lose points** if your code would find people whose **last** name contains *Spencer* as a **substring** (e.g. *Tara Spencer-Nairn*). The name should be **exactly** *Spencer*. 
3. You will **lose points** if your code would find any **movie titles** (e.g. *Meeting Spencer*).

In [69]:
# compute and store the answer in the variable 'nm_spencer', then display it
nm_spencer=[]
for keysq4 in small_mapping:
    if "nm" in keysq4 :
        nil=small_mapping[keysq4].split(" ") #nil is "name in list" in abbrev.
        if nil[-1]=="Spencer":
            nm_spencer.append(keysq4)
        
        
nm_spencer

['nm0818055']

In [70]:
grader.check("q4")

#### Now, let's move on to reading the movie files!

### Function 2: `get_raw_movies(path)`

We require you to complete the below function to answer the next several questions (this is a **requirement**, and you will **lose points** if you do not implement this function).

This function **must** return a **list** of **dictionaries**, where each **dictionary** is of the following format:

```python
   {
        'title': <title-id>,
        'year': <the year as an integer>,
        'duration': <the duration as an integer>,
        'genres': [<genre1>, <genre2>, ...],
        'rating': <the rating as a float>,
        'directors': [<director-id1>, <director-id2>, ...],
        'cast': [<actor-id1>, <actor-id2>, ....]
    }
```

Here is an example:

```python
    {
        'title': 'tt0033313',
        'year': 1941,
        'duration': 59,
        'genres': ['Western'],
        'rating': 5.2,
        'directors': ['nm0496505'],
        'cast': ['nm0193318', 'nm0254381', 'nm0279961', 'nm0910294', 'nm0852305']
    }
```

You may copy/paste code from your lab-p8 notebook to finish this function.

In [71]:
def get_raw_movies(path):
    
    example_file = open(path, encoding="utf-8")
    example_reader = csv.reader(example_file)
    example_data = list(example_reader)
    example_file.close()
    
    header=example_data[0]
    content=example_data[1:]

    lodf2=[] # list of dictionaries in function 2

    for listf2 in content:
        innerdict={}
        for type_index in range(len(header)):
            if header[type_index] in ["genres","directors","cast"]:
                spec_cnt_str=listf2[type_index]
                spec_cnt_list=spec_cnt_str.split(", ")
                innerdict[header[type_index]]=spec_cnt_list
            elif header[type_index] in ["year","duration"]:
                innerdict[header[type_index]]=int(listf2[type_index])
            elif header[type_index] == "rating":
                innerdict[header[type_index]]=float(listf2[type_index])
            else:
                innerdict[header[type_index]]=listf2[type_index]
        lodf2.append(innerdict)

    return lodf2

**Question 5:** What is returned by `get_raw_movies("small_movies.csv")`?

Your output **must** be a **list** of **dictionaries** where each dictionary contains information about a movie.

In [72]:
# compute and store the answer in the variable 'raw_small_movies', then display it
raw_small_movies=get_raw_movies("small_movies.csv")
raw_small_movies

[{'title': 'tt3104988',
  'year': 2018,
  'duration': 120,
  'genres': ['Comedy', 'Drama', 'Romance'],
  'rating': 6.9,
  'directors': ['nm0160840'],
  'cast': ['nm2090422', 'nm6525901', 'nm0000706', 'nm2110418', 'nm0523734']},
 {'title': 'tt4846340',
  'year': 2016,
  'duration': 127,
  'genres': ['Biography', 'Drama', 'History'],
  'rating': 7.8,
  'directors': ['nm0577647'],
  'cast': ['nm0378245', 'nm0818055', 'nm1847117']}]

In [73]:
grader.check("q5")

If your answer looks correct, but does not pass `grader.check`, make sure that the **datatypes** are all correct. Also make sure that the **directors** and **cast**  are in the **same order** as in `small_movies.csv`.

**Question 6:** How **many** cast members does the **first** movie have?

Your output **must** be an **int**.

In [74]:
# compute and store the answer in the variable 'num_cast_first_movie', then display it
num_cast_first_movie=len(get_raw_movies("small_movies.csv")[0]["cast"])
num_cast_first_movie

5

In [75]:
grader.check("q6")

**Question 7:** What is the *ID* of the **first** cast member listed for the **first** movie of the dataset?

Your output **must** be a **string**.

In [76]:
# compute and store the answer in the variable 'first_actor_id_first_movie', then display it
first_actor_id_first_movie=get_raw_movies("small_movies.csv")[0]["cast"][0]
first_actor_id_first_movie

'nm2090422'

In [77]:
grader.check("q7")

### Function 3: `get_movies(movies_path, mapping_path)`

We require you to complete the below function to answer the next several questions (this is a **requirement**, and you will **lose points** if you do not implement this function).


This function **must** return a **list** of **dictionaries**, where each **dictionary** is of the following format:

```python
   {
        'title': "the movie name",
        'year': <the year as an integer>,
        'duration': <the duration as an integer>,
        'genres': [<genre1>, <genre2>, ...],
        'rating': <the rating as a float>,
        'directors': ["director-name1", "director-name2", ...],
        'cast': ["actor-name1", "actor-name2", ....]
    }
```

Here is an example:

```python
    {
        'title': 'Across the Sierras',
        'year': 1941,
        'duration': 59,
        'genres': ['Western'],
        'rating': 5.2,
        'directors': ['D. Ross Lederman'],
        'cast': ['Dick Curtis', 'Bill Elliott', 'Richard Fiske', 'Luana Walters', 'Dub Taylor']
    }
```

You may copy/paste code from your lab-p8 notebook to finish this function.

In [78]:
def get_movies(movies_path, mapping_path):
    lic=get_raw_movies(movies_path) #list in code
    docn=get_mapping(mapping_path) #dictionaries of corresponding names

    for everydict in lic:
        for keyf3 in everydict:
            if keyf3 in ["directors","cast"]:
                for codenameidx in range(len(everydict[keyf3])):
                    everydict[keyf3][codenameidx]=docn[everydict[keyf3][codenameidx]]
            elif keyf3 == "title":
                everydict[keyf3]=docn[everydict[keyf3]]

    return lic


**Question 8:** What is returned by `get_movies("small_movies.csv", "small_mapping.csv")`?

Your output **must** be a **list** of **dictionaries** where each dictionary contains information about a movie.

In [79]:
# compute and store the answer in the variable 'small_movies_data', then display it
small_movies_data=get_movies("small_movies.csv", "small_mapping.csv")
small_movies_data

[{'title': 'Crazy Rich Asians',
  'year': 2018,
  'duration': 120,
  'genres': ['Comedy', 'Drama', 'Romance'],
  'rating': 6.9,
  'directors': ['Jon M. Chu'],
  'cast': ['Constance Wu',
   'Henry Golding',
   'Michelle Yeoh',
   'Gemma Chan',
   'Lisa Lu']},
 {'title': 'Hidden Figures',
  'year': 2016,
  'duration': 127,
  'genres': ['Biography', 'Drama', 'History'],
  'rating': 7.8,
  'directors': ['Theodore Melfi'],
  'cast': ['Taraji P. Henson', 'Octavia Spencer', 'Janelle Monáe']}]

In [80]:
grader.check("q8")

**Question 9:** What is `title` of the **second** movie in `small_movies_data`?

Your output **must** be a **string**.

In [81]:
# compute and store the answer in the variable 'second_movie_title_small_movies', then display it
second_movie_title_small_movies=small_movies_data[1]["title"]
second_movie_title_small_movies

'Hidden Figures'

In [82]:
grader.check("q9")

**Question 10:** Who are the `cast` members of the **second** movie in `small_movies_data`?

Your output **must** be a **list** of **string(s)**.

In [83]:
# compute and store the answer in the variable 'second_movie_cast_small_movies', then display it
second_movie_cast_small_movies=small_movies_data[1]["cast"]
second_movie_cast_small_movies

['Taraji P. Henson', 'Octavia Spencer', 'Janelle Monáe']

In [84]:
grader.check("q10")

**Question 11:** Who are the `directors` of the **last** movie in `small_movies_data`?

Your output **must** be a **list** of **string(s)**.

In [85]:
# compute and store the answer in the variable 'last_movie_directors_small_movies', then display it
last_movie_directors_small_movies=small_movies_data[-1]["directors"]
last_movie_directors_small_movies

['Theodore Melfi']

In [86]:
grader.check("q11")

#### Now that you’ve made it this far, your functions must be working pretty well with small datasets. Next, let's try a much bigger dataset!

Run the following code to open the full dataset:

In [87]:
movies = get_movies("movies.csv", "mapping.csv")
len(movies)

102668

As the files are very large, this cell is expected to take around ten seconds to run. If it takes much longer (say, around a minute), then you will **need** to **optimize** your `get_movies` function so it runs faster.

**Warning**: You are **not** allowed to call `get_movies` more than once on the full dataset (`movies.csv` and `mapping.csv`) in your notebook. Instead, reuse the `movies` variable, which is more efficient. You will **lose points** during manual review if you call `get_movies` again on these files.

**Warning:** Do **not** display the value of the variable `movies` **anywhere** in your notebook. It will take up a **lot** of space, and your **Gradescope code will not be displayed** for grading. So, you will receive **zero points** for p8. Instead you should verify `movies` has the correct value by looking at a small *slice* of the **list** as in the question below. 

**Question 12:** What are the movies in `movies[20200:20220]`?

Your answer should be a *list* of *dictionaries* that follows the format below:

```python
[{'title': 'Aliens in the Attic',
  'year': 2009,
  'duration': 86,
  'genres': ['Adventure', 'Comedy', 'Family'],
  'rating': 5.4,
  'directors': ['John Schultz'],
  'cast': ['Ashley Tisdale',
   'Robert Hoffman',
   'Carter Jenkins',
   'Austin Butler']},
 {'title': 'Dark Buenos Aires',
  'year': 2010,
  'duration': 90,
  'genres': ['Thriller'],
  'rating': 4.8,
  'directors': ['Ramon Térmens'],
  'cast': ['Francesc Garrido',
   'Daniel Faraldo',
   'Natasha Yarovenko',
   'Julieta Díaz']},
 ...
]
```

In [88]:
# compute and store the answer in the variable 'movies_20200_20220', then display it
movies_20200_20220=movies[20200:20220]
movies_20200_20220

[{'title': 'Aliens in the Attic',
  'year': 2009,
  'duration': 86,
  'genres': ['Adventure', 'Comedy', 'Family'],
  'rating': 5.4,
  'directors': ['John Schultz'],
  'cast': ['Ashley Tisdale',
   'Robert Hoffman',
   'Carter Jenkins',
   'Austin Butler']},
 {'title': 'Dark Buenos Aires',
  'year': 2010,
  'duration': 90,
  'genres': ['Thriller'],
  'rating': 4.8,
  'directors': ['Ramon Térmens'],
  'cast': ['Francesc Garrido',
   'Daniel Faraldo',
   'Natasha Yarovenko',
   'Julieta Díaz']},
 {'title': 'The Bank Shot',
  'year': 1974,
  'duration': 83,
  'genres': ['Comedy', 'Crime'],
  'rating': 5.4,
  'directors': ['Gower Champion'],
  'cast': ['George C. Scott', 'Joanna Cassidy', 'Sorrell Booke', 'G. Wood']},
 {'title': 'Complicity',
  'year': 2013,
  'duration': 81,
  'genres': ['Drama', 'Thriller'],
  'rating': 4.1,
  'directors': ['C.B. Harding'],
  'cast': ['Sean Young', 'Jenna Boyd', 'Heather Hemmens', 'Haley Ramm']},
 {'title': "Russia's Toughest Prisons",
  'year': 2011,
  '

In [89]:
grader.check("q12")

**Question 13:** What is the **number** of movies released in the `year` *2018*?

Your outuput must be an **int**.

In [90]:
# compute and store the answer in the variable 'num_movies_2018', then display it
num_movies_2018=0

for eachmov in movies:
    for keyq13 in eachmov:
        if eachmov[keyq13]==2018:
            num_movies_2018+=1

num_movies_2018

4262

In [91]:
grader.check("q13")

### Function 4: `find_specific_movies(movies, keyword)`

Now that we have created this data structure `movies`, we can start doing some fun things with the data!
We will continue working on this data structure for the next project (p9) as well.

Let us now use this data structure `movies` to create a **search bar** like the one in Netflix!
**Do not change the below function in any way**.
This function takes in a keyword like a substring of a title, a genre, or the name of a person, and returns a list of relevant movies with that title, genre, or cast member/director.

**Warning:** As `movies` is very large, the function `find_specific_movies` may take five to ten seconds to run. This is normal and you should not panic if it takes a while to run.

In [92]:
# DO NOT EDIT OR REDEFINE THIS FUNCTION
def find_specific_movies(movies, keyword):
    idx = 0
    while idx < len(movies):
        movie = movies[idx]
        # note: \ enables you split a long line of code into two lines
        if (keyword not in movie['title']) and (keyword not in movie["genres"]) \
        and (keyword not in movie["directors"]) and (keyword not in movie["cast"]):
            movies.pop(idx)
        else:
            idx += 1
    return movies

In [93]:
# movies[1:3]

**Important:** While it might look as if we are making it easy for you by providing `find_specific_movies`, there is a catch! There is a subtle flaw with the way the function is defined, that will cause you issues in the next two questions. If you can spot this flaw by just observing the definition of `find_specific_movies`, congratulations! Since you are **not** allowed to modify the function definition, you will have to be a little clever with your function arguments to sidestep the flaw with the function definition.

If you don't see anything wrong with the function just yet, don't worry about it. Solve q14 and q15 as you normally would, and see if you notice anything suspicious about your answers.

**Question 14:** List all the movies that *Katharine Hepburn* acted in.

Your answer **must** be a **list** of **dictionaries**.

You **must** answer this question by calling `find_specific_movies` with the keyword `"Katharine Hepburn"`.

The `find_specific_movies` function is expected to take around 5 seconds to run, so do not panic if it takes so long to run.

In [94]:
# compute and store the answer in the variable 'hepburn_films', then display it
movq14=copy.deepcopy(movies)
hepburn_films=find_specific_movies(movq14,"Katharine Hepburn")
hepburn_films

[{'title': 'Little Women',
  'year': 1933,
  'duration': 115,
  'genres': ['Drama', 'Family', 'Romance'],
  'rating': 7.2,
  'directors': ['George Cukor'],
  'cast': ['Katharine Hepburn',
   'Joan Bennett',
   'Paul Lukas',
   'Edna May Oliver']},
 {'title': 'Desk Set',
  'year': 1957,
  'duration': 103,
  'genres': ['Comedy', 'Romance'],
  'rating': 7.2,
  'directors': ['Walter Lang'],
  'cast': ['Spencer Tracy',
   'Katharine Hepburn',
   'Gig Young',
   'Joan Blondell']},
 {'title': 'Woman of the Year',
  'year': 1942,
  'duration': 114,
  'genres': ['Comedy', 'Drama', 'Romance'],
  'rating': 7.2,
  'directors': ['George Stevens'],
  'cast': ['Spencer Tracy',
   'Katharine Hepburn',
   'Fay Bainter',
   'Reginald Owen']},
 {'title': 'Quality Street',
  'year': 1937,
  'duration': 83,
  'genres': ['Comedy', 'Drama', 'Romance'],
  'rating': 6.2,
  'directors': ['George Stevens'],
  'cast': ['Katharine Hepburn', 'Franchot Tone', 'Eric Blore', 'Fay Bainter']},
 {'title': 'Love Affair',


In [95]:
grader.check("q14")

**Question 15:** List all the movies that contain the string *Wisconsin* in their `title`.

Your answer **must** be a **list** of **dictionaries**.

You **must** answer this question by calling `find_specific_movies` with the keyword `"Wisconsin"`.

**Important Hint:**  If you did not notice the flaw with the definition of `find_specific_movies` before, you are likely to have run into an issue with this quetsion. It is likely that you will see that your output for this question is an empty list. To see why this happened, find the value of `len(movies)` and see if it is equal to the value you found earlier.

Remember that you are **not** allowed to modify the definition of `find_specific_movies`. You will need to cleverly pass arguments to `find_specific_movies` (in both q14 and q15) to ensure that `movies` does not get modified by the function calls. Take a look at the [lecture slides](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/f22/meena_lec_notes/lec-21) from October 26 for more hints. You will have to Restart and Run all your cells to see the correct output after you fix your answer for q14 (and q15).

In [96]:
#len(movies)

In [97]:
# compute and store the answer in the variable 'wisconsin_movies', then display it
movq15=copy.deepcopy(movies)
wisconsin_movies=find_specific_movies(movq15,"Wisconsin")
wisconsin_movies

[{'title': 'Wisconsin Death Trip',
  'year': 1999,
  'duration': 76,
  'genres': ['Biography', 'Crime', 'Drama'],
  'rating': 6.6,
  'directors': ['James Marsh'],
  'cast': ['Ian Holm', 'Jeffrey Golden', 'Jo Vukelich', 'Marcus Monroe']},
 {'title': 'Bootleg Wisconsin',
  'year': 2008,
  'duration': 73,
  'genres': ['Drama'],
  'rating': 7.7,
  'directors': ['Brandon Linden'],
  'cast': ['Lepolion Henderson',
   'Angela Harris',
   'Alissa Bailey',
   'Joyce Porter']},
 {'title': 'Wisconsin Supper Clubs: An Old Fashioned Experience',
  'year': 2011,
  'duration': 55,
  'genres': ['Documentary', 'History'],
  'rating': 6.7,
  'directors': ['Ron Faiola'],
  'cast': ['Bun E. Carlos']},
 {'title': 'Small Town Wisconsin',
  'year': 2020,
  'duration': 109,
  'genres': ['Comedy', 'Drama'],
  'rating': 7.3,
  'directors': ['Niels Mueller'],
  'cast': ['David Sullivan',
   'Bill Heck',
   'Kristen Johnston',
   'Cooper J. Friedman']}]

In [98]:
grader.check("q15")

### Function 5: `bucketize_by_genre(movies)`

We require you to complete the below function to answer the next several questions (this is a **requirement**, and you will **lose points** if you do not implement this function).

In [99]:
def bucketize_by_genre(movies):
    
    dog={} # dictionary of genre
    dogl=[] # dictionary of genre (in) list
    
    for keyf5 in movies:
        if dogl==[]:
            dogl.extend(keyf5["genres"])
        elif dogl!=[]:
            dogl=list(set(dogl).union(set(keyf5["genres"])))

    for elements in dogl:
        dog[elements]=[]

    for elements in dogl:
        for items in movies:
            if elements in items["genres"]:
                dog[elements].append(items)

    return dog

In [100]:
# call the function bucketize_by_genre on 'movies' and store it in the variable 'genre_dict'
# do NOT display the output directly
movfunc5=copy.deepcopy(movies)
genre_dict = bucketize_by_genre(movfunc5)


**Warning:** You are **not** allowed to call `bucketize_by_genre` more than once on the full list of movies (`movies`) in your notebook. You will **lose points** during manual review if you call `bucketize_by_genre` again on `movies`.

**Question 16:** How many **unique** movie `genres` are present in the dataset?

In [101]:
# compute and store the answer in the variable 'num_genres', then display it
num_genres=len(genre_dict)
num_genres

26

In [102]:
grader.check("q16")

**Question 17:** How many *Music* movies (i.e. movies with *Music* as one of their `genres`) do we have in the dataset released **after** the `year` *2019*?

Your output **must** be an **int**. You **must** use the `genre_dict` data structure to answer this question.

In [103]:
# compute and store the answer in the variable 'music_after_2019', then display it
l17=[]
for genre_key in genre_dict:
    if genre_key=="Music":
        for items in genre_dict["Music"]:
            if items["year"]>2019:
                l17.append(items["title"])

music_after_2019=len(l17)
music_after_2019

153

In [104]:
grader.check("q17")

**Question 18:** List the `title` of all *Horror* movies (i.e. movies with *Horror* as one of their `genres`) with `rating` **larger** than *9.0* in the dataset.

Your output **must** be a **list** of **strings**. You **must** use the `genre_dict` data structure to answer this question.

In [105]:
# compute and store the answer in the variable 'horror_movies_above_9', then display it
horror_movies_above_9=[]
for genre_key in genre_dict:
    if genre_key=="Horror":
        for items in genre_dict["Horror"]:
            if items["rating"]>9.0:
                horror_movies_above_9.append(items["title"])

horror_movies_above_9

['American Barbarian',
 'The Children Under the House',
 'Santhoshathil Kalavaram',
 'La Bruja',
 'Laurence',
 'Muttnik',
 'Cold Calm',
 'Heavy Makeup',
 'Girls on a Boat',
 'Phantom Summer']

In [106]:
grader.check("q18")

**Question 19:** Which movie `genre` does *Jennifer Aniston* play the most?

There is a **unique** `genre` that *Jennifer Aniston* has played the most. You do **not** have to worry about breaking ties.

**Hint:** You can combine the *two* functions above to bucketize the movies that *Jennifer Aniston* has acted in by their `genres`. Then, you can loop through each genre to find the one with the most number of movies in it.

In [107]:
# compute and store the answer in the variable 'jen_aniston_genre, then display it
movq19=copy.deepcopy(movies)
jamov=[]
for items in movq19:
    if "Jennifer Aniston" in items["cast"]:
        jamov.append(items)

jamovbygenre=bucketize_by_genre(jamov)
jen_aniston_genre=""
jatimes=0
for keyq19 in jamovbygenre:
    if len(jamovbygenre[keyq19])>jatimes:
        jatimes=len(jamovbygenre[keyq19])
        jen_aniston_genre=keyq19

jen_aniston_genre

'Comedy'

In [108]:
grader.check("q19")

**Question 20:** Who are the `directors` of the *Documentary* movies with the **highest** `rating` in the movies dataset?

There are **multiple** *Documentary* movies in the dataset with the joint highest rating. You **must** output a **list** of **strings** containing the **names** of **all** the `directors` of **all** these movies.

**Hint:** If you are unsure how to efficiently add the elements of one list to another, take a look at the [lecture slides](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/f22/meena_lec_notes/lec-14) from October 10.

In [109]:
# compute and store the answer in the variable 'max_docu_rating_directors', then display it
max_docu_rating_directors=[]
maxrating=0
movq20=copy.deepcopy(movies)
movq20bg=bucketize_by_genre(movq20)
documov=movq20bg["Documentary"]
#Get the highest rating:
for eachmov in documov:
    if eachmov["rating"]>maxrating:
        maxrating=eachmov["rating"]

for eachmov in documov:
    if eachmov["rating"]==maxrating:
        max_docu_rating_directors.extend(eachmov["directors"])

max_docu_rating_directors



['Michael Kirk',
 'A.J. Martinson',
 'Anthony Moffat',
 'Jason Harney',
 'Thomas A. Morgan']

In [110]:
grader.check("q20")

## Submission
Make sure you have run all cells in your notebook in order before running the following cells, so that all images/graphs appear in the output.
It is recommended that at this stage, you Restart and Run all Cells in your notebook.
That will automatically save your work and generate a zip file for you to submit.

**SUBMISSION INSTRUCTIONS**:
1. **Upload** the zipfile to Gradescope.
2. Check **Gradescope otter** results as soon as the auto-grader execution gets completed. Don't worry about the score showing up as -/100.0. You only need to check that the test cases passed.

In [111]:
# running this cell will create a new save checkpoint for your notebook
from IPython.display import display, Javascript
display(Javascript('IPython.notebook.save_checkpoint();'))

<IPython.core.display.Javascript object>

In [112]:
!jupytext --to py p8.ipynb

[jupytext] Reading p8.ipynb in format ipynb
[jupytext] Updating the timestamp of p8.py


In [113]:
p8_test.check_file_size("p8.ipynb")
grader.export(pdf=False, run_tests=True, files=[py_filename])

AssertionError: Your file is too big to be processed by Gradescope; please delete unnecessary output cells so your file size is < 300 KB