In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("practice.ipynb")

In [2]:
import practice_test

# Lab-P6: Real-world Datasets (Airbnb)

## Segment 2: Loading Data from CSVs

### About the dataset

Now would be a good time to open `airbnb.csv` with Microsoft Excel (or some other Spreadsheet viewing software) and have a look at the data. The first few rows of the dataset are reproduced here:

room_id|name|host_id|host_name|neighborhood_group|neighborhood|latitude|longitude|room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365
------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
2539|Clean & quiet apt home by the park|2787|John|Brooklyn|Kensington|40.64749000000001|-73.97237|Private room|149|1|9|2018-10-19|0.21|6|365
2595|Skylit Midtown Castle|2845|Jennifer|Manhattan|Midtown|40.75362|-73.98376999999998|Entire home/apt|225|1|45|2019-05-21|0.38|2|355
3647|THE VILLAGE OF HARLEM....NEW YORK !|4632|Elisabeth|Manhattan|Harlem|40.80902|-73.9419|Private room|150|3|0|||1|365
3831|Cozy Entire Floor of Brownstone|4869|LisaRoxanne|Brooklyn|Clinton Hill|40.68514|-73.95976|Entire home/apt|89|1|270|2019-07-05|4.64|1|194
5022|Entire Apt: Spacious Studio/Loft by central park|7192|Laura|Manhattan|East Harlem|40.79851|-73.94399|Entire home/apt|80|10|9|2018-11-19|0.1|1|0
5099|Large Cozy 1 BR Apartment In Midtown East|7322|Chris|Manhattan|Murray Hill|40.74767|-73.975|Entire home/apt|200|3|74|2019-06-22|0.59|1|129

The `airbnb.csv` file has data about nearly 50,000 listings on Airbnb from New York City, NY from the year 2019. Each row in the file contains data about a *single listing*. The columns contain the following data about each listing (along with the correct data type you **must** represent it as):

1. `room_id` - the ID of the room listing (`str`)
2. `name` - the name of the room listing (`str`)
3. `host_id` - the ID of the host for the room listing (`str`)
4. `host_name` - the name of the host for the room listing (`str`)
5. `neighborhood_group` - the group of neighborhoods the room is in (`str`)
6. `neighborhood` - the neighborhood the room is in (`str`)
7. `latitude` - the latitude where the room is located (`float`)
8. `longitude` - the longitude where the room is located (`float`)
9. `room_type` - the type of room (`str`)
10. `price` - the price per night for the room in US dollars (`int`)
11. `minimum_nights` - the minimum amount of nights the room must be booked for (`int`)
12. `number_of_reviews` - the total number of reviews the room has received (`int`)
13. `last_review` - the date of the most recent review in the form yyyy-mm-dd (`str`)
14. `reviews_per_month` - how many reviews per month the room receives (`float`)
15. `calculated_host_listings_count` - how many listings the host of the room has (`int`)
16. `availability_365` - how many days per year the listing is available for (`int`)

**Warning:** Keep in mind while writing your project, some entries may be **missing data** for specific columns. Sadly, data in real life is often messy, and in p6, we will have to deal with missing data.

### Task 2.1: Processing the CSV file

You will now read this dataset with Python. [Chapter 14](https://automatetheboringstuff.com/chapter14/) of Automate the Boring Stuff introduces CSV files and provides a code snippet you can reuse. You can use the same code snippet for p6. Run the next few cells and see their outputs.

In [3]:
# it is considered a good coding practice to place all import statements at the top of the notebook
# please place all your import statements in this cell if you need to import any more modules for this lab

import csv # we have imported this module for you; it is required by the process_csv function below

In [6]:
# modified from https://automatetheboringstuff.com/chapter14/
def process_csv(filename):
    example_file = open(filename, encoding="utf-8")
    example_reader = csv.reader(example_file)
    example_data = list(example_reader)
    example_file.close()
    return example_data

In [5]:
# this call to process_csv reads the data in "airbnb.csv"
csv_data = process_csv("airbnb.csv")

# this will display the first three items in the list `csv_data`
csv_data[:3]

[['room_id',
  'name',
  'host_id',
  'host_name',
  'neighborhood_group',
  'neighborhood',
  'latitude',
  'longitude',
  'room_type',
  'price',
  'minimum_nights',
  'number_of_reviews',
  'last_review',
  'reviews_per_month',
  'calculated_host_listings_count',
  'availability_365'],
 ['2539',
  'Clean & quiet apt home by the park',
  '2787',
  'John',
  'Brooklyn',
  'Kensington',
  '40.64749000000001',
  '-73.97237',
  'Private room',
  '149',
  '1',
  '9',
  '2018-10-19',
  '0.21',
  '6',
  '365'],
 ['2595',
  'Skylit Midtown Castle',
  '2845',
  'Jennifer',
  'Manhattan',
  'Midtown',
  '40.75362',
  '-73.98376999999998',
  'Entire home/apt',
  '225',
  '1',
  '45',
  '2019-05-21',
  '0.38',
  '2',
  '355']]

The variable `csv_data` stores the contents of the file `airbnb.csv` as a **list of lists** (i.e., `csv_data` is a **list**, and the elements of this list are **lists** themselves). In the next subsection, you will learn to access data stored within this data structure.

### Task 2.2: Accessing the contents of the dataset

You will now index the data to extract the correct answers for the questions listed below. Some have been done for you. To understand the results better, locate the values in the `airbnb.csv` file.

**Question 1:** What are the `names` of the **columns** in the dataset?

**Hint:** Take a look at the output of the cell above and see where the header data is stored. Use **indexing** to extract the csv header.

In [8]:
# replace the ... with your code
csv_header = csv_data[0] # A list of the column headers

csv_header

['room_id',
 'name',
 'host_id',
 'host_name',
 'neighborhood_group',
 'neighborhood',
 'latitude',
 'longitude',
 'room_type',
 'price',
 'minimum_nights',
 'number_of_reviews',
 'last_review',
 'reviews_per_month',
 'calculated_host_listings_count',
 'availability_365']

In [9]:
grader.check("q1")

It would be a good idea to use **slicing** to extract all the inner lists (except the header), and store them in a variable you can use later.

In [10]:
csv_rows = csv_data[1:] # extract the entire CSV data set besides the header
# do NOT attempt to display csv_data(i.e., do not print out the variable 
# by adding the variable name to the end of the cell, do not add an addtional line to this cell);
# it has over 50,000 lists, and will take up unnecessary space
# you can confirm that you extracted the data correctly by answering the next few questions

**Question 2.1:** How many **rows** are in the dataset (excluding the **header**)?

In [80]:
# we have done this one for you
num_rows = len(csv_rows)

num_rows

48895

In [68]:
grader.check("q2-1")

**Question 2.2:** What are the **first** ten listings in the dataset?

In [13]:
# we have done this one for you
first_ten_rows = csv_rows[:10]

first_ten_rows

[['2539',
  'Clean & quiet apt home by the park',
  '2787',
  'John',
  'Brooklyn',
  'Kensington',
  '40.64749000000001',
  '-73.97237',
  'Private room',
  '149',
  '1',
  '9',
  '2018-10-19',
  '0.21',
  '6',
  '365'],
 ['2595',
  'Skylit Midtown Castle',
  '2845',
  'Jennifer',
  'Manhattan',
  'Midtown',
  '40.75362',
  '-73.98376999999998',
  'Entire home/apt',
  '225',
  '1',
  '45',
  '2019-05-21',
  '0.38',
  '2',
  '355'],
 ['3647',
  'THE VILLAGE OF HARLEM....NEW YORK !',
  '4632',
  'Elisabeth',
  'Manhattan',
  'Harlem',
  '40.80902',
  '-73.9419',
  'Private room',
  '150',
  '3',
  '0',
  '',
  '',
  '1',
  '365'],
 ['3831',
  'Cozy Entire Floor of Brownstone',
  '4869',
  'LisaRoxanne',
  'Brooklyn',
  'Clinton Hill',
  '40.68514',
  '-73.95976',
  'Entire home/apt',
  '89',
  '1',
  '270',
  '2019-07-05',
  '4.64',
  '1',
  '194'],
 ['5022',
  'Entire Apt: Spacious Studio/Loft by central park',
  '7192',
  'Laura',
  'Manhattan',
  'East Harlem',
  '40.79851',
  '-73.9

In [14]:
grader.check("q2-2")

**Question 2.3:** What are the **last** ten listings in the dataset?

In [15]:
# we have done this one for you
last_ten_rows = csv_rows[-10:]

last_ten_rows

[['36482809',
  'Stunning Bedroom NYC! Walking to Central Park!!',
  '131529729',
  'Kendall',
  'Manhattan',
  'East Harlem',
  '40.79633',
  '-73.93605',
  'Private room',
  '75',
  '2',
  '0',
  '',
  '',
  '2',
  '353'],
 ['36483010',
  'Comfy 1 Bedroom in Midtown East',
  '274311461',
  'Scott',
  'Manhattan',
  'Midtown',
  '40.75561',
  '-73.96723',
  'Entire home/apt',
  '200',
  '6',
  '0',
  '',
  '',
  '1',
  '176'],
 ['36483152',
  'Garden Jewel Apartment in Williamsburg New York',
  '208514239',
  'Melki',
  'Brooklyn',
  'Williamsburg',
  '40.71232',
  '-73.9422',
  'Entire home/apt',
  '170',
  '1',
  '0',
  '',
  '',
  '3',
  '365'],
 ['36484087',
  'Spacious Room w/ Private Rooftop, Central location',
  '274321313',
  'Kat',
  'Manhattan',
  "Hell's Kitchen",
  '40.76392',
  '-73.99183000000002',
  'Private room',
  '125',
  '4',
  '0',
  '',
  '',
  '1',
  '31'],
 ['36484363',
  'QUIT PRIVATE HOUSE',
  '107716952',
  'Michael',
  'Queens',
  'Jamaica',
  '40.69137',
 

In [16]:
grader.check("q2-3")

**In general, when you want to confirm that you are reading a very large file correctly, it is a good idea to check that you have the correct number of rows, and that the first and last few rows are correct. Here, you were given access to `practice_test.py`, which has the correct answers, so it was easy for you to check. Otherwise, you would have to manually open `airbnb.csv` and confirm that you have not made any mistakes. It is recommended that you manually open `airbnb.csv` in any case to verify that the data matches your answers for the previous three questions.**

**Question 3:** What values are present in the **first row** of the dataset?

In [17]:
# replace the ... with your code
first_row = csv_rows[0]

first_row

['2539',
 'Clean & quiet apt home by the park',
 '2787',
 'John',
 'Brooklyn',
 'Kensington',
 '40.64749000000001',
 '-73.97237',
 'Private room',
 '149',
 '1',
 '9',
 '2018-10-19',
 '0.21',
 '6',
 '365']

In [18]:
grader.check("q3")

You now know how to extract a single **row** (or a list of rows) from the file `airbnb.csv`. You will now extract data from a single **cell** of the file.

To extract data from a single cell of the csv file, we need two things:
  1. row index
  2. column index
    
You already know to extract a row of data with `csv_rows[row_idx]`. Given this list, can you now extract the data in a particular cell using the **column index**?

**Question 4:** What is the `host_name` of the **first** listing?

**Hint:** The **column index** for the `host_name` column is *3*. You may **hardcode** the **column index** as *3* **just for this question**.

In [19]:
# replace the ... with your code
first_row = csv_rows[0] # extract the first row of the dataset
first_host_name = first_row[3] # extract the host name

first_host_name

'John'

In [20]:
grader.check("q4")

You solved the previous question by **hardcoding** the **column index**, when you were just given the **column name**. This is however a **bad practice**, and you **must not** do it in your project. It would be much safer to somehow **extract** the **column index** **from** the **column name**, and then use the **column index**. The following (built-in) list method helps us with that:

### List method: `index`

**Syntax:** `list.index(column_name)`

This function will return the index of the item `column_name` in the `list`. You can see this function in action in the question below.

**Question 5:** What is the **index** of the column `neighborhood_group`?

In [21]:
# we have done this one for you
nbhd_group_index = csv_header.index('neighborhood_group')

nbhd_group_index

4

In [22]:
grader.check("q5")

Can you make sense of the code above? If not, please request your TA/PM to help you before proceeding further.

**Question 6:** What is the **value** in the `neighborhood_group` column of the **first** row?

You **must** use the `index` method to extract the **column index** from the given **column name**.

In [23]:
# replace the ... with your code
nbh_group_row1 = csv_rows[0][nbhd_group_index] # do NOT hardcode the number; use the index method to find the relevant column index

nbh_group_row1

'Brooklyn'

In [24]:
grader.check("q6")

### Task 2.3: Build a helper function for quick data access

It is quite cumbersome to extract data from `airbnb.csv` by indexing `csv_rows` and using the `index` method each time. To save yourself some time and effort, fill in the details of the following helper function.

After finishing the definition of the `cell` function below, you **must only** use this function to extract data from `airbnb.csv`. In p6, you will **lose points** if you attempt to extract data from `airbnb.csv` without **explicitly** calling the `cell` function.

In [25]:
def cell(row_idx, col_name):
    """
    cell(row_idx, col_name) returns the data value (cell) 
    corresponding to the row index `row_idx`
    and the column name `col_name` of a CSV file
    """
    # replace the ... with your code
    col_idx = csv_header.index(col_name)
    val = csv_rows[row_idx][col_idx]
    if val == "": # when we come across missing data, we return None instead
        return None
    return val

**Question 7.1:** What is the `neighborhood` of the **first** listing?

You **must** answer this question by calling the `cell` function.

In [26]:
# replace the ... with your code
first_neighborhood = cell(0, "neighborhood")

first_neighborhood

'Kensington'

In [27]:
grader.check("q7-1")

**Question 7.2:** What is the `name` of the **second** listing?

You **must** answer this question by calling the `cell` function.

In [28]:
# replace the ... with your code
second_name = cell(1, "name")

second_name

'Skylit Midtown Castle'

In [29]:
grader.check("q7-2")

**Question 7.3:** What is the `price` of the **third** listing?

You **must** answer this question by calling the `cell` function.

In [30]:
# replace the ... with your code
third_price = cell(2, "price")

third_price

'150'

In [31]:
grader.check("q7-3")

**Question 8:** How many rooms are listed in the `neighborhood_group` *Bronx*?

You **must** use `cell` if you want to extract any data from the csv file.

**Hint:** You must loop through the entire dataset. Use `cell` to extract the `neighborhood_group` of each room.

In [32]:
# replace the ... with your code
bronx_num_rooms = 0 # initialize with the correct value
for idx in range(num_rows): # loop through all the indices 
    if cell(idx, "neighborhood_group") == "Bronx": # use `cell` to determine if the listing at `idx` is from the correct neighborhood_group
        bronx_num_rooms += 1 # update the variable appropriately

bronx_num_rooms

1091

In [33]:
grader.check("q8")

**Question 9:** List the `host_names` of all the rooms in the `neighborhood` *University Heights*.

Your output **must** be a *list*. You **must** use `cell` if you want to extract any data from the csv file.

**Hint:** Loop through the entire dataset and use `cell` to determine if each room is in the correct `neighborhood`. Use `cell` once again to extract the `host_name` of each listing, and use the `append` list method to add the `host_name` to your list. 

In [34]:
# replace the ... with your code
univ_heights_names = [] # initialize as an empty list
for idx in range(num_rows): # loop through all the indices 
    if cell(idx, "neighborhood") == "University Heights": # use `cell` to determine if the listing at `idx` is from the correct neighborhood
        univ_heights_names.append(cell(idx, "host_name")) # use `cell` to append the host_name to the list

univ_heights_names

['Vanessa',
 'Rossy,  Carmen And Juan',
 'Rossy,  Carmen And Juan',
 'Rossy,  Carmen And Juan',
 'Maris',
 'Brais',
 'Justine',
 'Rossy,  Carmen And Juan',
 'Monica A',
 'Justine',
 'Justine',
 'Pedro',
 'Justine',
 'Mary',
 'Elizabeth',
 'Christophe',
 'Pierpaolo',
 'Oscar',
 'Henry',
 'Mel',
 'Justine']

In [35]:
grader.check("q9")

**Important**: Raise your hand and confirm your implementation with a TA. We'll use the `cell` function for all remaining tasks in this lab and throughout the project.

## Segment 3: Sorting Data

There are two major ways to sort lists in Python: (1) with the `sorted` function and (2) with the `.sort` method. For each method, let's examine (a) how it modifies existing structures, and (b) what new values it returns, if any.

The default sorting order is ascending. You can change that to descending, by passing keyword argument `reverse = True`. Same parameter / argument pair applicable for both `sort` method and `sorted` function.

### Task 3.1: Sort lists using `.sort()`

**Question 10.1:** What is the list of `neighborhood` names of the **first three** listings in the dataset?

In [36]:
# fetch the neighborhood names for the first three rows in the dataset
neighborhood1 = cell(0,"neighborhood") # we have done this one for you
# replace the ... with your code
neighborhood2 = cell(1,"neighborhood")
neighborhood3 = cell(2,"neighborhood")
# initialize a list with the three neighborhood names as elements
first_three_nbhds = [neighborhood1, neighborhood2, neighborhood3]

first_three_nbhds

['Kensington', 'Midtown', 'Harlem']

In [37]:
grader.check("q10-1")

**Question 10.2:** What does the function call `first_three_nbhds.sort()` do?

Sort the three names in alphabetical order

In [38]:
# we have done this one for you
result = first_three_nbhds.sort()

print("Returned value:", result)
print("List after sorting:", first_three_nbhds)

Returned value: None
List after sorting: ['Harlem', 'Kensington', 'Midtown']


In [39]:
grader.check("q10.2")

Now run the below code. Can you explain the output?

In [40]:
# sort in descending order
first_three_nbhds.sort(reverse=True)

first_three_nbhds

['Midtown', 'Kensington', 'Harlem']

### Task 3.2: Sort lists using `sorted()`

Now, use the `sorted` function to complete the same task as above. That is, fetch the names of the neighborhoods in the **first three** rows of the dataset. This time, use the `append` list method and a `for` loop to add entries into the list.

**Question 11.1:** What is the list of `neighborhood` names of the **first three** listings in the dataset?

In [41]:
first_three_nbhds = [] # this creates an empty list
# replace the ... with your code
for row_idx in range(3): # iterate over the indices of the first 3 rows in the dataset
    first_three_nbhds.append(cell(row_idx, "neighborhood"))
    row_idx += 1
first_three_nbhds

['Kensington', 'Midtown', 'Harlem']

In [42]:
grader.check("q11-1")

**Question 11.2:** What does the function call `sorted(first_three_nbhds)` do?

In [43]:
# sort the list `first_three_nbhds` and assign the sorted list to the variable `sorted_nbhds`.
# replace the ... with your code
sorted_nbhds = sorted(first_three_nbhds)

print("Returned value:", sorted_nbhds)
print("List after sorting:", first_three_nbhds)

Returned value: ['Harlem', 'Kensington', 'Midtown']
List after sorting: ['Kensington', 'Midtown', 'Harlem']


In [44]:
grader.check("q11-2")

Now run the below code to sort the list in reverse (descending) order using the `sorted()` function.

In [45]:
# sort in descending order
reverse_sorted_nbhds = sorted(first_three_nbhds, reverse=True) 
reverse_sorted_nbhds 

['Midtown', 'Kensington', 'Harlem']

**Can you compare the outcome of calling `list.sort()` and `sorted(list)`**? What is returned by these functions, and what happens to the original lists? If you are not able to spot the difference, reach out to a TA/PM.

### Task 3.4: Sorting to find the median

Now, let's try using sorting to solve a common problem - that of finding the median of a given distribution of values. Recall that the median is the **middle number** in a sorted (ascending or descending) list of numbers.   
  
In a sorted list, if the list has an **odd** number of elements, the median is the middle number:

For example, for the list `[10, 20, 30, 40, 50]` --> median is `30`.

If a sorted list has an **even** number of elements, the median is the **average** of the **two middle numbers**:

For example, for the list `[10, 20, 30, 40]` --> median is `25`.

In [1]:
def median(items):
    """
    median(items) returns the median of the list `items`
    """
    # sort the list
    sorted_list = sorted(items)
    # determine the length of the list
    list_len = len(sorted_list)
    if list_len % 2 != 0: # determine whether length of the list is odd
        # return item in the middle using indexing
        return sorted_list[int(((list_len)-1)/2)]
    else:
        first_middle = sorted_list[int(((list_len)-2)/2)] # use appropriate indexing 這個我寫很爛 用 len // 2
        second_middle = sorted_list[int(((list_len)-2)/2 + 1)] # use appropriate indexing 用 len //  2 -1
        return (first_middle + second_middle) / 2
    
# 用下面這個

def better_median(items):
    """
    median(items) returns the median of the list `items`
    """
    n = len(items)
    items = sorted(items)
    if n % 2 == 0:
        first_middle = items[n // 2 - 1] 
        second_middle = [n // 2]
        median = (first_middle + second_middle) / 2
    else:
        median = items[n // 2]
    
    return median

**Question 12.1:** What is the median of the list `list1 = [5, 3, 1, 2, 4]`?

In [2]:
list1 = [5, 3, 1, 2, 4]
# replace the ... with your code
median1 = median(list1)

median1

3

In [48]:
grader.check("q12-1")

**Question 12.2:** What is the median of the `list2 = [5, 3, 1, 2, 4, 6]`?

In [49]:
list2 = [5, 3, 1, 2, 4, 6]
# replace the ... with your code
median2 = median(list2)

median2

3.5

In [50]:
grader.check("q12-2")

After that short detour, let us dive back into the Airbnb dataset.

**Question 13:** What is the median `price` of all rooms in the `neighborhood` *Harlem*?

**Hint:** First create a *list* of all the `prices` of all the rooms in the given `neighborhood`, and then use the `median` function to find the **median** of that list.

In [51]:
# create a list of prices of all rooms in Harlem and store it in a suitably named variable
neighborhood_Harlem_price = []
for idx in range(num_rows): # loop through all the indices 
    if cell(idx, "neighborhood") == "Harlem": # use `cell` to determine if the listing at `idx` is from the correct neighborhood
        neighborhood_Harlem_price.append(int(cell(idx, "price"))) # use `cell` to append the host_name to the list

neighborhood_Harlem_price.sort()
harlem_median_price = median(neighborhood_Harlem_price)

# find the median of the list of prices and store it in the variable 'harlem_median_price', then display it

**Troubleshooting your function:**

Beware of type errors.

We expect the price to be an `int` value, but what type does the `cell` function actually return? Use the `type` function to find out. Think about how to solve this and test your result below.

In [52]:
grader.check("q13")

## Segment 4: Building a better helper function

Our helper function `cell` could use some improvement. As you have seen, we had have to **manually** convert the type returned by the function to represent the `price` as an `int` instead of a `str`. Let us ensure that the function returns the required type on its own. 

We will define a new function `cell_v2` to test our new implementation. Once the function is tested and works correctly, you **must** replace the original function with the new version.

First, define `cell_v2` running the code below.

In [53]:
# we did this one for you
def cell_v2(row_idx, col_name):
    col_idx = csv_header.index(col_name)
    val = csv_rows[row_idx][col_idx]
    if val == "":
        return None
    elif col_name in  ['price','minimum_nights', 'number_of_reviews', 'calculated_host_listings_count','availability_365']   :
        val = int(val)
    elif col_name in ['reviews_per_month', 'latitude', 'longitude', 'reviews_per_month']:
        val = float(val)
    return val

### Task 4.1 Return the correct data type for price


**Question 14**: What is the `price` of the **fifth** listing?

Your output **must** be an `int`. You **must** call the `cell_v2` function to answer this question.

In [54]:
# we have done this for you
fifth_price = cell_v2(4, 'price')

fifth_price

80

In [55]:
grader.check("q14")

### Task 4.2 Return the correct data type for `minimum_nights` column

Update `cell_v2` so it can handle the column `minimum_nights` as well. Your function **must** return an `int` when the `col_name` is `minimum_nights`.

**Question 15**: How many `minimum_nights` does the **last** listing require?

You **must** answer this question by calling the **`cell_v2`** function.

In [56]:
# replace the ... with your code
last_minimum_nights = cell_v2(-1,"minimum_nights")

last_minimum_nights

7

In [57]:
grader.check("q15")

### Task 4.3 Return the correct data types for all other columns

- Recall that the **correct** datatypes for each of the columns are as follows:
    1. `room_id` - the ID of the room listing (`str`)
    2. `name` - the name of the room listing (`str`)
    3. `host_id` - the ID of the host for the room listing (`str`)
    4. `host_name` - the name of the host for the room listing (`str`)
    5. `neighborhood_group` - the group of neighborhoods the room is in (`str`)
    6. `neighborhood` - the neighborhood the room is in (`str`)
    7. `latitude` - the latitude where the room is located (`float`)
    8. `longitude` - the longitude where the room is located (`float`)
    9. `room_type` - the type of room (`str`)
    10. `price` - the price per night for the room in US dollars (`int`)
    11. `minimum_nights` - the minimum amount of nights the room must be booked for (`int`)
    12. `number_of_reviews` - the total number of reviews the room has received (`int`)
    13. `last_review` - the date of the most recent review in the form yyyy-mm-dd (`str`)
    14. `reviews_per_month` - how many reviews per month the room receives (`float`)
    15. `calculated_host_listings_count` - how many listings the host of the room has (`int`)
    16. `availability_365` - how many days per year the listing is available for (`int`)
    
- Update your `cell_v2` function so it can handle **all** columns.
- The `if` condition will become very long if you keep using `or` to separate each column comparison operation.
- It is easier to make a list of all the column names whose values require `int` conversion and use `in` operator to check 
```python
if col_name in [..., ..., ...]
```

**Important:** Using `cell_v2` function is recommended but **optional** for the p6. You may choose to use the basic version and convert the return types manually when needed. Even if you use the `cell_v2` function defined here in p6, you **must** name that function `cell` in your p6 notebook.

## Segment 5: Sets
In class, we learned about the Python `list` sequence. Another simpler structure you'll sometimes find useful is the `set`. A set is **not** a sequence because it does not keep all the values in any particular order. 

### Task 5.1: Create a set
You can create sets the same way as lists, just **replacing** the *square brackets*(`[]`) with *curly braces*(`{}`). In the cell below, create a set with the same elements as the example list provided.

**Question 16:** What is the set having the same contents as `example_list` below?

In [58]:
example_list = ["Kensington", "Harlem", "Midtown"]
print(example_list)
# replace the ... with your code
example_set = {"Kensington", "Harlem", "Midtown"}

example_set

['Kensington', 'Harlem', 'Midtown']


{'Harlem', 'Kensington', 'Midtown'}

In [59]:
grader.check("q16")

### Task 5.2: Check if an element is present in a list or set

The `in` operator is used to check if an element is present in a list or set. Try it below:

In [60]:
"Harlem" in example_list

True

**Question 17:** Check if `neighborhood` *Midtown* is present in the set `example_set`.


In [61]:
# replace the ... with your code
midtown_check = "Midtown" in example_set

midtown_check

True

In [62]:
grader.check("q17")

### Task 5.3: Check the ordering of elements in a list or set

Sets have no inherent ordering, so they don't support indexing. Try the code in the cells below.

In [63]:
example_list[0]  # this works

'Kensington'

In [64]:
example_set[1] # but this does not work

TypeError: 'set' object is not subscriptable

The lack of order also matters for comparisons. Try evaluating this boolean expression:

In [65]:
["Harlem", "Midtown", "Kensington"] == ["Kensington", "Harlem", "Midtown"]

False

And now try this:

In [66]:
{"Harlem", "Midtown", "Kensington"} == {"Kensington", "Harlem", "Midtown"}

True

### Task 5.4 Convert between lists and sets
You can switch back and forth between lists and sets with ease. Let's try it.

**Question 18**: What is the **list** of all `host_names` in the `neighborhood` *Throgs Neck*?

In [69]:
# compute and store the answer in the variable 'throgs_neck_hosts', then display it
throgs_neck_hosts = []
for idx in range(num_rows):
    if cell_v2(idx, "neighborhood") == "Throgs Neck":
        throgs_neck_hosts.append(cell_v2(idx, "host_name"))

In [70]:
grader.check("q18")

Now, let us convert the *list* `throgs_neck_hosts` into a *set*. Compare the the number of elements in the list and the set.

**Question 19**: What is the **set** of all `host names` in the `neighborhood` *Throgs Neck*?

**Hint:** You can convert a *list* into a *set* by typecasting. For example, to convert a *list* `example_list` into a *set*, you can use `set(example_list)`.

In [71]:
# replace the ... with your code
throgs_neck_hosts_set = set(throgs_neck_hosts)


print('Length of list:',  len(throgs_neck_hosts))
print('Length of set:', len(throgs_neck_hosts_set))

Length of list: 24
Length of set: 17


In [72]:
grader.check("q19")

As you can see, the number of elements is different! This is because a set is a collection of **unique** elements. Therefore, there can be no duplicates in a **set**.

**Be careful!** When going from a set to a list, Python has to choose how to order the previously unordered values. If you run the same code, there's no guarantee Python will always choose the same way to order the set values in the new list.

### Task 5.5 Remove Duplicates
Let's use the uniqueness property of sets above to remove duplicates from a list by converting from a list to a set and back to a list again.

**Question 20**: Convert `list_1 = ["Brooklyn", "Brooklyn", "Manhattan", "Midtown", "Kensington", "Kensington", "Manhattan"]` to a set and then back to a list.

**Hint:** Just as you can convert a *list* into a *set* by typecasting, you can convert a *set* into a *list*. For example, to convert a *set* `example_set` into a *list*, you can use `list(example_set)`.

In [75]:
# try playing with different values here
# the backslash enables us to split a long line of code into two lines
list_1 = ["Brooklyn", "Brooklyn", "Manhattan", "Midtown", \
          "Kensington", "Kensington", "Manhattan"] 
# convert list_1 to a set and back to a list
# replace the ... with your code
set_1 = set(list_1)
list_2 = list(set_1)

list_2

['Brooklyn', 'Kensington', 'Midtown', 'Manhattan']

In [76]:
grader.check("q20")

## Great work! You are now ready to start [p6](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-f22-projects/-/tree/main/p6).