# SLU2 - Subsetting data: Exercise notebook

In this notebook you'll practice the following:
    - Setting pandas Dataframe index
    - Selecting columns (brackets and dot notation)
    - Selecting rows (loc and iloc)
    - Chain indexing (not good) vs Multi-axis indexing (good)
    - Masks
    - Where
    - Subsetting on conditions
    - Removing and Adding columns

In [None]:
import pandas as pd

pd.options.display.max_rows = 10

For these exercices we will be using a zomato dataset containing the description and ratings of several restaurants.

In each exercise, you'll be asked to implement a function. In order to test it before you submit the assignement, add a new cell and call the function to inspect it's output.

In [None]:
# Read restaurants dataset and set restaurant name column as index
restaurants = pd.read_csv('data/zomato_restaurants.csv',index_col='restaurant id').sort_index()

# Show first 5 lines
restaurants.head(5)

## Exercise 1

Selecting columns

Select the column __*city*__.

In [None]:
def exercise_1(df):
    """ 
    Select the column city of the DataFrame
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.Series): city column

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()
    

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_1(restaurants)
assert isinstance(df_test, pd.Series)
assert df_test.name == 'city'
assert df_test.shape[0] == restaurants.shape[0]
#pd.testing.assert_series_equal(df_test, df_true)

## Exercise 2

Selecting columns.

Select columns __*aggregate_rating*__ and __*average_cost_for_two*__.

In [None]:
def exercise_2(df):
    """ 
    Select columns aggregate_rating and average_cost_for_two
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): aggregate_rating and average_cost_for_two columns

    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_2(restaurants)
assert isinstance(exercise_2(restaurants), pd.DataFrame)
assert df_test.columns.tolist() == ['aggregate_rating', 'average_cost_for_two']
assert df_test.shape[0] == restaurants.shape[0]

## Exercise 3
Selecting rows.

Select the **78th**, the **156th** and the **390th** rows.

In [None]:
def exercise_3(df):
    """ 
    Select the 78, the 156 and the 390 rows
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): subsetted df

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_3(restaurants)
assert isinstance(df_test, pd.DataFrame)
assert df_test.shape[1] == restaurants.shape[1]
assert df_test.index.values.tolist() == [6114650, 8302994, 17836438]

## Exercise 4
Selecting rows and columns

Select columns __*aggregate_rating*__  and __*restaurant name*__  for restaurants whose **id** is __8202867__ or __16553285__.

In [None]:
def exercise_4(df):
    """ 
    Select columns aggregate_rating and restaurant name for rooms 8202867 and 16553285
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): subsetted df

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_4(restaurants)
assert isinstance(df_test, pd.DataFrame)
assert df_test.index.values.tolist() == [8202867, 16553285]
assert df_test.columns.tolist() == ['aggregate_rating', 'restaurant name']

## Exercise 5
Using the __mask__ function

Use the mask function to hide all the restaurants in London.

In [None]:
def exercise_5(df):
    """ 
    Use the mask function to hide all the restaurants in London
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): df with hidden (NaN) rows

    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_5(restaurants)
assert isinstance(df_test, pd.DataFrame)
assert df_test.shape == restaurants.shape
assert sum(df_test.city=='London') == 0

## Exercise 6
Using the __where__ function

Use the where function to hide all the restaurants that **do not have table booking**.

In [None]:
def exercise_6(df):
    """ 
    Use the where function to hide all the restaurants that do not have table booking
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): df with hidden (NaN) rows

    """
    
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_6(restaurants)
assert isinstance(df_test, pd.DataFrame)
assert df_test.shape == restaurants.shape
assert sum(df_test.has_table_booking==0) == 0

## Exercise 7
Using slice operation.

Use the slice operation to pick the **restaurants** whose **id** are between **6122300** and **6130000**.


In [None]:
def exercise_7(df):
    """ 
    Use the slice operation to pick all restaurants whose id is between 6122300 and 6130000.
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): df subset of rows

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_7(restaurants)
assert isinstance(df_test, pd.DataFrame)
assert df_test.index.values.tolist() == [6122355, 6124820, 6125672, 6127163]
assert df_test.columns.tolist() == restaurants.columns.tolist()

## Exercise 8
Picking a tasty ice cream for Maria!

Maria has been working in Rome (Roma in portuguese) for two months now. Summer is just starting and she would like to try one of the famous ice creams that Italy is known for. She wants to try the best that Rome has to offer and so she is only picking an ice cream shop with a rating bigger than 4.5

Find the options that fullfill these criteria. Select only the following columns:
**neighbourhood**,
**aggregate_rating**,
**votes**,
**average_cost_for_two**

In [None]:
def exercise_8(df):
    """ 
    Pick an Ice Cream Shop for Maria.
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): subsetted df

    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_8(restaurants)
assert isinstance(df_test, pd.DataFrame)
assert len(df_test.columns) == 4
assert len(df_test.index) == 3
assert sum(df_test.votes) == 483
assert round(sum(df_test.aggregate_rating),1) == 14.5
assert sum(df_test.average_cost_for_two) == 52
assert 'Veneto' in df_test.neighbourhood.values

## Exercise 9

__Find a restaurant for Toni!__

Toni is visiting Porto with  a friend and they really want to get some nice food next Saturday night. They have a total budget of 50€ and they want a restaurant with a rating bigger than 4.5. As their hotel is near Baixa neighbourhood they would prefer to pick a restaurant close by. 

Find the options that fullfill Tony criteria.

In the end, present only the **restaurant name**, **neighbourhood**, **aggregate_rating**, **average_cost_for_two** and the **has_table_booking**.

In [None]:
def exercise_9(df):
    """ 
    Pick a restaurant for Toni and his friend
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): subsetted df

    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_9(restaurants)
assert isinstance(df_test, pd.DataFrame)
assert len(df_test.columns) == 5
assert len(df_test.index) == 6
assert 'Miss Pavlova' in df_test['restaurant name'].values
assert round(sum(df_test.aggregate_rating),1) == 28.3
assert sum(df_test.has_table_booking) == 2
assert sum(df_test.average_cost_for_two) == 195
assert sum(df_test.neighbourhood == 'Baixa') == 6

##  Exercise 10
Working with dataframe index.

Set column __*restaurant name*__ as index and sort it.
Make sure you keep the old index as column in the resulting dataframe

In [None]:
def exercise_10(df):
    """ 
    Set column restaurant name as index. Keep old index as column in the new dataframe
    Sort the index.
    
    Args:
        beers (pd.DataFrame): the input DataFrame

    Returns:
        beers (pd.DataFrame): the transformed DataFrame

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()


In [None]:
df_test= exercise_10(restaurants)
assert isinstance(df_test, pd.DataFrame)
assert df_test.index.name == 'restaurant name'
assert 'restaurant id' in df_test.columns.tolist()
assert df_test.index.values[0] == '100 Montaditos' and df_test.index.values[-1] == 'daTerra'

## Exercise 11

Adding columns

Add a column with the the average cost for 10 people and name it __*average_cost_for_ten*__

Add a column with the photo_count per vote and name it __*photo_count_per_vote*__

In [None]:
def exercise_11(df):
    """ 
    Add a column average_cost_for_ten and column photo_count_per_vote
    
    Args:
        beers (pd.DataFrame): the input DataFrame

    Returns:
        beers (pd.DataFrame): DataFrame with 2 extra columns

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert isinstance(exercise_11(restaurants), pd.DataFrame)
df_test= exercise_11(restaurants)
df_true=restaurants
df_true['average_cost_for_ten'] = df_true['average_cost_for_two']*5
df_true['photo_count_per_vote'] = df_true['photo_count']/df_true['votes']
pd.testing.assert_frame_equal(df_test, df_true)

## Exercice 12
Dropping columns

Drop the columns __*timings*__ and __*has_online_delivery*__.

In [None]:
def exercise_12(df):
    """ 
    Drop columns timmings and has_online_delivery
    
    Args:
        beers (pd.DataFrame): the input DataFrame

    Returns:
        beers (pd.DataFrame): the transformed DataFrame

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
df_test = exercise_12(restaurants)
assert isinstance(df_test, pd.DataFrame)
assert ['timings','has_online_delivery'] not in df_test.columns.tolist()
assert df_test.shape[0] == restaurants.shape[0]
assert df_test.shape[1] == restaurants.shape[1]-2