# Subsetting data: Exercise notebook

In this notebook you'll practice the following:
- Setting pandas Dataframe index
- Selecting columns (brackets and dot notation)
- Selecting rows (loc and iloc)
- Subsetting on conditions
- Removing and Adding columns

In [None]:
import pandas as pd

For these exercices we will be using a zomato dataset containing the description and ratings of several restaurants.

In each exercise, you'll be asked to implement a function. In order to test it before you submit the assignement, add a new cell and call the function to inspect it's output.

In [None]:
# Read restaurants dataset and set restaurant name column as index
indicators = pd.read_csv('data/U.S._Chronic_Disease_Indicators.csv', index_col='indicator_id',)

# Show first 5 lines
print(indicators.shape)
indicators.head(5)

## Exercise 1

Selecting columns

Select the column __*Topic*__.

In [None]:
def exercise_1(df):
    """ 
    Select the column Topic of the DataFrame
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.Series): Topic column

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()
    

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_1(indicators)
assert isinstance(df_test, pd.Series)
assert df_test.name == 'Topic'
assert df_test.shape[0] == indicators.shape[0]
#pd.testing.assert_series_equal(df_test, df_true)

## Exercise 2

Selecting columns.

Select columns __*LocationDesc*__ and __*Question*__.

In [None]:
def exercise_2(df):
    """ 
    Select columns LocationDesc and Question
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): LocationDesc and Question columns

    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_2(indicators)
assert isinstance(exercise_2(indicators), pd.DataFrame)
assert df_test.columns.tolist() == ['LocationDesc', 'Question']
assert df_test.shape[0] == indicators.shape[0]

## Exercise 3
Selecting rows.

Select the **78th**, the **156th** and the **390th** rows.

In [None]:
indicators.head(3)

In [None]:
def exercise_3(df):
    """ 
    Select the 78, the 156 and the 390 rows
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): subsetted df

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_3(indicators)
assert isinstance(df_test, pd.DataFrame)
assert df_test.shape[1] == indicators.shape[1]
assert df_test.index.tolist() == [9152, 9115, 1215]

In [None]:
indicators.head(2)

## Exercise 4
Selecting rows and columns

Select columns __*Question*__, __*DataValueUnit*__, and __*DataValue*__,  for indicators whose **id** is __1143__ or __1910__.

In [None]:
def exercise_4(df):
    """ 
    Select columns aggregate_rating and restaurant name for rooms 8202867 and 16553285
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): subsetted df

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_4(indicators)
assert isinstance(df_test, pd.DataFrame)
assert df_test.index.tolist() == [1143, 1910]
assert df_test.columns.tolist() == ['Question','DataValueUnit', 'DataValue']

## Exercise 5
Using slice operation.

Use the slice operation to pick the **indicators** rows between **2100** and **3100**.

In [None]:
def exercise_5(df):
    """ 
    Use the slice operation to pick all indicators whose rows are between 2100 (including) and 3100 (excluding).
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): df subset of rows

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_5(indicators)
assert isinstance(df_test, pd.DataFrame)
assert df_test.shape == (1000,7)
assert df_test.index[-1] == 8905
assert df_test.columns.tolist() == indicators.columns.tolist()

## Exercise 6

Help Sofia find the data she needs to complete her study!

Sofia has been researching on the impact of Arthritis. For her final remarks on this subject, she needs to look at all the indicators that have been measured for this topic.

Find the options that fullfill these criteria. In the end, present only the **YearStart**, **LocationDesc**, **Question**, and **DataValue**.

In [None]:
def exercise_6(df):
    """ 
    Find all the indicators involving Arthritis.
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): subsetted df

    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_6(indicators)
assert isinstance(df_test, pd.DataFrame)
assert len(df_test.columns) == 4
assert len(df_test.index) == 681
assert sum(df_test.DataValue.isna()) == 229
assert round(df_test.DataValue.astype(float).mean(),1) == 36.7
assert len("".join(df_test.Question)) == 45972
assert 'California' in df_test.LocationDesc.tolist()

## Exercise 7

Help Marco finding the study he needs.

Marco is currently working in Mental health issues in the general population. He is looking for recent indicators to help him in his analysis of the situation in Arkansas. So he asks you if there are any indicators that started in 2018 that may be helpful to him.

Find the options that fullfill this criteria. You must show the **YearStart**, **Question**,**DataValueUnit**, and **DataValue**.

Hint: Beaware of the columns data types. You might think some things are strings when in fact their are integers, or vice-versa.

In [None]:
def exercise_7(df):
    """ 
    Find all Metal health indicators for Arkansas in the year 2018.
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        (pd.DataFrame): subsetted df

    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
pd.DataFrame().reset_index()

In [None]:
work = indicators.copy()
work = work.reset_index().groupby('LocationDesc').agg(lambda df: df.iloc[0]).reset_index().sample(frac=1)
work.to_csv('data/U.S._Chronic_Disease_Indicators_subset.csv',index=False)
print(work.shape)
work.head(2)

In [None]:
# This cell is what will test your code, please ignore it!
df_test = exercise_7(indicators)
assert isinstance(df_test, pd.DataFrame)
assert len(df_test.columns) == 4
assert len(df_test.index) == 1
assert 'recent' in df_test['Question'].tolist()[0]
assert df_test.DataValue.get(8144) == '19.6'

##  Exercise 8
Working with dataframe index.

You now have a subset of the data called `U.S._Chronic_Disease_Indicators_subset.csv`. The difference between this and the original is that now you only have one indicator per location.

Set the column **LocationDesc** as index and sort it.

Make sure you keep the old index as column in the resulting dataframe

In [None]:
subset_indicators = pd.read_csv('data/U.S._Chronic_Disease_Indicators_subset.csv', index_col='indicator_id')
subset_indicators.head(3)

In [None]:
def exercise_8(df):
    """ 
    Set column LocationDesc as index. Keep old index as column in the new dataframe
    Sort the index.
    
    Args:
        df (pd.DataFrame): the input DataFrame

    Returns:
        new_df (pd.DataFrame): the transformed DataFrame

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()


In [None]:
df_test= exercise_8(subset_indicators)
assert isinstance(df_test, pd.DataFrame)
assert df_test.index.name == 'LocationDesc'
assert 'indicator_id' in df_test.columns.tolist()
assert df_test.index.values[0] == 'Alabama' and df_test.index.values[-1] == 'Wyoming'