# Combining Data With Pandas

This notebook contains exercises for 'Combining Data With Pandas'.

<!-- **At the end of each exercise there are cells containing assert statements that you can use to check your answers.** -->

In [None]:
import pandas as pd
import numpy as np
%autosave 30

## Exercise 3: Appending Features


The two columns that have inconsistent values (as shown in the previous exercise) are `popularity` and `vote_count`.

This question looks at a way of "fixing" this.
One way of doing so when merging is simply duplicating the columns in the resulting dataset.

This is by no means a solution in terms of merging the data, but rather allows to distinguish between the values and not lose any information.
If you are doing any learning or data analysis on top, you will need a more sophisticated approach.

In [None]:
movies_df_pt1 = pd.read_csv('../data/movies_dataset/movies_pt1.csv', index_col=0)
movies_df_pt2 = pd.read_csv('../data/movies_dataset/movies_pt2.csv', index_col=0)
movies_df_pt3 = pd.read_csv('../data/movies_dataset/movies_pt3.csv', index_col=0)

all_movies_df = pd.concat([
    movies_df_pt1,
    movies_df_pt2,
    movies_df_pt3
], ignore_index=True)

## Question 3.1

The function `extract_duplicate_features_df` extracts duplicate features from the dataset and creates a new `DataFrame`.

For example, let's say the movie with id `3` has two different values for `popularity` in the original dataset:


| | .. _other features_ .. | popularity |
| --- | --- | --- |
| **id** | | |
| 3 | ... | 1.5 |
| 3 | ... | 3.8 |

Then `extract_duplicate_features_df` will return a new `DataFrame` with a single row for this movie `id`:

| | popularity_0 | popularity_ 1 |
| --- | --- | --- |
| **id** | | | |
| 3  |     1.5      |      3.8      |


where `1.5` and `3.8` are the two values for popularity. The `DataFrame` is also indexed by the movie's `id`.

This `DataFrame` will not contain any other columns.

* Run this function with `col_name='popularity'` (the default) and inspect the `DataFrame`.
* Run `df.loc[~pd.isna(df['popularity_1'])]`, where `df` is the `DataFrame` returned by the function and make sure you understand its output -- this will select all rows that had duplicate `popularity` scores.

In [None]:
def extract_duplicate_features_df(all_movies_df, col_name='popularity'):
    """Convert duplicate values for `col_name` into multiple columns."""
    # You do not need to understand how this function works.
    duplicate_cols_df = all_movies_df.drop_duplicates(subset=['id', col_name]).groupby('id').apply(
        lambda df: pd.Series({
            f'{col_name}_{a_id}': row[col_name]
            for a_id, (r_id, row) in enumerate(df.iterrows())
        })
    ).unstack()

    return duplicate_cols_df

In [None]:
duplicate_popularity_df = ...

## Question 3.2

Concatenate the `DataFrame` you obtained from `extract_duplicate_features_df` together with the `all_movies_df` `DataFrame`. Make sure the resulting `DataFrame` contains no duplicate movie `id`s.

* You will need to use `drop_duplicates` on `all_movies_df` with an appropriate `subset` argument.
* Make sure to also set the index on `all_movies_df` appropriately.

In [None]:
...

## Question 3.3

* Do the same thing for the column `vote_count` (you will need to call `extract_duplicate_features_df` with an appropriate argument.
* Then reset the index of the resulting `DataFrame`, making sure to keep the old index, as it contains the movie `id`s.

* **[Optional**] Finally, remove the now redundant `popularity` and `vote_count` columns.
You can do this by using `df.drop(<col_names>, axis=1)` on the final dataframe, giving it the appropriate column names as argument.

In [None]:
duplicate_vote_count_df = ...

In [None]:
# Optional - remove the `popularity` and `vote_count` columns
...