# Task 1: Instructions

Load the required libraries and the Nobel Prize dataset.

- Import the `pandas` library as `pd`.
- Import the `seaborn` library as `sns`.
- Import the `numpy` library as `np`.
- Use `pd.read_csv` to read in `datasets/nobel.csv` and save it into `nobel`.
- Show at least the first six entries of `nobel` using the `head()` method, setting `n=6` or greater.

## Good to know

This project assumes you are familiar with the `pandas` and `seaborn` libraries and before taking on this project, we recommend that you have completed the courses [Data Manipulation with pandas](https://www.datacamp.com/courses/data-manipulation-with-pandas) and [Intermediate Data Visualization with Seaborn](https://www.datacamp.com/courses/intermediate-data-visualization-with-seaborn).

Two cheat sheets that will be useful throughout this project: DataCamp's [Seaborn cheat sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf) and [Data Wrangling with pandas](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) cheat sheet. We recommend that you keep them open in separate tabs to make it easy to refer to them.

# Task 2: Instructions

Count up the Nobel Prizes. Also, split by `sex` and `birth_country`.

- Count the number of rows/prizes using the `len()` function. Use the `display()` function to display the result.
- Count and `display` the number of prizes for each `sex` using the `value_counts()` method.
- Count the number of prizes for each `birth_country` using `value_counts()` and show the top 10 using `head()`. **Do not use `display()`.**

By default, a Jupyter Notebook (which is where you are working right now) will only show the final output in a cell. If you want to show intermediate results, you will have to use the `display()` function. See [here](https://cmdlinetips.com/2018/02/how-to-get-frequency-counts-of-a-column-in-pandas-dataframe/) for an example of how to use `value_counts()`.

Why `display()` over `print()`? Try them both out for yourself. You'll find that the output of `display()` is prettier. :)

# Task 3: Instructions

Create a DataFrame with two columns: decade and proportion of USA-born Nobel Prize winners that decade.

- Add a `usa_born_winner` column to `nobel`, where the value is `True` when `birth_country` is `"United States of America"`.
- Add a `decade` column to `nobel` for the decade each prize was awarded. Here, `np.floor()` will come in handy. Ensure the decade column is of type `int64`.
- Use `groupby` to group by `decade`, setting `as_index=False`. Then isolate the `usa_born_winner` column and take the `mean()`. Assign the resulting DataFrame to `prop_usa_winners`.
- Display `prop_usa_winners`.

For the `decade` column, 1953 should become 1950, for example. Calculating this column is a bit tricky, but try to see if you can solve it using the `np.floor` function. If not, check the hint!

By setting `as_index=False`, you make sure that both the grouping variable and the calculated mean are included in the resulting DataFrame.

# Task 4: Instructions

Plot the proportion of USA born winners per decade.

- Use seaborn to plot `prop_usa_winners` with `decade` on the x-axis and `usa_born_winner` on the y-axis as an `sns.lineplot`. Assign the plot to ax.
- Fix the y-scale so that it shows percentages using `PercentFormatter`.

See [here](https://stackoverflow.com/a/36319915/1001848) for a Stack Overflow answer on how `PercentFormatter` works and [here](https://seaborn.pydata.org/generated/seaborn.lineplot.html) for the documentation of `lineplot`.

# Task 5: Instructions

Plot the proportion of female laureates by decade split by prize category.

- Add the `female_winner` column to `nobel`, where the value is `True` when `sex` is `"Female"`.
- Use `groupby` to group by both `decade` and `category`, setting `as_index=False`. Then isolate the `female_winner` column and take the `mean()`. Assign the resulting DataFrame to `prop_female_winners`.
- Copy and paste your `seaborn` plot from task 4 (including axis formatting code), but plot `prop_female_winners` and map the `category` variable to the `hue` parameter.

This task can be solved by copying and modifying the code from task 3 and 4.

# Task 6: Instructions

Extract and display the row showing the first woman to win a Nobel Prize.

- Select only the rows of `'Female'` winners in `nobel`.
- Using the `nsmallest()` method with its `n` and `columns` parameters, pick out the first woman to get a Nobel Prize.

See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.nsmallest.html) for the documentation of `nsmallest()`.

# Task 7: Instructions

Extract and display the rows of repeat Nobel Prize winners.

- Use `groupby` to group `nobel` by `'full_name'`.
- Use the `filter` method to keep only those rows in `nobel` with winners with 2 or more prizes.

See [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html#filtration) for how to use the `filter` method.

# Task 8: Instructions

Calculate and plot the age of each winner when they won their Nobel Prize.

- Convert the `nobel['birth_date']` column to datetime using `pd.to_datetime`.
- Add a new column `nobel['age']` that contains the age of each winner when they got the prize. That is, year of prize win minus birth year.
- Use `sns.lmplot` (**not** `sns.lineplot`) to make a plot with `year` on the x-axis and `age` on the y-axis.

To get the year from a datetime column you need to use access the dt.year value. Here is an example:

```
a_data_frame['a_datatime_column'].dt.year
```

Seaborn's `lmplot` is a 2D scatterplot with an optional overlaid regression line. This type of plot is useful for [visualizing linear relationships](https://seaborn.pydata.org/tutorial/regression.html).

To make the plot prettier, add the arguments `lowess=True`, `aspect=2`, and `line_kws={'color':'black'}`.

# Task 9: Instructions

Plot how old winners are within the different price categories.

- As before, use `sns.lmplot` to make a plot with `year` on the x-axis and `age` on the y-axis. But this time, make one plot per prize category by setting the `row` argument to `'category'`.

This is the same plot as in task 8, except with the added `row=` argument (examples in the official Seaborn documentation [here](https://seaborn.pydata.org/generated/seaborn.lmplot.html)).

# Task 10: Instructions

Pick out the rows of the oldest and the youngest winner of a Nobel Prize.

- Use `nlargest()` to pick out and display the row of the oldest winner.
- Use `nsmallest()` to pick out and display the row of the youngest winner.

As before, you will need to use `display()` to display more than the last output of the cell. Here is [the documentation for nsmallest](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.nsmallest.html) and [n_largest](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.nlargest.html).

# Task 11: Instructions

- Assign the name of the youngest winner of a Nobel Prize to `youngest_winner`. The first name will suffice.

## If you want to know more

The Nobel Prize dataset is rich, and this project just scratched the surface -- there is much more to explore! After you have completed this project, you can download it and continue exploring on your own! To do that you will have to install Jupyter Notebooks. Here are instructions for [how to install the Jupyter Notebook interface](http://jupyter.org/install.html). Good luck!