# %title%

**_Author: Favio Vázquez_**

**_Reviewer: Jessica Cervi_**

**Expected time = %expected_time%**

**Total points = 110 points**


## Assignment Overview

In this assignment you will manipulate and analyze data with Pandas. You will begin by reviewing how to read data into a series or dataframe, then you will use additional functionalities of Pandas objects. After that, you will index, select and edit data inside dataframes. In the final parts of the assignment you will be combining, grouping and aggregating dataframes.

This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are explicitly asked to in the question. 
- You can download the Grading Report after submitting the assignment. This will include feedback and hints on incorrect questions.


### Learning Objectives

- Use Pandas to build, extract, filter, and transform DataFrames.
- Describe Pandas data structures: DataFrames and Series.  
- Use Pandas objects for analyses. 

## Index:

#### %title%

- [Question 1](#Question-1)
- [Question 2](#Question-2)
- [Question 3](#Question-3)
- [Question 4](#Question-4)
- [Question 5](#Question-5)
- [Question 6](#Question-6)
- [Question 7](#Question-7)
- [Question 8](#Question-8)
- [Question 9](#Question-9)
- [Question 10](#Question-10)
- [Question 11](#Question-11)
- [Question 12](#Question-12)
- [Question 13](#Question-13)
- [Question 14](#Question-14)

## %title%

In [1]:
# Let's start by importing Pandas
import pandas as pd

# Avoid warnings
import warnings
warnings.filterwarnings("ignore")

### Importing data

We will begin this assignment with a review of how to import data with Pandas. For several parts of this assignment we will be using two datasets coming from the past 120 years of Olympic history: athletes and results. This dataset is saved inside the folder `/data`; more information about this dataset can be found on Kaggle at this link:

https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. 

The file `athlete_events.csv` contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:

- ID - Unique number for each athlete
- Name - Athlete's name
- Sex - M or F
- Age - Athlete's age
- Height - In centimeters
- Weight - In kilograms
- Team - Team name
- NOC - National Olympic Committee 3-letter code
- Games - Year and season
- Year - Year of game
- Season - Summer or Winter
- City - Host city
- Sport - Sport
- Event - Event
- Medal - Gold, Silver, Bronze, or NA

The file `noc_regions.csv` contains 230 rows and 3 columns. Each row contains information about the different National Olympic Committee (NOC). The columns are:

- NOC - National Olympic Committee abreviation
- region - Name of country in NOC
- notes - Notes about the region and NOC


[Back to top](#Index:) 

### Question 1
*5 points*

Read the CSV file named `"athlete_events.csv"` contained in the `data/` folder and assign it to a dataframe called `df`.

In [2]:
### GRADED

### YOUR SOLUTION HERE
df = None

### BEGIN SOLUTION
df = pd.read_csv("data/athlete_events.csv")
### END SOLUTION

In [3]:
### BEGIN HIDDEN TESTS (5)
from pandas.util.testing import assert_frame_equal
df_ = pd.read_csv("data/athlete_events.csv")
#
#
#
assert df.equals(df_), "Did you import the dataframe correctly?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


In [4]:
# Let's take a look at our dataframe df
df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [5]:
# Let's see the shape of out dataframe df
print("Number of rows: {}, number of columns: {}".format(df.shape[0],df.shape[1]))

Number of rows: 271116, number of columns: 15


[Back to top](#Index:) 

### Question 2
*5 points*

Read the CSV file named `"noc_regions.csv"` in the `data/` folder and assign it to a dataframe called `regions`.

In [6]:
### GRADED

### YOUR SOLUTION HERE
regions = None

### BEGIN SOLUTION
regions = pd.read_csv("data/noc_regions.csv")
### END SOLUTION

In [7]:
### BEGIN HIDDEN TESTS (5)
from pandas.util.testing import assert_frame_equal
regions_ = pd.read_csv("data/noc_regions.csv")
#
#
#
assert regions_.equals(regions),  "Did you import the dataframe correctly?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


In [8]:
# Let's take a look at our dataframe regions
regions.head()

Unnamed: 0,NOC,region,notes
0,AFG,Afghanistan,
1,AHO,Curacao,Netherlands Antilles
2,ALB,Albania,
3,ALG,Algeria,
4,AND,Andorra,


In [9]:
# Let's see the shape of out dataframe regions
print("Number of rows: {}, number of columns: {}".format(regions.shape[0],regions.shape[1]))

Number of rows: 230, number of columns: 3


### Pandas Objects

In this part of the assignment we will begin studying the two most important objects exposed by Pandas: Series and Dataframes. As you remember:
- **Series** is a 1 dimensional data structure in Pandas.
- **DataFrame** is a 2 dimentional data structure in Pandas, made up of columns and rows.

[Back to top](#Index:) 

### Question 3
*5 points*

Select a series from the dataframe `df` with the contents of the column `Height` and store it in a variable called `height`. 

In [10]:
### GRADED

### YOUR SOLUTION HERE
height = None

### BEGIN SOLUTION
height = df["Height"]
### END SOLUTION

In [11]:
### BEGIN HIDDEN TESTS (5)
from pandas.util.testing import assert_series_equal
height_ = df["Height"]
#
#
#
assert height.equals(height_), "Did you extract the correct column?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


[Back to top](#Index:) 

### Question 4
*10 points*

In the videos, you have seen how you can use the function `map` to transform the entries of a `pandas` `series`.
In a similar way, you can use the function `rename` to replace the entries of a series. Like `map`, `rename` takes as argument a lambda function executing the desired transformation on the `series`.

The syntax is as follows
```Python
new_series = series.rename(lambda x: your function)
```

Use the function `rename` to rename the index (or labels) of the series `height` by raising each old label to the power of two, like so:

$$
0 \rightarrow 0 \\
1 \rightarrow 1 \\
2 \rightarrow 4 \\
3 \rightarrow 9 \\
\vdots
$$

Save this new series in a variable called `height_new`.

In [12]:
### GRADED

### YOUR SOLUTION HERE
height_new = None

### BEGIN SOLUTION
height_new = height.rename(lambda x: x ** 2)
### END SOLUTION

In [13]:
### BEGIN HIDDEN TESTS (10)
from pandas.util.testing import assert_series_equal
height_new_ = height.rename(lambda x: x ** 2)
#
#
#
assert height_new.equals(height_new_), "Did you define the lambda function correclty?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


[Back to top](#Index:) 

### Question 5
*5 points*

Select a series from the dataframe `regions` with the contents of the column `region` and store it in a variable called `reg`.

In [14]:
### GRADED

### YOUR SOLUTION HERE
reg = None

### BEGIN SOLUTION
reg = regions["region"]
### END SOLUTION

In [15]:
### BEGIN HIDDEN TESTS (5)
from pandas.util.testing import assert_series_equal
reg_ = regions["region"]
#
#
#
assert reg.equals(reg_),"Did you extract the correct column?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


[Back to top](#Index:) 

### Question 6
*5 points*
    
You can also select multiple columns at once from a dataframe to create a new dataframe.


Create a new dataframe from the dataframe `df`, that only contain the columns `ID`, `Age`, `Height`, `Weight` and `Sex` in this specific order. Name this new dataframe `df_subset`.

In [16]:
### GRADED

### YOUR SOLUTION HERE
df_subset = None

### BEGIN SOLUTION
df_subset = df[["ID", "Age", "Height", "Weight", "Sex"]]
### END SOLUTION

In [17]:
### BEGIN HIDDEN TESTS (5)
from pandas.util.testing import assert_frame_equal
df_subset_ = df[["ID", "Age", "Height", "Weight", "Sex"]]
#
#
#
assert df_subset.equals(df_subset_), "Did you select all the required columns?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


Let's have a look at the dataframe `df_subset` by using the command `.head()`

In [18]:
df_subset.head()

Unnamed: 0,ID,Age,Height,Weight,Sex
0,1,24.0,180.0,80.0,M
1,2,23.0,170.0,60.0,M
2,3,24.0,,,M
3,4,34.0,,,M
4,5,21.0,185.0,82.0,F


[Back to top](#Index:) 

### Question 7
*10 points*
    
Observe the dataframe `df_subset`, above. You see that the column `Sex` contains entries `M` and `F` based on whether an athlete was a male or a female, respectively.

Create a new column, `New sex`, in our dataframe. Fill this column by using the function `map` to change the entries of the column `sex` from `M` to `male` and from `F` to `female`.

The syntax is as follows
```Python
dataframe['column'] = dataframe.column.map(lambda x: your function)
```

**HINT: Notice that we are adding a new column, not replacing an existing one!**


In [19]:
### GRADED

### YOUR SOLUTION HERE

### BEGIN SOLUTION
df_subset["New sex"] = df_subset.Sex.map(lambda x: 'male' if x == 'M' else 'female')
### END SOLUTION

In [20]:
### BEGIN HIDDEN TESTS (10)
from pandas.util.testing import assert_frame_equal
df_subset_ = df[["ID", "Age", "Height", "Weight", "Sex"]]
df_subset_["New sex"] = df_subset_.Sex.map(lambda x: 'male' if x == 'M' else 'female')
#
#
#
assert df_subset.equals(df_subset_), "Did you define the entries correctly?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


### Indexing and selecting data from Dataframes

In this part of the assignment we will work with the dataframes from above to select specific data using Pandas different methods and attributes. You have learned to use `loc[]` and `iloc[]` to do this.

[Back to top](#Index:) 

### Question 8
*5 points*

Create a new dataframe called `df_1` by selecting the following from the dataframe `df`:
- the rows with labels from 3 through 11.
- the columns from `ID` to `Height`.

In [21]:
### GRADED

### YOUR SOLUTION HERE
df_1 = None

### BEGIN SOLUTION
df_1 = df.loc[3:11, "ID": "Height"]
### END SOLUTION

In [22]:
### BEGIN HIDDEN TESTS (5)
from pandas.util.testing import assert_frame_equal
df_1_ = df.loc[3:11, "ID":"Height"]
#
#
#
assert df_1.equals(df_1_), "Did you select ALL the correct entries?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


[Back to top](#Index:) 

### Question 9
*10 points*

Select all the rows from the dataframe `df` when the `Year` is greater than 1980. Assign this dataframe to `df_year`.

Next, select all the rows in `df_year` where `Team` is equal to "China", "United States", "Italy" or "Spain". Save your results in a dataframe called `df_country`.

**HINT**: To select only the desired countries, create a list containing the contries and use the function `isin` like so:

```Python
df.Team.isin([list with countries])
```

In [23]:
### GRADED

### YOUR SOLUTION HERE
df_year = None
df_country = None

### BEGIN SOLUTION
df_year = df.loc[df.Year > 1980]
df_country = df_year.loc[df_year.Team.isin(["China", "United States", "Italy", "Spain"])]
### END SOLUTION

In [24]:
### BEGIN HIDDEN TESTS (10)
from pandas.util.testing import assert_frame_equal
df_year_ = df.loc[df.Year > 1980]
df_country_ = df_year.loc[df_year.Team.isin(["China", "United States", "Italy", "Spain"])]
#
#
#
assert df_year.equals(df_year_), "Did you select all the required entries?"
assert df_country.equals(df_country_), "Did you select all the required entries?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


[Back to top](#Index:) 

### Question 10
*5 points*

Using the function `iloc()` select the rows with index 0, 10, 20, 40, 43, 66 and the columns with index 0, 3, 5 from the dataframe `df`. Store your results in a dataframe called `df_3`.

In [25]:
### GRADED

### YOUR SOLUTION HERE
df_3 = None

### BEGIN SOLUTION
df_3 = df.iloc[[0, 10, 20, 40, 43, 66],[0, 3, 5]]
### END SOLUTION

In [26]:
### BEGIN HIDDEN TESTS (5)
from pandas.util.testing import assert_frame_equal
df_3_ = df.iloc[[0, 10, 20, 40, 43, 66],[0, 3, 5]]
#
#
#
assert df_3.equals(df_3_), "Did you use the function iloc() correctly?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


### Editing data in DataFrames and Combining DataFrames

In this section we will modify the internal structure and data of dataframes, deleting some of its columns and transforming others. We will also be combining our dataframes `df` and `regions` and learn different ways of working with them.

[Back to top](#Index:) 

### Question 11
*10 points*
    
Use a `left` join to combine the dataframes `df` and `regions`, in this particular order, into a new dataframe called `merged`. Set the column `NOC` as the key column.

**HINT**: Use the `pandas` function `merge`.

In [27]:
### GRADED

# Let's read our data again to have the original datasets
df = pd.read_csv("data/athlete_events.csv")
regions = pd.read_csv("data/noc_regions.csv")

### YOUR SOLUTION HERE
merged = None

### BEGIN SOLUTION
merged = pd.merge(df, regions, on='NOC', how='left')
### END SOLUTION

In [28]:
### BEGIN HIDDEN TESTS (10)
from pandas.util.testing import assert_frame_equal
df_ = pd.read_csv("data/athlete_events.csv")
regions_ = pd.read_csv("data/noc_regions.csv")
merged_ = pd.merge(df_, regions_, on='NOC', how='left')
#
#
#
assert merged.equals(merged_), "Did you join the dataframes as required?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


### Grouping and Aggregating DataFrames

In this final section we will group and perform aggregations on our dataframes.

In [29]:
# Let's read our data again to have the original datasets
df = pd.read_csv("data/athlete_events.csv")
regions = pd.read_csv("data/noc_regions.csv")
regions.head()

Unnamed: 0,NOC,region,notes
0,AFG,Afghanistan,
1,AHO,Curacao,Netherlands Antilles
2,ALB,Albania,
3,ALG,Algeria,
4,AND,Andorra,


[Back to top](#Index:) 

### Question 12
*10 points*

Reshape the `regions` dataframe by setting:
- index = `region`
- columns = `NOC`
- values = `notes`

Assign the new dataframe to `regions_stacked`

In [30]:
### GRADED

### YOUR SOLUTION HERE

regions_stacked = pd.DataFrame()
### BEGIN SOLUTION
regions_stacked = regions.pivot(index = 'region', columns = 'NOC', values = 'notes')
### END SOLUTION

In [31]:
### BEGIN HIDDEN TESTS (10)
regions_stacked_ = regions.pivot(index = 'region', columns = 'NOC', values = 'notes')
#
#
#
assert regions_stacked.equals(regions_stacked_), "Make sure you have used the function `pivot` correctly?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


[Back to top](#Index:) 

### Question 13
*10 points*

Group the entries of the dataframe `df` by the columns `"Season"` and `"Medal"`. Assign the new dataframe to `season_medal`.

Also, answer the following question: In which Olympic games did the first group of athletes participate?
- a) 2014 Winter
- b) 1920 Summer
- c) 1994 Summer
- d) 1920 Winter

Assign the letter of the correct answer, as a string, to `ans13`.

For instance, if you believe the correct answer is *a) 2014 Winter*, your answer would be `ans13 = 'a'`.

In [32]:
### GRADED

### YOUR SOLUTION HERE
season_medal = None
ans13 =  None

### BEGIN SOLUTION
season_medal = df.groupby(["Season", "Medal"])
season_medal.first()
ans13 = 'b'
### END SOLUTION

In [33]:
### BEGIN HIDDEN TESTS (10)
season_medal_ = df.groupby(["Season", "Medal"])
season_medal_.first()
ans13_ = 'b'
#
#
#
assert ans13 == ans13_, "Are you sure?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!


[Back to top](#Index:) 

### Question 14
*15 points*

Perform the following aggregation operations on `df`:

- compute the `max` and the `min` on the column `Age`.
- Compute the `mean` on the column `Weight`.
- compute the `max`, `min` and `mean` on the column `Height`.

Assign the new dataframe to `df_aggr`.


In [34]:
### GRADED

### YOUR SOLUTION HERE
df_aggr = None

### BEGIN SOLUTION
df_aggr = df.aggregate({ "Age":['max', 'min'], 
              "Weight":['mean'],  
              "Height":['max', 'min', 'mean']}) 
### END SOLUTION

In [35]:
### BEGIN HIDDEN TESTS (15)
df_aggr_ = df.aggregate({ "Age":['max', 'min'], 
              "Weight":['mean'],  
              "Height":['max', 'min', 'mean']}) 
#
#
#
assert df_aggr.equals(df_aggr_),  "Did you compute all the required aggreation operations?"
print("That's correct!")
### END HIDDEN TESTS

That's correct!
