# Assignment 2

Variable `data` shows where data is located. Modify it as needed

In [1]:
data = "gs://<BUCKET-NAME>/notebooks/jupyter/data/"

## Data

This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. The data was taken from Kaggle. The `athlete_events` Dataset contains $271,116$ rows and $15$ columns and the NOC region dataset contains $230$ rows and $3$ columns. They will be merged together by the National Olympic Committee (NOC) region. Both files are comma separated.

**Source:**

Griffin, R, H (2018) 120 years of Olympic history: athletes and results, athlete_events, Found at: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results#athlete_events.csv

Griffin, R, H (2018) 120 years of Olympic history: athletes and results, noc_regions, Found at: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results#noc_regions.csv

**ATTRIBUTES:**

**athlete_events.csv**

| Column Name | Data Type | Description/Notes |
|:----:|:----:|:----|
| ID |  integer | Unique number for each athlete |
| Name | string | Athlete’s name |
| Sex | string | M or F |
| Age | integer |  |
| Height | integer | In centimeters |
| Weight | integer | In kilograms |
| Team | string | Team name |
| NOC | string | National Olympic Committee, 3 letter code (Matches with `NOC` from noc_regions.csv) |
| Games | string | Year and season |
| Year | integer |  |
| Season | string | Summer or Winter |
| City | string | Host city |
| Sport | string |  |
| Event | string |  |
| Medal | string | Gold, Silver, Bronze, or NA |

**noc_regions.csv**

| Column Name | Data Type | Description/Notes |
|:--|--|:--|
| NOC | string | National Olympic Committee, 3 letter code (Matches with `NOC` from noc_regions.csv) |
| Region | string |  |
| notes | string |  |

## Upload the data into Google Cloud Storage

Use the paths above to download and upload our two files to your Google bucket. For consistency, use the following path:

`gs://<BUCKET-NAME>/notebooks/data/olympics-analysis`

and upload the files into *olympics-analysis* directory.

Confirm that files were uploaded successfully and are accessible via the notebook by the following gsutil command:

In [21]:
!gsutil ls {data + "olympics-analysis"}

gs://qst843/notebooks/jupyter/data/olympics-analysis/
gs://qst843/notebooks/jupyter/data/olympics-analysis/athlete_events.csv
gs://qst843/notebooks/jupyter/data/olympics-analysis/noc_regions.csv


## Load the data into Spark

As seen in the [class notes](https://github.com/soltaniehha/Big-Data-Analytics-for-Business/blob/master/06-Basic-DF-Operations/01-Basic-Structured-Operations.ipynb), we can either ask Spark to infer the schema, or we explicitly specify it ourselves. For this example, we need to specify the schema explicitly since not all the columns will be converted the way we would like to by the default option.

As a reminder, here is how we can define a schema containing two columns, one string, and one integer:

```python
from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([
  StructField("ID", LongType(), True),
  StructField("name", StringType(), True)
])

df = spark.read.csv(path = data + "gs/path/to/file", 
                    header=True, quote='"', 
                    schema=myManualSchema)
```

Modify this code to load athlete_events.csv. Call this DataFrame `athlete_events`:

**Note 1:** We have "NA" values in our data. This could cause issues when loading the data. To overcome this we need to let Spark know that what string is representing `null` in the data. We can use the option/parameter `nullValue` and set it to "NA". You will need to modify the code snippet to adjust for this.

**Note 2:** When working with CSV files, you might encounter text fields containing commas. These commas can pose a challenge, as the parser might mistakenly interpret them as delimiters, leading to incorrect cell separation. To circumvent this issue, it's standard practice to enclose such text fields in quotes, as shown in the previous code snippet ('"'). However, this approach presents a new challenge when the text itself contains quote characters, potentially confusing the parser. To resolve this, we can employ the `escape='"'` option. This setting allows the inclusion of quote characters within a text field without prematurely terminating it. For our `athlete` file, it's crucial to apply this adjustment to ensure accurate parsing. You'll need to modify the provided code snippet accordingly to account for these considerations.

In [33]:
# Your answer goes here


Print the schema of this DataFrame:

In [23]:
# Your answer goes here


root
 |-- ID: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- Height: long (nullable = true)
 |-- Weight: long (nullable = true)
 |-- Team: string (nullable = true)
 |-- NOC: string (nullable = true)
 |-- Games: string (nullable = true)
 |-- Year: long (nullable = true)
 |-- Season: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Medal: string (nullable = true)



Print the first 5 rows:

In [24]:
# Your answer goes here


+---+--------------------+---+---+------+------+--------------+---+-----------+----+------+---------+-------------+--------------------+-----+
| ID|                Name|Sex|Age|Height|Weight|          Team|NOC|      Games|Year|Season|     City|        Sport|               Event|Medal|
+---+--------------------+---+---+------+------+--------------+---+-----------+----+------+---------+-------------+--------------------+-----+
|  1|           A Dijiang|  M| 24|   180|    80|         China|CHN|1992 Summer|1992|Summer|Barcelona|   Basketball|Basketball Men's ...| null|
|  2|            A Lamusi|  M| 23|   170|    60|         China|CHN|2012 Summer|2012|Summer|   London|         Judo|Judo Men's Extra-...| null|
|  3| Gunnar Nielsen Aaby|  M| 24|  null|  null|       Denmark|DEN|1920 Summer|1920|Summer|Antwerpen|     Football|Football Men's Fo...| null|
|  4|Edgar Lindenau Aabye|  M| 34|  null|  null|Denmark/Sweden|DEN|1900 Summer|1900|Summer|    Paris|   Tug-Of-War|Tug-Of-War Men's ...| Gold|

Load noc_regions.csv. Call this DataFrame `noc_regions`:

In [25]:
# Your answer goes here


In [26]:
noc_regions.show(5)

+---+-----------+--------------------+
|NOC|     region|               notes|
+---+-----------+--------------------+
|AFG|Afghanistan|                null|
|AHO|    Curacao|Netherlands Antilles|
|ALB|    Albania|                null|
|ALG|    Algeria|                null|
|AND|    Andorra|                null|
+---+-----------+--------------------+
only showing top 5 rows



### Caching

Since we will be using these two DataFrames a lot in this notebook let's `cache()` them to speed up our execution. Caching allows the DataFrame to be loaded and persist in the memory. If we don't use this option, every time we execute an action, our DataFrame gets loaded from our Cloud Storage, which is not ideal and will add to our execution time:

**Note:** Caching is a lazy transformation. It will happen the first time you execute an action against the DataFrame, not when you cache that DataFrame.

In [37]:
athlete_events.cache()

24/03/11 03:28:42 WARN CacheManager: Asked to cache already cached data.


DataFrame[ID: bigint, Name: string, Sex: string, Age: bigint, Height: bigint, Weight: bigint, Team: string, NOC: string, Games: string, Year: bigint, Season: string, City: string, Sport: string, Event: string, Medal: string]

In [38]:
noc_regions.cache()

24/03/11 03:28:44 WARN CacheManager: Asked to cache already cached data.


DataFrame[NOC: string, region: string, notes: string]

Let's do a couple of quick sanity checks: 

What is the minimum and maximum `year`? Does it match the description of the dataset?

In [2]:
# Your answer goes here


Total number of rows in athletes df?

In [3]:
# Your answer goes here


## Question 1&2
Not every athlete receives a medal in the Olympics. How many records have a non-null value for the `Medal` field? In other words, how many medals were given according to this dataset?

**Note:** It is okay to double-count the medals for the team sports for this question.

In [4]:
# Your answer goes here


## Question 3&4

Which of the statements below are correct, according to this dataset as of 2016?

A. "Rugby Sevens" is the latest sport that was added to the Olympic Games in 2016.

B. "Trampolining" is an existing Olympic sport and is considered one of the latest additions to the Olympic sports.

C.   "Football", "Fencing", "Volleyball" and "Wrestling" are among the very first sports in the Olympics that are still present in the games.

D.   "Snowboarding" is a new addition to the Winter Olympics, added in 1998.

E. "Baseball" is one of the latest sports that was removed from the games.

In [5]:
# Your answer goes here
# Statement A: "Rugby Sevens" is the latest sport that was added to the Olympic Games in 2016.


In [6]:
# Your answer goes here
# Statement B: "Trampolining" is an existing Olympic sport and is considered one of the latest additions to the Olympic sports.


In [7]:
# Your answer goes here
# Statement C: "Football", "Fencing", "Volleyball" and "Wrestling" are among the very first sports in the Olympics that are still present in the games.


In [8]:
# Your answer goes here
# Statement D: "Snowboarding" is a new addition to the Winter Olympics, added in 1998.


In [9]:
# Your answer goes here
# Statement E: "Baseball" is one of the latest sports that was removed from the games.


## Question 5&6

True or False?

> The average age of female athletes who attended the Olympic games after 1990 has risen compared to the era before.

**Note:** Possible double counting is okay to answer this question. (I.e., If an athlete has attended more than one event each year, it's okay to have her counted multiple times to get this average.)

In [10]:
# Your answer goes here


In [11]:
# Your answer goes here


## Question 7&8

How many Gold medals were given to men from 1970 to 2000 (including both years)?


**Note:** It is okay to double-count the medals for the team sports for the purpose of this question.

In [12]:
# Your answer goes here


## Question 9&10

Can you help us identify how many athletes attended the Olympic Games in 2016? We are trying to identify the hotels that could have handled such numbers.

**Note 1:** You can use the method `.distinct()` to get the unique values.

**Note 2:** Watch out for athletes with similar names.

In [13]:
# Your answer goes here


## Question 11&12

Who won the event "Swimming Men's 100 metres Breaststroke" in 2004? Please note that in the Event description, "metres" is spelled in British!

In [14]:
# Your answer goes here


## Question 13&14

In which city were the maximum number of games played? (by "games" in this question we refer to individual races)

In [15]:
# Your answer goes here


## Question 15&16

For the city where the maximum number of games were played, find the number of female player(s) whose age is greater than the average age of all the players in the dataset.

In [16]:
# Your answer goes here


## Question 17

Make a new column in the dataframe title `BMI` by utilizing the columns `Height` and `Weight`. Impute the missing values in the `Height` and `Weight` columns by the average of the rest of the values in the respective column. Round the values in the `BMI` column to 1 decimal place.

Note that $BMI=\frac{(Weight_{kg})}{Height_m^2}$.

In [17]:
# Your answer goes here


## Question 18&19

Create two DataFrames, one for the Winter Games and one for the Summer Games; these DataFrames should include a list of all NOCs that have won gold medals in the Colympics and their count. Sort these DataFrame by the count in descending order. Call these DataFrames `winter_gold_count` and `summer_gold_count` respectively. Using these two, answer the following questions:

Which country has the highest gold medal count in the Winter Olympics? How about the Summer Olympics?

In [18]:
# Your answer goes here


## Question 20&21

Using the common field `NOC`, merge `summer_gold_count` and `noc_regions` DataFrames.

Which region takes the 10th place? This is based on the number of gold medals in all of the Summer Olympics in our dataset.

In [19]:
# Your answer goes here
