In [None]:
from datascience import *
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from notebook.services.config import ConfigManager

cm = ConfigManager()
cm.update(
    "livereveal", {
        'width': 1500,
        'height': 700,
        "scroll": True,
})

# DSC 10 Discussion Week 4
---
Kyle Vigil

# Practice With Join

In [None]:
people = Table().with_columns("name",["kyle","jill","cole","alex"],"age",[24,22,21,24], "city", ["San Diego","LA","San Francisco","Irvine"])
people

In [None]:
cities = Table().with_columns("name", ["San Diego", "LA", "San Francisco","Denver","New York"], "Popular Food", ["California Burrito", "Tacos", "Sourdough", "Denver Omelete", "Cheesecake"])
cities

In [None]:
important_birthdays = Table().with_columns("age", [21,21,22,24], "importance", ["Legal Drinking Age", "Officially an Adult", "Taylor Swift Song", "Kyle's Age"])
important_birthdays

## How to join people with cities? How many rows will there be?

## How to join people with important_birthdays? How many rows will there be?

# How to join all three? How many rows? Does order matter?

# Olympic Athletes
---

From kaggle user Randi H Griffin:
>This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. I recommend checking my kernel before starting your own analysis.
>
>Note that the Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on. A common mistake people make when analyzing this data is to assume that the Summer and Winter Games have always been staggered.
Content
>
>The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:
>
>1. ID - Unique number for each athlete  
>2. Name - Athlete's name  
>3. Sex - M or F  
>4. Age - Integer  
>5. Height - In centimeters  
>6. Weight - In kilograms  
>7. Team - Team name  
>8. NOC - National Olympic Committee 3-letter code  
>9. Games - Year and season  
>10. Year - Integer  
>11. Season - Summer or Winter  
>12. City - Host city  
>13. Sport - Sport  
>14. Event - Event  
>15. Medal - Gold, Silver, Bronze, or NA  


In [None]:
data = Table.read_table("data/athlete_events.csv")
data

# Let's assign points to each country
---

Let's say we're assigning points to each country based on the number of Golds, Silvers, and Bronzes they've won.

Medals are with the following amount of points:

<pre>
  Gold    +5 pts
  Silver  +3 pts
  Bronze  +2 pts
  nan     0 pts
</pre>

In [None]:
# How are we going to do this?

In [None]:
def medal_to_points(medal):
    if medal == "Gold":
        return 5
    elif medal == "Silver":
        return 3
    elif medal == "Bronze":
        return 2
    else:
        return 0

Okay, now we need to apply that function to our table.

What does `apply` return again?  And how will we use what it returns?

In [None]:
data_with_points = ...
data_with_points

If we only care about the country and the points, do we need to work with this entire table?

In [None]:
# Select relevant columns
country_points = data_with_points.select("NOC", "Points")

Now, how do we find the total amount of points scored by each country?

In [None]:
# Group by country
scores = ...

scores.sort("Points", descending=True)

Cool!  Look's like we're at the top :)

What happens if we change our function to weight the medals differently?

# What are the points of the top 5 countries over time?
---

This one might be a doozy, so let's work through it together.

First, let's start by choosing 5 countries and only working with their data.  This will make things a bit more manageable.  Just as we found out before, we should use NOC.

In [None]:
included_countries = ["USA", "CHN", "RUS", "GBR", "GER"]

We have already added points to the entire dataset based on the Medal placement, so let's just get our countries from that `data_with_points` table.

In [None]:
# Solution #1
countries = data_with_points.where("NOC", are.contained_in(included_countries))
countries

In [None]:
# Solution #2 with join
inc_countries = ...
inc_countries

In [None]:
countries = ...
countries

Since our data is time-specific, we should make sure that we're keeping it sorted by date.

In [None]:
countries = countries.sort("Year")
countries

We should also limit our data to just what we want.

In [None]:
countries = countries.select("Year", "NOC", "Points")
countries

Now we get to try out a handy-dandy new method that we learned recently: `.groups`.

This takes multiple column names and gives us every unique row of Col_1 and Col_2.

For example, let's try out `.groups` on a simple table first.

In [None]:
tbl = Table().with_columns(
    "Alph", ["A", "A", "A", "B", "B", "C"],
    "Numb", [1, 2, 3, 4, 4, 1],
    "Data", [5.8, 2.6, 4.4, 9.8, 10.2, 4.3]
)

tbl

In [None]:
tbl.groups(["Alph", "Numb"])

Alrighty, back to our Olympics data!

For every year we want every NOC.  So, the columns that we pass into `.groups` should probably be those.

For every year and NOC we probably want the total amount of points that country got that year.  What collection function should we use?

In [None]:
points = ...
points

Is there a different/better way to view this table?

In [None]:
# What if we want a "Year" column, and then a column for every NOC?

# Then, we want the values to be the total points for that year for that NOC.
points = countries.pivot("NOC", "Year", "Points", sum)
points

Right, now lets plot this data! We want to plot score for each country over time.

What type of plot would work best here—for time-based data?

In [None]:
points.plot("Year", width=10, height=8)
plt.title("Olympic Medal Points Over Time by Country")
plt.ylabel("Points");

This graph is awfully confusing. A more intuitive way to visualize this is through the cumulative number of points each country has. Now, how do we get the total points *so far* of each country?  Let's move outside of the table for now, and work with numpy a bit.

In [None]:
# Let's get the total points so far for each year for China.
np.cumsum(points.column("CHN"))

So, let's replace each column in our table with the cumulative sum data like we just calculated!

In [None]:
chn = np.cumsum(points.column("CHN"))
usa = np.cumsum(points.column("USA"))
rus = np.cumsum(points.column("RUS"))
gbr = np.cumsum(points.column("GBR"))
ger = np.cumsum(points.column("GER"))

In [None]:
cumulative_points = points.with_columns([
    "CHN", chn,
    "USA", usa,
    "RUS", rus,
    "GBR", gbr,
    "GER", ger
])

cumulative_points

## Can we rewrite this with a for loop (hint: yes)

This brings us to the final plot of the notebook:

In [None]:
points.plot("Year", width=10, height=8)
plt.title("Cumulative Olympic Medal Points Over Time by Country")
plt.ylabel("Points");