In [None]:
from datascience import *
import numpy as np
from notebook.services.config import ConfigManager
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

cm = ConfigManager()
cm.update(
    "livereveal", {
        "width": "90%",
        "height": "70%",
        "scroll": True,
})

# DSC 10 Discussion Week 3
---
Kyle Vigil

# What have we learned so far?
---

1. Python
  - Assigning variables
  - Working with data types
  - Calling functions
  - Defining functions
  - If... elif... else...
  - For i in np.arange...
2. Arrays
  - Creating arrays (same type!)
  - Operations between arrays (add, mult)
  - Adding elements to arrays (using np.append)
3. Tables
  - Reading tables
  - Creating new tables
  - Grabbing data from columns
  - Creating a copy with additional columns
  - Creating a copy with only certain columns
  - Creating a copy with only certain rows
  - Creating a copy with rows grouped on a column and a collection function
4. Plots
  - Creating bar charts from 'category' and 'y' columns
  - Creating scatter plots and line plots from 'x' and 'y' columns
  - Creating histograms to count occurrences/density
5. Functions
  - Calling functions
  - Creating our own functions

# Any questions about recent things in class?
---

We don't necessarily need to get through this entire notebook, you can treat it more as a practice test that we're doing together.

What's most important is that we're all solid on the information covered in lecture, that way we can figure out how to apply it to various datasets and find answers to questions!

Anything and everything that you're unclear about, ask about it now!  No need to be shy, there are definitely others worried about the same thing.

# Ultimate Halloween Candy Showdown
---
269,000 user submitted winners of head to head candy matchups

In [None]:
candy = Table.read_table("data/candy.csv")
for col in ["chocolate", "fruity", "caramel", "peanutyalmondy", "nougat", "crispedricewafer", "hard", "bar", "pluribus"]:
    candy = candy.with_column(col, candy.column(col).astype(bool))
candy

In [None]:
candy = candy.sort("winpercent", descending=True)
plt.figure(figsize=(20,8))
sns.barplot(candy.column("competitorname"), candy.column("winpercent"))
plt.xticks(rotation="vertical");

## Lets use group and plotting to analyze the data

* Group will show us aggregated data about groups of the data (ex. average score of chocolate candy vs. non chocolate candy)
* Plotting will give us visual insights to the data

### Grouping

In [None]:
# Interpret the meaning of the output generated from:
candy.group("chocolate")

In [None]:
# Interpret the meaning of the output generated from:
candy.group("chocolate", np.mean)

In [None]:
# How does that compare to this output? What changes?
candy.group("chocolate", max)

In [None]:
# What is different about this statement? How many rows do you expect?
candy.group(["chocolate", "caramel","fruity","nougat"], np.mean)

### Plotting

In [None]:
# Create a histogram of the winpercentages

In [None]:
# See if price percentage has an effect on win percentage. What type of graph can we use for this?

In [None]:
# What type of graph should we use for plotting fruity vs non fruity average win percentage?

# Functions and Apply
---

Review: How do you make a function to lowercase a string and then add " <- this is now lowercase" to the end? 

In [None]:
def lowercase(string):
    pass

lowercase("sTrInG")

We can use functions to clean up messy data. A good example of messy data comes from a column of user inputted data

In [None]:
survey_responses = Table().with_columns({
    "name": ["Kyle", "Meghan", "Taylor", "Tyler", "Josh", " Chad"],
    "fav_food": ["Pizza", "pizza", "turkey sandwich", "pepperoni pizza", "Pepperoni Pizza", "ham sandwich"], 
    "phone": ["123-456-7890", "(123) 456-7890", "1234567890", "(123)4567890", "123 456 7890", "(123)456-7890"]
})
survey_responses

This table has difficult to access and analyze data. We can use our own functions and apply to clean it up.

1. clean up fav_food by making it lowercase and only the last word of the input

2. clean up the phone number by making it a string with only 10 digits in it (what about storing it as an int?)

In [None]:
# Make a function to clean up one entry of fav_food (lowercase and only last word)
def clean_fav_food(string):
     pass
    
clean_fav_food("this is a test")

In [None]:
survey_responses

In [None]:
# apply this function to the fav_food column
survey_responses = survey_responses.with_column("fav_food", survey_responses.apply(clean_fav_food, 'fav_food'))

In [None]:
# apply this function to the fav_food column
survey_responses = survey_responses.with_column("fav_food", survey_responses.apply(clean_fav_food, "fav_food"))
survey_responses

In [None]:
# Make a function to clean up one entry of phone (10 digit string)
def clean_phone(phone):
    pass
    
clean_phone("(123)456-7890")

In [None]:
# apply this function to the phone column
survey_responses = survey_responses.with_column("phone", survey_responses.apply(clean_phone, "phone"))
survey_responses

## Our dataset is now clean and ready for analysis!

# More Practice! Today's Dataset:
---

From kaggle user Randi H Griffin:
>This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. I recommend checking my kernel before starting your own analysis.
>
>Note that the Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on. A common mistake people make when analyzing this data is to assume that the Summer and Winter Games have always been staggered.
>
>The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:
>
>1. ID - Unique number for each athlete  
>2. Name - Athlete's name  
>3. Sex - M or F  
>4. Age - Integer  
>5. Height - In centimeters  
>6. Weight - In kilograms  
>7. Team - Team name  
>8. NOC - National Olympic Committee 3-letter code  
>9. Games - Year and season  
>10. Year - Integer  
>11. Season - Summer or Winter  
>12. City - Host city  
>13. Sport - Sport  
>14. Event - Event  
>15. Medal - Gold, Silver, Bronze, or NA  


In [None]:
data = Table.read_table("data/athlete_events.csv")
data

Since the olympics didn't stagger the winter and summer games until 1992, let's look at only years after 1992.

In [None]:
data = data.where("Year", are.above(1992))
data

Alright.  We got rid of a lot data (and notice that we got rid of many of Christine Jacoba Aaftink's games), but it should be more manageable now.

Also, while we're at it, let's ignore rows with "mismatched" values.  You don't *need* to know how this next code block works, but feel free to follow along.

In [None]:
nans = np.argwhere(np.isnan(data.column("Age")) | np.isnan(data.column("Height")) | np.isnan(data.column("Weight")))
data.take(nans[:3]).show() # This line is just to show us what these mismatched rows look like
print("We're getting rid of rows like these!")
data = data.exclude(nans.flatten())

# Interesting Analysis
---
1. Average age of medal winners.
    - Does this change for males and females?
    - min/max age of medal winners?
        - find their entries
2. Sports with the tallest/heaviest/oldest participants?
3. Count of sport season by team country.
4. Plot age by year, possibly split by sex.

In [None]:
# Warmup: Find the events with the heaviest average competitor weight. 
# Extra credit: can you guess what events these would be before we find them?

# Repeat Participants vs One-Timers
---

Who showed up multiple times?  Who only showed up once?

Can we get a table of just those people and their data?

In [None]:
# Let's get how many times each person was in the dataset
counts = ...
counts

In [None]:
# How to get just the names that have count > 1?
repeat_names = ...

And now we want just the data of people who were repeat participants.

In [None]:
# Now as simple as using a .where() statement
repeats = ...
repeats

# Beginning Age --> Repeat or One-Timer?
---

Can we tell if you'll participate in the Olympics again based on your age?

In [None]:
# If someone showed up multiple times they were probability pretty young at their first olympics, right?
# Let's find the avg min age of these repeat participants, and compare it to avg age of one-time participants
onetimes = data.where("Name", are.not_contained_in(repeat_names))

In [None]:
# How will we take avg of min ages for all participants?
# Well, by grouping of course!
# Let's only select the columns we need
min_ages_repeats = repeats.select("Name", "Age").group("Name", min)

# Why don't we need to do any of that here?
ages_onetimes = onetimes.select("Name", "Age")

min_ages_repeats

In [None]:
# Now as simple as comparing the means of the Age columns in the two tables
avg_min_age_repeats = np.mean(min_ages_repeats.column("Age min"))
avg_ages_onetimes = np.mean(ages_onetimes.column("Age"))

print("Average Beginning Age of Repeat Participants:", avg_min_age_repeats)
print("Average Age of One-Time Participants:", avg_ages_onetimes)

# Age distributions
---

In [None]:
# Hmm... not as different as I would have expected.

# Let's plot the ditributions to look into this further

min_ages_repeats.hist("Age min")

In [None]:
ages_onetimes.hist("Age")

In [None]:
# Don't worry about these next code blocks, just focus on the histograms they create
import matplotlib.pyplot as plt

In [None]:
plt.hist(
    [min_ages_repeats.column("Age min"), ages_onetimes.column("Age")],
    bins=np.arange(10, 70, 5),
    density=True,
    histtype="stepfilled", alpha=0.6
)
plt.legend(["one time","repeat"]);

What percent of repeats/one-timers are 20-25?

Let's recall the axis label *"percent per unit"*

**Unit = Bin Size**

Then, there's some math
$$"percent\ per\ unit"=\frac{percent}{bin\_size}$$

$$"percent\ per\ unit" *\ bin\_size=percent$$

In [None]:
# So, the percent of repeats that were 20-25 at their first game is:



In [None]:
# And the percent of one-timers that were 20-25 is:



But what are those percents out of?

In [None]:
# The percent of repeats is out of:
"""

"""

# The percent of one-timers is out of:
"""

"""

So how would we find the *number* of repeats between 20-25 at their first game versus the *number* of one-timers between 20-25?

In [None]:
# To turn percent into a count we:
"""

"""

In [None]:
# Some code to get the "population sizes" of `min_ages_repeats` and `ages_onetimes`
# and multiply by the proportion
num_repeats_2025 = min_ages_repeats.num_rows *
num_onetimes_2025 = ages_onetimes.num_rows *

print(num_repeats_2025, num_onetimes_2025)

Let's instead take a look at the count histogram.

In [None]:
plt.hist(
    [min_ages_repeats.column("Age min"), ages_onetimes.column("Age")],
    bins=np.arange(10, 70, 5),
    density=False,
    histtype="stepfilled", alpha=0.6
);

So, if you're 20-25, are you more likely to be a repeat or a one-timer?

How do you know this, and why might it be the case?

In [None]:
# Our answer goes here
"""

"""