# Pandas for Data Analysis 1 – sorting, filtering, subsetting

## Import libraries

In [1]:
## import libraries
import pandas as pd

## Upload Olympics datasets to this notebook.

### Import ```summer.csv``` and store in a dataframe called ```oly_s```.

### Import ```winter.csv``` and store in a dataframe called ```oly_w```.

In [2]:
## import summer data
oly_s = pd.read_csv("data/summer.csv")
oly_s.head()

Unnamed: 0,Year,City,Sport,Discipline,Athlete,Country,Gender,Event,Medal
0,1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100M Freestyle,Gold
1,1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100M Freestyle,Silver
2,1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100M Freestyle For Sailors,Bronze
3,1896,Athens,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100M Freestyle For Sailors,Gold
4,1896,Athens,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100M Freestyle For Sailors,Silver


In [3]:
## import winter data
oly_w = pd.read_csv("data/winter.csv")
oly_w.head()

Unnamed: 0,Year,City,Sport,Discipline,Athlete,Country,Gender,Event,Medal
0,1924,Chamonix,Biathlon,Biathlon,"BERTHET, G.",FRA,Men,Military Patrol,Bronze
1,1924,Chamonix,Biathlon,Biathlon,"MANDRILLON, C.",FRA,Men,Military Patrol,Bronze
2,1924,Chamonix,Biathlon,Biathlon,"MANDRILLON, Maurice",FRA,Men,Military Patrol,Bronze
3,1924,Chamonix,Biathlon,Biathlon,"VANDELLE, André",FRA,Men,Military Patrol,Bronze
4,1924,Chamonix,Biathlon,Biathlon,"AUFDENBLATTEN, Adolf",SUI,Men,Military Patrol,Gold


## Get a sense of the data

In [4]:
## SUMMER
## what exactly do we have: columns, datatypes, etc.
oly_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31165 entries, 0 to 31164
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Year        31165 non-null  int64 
 1   City        31165 non-null  object
 2   Sport       31165 non-null  object
 3   Discipline  31165 non-null  object
 4   Athlete     31165 non-null  object
 5   Country     31161 non-null  object
 6   Gender      31165 non-null  object
 7   Event       31165 non-null  object
 8   Medal       31165 non-null  object
dtypes: int64(1), object(8)
memory usage: 2.1+ MB


In [5]:
## WINTER
## what exactly do we have: columns, datatypes, etc.
oly_w.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5770 entries, 0 to 5769
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Year        5770 non-null   int64 
 1   City        5770 non-null   object
 2   Sport       5770 non-null   object
 3   Discipline  5770 non-null   object
 4   Athlete     5770 non-null   object
 5   Country     5770 non-null   object
 6   Gender      5770 non-null   object
 7   Event       5770 non-null   object
 8   Medal       5770 non-null   object
dtypes: int64(1), object(8)
memory usage: 405.8+ KB


### ```Series``` v. ```DataFrame```

A single 'column' of data in a ```dataframe``` is known as a series.

A ```series``` is like a Python ```list```.

### ```Series``` v. ```DataFrame```

#### Why does this matter?

#### Different methods work on ```series``` and ```dataframes```.

In [6]:
## METHOD 1 to return a sereies
## df.column_name also returns a Pandas Series
oly_w.Year

0       1924
1       1924
2       1924
3       1924
4       1924
        ... 
5765    2014
5766    2014
5767    2014
5768    2014
5769    2014
Name: Year, Length: 5770, dtype: int64

In [7]:
## METHOD 2 to return a sereies
## df['ColumnName'] also returns a Pandas Series
oly_w["Year"]

0       1924
1       1924
2       1924
3       1924
4       1924
        ... 
5765    2014
5766    2014
5767    2014
5768    2014
5769    2014
Name: Year, Length: 5770, dtype: int64

### Method 2 is preferred because you will often work with data with multiword column headers:

This will break:

```df.Multi Word Header```

This will NOT break:

```df["Multi Word Header"]```


### If you pass multiple column headers, it returns dataframe with only those columns

####  ```df[["Column 1", "Column 2"]]```

In [44]:
## df[["Column1", "Column2"]] returns a dataframe
df = oly_w[["Year", "City"]]
df.sample(10)

Unnamed: 0,Year,City
2783,1992,Albertville
4267,2006,Turin
1127,1964,Innsbruck
5117,2010,Vancouver
3303,1998,Nagano
543,1948,St.Moritz
208,1932,Lake Placid
1909,1980,Lake Placid
4208,2006,Turin
1120,1964,Innsbruck


## Sorting

```df_name.sort_values(by=["Value 1", "Value 2"], ascending = True)``` for A-Z or smaller numbers to bigger numbers

```ascending = False``` for Z-A or bigger to smaller numbers.


In [9]:
## Sort by values (City and Year)

city_yr_sort = df.sort_values(by=["City", "Year"], ascending=True)
city_yr_sort

Unnamed: 0,Year,City
2502,1992,Albertville
2503,1992,Albertville
2504,1992,Albertville
2505,1992,Albertville
2506,1992,Albertville
...,...,...
5153,2010,Vancouver
5154,2010,Vancouver
5155,2010,Vancouver
5156,2010,Vancouver


In [10]:
## sort by city and year but descending
city_yr_sort = df.sort_values(by=["City","Year"], ascending=False)
city_yr_sort

Unnamed: 0,Year,City
4629,2010,Vancouver
4630,2010,Vancouver
4631,2010,Vancouver
4632,2010,Vancouver
4633,2010,Vancouver
...,...,...
2822,1992,Albertville
2823,1992,Albertville
2824,1992,Albertville
2825,1992,Albertville


## Should Winter and Summer data remain separate?


### What if we want to know total gold medals for any country?

### But if we combine them, how to we easily distinguish between summer and winter?

### Creating a new column with a default value

In [11]:
## In oly_s, add "Season" column with the value "Summer"
oly_s["Season"] = "Summer"
oly_s.head()

Unnamed: 0,Year,City,Sport,Discipline,Athlete,Country,Gender,Event,Medal,Season
0,1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100M Freestyle,Gold,Summer
1,1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100M Freestyle,Silver,Summer
2,1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100M Freestyle For Sailors,Bronze,Summer
3,1896,Athens,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100M Freestyle For Sailors,Gold,Summer
4,1896,Athens,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100M Freestyle For Sailors,Silver,Summer


In [12]:
## In oly_w, add "Season" column with the value "Winter"
oly_w["Season"] = "Winter"
oly_w

Unnamed: 0,Year,City,Sport,Discipline,Athlete,Country,Gender,Event,Medal,Season
0,1924,Chamonix,Biathlon,Biathlon,"BERTHET, G.",FRA,Men,Military Patrol,Bronze,Winter
1,1924,Chamonix,Biathlon,Biathlon,"MANDRILLON, C.",FRA,Men,Military Patrol,Bronze,Winter
2,1924,Chamonix,Biathlon,Biathlon,"MANDRILLON, Maurice",FRA,Men,Military Patrol,Bronze,Winter
3,1924,Chamonix,Biathlon,Biathlon,"VANDELLE, André",FRA,Men,Military Patrol,Bronze,Winter
4,1924,Chamonix,Biathlon,Biathlon,"AUFDENBLATTEN, Adolf",SUI,Men,Military Patrol,Gold,Winter
...,...,...,...,...,...,...,...,...,...,...
5765,2014,Sochi,Skiing,Snowboard,"JONES, Jenny",GBR,Women,Slopestyle,Bronze,Winter
5766,2014,Sochi,Skiing,Snowboard,"ANDERSON, Jamie",USA,Women,Slopestyle,Gold,Winter
5767,2014,Sochi,Skiing,Snowboard,"MALTAIS, Dominique",CAN,Women,Snowboard Cross,Silver,Winter
5768,2014,Sochi,Skiing,Snowboard,"SAMKOVA, Eva",CZE,Women,Snowboard Cross,Gold,Winter


In [45]:
oly_w.columns

Index(['Year', 'City', 'Sport', 'Discipline', 'Athlete', 'Country', 'Gender',
       'Event', 'Medal', 'Season'],
      dtype='object')

In [13]:
## are the two column headers identical?
list(oly_w.columns) == list(oly_s.columns)

True

In [14]:
## join winter and summer dataframes together into a dataframe called oly
join_list = [oly_s, oly_w]
oly = pd.concat(join_list, sort = True)
oly

Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
0,"HAJOS, Alfred",Athens,HUN,Swimming,100M Freestyle,Men,Gold,Summer,Aquatics,1896
1,"HERSCHMANN, Otto",Athens,AUT,Swimming,100M Freestyle,Men,Silver,Summer,Aquatics,1896
2,"DRIVAS, Dimitrios",Athens,GRE,Swimming,100M Freestyle For Sailors,Men,Bronze,Summer,Aquatics,1896
3,"MALOKINIS, Ioannis",Athens,GRE,Swimming,100M Freestyle For Sailors,Men,Gold,Summer,Aquatics,1896
4,"CHASAPIS, Spiridon",Athens,GRE,Swimming,100M Freestyle For Sailors,Men,Silver,Summer,Aquatics,1896
...,...,...,...,...,...,...,...,...,...,...
5765,"JONES, Jenny",Sochi,GBR,Snowboard,Slopestyle,Women,Bronze,Winter,Skiing,2014
5766,"ANDERSON, Jamie",Sochi,USA,Snowboard,Slopestyle,Women,Gold,Winter,Skiing,2014
5767,"MALTAIS, Dominique",Sochi,CAN,Snowboard,Snowboard Cross,Women,Silver,Winter,Skiing,2014
5768,"SAMKOVA, Eva",Sochi,CZE,Snowboard,Snowboard Cross,Women,Gold,Winter,Skiing,2014


In [15]:
### confirm that you have the correct number of rows after the join.
### you can scroll up to count the number of entries in oly_w and oly_s. 
### The totals should be equal to the value generated below
oly.shape

(36935, 10)

In [16]:
oly_w.shape[0]

5770

In [17]:
## confirm programmatically that the rows add up
oly.shape[0] == oly_w.shape[0] + oly_s.shape[0]


True

In [18]:
### generate a random sample of 20 rows
### if you see only "summer" or only "winter" in the "Season" column, just run the cell again to
## confirm both seasons are in oly.
oly.sample(20)

Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
821,"KOLCHIN, Pavel",Cortina d'Ampezzo,URS,Cross Country Skiing,15KM,Men,Bronze,Winter,Skiing,1956
8981,"MARKAROV, Boris",Melbourne / Stockholm,URS,Water polo,Water Polo,Men,Bronze,Summer,Aquatics,1956
11988,"MATSON, James Randel",Mexico,USA,Athletics,Shot Put,Men,Gold,Summer,Athletics,1968
17215,"HOMFELD, Conrad E.",Los Angeles,USA,Jumping,Individual,Men,Silver,Summer,Equestrian,1984
20178,"WANG, Fang",Barcelona,CHN,Basketball,Basketball,Women,Silver,Summer,Basketball,1992
1731,"PRIESTNER, Cathy",Innsbruck,CAN,Speed skating,500M,Women,Silver,Winter,Skating,1976
226,"SERRURIER, Auguste",Paris,FRA,Archery,Sur La Perche À La Herse,Men,Silver,Summer,Archery,1900
2214,"MATIKAINEN, Marjo",Sarajevo,FIN,Cross Country Skiing,4X5KM Relay,Women,Bronze,Winter,Skiing,1984
29785,"ZHAO, Yunlei",London,CHN,Badminton,Doubles,Women,Gold,Summer,Badminton,2012
13593,"KETELÄ, Martti Einari",Munich,FIN,Modern Pentath.,Team Competition,Men,Bronze,Summer,Modern Pentathlon,1972


In [46]:
## export to csv
oly.to_csv("data/olympics.csv", encoding = "UTF-8", index = False)

## Improving memory allocation

If we were working with a massive dataset, we'd want to find ways to improve processing power.

For example, the following columns have only a few values that repeat again and again as strings. This is highly inefficient. 

- The "Season" column has "winter" and "summer", 
- The "Medal" column only has "gold", "silver" and "bronze",
- "Gender" at this point in the Olympics only has "male" and "female".

One way to improve memory allocation is to take columns with that contain only a few data points and turn them into categories. 



In [47]:
## let's first get info() on oly.
## you should see how "Season", "Medal" and "Gender" are all string objects.
## Note that memory usage: 3.1+ MB
oly.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36935 entries, 0 to 5769
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Athlete     36935 non-null  object
 1   City        36935 non-null  object
 2   Country     36931 non-null  object
 3   Discipline  36935 non-null  object
 4   Event       36935 non-null  object
 5   Gender      36935 non-null  object
 6   Medal       36935 non-null  object
 7   Season      36935 non-null  object
 8   Sport       36935 non-null  object
 9   Year        36935 non-null  int64 
dtypes: int64(1), object(9)
memory usage: 3.1+ MB


In [48]:
## convert gender, medal and season  to catagory
oly[["Gender", "Medal", "Season"]] = oly[["Gender", "Medal", "Season"]].astype("category")

In [49]:
## See what the memory allocation is now using info()

oly.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36935 entries, 0 to 5769
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Athlete     36935 non-null  object  
 1   City        36935 non-null  object  
 2   Country     36931 non-null  object  
 3   Discipline  36935 non-null  object  
 4   Event       36935 non-null  object  
 5   Gender      36935 non-null  category
 6   Medal       36935 non-null  category
 7   Season      36935 non-null  category
 8   Sport       36935 non-null  object  
 9   Year        36935 non-null  int64   
dtypes: category(3), int64(1), object(6)
memory usage: 2.4+ MB


It should have dropped to 2.4MB.
That's a big descrease for just about 37,000 rows. 

This would have an even bigger impact on larger datasets.

# Filter/Subset 



## Create a subset that holds the results from the Sochi Olympics

```df[df["column_name"] == "value_in_column"]```

In [23]:
## Sochi subset
oly[oly["City"] == "Sochi"].sample(20)


Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
5679,"MIDOL, Jonathan",Sochi,FRA,Freestyle Skiing,Ski Cross,Men,Bronze,Winter,Skiing,2014
5494,"SHI, Jingnan",Sochi,CHN,Short Track Speed Skating,5000M Relay,Men,Bronze,Winter,Skating,2014
5376,"ANTHAMATTEN, Sophie",Sochi,SUI,Ice Hockey,Ice Hockey,Women,Bronze,Winter,Ice Hockey,2014
5370,"WAKEFIELD, Jennifer",Sochi,CAN,Ice Hockey,Ice Hockey,Women,Gold,Winter,Ice Hockey,2014
5240,"MEYERS, Elana",Sochi,USA,Bobsleigh,Two-Woman,Women,Silver,Winter,Bobsleigh,2014
5309,"JOKINEN, Olli",Sochi,FIN,Ice Hockey,Ice Hockey,Men,Bronze,Winter,Ice Hockey,2014
5608,"FENNINGER, Anna",Sochi,AUT,Alpine Skiing,Super-G,Women,Gold,Winter,Skiing,2014
5373,"WICKENHEISER, Hayley",Sochi,CAN,Ice Hockey,Ice Hockey,Women,Gold,Winter,Ice Hockey,2014
5549,"VERWEIJ, Koen",Sochi,NED,Speed skating,Team Pursuit,Men,Gold,Winter,Skating,2014
5664,"FABJAN, Vesna",Sochi,SLO,Cross Country Skiing,Sprint 1.5KM,Women,Bronze,Winter,Skiing,2014


In [51]:
## Store subset into dataframe called sochi_df
df_sochi = oly[oly["City"] == "Sochi"]
df_sochi

Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
5158,"LANDERTINGER, Dominik",Sochi,AUT,Biathlon,10KM,Men,Silver,Winter,Biathlon,2014
5159,"SOUKUP, Jaroslav",Sochi,CZE,Biathlon,10KM,Men,Bronze,Winter,Biathlon,2014
5160,"BJOERNDALEN, Ole Einar",Sochi,NOR,Biathlon,10KM,Men,Gold,Winter,Biathlon,2014
5161,"MORAVEC, Ondrej",Sochi,CZE,Biathlon,12.5Km Pursuit,Men,Silver,Winter,Biathlon,2014
5162,"BEATRIX, Jean Guillaume",Sochi,FRA,Biathlon,12.5Km Pursuit,Men,Bronze,Winter,Biathlon,2014
...,...,...,...,...,...,...,...,...,...,...
5765,"JONES, Jenny",Sochi,GBR,Snowboard,Slopestyle,Women,Bronze,Winter,Skiing,2014
5766,"ANDERSON, Jamie",Sochi,USA,Snowboard,Slopestyle,Women,Gold,Winter,Skiing,2014
5767,"MALTAIS, Dominique",Sochi,CAN,Snowboard,Snowboard Cross,Women,Silver,Winter,Skiing,2014
5768,"SAMKOVA, Eva",Sochi,CZE,Snowboard,Snowboard Cross,Women,Gold,Winter,Skiing,2014


## How many values in a column?

```df_name["column_name"].value_counts()```

In [25]:
## How many men and women won medals Sochi?
# sochi_df.Gender.value_counts()

df_sochi["Gender"].value_counts()

Men      340
Women    272
Name: Gender, dtype: int64

In [26]:
## in all olympics how many gold, silver and bronze medals were given out?
oly["Medal"].value_counts()

Gold      12407
Bronze    12288
Silver    12240
Name: Medal, dtype: int64

## Filtering by dates

You'll learn more about date and time in the coming week. For now, please note how the ```Year``` column is a  ```int64``` object and NOT a ```datetime``` object. We can still work with it for our needs.

In [52]:
## Find all the 1900 competitions
oly[oly["Year"] == 1900]

Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
151,"HALMAY, Zoltan",Paris,HUN,Swimming,1500M Freestyle,Men,Bronze,Summer,Aquatics,1900
152,"JARVIS, John Arthur",Paris,GBR,Swimming,1500M Freestyle,Men,Gold,Summer,Aquatics,1900
153,"WAHLE, Otto",Paris,AUT,Swimming,1500M Freestyle,Men,Silver,Summer,Aquatics,1900
154,"DROST, Johannes",Paris,NED,Swimming,200M Backstroke,Men,Bronze,Summer,Aquatics,1900
155,"HOPPENBERG, Ernst",Paris,GER,Swimming,200M Backstroke,Men,Gold,Summer,Aquatics,1900
...,...,...,...,...,...,...,...,...,...,...
658,"COLLAS, Jean",Paris,FRA,Tug of War,Tug Of War,Men,Silver,Summer,Tug of War,1900
659,"GONDOUIN, Charles",Paris,FRA,Tug of War,Tug Of War,Men,Silver,Summer,Tug of War,1900
660,"HENRIQUEZ DE ZUBIERRA, Constantin",Paris,FRA,Tug of War,Tug Of War,Men,Silver,Summer,Tug of War,1900
661,"ROFFO, Joseph",Paris,FRA,Tug of War,Tug Of War,Men,Silver,Summer,Tug of War,1900


In [54]:
### return all the competitions between 1920 and 1950 (inclusive).
## METHOD 1
oly[(oly["Year"] >= 1920) & (oly["Year"] <= 1950)]

Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
4120,"PINKSTON, Clarence",Paris,USA,Diving,10M Platform,Men,Bronze,Summer,Aquatics,1924
4121,"WHITE, Albert",Paris,USA,Diving,10M Platform,Men,Gold,Summer,Aquatics,1924
4122,"FALL, David",Paris,USA,Diving,10M Platform,Men,Silver,Summer,Aquatics,1924
4123,"TÖPEL, Hjördis",Paris,SWE,Diving,10M Platform,Women,Bronze,Summer,Aquatics,1924
4124,"SMITH, Caroline",Paris,USA,Diving,10M Platform,Women,Gold,Summer,Aquatics,1924
...,...,...,...,...,...,...,...,...,...,...
566,"HASU, Heikki",St.Moritz,FIN,Nordic Combined,Individual,Men,Gold,Winter,Skiing,1948
567,"HUHTALA, Martti",St.Moritz,FIN,Nordic Combined,Individual,Men,Silver,Winter,Skiing,1948
568,"SCHJELDERUP, Thorleif",St.Moritz,NOR,Ski Jumping,K90 Individual (70M),Men,Bronze,Winter,Skiing,1948
569,"HUGSTED, Petter",St.Moritz,NOR,Ski Jumping,K90 Individual (70M),Men,Gold,Winter,Skiing,1948


In [55]:
### return all the competitions between 1920 and 1950.
### METHOD 2

## Declare my filters
my_filter_1 = oly["Year"] >= 1920
my_filter_2 = oly["Year"] <= 1950

## Filter
oly[my_filter_1 & my_filter_2]



Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
2822,"PRIESTE, Harry",Antwerp,USA,Diving,10M Platform,Men,Bronze,Summer,Aquatics,1920
2823,"PINKSTON, Clarence",Antwerp,USA,Diving,10M Platform,Men,Gold,Summer,Aquatics,1920
2824,"ADLERZ, Erik",Antwerp,SWE,Diving,10M Platform,Men,Silver,Summer,Aquatics,1920
2825,"OLLIVIER, Eva",Antwerp,SWE,Diving,10M Platform,Women,Bronze,Summer,Aquatics,1920
2826,"FRYLAND CLAUSEN, Stefani",Antwerp,DEN,Diving,10M Platform,Women,Gold,Summer,Aquatics,1920
...,...,...,...,...,...,...,...,...,...,...
566,"HASU, Heikki",St.Moritz,FIN,Nordic Combined,Individual,Men,Gold,Winter,Skiing,1948
567,"HUHTALA, Martti",St.Moritz,FIN,Nordic Combined,Individual,Men,Silver,Winter,Skiing,1948
568,"SCHJELDERUP, Thorleif",St.Moritz,NOR,Ski Jumping,K90 Individual (70M),Men,Bronze,Winter,Skiing,1948
569,"HUGSTED, Petter",St.Moritz,NOR,Ski Jumping,K90 Individual (70M),Men,Gold,Winter,Skiing,1948


In [30]:
## return only Tennis competitions between 1920 and 1950 (inclusive).
## METHOD 1


In [57]:
## return only Tennis competitions between 1920 and 1950.

### METHOD 2

## Declare my filters
my_filter_1 = oly["Year"] >= 1920
my_filter_2 = oly["Year"] <= 1950
my_filter_3 = oly["Sport"] == "Tennis"

## Filter
oly[my_filter_1 & my_filter_2 & my_filter_3]

Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
4026,"ALBARRAN, Pierre",Antwerp,FRA,Tennis,Doubles,Men,Bronze,Summer,Tennis,1920
4027,"DECUGIS, Max",Antwerp,FRA,Tennis,Doubles,Men,Bronze,Summer,Tennis,1920
4028,"TURNBULL, Oswald Graham Noel",Antwerp,GBR,Tennis,Doubles,Men,Gold,Summer,Tennis,1920
4029,"WOOSNAM, Maxwell",Antwerp,GBR,Tennis,Doubles,Men,Gold,Summer,Tennis,1920
4030,"KASHIO, Seiichiro",Antwerp,JPN,Tennis,Doubles,Men,Silver,Summer,Tennis,1920
4031,"KUMAGAE, Ichiya",Antwerp,JPN,Tennis,Doubles,Men,Silver,Summer,Tennis,1920
4032,"D'AYEN, Elisabeth",Antwerp,FRA,Tennis,Doubles,Women,Bronze,Summer,Tennis,1920
4033,"LENGLEN, Suzanne",Antwerp,FRA,Tennis,Doubles,Women,Bronze,Summer,Tennis,1920
4034,"MCKANE, Kathleen",Antwerp,GBR,Tennis,Doubles,Women,Gold,Summer,Tennis,1920
4035,"MCNAIR, Winifred Margaret",Antwerp,GBR,Tennis,Doubles,Women,Gold,Summer,Tennis,1920


In [58]:
## return only Tennis competitions between 1920 and 1950.

### METHOD 3

## Declare my filters with meangingful names
fil_yr1920 = oly["Year"] >= 1920
fil_yr1950 = oly["Year"] <= 1950
fil_tennis = oly["Sport"] == "Tennis"


## Filter
oly[fil_yr1920 & fil_yr1950 & fil_tennis]

Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
4026,"ALBARRAN, Pierre",Antwerp,FRA,Tennis,Doubles,Men,Bronze,Summer,Tennis,1920
4027,"DECUGIS, Max",Antwerp,FRA,Tennis,Doubles,Men,Bronze,Summer,Tennis,1920
4028,"TURNBULL, Oswald Graham Noel",Antwerp,GBR,Tennis,Doubles,Men,Gold,Summer,Tennis,1920
4029,"WOOSNAM, Maxwell",Antwerp,GBR,Tennis,Doubles,Men,Gold,Summer,Tennis,1920
4030,"KASHIO, Seiichiro",Antwerp,JPN,Tennis,Doubles,Men,Silver,Summer,Tennis,1920
4031,"KUMAGAE, Ichiya",Antwerp,JPN,Tennis,Doubles,Men,Silver,Summer,Tennis,1920
4032,"D'AYEN, Elisabeth",Antwerp,FRA,Tennis,Doubles,Women,Bronze,Summer,Tennis,1920
4033,"LENGLEN, Suzanne",Antwerp,FRA,Tennis,Doubles,Women,Bronze,Summer,Tennis,1920
4034,"MCKANE, Kathleen",Antwerp,GBR,Tennis,Doubles,Women,Gold,Summer,Tennis,1920
4035,"MCNAIR, Winifred Margaret",Antwerp,GBR,Tennis,Doubles,Women,Gold,Summer,Tennis,1920


In [62]:
## return only Women's Tennis competitions between 1920 and 1950.

### METHOD 3

## Declare my filters with meangingful names
fil_yr1920 = oly["Year"] >= 1920
fil_yr2012 = oly["Year"] <= 2012
fil_tennis = oly["Sport"] == "Tennis"
fil_women = oly["Gender"] == "Women"

## Filter
oly_tennis = oly[fil_yr1920 & fil_yr2012 & fil_tennis & fil_women]
oly_tennis

Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
4032,"D'AYEN, Elisabeth",Antwerp,FRA,Tennis,Doubles,Women,Bronze,Summer,Tennis,1920
4033,"LENGLEN, Suzanne",Antwerp,FRA,Tennis,Doubles,Women,Bronze,Summer,Tennis,1920
4034,"MCKANE, Kathleen",Antwerp,GBR,Tennis,Doubles,Women,Gold,Summer,Tennis,1920
4035,"MCNAIR, Winifred Margaret",Antwerp,GBR,Tennis,Doubles,Women,Gold,Summer,Tennis,1920
4036,"BEAMISH, Winifred Geraldine",Antwerp,GBR,Tennis,Doubles,Women,Silver,Summer,Tennis,1920
...,...,...,...,...,...,...,...,...,...,...
30949,"ROBSON, Laura",London,GBR,Tennis,Mixed Doubles,Women,Silver,Summer,Tennis,2012
30951,"RAYMOND, Lisa",London,USA,Tennis,Mixed Doubles,Women,Bronze,Summer,Tennis,2012
30955,"WILLIAMS, Serena",London,USA,Tennis,Singles,Women,Gold,Summer,Tennis,2012
30956,"SHARAPOVA, Maria",London,RUS,Tennis,Singles,Women,Silver,Summer,Tennis,2012


#### Which method did you prefer? Method 1 or 2 or 3? Why?

## What year was women's boxing introduced to the olympics?

In [66]:
## find all the names of sports first.
## how is boxing listed?
# list(oly["Sport"].unique())
oly["Sport"].unique()

array(['Aquatics', 'Athletics', 'Cycling', 'Fencing', 'Gymnastics',
       'Shooting', 'Tennis', 'Weightlifting', 'Wrestling', 'Archery',
       'Basque Pelota', 'Cricket', 'Croquet', 'Equestrian', 'Football',
       'Golf', 'Polo', 'Rowing', 'Rugby', 'Sailing', 'Tug of War',
       'Boxing', 'Lacrosse', 'Roque', 'Hockey', 'Jeu de paume', 'Rackets',
       'Skating', 'Water Motorsports', 'Modern Pentathlon', 'Ice Hockey',
       'Basketball', 'Canoe / Kayak', 'Handball', 'Judo', 'Volleyball',
       'Table Tennis', 'Badminton', 'Baseball', 'Softball', 'Taekwondo',
       'Triathlon', 'Canoe', 'Biathlon', 'Bobsleigh', 'Curling', 'Skiing',
       'Luge'], dtype=object)

In [67]:
## create a filter for boxing and women

fil_boxing = oly["Sport"] == "Boxing"
fil_women = oly["Gender"] == "Women"
## Filter
oly[fil_boxing & fil_women]

Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
29876,"ADAMS, Nicola",London,GBR,Boxing,51 KG,Women,Gold,Summer,Boxing,2012
29877,"REN, Cancan",London,CHN,Boxing,51 KG,Women,Silver,Summer,Boxing,2012
29878,"ESPARZA, Marlen",London,USA,Boxing,51 KG,Women,Bronze,Summer,Boxing,2012
29879,"KOM, Mary",London,IND,Boxing,51 KG,Women,Bronze,Summer,Boxing,2012
29896,"TAYLOR, Katie",London,IRL,Boxing,60 KG,Women,Gold,Summer,Boxing,2012
29897,"OCHIGAVA, Sofya",London,RUS,Boxing,60 KG,Women,Silver,Summer,Boxing,2012
29898,"ARAUJO, Adriana",London,BRA,Boxing,60 KG,Women,Bronze,Summer,Boxing,2012
29899,"CHORIEVA, Mavzuna",London,TJK,Boxing,60 KG,Women,Bronze,Summer,Boxing,2012
29912,"SHIELDS, Claressa",London,USA,Boxing,75 KG,Women,Gold,Summer,Boxing,2012
29913,"TORLOPOVA, Nadezda",London,RUS,Boxing,75 KG,Women,Silver,Summer,Boxing,2012


## Summer v. Winter

It's likely that the Summer Olympics have more medal winners than Winter Olympics. How can we check?

In [68]:
## what is the total medal count for winter v. summer?
oly["Season"].value_counts()

Summer    31165
Winter     5770
Name: Season, dtype: int64

## How many medals were handed out at each olympics between 1896 and 2014?

Show the result from highest to lowest.

In [75]:
## how many medals were handed out at each olympics?
## show the result in descending order
totals_by_year = oly["Year"].value_counts(ascending = False)
totals_by_year

2008    2042
1992    2030
2000    2015
2004    1998
2012    1949
1996    1859
1988    1810
1984    1681
1980    1605
1976    1515
1972    1385
1920    1298
1968    1230
1964    1195
1956    1035
1960    1029
1952    1025
1924    1002
1936     983
1948     954
1912     885
1908     804
1928     799
1932     731
2014     612
2006     531
2010     529
1900     512
2002     481
1904     470
1998     447
1994     343
1896     151
Name: Year, dtype: int64

In [40]:
## how many medals were handed out at each olympics each year from 2014 down to 1896.


## Filter and count

What are the top 5 sports that France has won the most gold in?

(HINT: different answers based on columns you decide to count)

In [76]:
## first you need to filter by country and gold
fil_france = oly["Country"] == "FRA"
fil_gold = oly["Medal"] == "Gold"
france_gold = oly[fil_france & fil_gold]
france_gold

Unnamed: 0,Athlete,City,Country,Discipline,Event,Gender,Medal,Season,Sport,Year
51,"FLAMENG, Léon",Athens,FRA,Cycling Track,100KM,Men,Gold,Summer,Cycling,1896
54,"MASSON, Paul",Athens,FRA,Cycling Track,10KM,Men,Gold,Summer,Cycling,1896
59,"MASSON, Paul",Athens,FRA,Cycling Track,1KM Time Trial,Men,Gold,Summer,Cycling,1896
62,"MASSON, Paul",Athens,FRA,Cycling Track,Sprint Indivual,Men,Gold,Summer,Cycling,1896
65,"GRAVELOTTE, Eugène-Henri",Athens,FRA,Fencing,Foil Individual,Men,Gold,Summer,Fencing,1896
...,...,...,...,...,...,...,...,...,...,...
5105,"LAMY CHAPPUIS, Jason",Vancouver,FRA,Nordic Combined,"Individual, Ski Jumping K90 (70M)",Men,Gold,Winter,Skiing,2010
5163,"FOURCADE, Martin",Sochi,FRA,Biathlon,12.5Km Pursuit,Men,Gold,Winter,Biathlon,2014
5167,"FOURCADE, Martin",Sochi,FRA,Biathlon,20KM,Men,Gold,Winter,Biathlon,2014
5678,"CHAPUIS, Jean Frederic",Sochi,FRA,Freestyle Skiing,Ski Cross,Men,Gold,Winter,Skiing,2014


In [78]:
## answer if you count gold wins by "event"
france_gold["Event"].value_counts().head()

Foil Team    45
Épée Team    43
Handball     30
Football     17
Rugby        17
Name: Event, dtype: int64

In [79]:
## answer if you count gold wins by "sport"
france_gold["Sport"].value_counts().head()

Fencing       116
Cycling        66
Handball       30
Aquatics       24
Equestrian     23
Name: Sport, dtype: int64