In [1]:
import pandas as pd

### value_counts()
- value_counts is one of the most useful methods in pandas.
- It returns a series object, counting all the unique values.
- There are two things in particular to be aware of value_counts.
    - By default, results are in descending order. As this is returning a count of the unique values, the first value is the most frequently occurring element. The second, the second most frequently occurring element and so on. This order can be reversed by just setting the ascending flag to True.
    - Dropna, one of the parameters within the value_counts is True by default and you will not get a count of the na(null) values. If your data set has a significant number of na values, this can be misleading and you can turn this feature off by setting dropna to False.

In [2]:
oo = pd.read_csv('data/olympics.csv', skiprows=4)
oo.head()

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
0,Athens,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1,Athens,1896,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
2,Athens,1896,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
3,Athens,1896,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
4,Athens,1896,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver


In [6]:
oo['Edition'].value_counts()
# since each medal given takes a row, you can interpret the results of the value count as: 
# Now remember that value_counts gives you the unique values for that series.
# So if I wanted to know how many medals were presented each time the Olympics were held, I can type, 
# oo['Edition'].value_counts()
# value_counts sorts the values by the Olympics where the most medals were presented
# to the ones with the least medals were presented.
# So here, the most medals were presented in the 2008 games
# and for whatever reason, there were more medals presented in the 2000 game versus the 2004 games.

2008    2042
2000    2015
2004    1998
1996    1859
1992    1705
1988    1546
1984    1459
1980    1387
1976    1305
1920    1298
1972    1185
1968    1031
1964    1010
1952     889
1912     885
1956     885
1924     884
1960     882
1936     875
1948     814
1908     804
1928     710
1932     615
1900     512
1904     470
1896     151
Name: Edition, dtype: int64

In [7]:
oo.Gender.value_counts()
# so this gives the total number of medals given to men vs women

Men      21721
Women     7495
Name: Gender, dtype: int64

In [9]:
oo.Gender.value_counts(ascending=True)
# this will make the display ascending, with least frequent value on top 

Women     7495
Men      21721
Name: Gender, dtype: int64

In [10]:
# if the dataset has many na values, then the default count will be misleading. We can change it like so
oo.Gender.value_counts(ascending=True,dropna=True)
# since this dataset has no missing values, the count will be unaffected

Women     7495
Men      21721
Name: Gender, dtype: int64

## sort_values()
- Sort_values() sorts the values in a series.
- As axis is equal to zero, you are sorting along the column and in ascending order by default. So if you visualize a series as being a single column, you are sorting the contents of that column in ascending order.
- You can sort along any exis, by specifying the axis parameter. Default is axis=0
- By default, the NaNs, or missing data, are put right at the end. You can change this by setting na_position='last' to something else
- Sort_values(), when used in conjunction with a DataFrame, is particularly useful as you can sort multiple series in ascending and descending order.
- you can choose the sorting algo by setting kind='quicksort' to something else
- you can change inplace=False setting  to sort inplace (false means a new series will be returned. If we want to capture those changes, we will need to assign it to a new series.)

In [11]:
# OO is the name of the DataFrame, Athlete gives us the details of the athlete's name, and sort_values()
# provides us the list of all of the athletes' names sorted by the name of the athlete
oo.Athlete.sort_values().head()


651                 AABYE, Edgar
2849       AALTONEN, Arvo Ossian
2852       AALTONEN, Arvo Ossian
7716    AALTONEN, Paavo Johannes
7730    AALTONEN, Paavo Johannes
Name: Athlete, dtype: object

In [13]:
ath = oo.Athlete.sort_values()

In [15]:
# The sort_values() is particularly useful when used with DataFrames.
# So let's sort by the edition of the Olympics, and the athletes' names.
# since, we are using multiple series, we will enter them as a list
oo.sort_values(by=['Edition','Athlete']).head()
# we're sorting by edition first, and then the athletes' names. 
# so each section of the edition will be sorted by athlete for each Olympic Edition.

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
7,Athens,1896,Aquatics,Swimming,"ANDREOU, Joannis",GRE,Men,1200m freestyle,M,Silver
82,Athens,1896,Gymnastics,Artistic G.,"ANDRIAKOPOULOS, Nicolaos",GRE,Men,rope climbing,M,Gold
110,Athens,1896,Gymnastics,Artistic G.,"ANDRIAKOPOULOS, Nicolaos",GRE,Men,"team, parallel bars",M,Silver
111,Athens,1896,Gymnastics,Artistic G.,"ATHANASOPOULOS, Spyros",GRE,Men,"team, parallel bars",M,Silver
48,Athens,1896,Cycling,Cycling Road,"BATTEL, Edward",GBR,Men,individual road race,M,Bronze


### Boolean Indexing
- Boolean vectors or conditions can be used to filter data.
- Based on a condition, pass series of true and false values to a series or data frame to select and display the rows where the series has true values.
- Remember that if you have more than one condition, or Boolean vector, this must be grouped in brackets or parentheses. This is to ensure that the order of operations is carried out correctly.

In [17]:
oo.head()

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
0,Athens,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1,Athens,1896,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
2,Athens,1896,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
3,Athens,1896,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
4,Athens,1896,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver


In [25]:
# let's say - we want to list all athletes who have won a gold medal
oo.Medal.head() == 'Gold'
#The output is a series with boolean values.

0     True
1    False
2    False
3     True
4    False
Name: Medal, dtype: bool

In [27]:
# if we use the above to index dataframe like so (always use square brackets for indexing), we get only the rows with boolean true
oo[oo.Medal == 'Gold'].head()

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
0,Athens,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
3,Athens,1896,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
6,Athens,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,1200m freestyle,M,Gold
9,Athens,1896,Aquatics,Swimming,"NEUMANN, Paul",AUT,Men,400m freestyle,M,Gold
13,Athens,1896,Athletics,Athletics,"BURKE, Thomas",USA,Men,100m,M,Gold


### Multiple Boolean Indexing

In [29]:
#let's list all women athletes who have won a gold medal
oo[(oo.Gender == 'Women')&(oo.Medal == 'Gold')].head()
# NOTE: if you want to see details about specific values in Gender before running the above command, do this:
# oo.Gender.value_counts() .. it will give you all the unique values for Gender

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
417,Paris,1900,Golf,Golf,"ABBOTT, Margaret Ives",USA,Women,individual,W,Gold
641,Paris,1900,Tennis,Tennis,"COOPER, Charlotte",GBR,Women,mixed doubles,X,Gold
649,Paris,1900,Tennis,Tennis,"COOPER, Charlotte",GBR,Women,singles,W,Gold
710,St Louis,1904,Archery,Archery,"HOWELL, Matilda Scott",USA,Women,double columbia round (50y - 40y - 30y),W,Gold
713,St Louis,1904,Archery,Archery,"HOWELL, Matilda Scott",USA,Women,double national round (60y - 50y),W,Gold


### String Handling
- String handling generally have names matching the equivalent scaler built in string methods that are available in Python (like .contains(), .startswith(), .isnumeric() .. all like the corresponding Python string methods)
- These are available under the str attribute. Using the str attribute, you have access to several common string methods, such as contains, startswith, isnumeric and so on.

In [34]:
# for example, let's try to look up for the famous female athlete Flo-Jo. We only know her first name Florence and that she is
# from USA (NOC=USA). 
# So lets try to look up using string handling. Remember, all the string handling methods are under str
oo[(oo.Athlete.str.contains('Florence')) & (oo.NOC == 'USA')]

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
16817,Los Angeles,1984,Athletics,Athletics,"GRIFFITH-JOYNER, Florence",USA,Women,200m,W,Silver
18287,Seoul,1988,Athletics,Athletics,"GRIFFITH-JOYNER, Florence",USA,Women,100m,W,Gold
18305,Seoul,1988,Athletics,Athletics,"GRIFFITH-JOYNER, Florence",USA,Women,200m,W,Gold
18347,Seoul,1988,Athletics,Athletics,"GRIFFITH-JOYNER, Florence",USA,Women,4x100m relay,W,Gold
18374,Seoul,1988,Athletics,Athletics,"GRIFFITH-JOYNER, Florence",USA,Women,4x400m relay,W,Silver
