### This dataset consists of competitors at the Rio Olympics in 2016.

### There are a few possibilities of predictions that can be made using this dataset.  Our first run through will be looking to classify the athlete's country using recipients that earned a medal.  What features may be of importance?

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv('athletes.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum()

### There are nulls, but not a lot.  Focusing on time, let's drop them.

In [None]:
df.dropna(inplace=True)

In [None]:
df.shape

In [None]:
df.dtypes

### Let's feature engineer a column that contains the number of medals an athlete has won.  Some athletes may have won more than one medal, and this may be a good indicator of nationality.

In [None]:
df['medal_or_nm'] =  df['gold'] + df['silver'] + df['bronze']

In [None]:
df_medals = df[df.medal_or_nm >= 1]
df_medals.shape

In [None]:
df_medals.head()

In [None]:
df_medals.groupby('nationality')['medal_or_nm'].count()

### This is good to see.  Looking at athletes that do have a medal still leaves us at 1,753 observations. 

### A thought: In order to classify, we will take only countries that have a minimum medal count.  We'll say 50 for now.  Let's see how many countries that leaves us.

In [None]:
country_count = pd.DataFrame(df_medals.groupby('nationality')['medal_or_nm'].agg('sum'))
country_count.columns = ['country_count']

In [None]:
df_medals = df_medals.merge(country_count, on='nationality')

In [None]:
df_medals.head(10)

In [None]:
df_medals = df_medals[df_medals.country_count > 50]

In [None]:
df_medals.shape

In [None]:
df_medals.nationality.nunique()

### We now have 11 countries to classify our athletes as their nationality.  Let's split and then explore our train.

In [None]:
train, test = train_test_split(df_medals, test_size=.3, random_state=123, stratify=df_medals[['nationality']])