# American Ninja Warrior: Creating A New Data Subset From Obstacle History

The following notebook will teach you how to filter a data set into a new subset of data using some of the basic processes of coding in Python.

First things first, we will go over how to correctly set up your Python Notebook on your computer, and how to launch the correct Jupyter Lab coding environment you'll need to begin your coding adventure of learning how to create a new subset of data.

### Overview of Tutorial

Throughout this notebook tutorial on filtering data, we will use this step-by-step tutorial to cover the following:

*Technical Instructions:*
1. Importing the Pandas Package.
2. Importing your chosen set of data from a publicly available data set.
3. Ways of exploring and getting to know your data.
4. Using computational techniques to navigate, create, and export your newly filtered data sub-set.

### Acknowledgements

This Python tutorial has been taught and assigned by Professor Gotzler of the UNC-Chapel Hill English Department.

For more detailed examples and exercies, see the Davis Library Research Hub's Python lessons at [Python: Intro to Data lessons](https://unc-libraries-data.github.io/Python/Intro/Introduction_CrashCourse.html)

### Step #1: Importing the Pandas Package

By importing packages into your session, you are able to use additional tools and functions not available in base Python.

One you've installed a package, you can load it into your current Python session with the import function.

The package that you will be importing today is the Pandas Package. The Pandas Package is a very useful package that enables better management, manipulation, filtering, and cleaning of data much easier then simply using base Python. Additionally, Pandas also provides a range of useful tools for working with data once it has been stored in the session.

You'll begin by importing the Pandas Package with the following command:

In [1]:
import pandas as pd
import numpy as np

Pay close attention to the way we imported the Pandas Package. You load pandas with the usual `import pandas` and an extra `as pd` statement. This allows us to write cells using `pandas` with `pd.` instead of `pandas.` for ease while writing functions. Incluidng `as pd` is **not** necessary to load the Pandas Package.

Another extra thing you can add - **seen above** - for a better experience while exploring your data is `numpy`. In choosing to import the `numpy` package, `numpy` will help `pandas` do some of its math while you're exploring your data.

### Step #2: Importing Your Chosen Set of Data From a Publicly Available Data Set

If you haven't already, download your chosen .csv file of data and save it to a file in your computer. Make sure it's organized well so you can read through it easily when your navigating to your data. If you don't know where to look, you can look through a great collection of interesting data collected by various Github users at [GitHub's Awesome Data Collections](https://github.com/awesomedata/awesome-public-datasets) or if you can't choose, you can ask Professor Gotzler and he'll give you some data sets that he's chosen himself.

For this tutorial we will be exploring data from a .csv file of every single American Ninja Warrior Obstacle used in competition.

If you saved your .csv file correctly, you should see it in your folder in your Jupyter Lab. The relative path you will use to import the data into your session is the name of the data-set in your folder. For this example, your relative path will be AmericanNinjaWarrior.csv.

You'll import the data-set into your session with the following function: `pd.read_csv()`. Remember to include `df=` in order to define your object. You'll insert the relative path, or in other words, the data-set file name into the parentheses of the function.

In [2]:
df=pd.read_csv("AmericanNinjaWarrior.csv")

### Step #3: Ways of Exploring and Getting to Known Your Data

There are many different attributes to use in order to explore your data. Attributes provide you numbers, information, and values inside your dataframe in order to better know your data.

- `.columns` provides all of the column names in your data.
- `.shape` provides a number for the amount of rows and columns are available in your dataset.
- `.size` provides the amount of total cells are in your data (rows * columns).
- `.head()` provides the first 5 columns in your data set.
- `.tail()` provides the last 5 columns in your data set.
- `.sample(n=10)` provides a random 10 individual columns of data in your dataset

Reminder, to include `df` in order to define your attributes. These attributes should look like so:

In [3]:
df.columns

Index(['Season', 'Location', 'Round/Stage', 'Obstacle Name', 'Obstacle Order',
       'Unnamed: 5', 'Unnamed: 6'],
      dtype='object')

In [6]:
df.shape

(904, 7)

In [7]:
df.size

6328

In [8]:
df.head()

Unnamed: 0,Season,Location,Round/Stage,Obstacle Name,Obstacle Order,Unnamed: 5,Unnamed: 6
0,1.0,Venice,Qualifying,Quintuple Steps,1.0,,
1,1.0,Venice,Qualifying,Rope Swing,2.0,,
2,1.0,Venice,Qualifying,Rolling Barrel,3.0,,
3,1.0,Venice,Qualifying,Jumping Spider,4.0,,
4,1.0,Venice,Qualifying,Pipe Slider,5.0,,


In [9]:
df.tail()

Unnamed: 0,Season,Location,Round/Stage,Obstacle Name,Obstacle Order,Unnamed: 5,Unnamed: 6
899,,,,,,,
900,,,,,,,
901,,,,,,,
902,,,,,,,
903,,,,,,,


In [10]:
df.sample(n=10)

Unnamed: 0,Season,Location,Round/Stage,Obstacle Name,Obstacle Order,Unnamed: 5,Unnamed: 6
99,3.0,Sasuke 27 (Japan),National Finals - Stage 1,Spinning Bridge,7.0,,
385,6.0,Miami,Finals (Regional/City),Spider Climb,10.0,,
538,7.0,Las Vegas,National Finals - Stage 3,Doorknob Grasper,2.0,,
26,1.0,Sasuke 23 (Japan),National Finals - Stage 2,Stick Slider,3.0,,
771,10.0,Los Angeles,Qualifying (Regional/City),Spinning Bridge,3.0,,
209,4.0,Southeast,Finals (Regional/City),Cargo Climb,10.0,,
878,10.0,Las Vegas,National Finals - Stage 2,Wingnut Alley,5.0,,
541,7.0,Las Vegas,National Finals - Stage 3,Pole Grasper,5.0,,
811,10.0,Miami,Finals (Regional/City),Slippery Summit,5.0,,
285,5.0,Denver,Qualifying (Regional/City),Jump Hang Kai,4.0,,


Or, if you wanted a sample of more - say 15 or 20 - it would look like this:

In [11]:
df.sample(n=15)

Unnamed: 0,Season,Location,Round/Stage,Obstacle Name,Obstacle Order,Unnamed: 5,Unnamed: 6
861,10.0,Minneapolis,Finals (Regional/City),Salmon Ladder,7.0,,
679,9.0,San Antonio,Finals (Regional/City),Spinball Wizard,9.0,,
327,6.0,Venice,Qualifying (Regional/City),Warped Wall,6.0,,
695,9.0,Daytona Beach,Finals (Regional/City),Circuit Board,9.0,,
313,5.0,Las Vegas,National Finals - Stage 3,Roulette Cylinder,1.0,,
285,5.0,Denver,Qualifying (Regional/City),Jump Hang Kai,4.0,,
377,6.0,Miami,Finals (Regional/City),Downhill Pipe Drop,2.0,,
140,4.0,Midwest,Finals (Regional/City),Jump Hang,4.0,,
214,4.0,Las Vegas,National Finals - Stage 1,Half-Pipe Attack,5.0,,
501,7.0,Houston,Finals (Regional/City),Warped Wall,6.0,,


In [12]:
df.sample(n=20)

Unnamed: 0,Season,Location,Round/Stage,Obstacle Name,Obstacle Order,Unnamed: 5,Unnamed: 6
801,10.0,Miami,Qualifying (Regional/City),Floating Steps,1.0,,
308,5.0,Las Vegas,National Finals - Stage 2,Double Salmon Ladder,2.0,,
365,6.0,St. Louis,Finals (Regional/City),Warped Wall,6.0,,
381,6.0,Miami,Finals (Regional/City),Warped Wall,6.0,,
526,7.0,Las Vegas,National Finals - Stage 1,Sonic Curve,5.0,,
194,4.0,Southeast,Qualifying (Regional/City),Log Grip,2.0,,
302,5.0,Las Vegas,National Finals - Stage 1,Half-Pipe Attack,5.0,,
591,8.0,Indianapolis,Finals (Regional/City),Hourglass Drop,8.0,,
300,5.0,Las Vegas,National Finals - Stage 1,Rope Glider,3.0,,
796,10.0,Dallas,Finals (Regional/City),Warped Wall,6.0,,


### Step #4: Using Computational Techniques to Navigate, Create, and Export Your Newly Filtered Data Sub-Set

With so much data, it's often hard to narrow down and actually find what you're looking for. Despite the challenge, using Python, Pandas, and Numpy, you have the ability to take some "short cuts" to find what you're looking for quick and easy. That's where the following commands come in handy when looking through a big .csv file.

In this tutorial I will show you how to narrow down you big data-set into a subset of specific information.

Say you wanted to see all of the obstacle information in your data-set for the city of Atlanta.

- The inner command you would write is `df["Location"] == "Atlanta"`.
- The outer command encompassing the inner command is `df[`.
- When combined, the following commands result in all of information for the city of Atlanta in your data set.

In [13]:
df[df["Location"] == "Atlanta"]

Unnamed: 0,Season,Location,Round/Stage,Obstacle Name,Obstacle Order,Unnamed: 5,Unnamed: 6
562,8.0,Atlanta,Qualifying (Regional/City),Floating Steps,1.0,,
563,8.0,Atlanta,Qualifying (Regional/City),Big Dipper,2.0,,
564,8.0,Atlanta,Qualifying (Regional/City),Block Run,3.0,,
565,8.0,Atlanta,Qualifying (Regional/City),Spin Cycle,4.0,,
566,8.0,Atlanta,Qualifying (Regional/City),Pipe Fitter,5.0,,
567,8.0,Atlanta,Qualifying (Regional/City),Warped Wall,6.0,,
568,8.0,Atlanta,Finals (Regional/City),Floating Steps,1.0,,
569,8.0,Atlanta,Finals (Regional/City),Big Dipper,2.0,,
570,8.0,Atlanta,Finals (Regional/City),Block Run,3.0,,
571,8.0,Atlanta,Finals (Regional/City),Spin Cycle,4.0,,


In [23]:
df["Obstacle Name"].value_counts().head(20)

Warped Wall             86
Salmon Ladder           41
Quintuple Steps         32
Floating Steps          28
Log Grip                21
Jump Hang               18
Quad Steps              16
Jumping Spider          14
Invisible Ladder        11
Wall Lift               11
Rolling Log             11
Rope Ladder             10
Bridge of Blades        10
Cargo Climb              9
Rope Climb               9
Spider Climb             9
Ultimate Cliffhanger     9
Spinning Log             8
Spinning Bridge          8
Jumping Bars             8
Name: Obstacle Name, dtype: int64

In [25]:
ValueTable = df["Obstacle Name"].value_counts().head(20)

In [26]:
ValueTable

Warped Wall             86
Salmon Ladder           41
Quintuple Steps         32
Floating Steps          28
Log Grip                21
Jump Hang               18
Quad Steps              16
Jumping Spider          14
Invisible Ladder        11
Wall Lift               11
Rolling Log             11
Rope Ladder             10
Bridge of Blades        10
Cargo Climb              9
Rope Climb               9
Spider Climb             9
Ultimate Cliffhanger     9
Spinning Log             8
Spinning Bridge          8
Jumping Bars             8
Name: Obstacle Name, dtype: int64

In [27]:
ValueTable.to_csv("ValueTable.csv", index=False)

Say you wanted to see all of the obstacle information used in the city of Miami's competitions.

You would use the exact same combined commands as above but replace `"Atlanta"` with `"Miami"`. Resulting in this:

In [14]:
df[df["Location"] == "Miami"]

Unnamed: 0,Season,Location,Round/Stage,Obstacle Name,Obstacle Order,Unnamed: 5,Unnamed: 6
266,5.0,Miami,Qualifying (Regional/City),Quintuple Steps,1.0,,
267,5.0,Miami,Qualifying (Regional/City),Utility Pole Slider,2.0,,
268,5.0,Miami,Qualifying (Regional/City),Balance Bridge,3.0,,
269,5.0,Miami,Qualifying (Regional/City),Slider Jump,4.0,,
270,5.0,Miami,Qualifying (Regional/City),Monkey Pegs,5.0,,
271,5.0,Miami,Qualifying (Regional/City),Warped Wall,6.0,,
272,5.0,Miami,Finals (Regional/City),Quintuple Steps,1.0,,
273,5.0,Miami,Finals (Regional/City),Utility Pole Slider,2.0,,
274,5.0,Miami,Finals (Regional/City),Balance Bridge,3.0,,
275,5.0,Miami,Finals (Regional/City),Slider Jump,4.0,,


Awesome job at your first time filtering data, you did great!

Now, you need to export your new data set into an exportable .csv file. And in order to do this, you need to give a name to your new subset. By using `copy.()`, you are able to take a "snapshot" of your original data file that you filtered from. For this tutorial, you are only going to export the information from the city of Atlanta.

To use your Atlanta data subset later, use the following commands as so:

In [18]:
ATL_subset = df[df["Location"] == "Atlanta"]

Now that you've given a name to your newly manipulated data. You need to export your data as a .csv file.

In order to do this you use the command `.to_csv()`. Then, add your new data name within the parentheses at the end.

So for example, for your filtered subset we would run: `ATL_subset.to_csv("ATL_subset.csv")` this will export a `.csv` file in our working directory.

By default, this newly created .csv file will include the row of indices that the Pandas Package created when we command the original file into your session using `.read_csv`. 

In order to get rid of this mechanic, we can add `index=false` to our statement, which tells it not bring in those extra index numbers that is not needed.

It will look this so:

In [19]:
ATL_subset.to_csv("ATL_subset.csv", index=False)

Amazing job! You have now successfully finished the tutorial and should see your new data in the form of a .csv file in your folder. And if you don't see it, no problem, just carefully check over your work and go over the tutorial step-by-step to figure out where you went wrong.

If you want, you can create a new session in Jupyter Lab using Python and by running the command `pd.read_csv` you can start manipulating your own data you created yourself!

### Thank You and Further Resources!

This tutorial was created and brought to you by Reid Miller. A big thank you to Prof. Gotzler for teaching and assigning this very interesting and cool assignment.

If your interested in more information and tutorials on Pandas, check out these sources provided by the Davis Library Research Hub on [Pandas](https://unc-libraries-data.github.io/Python/Jupyter/Pandas.html) and [Pandas: Extra_Topics](https://unc-libraries-data.github.io/Python/Jupyter/Extra_Topics.html).