# Five-Step Process for Data Exploration

Major issues arise for beginners when too many lines of code are written in a single cell of a notebook. It's important to get feedback on every single line of code that you write and verify that it is in fact correct. Only once you have verified the result should you move on to the next line of code. To help increase your ability to do data exploration in Jupyter Notebooks, I recommend the following five-step process:

1. Write and execute a single line of code to explore your data
1. Verify that this line of code works by inspecting the output
1. Assign the result to a variable
1. Within the same cell, in a second line, output the head of the DataFrame or Series
1. Continue to the next cell. Do not add more lines of code to the cell

### Apply to every part of the analysis
You can apply this five-step process to every part of your data analysis. Let's begin by reading in the bikes dataset and applying the five-step process for setting the index of our DataFrame as the `trip_id` column.

In [1]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv')
bikes.head(3)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy


### Step 1: Write and execute a single line of code to explore your data

In this step, we call the `set_index` method to be the `trip_id` column.

In [2]:
bikes.set_index('trip_id').head(3)

Unnamed: 0_level_0,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy


### Step 2: Verify that this line of code works by inspecting the output

Looking above, the output appears to be correct. The `trip_id` column has been set as the index and is no longer a column.

### Step 3: Assign the result to a variable

You would normally do this step in the same cell, but for this demonstration, we will place it in the cell below.

In [3]:
bikes2 = bikes.set_index('trip_id')

### Step 4: Within the same cell, in a second line, output the head of the DataFrame or Series

Again, all these steps would be combined in the same cell.

In [4]:
bikes2.head(3)

Unnamed: 0_level_0,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy


### Step 5: Continue to the next cell. Do not add more lines of code to the cell

It is tempting to do more analysis in a single cell. I advise against doing so when you are a beginner. By limiting your analysis to a single main line of code per cell, and outputting that result, you can easily trace your work from one step to the next. Most lines of code in a notebook will apply some operation to the data. It is vital that you can see exactly what this operation is doing. If you put multiple lines of code in a single cell, you lose track of what is happening and can't easily determine the veracity of each operation.

### All steps in one cell
The five-step process was shown above one step at a time in different cells. When you actually explore data with this process, you would complete it in a single cell and up with the following result.

In [5]:
bikes2 = bikes.set_index('trip_id')
bikes2.head(3)

Unnamed: 0_level_0,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy


### More examples

Let's see a more complex example of the five-step process. Let's find the `from_station_name` that has the longest average trip duration. This example will be completed with two rounds of the five-step process. First we will find the average trip duration for each station and then we will sort it. This example uses the `groupby` method which is covered in the **Grouping Data** part of the book.

In [6]:
avg_td = bikes.groupby('from_station_name').agg({'tripduration':'mean'})
avg_td.head(3)

Unnamed: 0_level_0,tripduration
from_station_name,Unnamed: 1_level_1
2112 W Peterson Ave,911.625
63rd St Beach,1027.666667
900 W Harrison,495.5


After grouping, we can sort from greatest to least.

In [7]:
top_stations = avg_td.sort_values('tripduration', ascending=False)
top_stations.head(3)

Unnamed: 0_level_0,tripduration
from_station_name,Unnamed: 1_level_1
Western Blvd & 48th Pl,7902.0
Kedzie Ave & Lake St,5474.823529
Ridge Blvd & Howard St,4839.666667


While it is possible to complete this exercise in a single cell, I recommend executing only a single main line of code that explores the data.

### No strict requirement for one line of code
The above examples each had a single main line of code followed by outputting the head of the DataFrame. Often times there will be a few more simple lines of code that can be written in the same cell. You should not strictly adhere to writing a single line of code, but instead, think about keeping the amount of code written in a single cell to a minimum.

For instance, the following block is used to select a subset of the data with three lines of code. The first is simple and creates a list of column names as strings. This is an instance where multiple lines of code are easily interpreted.

In [8]:
cols = ['gender', 'tripduration']
bikes_gt = bikes[cols]
bikes_gt.head(3)

Unnamed: 0,gender,tripduration
0,Male,993
1,Male,623
2,Male,1040


### When to assign the result to a variable
Not all operations on our data will need to be assigned to a variable. We might just be interested in seeing the results. But, for many operations, you will want to continue with the new transformed data. By assigning the result to a variable, you will have immediate access to the result.

### When to create a new variable name
During step 3 of the first example, the result of our new dataset was assigned to `bikes2`. We could have reassigned the result back to `bikes` and continued on with our analysis. When first exploring data, I recommend creating a new variable for each major result. By doing so, you will have preserved each step of your work and be able to inspect it later on. Creating new variables makes it much easier to find errors at different places in your analysis.

### When to reuse variable names
The downside to using new variable names is that each variable can hold a copy of the data and if your dataset is large, you might run out of memory. By reassigning a result to the same variable name, you'll reduce memory used. 

Another time to reuse variable names is when you are confident that the analysis you have produced is correct and no longer need to preserve all the previous results.

### Continuously verifying results
Regardless of how adept you become at doing data explorations, it is good practice to verify each line of code. Data science is difficult and it is easy to make mistakes. Data is also messy and it is good to be skeptical while proceeding through an analysis. Getting visual verification that each line of code is producing the desired result is important. Doing this also provides feedback to help you think about what avenues to explore next.