[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/treehouse-projects/python-introducing-pandas/master?filepath=s2n1-exploration-methods.ipynb)

# Exploration Methods
When you get a CSV file from an external source, you'll want to get a feel for the data.

Let's import some data and then learn how to explore it! 

In [1]:
# Setup
import os
import pandas as pd

from utils import render

# We use os.path.join because Windows uses a back slash (\) to separate directories
#  while others use a forward slash (/)
users_file_name = os.path.join('data', 'users.csv')
users_file_name

'data/users.csv'

## CSV File Exploration
If you want to take a peek at your CSV file, you could open it in an editor. 

Let's just use some standard Python to see the first couple lines of the file.

In [2]:
# Open the file and print out the first 5 lines
with open(users_file_name) as lines:
    for _ in range(5):
        # The `file` object is an iterator, so just get the next line 
        render(next(lines))

,first_name,last_name,email,email_verified,signup_date,referral_count,balance


aalexander,Adam,Alexander,alexander@gmail.com,True,2018-08-22,0,31.68


aaron.grimesaaron3543,Aaron,Grimes,aaron@gmail.com,True,2018-06-18,7,53.3


aaron.vargasaaron772,Aaron,Vargas,aaron@hotmail.com,False,2018-04-26,7,95.8


adam,Adam,Roberts,adam@gmail.com,True,2018-07-13,5,35.07


Notice how the first line is a header row. It has column names in it. By default, it will be assumed that the first row is your header row.

Also note how the first column of that header row is empty...the values below in that first column appear to be usernames.  They are what we want for the index.

We can use the `index_col` parameter of the `pandas.read_csv` function.

In [3]:
# Create a new `DataFrame` and set the index to the first column
users = pd.read_csv(users_file_name, index_col=0)

## Explore your imported DataFrame

A quick way to check and see if you got your CSV file read correctly is to use the [`DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method. This gives you the first **x** number of rows. It defaults to 5.

In [4]:
users.head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aalexander,Adam,Alexander,alexander@gmail.com,True,2018-08-22,0,31.68
aaron.grimesaaron3543,Aaron,Grimes,aaron@gmail.com,True,2018-06-18,7,53.3
aaron.vargasaaron772,Aaron,Vargas,aaron@hotmail.com,False,2018-04-26,7,95.8
adam,Adam,Roberts,adam@gmail.com,True,2018-07-13,5,35.07
agregory,Anthony,Gregory,gregory4121@yahoo.com,True,2018-01-16,2,97.0


Nice! We got it.  So let's see how many rows we have.  There are a couple of ways.

In [5]:
# Pythonic approach still works
len(users)

453

*Side note*: This length call is quick. Under the covers this `DataFrame.__len__` call is actually performing a `len(df.index)`, counting the rows by using the index. You might see older code that uses the style of `len(df.index)` to get a count of rows. As of pandas version 0.11, `len(df)` is the same as `len(df.index)`.  

The `DataFrame.shape` property works just like `np.array.shape` does. This is the length of each axis of your data frame, rows and columns.

In [6]:
users.shape

(453, 7)

### Explore from a bird's eye view

The `DataFrame.count` method will count up each column for how many non-empty values we have.  Looks like most of our columns are present in all of our rows, but looks like **last_name** has some `np.nan` values.  

The `count` method is data missing aware.

In [7]:
users.count()

first_name        453
last_name         415
email             453
email_verified    453
signup_date       453
referral_count    453
balance           453
dtype: int64

Remember that a `DataFrame` has the ability to contain multiple data types of [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes).

You can use the `DataFrame.dtypes` to see the `dtype` of each column.

In [8]:
users.dtypes

first_name         object
last_name          object
email              object
email_verified       bool
signup_date        object
referral_count      int64
balance           float64
dtype: object

As you can see, most of the data types of these columns where inferred, or assumed, correctly.  See how automatically **`email_verified`** is `bool`, **`referral_count`** is an integer, and **`balance`** a float. This happened when we used `pd.read_csv`. 

One thing to note though that the **`signup_date`** field is an `object` and not a `datetime`. You can convert these druing or after import if you need to, and we'll do some of that later in this course.

#### Describe your data
The `DataFrame.describe` method is a great way to get a vibe for all numeric data in your data frame.  You'll see lot's of different aggregations.

In [9]:
users.describe()

Unnamed: 0,referral_count,balance
count,453.0,453.0
mean,3.551876,52.051545
std,2.352757,29.378042
min,0.0,0.81
25%,1.0,26.15
50%,4.0,52.08
75%,6.0,78.43
max,7.0,99.98


Most of these aggregations are available by themselves as well

In [10]:
# The mean or average
users.mean()

email_verified     0.818985
referral_count     3.551876
balance           52.051545
dtype: float64

In [11]:
# Standard deviation
users.std()

email_verified     0.385457
referral_count     2.352757
balance           29.378042
dtype: float64

In [12]:
# The minimum of each column
users.min()

first_name                  Aaron
email             aaron@gmail.com
email_verified              False
signup_date            2018-01-01
referral_count                  0
balance                      0.81
dtype: object

In [13]:
# The maximum of each column
users.max()

first_name                    Zachary
email             zachary@hotmail.com
email_verified                   True
signup_date                2018-09-13
referral_count                      7
balance                         99.98
dtype: object

### Rearrange your data
You can create a new `DataFrame` that is sorted by using the [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method.

Let's sort the DataFrame so that the user with the highest **`balance`** is at the top. By default, ascending order is assumend, you can change that by setting the `ascending` keyword argument to `False`.

In [14]:
users.sort_values(by='balance', ascending=False).head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
molly,Molly,,molly@gmail.com,True,2018-07-24,6,99.98
gwarren,Glenn,Warren,glenn@hotmail.com,True,2018-02-03,2,99.79
andre.andersonandre5554,Andre,Anderson,andre.andersonandre1731@yahoo.com,True,2018-01-24,1,99.66
kevin,Kevin,,kevin@gmail.com,True,2018-02-17,7,99.28
jesse,Jesse,Rodriguez,jesse.rodriguezjesse7790@yahoo.com,True,2018-07-14,4,99.04


You'll notice that `sort_values` call actually created a new `DataFrame`. If you want to permanently change the sort from the default (sorted by index), you can pass `True` as an argument to the `inplace` keyword parameter.

In [15]:
# Sort first by last_name and then first_name. By default, np.nan show up at the end
users.sort_values(by=['last_name', 'first_name'], inplace=True)
# Sort order is now changed
users.head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
kenneth.adamskenneth6481,Kenneth,Adams,adams4272@yahoo.com,True,2018-07-20,0,77.97
aguilar,Caitlin,Aguilar,aguilar9306@hotmail.com,False,2018-02-22,4,41.58
christina.aguilarchristina5698,Christina,Aguilar,christina.aguilarchristina3925@yahoo.com,False,2018-04-29,0,94.11
daguilar,Daniel,Aguilar,daguilar@hotmail.com,True,2018-01-08,3,76.78
jaguirre,John,Aguirre,jaguirre@hotmail.com,True,2018-03-10,0,73.86


And if you want to sort by the index, like it was originally, you can use the `sort_index` method.

In [16]:
users.sort_index(inplace=True)
users.head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aalexander,Adam,Alexander,alexander@gmail.com,True,2018-08-22,0,31.68
aaron.grimesaaron3543,Aaron,Grimes,aaron@gmail.com,True,2018-06-18,7,53.3
aaron.vargasaaron772,Aaron,Vargas,aaron@hotmail.com,False,2018-04-26,7,95.8
adam,Adam,Roberts,adam@gmail.com,True,2018-07-13,5,35.07
agregory,Anthony,Gregory,gregory4121@yahoo.com,True,2018-01-16,2,97.0
