[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/treehouse-projects/python-introducing-pandas/master?filepath=s2n1-exploration-methods.ipynb)

# Exploration Methods
When you get a CSV file from an external source, you'll want to get a feel for the data.

Let's import some data and then learn how to explore it! 

In [1]:
# Setup
import os
import pandas as pd

from utils import render

# We use os.path.join because Windows uses a back slash (\) to separate directories
#  while others use a forward slash (/)
users_file_name = os.path.join('data', 'users.csv')
users_file_name

'data/users.csv'

## CSV File Exploration
If you want to take a peek at your CSV file, you could open it in an editor. 

Let's just use some standard Python to see the first couple lines of the file.

In [2]:
# Open the file and print out the first 5 lines
with open(users_file_name) as lines:
    for _ in range(5):
        # The `file` object is an iterator, so just get the next line 
        render(next(lines))

,first_name,last_name,email,email_verified,signup_date,referral_count,balance


aadams,Andrew,Adams,andrew.adamsandrew3882@yahoo.com,True,2018-02-10,4,60.2


abigail,Abigail,,abigail@gmail.com,True,2018-07-25,6,35.23


acosta5081,Bonnie,Acosta,bonnie@hotmail.com,False,2018-02-23,3,84.0


adrian,Adrian,Sullivan,asullivan@yahoo.com,False,2018-02-15,3,70.26


Notice how the first line is a header row. It has column names in it. By default, it will be assumed that the first row is your header row.

Also note how the first column of that header row is empty...the values below in that first column appear to be usernames.  They are what we want for the index.

We can use the `index_col` parameter of the `pandas.read_csv` function.

In [3]:
# Create a new `DataFrame` and set the index to the first column
users = pd.read_csv(users_file_name, index_col=0)

## Explore your imported DataFrame

A quick way to check and see if you got your CSV file read correctly is to use the [`DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method. This gives you the first **x** number of rows. It defaults to 5.

In [4]:
users.head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aadams,Andrew,Adams,andrew.adamsandrew3882@yahoo.com,True,2018-02-10,4,60.2
abigail,Abigail,,abigail@gmail.com,True,2018-07-25,6,35.23
acosta5081,Bonnie,Acosta,bonnie@hotmail.com,False,2018-02-23,3,84.0
adrian,Adrian,Sullivan,asullivan@yahoo.com,False,2018-02-15,3,70.26
aguirre7290,James,Aguirre,james@yahoo.com,True,2018-07-19,1,12.25


Nice! We got it.  So let's see how many rows we have.  There are a couple of ways.

In [5]:
# Pythonic approach still works
len(users)

463

*Side note*: This length call is quick. Under the covers this `DataFrame.__len__` call is actually performing a `len(df.index)`, counting the rows by using the index. You might see older code that uses the style of `len(df.index)` to get a count of rows. As of pandas version 0.11, `len(df)` is the same as `len(df.index)`.  

The `DataFrame.shape` property works just like `np.array.shape` does. This is the length of each axis of your data frame, rows and columns.

In [6]:
users.shape

(463, 7)

### Explore from a bird's eye view

The `DataFrame.count` method will count up each column for how many non-empty values we have.  Looks like most of our columns are present in all of our rows, but looks like **last_name** has some `np.nan` values.  

The `count` method is data missing aware.

In [7]:
users.count()

first_name        463
last_name         421
email             463
email_verified    463
signup_date       463
referral_count    463
balance           463
dtype: int64

Remember that a `DataFrame` has the ability to contain multiple data types of [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes).

You can use the `DataFrame.dtypes` to see the `dtype` of each column.

In [8]:
users.dtypes

first_name         object
last_name          object
email              object
email_verified       bool
signup_date        object
referral_count      int64
balance           float64
dtype: object

As you can see, most of the data types of these columns where inferred, or assumed, correctly.  See how automatically **`email_verified`** is `bool`, **`referral_count`** is an integer, and **`balance`** a float. This happened when we used `pd.read_csv`. 

One thing to note though that the **`signup_date`** field is an `object` and not a `datetime`. You can convert these druing or after import if you need to, and we'll do some of that later in this course.

#### Describe your data
The `DataFrame.describe` method is a great way to get a vibe for all numeric data in your data frame.  You'll see lot's of different aggregations.

In [9]:
users.describe()

Unnamed: 0,referral_count,balance
count,463.0,463.0
mean,3.544276,49.260756
std,2.272955,29.620366
min,0.0,0.03
25%,2.0,23.15
50%,3.0,50.87
75%,5.0,74.01
max,7.0,99.23


Most of these aggregations are available by themselves as well

In [10]:
# The mean or average
users.mean()

email_verified     0.827214
referral_count     3.544276
balance           49.260756
dtype: float64

In [11]:
# Standard deviation
users.std()

email_verified     0.378471
referral_count     2.272955
balance           29.620366
dtype: float64

In [12]:
# The minimum of each column
users.min()

first_name                    Aaron
email             abigail@gmail.com
email_verified                False
signup_date              2018-01-02
referral_count                    0
balance                        0.03
dtype: object

In [13]:
# The maximum of each column
users.max()

first_name                          Yesenia
email             yang899@james-nichols.com
email_verified                         True
signup_date                      2018-09-11
referral_count                            7
balance                               99.23
dtype: object

### Rearrange your data
You can create a new `DataFrame` that is sorted by using the [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method.

Let's sort the DataFrame so that the user with the highest **`balance`** is at the top. By default, ascending order is assumend, you can change that by setting the `ascending` keyword argument to `False`.

In [14]:
users.sort_values(by='balance', ascending=False).head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
hannah.murphyhannah706,Hannah,Murphy,hannah.murphyhannah6472@yahoo.com,False,2018-06-01,7,99.23
anthony.riceanthony8736,Anthony,Rice,anthony.riceanthony5517@yahoo.com,True,2018-01-04,7,99.21
shea,Miguel,Shea,miguel@hotmail.com,True,2018-07-28,7,99.17
frank,Kristi,Frank,kristi.frankkristi5878@yahoo.com,True,2018-01-03,5,98.9
kmcdowell,Kayla,Mcdowell,kmcdowell@yahoo.com,True,2018-08-01,4,98.4


You'll notice that `sort_values` call actually created a new `DataFrame`. If you want to permanently change the sort from the default (sorted by index), you can pass `True` as an argument to the `inplace` keyword parameter.

In [15]:
# Sort first by last_name and then first_name. By default, np.nan show up at the end
users.sort_values(by=['last_name', 'first_name'], inplace=True)
# Sort order is now changed
users.head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
acosta5081,Bonnie,Acosta,bonnie@hotmail.com,False,2018-02-23,3,84.0
aadams,Andrew,Adams,andrew.adamsandrew3882@yahoo.com,True,2018-02-10,4,60.2
tammy,Tammy,Adams,adams@gmail.com,True,2018-04-08,3,22.54
aguirre7290,James,Aguirre,james@yahoo.com,True,2018-07-19,1,12.25
alexander,Alec,Alexander,alec.alexanderalec1250@hotmail.com,True,2018-03-24,6,55.49
