# Pandas: Tabular Data in Python

## Objectives

* Create `Series` and `DataFrame`s from Python data types. 
* Create `DataFrame`s from on disk data.
* Index and Slice `pandas` objects.
* Aggregate data in `DataFrame`s.
* Join multiple `DataFrame`s.

## What is Pandas?

A Python library providing data structures and data analysis tools for tabular data of many types.

## Benefits

* Efficient storage and processing of data.
* Includes many built in functions for data transformation, aggregations, and plotting.
* Great for exploratory work.

## Not so greats

* Does not scale terribly well to large datasets.

## Documentation:

* http://pandas.pydata.org/pandas-docs/stable/index.html

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')

  mplDeprecation)


## Numpy: A Quick Primer

`pandas` is built out of data types from `numpy` a lower level library we will be studying soon.

For now, it is sufficient to know that `numpy`s main feature is a very efficient data structure, the `array`.

In [2]:
x = np.array([0, 1, 2, 3, 4, 5])
x

array([0, 1, 2, 3, 4, 5])

Arrays can be processed very efficiently.

In [3]:
x.sum()  # <-- As efficient as possible way to sum these numbers in python.

15

Arrays can be multi-dimensional.  A **two-dimensional array** is called a **matrix**.

In [4]:
M = np.array([
    [0, 1, 2],
    [1, 2, 3],
    [2, 3, 4]
])

M

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

In [5]:
print(x.shape)
print(M.shape)

(6,)
(3, 3)


### Drawbacks of Numpy as a Data Analysis Tool

#### Numpy arrays can only store homogeneous data

In [6]:
x = np.array([
    [2.0, 3.4, "Jack"],
    [1.0, 0.4, "Matt"],
    [5.0, 9.4, "Miles"]
])

That seemed to work...

In [7]:
x.dtype

dtype('<U32')

What?

Numpy has chosen to store our array as **bytes**, i.e. raw ones and zeros.

In [8]:
x[0, 0]  # <- It is a string! It is a string!

'2.0'

Arithmetic operations that should work do not.  Here is an attempt at a column sum!

In [9]:
# Column sum
x.sum(axis=1)

TypeError: cannot perform reduce with flexible type

#### Numpy Arrays only Accept Integer Indexes

You cannot assign column or row names to numpy arrays.  This makes programming a nightmare.

## Getting Data into Pandas

### Creating DataFrames from Python Objects

You can think of DataFrames as labeled (columns) and indexed (rows) matrices. 

We can create DataFrames from numpy arrays and list of lists with the provided labels and indices.

In [10]:
pd.DataFrame(
    [[1, 2, 3], [4, 5, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])

Unnamed: 0,a,b,c
foo,1,2,3
bar,4,5,6


Alternatively, you can think of DataFrames as a combination of column vectors, so we can create DataFrames from a dictionary of column vectors.  The keys are the column labels, and the values are the vectors.

In [13]:
frame_dict = {'coulmn_1': [1, 2, 3], 'column_2': [10, 11, 12]}
pd.DataFrame(frame_dict, index=['3', '2', '1'])

Unnamed: 0,column_2,coulmn_1
3,10,1
2,11,2
1,12,3


#### Exercise:

Create a data frame with two columns: `decreasing` and `increasing`, that have the number 1-100 in increasing and decreasing orders.

### Series

If DataFrames are labeled and indexed matrices, then Series are labeled and indexed vectors.

In [8]:
pd.Series([1, 2, 3], index=['a', 'b', 'c'], name='Numbers')

a    1
b    2
c    3
Name: Numbers, dtype: int64

If you create a Series using a dictionary, the keys are treated as indices instead.  Note though, that the Series ends up in a random order.

In [9]:
pd.Series({'Star': 'Wars', 'Is': 'Boring', 'Please': 'Stop'})

Is        Boring
Please      Stop
Star        Wars
dtype: object

You can take out a Series from a DataFrame.

In [15]:
df = pd.DataFrame(
    [[1, 2, 3], [4, 5, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])

print("Data Frame")
print(df)
print()
print("Column 'a'")
print(df['a'])

Data Frame
     a  b  c
foo  1  2  3
bar  4  5  6

Column 'a'
foo    1
bar    4
Name: a, dtype: int64


In [16]:
type(df['c'])

pandas.core.series.Series

... or put a Series into a DataFrame as long as you have matching index.

In [17]:
df['d'] = pd.Series([4, 5], index=['foo', 'bar'])
df

Unnamed: 0,a,b,c,d
foo,1,2,3,4
bar,4,5,6,5


The elements in the assignment above are matched **by index**, which is a common pattern in Pandas.

In [18]:
# Index flipped from previous example.
#                           v
df['d'] = pd.Series([4, 5], index=['bar', 'foo'])
df

Unnamed: 0,a,b,c,d
foo,1,2,3,5
bar,4,5,6,4


If no indicies match, missing values are filled into the unmatched spaces.

In [19]:
df['d'] = pd.Series([4, 5], index=['bar', 'baz'])
df

Unnamed: 0,a,b,c,d
foo,1,2,3,
bar,4,5,6,4.0


We can also put a list/vector into a DataFrame, and here there is no index, so the column is inserted in order.

In [20]:
df['e'] = [1, 2]
df

Unnamed: 0,a,b,c,d,e
foo,1,2,3,,1
bar,4,5,6,4.0,2


#### Exercise:

Create a data frame that has two columns `increasing` and `evens`.  The `increasing` column contains the numbers 1-100 in increasing order, and the `evens` column has the even numbers in increasing order at the same locations as in `increasing`, but with missing values in the other locations.

### Load data from csv

A csv (comma separated values) is a file format used to store data separated by a **delimiter**.

A delimiter is a **single character** that delimits boundaries between data elements in a file.  A comma is a traditional choice of delimiter, but a relatively poor one.  Better choices are pipe: `|`, or tab `\t`.

In [21]:
# Pipe separated file.
!head 'playgolf.csv'

Date|Outlook|Temperature|Humidity|Windy|Result
07-01-2014|sunny|85|85|false|Don't Play
07-02-2014|sunny|80|90|true|Don't Play
07-03-2014|overcast|83|78|false|Play
07-04-2014|rain|70|96|false|Play
07-05-2014|rain|68|80|false|Play
07-06-2014|rain|65|70|true|Don't Play
07-07-2014|overcast|64|65|true|Play
07-08-2014|sunny|72|95|false|Don't Play
07-09-2014|sunny|69|70|false|Play


In a bizarre twist of history, comma separated files are often separated by different characters than commas.  There is no consistent convention of using a different file extension, but some people use `.psv` or `.tsv`.

Pandas has a `read_csv` function that loads a delimited file into a `DataFrame`.  The resulting object **must fit in memory**.

In [22]:
golf_df = pd.read_csv('playgolf.csv', delimiter='|')

`DataFrame.head` can be used to view a portion of our new dataframe.

In [23]:
golf_df.head()

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
0,07-01-2014,sunny,85,85,False,Don't Play
1,07-02-2014,sunny,80,90,True,Don't Play
2,07-03-2014,overcast,83,78,False,Play
3,07-04-2014,rain,70,96,False,Play
4,07-05-2014,rain,68,80,False,Play


## Extracting information from DataFrames

### Basic Row and Column Indexing

As we have seen, individual columns may be extracted from a `DataFrame` using the usual `__getitem__` style indexing using the name of the column.  

This is similar to how we index a dictionary.

In [24]:
golf_df['Temperature']

0     85
1     80
2     83
3     70
4     68
5     65
6     64
7     72
8     69
9     75
10    75
11    72
12    81
13    71
Name: Temperature, dtype: int64

We can extract individual values by taking the series out of the matrix, then treating it like a list.

In [27]:
golf_df['Temperature'][0]

85

If you try to index Pandas like a list, with an integer or a slice, it will only operate on the rows.

In [29]:
short_df = golf_df[0:5]
short_df

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
0,07-01-2014,sunny,85,85,False,Don't Play
1,07-02-2014,sunny,80,90,True,Don't Play
2,07-03-2014,overcast,83,78,False,Play
3,07-04-2014,rain,70,96,False,Play
4,07-05-2014,rain,68,80,False,Play


### Boolean / Logical Indexing

Interestingly we can try to index into a dataframe using a list of **booleans** (i.e. `True` and `False` values).

In [30]:
# Takes rows 0, 2, and 4.
short_df[[True, False, True, False, True]]

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
0,07-01-2014,sunny,85,85,False,Don't Play
2,07-03-2014,overcast,83,78,False,Play
4,07-05-2014,rain,68,80,False,Play


We can also create a Boolean Series/List by using comparisons on a Series

In [31]:
# A series of booleans.
golf_df['Temperature'] > 70

0      True
1      True
2      True
3     False
4     False
5     False
6     False
7      True
8     False
9      True
10     True
11     True
12     True
13     True
Name: Temperature, dtype: bool

And them use the result to grab rows of the dataframe.

In [34]:
golf_df[golf_df['Temperature'] > 70][["Date", "Windy"]]

Unnamed: 0,Date,Windy
0,07-01-2014,False
1,07-02-2014,True
2,07-03-2014,False
7,07-08-2014,False
9,07-10-2014,False
10,07-11-2014,True
11,07-12-2014,True
12,07-13-2014,False
13,07-14-2014,True


This is essentially applying a logical condition to select rows from a `DataFrame`.  This is one of the most common patterns in Pandas.

#### Exercise

Select all of the rainy days in which the humidity is larger than 90 from this data frame.

### Other Indexers

There are a few other indexing objects in pandas:

  - `df.iloc` is **positionally based**.  This indexer accepts integers and integer slices, and essentially treats the data frame as if it was a simple matrix.
  - `df.loc` is **label based**.  This indexer works with row and column indices / labels.
  
There used to be another one, and you will encounter it sometimes

  - `df.ix` is **mixed**, it works with row numbers (integers) and column labels (names).
  
**The `ix` indexer is depreciated, and you will get a warning if you use it.  It will be removed in a future version of pandas.  Don't write code that uses ix!**

In [35]:
df = pd.DataFrame({
    'some_integers': [0, 0, 1, 1, 2, 2],
    'some_strings': ['x', 'y', 'z', 'x', 'y', 'z'],
    'some_booleans': [0, 0, 1, 0, 1, 1]},
    index=['a', 'b', 'c', 'd', 'e', 'f']
)

In [36]:
df.iloc[2:4, 0:2]

Unnamed: 0,some_booleans,some_integers
c,1,1
d,0,1


In [37]:
df.loc['b':'e', ['some_integers', 'some_booleans']]

Unnamed: 0,some_integers,some_booleans
b,0,0
c,1,1
d,1,0
e,2,1


**Deprecation Warning!!!!**

In [40]:
df.ix[2:4, ['some_integers', 'some_booleans']]

Unnamed: 0,some_integers,some_booleans
c,1,1
d,1,0


### Mixed Indexing

So what do we do if we want to get the rows by position, and get the columns by label?  I.e. if we have a use for **mixed indexing**.

In [42]:
# Mixed indexing with iloc: will not work.
df.iloc[2:4, ['some_integers', 'some_booleans']]

TypeError: cannot perform reduce with flexible type

Doing mixed indexing in modern pandas is a more explicit, less magic.  You need to use the `df.index` and `df.columns` attributes to explicitly turn positions into labels.

In [46]:
df = pd.DataFrame({
    'some_integers': [0, 0, 1, 1, 2, 2],
    'some_strings': ['x', 'y', 'z', 'x', 'y', 'z'],
    'some_booleans': [0, 0, 1, 0, 1, 1]},
    index=['a', 'b', 'c', 'd', 'e', 'f']
)

#### Rows by position, Columns by name

In [49]:
df.index[2:4]

Index(['c', 'd'], dtype='object')

In [50]:
df.loc[df.index[2:4], ['some_integers', 'some_booleans']]

Unnamed: 0,some_integers,some_booleans
c,1,1
d,1,0


#### Rows by name, Columns by position

In [51]:
df.columns[[0, 2]]

Index(['some_booleans', 'some_strings'], dtype='object')

In [52]:
df.loc[['c', 'd'], df.columns[[0, 2]]]

Unnamed: 0,some_booleans,some_strings
c,1,z
d,0,x


### Transforming data

Arithmetic operations apply to `Series` element by element.

In [53]:
# Yes, this makes no sense.
golf_df["TempHumid"] = golf_df['Temperature'] + golf_df['Humidity']

In [54]:
golf_df.head()

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid
0,07-01-2014,sunny,85,85,False,Don't Play,170
1,07-02-2014,sunny,80,90,True,Don't Play,170
2,07-03-2014,overcast,83,78,False,Play,161
3,07-04-2014,rain,70,96,False,Play,166
4,07-05-2014,rain,68,80,False,Play,148


In [58]:
# More Usefully

# Heat index formula taken from wikipedia: 
#    https://en.wikipedia.org/wiki/Heat_index
temp = golf_df['Temperature']
humid = golf_df['Humidity']
golf_df['HeatIndex'] = (-42.37 + 2.05*temp + 10.14*humid
                        - 0.225*temp*humid
                        - 6.84e-3*temp**2 
                        - 5.482e-2*humid**2
                        + 1.23e-3*temp**2*humid
                        + 8.53e-4*temp*humid**2
                        - 1.99e-6*temp**2*humid**2
)
golf_df[['Temperature', 'Humidity', 'HeatIndex']]

Unnamed: 0,Temperature,Humidity,HeatIndex
0,85,85,98.004631
1,80,90,84.4744
2,83,78,89.669911
3,70,96,62.847024
4,68,80,69.089776
5,65,70,73.668025
6,64,65,75.987116
7,72,95,66.247396
8,69,70,72.843649
9,75,80,74.557


We can create a new Series by applying functions to an existing Series

In [59]:
# Create an indicator variable out of a column.
golf_df['Result'].apply(lambda x: 1 if x == 'Play' else 0)

0     0
1     0
2     1
3     1
4     1
5     0
6     1
7     0
8     1
9     1
10    1
11    1
12    1
13    0
Name: Result, dtype: int64

Though the previous result is better executed as

In [60]:
(golf_df['Result'] == 'Play').astype(int)

0     0
1     0
2     1
3     1
4     1
5     0
6     1
7     0
8     1
9     1
10    1
11    1
12    1
13    0
Name: Result, dtype: int64

We can check that these give the same things

In [61]:
golf_df['Result'].apply(lambda x: 1 if x == 'Play' else 0) == (golf_df['Result'] == 'Play').astype(int)

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
Name: Result, dtype: bool

Or, to get a single answer

In [63]:
np.all(
    golf_df['Result'].apply(lambda x: 1 if x == 'Play' else 0) 
    == (golf_df['Result'] == 'Play').astype(int))

True

We can also apply function to each row of the DataFrame by specifying the column and axis equals 1, though this is not useful as in many cases becuase it's more efficient to use the arithmetic operations.

In [64]:
golf_df.apply(lambda x: x['Temperature'] + x['Humidity'], axis=1)

0     170
1     170
2     161
3     166
4     148
5     135
6     129
7     167
8     139
9     155
10    145
11    162
12    156
13    151
dtype: int64

In general, `.apply` is useful for mapping complex functions across your data, you should be wary of using it in simple cases like this, there is probabily a better way.

### Aggregating data

We can do something like the group by statement in SQL.

In [65]:
groups = golf_df.groupby('Outlook')

In [66]:
type(groups)

pandas.core.groupby.DataFrameGroupBy

We can see that `groupby` creates a tuple for each Outlook with a segmented DataFrame

In [68]:
for group in groups:
    print('Group Name: ', group[0])
    print('Group Data:\n', group[1])
    print('\n')

Group Name:  overcast
Group Data:
           Date   Outlook  Temperature  Humidity  Windy Result  TempHumid  \
2   07-03-2014  overcast           83        78  False   Play        161   
6   07-07-2014  overcast           64        65   True   Play        129   
11  07-12-2014  overcast           72        90   True   Play        162   
12  07-13-2014  overcast           81        75  False   Play        156   

    HeatIndex  
2   89.669911  
6   75.987116  
11  68.106944  
12  84.523441  


Group Name:  rain
Group Data:
           Date Outlook  Temperature  Humidity  Windy      Result  TempHumid  \
3   07-04-2014    rain           70        96  False        Play        166   
4   07-05-2014    rain           68        80  False        Play        148   
5   07-06-2014    rain           65        70   True  Don't Play        135   
9   07-10-2014    rain           75        80  False        Play        155   
13  07-14-2014    rain           71        80   True  Don't Play        151 

We can then apply some sort of aggregation to each subset of the data.

In [69]:
groups.count()

Unnamed: 0_level_0,Date,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
overcast,4,4,4,4,4,4,4
rain,5,5,5,5,5,5,5
sunny,5,5,5,5,5,5,5


In [70]:
groups.sum()

Unnamed: 0_level_0,Temperature,Humidity,Windy,TempHumid,HeatIndex
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
overcast,300,308,2.0,608,318.287412
rain,349,406,2.0,755,350.648809
sunny,381,410,2.0,791,397.347701


In [71]:
groups.mean()

Unnamed: 0_level_0,Temperature,Humidity,Windy,TempHumid,HeatIndex
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
overcast,75.0,77.0,0.5,152.0,79.571853
rain,69.8,81.2,0.4,151.0,70.129762
sunny,76.2,82.0,0.4,158.2,79.46954


You can apply your own custom aggregation functions with `aggregate`.

In [72]:
# Get the minimum Temperature within each group.
# Note: This is an awful way to accomplish this, it's just for illustration.
groups.aggregate(lambda df: sorted(df['Temperature'])[0])

Unnamed: 0_level_0,Date,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
overcast,64,64,64,64,64,64,64
rain,65,65,65,65,65,65,65
sunny,69,69,69,69,69,69,69


You should investigate a better way to accomplish the task in the previous example.

### Joining DataFrames

We can join DataFrames in a similar way that we join tables to SQL.  In fact, left, right, outer, and inner joins work the same way here.

Lets create a fake DataFrame to join with first.

In [79]:
mood_df = pd.DataFrame([['overcast', 'sad'], ['rainy', 'sad'], ['sunny', 'happy']],
                       columns=['Outlook', 'Mood'])

mood_df

Unnamed: 0,Outlook,Mood
0,overcast,sad
1,rainy,sad
2,sunny,happy


We can do joins using the merge command.

In [80]:
golf_df.merge(mood_df, how='inner', on='Outlook')

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex,Mood
0,07-01-2014,sunny,85,85,False,Don't Play,170,98.004631,happy
1,07-02-2014,sunny,80,90,True,Don't Play,170,84.4744,happy
2,07-08-2014,sunny,72,95,False,Don't Play,167,66.247396,happy
3,07-09-2014,sunny,69,70,False,Play,139,72.843649,happy
4,07-11-2014,sunny,75,70,True,Play,145,75.777625,happy
5,07-03-2014,overcast,83,78,False,Play,161,89.669911,sad
6,07-07-2014,overcast,64,65,True,Play,129,75.987116,sad
7,07-12-2014,overcast,72,90,True,Play,162,68.106944,sad
8,07-13-2014,overcast,81,75,False,Play,156,84.523441,sad


There are, of course, other options besides `inner`, which you can find in the documentation.

## Concatenating dataframes

This is the equivalent of Unions in SQL, but a little more flexible.

In [81]:
df1 = pd.DataFrame(
    {'Col1': range(5), 'Col2': range(5), 'Col3': range(5)})
df2 = pd.DataFrame(
    {'Col1': range(5), 'Col2': range(5), 'Col4': range(5)},
    index=range(5, 10))

In [82]:
df1

Unnamed: 0,Col1,Col2,Col3
0,0,0,0
1,1,1,1
2,2,2,2
3,3,3,3
4,4,4,4


In [83]:
df2

Unnamed: 0,Col1,Col2,Col4
5,0,0,0
6,1,1,1
7,2,2,2
8,3,3,3
9,4,4,4


#### Vertically

This is like a Union All.

In [84]:
pd.concat([df1, df2], axis=0)

Unnamed: 0,Col1,Col2,Col3,Col4
0,0,0,0.0,
1,1,1,1.0,
2,2,2,2.0,
3,3,3,3.0,
4,4,4,4.0,
5,0,0,,0.0
6,1,1,,1.0
7,2,2,,2.0
8,3,3,,3.0
9,4,4,,4.0


#### Horizontally

This is pretty much a simple join on indices.  While concat is capable of doing joins, it is far less flexible.

In [85]:
pd.concat([df1, df2], join='outer', axis=1)

Unnamed: 0,Col1,Col2,Col3,Col1.1,Col2.1,Col4
0,0.0,0.0,0.0,,,
1,1.0,1.0,1.0,,,
2,2.0,2.0,2.0,,,
3,3.0,3.0,3.0,,,
4,4.0,4.0,4.0,,,
5,,,,0.0,0.0,0.0
6,,,,1.0,1.0,1.0
7,,,,2.0,2.0,2.0
8,,,,3.0,3.0,3.0
9,,,,4.0,4.0,4.0


## Some Extra, Useful Stuff

### Various Summaries

The info class method is useful for checking column types and quickly seeing if you have NaN in the data.

In [75]:
golf_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 8 columns):
Date           14 non-null object
Outlook        14 non-null object
Temperature    14 non-null int64
Humidity       14 non-null int64
Windy          14 non-null bool
Result         14 non-null object
TempHumid      14 non-null int64
HeatIndex      14 non-null float64
dtypes: bool(1), float64(1), int64(3), object(3)
memory usage: 878.0+ bytes


The describe method will give you a quick sense of the quartiles and distribution.


In [86]:
golf_df.describe()

Unnamed: 0,Temperature,Humidity,TempHumid,HeatIndex
count,14.0,14.0,14.0,14.0
mean,73.571429,80.285714,153.857143,76.163137
std,6.571667,9.840486,13.242456,9.77144
min,64.0,65.0,129.0,62.847024
25%,69.25,71.25,145.75,69.439078
50%,72.0,80.0,155.5,74.112513
75%,78.75,88.75,165.0,82.352579
max,85.0,96.0,170.0,98.004631


### Frequency Tables

The `crosstab` function will allow us to quickly take a look at the frequency count between two columns.

In [87]:
pd.crosstab(golf_df['Outlook'], golf_df['Result'])

Result,Don't Play,Play
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
overcast,0,4
rain,2,3
sunny,3,2


### DateTimes

We can turn strings of dates into datetime types by using Pandas' to_datetime function.

In [88]:
golf_df['DateTime'] = pd.to_datetime(golf_df['Date'])

In [89]:
golf_df.head()

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex,DateTime
0,07-01-2014,sunny,85,85,False,Don't Play,170,98.004631,2014-07-01
1,07-02-2014,sunny,80,90,True,Don't Play,170,84.4744,2014-07-02
2,07-03-2014,overcast,83,78,False,Play,161,89.669911,2014-07-03
3,07-04-2014,rain,70,96,False,Play,166,62.847024,2014-07-04
4,07-05-2014,rain,68,80,False,Play,148,69.089776,2014-07-05


In [90]:
golf_df['DateTime'].describe()

count                      14
unique                     14
top       2014-07-01 00:00:00
freq                        1
first     2014-07-01 00:00:00
last      2014-07-14 00:00:00
Name: DateTime, dtype: object

### Creating a New Row Index

We can also set the index to be an existing column(s).

In [91]:
date_df = golf_df.set_index('DateTime')
date_df

Unnamed: 0_level_0,Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2014-07-01,07-01-2014,sunny,85,85,False,Don't Play,170,98.004631
2014-07-02,07-02-2014,sunny,80,90,True,Don't Play,170,84.4744
2014-07-03,07-03-2014,overcast,83,78,False,Play,161,89.669911
2014-07-04,07-04-2014,rain,70,96,False,Play,166,62.847024
2014-07-05,07-05-2014,rain,68,80,False,Play,148,69.089776
2014-07-06,07-06-2014,rain,65,70,True,Don't Play,135,73.668025
2014-07-07,07-07-2014,overcast,64,65,True,Play,129,75.987116
2014-07-08,07-08-2014,sunny,72,95,False,Don't Play,167,66.247396
2014-07-09,07-09-2014,sunny,69,70,False,Play,139,72.843649
2014-07-10,07-10-2014,rain,75,80,False,Play,155,74.557


In [92]:
date_df.index

DatetimeIndex(['2014-07-01', '2014-07-02', '2014-07-03', '2014-07-04',
               '2014-07-05', '2014-07-06', '2014-07-07', '2014-07-08',
               '2014-07-09', '2014-07-10', '2014-07-11', '2014-07-12',
               '2014-07-13', '2014-07-14'],
              dtype='datetime64[ns]', name='DateTime', freq=None)

If we have an index of datetime types, we can use the resample to quickly look at time based aggregations.

In [93]:
# Weekly means.
date_df.resample('W').mean()

Unnamed: 0_level_0,Temperature,Humidity,Windy,TempHumid,HeatIndex
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-07-06,75.166667,83.166667,0.333333,158.333333,79.625628
2014-07-13,72.571429,77.857143,0.428571,150.428571,74.006167
2014-07-20,71.0,80.0,1.0,151.0,70.486984


This will be especially useful when we work with time series.

# Writing Data

We can write data into csv.

In [None]:
golf_df.to_csv('new_playgolf.csv', index=False)