# Pandas: Tabular Data in Python

## Objectives

* Create `Series` and `DataFrame` objects from Python data types. 
* Create `DataFrame` objects from files.
* Index and slice `pandas` objects.
* Aggregate data in `DataFrame`s.
* Join multiple `DataFrame` objects.

## What is Pandas?

A Python library providing data structures and data analysis tools for tabular data of many types. Think of a `DataFrame` like a table in SQL.

## Benefits

  * Efficient storage and processing of data.
  * Includes many built-in functions for data transformation, aggregations, and plotting.
  * Great for exploratory work.

## Not so greats

  * Does not scale terribly well to large datasets.

## Documentation:

The documentation for pandas is here:

  * http://pandas.pydata.org/pandas-docs/stable/index.html
  
Particularly important reads (eventually) are:

  * [Indexing and Selecting](https://pandas.pydata.org/pandas-docs/stable/indexing.html)
  * [Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-mi-slicers)
  * [Group-by](https://pandas.pydata.org/pandas-docs/stable/groupby.html)

## Standard Imports

In [5]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')

## Numpy: A Quick Primer

`pandas` is built out of data types from `numpy` a lower level library.

The basic object in `numpy` is an `array`.

In [6]:
x = np.array([0, 1, 2, 3, 4, 5])
x

array([0, 1, 2, 3, 4, 5])

Arrays can be processed very efficiently.

In [7]:
x.sum()  # <-- As efficient as possible way to sum these numbers in python.

15

Arrays can be multi-dimensional.  A **two-dimensional array** is called a **matrix**.

In [8]:
M = np.array([
    [0, 1, 2],
    [1, 2, 3],
    [2, 3, 4],
    [5, 6, 7]
])

M

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4],
       [5, 6, 7]])

In [9]:
print(x.shape)
print(M.shape)

(6,)
(4, 3)


### Drawbacks of Numpy as a General Data Analysis Tool

#### Numpy arrays can only store homogeneous data

In [11]:
x = np.array([
    [2.0, 3.4, "Jack"],
    [1.0, 0.4, "Matt"],
    [5.0, 9.4, "Miles"]
])

That seemed to work...

In [12]:
x.dtype

dtype('<U32')

What?

Numpy has chosen to store our array as **uncode strings**.  **Even the numbers are now strings!**

In [13]:
x[0, 0]  # <- It is a string! It is a string!

'2.0'

Arithmetic operations that should work do not.  Here is an attempt at a column sum!

In [14]:
# Column sum
x.sum(axis=1)

TypeError: cannot perform reduce with flexible type

This happens because numpy arrays are **homogeneous**.  All the data in an array (even in different columns) must be of the same datatype!

#### Numpy Arrays only Accept Integer Indexes

You cannot assign column or row names to numpy arrays. This can make it harder to program.

## Getting Data into Pandas

### Creating DataFrames from Python Objects

You can think of DataFrames as labeled (columns) and indexed (rows) matrices. 

We can create DataFrames from numpy arrays and list of lists with provided labels and indices. The `columns=` parameter specifies the names for the columns; the `index=` specifies the names for the rows.

In [16]:
pd.DataFrame(
    [[1, 2, 3], [4, 5, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])

Unnamed: 0,a,b,c
foo,1,2,3
bar,4,5,6


Alternatively, you can think of DataFrames as a combination of column vectors, so we can create DataFrames from a dictionary of column vectors.  The keys are the column labels, and the values are the vectors.

In [17]:
frame_dict = {'column_1': [1, 2, 3], 'column_2': [10, 11, 12]}
pd.DataFrame(frame_dict, index=['3', '2', '1'])

Unnamed: 0,column_1,column_2
3,1,10
2,2,11
1,3,12


#### Exercise:

Create a data frame with two columns: `decreasing` and `increasing`, that have the numbers 1-100 in increasing and decreasing orders.

### Series

If DataFrames are labeled and indexed matrices, then Series are labeled and indexed vectors.

In [18]:
s = pd.Series([1, '2', 3], index=['a', 'b', 'c'], name='Numbers')
s

a    1
b    2
c    3
Name: Numbers, dtype: object

If you create a Series using a dictionary, the keys are treated as indices instead.  Note that the order of element might not be the same as the order in the dictionary.

In [20]:
pd.Series({'Star': 'Wars', 'Is': 'Boring', 'Please': 'Stop'})

Star        Wars
Is        Boring
Please      Stop
dtype: object

You can take out a Series from a DataFrame.

In [21]:
df = pd.DataFrame(
    [[1, 2, 3], [4, 5, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])

print("Data Frame")
print(df)
print()
print("Column 'a'")
print(df['a'])

Data Frame
     a  b  c
foo  1  2  3
bar  4  5  6

Column 'a'
foo    1
bar    4
Name: a, dtype: int64


In [22]:
type(df['c'])

pandas.core.series.Series

... or put a Series into a DataFrame as long as you have matching index.

In [23]:
df['d'] = pd.Series([4, 5], index=['foo', 'bar'])
df

Unnamed: 0,a,b,c,d
foo,1,2,3,4
bar,4,5,6,5


The elements in the assignment above are matched **by index**, which is a common pattern in Pandas.

In [24]:
# Index flipped from previous example.
#                           v
df['d'] = pd.Series([4, 5], index=['bar', 'foo'])
df

Unnamed: 0,a,b,c,d
foo,1,2,3,5
bar,4,5,6,4


If no indices match, missing values are filled into the unmatched spaces.

In [25]:
df['d'] = pd.Series([4, 5], index=['bar', 'baz'])
df

Unnamed: 0,a,b,c,d
foo,1,2,3,
bar,4,5,6,4.0


We can also put a list/vector into a DataFrame, and here there is no index, so the column is inserted in order.

In [26]:
df['e'] = [1, 2]
df

Unnamed: 0,a,b,c,d,e
foo,1,2,3,,1
bar,4,5,6,4.0,2


#### Exercise:

Create a data frame that has two columns `increasing` and `evens`.  The `increasing` column contains the numbers 1-100 in increasing order, and the `evens` column has the even numbers in increasing order at the same locations as in `increasing`, but with missing values in the other locations.

### Load data from csv

A csv (comma separated values) is a file format used to store data separated by a **delimiter**.

A delimiter is a **single character** that delimits boundaries between data elements in a file.  A comma is a traditional choice of delimiter but a relatively poor one because they are often part of elements themselves.  Better choices are pipe (`|`) and tab (`\t`).

In [28]:
# Pipe separated file.
!head 'playgolf.csv'

Date|Outlook|Temperature|Humidity|Windy|Result
07-01-2014|sunny|85|85|false|Don't Play
07-02-2014|sunny|80|90|true|Don't Play
07-03-2014|overcast|83|78|false|Play
07-04-2014|rain|70|96|false|Play
07-05-2014|rain|68|80|false|Play
07-06-2014|rain|65|70|true|Don't Play
07-07-2014|overcast|64|65|true|Play
07-08-2014|sunny|72|95|false|Don't Play
07-09-2014|sunny|69|70|false|Play


In a bizarre twist of history, comma separated files are often separated by different characters than commas.  There is no consistent convention of using a different file extension, but some people use `.psv` or `.tsv`.

Pandas has a `read_csv` function that loads a delimited file into a `DataFrame`.  The resulting object **must fit in memory**.

In [29]:
golf_df = pd.read_csv('playgolf.csv', delimiter='|')

`DataFrame.head` can be used to view a portion of our new dataframe.

In [30]:
golf_df.head()

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
0,07-01-2014,sunny,85,85,False,Don't Play
1,07-02-2014,sunny,80,90,True,Don't Play
2,07-03-2014,overcast,83,78,False,Play
3,07-04-2014,rain,70,96,False,Play
4,07-05-2014,rain,68,80,False,Play


## Extracting information from DataFrames

### Basic Row and Column Indexing

As we have seen, individual columns may be extracted from a `DataFrame` as a `Series` using the usual `__getitem__` style indexing using the name of the column.  

This is similar to how we index a dictionary.

In [31]:
golf_df['Temperature']

0     85
1     80
2     83
3     70
4     68
5     65
6     64
7     72
8     69
9     75
10    75
11    72
12    81
13    71
Name: Temperature, dtype: int64

We can extract individual values by taking the series out of the matrix, then treating it like a list.

In [32]:
golf_df['Temperature'][0]

85

We can extract multiple rows at once.

In [33]:
golf_df[['Temperature', 'Humidity']]

Unnamed: 0,Temperature,Humidity
0,85,85
1,80,90
2,83,78
3,70,96
4,68,80
5,65,70
6,64,65
7,72,95
8,69,70
9,75,80


If you try to index with a slice, however, it will only operate on the rows.

In [34]:
short_df = golf_df[0:5]
short_df

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
0,07-01-2014,sunny,85,85,False,Don't Play
1,07-02-2014,sunny,80,90,True,Don't Play
2,07-03-2014,overcast,83,78,False,Play
3,07-04-2014,rain,70,96,False,Play
4,07-05-2014,rain,68,80,False,Play


### Boolean / Logical Indexing

We can also index into a `DataFrame` using a list of **booleans** (i.e. `True` and `False` values). This will also operate on the rows.

In [35]:
# Takes rows 0, 2, and 4.
short_df[[True, False, True, False, True]]

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
0,07-01-2014,sunny,85,85,False,Don't Play
2,07-03-2014,overcast,83,78,False,Play
4,07-05-2014,rain,68,80,False,Play


Which doesn't seem that useful...except we can create a boolean `Series` by using comparisons on a Series

In [36]:
# A series of booleans.
golf_df['Temperature'] > 70

0      True
1      True
2      True
3     False
4     False
5     False
6     False
7      True
8     False
9      True
10     True
11     True
12     True
13     True
Name: Temperature, dtype: bool

And them use the result to grab rows of the dataframe.

In [37]:
golf_df[golf_df['Temperature'] > 70][["Date", "Windy"]]

Unnamed: 0,Date,Windy
0,07-01-2014,False
1,07-02-2014,True
2,07-03-2014,False
7,07-08-2014,False
9,07-10-2014,False
10,07-11-2014,True
11,07-12-2014,True
12,07-13-2014,False
13,07-14-2014,True


This is essentially applying a logical condition to select rows from a `DataFrame`.  This is one of the most common patterns in Pandas.

#### Exercise

Select all of the rainy days in which the humidity is larger than 90 from this data frame.

To review: if you index a `DataFrame` with a **single value** or a **list of values**, it selects the **columns**.

If you use a **slice** or **sequence of booleans**, it selects the **rows**. 

### Double Indexing

Suppose we want to set the value of the `Windy` column where `Temperature > 70` to True (because, um, science).

In [38]:
golf_df[golf_df['Temperature'] > 70]["Windy"] = True

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


What?

In [39]:
golf_df[golf_df['Temperature'] > 70]["Windy"]

0     False
1      True
2     False
7     False
9     False
10     True
11     True
12    False
13     True
Name: Windy, dtype: bool

Apparently that error actually meant something.

This pattern is called double indexing, and it is an antipattern!  Pandas can not guarentee that assignments will hold when you index twice!

To fix these issues, we need to study the other indexing options that Pandas provides.

### Other Indexers: .loc and .iloc

There are a few other indexing objects in pandas, both of which take a value to choose rows and a value to choose columns.

  - `df.iloc` is **positionally based**.  This indexer accepts integers and integer slices, and essentially treats the data frame as if it were a simple matrix.
  - `df.loc` is **label based**.  This indexer works with row and column indices / labels.
  
There used to be another one, and you will encounter it sometimes.

  - `df.ix` is **mixed**, it works with row numbers (integers) and column labels (names).
  
**The `ix` indexer is depreciated, and you will get a warning if you use it.  It will be removed in a future version of pandas.  Don't write code that uses ix!**

In [43]:
df = pd.DataFrame({
    'some_integers': [0, 0, 1, 1, 2, 2],
    'some_strings': ['x', 'y', 'z', 'x', 'y', 'z'],
    'some_booleans': [0, 0, 1, 0, 1, 1]},
    index=['a', 'b', 'c', 'd', 'e', 'f']
)
df

Unnamed: 0,some_integers,some_strings,some_booleans
a,0,x,0
b,0,y,0
c,1,z,1
d,1,x,0
e,2,y,1
f,2,z,1


In [44]:
df.iloc[2:4, 0:2]

Unnamed: 0,some_integers,some_strings
c,1,z
d,1,x


In [45]:
df.loc['b':'e', ['some_integers', 'some_booleans']]

Unnamed: 0,some_integers,some_booleans
b,0,0
c,1,1
d,1,0
e,2,1


**Deprecation Warning!!!!**

In [48]:
df.ix[2:4, ['some_integers', 'some_booleans']]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,some_integers,some_booleans
c,1,1
d,1,0


### Mixed Indexing

So what do we do if we want to get the rows by position, and get the columns by label?  I.e. if we have a use for **mixed indexing**.

In [49]:
# Mixed indexing with iloc: will not work.
df.iloc[2:4, ['some_integers', 'some_booleans']]

TypeError: cannot perform reduce with flexible type

Doing mixed indexing in modern pandas is a more explicit, less magic.  You need to use the `df.index` and `df.columns` attributes to explicitly turn positions into labels.

In [51]:
df = pd.DataFrame({
    'some_integers': [0, 0, 1, 1, 2, 2],
    'some_strings': ['x', 'y', 'z', 'x', 'y', 'z'],
    'some_booleans': [0, 0, 1, 0, 1, 1]},
    index=['a', 'b', 'c', 'd', 'e', 'f']
)
df

Unnamed: 0,some_integers,some_strings,some_booleans
a,0,x,0
b,0,y,0
c,1,z,1
d,1,x,0
e,2,y,1
f,2,z,1


#### Rows by position, Columns by name

In [52]:
df.index[2:4]

Index(['c', 'd'], dtype='object')

In [53]:
df.loc[df.index[2:4], ['some_integers', 'some_booleans']]

Unnamed: 0,some_integers,some_booleans
c,1,1
d,1,0


#### Rows by name, Columns by position

In [54]:
df.columns[[0, 2]]

Index(['some_integers', 'some_booleans'], dtype='object')

In [55]:
df.loc[['c', 'd'], df.columns[[0, 2]]]

Unnamed: 0,some_integers,some_booleans
c,1,1
d,1,0


### Transforming data

Arithmetic operations apply to `Series` element by element.

In [56]:
# Yes, this makes no sense.
golf_df["TempHumid"] = golf_df['Temperature'] + golf_df['Humidity']

In [57]:
golf_df.head()

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid
0,07-01-2014,sunny,85,85,False,Don't Play,170
1,07-02-2014,sunny,80,90,True,Don't Play,170
2,07-03-2014,overcast,83,78,False,Play,161
3,07-04-2014,rain,70,96,False,Play,166
4,07-05-2014,rain,68,80,False,Play,148


In [58]:
# More Usefully

# Heat index formula taken from wikipedia: 
#    https://en.wikipedia.org/wiki/Heat_index
temp = golf_df['Temperature']
humid = golf_df['Humidity']
golf_df['HeatIndex'] = (-42.37 + 2.05*temp + 10.14*humid
                        - 0.225*temp*humid
                        - 6.84e-3*temp**2 
                        - 5.482e-2*humid**2
                        + 1.23e-3*temp**2*humid
                        + 8.53e-4*temp*humid**2
                        - 1.99e-6*temp**2*humid**2
)
golf_df[['Temperature', 'Humidity', 'HeatIndex']]

Unnamed: 0,Temperature,Humidity,HeatIndex
0,85,85,98.004631
1,80,90,84.4744
2,83,78,89.669911
3,70,96,62.847024
4,68,80,69.089776
5,65,70,73.668025
6,64,65,75.987116
7,72,95,66.247396
8,69,70,72.843649
9,75,80,74.557


We can create a new Series by applying functions to an existing Series.

In [60]:
# Create an indicator variable out of a column.
golf_df['Result'].apply(lambda x: 1 if x == 'Play' else 0)

0     0
1     0
2     1
3     1
4     1
5     0
6     1
7     0
8     1
9     1
10    1
11    1
12    1
13    0
Name: Result, dtype: int64

Though the previous result is better executed as

In [61]:
(golf_df['Result'] == 'Play').astype(int)

0     0
1     0
2     1
3     1
4     1
5     0
6     1
7     0
8     1
9     1
10    1
11    1
12    1
13    0
Name: Result, dtype: int64

We can check that these give the same things

In [62]:
golf_df['Result'].apply(lambda x: 1 if x == 'Play' else 0) == (golf_df['Result'] == 'Play').astype(int)

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
Name: Result, dtype: bool

Or, to get a single answer

In [63]:
np.all(
    golf_df['Result'].apply(lambda x: 1 if x == 'Play' else 0) 
    == (golf_df['Result'] == 'Play').astype(int))

True

We can also apply function to each row of the DataFrame by specifying the column and axis equals 1, though this is not useful as in many cases becuase it's more efficient to use the arithmetic operations.

In [64]:
golf_df.apply(lambda x: x['Temperature'] + x['Humidity'], axis=1)

0     170
1     170
2     161
3     166
4     148
5     135
6     129
7     167
8     139
9     155
10    145
11    162
12    156
13    151
dtype: int64

In general, `.apply` is useful for mapping complex functions across your data, you should be wary of using it in simple cases like this, there is probably a better way.

### Aggregating data

We can do something like the group by statement in SQL.

In [66]:
groups = golf_df.groupby('Outlook')

We can see that `groupby` creates a tuple for each Outlook with a segmented DataFrame.

In [None]:
for group in groups:
    print('Group Name: ', group[0])
    print('Group Data:\n', group[1])
    print('\n')

We can then apply some sort of aggregation to each subset of the data.

In [67]:
groups.count()

Unnamed: 0_level_0,Date,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
overcast,4,4,4,4,4,4,4
rain,5,5,5,5,5,5,5
sunny,5,5,5,5,5,5,5


In [68]:
groups.sum()

Unnamed: 0_level_0,Temperature,Humidity,Windy,TempHumid,HeatIndex
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
overcast,300,308,2.0,608,318.287412
rain,349,406,2.0,755,350.648809
sunny,381,410,2.0,791,397.347701


In [69]:
groups.mean()

Unnamed: 0_level_0,Temperature,Humidity,Windy,TempHumid,HeatIndex
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
overcast,75.0,77.0,0.5,152.0,79.571853
rain,69.8,81.2,0.4,151.0,70.129762
sunny,76.2,82.0,0.4,158.2,79.46954


You can apply your own custom aggregation functions with `aggregate`.

In [70]:
groups.aggregate(min)

Unnamed: 0_level_0,Date,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
overcast,07-03-2014,64,65,False,Play,129,68.106944
rain,07-04-2014,65,70,False,Don't Play,135,62.847024
sunny,07-01-2014,69,70,False,Don't Play,139,66.247396


In [71]:
# Get the minimum Temperature within each group.
# Note: This is an awful way to accomplish this, it's just for illustration.
groups.aggregate(lambda df: sorted(df['Temperature'])[0])

Unnamed: 0_level_0,Date,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
overcast,64,64,64,64,64,64,64
rain,65,65,65,65,65,65,65
sunny,69,69,69,69,69,69,69


You should investigate a better way to accomplish the task in the previous example.

Note that groupby is a big topic; more documentation is at http://pandas.pydata.org/pandas-docs/stable/groupby.html



### Joining DataFrames

We can join DataFrames in a similar way that we join tables to SQL.  In fact, left, right, outer, and inner joins work the same way here.

Lets create a fake DataFrame to join with first.

In [72]:
mood_df = pd.DataFrame([['overcast', 'sad'], ['rainy', 'sad'], ['sunny', 'happy']],
                       columns=['Weather', 'Mood'])

mood_df

Unnamed: 0,Weather,Mood
0,overcast,sad
1,rainy,sad
2,sunny,happy


We can do joins using the merge command.

In [None]:
golf_df.merge(mood_df, how='inner', left_on='Outlook', right_on='Weather')

There are, of course, other options besides `inner`, which you can find in the documentation.

## Concatenating dataframes

This is the equivalent of Unions in SQL, but a little more flexible.

In [103]:
df1 = pd.DataFrame(
    {'Col3': range(5), 'Col2': range(5), 'Col1': range(5)},
    index=range(0, 5))
df2 = pd.DataFrame(
    {'Col1': range(5), 'Col2': range(5), 'Col4': range(5)},
    index=range(3, 8))

In [104]:
df1

Unnamed: 0,Col3,Col2,Col1
0,0,0,0
1,1,1,1
2,2,2,2
3,3,3,3
4,4,4,4


In [105]:
df2

Unnamed: 0,Col1,Col2,Col4
3,0,0,0
4,1,1,1
5,2,2,2
6,3,3,3
7,4,4,4


#### Vertically

This is like a Union All. The `sort` parameter controls the order of the columns in the output.

In [109]:
pd.concat([df1, df2], axis=0, join='outer', sort=True)

Unnamed: 0,Col1,Col2,Col3,Col4
0,0,0,0.0,
1,1,1,1.0,
2,2,2,2.0,
3,3,3,3.0,
4,4,4,4.0,
3,0,0,,0.0
4,1,1,,1.0
5,2,2,,2.0
6,3,3,,3.0
7,4,4,,4.0


An `inner` value limits the columns to those in all the inputs.

In [110]:
pd.concat([df1, df2], axis=0, join='inner', sort=True)

Unnamed: 0,Col1,Col2
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4
3,0,0
4,1,1
5,2,2
6,3,3
7,4,4


#### Horizontally

This is pretty much a simple join on indices.  While `concat` is capable of doing joins, it is far less flexible.

In [111]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,Col3,Col2,Col1,Col1.1,Col2.1,Col4
0,0.0,0.0,0.0,,,
1,1.0,1.0,1.0,,,
2,2.0,2.0,2.0,,,
3,3.0,3.0,3.0,0.0,0.0,0.0
4,4.0,4.0,4.0,1.0,1.0,1.0
5,,,,2.0,2.0,2.0
6,,,,3.0,3.0,3.0
7,,,,4.0,4.0,4.0


**Question:** why do some numbers show up as floats? Why do some numbers not?

For more on joining DataFrames, read https://pandas.pydata.org/pandas-docs/stable/merging.html

## Some Extra, Useful Stuff

### Various Summaries

The `info` method is useful for checking column types and quickly seeing if you have NaN in the data.

In [112]:
golf_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 8 columns):
Date           14 non-null object
Outlook        14 non-null object
Temperature    14 non-null int64
Humidity       14 non-null int64
Windy          14 non-null bool
Result         14 non-null object
TempHumid      14 non-null int64
HeatIndex      14 non-null float64
dtypes: bool(1), float64(1), int64(3), object(3)
memory usage: 878.0+ bytes


The `describe` method will give you a quick sense of the quartiles and distribution.

In [113]:
golf_df.describe()

Unnamed: 0,Temperature,Humidity,TempHumid,HeatIndex
count,14.0,14.0,14.0,14.0
mean,73.571429,80.285714,153.857143,76.163137
std,6.571667,9.840486,13.242456,9.77144
min,64.0,65.0,129.0,62.847024
25%,69.25,71.25,145.75,69.439078
50%,72.0,80.0,155.5,74.112513
75%,78.75,88.75,165.0,82.352579
max,85.0,96.0,170.0,98.004631


### Frequency Tables

The `crosstab` function will allow us to quickly take a look at the frequency count between two columns.

In [114]:
pd.crosstab(golf_df['Outlook'], golf_df['Result'])

Result,Don't Play,Play
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
overcast,0,4
rain,2,3
sunny,3,2


### DateTimes

We can turn strings of dates into datetime types by using Pandas' `to_datetime` function.

In [115]:
golf_df['DateTime'] = pd.to_datetime(golf_df['Date'])

In [116]:
golf_df.head()

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex,DateTime
0,07-01-2014,sunny,85,85,False,Don't Play,170,98.004631,2014-07-01
1,07-02-2014,sunny,80,90,True,Don't Play,170,84.4744,2014-07-02
2,07-03-2014,overcast,83,78,False,Play,161,89.669911,2014-07-03
3,07-04-2014,rain,70,96,False,Play,166,62.847024,2014-07-04
4,07-05-2014,rain,68,80,False,Play,148,69.089776,2014-07-05


In [117]:
golf_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 9 columns):
Date           14 non-null object
Outlook        14 non-null object
Temperature    14 non-null int64
Humidity       14 non-null int64
Windy          14 non-null bool
Result         14 non-null object
TempHumid      14 non-null int64
HeatIndex      14 non-null float64
DateTime       14 non-null datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(1), int64(3), object(3)
memory usage: 990.0+ bytes


In [118]:
golf_df['DateTime'].describe()

count                      14
unique                     14
top       2014-07-01 00:00:00
freq                        1
first     2014-07-01 00:00:00
last      2014-07-14 00:00:00
Name: DateTime, dtype: object

### Creating a New Row Index

We can also set the index to be an existing column(s).

In [119]:
date_df = golf_df.set_index('DateTime')
date_df

Unnamed: 0_level_0,Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2014-07-01,07-01-2014,sunny,85,85,False,Don't Play,170,98.004631
2014-07-02,07-02-2014,sunny,80,90,True,Don't Play,170,84.4744
2014-07-03,07-03-2014,overcast,83,78,False,Play,161,89.669911
2014-07-04,07-04-2014,rain,70,96,False,Play,166,62.847024
2014-07-05,07-05-2014,rain,68,80,False,Play,148,69.089776
2014-07-06,07-06-2014,rain,65,70,True,Don't Play,135,73.668025
2014-07-07,07-07-2014,overcast,64,65,True,Play,129,75.987116
2014-07-08,07-08-2014,sunny,72,95,False,Don't Play,167,66.247396
2014-07-09,07-09-2014,sunny,69,70,False,Play,139,72.843649
2014-07-10,07-10-2014,rain,75,80,False,Play,155,74.557


In [120]:
date_df.index

DatetimeIndex(['2014-07-01', '2014-07-02', '2014-07-03', '2014-07-04',
               '2014-07-05', '2014-07-06', '2014-07-07', '2014-07-08',
               '2014-07-09', '2014-07-10', '2014-07-11', '2014-07-12',
               '2014-07-13', '2014-07-14'],
              dtype='datetime64[ns]', name='DateTime', freq=None)

If we have an index of datetime types, we can use the `resample` method to quickly look at time based aggregations.

In [121]:
# Weekly means.
date_df.resample('W').mean()

Unnamed: 0_level_0,Temperature,Humidity,Windy,TempHumid,HeatIndex
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-07-06,75.166667,83.166667,0.333333,158.333333,79.625628
2014-07-13,72.571429,77.857143,0.428571,150.428571,74.006167
2014-07-20,71.0,80.0,1.0,151.0,70.486984


This will be especially useful when we work with time series.

# Writing Data

We can write data into a csv file.

In [122]:
golf_df.to_csv('new_playgolf.csv', index=False)

In [123]:
!cat new_playgolf.csv

Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex,DateTime
07-01-2014,sunny,85,85,False,Don't Play,170,98.0046312500001,2014-07-01
07-02-2014,sunny,80,90,True,Don't Play,170,84.47439999999996,2014-07-02
07-03-2014,overcast,83,78,False,Play,161,89.66991075999985,2014-07-03
07-04-2014,rain,70,96,False,Play,166,62.84702400000019,2014-07-04
07-05-2014,rain,68,80,False,Play,148,69.08977600000006,2014-07-05
07-06-2014,rain,65,70,True,Don't Play,135,73.66802499999999,2014-07-06
07-07-2014,overcast,64,65,True,Play,129,75.9871160000001,2014-07-07
07-08-2014,sunny,72,95,False,Don't Play,167,66.2473960000001,2014-07-08
07-09-2014,sunny,69,70,False,Play,139,72.84364900000003,2014-07-09
07-10-2014,rain,75,80,False,Play,155,74.55700000000012,2014-07-10
07-11-2014,sunny,75,70,True,Play,145,75.777625,2014-07-11
07-12-2014,overcast,72,90,True,Play,162,68.10694399999996,2014-07-12
07-13-2014,overcast,81,75,False,Play,156,84.52344124999973,2014-07-13
07-14-2014,rain,71,80,T