# Welcome to the Dark Art of Coding:
## Introduction to Python
Data Handling

<img src='../images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Merge DataFrames effectively
* Unstack your data
* Replace unwanted data with better versions

# Merging data
---

In [None]:
import pandas as pd
from pandas import DataFrame, Series

We start off with two data sets. One is shorter than the other but generally they're similar. Both have a column with names and a column with countries

In [None]:
readers1 = pd.read_csv('reader_stats.csv')
readers2 = pd.read_csv('reader_stats_short.csv')

print("readers1 data:", '\n', readers1)
print('-' * 40)
print("readers2 data:", '\n', readers2)

Built in functions of pandas let us merge two data frames together in multiple different ways. These merges are similar to the ones you might see in a SQL database

In [None]:
readerso = pd.merge(readers1, readers2, how='outer')
readersi = pd.merge(readers1, readers2, how='inner')
readersl = pd.merge(readers1, readers2, how='left')
readersr = pd.merge(readers1, readers2, how='right')

<img src='Base.jpg' width='600' style='float:center'>

In [None]:
print('Outer Join\n')
readerso

<img src='Outer.jpg' width='600' style='float:center'>

In [None]:
print('Inner Join\n')
readersi

<img src='Inner.jpg' width='600' style='float:center'>

In [None]:
print('Left Join\n')
readersl

<img src='Left.jpg' width='600' style='float:center'>

In [None]:
print('Right Join\n')
readersr

<img src='Right.jpg' width='600' style='float:center'>

NOTE: Please be aware, that unless you specify otherwise, these joins are based on the contents of the entire row. In many cases, we simply want to join based on the contents of one or more columns.

# Key columns
Remember, DataFrames can be built from dictionaries, using the keys of the dictionary as the source of the column in the DataFrame. Any elements (stored as a sequence) in the values associated with those keys then become the elements in the respective column

Here, we are creating some **key** columns that we can use to create joins...

In [None]:
dfa = DataFrame({'key':     ['bruce', 'bruce', 'diana', 'bruce', 'hal', 'diana', 'kara'],
                 'emails_left': [112, 111, 201, 109, 113, 203, 204]}) 

dfb = DataFrame({'key':        ['hal', 'bruce', 'selina', 'diana'],
                 'ages_right': [36, 37, 33, 34]})

In [None]:
# Imagine, that using the previous data, we wanted to do an analysis of emails versus
# age (i.e. whether age impacts the number of emails someone receives over time).
# Let's start with a Left Join: 

dfl = pd.merge(dfa, dfb, on='key', how='left')

In [None]:
# Now, let's look at an Inner Join:

dfi = pd.merge(dfa, dfb, on='key', how='inner')

In [None]:
dfi

## Multiple key columns

In [None]:
# Here, again, we create a set of DataFrames based on dictionaries.
# This time we choose to use more than one column that will be used as keys to
# match data in each of the DataFrames.


dfa = DataFrame({'fname_key': ['bruce', 'bruce', 'hal', 'selina', 'hal'],
                 'lname_key': ['wayne', 'jordan', 'wayne', 'kyle', 'jordan'],
                 'ages_left': [37, 53, 54, 33, 36]})

dfb = DataFrame({'fname_key': ['hal', 'bruce', 'hal', 'kara', 'hal'],
                 'lname_key': ['jordan', 'wayne', 'jordan', 'zor-el', 'jordan'],
                 'emails_right': [189, 111, 193, 253, 187]})

# Outer Join 
dfo = pd.merge(dfa, dfb, on=['fname_key', 'lname_key'], how='outer')

In [None]:
# Inner Join
dfi = pd.merge(dfa, dfb, on=['fname_key', 'lname_key'], how='inner')

# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_merge_01.py```

Execute your script in the **IPython interpreter** using the command:

```bash
run my_merge_01.py```

Your script should do the following:
* Read in two csv files (Don't worry about column names. the files have a header row that is turned into column names for you):
    * `left_file.csv`
    * `right_file.csv`
* Merge the two DataFrames using the `name` column as the key and using an inner join
* Create a new column called `matchip` of True/False values where the `toip` column and the `fromip` column match
* Output just the rows where `matchip` is True

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
dfl = pd.read_csv('left_file.csv')
dfr = pd.read_csv('right_file.csv')
comb = pd.merge(dfl, dfr, on='name', how='inner')
comb['matchip'] = comb.fmip == comb.toip
comb[comb.matchip]

# Concatenation
---

Pandas Series/DataFrames (like some of the other data we've handled) can concatenate. However instead of using the `+` like with lists or strings. You have to use Pandas built in function `pd.concat()`. The default behaviour is to stack the data end to end

In [None]:
names1 = Series(['wayne', 'jordan'], index=[1, 2])
names2 = Series(['dinah', 'kent'], index=[4, 5])
names3 = Series(['rayner', 'gordon', 'grayson'], index=[6, 7, 8])

pd.concat([names1, names3, names2], axis=0)

# An alternate method is to stack columns side by side
# pd.concat([names1, names3, names2], axis= 1)

In [None]:
names4 = pd.concat([names1, names3])
pd.concat([names1, names4], axis=1)

In [None]:
output = pd.concat([names1, names3, names3], keys=['rho', 'sigma', 'tau'])
output

In [None]:
output = pd.concat([names1, names3, names3], axis=1, keys=['rho', 'sigma', 'tau'])
output

In [None]:
# To prep our next data set, we'll use yet another way to generate DataFrames...
# These nested lists will become the rows in our DataFrame
# AS a reminder, you can assign columns when you generate the Frame
# If you don't have any need for the original indexes, you can ignore
# them and pandas will auto-generate an brand-new index on the fly when you do a 
# concatenation.

dfa = DataFrame([[11, 21, 31, 41],
                 [13, 25, 32, 49],
                 [11, 21, 31, 41],
                 [11, 21, 31, 42]], columns=['iota', 'kappa', 'lambda', 'mu'])

dfb = DataFrame([[55, 66, 77],
                 [53, 63, 73]], columns=['kappa', 'lambda', 'mu'])

print(dfa)
print(dfb)

pd.concat([dfa, dfb], ignore_index=True)

# Unstacking
---

In [None]:
# When generating DataFrames, another common method, especially with ranges of
# data OR with randomized data is to use functions in numpy to seed
# the Frame with ranges and/or randomized values. 
# Here, we are creating a Frame with the numbers 100 to 114 and shaping it to be a 
# three by five table.

import numpy as np

df = DataFrame(np.arange(100, 115).reshape((3, 5)),
               index=pd.Index(['kara', 'dinah', 'selina'], name='justiceleague'),
               columns=pd.Index(['wed', 'thu', 'fri', 'sat', 'sun'], name='day'))
df

In [None]:
# The default level to unstack is the innermost
df.unstack()

In [None]:
s = df.unstack()
s['wed']['kara']

In [None]:
# You can refer to the level to unstack by an integer number, starting
# with the farthest left being noted as 0. By default, pandas unstacks from the 
# innermost level of a multi-level hierarchical index.

# The following code comes directly from the pandas documentation:
# http://pandas.pydata.org/pandas-docs/stable/advanced.html
# Several take-aways for this code... 
#     * use the documentation > plenty of great examples are in there.
#     * This MultiIndex dataframe is a nice setup for demoing multilevel unstacking
 
'''
In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ...:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ...: 
In [2]: tuples = list(zip(*arrays))
In [3]: tuples
Out[3]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]
In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [5]: index
Out[5]: 
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])
In [6]: s = pd.Series(np.random.randn(8), index=index)
In [7]: s
Out[7]: 
first  second
bar    one       0.469112
       two      -0.282863
baz    one      -1.509059
       two      -1.135632
foo    one       1.212112
       two      -0.173215
qux    one       0.119209
       two      -1.044236
dtype: float64
'''

In [None]:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))

In [None]:
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index

In [None]:
s = pd.Series(np.random.randn(8), index=index)

In [None]:
# Using the example above, it is possible to demonstrate several levels of unstacking.
# As noted, the default level of unstacking is to unstack from the innermost level
# of a MultiIndex. Levels are numbered started at the outermost level being '0' and 
# incrementing as they move inward.

s

In [None]:
s.unstack(1)

In [None]:
s.unstack(0)

In [None]:
# NOTE: You can refer to the level to unstack by the name of the Index.

s.unstack('second')

# Pivot table
---

In [None]:
# Another great tool for looking at your data in more convenient ways is to use a 
# pivot table. Let's start with a DataFrame that has three columns based on 
# this list of lists. A timestamp, a Justice League hero and the number of 
# Tweets they received on a given day.

league = DataFrame([['2016-03-10T00:00:00', 'jordan', 221],
                    ['2016-03-10T00:00:00', 'wayne', 222],
                    ['2016-03-10T00:00:00', 'kyle', 345],
                    ['2016-03-11T00:00:00', 'jordan', 222],
                    ['2016-03-11T00:00:00', 'wayne', 223],
                    ['2016-03-11T00:00:00', 'kyle', 323],
                    ['2016-03-12T00:00:00', 'jordan', 201],
                    ['2016-03-12T00:00:00', 'wayne', 209],
                    ['2016-03-12T00:00:00', 'kyle', 340],
                    ['2016-03-13T00:00:00', 'jordan', 220],
                    ['2016-03-13T00:00:00', 'wayne', 223],
                    ['2016-03-13T00:00:00', 'kyle', 339],
                    ['2016-03-14T00:00:00', 'jordan', 201],
                    ['2016-03-14T00:00:00', 'wayne', 219],
                    ['2016-03-14T00:00:00', 'kyle', 345]],
                    columns=['timestamp', 'jleague', 'tweets'])

In [None]:
# From the league DataFrame, we can create a pivot table using the pivot() command:

tweet_view = league.pivot('timestamp', 'jleague', 'tweets')
tweet_view

In [None]:
league['fan_index'] = abs(np.random.randn(len(league)))
league

In [None]:
tweet_view2 = league.pivot('timestamp', 'jleague')
tweet_view2

In [None]:
tweet_view2['fan_index']

# Removing duplicates and replacing values
---

In [None]:
# Dropping duplicates
dfd = dfa
dfd['zeta'] = [4, 1, 4, 1]
dfd

In [None]:
dfd.duplicated()

In [None]:
dfd.duplicated(['iota', 'kappa'])

In [None]:
dfd.drop_duplicates()

In [None]:
# Using .map()

# legend:
# 0 = 'm'
# 1 = 'f'

genders = {'selina kyle': '1',
           'bruce wayne': '0',
           'dinah lance': '1',
           'hal jordan': '0',
           'clark kent': '0',
           'barry allen': '0',
           'arthur curry': '0',
           'billy batson': '0',
           'barbara gordon': '1',
           'kara zor-el': '1',
           'john jones': '0',
           'diana prince': '1',
           'dick grayson': '0',
           'john jones': '0',
           'victor stone': '0',
           'ray palmer': '0',
           'john constantine': '0',
           'kyle rayner': '0',
           'wally west': '0'}


it = pd.read_csv('ig_tweets.csv')
it

In [None]:
# Uses a dictionary to map keys to values

it['gender'] = it['jleague'].map(genders)

In [None]:
it

In [None]:
# Run a function on the entire series using apply

it['jleagueLower'] = it['jleague'].apply(str.lower)

In [None]:
it['gender'] = it['jleagueLower'].map(genders)
it

In [None]:
def gen_conv(name):
    gen = genders[name.lower()]
    if gen == '0':
        return 'm'
    elif gen == '1':
        return 'f'

In [None]:
it['gender'] = it['jleague'].apply(gen_conv)
it

In [None]:
# You can also replace certain values wholesale if desired, using the replace() function
# Using .replace()
it.gender.replace('f', 'Female')

In [None]:
it.gender.replace(['f', 'm'], ['Female', 'Male'])

## Bins
---

In [None]:
msgs = it.tweets
bins = [2, 5, 9, 15]

categories = pd.cut(msgs, bins)

categories

In [None]:
# math notation ... '(' open   OR exclusive
#                   ']' closed OR inclusive
# 2 < x <= 5        (2, 5]

#                   right=True/False

pd.value_counts(categories)

In [None]:
labels = ['few', 'medium', "aren't there bad guys to catch"]
it['workload'] = pd.cut(it.tweets, bins, labels=labels)
it

# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_bin_01.py
```

Execute your script in the **IPython interpreter** using the command:

```bash
run my_bin_01.py
```

Your script should do the following:

* Bring out your merged DataFrame from the last exercise
* Bin the payload column by 100_000 increments up to AND INCLUDING 1_000_000 with labels where you spell it out E.G.
    * `One hundred thousand`
    * `Two hundred thousand`
    * ...
    * `Nine hundred thousand`
    * `One million`
* Store the binned data in a new column called `bins`
* Create a pivot table using:
    * `lat` column as index
    * `long` column as column names
    * `bins` column as values

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
comb.lat = comb.lat.apply(int)
comb.long = comb.long.apply(int)

comb['bins'] = pd.cut(comb.payload, range(0, 1000001, 100000), labels=['one hundred thousand',
                                                                       'two hundred thousand',
                                                                       'three hundread thousand',
                                                                       'four hundred thousand',
                                                                       'five hundred thousand',
                                                                       'six hundred thousand',
                                                                       'seven hundred thousand',
                                                                       'eight hundred thousand',
                                                                       'nine hundred thousand',
                                                                       'one million'])

comb.pivot('lat', 'long', 'bins')