# Welcome to the Dark Art of Coding:
## Introduction to Python
pandas: Series & DataFrames

<img src='../images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* understand the purpose and application of pandas to data analysis problems
* understand how to create and use a Series
* understand how to create and use a DataFrame
* explore various simple examples of pandas usage


# `pandas` basics
---

`pandas` is one of the premier data analysis libraries in the Python ecosystem. It offers high-performance, easy-to-use data structures and data analysis tools enabling you to carry out your entire data analysis workflow.

`pandas` is used for:

* data analysis/science
* financial analysis
* data manipulation
* data cleansing
* data transformation

`pandas` has tools to read and write data to and from multiple data formats.

It also includes tools that simplify:

* grouping data
* applying transformations to columns, rows and individual cells
* working with dates and times

# List vs. Dict vs. Series vs. DataFrame
---

## list 
```python
LIST:                               
mylist = ['A', 'B', 'C']            
```

**indexable**: 

`mylist[0]` by integer            

**sliceable**: 

`mylist[0:2]` by integer
 

## dict
```
DICTIONARY:
mydict = {'alpha': 1,
          'beta': 2,
          'gamma': 2}
```

**indexable**: 

`mydict['alpha']` by key


## Series
```
myseries = Series(['bruce', 'selina', 'kara', 'clark])
          column
rows
0         'bruce'
1         'selina'
2         'kara'
'three'   'clark'
```

**indexable**: 

`myseries[0]` by integer

`myseries['three']` by row name

**sliceable**:

`myseries[0:3]`                  

## DataFrame:
```
mydataframe = DataFrame(lots of data...)
        col1      col2        col3      age
rows
0       'bruce'   'wayne'     'M'       42
1       'selina'  'kyle'      'F'       34
'two'   'kara'    'zor-el'    'F'       27
3       'clark'   'kent'      'M'       35
```

**indexable**

by either row(s) or column(s) 

`mydataframe['col1']`

`mydataframe[['col1', 'age']]`



# Series
---

In [8]:
# Let's start by making a simple Series.
# It is customary to import pandas by the alias: pd

import pandas as pd
from pandas import Series

s = Series([33, 37, 27, 42])

# pandas will assign an index automatically starting at "0"

s

0    33
1    37
2    27
3    42
dtype: int64

In [9]:
# We can see that the object is a Series object

print(type(s))

<class 'pandas.core.series.Series'>


In [18]:
# Series objects can be assigned a name 
# The index can also be assigned directly.

s.name = 'Justice League ages'
s.index = ['bruce', 'selina', 'kara', 'clark']

s

bruce     33
selina    37
kara      27
clark     42
Name: Justice League ages, dtype: int64

In [33]:
# The Series factory function allows you to assign attributes
#     such as the index directly.

s1 = Series([37, 36, 10, 36],
            index=['hal', 'victor', 'diana', 'billy'],
            name='More Justice League ages')
s1

hal       37
victor    36
diana     10
billy     36
Name: More Justice League ages, dtype: int64

In [34]:
# Accessing a row directly uses brackets and the 
#     name of the row.

s1['billy']

36

In [36]:
# Similarly, assignment of a value to a row
# uses bracket indexing

s1['diana'] = 32
s1

hal       37
victor    36
diana     32
billy     36
Name: More Justice League ages, dtype: int64

In [38]:
# Rows can be filtered using comparison operators
#     such as ==, <=, >=

s1[s1 >= 35]

hal       37
victor    36
billy     36
Name: More Justice League ages, dtype: int64

In [39]:
# Much like numpy, pandas Series (and DataFrames)
#     offer vector mathematics whereby you can add to
#     or multiply against all rows or cells
#     WITHOUT using a for loop.

s1*2

hal       74
victor    72
diana     64
billy     72
Name: More Justice League ages, dtype: int64

In [27]:
s1

hal       37
victor    36
diana     32
billy     36
Name: More Justice League ages, dtype: int64

In [28]:
s1[['diana', 'billy']]*20

diana    640
billy    720
Name: More Justice League ages, dtype: int64

In [31]:
'diana' in s1

True

In [32]:
'lex' in s1

False

In [None]:
names = {'bruce wayne': 'bwayne@jleague.org',
         'hal jordan': 'hjordan@jleague.org',
         'clark kent': 'ckent@jleague.org',
         'barry allen': 'ballen@jleague.org',
         'diana prince': 'dprince@jleague.org',
         'arthur curry': 'acurry@jleague.org',
         'billy batson': 'bbatson@jleague.org',
         'john jones': 'jjones@jleague.org',
         'victor stone': 'vstone@jleague.org',
         'dick grayson': 'dgrayson@jleague.org',
         'ray palmer': 'rpalmer@jleague.org',
         'dinah lance': 'dlance@jleague.org',
         'kara zor-el': 'kzor-el@jleague.org',
         'john constantine': 'jconstantine@jleague.org',
         'barbara gordon': 'bgordon@jleague.org',
         'kyle rayner': 'krayner@jleague.org',
         'selina kyle': 'skyle@jleague.org',
         'wally west': 'wwest@jleague.org'
         }

emails = Series(names)
# emails.index
# emails.values

In [None]:
s1 = Series(range(10, 16), index=['a', 'b', 'c', 'd', 'e', 'f'])
s2 = Series(range(16, 22), index=['a', 'b', 'c', 'x', 'y', 'z'])

# s1
# s2

In [None]:
s3 = s1 + s2
s1 + s2

# type(s3)
# pd.isnull(s3)
# s3.isnull()
# s3.<tab>

In [None]:
# How do I learn more?
# s3.<method_name>?        # just ask by typing the method name (sans parenthesis) and 
#                          # adding a question mark to see the builtin help docs
# 
# s3.value_counts?
# s3.value_counts(dropna=False)

In [None]:
s3.dropna()

# s3

In [None]:
s4 = Series([42, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 6])

# s4.unique()
# s4.value_counts()
# s4.max()
# s4 + 2

In [None]:
def transmogrifier(x):
    '''hat tip to Calvin and Hobbes for introducing me to this 
    truly fantastic word. thanks, bill watterson.

    "transform, especially in a surprising or magical manner."
    '''
    new_val = '- ' + str(x ** 3) + ' -'
    return new_val

s4.apply(transmogrifier)


# DataFrames
---

In [None]:
# Making a DataFrame # 1
# Using a dictionary:

data = {'hero': ['billy', 'billy', 'billy', 'selina', 'selina'],
        'date': ['Jan 10', 'Jan 11', 'Jan 12', 'Jan 10', 'Jan 11'],
        'emails': [111, 121, 93, 211, 210]}

df = DataFrame(data)

In [None]:
df = DataFrame(data, columns=['date', 'hero', 'emails'])

In [None]:
df = DataFrame(data, columns=['date', 'hero', 'emails', 'instagrams'])

df.index = [1, 2, 3, 4, 5]

# df
# df.columns

In [None]:
# df['hero']
# df.hero

# df.ix[3]
# df.ix[3:4]
# df.ix[3:5]
# df.ix[1:5:2]

In [None]:
df.instagrams = 50
ins = Series([10, 20, 30], index=[1, 3, 5])

# df.instagrams = ins

In [None]:
df['overworked'] = df['emails'] >= 120

In [None]:
# Making a DataFrame # 2
# Using a dictionary with nested dictionaries...

data = {'billy': {'Jan 10': 202, 'Jan 11': 220, 'Jan 12': 198},
        'selina': {'Jan 09': 246, 'Jan 10': 235, 'Jan 11': 243}}

In [None]:
df2 = DataFrame(data)
# df2.T
dft = df2.T

In [None]:
dft.columns.name = 'date'
dft.index.name = 'hero'

In [None]:
# using indexes
nums = Series(range(10, 16), index=['t', 'u', 'v', 'x', 'y', 'z'])
i = nums.index
# i
# i[2:4]
# i[::2]
# i[::3]
# i[4]

In [None]:
logs = pd.read_csv('../../log_file_1000.csv', names=['name',
                                                     'email',
                                                     'fm_ip',
                                                     'to_ip',
                                                     'date_time',
                                                     'lat',
                                                     'long',
                                                     'payload_size'])

In [None]:
logs.fm_ip.unique()

# logs.name.value_counts()
# logs.name.head()

In [None]:
g = logs.groupby(logs.fm_ip)

g.ngroups

# g.first()
# g.get_group('106.152.115.161')
# g.get_group('106.152.115.161').head(3)

In [None]:
logs.date_time.head()

In [None]:
def date_only(dt):
    day = dt.split('T')[0]
    return day

In [None]:
logs['date'] = logs.date_time.apply(date_only)

In [None]:
logs.columns

# tf = logs.fm_ip == logs.to_ip
# tf
# tf.unique()
# tf.value_counts()

# replace_section_header
---

# replace_section_header
---

# replace_section_header
---

# Experience Points!
---

# delete_this_line: sample 01

In your **text editor** create a simple script called `my_lesson_name_01.py` to do the following:

Execute your script in the **IPython interpreter** using the command `run my_lesson_name_01.py`.

Create a function called `me()` that prints out 3 things:

* Your name
* Your favorite food
* Your favorite color

Lastly, call the function, so that it executes when the script is run

# delete_this_line: sample 02

On the **IPython interpreter** do each of the following:

Task | Sample Object(s)
:---|:---
Compare two items using `and` | 'Bruce', 0
Compare two items using `or` | '', 42
Use the `not` operator to make an object False | 'Selina' 
Compare two numbers using comparison operators | `>, <, >=, !=, ==`
Create a more complex/nested comparison using parenthesis and Boolean operators| `('kara' _ 'clark') _ (0 _ 0.0)`

# delete_this_line: sample 03

In your **text editor** create a simple script called `my_lesson_name.py` to do the following:

Execute your script in the **IPython interpreter** using the command `run my_lesson_name.py`.

I suggest that as you add each feature to your script that you run it right away to test it incrementally. 

1. Create a variable with your first name as a string AND save it with the label: `myfname`.
1. Create a variable with your age as an integer AND save it with the label: `myage`.

1. Use `input()` to prompt for your first name AND save it with the label: `fname`.
1. Create an `if` statement to test whether `fname` is equivalent to `myfname`. 
1. In the `if` code block: 
   1. Use `input()` prompt for your age AND save it with the label: `age` 
   1. NOTE: don't forget to convert the value to an integer.
   1. Create a nested `if` statement to test whether `myage` and `age` are equivalent.
1. If both tests pass, have the script print: `Your identity has been verified`

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../images/green_sticky.300px.png' width='200' style='float:left'>