# Python Basics
From the [python.org executive summary](https://www.python.org/doc/essays/blurb/):
> Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.

A Jupyter Notebook is a great way to write data science scripts because it combines code cells with markdown cells, making it possible to create reports that are dynamic, styled, and interacive.  Jupyter Notebooks (formerly IPython Notebooks) come as part of the Anaconda Distribution that includes lots of other great packages.  The Anaconda distribution is the easiest way to get started with data science in Python, and includes as part of the base distribution `pandas`, `scikit-learn`, and `numpy`.

## Basic variable types and operations
There are categories of variable types that you can define names for, including strings (`str`), integers (`int`), floats (`float`), as well as collections includings lists and dictionaries (which we'll get into a little later).  All of these variables are objects that have methods associated with them out of the box.
As with any programmning language, Python comes with certain functions baked in.  The way a function works (or even if it works at all) will depend on what type of object you're applying it to.  For instance below I'll define a few variables, and then try to do some stuff with them.

In [52]:
x = 2 # Note that Python will infer the type of object
y = 3
my_string = 'Test'
your_string = 'Yours'
z = 3.5
print(type(x))
print(type(my_string))
print(type(z))

<type 'int'>
<type 'str'>
<type 'float'>


In [53]:
x + y
my_string + your_string
x + z

5.5

The reason the last result is the only thing that showed up is that we didn't use `print()` so the Notebook assumes that you just want an interactive mode.  You can tell that by the `Out[]`.

In [54]:
print(x + y)
print(my_string + your_string)
print(x + z)

5
TestYours
5.5


Now let's break something...

In [55]:
x + my_string

TypeError: unsupported operand type(s) for +: 'int' and 'str'

There are a couple skills that will serve you VERY well when programming for data science.  
* Reading documentation
* Searching StackOverflow
* Parsing error messages

Error messages are not always understandable like the one above... Google is your friend.  What's going on with this error?  Basically, `+` means different things to different object types (ie. addition for numeric types, and concatenate for strings), and Python doesn't know which kind you want.  Now for some more unexpected results.  First guess what you think the result will be, and then type it in and execute. 
```Python
print(x/z)
print(x/y)
```

In [None]:
### Type the commands in this cell.

## Collections
The three main collections that are part of base Python are `list`, `tuple`, and `dict`.  A `list` is a holder for generic objects that allows you to iterate through the objects.  Objects can be of any type, are ordered, and can repeat elements.  A `tuple` is like a list, but, once created cannot be updated.  Lists are mutable, tuples are immutable.  A dictionary (`dict`) is a key-value collection that is unordered and allows for by-name reference to the data contained in the value.  You can have nested lists and dictionaries. 

In [None]:
# The common collections can be defined their own special brackets for convenience

# List example

my_list = ['a', 'b', 'c']

# Tuple examples

my_tup = ('a', 1)

# Dictionary examples

my_dict = {'a': 1, 'b': 2}

You can type cast objects, and if Python can figure out to __*convert*__ from one to the other it will convert it from one to the other.  Below, try experimenting with type casting.  To get started, try:
```python
print(float(x))
print(list(my_tup))
print(list(my_dict))
```

In [None]:
### Experiment with type casting here

Before we move on from collections it's worth noting that there are lots (and lots and lots) of convenience functions available in Python.  For instance, let's say you have two lists, one that contains keys and another that contains values, and you need to get them into a dictionary object.  You could loop through (as we'll do below), but Python anticipates this sort of thing, and gives us the zip command.

In [None]:
keys = ['a', 'b', 'c']
vals = [1, 2, 3]
zipped = zip(keys, vals)
kv_dict = dict(zipped)
print(kv_dict)
# Of course we could do that in all one line
print(dict(zip(keys, vals)))

## Control flow basics
By combining relatively simple logic and iteration through collections we can do some powerful things with our data.  We'll go over `if-then` and `for` loops today.  
### `if - then` syntax
An `if-then` block is merely a way to tell the script to only execute this code if a certain condition is met.  Python's syntax for this is pretty simple, and requires you to learn two things about it's interpreter.  First, the end of your conditional statement should end with a colon (:) character.  Second, all of the code that you want to execute should be on the lines that follow that colon, and end when you stop indenting the lines.  
```python
if x < y:
    print("The value of y is:")
    print(y)
```
You could also easily tell it to do something `else` if the condition is false.
```python
if x > y:
    print("The value of y is:")
    print(y)
    y = y - 1
else:
    print("The value of y is: " + str(y))
```
The conditional logic is complete with `and`, `or`, and `not` operations available.  Finally, you can chain logical statements together with `elif` in between the `if` and the concluding `else`.
```python
if x < y:
    print("The value of y is:")
    print(y)
    y = y - 1
elif y < z and x <= z:
    print("The %f is the value of z." % z)
    y = y + 1
else:
    print("The value of y is: " + str(y))
```
Note that at most one of those blocks of code will execute.  It will execute the first code block whose condition is `True`.  If it gets to the end, and none of them is `True` then it'll execute the `else` portion.  If there is no `else` block then none of that code will execute.

| Operator      | Description                                      |
|:-------------:|:------------------------------------------------ |
| <, >, <=, >=  | Less/Greater than, Less/Greater than or equal to |
| ==, !=        | Equal to, Not equal to                           |
| not           | Logical not... that is reverses the truth        |
| and, or       | Logical and, Logical or                          |

### `for` loop syntax
Loops are a way to iteratively complete a block of code.  They run until a certain condition is met.  In the case of a `for` loop, it executes the proscribed code for each item in a collection.  As a contrast, a `while` loop executes code until a condition is no longer `True`.  A `for` loop also takes advantage of the colon, indent syntax.  For instance:
```python
for m in my_list:
    print(m)
```
In this case the loop isn't doing anything interesting.  Let's say we wanted to take a list of numbers and replace each element with it's value times 2.  Type this code into the block below and let's talk about what's going on.
```python
numbers = range(5)
print(numbers)
for i, n in enumerate(numbers):
    numbers[i] = n * 2
print(numbers)
```

In [None]:
### Recreate what's above here

## Reading and writing data basics
There are a TON of ways to get data in and out of Python.  The most basic is `read` and `write`.  If we're in the directory where our file exists we just need to open it and then assign it to an appropriate data structure.  

In [None]:
with open("simple_data.txt") as simp:
    lines = simp.readlines()
print(lines)


Those '\n' characters are problematic.  All `readlines()` is doing is making the (more or less) raw contents of each line into an item in a list.  By using something called list comprehension I can turn my text file into a list of numbers.  Let's talk about this line by line.

In [None]:
with open("simple_data.txt", 'r') as simp:
    data = [int(i.strip()) for i in simp.readlines()]
print(data)

Now let's say that I wanted to do something with those numbers, and write the results to a new file.

In [None]:
with open("simple_output.txt", 'w') as out:
    for d in data:
        result = d ** 2
        out.write(str(result) + '\n') # note that we could use writeline()


Reading and writing custom data formats is beyond our basic introduction, but check out the `csv` and `json` modules that can help.  Today we'll be using `pandas` so I won't get into them others here.
### Problem #1
Using the file `big_data.txt`, count the numbers that are greater than 3 in that data set. (hint: `count += 1` will increment your counter by 1).

In [None]:
count = 0 


# Pandas
`pandas` is a library to handle structured row-column data for the purpose of analysis and modeling.  It's part of the base installation of the Anaconda distribution.  The main object is a `pandas` `DataFrame()`, which is essentially a row-column data store.  Each columnn is a `Series()` object which can be treated as a stand-alone object. 
To use a non-base package we have to import it, which we do like so:
```python 
import pandas as pd
```
The `as` statement there is unnecessary, but helpful to lessen keystrokes.  Think of it as a nickname for the package.  In order to save memory we could also just import part of a package, like:
```python
from random import randint
import matplotlib.pyplot as plt
```
We'll take a look at a simple data set that's based on the World Bank's World Development Indicators that I pulled a year and a half ago (read: it's not up to date, and I did a lot of munging on it already).  

In [56]:
import pandas as pd
bDF = pd.read_csv(r'builtDF.csv', encoding = "ISO-8859-1") # A dataset that I've curated
descriptions = pd.read_csv(r'wdiSeries.csv', encoding = "ISO-8859-1") # Descriptions of each field name
desc_dict = descriptions.set_index('Code')['Name'].to_dict()

The first step to anything data related is to munge it.  That's a general term for preparing the data for further analysis.  `pandas` can be a great tool for simple munging, though it's not recommended if your dataset is really big and/or your munge requirement is computationally complex.  How big or complex depends on the resources you have.  The problem is that `pandas` is robust, and thus not necessarily memory efficient (though getting more so all the time).  To munge this data I'm going to define a simple function using the `def` and `return` keywords.  Let's talk about what this does.

In [57]:
import numpy as np
def check_complete(ser):
    l = len(ser)
    isNaN = np.sum(pd.isnull(ser))
    if isNaN/float(l) > 0.1:
        return False
    else:
        return True
    
keep = bDF.apply(check_complete, axis = 0)

print(bDF.shape)
bDF_keep = bDF.iloc[:,keep.tolist()]
print(bDF_keep.shape)

### Let's go ahead and print the columns and descriptions that are left for future reference.

for code in bDF_keep.columns:
    if code in desc_dict.keys():
        print("%s : %s" %(code, desc_dict[code]))

(190, 53)
(190, 27)
AG.LND.AGRI.ZS : Agricultural land (% of land area)
AG.LND.FRST.ZS : Forest area (% of land area)
BG.GSR.NFSV.GD.ZS : Trade in services (% of GDP)
BM.GSR.TRVL.ZS : Travel services (% of service imports, BoP)
BM.KLT.DINV.WD.GD.ZS : Foreign direct investment, net outflows (% of GDP)
BN.GSR.MRCH.CD : Net trade in goods (BoP, current US$)
BN.KLT.DINV.CD : Foreign direct investment, net (BoP, current US$)
BX.TRF.PWKR.DT.GD.ZS : Personal remittances, received (% of GDP)
EG.ELC.RNEW.ZS : Renewable electricity output (% of total electricity output)
EN.ATM.GHGO.ZG : Other greenhouse gas emissions (% change from 1990)
EN.POP.DNST : Population density (people per sq. km of land area)
IT.CEL.SETS.P2 : Mobile cellular subscriptions (per 100 people)
IT.NET.USER.P2 : Internet users (per 100 people)
NE.TRD.GNFS.ZS : Trade (% of GDP)
NV.AGR.TOTL.ZS : Agriculture, value added (% of GDP)
NV.IND.TOTL.ZS : Industry, value added (% of GDP)
NV.SRV.TETC.ZS : Services, etc., value added (% 

We could also define a new column based on existing columns.  If it's simple, a `pandas Series()` object will likely support it in the way you would expect.  If it's more complex you can define it yourself.

In [58]:
bDF_keep['new_column'] = bDF_keep['BG.GSR.NFSV.GD.ZS'] - bDF_keep['BM.KLT.DINV.WD.GD.ZS']

print(bDF_keep.new_column.head())

0          NaN
1    34.425394
2    23.256564
3    37.245493
4          NaN
Name: new_column, dtype: float64




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [59]:
def ser_avg(row):
    ret = (row['BG.GSR.NFSV.GD.ZS'] - row['BM.KLT.DINV.WD.GD.ZS']) / 2
    if np.isnan(ret):
        ret = 0
    return ret

# This is black magic...
bDF_keep['another_new'] = bDF.apply(lambda x: ser_avg(x), axis = 1)
print(bDF_keep.another_new.head())

0     0.000000
1    17.212697
2    11.628282
3    18.622746
4     0.000000
Name: another_new, dtype: float64




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In this context a `lambda` function is one that is defined only in the context of that line of code... on the fly.  They can be very useful.  In fact, we could have written everything we did with `ser_avg()` into the `lambda`, but it would have been messy looking.  So, what our 'on-the-fly' lambda function does is call another function (namely `ser_avg`), and `apply` applies it to each row (we could have applied to columns by changing the `axis` argument to `0`).  

### Problem #2
Create a new column that is a binary (0 or 1) value indicating whether or not the Mobile cellular subscriptions (per 100 people) value is greater than or equal to the world average.  Hint: define the average value in the global name space. 

In [60]:
### Put your code in this cell.

# Plotly
There are lots of plotting libraries available for Python. `matplotlib` comes with Anaconda, `seaborn` is cool, and there are others.  The `plotly` library gives you d3.js style charts, and you can make them interactive, which is cool.  

In [61]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import Scatter, Layout, Bar
init_notebook_mode(connected = True) # Allows us to use plotly in offline mode

The base object of a plotly graph is a figure, and we'll usually name it `fig`.  A figure takes `data` and a `layout`.  `data` is a list object that contains one or more `trace` objects.  The `trace` object is where we define the stuff that we want plotted.  Think of it as a series in Excel.  Just like you can have more than one Excel series plotted on the same chart, you can have more than one `trace` plotted on a `plotly` chart.  Let's go through the below code line by line.

In [62]:
trace_ind = Scatter(
    x = bDF_keep['NV.IND.TOTL.ZS'], 
    y = bDF_keep['NY.GDP.MKTP.KD'],
    mode = 'markers',
    name = 'Industry',
    marker = dict(
        color = '#cc7c2c'
        )
    )

trace_agr = Scatter(
    x = bDF_keep['NV.AGR.TOTL.ZS'], 
    y = bDF_keep['NY.GDP.MKTP.KD'],
    mode = 'markers',
    name = 'Agriculture',
    marker = dict(
        color = '#636b77'
        )
    )

trace_srv = Scatter(
    x = bDF_keep['NV.SRV.TETC.ZS'], 
    y = bDF_keep['NY.GDP.MKTP.KD'],
    mode = 'markers',
    name = 'Services',
    marker = dict(
        color = '#3270d3'
        )
    )

data = [trace_ind, trace_agr, trace_srv]

layout = dict(
    title = 'Relationship between the percentage of type of value add and overall GDP',
    hovermode = 'closest',
    yaxis = dict(zeroline = False,
        title = 'GDP in constant 2010 US$',
        type = 'log'),
    xaxis = dict(zeroline = False,
        title = 'Percent value add',
        )
    )

fig = dict(data = data, layout = layout)
iplot(fig)

We can get a lot more fancy than merely dot plots though.  For instance, let's say I wanted a bubble chart.  In this case, I'm creating a trace for each region of the world.  I don't want to manually write each trace... that's prone to errors, and I'm lazy anyway.  Instead I'll create a function that returns a tuple of my `x` value, `y` value, `sz` size (that I want my bubble to be), and `tx` text (that I want to appear when I hover over the bubble).  Note that the division of labor between the `for` loop below and the `create_trace` function is kind of artibitrary.  It would have been cleaner, for instance, if my `create_trace` actually returned a `trace` object, and then all my `for` loop would do is append each `trace` to the `data` list.

In [63]:
def create_trace(resp, x_name, y_name, sz_name, tx_name):
    x = bDF_keep[x_name][bDF_keep.region == resp] # Let's talk about what's happening here
    y = bDF_keep[y_name][bDF_keep.region == resp]
    sz = bDF_keep[sz_name][bDF_keep.region == resp]
    tx = bDF_keep[tx_name][bDF_keep.region == resp]
    return (x, y, sz, tx)

regions = bDF_keep.region.unique()
data = []
# Let's talk about where I get colors from
colors = ['#839cc4', '#ad83c4', '#c48383', '#93c483', '#c4a283', '#83c4b8', '#7f8187']
for i, r in enumerate(regions):
    x_val, y_val, sz_val, tx_val = create_trace(r, 'NE.TRD.GNFS.ZS', 'NY.GDP.PCAP.CD', 
                                        'IT.NET.USER.P2', 'country_name')
    trace_val = Scatter(
        x = x_val,
        y = y_val,
        text = tx_val,
        name = r,
        mode = 'markers',
        marker = dict(
            color = colors[i],
            size = sz_val,
            sizemode = 'area',
            sizemin = 1
            ),
        line = dict(
            width = 2,
            color = 'rgb(0,0,0)')
        )
    data.append(trace_val)

layout = dict(
    title = 'GDP, internet users per 100 ppl (size), and trade for various regions (color)',
    hovermode = 'closest',
    yaxis = dict(zeroline = False,
        title = 'GDP per capita in constant 2010 US$',

        ),
    xaxis = dict(zeroline = False,
        title = 'Trade as a percentage of GDP',

        )
    )

fig = dict(data = data, layout = layout)
iplot(fig)

### Problem #3
Visit (the plotly examples page)[https://plot.ly/python/] and find a plot that you like.  Use our data set to recreate something.  Note that you'll have to infer where to do `pandas Series` things in place of `numpy` arrays and the like.  But that's part of the fun!

In [None]:
### Code here.