# From Rows to Lists, Arrays, and DataFrames

## Iterating Over Rows to Analyze Data

So far we have learned how to iterate through the rows of a file in order to process each row, one at a time.  It has its limitations.  In this session we compare working with iteration over rows, to two alternatives: generating lists from the rows, and generating arrays using Numpy.  We close with an introduction to the pandas package, which we will begin using extensively.

But first, let's ceate and then load a simple cvs file with one column, containing 50 random numbers between 0 and 1:

The first task is to calculate a mean value of these numbers.  How can we do that?

In [None]:
import csv
with open('random.csv', 'r') as csvfile:
    i = 0
    cumsum = 0
    itemreader = csv.reader(csvfile)
    for row in itemreader:
        i = i+1
        cumsum = cumsum+float(row[0])
    mean = cumsum/i
    print(mean)

How about computing the min and max values?

In [None]:
import csv
with open('random.csv', 'r') as csvfile:
    i = 0
    cumsum = 0
    min = 1
    max = 0
    itemreader = csv.reader(csvfile)
    for row in itemreader:
        i = i+1
        cumsum = cumsum+float(row[0])
        if float(row[0])>max:
            max = float(row[0])
        if float(row[0])<min:
            min = float(row[0])
    mean = cumsum/i
    print(mean, min, max)

Now let's say we need to compute the median of these numbers.  Remember that a median is the value for which half of the observations are less and half are greater.  Any suggestions on how to code this using our iteration approach, and without resorting to creating a list?

## A Note on Documentation

Often we forget how to do something.  In that case, the python docs are our friend.  They can be found here: https://docs.python.org/3.5/

In particular, note the language reference and the library reference.  Let's stop and take a look at each.  

## Using Lists to Analyze Data

Some things that are hard in the row iteration approach above, become easier if we can keep all of the values in one object, like a list.

Let's revisit the problems above, by assembling the values from each row into a single list.

In [None]:
import csv
with open('random.csv', 'r') as csvfile:
    x = []
    itemreader = csv.reader(csvfile)
    for row in itemreader:
        x.append(float(row[0]))
x

X is now a single list object with all the values in the file, whereas initially we just had each row producing one list with one element in it (from one row), and then printing that, before recreating it with the value from the next row.  The iteration approach kept only one row at a time, and we could not easily do calculations like a median.

Using the list (x), we should now have an easier time with mean and median calculations, using list methods like sum and len, and sort:

In [None]:
mean = sum(x)/len(x)
print(mean)

We can solve the median problem by sorting the list, and getting the value that is halfway through the list:

In [None]:
x.sort()
median = x[int(len(x)/2)]
print(median)

It is also simple to get the min and max, using built-in list methods:

In [None]:
min = x[0]
min

In [None]:
max = x[-1]
max

OK, this is progress.  Can we now do other math on the data, like multiply each value by 5?

In [None]:
y = x*5
y

Not what we wanted. That just concatenated 5 copies of the list together!

Let's take a different approach.

In [None]:
y = []
for item in range(len(x)):
    y.append(x[item]*5)
y

Alright, this is a big improvement over the iteration over rows approach, but it is still a bit tedious.

## Introducing Numpy

Now let's look at the use of arrays, the basic data structure added by Numpy, a library that creates, manipultates, and analyzes n-dimensional array objects, or ndarrays.  We will refer to this datatype as array for short.

When we're working with a library, it is nice to keep the documentation handy.  We can find the Numpy docs here: http://www.numpy.org/.  If you're unfamiliar with a new package, you might want to start from the tutorial to cover the basics.  

Arrays can be created from any sequence-like object, such as a list of integers or floats, for example.  We start by creating one from a list of floats.  

First, we have to import Numpy and will assign it a shorter name of np

In [None]:
import numpy as np

data1 = [3.2, 4.5, 8.0, 9.9]
arr1 = np.array(data1)
arr1

Notice that if we pass a nested list of equal-length lists, we create a two-dimensional array:

In [None]:
data2 = [[1.1, 2.2 ,3.3], [1.2, 2.3, 3.4]]
arr2 = np.array(data2)
arr2

We can see its type, number of dimensions, size (number of elements), and shape (length of each dimension) as follows:

In [None]:
type(arr2)

In [None]:
arr2.ndim

In [None]:
arr2.size

In [None]:
arr2.shape

We can check the data type of the contents of an array, or 'cast' it to a specified data type, like this:

In [None]:
arr2.dtype

In [None]:
arr3 = arr2.astype(np.int8)
arr3.dtype

The data type for the elements in an array are required to be the same.  Among other types, they can be int8, int16, int32, int64, float16, float32, float64, bool. They can also be strings of fixed length, but Numpy is mostly used for arrays of numeric data.

Recall the range method to generate a range of integers?  We can use an analogous method called arange in Numpy to create a multi-dimensional array:

In [None]:
arr = np.arange(15).reshape(3,5)
arr

Below, we load the same csv file with random numbers that we loaded earlier, with a built-in Numpy method, genfromtext.  

In [None]:
a = np.genfromtxt('random.csv')
a

Numpy Arrays are objects of type ndarray, often just called arrays. They are a special data type that support fast, vectorized calculations.  Numpy calculations are actually computed in low-level C code behind the scenes, avoiding the need for iterative loops in Python, which are well known to be very slow compared to compiled languages like C, C++ or Fortran.

Let's have a look at some of the built-in methods to explore Numpy arrays.

First, instead of len, we have a size method that tells us how many elements are in the array.

In [None]:
np.size(a)

Next, we can explore the shape of an array.  If it is a single column, it shows as (length,).  If it is a single row, it shows up as (,length).

In [None]:
np.shape(a)

A lot of mathematical methods are readily built in to Numpy.  Here are a few for mean, standard deviation, variance, median, min, max, sum.

In [None]:
np.mean(a)

In [None]:
np.std(a)

In [None]:
np.var(a)

In [None]:
np.median(a)

In [None]:
np.min(a)

In [None]:
np.max(a)

In [None]:
np.sum(a)

Much easier than coding for loops and counters, don't you think?  And a lot faster on large datasets, too.

Numpy also integrates very well with Matplotlib to generate charts.  Let's look at a sorted list of the elements in the array.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(np.sort(a))

In [None]:
plt.hist(a)

The distribution above doesn't look very random, does it?  It seems to be intended as a uniform distribution from 0 to 1, but has a disproportionate samping from above 0.8.  Let's try generating larger samples and see if it looks more uniform.

In [None]:
b = np.random.uniform(0,1,50)
plt.hist(b)

And here is a normal distribution.

In [None]:
b = np.random.randn(50)
plt.hist(b)

The efficiency and productivity of coding using arrays becomes even more obvious when we consider calculations like an element by element calculation -- multiplying each element by 5, for example.  No loops needed when the calculations are vectorized.  If we say, a*5 and a is an array, Numpy will perform an element by element multiplication, using fast C code to do the math.

In [None]:
b = a*5
b

Similarly, multiplying two arrays together produces an element by element multiplication, in this case, instead of each element of a being multiplied by a constant (5), it is being multiplied by the corresponding value (same index position) in array b.  In other words, the first element of each array get multiplied together, the second element of each array, the third, and so on.

This works as long as the arrays are of the same length, or if one is an even multiple of the other, in which case the shorter one gets repeated to make the lengths match.  That process is called broadcasting, and it is how we were able to multiply a initially by a constant. The constant was broadcast to the length of a.

In [None]:
c = a*b
c

### More Advanced Uses of Numpy

Numpy is a very powerful libary for mathematical and scientific computing, well beyond the limited functionality above.  It forms the foundation for Scipy, a large libray for scientific computing, and many other numeric libraries, including Pandas, which we will get to shortly.

But first, a glimpse of some more advanced uses of numpy.  To get a sense of the types of things Numpy can do, we can check out the list of Numpy functions by category from their docs.  Those are here: https://docs.scipy.org/doc/numpy/reference/routines.html.  Let's take a look...

Now let's look at two examples.  

#### Linear Algebra

You noticed that when we did array calculations in Numpy it was elementwise?  Those of you who have done any linear algebra and worked with matrix (arra) libraries might be perplexed that standard matrix operations do not result from a * b for example.  Not to worry.  Numpy can do linear algebra quite well - it just uses separate syntax for it.

For example, a matrix multiplication is called a dot product.

In [None]:
d = np.array([[1.0, 2.0], [3.0, 4.0]])
c = np.dot(d,d)
c

In [None]:
np.transpose(c)

There is way more to Numpy than we have covered here but if you need to do serious number crunching and advanced computation in Python, you will generally find Numpy to be a core part of the solution.  

#### Generating Fractals

Here is a fun example of using Numpy to generate a Mandlebrot set from a Scipy tutorial (don't ask me to explain it).  If you get curious about the theory and math behind it, Wikipedia has a good page on it. https://en.wikipedia.org/wiki/Mandelbrot_set

In [None]:
import numpy as np
import matplotlib.pyplot as plt
def mandelbrot( h,w, maxit=20 ):
    """Returns an image of the Mandelbrot fractal of size (h,w)."""

    y,x = np.ogrid[ -1.4:1.4:h*1j, -2:0.8:w*1j ]
    c = x+y*1j
    z = c
    divtime = maxit + np.zeros(z.shape, dtype=int)

    for i in range(maxit):
        z = z**2 + c
        diverge = z*np.conj(z) > 2**2            # who is diverging
        div_now = diverge & (divtime==maxit)  # who is diverging now
        divtime[div_now] = i                  # note when
        z[diverge] = 2                        # avoid diverging too much

    return divtime

plt.imshow(mandelbrot(400,400))
plt.show()

# Introducing Pandas

We have moved from processing file with iterators over rows, to lists with iterators, to using Numpy arrays to do vectorized operations that are much faster and are also less complex to read, understand and to code. We moved from operating on one item in a list at a time, to operating on a whole array at one time. Now we want to be able to move to handling the whole table of data at once.

One of the problems we did not attempt to deal with using arrays, or lists, was how to keep rows of data together so that if we skip a missing value in one variable like price, it does not cause the other variables to be mis-aligned due to changes in the length of the array or list for one entry. This is one of many things that the pandas library does for us.
We skip forward a bit to Chapter 6 in Python for Data Analysis in order to learn how to load our data using pandas. 

We also want to keep the Pandas docs handy here: http://pandas.pydata.org/

Here is how we start with loading data:

In [None]:
import pandas as pd
df = pd.read_csv('ca_tracts_pop_cleaned.csv')
df[:5]

Notice a few things that have happened here.

1. The pd.read_csv has enough built-in smarts to read the first row of the file, get the variable names from it.
1. It then read all rows in the file, and used them to create a pandas DataFrame, which is like a set of Numpy arrays we can treat as a table.
1. It created an automatic unique index, beginning with zero.
1. It inferred the type of each variable from the data.  All were interpreted as strings, except the index that pandas created automatically.

Let's explore this pandas DataFrame to learn some of its features.  Note that a pandas Series is like one column of this DataFrame, coupled with its own index column.  So the main difference between a Series and a DataFrame is that the latter has multiple columns. Columns can be of different data types, but within a column, must be consistent.

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.dtypes

We can select subsets of the rows by indexing, and select specific columns by their name:

In [None]:
df['Population'][:10]

We can get all of our statistics on the Population column in one short command:

In [None]:
df['Population'].describe()

Or get these values 'a la carte'.  You might recognize that these are essentially Numpy functions, but that in pandas we can now deal with multiple data types.

In [None]:
df['Population'].min()

In [None]:
df['Population'].max()

In [None]:
df['Population'].median()

We can also do some plotting of the data without much effort:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.figure()
df.hist(column='Population', bins=50)

Note that we can use string operations we have already learned, and apply those to columns that are of type string.  The syntax is a bit different when used with pandas DataFrames.  Here, we create a State column and add it to df by splitting geodisplay and getting its third element.  Note that these string methods are reviewed in Pandas forData Analysis pages 206-206.  More advanced regex methods and other vectorized string methods are covered in pages 207-212.

In [None]:
df['State'] = df['geodisplay'].str.split(',').str[2]
df[:5]

Next let's create a Tract column with just the tract number, and remove the 'Census Tract ' from it.

In [None]:
df['Tract']=df['geodisplay'].str.split(',').str[0].str.strip('Census Tract ')
df[:5]

Here is another way to do the last step, using replace.

In [None]:
df['Tract']=df['geodisplay'].str.split(',').str[0].str.replace('Census Tract ', '')
df[:5]