# Lecture 12: Pandas, Matplotlib and Numpy

__Reading Material:__
- [Pandas Tutorial](https://pandas.pydata.org/pandas-docs/stable/tutorials.html)
- Pandas Basics Cheat Sheet (on CCLE)

## Pandas

We use the modules pandas and matplotlib to import a dataset and create a nice visualization. Pandas assumes that your data file has rows separated by newlines, and columns separated by an expression that you specify. It also assumes that the first row contains the names of your columns. We start with the dataset on LSD and math scores1 from
[this page](http://stat.ufl.edu/~winner/datasets.html).

In [1]:
f = open('lsd.txt').read()
print f

drug	math
1.17	78.93
2.97	58.20
3.26	67.47
4.69	37.47
5.83	45.65
6.00	32.92
6.41	29.97



We first import the data in a .txt file into pandas as follows:

In [None]:
import pandas as pd
lsd = pd.read_table('lsd.txt','\t')
print lsd

In [None]:
lsd['math']

Pandas lets us call the separate columns by their name using commands __lsd['math']__ and __lsd['drug']__. 

We can use these directly as lists to input into pyplot and create a scatter plot.

In [None]:
import matplotlib.pyplot as plt 
plt.plot(lsd['drug'],lsd['math'],'o')
plt.axis([0, 7, 0, 100])
plt.ylabel('Math Score')
plt.xlabel('LSD Tissue Concentration') 
plt.title('Math Scores and Drug Concentrations') 
plt.show()

We now do the same for a data set on "Comparison of 6 Lengths of Chop Sticks of Feeding Efficiency" on the [dataset page](http://stat.ufl.edu/~winner/datasets.html) and use a bar plot to visualize the data set.

In [None]:
cs = pd.read_table('chopstick2_rcb.dat','\s+')
print cs

In [None]:
import numpy as np
cs = pd.read_table('chopstick2_rcb.dat','\s+')
type1 = cs['type']==1 
type2 = cs['type']==2
type3 = cs['type']==3
type4 = cs['type']==4
type5 = cs['type']==5
type6 = cs['type']==6
good = cs['eff'] >= 27
fair = (cs['eff']>= 22) & (cs['eff'] < 27)
poor = cs['eff']<22 
barwidth=.2
plt.axis([-1, 6, 0, 20])
plt.ylabel('Frequency')
plt.xlabel('Chopsticks Type')
plt.title('Comparison of 6 Lengths of Chopsticks of Feeding Efficiency') 
plt.xticks(np.arange(6)+1/2,['Type1','Type2','Type3','Type4','Type5','Type6']) 

plt.bar(np.arange(6),[len(cs[type1 & good].index),
                      len(cs[type2 & good].index),
                      len(cs[type3 & good].index), 
                      len(cs[type4 & good].index), 
                      len(cs[type5 & good].index), 
                      len(cs[type6 & good].index)], 
        barwidth, color='r', label='Good')
plt.bar(np.arange(6)+barwidth,[len(cs[type1 & fair].index),
                               len(cs[type2 & fair].index),
                               len(cs[type3 & fair].index), 
                               len(cs[type4 & fair].index), 
                               len(cs[type5 & fair].index), 
                               len(cs[type6 & fair].index)], 
        barwidth, color='b', label='fair')
plt.bar(np.arange(6)+2*barwidth,[len(cs[type1 & poor].index),
                                 len(cs[type2 & poor].index),
                                 len(cs[type3 & poor].index), 
                                 len(cs[type4 & poor].index), 
                                 len(cs[type5 & poor].index), 
                                 len(cs[type6 & poor].index)], 
        barwidth, color='g', label='poor')

plt.legend()
plt.show()

## Networks

Networks are sets of nodes that may pairwise be connected by links. Links may be directed or weighted, and the network might contain other information such as categories of nodes or links. We can store a network in a 2-dimensional array (a list of lists) such that the value at index i, j indicates the presence of a link. 

For example, here is a network with nodes 0, 1, 2, such that node 1 is connected to node 0 and 2:

In [None]:
N=[[0,1,0],[1,0,1],[0,1,0]]

Let's write a function that takes any network in this form, and plots it using __matplotlib__. It creates the x and y coordinates by placing the nodes equally spaced around a circle:

In [None]:
import numpy as np
def network_plot_circle(N):
    n=len(N)
    x=[np.cos(2*np.pi*i/n) for i in range(n)]
    y=[np.sin(2*np.pi*i/n) for i in range(n)]
    for i in range(n):
        for j in range(i):
               if N[i][j]==1:
                    plt.plot([x[i],x[j]],[y[i],y[j]],'b')
    plt.plot(x,y,'ro')
    plt.show()

In [None]:
network_plot_circle(N)

The example network is the Zachary Karate Club social network. This is a well-known social network of friendships between 34 members of a karate club at a US university in the 1970s.

In [None]:
karate = open("karate_edgeList.txt").read()
pairs = [s.split('\t') for s in karate.splitlines()]
pairs = [[int(i) for i in j]for j in pairs]
n = max(max(j for j in pairs))
adjMatrix = [[0]*n for _ in range(n)]
for p in pairs:
    adjMatrix[p[0]-1][p[1]-1]=1
    adjMatrix[p[1]-1][p[0]-1]=1

network_plot_circle(adjMatrix)

### Exercises

- Create visualizations for several other datasets from the toy dataset page.

- Adapt the network plotting code so that it plots the nodes at uniform randomly chosen coordinates.

- Adapt the network plotting code so that it plots edges of two different colors, which the user can indicate by recording edges as 1s or 2s in their data.

- Adapt the network plotting code so that it takes as input a network and list, which is a subset of the nodes. It plots those nodes in a different color from the rest, and plots them next to each other on the circle.

- Adapt the network plotting code so that it plots edges of different thickness, depending on their value in the data.