# Lab 2: Working with Data

- **Author:** Niall Keleher ([nkeleher@uw.edu](mailto:nkeleher@uw.edu))
- **Date:** 04 April 2016
- **Course:** INFO 371: Core Methods in Data Science

### Learning Objectives:
By the end of the lab, you will be able to:
* read files into Python
* work with Series and Data Frames using the Pandas library
* produce basic graphs using the Matplotlib library

### To do before lab:
* Watch 10-minute tour of pandas: https://vimeo.com/59324550
* Read Chapters 3-5 and Chapter 9 of McKinney (2013): Python for Data Analysis. O’Reilly Media, Inc.
* Read and complete lessons 1-7 of Learn Pandas (https://bitbucket.org/hrojas/learn-pandas)

### Topics:
1. Numpy
2. Getting data into Python
3. Pandas
4. Matplotlib
    * line graph, bar chart, histogram, scatterplot
    * working with axes, legend, text
5. Basics of functions

### References: 
 * [Scipy Lecture Notes](http://www.scipy-lectures.org/)
 * [Numpy](http://docs.scipy.org/doc/numpy-1.10.0/reference/)
 * [Pandas](http://pandas.pydata.org/pandas-docs/stable/)
 * [Matplotlib Gallery](http://matplotlib.org/gallery.html)
 * [Learn Pandas, by Hernan Rojas](https://bitbucket.org/hrojas/learn-pandas)
 * [Introduction to Statistical Learning, Lab #1](http://www-bcf.usc.edu/~gareth/ISL/Chapter%202%20Lab.txt)
 * Additional Notebooks:
     * [UW EScience Institute, Python Seminar 2015](https://github.com/uwescience/python-seminar-2015/)
     * [Introduction to Simulation and Modeling, by Skipper Seabold](https://github.com/jseabold/csc432-notebooks/)
     * [YHat](https://github.com/yhat/DataGotham2013/tree/master/notebooks)


### 1. Numpy

***array oriented computing***
* Provides high-performance scientific computing and data analysis
* Multi-dimensional arrays
* Highly efficient
* 

http://www.scipy-lectures.org/intro/numpy/index.html

http://localhost:8888/notebooks/gitRepo/csc432-notebooks/week02-02-numpy_tutorial.ipynb

http://localhost:8888/notebooks/gitRepo/python-seminar-2015/pandas/00-numpy.ipynb

#### Import module

In [None]:
import numpy

#### Explore module documentation

In [None]:
numpy?

#### Import module (using shorthand)

In [None]:
import numpy as np

#### Create an array

In [None]:
ages_arr = np.array([43, 23, 32, 45, 26, 23])

In [None]:
ages_arr

In [None]:
ages_arr_float = np.array([43, 23, 45, 26, 23], dtype=float)

#### Dimensions of the array

In [None]:
ages_arr.ndim

#### Shape of the array

#### Mean value of the array

In [None]:
ages_arr.mean()

#### Minimum and maximun of the array

In [None]:
age_arr.min()

In [None]:
ages_arr.max()

#### Total of the array

In [None]:
ages_arr.sum()

#### Putting in all together

In [None]:
print('Data type                :', ages_arr.dtype)
print('Total number of elements :', ages_arr.size)
print('Number of dimensions     :', ages_arr.ndim)
print('Shape (dimensionality)   :', ages_arr.shape)
print('Memory used (in bytes)   :', ages_arr.nbytes)

In [None]:
print('Minimum and maximum             :', ages_arr.min(), ages_arr.max())
print('Sum and product of all elements :', ages_arr.sum(), ages_arr.prod())
print('Mean and standard deviation     :', ages_arr.mean(), ages_arr.std())

### Other ways to create an array

In [None]:
a = np.arange(10)

In [None]:
a

In [None]:
b = np.arange(1, 9, 2) # start, end (exclusive), step

In [None]:
b

In [None]:
c = np.linspace(0, 1, 6)

In [None]:
c

In [None]:
d = np.zeros(3, dtype=int)

In [None]:
d

#### Why use arrays?

In [None]:
plain_list = range(1000)

%timeit [i**2 for i in plain_list]

In [None]:
arr = np.arange(1000)
%timeit arr**2

### Multidimensional Arrays

In [None]:
multi_lst = [[1, 2], [3, 4]]
multi_arr = np.array([[1, 2], [3, 4]])

In [None]:
multi_arr

#### Basic Operations

In [None]:
x = np.arange(20, 101, 20)
x

In [None]:
y = np.arange(100, 19, -20)
y

In [None]:
y - x

In [None]:
x ** 2

In [None]:
x < 50

In [None]:
X / 2 # be careful

In [None]:
A = np.array([[1, 2, 3],
              [4, 2, 3],
              [1, 3, 6]])

In [None]:
B = np.array([[1, 2, 3],
              [4, 1, 2],
              [2, 3, 5]])

In [None]:
A * B # element-wise

In [None]:
np.dot(A, B) # dot product (aka matrix multiplication)

#### Indexing and slicing

In [None]:
multi_lst[0]

In [None]:
multi_lst[1]

In [None]:
multi_lst[1][1]

In [None]:
np.random.seed(1234)
a = np.random.randint(0, 20, 15)
a

#### Boolean Indexing

In [None]:
idx = a % 3 == 0
idx

In [None]:
a[idx]

In [None]:
a[idx] += 1
a

#### Indexing with Integer Masks

In [None]:
a[[0, 2, 5, 10]]

In [None]:
a[[2,0,2,0,2,0]] # repeat elements

***

### 2. Getting data into Python

#### Load a csv file

In [None]:
datafilepath = 'data/Auto.csv'

##### Check data from shell

In [None]:
!cat 'data/Auto.csv' | awk 'NR == 1'

In [None]:
!head -4 'data/Auto.csv' 

#### Load a csv while specifying column names

In [None]:
np.loadtxt(datafilepath, delimiter=",") # Be careful

In [None]:
auto_ndarray = np.genfromtxt(datafilepath,dtype=None, delimiter=',', names=True)

In [None]:
auto_ndarray['mpg'][0:11]

***

### 3. Pandas

http://www.scipy-lectures.org/packages/statistics/index.html

http://localhost:8888/notebooks/gitRepo/csc432-notebooks/week02-02-pandas_tutorial.ipynb

http://localhost:8888/notebooks/gitRepo/python-seminar-2015/pandas/01-pandas.ipynb

#### Import module

In [None]:
import pandas as pd

#### Create a Series

In [None]:
a_series = pd.Series([1,2,3,4])

In [None]:
a_series

In [None]:
b_series = pd.Series(np.arange(10))

In [None]:
b_series

#### Create DataFrame

In [None]:
df = pd.DataFrame({'a': [10,20,30],
                   'b': [40,50,60]})

In [None]:
df

#### Mean value of a series

In [None]:
b_series.mean()

In [None]:
df['b'].mean()

In [None]:
df.b.mean()

#### Minimum and maximum value of a series

In [None]:
b_series.min()

In [None]:
b_series.max()

#### Full summary data

In [None]:
df.b.describe()

***

### Loading data from csv to pandas DataFrame

In [None]:
# loading a csv
auto_df = pd.read_csv('data/Auto.csv')

In [None]:
auto_df.head()

Load data from excel files:
* excel_df = pd.read_excel('data.xls')
Load data from json files:
* json_df = pd.read_json('data.json')

In [None]:
len(auto_df)

In [None]:
auto_df = auto_df.set_index(['name'])

In [None]:
auto_df.head()

In [None]:
byOrigin = auto_df.groupby('origin')

In [None]:
byOrigin.horsepower.describe()

#### Exercise (adapted from Itroduction to Satistical Learning, James et al. (2013), Chapter 2, Exercise 9.

http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf (page 56)

a) Which predictors are quantitative and which are qualitative?

b) What is the *range* of each quantitative predictor?

c) What is the mean and standard deviation of each quantitative predictor?

d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of **mpg**.

e) What is the mean and standard deviation of **mpg** for US-made cars (origin = 1)?

f) What is the mean and standard deviation of **mpg** for foreign cars (origin = 2)?

### 4. Matplotlib

http://www.scipy-lectures.org/intro/matplotlib/matplotlib.html

http://localhost:8888/notebooks/gitRepo/csc432-notebooks/week02-02-plotting_tutorial.ipynb

#### Enable inline printing of matplotlib plots

In [None]:
%matplotlib inline

#### import module

In [None]:
import matplotlib.pyplot as plt

#### Boxplots

In [None]:
plt.figure()
plt.boxplot(auto_df['mpg'])

plt.show()

#### Histograms

In [None]:
plt.figure()
plt.hist(auto_df['cylinders'])

plt.show()

#### Scatter Plots

In [None]:
plt.figure()
plt.scatter(auto_df['mpg'], auto_df['weight'])

plt.show()

In [None]:
from pandas.tools.plotting import scatter_matrix

In [None]:
plt.figure()
scatter_matrix(auto_df, alpha=0.2, figsize=(12, 12), diagonal='kde')
plt.show()

***

### 5. Basic of functions

http://www.scipy-lectures.org/intro/language/functions.html

https://www.python.org/dev/peps/pep-0257/

http://localhost:8888/notebooks/gitRepo/code_py/function_basics.ipynb

In [None]:
def printMax(x, y):
    # Create the docstring
    '''Prints out the maximum of two values'''
    # if a is larger than b
    if x > y:
        # then print this
        print(x, 'is maximum')
    # if a is equal to b
    elif x == y:
        # print this
        print(x, 'is equal to', y)
    # otherwise
    else:
        # print this
        print(y, 'is maximum')

In [None]:
printMax?

In [None]:
print(printMax.__doc__)

In [None]:
printMax(3,4)

***

#### Exercise (adapted from Itroduction to Satistical Learning, James et al. (2013), Chapter 2, Exercise 8.

http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf (page 54-55)

This exercise relates to the **College** data set, which can be found in the file **College.csv**. It contains a number of variables for 777 different universities and colleges in the US.

The variables are:
* **Private** : Public/private indicator
* **Apps** : Number of applications received
* **Accept** : Number of applications accepted
* **Enroll** : Number of new students enrolled
* **Top10perc** : New students from top 10% of high school class
* **Top25perc** : New students from top 25% of high school class
* **F.Undergrad** : Number of full-time undergraduates
* **P.Undergrad** : Number of part-time undergraduates
* **Outstate** : Out-of-state tuition
* **Room.Board** : Room and board costs
* **Books** : Estimated book costs
* **Personal** : Estimated personal spending
* **PhD** : Percent of faculty with Ph.D.'s
* **Terminal** : Percent of faculty with terminal degree
* **S.F.Ratio** : Student/faculty ratio
* **perc.alumni** : Percent of alumni who donate
* **Expend** : Instructional expenditure per student
* **Grad.Rate** : Graduation rate

a) Using pandas, ead the data into python. Call the loaded data **college**. Make sure that you have the directory set to the correct location for the data.

b) Review the data set. Set the index to the college name.

c) Summarize the data (mean, standard deviation, range)

d) Create a scatterplot matrix of the first ten variables.

e) Create side-by-side boxplots of the **Outstate** for public and private schools.

f) Create a new qualitative variable, called **Elite**, by *binning* the **Top10perc** variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes excedes 50%.

Next, crreate a boxplot that compares **Outstate** for the two categories of **Elite**.