# Intro to Data Analysis in Python

    Stephen McLaughlin
    PhD Student, UT Austin School of Information

#### If you haven't already done so, open a terminal window and use the following command to install the libraries we'll use below.

    pip install -U pandas scipy matplotlib
    
If you don't already have Python, install it [using Anaconda](https://www.continuum.io/downloads) for OS X / Windows / Linux or follow [this walkthrough](https://github.com/stevemclaugh/Python-Data-Workshop_ASIST-AWIT_Sept-2016/blob/master/README.md) for getting set up on OS X.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats

## Python Syntax Basics

In [None]:
print "Hello Jupyter!"
print "Hello"+" Jupyter!"

In [None]:
# Integer arithmetic

int1=10000
int2=150

print int1+int2 # add
print int1-int2 # subtract
print int1*int2 # multiply
print int1/int2 # divide

In [None]:
# Floating point arithmetic

float1=10000.0
float2=150.5

print float1+float2 # add
print float1-float2 # subtract
print float1*float2 # multiply
print float1/float2 # divide

In [None]:
## Turning a sentence string into a list of words

sentence="A green hunting cap squeezed the top of a fleshy balloon of a head."

word_list=sentence.split(" ")

print word_list

## Working with Lists
#### Data: average weight of chickens in 8 population blocks

Snee, R.D. (1985) Graphical display of results of three-treatment randomized block experiments. _Applied Statistics_, 34, 71-7.

In [None]:
## Creating lists and assigning them to variables

control=[3.93, 3.78, 3.88, 3.93, 3.84, 3.75, 3.98, 3.84]

low_dose=[3.99, 3.96, 3.96, 4.03, 4.10, 4.02, 4.06, 3.92]

high_dose=[3.96, 3.94, 4.02, 4.06, 3.94, 4.09, 4.17, 4.12]

In [None]:
## Select an item from a list by its index

control[3]

In [None]:
## Split a list into sub-lists based on index ranges

print control[2:5]

print control[2:]

print control[:4]

## Descriptive Statistics with NumPy and scipy.stats

In [None]:
## Use NumPy to calculate mean values for lists of numbers

import numpy as np

print np.mean(control)

print np.mean(low_dose)

print np.mean(high_dose)

In [None]:
## Use scipy.stats to calculate several descriptive statistics in one step

import scipy.stats

print scipy.stats.describe(control)

## Hypothesis Testing with scipy.stats

In [None]:
## Running t-tests (assuming data is distributed normally)

import scipy.stats

print scipy.stats.ttest_ind(control,low_dose)

print scipy.stats.ttest_ind(control,high_dose)

print scipy.stats.ttest_ind(low_dose,high_dose)

# Returns t-statistic and two-tailed p-value

In [None]:
## Running Mann-Whitney U tests (a nonparametric test; no normality assumption)

print scipy.stats.mannwhitneyu(control,low_dose)

print scipy.stats.mannwhitneyu(control,high_dose)

print scipy.stats.mannwhitneyu(low_dose,high_dose)

## Working with Tabular Data in pandas

Example adapted from http://nbviewer.jupyter.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/01%20-%20Lesson.ipynb

In [None]:
## Create a table with 5 rows and 2 columns

import pandas as pd

names = ['Mel','Jessica','Mary','John','Bob']

births = [973, 155, 77, 578, 968]

baby_data = zip(names,births)

baby_data

In [None]:
## Print each row in our table on a separate line

for row in baby_data:
    print row

In [None]:
## Load table into pandas DataFrame object

baby_df = pd.DataFrame(data = baby_data, columns=['Names', 'Births'])

baby_df

In [None]:
## Select first 3 rows in DataFrame using index range

baby_df[:3]

In [None]:
## Sort DataFrame by numerical values in 'Birth' column

baby_df.sort_values(['Births'], ascending=False)

In [None]:
## Sort DataFrame alphabetically by string values in 'Names' column

baby_df.sort_values(['Names'], ascending=True)

In [None]:
## Calculate mean for values in 'Births' column

baby_df['Births'].mean()

In [None]:
## Same calculation as above, using NumPy directly

np.mean(baby_df['Births'])

In [None]:
## Return the max value in 'Births' column

baby_df['Births'].max()

In [None]:
## Returns a full column as a pandas Series object

baby_df['Births']

In [None]:
## ...which we can convert to a plain Python list for convenient use outside pandas

list(baby_df['Births'])

## Importing Data from a CSV

#### Data: girth, height, and volume for black cherry trees

data from https://vincentarelbundock.github.io/Rdatasets/datasets.html

In [None]:
import pandas as pd

cherry_tree_data = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/trees.csv")

In [None]:
## Select first 5 rows in DataFrame

cherry_tree_data.head()

In [None]:
## Select last 5 rows in DataFrame

cherry_tree_data.tail()

In [None]:
## Removing unnecessary index column

cherry_tree_data = cherry_tree_data.drop('Unnamed: 0', 1)

In [None]:
## Select first 5 rows in updated DataFrame

cherry_tree_data.head()

In [None]:
## Select rows with 'Height' values greater than 80

cherry_tree_data[cherry_tree_data['Height']>80]

In [None]:
## Select 2 columns in DataFrame and display in different order

cherry_tree_data[['Height','Girth']]

In [None]:
## Calculate mean cherry tree height

cherry_tree_data['Height'].mean()

In [None]:
## Display descriptive statistics for values in 'Height' column

cherry_tree_data['Height'].describe()

## Creating Graphs with matplotlib

In [None]:
import matplotlib.pyplot as plt

## The following line tells Jupyter to display graphs within our Jupyter notebook:
%matplotlib inline

In [None]:
## Create a histogram for values in 'Height' column

cherry_tree_data['Height'].hist()

In [None]:
## Change matplotlib style and display histogram for same data

plt.style.use('ggplot')

plt.figure(figsize=(18,8))

cherry_tree_data['Height'].hist()

In [None]:
## View list of built-in matplotlib styles

plt.style.available

In [None]:
## Plot 'Height' values as if they were a time series

cherry_tree_data['Height'].plot()

In [None]:
## Create a scatter plot for 'Height' vs. 'Girth' values

plt.scatter(cherry_tree_data['Height'], cherry_tree_data['Girth'])

In [None]:
## Test correlation for 'Height' vs. 'Girth' using scipy.stats function for Pearson's r

scipy.stats.pearsonr(cherry_tree_data['Height'],cherry_tree_data['Girth'])

# returns correlation coefficient and p-value

## Further reading

#### General Python data analysis overviews
- http://www.scipy.org/getting-started.html
- http://www.scipy-lectures.org

#### Video courses on Lynda.com
To access videos on [Lynda](http://lynda.com), click “Log in,” then choose “Sign in with your organization portal.”
Type “utexas.edu” and enter your UT EID and password.
- “Introduction to Data Analysis with Python,” taught by Michele Vallisneri ([link](https://www.lynda.com/Numpy-tutorials/Introduction-Data-Analysis-Python/419162-2.html))
- “Up and Running with Python,” taught by Joe Marini ([link](http://www.lynda.com/Python-tutorials/Welcome/122467/142550-4.html))

#### NumPy quickstart
- http://docs.scipy.org/doc/numpy-dev/user/quickstart.html

#### Lessons for new pandas users
- http://pandas.pydata.org/pandas-docs/stable/tutorials.html#lessons-for-new-pandas-users

#### matplotlib graph gallery
- http://matplotlib.org/gallery.html

#### A collection of handy code snippets
- http://chrisalbon.com


<a rel="license"
     href="http://creativecommons.org/publicdomain/zero/1.0/">
    <img src="http://i.creativecommons.org/p/zero/1.0/88x31.png" style="border-style: none;" alt="CC0" />
  </a>