# Data Analysis With Python

## Yale Center for Research Computing

Kaylea Nelson

research.computing@yale.edu

# Data Analysis With Python

Python is more of a general purpose programming language than R or Matlab. It has gradually become more popular for data analysis and scientific computing, but additional modules are needed. Some of the more popular modules are:

* **NumPy** - N-dimensional array
* **SciPy** - Scientific computing (linear algebra, numerical integration, optimization, etc)
* **Matplotlib** - 2D Plotting (similar to Matlab)
* **IPython** - Enhanced Interactive Console
* **Sympy** - Symbolic mathematics
* **Pandas** - Data analysis (provides a data frame structure similar to R)

NumPy, Pandas and Matplotlib are used in this presentation.


# How to get these packages?

## Download Anaconda

https://www.continuum.io/downloads

Easier installer for all three operating systems. Includes Numpy, Matplotlib and Pandas out of the box, but additional packages are also very easy to add with the `conda` command.


## Brief Refresher on Python Data Structures

### Lists

Lists are the most commonly used data structure in Python. They are a collection of values where the values can be any data type or object and don't have to all be same.

In [2]:
my_int_list = [199, 3, 3, 4]
my_int_list

[199, 3, 3, 4]

In [3]:
my_float_list = [1.1, 2.3, 3.8, 4.1]
my_float_list

[1.1, 2.3, 3.8, 4.1]

In [11]:
my_string_list = ['apple', 'orange', 'grape']
my_string_list

['apple', 'orange', 'grape']

In [4]:
my_mixed_list = ['apple', 1, 4.3]
my_mixed_list

['apple', 1, 4.3]

Accessing a list:

In [5]:
my_int_list[2]

3

In [6]:
my_int_list[1:3]

[3, 3]

In [7]:
my_int_list[-1]

4

Add to a list

In [8]:
my_int_list.append(10)
my_int_list

[199, 3, 3, 4, 10]

Joining ('concatenating') lists:

In [13]:
additions_to_list = ['pineapple', 'mango']
my_string_list + additions_to_list

['apple', 'orange', 'grape', 'pineapple', 'mango']

### Dictionaries

A dictionary is a data structure comprised of 'key' and 'value' pairs. Keys must be strings but values can be any data type (and don't all need to be same). Items in a dictionary are designed to be accessed by passing the dictionary a 'key', to which it returns the paired 'value'.

In [2]:
from datetime import date

bootcamp = {'name': 'Data Analysis with Python',
            'date': date.today(),
            'num_attendees': 11,
            'topics': ['numpy', 'pandas', 'matplotlib']}

In [16]:
bootcamp['date']

datetime.date(2017, 8, 23)

In [17]:
bootcamp['topics']

['numpy', 'pandas', 'matplotlib']

In [18]:
bootcamp.keys()

['date', 'topics', 'name', 'num_attendees']

In [19]:
bootcamp.values()

[datetime.date(2017, 8, 23),
 ['numpy', 'pandas', 'matplotlib'],
 'Data Analysis with Python',
 41]

Add a new value:

In [20]:
bootcamp['location'] = '160 St. Ronan Street'
bootcamp['location']

'160 St. Ronan Street'

### Other Common Data Structures

**Tuples** are 'immutable' lists, meaning that once they are created they **cannot** be changed

In [24]:
my_tuple = (1, 4.5, 'orange')
my_tuple[2]

'orange'

**Sets** can only have **unique** values

In [25]:
list_one = [1, 5, 7, 3, 9]
list_two = [3, 6, 5, 8, 2]
my_set = set(list_one + list_two)
my_set

{1, 2, 3, 5, 6, 7, 8, 9}

# For loops in Python

In [27]:
for i in range(4):
    print i+3

3
4
5
6


## List Comprehension

For loops in python are notoriously inefficient. When possible (which is definitely not always), try to find a more 'Pythonic' way to traverse your lists, such as a list comprehension. Your code will both be faster and easier to read.

In [28]:
my_list = [2., 4., 100., 5., 0.]

In [29]:
# with a for loop
my_new_list = []
for i, number in enumerate(my_list):
    my_new_list.append(number * 3)
    
print my_new_list

[6.0, 12.0, 300.0, 15.0, 0.0]


In [30]:
# with list comprehension
my_new_list = [3*x for x in my_list]
my_new_list

[6.0, 12.0, 300.0, 15.0, 0.0]

So much better!

Dictionary have comprehensions too.

In [31]:
cities = ['Seattle', 'San Francisco', 'Dallas']
num_of_people = [704352, 864816, 1317929]

population = {city: people for (city, people) in zip(cities, num_of_people)}
population

{'Dallas': 1317929, 'San Francisco': 864816, 'Seattle': 704352}

In [32]:
population['Dallas']

1317929