# Data Analysis With Python

## Yale Center for Research Computing

Kaylea Nelson & Ben Evans

research.computing@yale.edu

# What We'll Cover Today

- Intro / Python refresher
- N-Dimentional Matrices using Numpy
- Tabular data with pandas
- Plotting with Matplotlib

# Data Analysis With Python

Python is more of a general purpose programming language than R or Matlab. It has gradually become more popular for data analysis and scientific computing, but additional modules are needed. Some of the more popular modules are:

- **Numpy*** - N-dimensional array (https://docs.scipy.org/doc/)
- **Scipy** - Scientific computing (https://docs.scipy.org/doc/)
- **Sympy** - Symbolic mathematics (https://docs.sympy.org/latest/index.html)
- **Scikitlearn** - machine learning (http://scikit-learn.org/stable/documentation.html)
- **Pandas*** - Data frames & time series (http://pandas.pydata.org/pandas-docs/stable/)


# Data Visualization With Python

Matplotlib has become the standard underlying plotting package in Python. There is an ecosystem of libraries that all make visualizing, prettifying, and sharing your work easier:

- **Matplotlib*** - Plotting, with interface similar to MATLAB (https://matplotlib.org/contents.html)
- **Seaborn** - statistical graphics (http://seaborn.pydata.org/tutorial.html)
- **Bokeh** - Inveractive plots (http://bokeh.pydata.org/en/latest/docs/user_guide.html)
- **IPython** - Enhanced Interactive Console (https://ipython.org/documentation.html)
- **Jupyter** - Interactive notebooks, like this one! (http://jupyter.org/documentation)


# How to get these packages?

## Download Anaconda

https://www.continuum.io/downloads

Easier installer for all three operating systems. Includes numerous scientific computing packages out of the box, but additional packages are also very easy to add with the `conda` and `pip` commands.


# Brief Refresher on Python

- Lists
- Dictionaries
- Tuples & Sets
- Loops and list comprehensions


# Lists

A Python list is an ordered collection of objects, and can be heterogneous. Lists are good for storing values or strings where order matters, but you might want to change individual items.

In [1]:
my_float_list = [1.1, 2.3, 3.8, 4.1]
my_float_list

[1.1, 2.3, 3.8, 4.1]

In [2]:
my_string_list = ['apple', 'orange', 'grape']
my_string_list

['apple', 'orange', 'grape']

In [3]:
my_mixed_list = ['apple', 1, 4.3]
my_mixed_list

['apple', 1, 4.3]

# Accessing a list: indices and slices

In [4]:
my_int_list = [199, 3, 10, 4]
my_int_list[2]

10

In [5]:
my_int_list[1:3]

[3, 10]

In [6]:
my_int_list[::2]

[199, 10]

In [7]:
my_int_list[-1]

4

# Modifying a list
Direct assignment

In [8]:
my_int_list[2] = 16
my_int_list

[199, 3, 16, 4]

Adding elements

In [9]:
my_int_list.append(11)
my_int_list

[199, 3, 16, 4, 11]

Concatenating lists

In [10]:
additions_to_list = ['pineapple', 'mango']
my_string_list + additions_to_list

['apple', 'orange', 'grape', 'pineapple', 'mango']

# Dictionaries

A **dictionary** is a data structure comprised of **key -> value pairs**. Keys are usually strings, values can be any data type (and don't all need to be same). Items in a dictionary are designed to be accessed by passing the dictionary a key, which returns the paired value.

In [11]:
from datetime import date

bootcamp = {'name': 'Data Analysis with Python',
            'date': date.today(),
            'num_attendees': 11,
            'topics': ['numpy', 'pandas', 'matplotlib']}

In [12]:
bootcamp['date']

datetime.date(2018, 10, 25)

In [13]:
bootcamp['topics']

['numpy', 'pandas', 'matplotlib']

In [14]:
bootcamp.keys()

dict_keys(['name', 'date', 'num_attendees', 'topics'])

In [15]:
bootcamp.values()

dict_values(['Data Analysis with Python', datetime.date(2018, 10, 25), 11, ['numpy', 'pandas', 'matplotlib']])

Add a new value:

In [16]:
bootcamp['location'] = '160 St. Ronan Street'
bootcamp['location']

'160 St. Ronan Street'

### Other Common Data Structures

**Tuples** are immutable ordered sequence of objects, meaning that once they are created they **cannot be changed**. They are usually more efficient than lists.

In [17]:
my_tuple = (1, 4.5, 'orange')
my_tuple[2]

'orange'

**Sets** can only have **unique values**, and have functions that make comparing and joining them easy.

In [18]:
list_one = [1, 5, 7, 3, 9]
list_two = [3, 6, 5, 8, 2]
my_set = set(list_one + list_two)
my_set

{1, 2, 3, 5, 6, 7, 8, 9}

In [19]:
my_set.difference(list_one)

{2, 6, 8}

# For loops in Python

In [20]:
for i in range(4):
    print(i+3)

3
4
5
6


## List Comprehension

When building a new list, `for` loops can be inefficient. When possible try using `map()` or a list comprehension. Your code will be faster and more consise.

In [21]:
my_list = [2.0, 4.0, 100.5, 5.5, 0.1]

In [22]:
# with a for loop
my_new_list = []
for i, number in enumerate(my_list):
    my_new_list.append(number * 3)
    
my_new_list

[6.0, 12.0, 301.5, 16.5, 0.30000000000000004]

In [23]:
# with list comprehension
my_new_list = [3*x for x in my_list]
my_new_list

[6.0, 12.0, 301.5, 16.5, 0.30000000000000004]

In [24]:
my_new_list = map(round, my_list)
my_new_list

<map at 0x7fc620686048>

Dictionaries have comprehensions too.

In [25]:
cities = ['Seattle', 'San Francisco', 'Dallas']
num_of_people = [704352, 864816, 1317929]

population = {city: people for (city, people) in zip(cities, num_of_people)}
population

{'Seattle': 704352, 'San Francisco': 864816, 'Dallas': 1317929}

In [26]:
population['Dallas']

1317929