## Data analysis in Python

## UTM Coders April 23, 2019

#### **Authors:** James Santangelo and Ahmed Hasan, borrowing from [UofT Coders lesson by Derek Howard](https://github.com/UofTCoders/studyGroup/blob/gh-pages/lessons/python/pandas2/UofT-pandas.ipynb)

## What we're assuming you know

- the interpreter
- variables
- lists
- indexing/slicing
- if statements
- for loops (and loop syntax in general)
- functions

We covered the topics above above in a previous lesson. While not essential, being familiar with them will increase your understanding of today's material. However, we'll provide reminders of key Python concepts as they come up.

## Class methods and attributes

In a previous lesson, we learned how to store data in the form of:

- **Variables:** `x = 5`
- **Strings:** `my_string = 'This is a string'`
- **Lists:** `my_list = [3, 6.7, 'three']`
- **Dictionaries:** `my_dict = {'Apple': 'Red', 'Banana': 'Yellow'}`

We also learned how to create functions to perform specific operations. For example:

```
def add_nums(x, y):
    return x + y
```

All of these are examples of creating *objects* in Python, which can be recurrently used throughout Python programs once they are defined.

Certain types of objects may have specific operations associated with them. For instance, we may want to know how many times a certain value appears in a list. Say we have the following list:

In [None]:
my_list = [1, 5, 1, 1, 3]

How can we compute how many times the element `1` appears in the list?

This is where **methods** and **attributes** come in. Methods are essentially functions that 'live within' an object type, whereas attributes instead store information about that object. In our case, there is a list method called `count` that we can use for this operation.

In [None]:
my_list.count(1)

However, the set of methods available to one object type are often quite different from those available to another. Consider string objects - when dealing with text information, we may want to convert all letters to uppercase. The string method `upper` allows us to do just that:

In [None]:
my_agency = 'nasa'
print(my_agency.upper())

However, list objects do not have an `upper` method. This makes intuitive sense - what is the uppercase of one or more numbers?

In Python, **classes** allow related objects (e.g. data types, functions) to be logically grouped together and used throughout Python programs. If you're not used to the idea of classes, they can be a bit tricky and take some time to get the hang of. However, they are incredibly powerful and versatile, and it turns out most Python modules that users import and use are actually Python classes that contain the functionality they want. 

Let's walk through a (very) brief example. Understanding the basics of class structures will help us when using some of the data analysis modules later on.

In [None]:
class Vehicle:
    """Basic class describing a vehicle"""
    
    num_wheels = 4
    color = ""
    
    def description(self):
        print("The vehicle is {0} and has {1} wheels".format(self.color, self.num_wheels))
    

Remember, we can ask Python to provide some details about existing objects in case we've forgotten

In [None]:
help(Vehicle)

The `help` function is telling us about the class *methods* and *attributes* for our vehicle class. Let's look at these in more detail. 

In [None]:
# Create an instance of the vehicle class
sedan = Vehicle()

In [None]:
# Access the num_wheels class attribute
print(sedan.num_wheels)

In [None]:
# Change the sedan's color to black by modifying the color attribute
sedan.color = "black"
print(sedan.color)

In [None]:
# Get a description of the sedan using the classes "description" method
sedan.description()

As you can see, class attributes store data about the class object and methods perform operations using that data (or data from outside the class as well). The advantage here is that we can create as many instances of this class as we want, each with different attribute values. This is known as **Object-oriented programming**. 

## The numpy module

So far, we've looked at some of the ways Python's functionality can be extended, whether by creating custom functions and classes. Naturally, this has led to programmers all over the world working to write code that can perform all sorts of useful operations. 

These codebases are then packaged together and disseminated to the community for free in the form of _Python libraries_ (also referred to as packages or modules). We will be looking at some of the most popular libraries for data analysis in Python for the remainder of this lesson. To begin with, we will look at `numpy`, or Numerical Python, a Python library for quick and efficient mathematical operations that actually forms the groundwork for many other Python data analysis libraries.

To use any given Python library, it has to first be imported into the workspace using the `import` keyword. When importing libraries, they can be given an abbreviated name for ease of typing by using the `as` keyword.

In [None]:
import numpy as np

Now that `numpy` has been imported, any function or object that 'belongs' to `numpy` has to be prefaced with the name of the package. Since we used the `import numpy as np` syntax, we can simply type out `np` instead of `numpy`.

The most fundamental `numpy` data structure is known as an `array`, which allows for computationally efficient, vectorized operations. We can create an instance of an array using the `np.array` function:

In [None]:
my_array = np.array([1, 7, 12, 6])
my_array

In [None]:
# arrays can be indexed/sliced like lists
print(my_array[2])
print(my_array[1:3])

In [None]:
# operations on arrays are vectorized - much like R vectors
print(my_array * 2)

# operations on Python lists are not
print([1, 7, 12, 6] * 2)

In [None]:
# arrays have a fixed type throughout!
my_array[1] = 7.15
print(my_array)

There exist several helper functions to quickly make certain arrays instead of having to type out values:

In [None]:
print(np.arange(1, 10)) # values 1-9
print(np.linspace(0, 10, 5)) # evenly spaces 0-10 into 5 values
print(np.zeros(8)) # create array of 8 zeroes

Arrays can also be two-dimensional:

In [None]:
# feed list of lists
my_matrix = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])

print(my_matrix)

In [None]:
# slicing
print(my_matrix[1,]) # row w/ index 1

print(my_matrix[:,2]) # col w/ index 2

# indexing - row, column
print(my_matrix[1, 2])

`numpy` also has helpful functions to generate random values:

In [None]:
# value between 0-1
for i in range(0, 10):
    print(i, np.random.random())

In [None]:
# integers between a range
for i in range(0, 10):
    print(i, np.random.randint(1, 10, 3)) # 3 ints between 1-10

While we won't be covering `numpy` in much more detail, it dramatically extends the amount of numerical operations that can be done in Python, in addition to providing data types relied upon by many other libraries as well. One of these libraries just so happens to be...

## The Pandas module

The Pandas module provides tools for handling dataframes in Python and is commonly used among data scientists that program in Python. Here, we'll cover some of its basic functionality.

In [None]:
import pandas as pd  # Import using as alias to reduce typing
import seaborn as sns

In [None]:
iris_data = sns.load_dataset("iris")
# data = pd.read_table('/path/to/file')

In [None]:
# Note this is an instance of the DataFrame class from Pandas
type(iris_data)

In [None]:
iris_data

In [None]:
# Show only first 10 rows
iris_data.head(n = 10)  # Note this is a class method

In [None]:
# How many rows and columns?
iris_data.shape  # Note this is a class attribute

In [None]:
iris_data.columns

In [None]:
iris_data.info()

In [None]:
iris_data.dtypes

In [None]:
iris_data.describe()  # Show some summary stats

### Selecting and filtering data

In [None]:
# Select columns by passing a list of column names
cols_to_keep = ['sepal_length', 'species']
iris_data[cols_to_keep].head()

In [None]:
# You can retrieve individual columns using dot notation
iris_data.sepal_length

In [None]:
# We can filter out rows by index position in the dataframe
rows = [0, 5, 10]
iris_data.iloc[rows]

In [None]:
# We can also just slice the dataframe directly
iris_data[0:10]

In [None]:
# Filter based on boolean
iris_data['species'] == 'virginica'  # Returns true/false
iris_data[iris_data['species'] == 'virginica']  # Filter where true
# iris_data[iris_data.species == 'virginica']

In [None]:
# Filter by multiple conditions. Note parentheses around each condition
iris_data[(iris_data['sepal_length'] > 6) & (iris_data['sepal_length'] < 7)]

In [None]:
species_list = ['setosa', 'versicolor']
iris_data[iris_data.species.isin(species_list)]

### Groupby

In [None]:
# Groupby species
species = iris_data.groupby('species')
species  # Note the class name has changed to DataFrameGroupBy

In [None]:
len(species)

In [None]:
species.first()  # First row for each group.

In [None]:
species['sepal_length'].mean()  # Mean sepal length for each species

In [None]:
# Summarize a particular group
species.get_group('setosa').describe()

In [None]:
sorted_by_sepalLength = iris_data.sort_values('sepal_length', ascending=False)
sorted_by_sepalLength.head()

## Visualisation

In [None]:
%matplotlib inline  # Required for plots to show below cells

In [None]:
# Histogram of sepal length
iris_data['sepal_length'].plot.hist(bins=100)

In [None]:
# Scatterplot 
iris_data.plot(x='sepal_length',y='sepal_width',kind='scatter')

In [None]:
# Seaborn is a wrapper around matplotlib that makes plotting a bit easier.
sns.relplot(data = iris_data, x='petal_length', y='petal_width')

In [None]:
# It's trivial to color by species using the 'hue' argument
sns.relplot(data = iris_data, x='petal_length', y='petal_width', hue = 'species')

In [None]:
# All species and response variables
sns.pairplot(iris_data, hue='species', height=2.5)

In [None]:
sns.jointplot("sepal_length", "sepal_width", kind='reg', data = iris_data);