# Python for Data Science Teaching Session 1: Data Manipulation

## Introduction

### Course Prerequisites

It is advised that course participants have completed WDSS's [Beginner's Python](http://education.wdss.io/beginner-python) course or equivalent including going through most of the additional notes on Pythonic programming. You should be able to get by if this is not the case, but you may want to brush up on the following notes:

- [Lists](https://education.wdss.io/beginners-python/session-four/) and [dictionaries](https://education.wdss.io/beginners-python/session-six/)
- [List comprehension](https://colab.research.google.com/github/warwickdatasciencesociety/beginners-python/blob/master/session-four/session_four_additional_content.ipynb) and [dictionary comprehension](https://colab.research.google.com/github/warwickdatasciencesociety/beginners-python/blob/master/session-four/session_six_additional_content.ipynb)
- [Truthiness and if-expressions](https://colab.research.google.com/github/warwickdatasciencesociety/beginners-python/blob/master/session-three/session_three_additional_content.ipynb)
- [String methods](https://colab.research.google.com/github/warwickdatasciencesociety/beginners-python/blob/master/session-three/session_two_additional_content.ipynb)
- [Importing modules and packages](https://education.wdss.io/beginners-python/session-eight/)

### Session Objectives

- Reading/writing data from/to files
- Exploring the structure and contents of a dataset
- Subsetting and filtering
- Mutating and summarising datasets

### Recommendations and Advice

Checkout [PEP8](https://www.python.org/dev/peps/pep-0008/) and use a [linter](https://jupyterlab-code-formatter.readthedocs.io/en/latest/index.html) if needed.

Google, Google, Google. Use [Stack Overflow](https://stackoverflow.com/) and the [pandas reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) to find the answer you're after.

A warning: data-scientific Python is the wild-west. There are often many ways to achieve the same thing. Although this provides flexibility, it can cause confusion when learning. Don't be put off if it's not clear when and why to use a certain method over another. There might not even be a reason at all more than personal preference.

## Getting Started with pandas

### What is pandas?

Let's ask the team:

> pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

In short, pandas allows you to:
- Read/write data using a wide variety of formats
- Manipulate and transform data
- Combine data sources together (session 4)
- Perform basic analysis and visualisation

It is part of the [SciPy stack](https://www.scipy.org/stackspec.html), a collection of Python packages for scientific programming (closely related to data science).

Once installed (see [bonus session one](https://education.wdss.io/python-for-data-science/bonus-one)), you can import (using its usual alias of `pd`).

In [None]:
# Import pandas


### Importing CSVs

In this session, we'll be looking at the [wine quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). You can download this directly from the site (we'll only be using the data for red wine), or find it on this session's [materials](https://education.wdss.io/python-for-data-science/session-one) on the course website.

CSV stands for comma-separated value, and are plain text files used to store tabular data, one observation per line, and with values separated by commas. E.g.

```
"Numeric Column", "Boolean Column", "Text Column"
4, True, "Cat"
7, False, "Dog"
6, True, "Elephant"
```

CSV files can also be separated by semi-colons. This is common in Europe where are comma is used instead of a decimal separator.

We read CSV files using the `read_csv` function from pandas. When the separating character is not a comma, we have to specify it using the `sep` parameter.

In [None]:
wine = pd.read_csv('data/winequality-red.csv', sep=';')

The `read_csv` function has a ridiculous number of possible parameters. Read the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to learn more.

### Viewing a Dataframe's Structure

We can view the whole dataframe by typing it in a code cell.

In [None]:
# Print entire dataframe
wine

There are various attributes of a pandas dataframe.

In [None]:
# Dimensions


In [None]:
# Number of columns


In [None]:
# Column names


We're not going to worry about what an `Index` is in this course. It often works the same as a list but can be converted if needed.

In [None]:
# Column names as list
wine.columns.values.tolist()

In fact, there are many ways to do this (search [Stack Overflow](https://stackoverflow.com/questions/19482970/get-list-from-pandas-dataframe-column-headers) to find out) as there is with many tasks involving the SciPy stack, but this is the most performant.

In [None]:
# Row names (indexes)


In [None]:
# Column types


### Exploring Dataframe Contents

We can obtain simple or more substantial summaries of the dataframe using a variety of methods.

In [None]:
# Top 5 rows


In [None]:
# Bottom 3 rows


In [None]:
# Random sample of 4 rows


As with `read_csv`, the `sample` method has many optional arguments for replacement, weights, and random state. We will only ever go through the most critical parameters in this course, so it is your job to read the documentation when you want to go further.

In [None]:
# First 2 rows of random sample of 3 columns


In [None]:
# Numerical summaries of columns


## Subsetting and Filtering

Also see `.info()` and `.count()` for similar functionality. 

### Subsetting Rows and Columns

Pandas has two main methods of subsetting a dataframe:

- `.loc`: label-based
- `.iloc`: integer-based (using zero-based indexing)

These both accept single values, lists/arrays and slices (and a few moreâ€”[read the docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)!)

For data frames we follow `.loc` and `.iloc` with a pair of square brackets, containing either one or two inputs. If one input is used, this subsets the rows. If two are used, they subset the rows and columns respectively.

A colon (`:`) can be used to include all rows in that dimension.

In [None]:
# 10th row of the dataset


In [None]:
# 2nd to last column


In [None]:
# 4th row, 7th column


In [None]:
# Column means


In [None]:
# Acidity markers for every 10th row
wine.iloc[::10].loc[:, ['fixed acidity', 'volatile acidity', 'citric acid', 'pH']]

Notice that when using a list to subset columns we obtain a dataframe in return. This holds even if the list has one element.

In [None]:
# 4th row as dataframe


In [None]:
# (4, 2) element as dataframe


Be careful, unlike with integer slices, labels slices include the final value

In [None]:
wine.loc[:, 'density':'quality']

We can also use these methods for setting values.

In [None]:
df = pd.DataFrame({
    'x': [1, 2, 3],
    'y': [4, 5, 6]
})

In [None]:
# Change 1 to -1


In [None]:
# Double second row


We can also extract columns using regular square brackets (just like a list or dictionary) using label-based indexing.

In [None]:
# pH column


### Series

Unless we force a dataframe to be return using one-element lists, pandas will return either a single value, a series or a new dataframe depending on whether our result is 0, 1, or 2-dimensional.

In [None]:
type(wine['pH'])

A series is a one-dimensional array with axis labels. We can use `.loc` and `iloc` on series but only need to specify a single input. We can also use standard square brackets using either labels or integers.

It is important to note that subsetting in pandas copies by reference, not by value (unless you use the `.copy()` method).

In [None]:
df = pd.DataFrame({
    'x': [1, 2, 3],
    'y': [4, 5, 6]
})

x = df['x']
y = df['y'].copy()

x[1] = 0
y[1] = 0

df

> See also: `.at` and `.iat` in [the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html)

### Filtering

`loc`, `iloc` and `[]` can also accept Boolean vectors, returning only rows/columns that correspond to a true value

In [None]:
# Select rows with ph greater than 2.9


In [None]:
# Select only decimal columns


A useful helper is the `isin()` series method.

In [None]:
# Select wines of quality 3, 5, 6


You can combine Boolean vectors using Boolean operators. The notation we use in pandas is different to in base Python however:
- Use `&` for `and`
- Use `|` for `or`
- Use `~` for `not`

We could use this to drop columns with certain names (I'll leave this as a puzzle), but there is a better way using the `.drop` method.

In [None]:
wine.drop(['citric acid', 'residual sugar'], axis=1)

## Data Manipulation

### Sorting

We can sort a pandas dataframe using the `sort_values` method. This sorts either columns or rows depending on the specified axis. If a single label is provided the dataframe is sorted using that row/column. If a list is provided, the latter labels are used to break ties.

In [None]:
# Sort first by quality then by alcohol


In [None]:
# Sort by descending density


> See the `key` parameter in [the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) for more flexible sorts

> Most of the methods we've used come with an `inplace` parameter, which when set to `True` will perform the operation directly on the data rather than returning a modified data frame. This is useful is some cases but prevents you from chaining together methods.

### Creating and Overwriting Columns

We can create new columns using square brackets, providing a column name that doesn't already exist. If the column does exist, it's value will be overwritten.

Operations are _vectorised_ meaning they act on an element-by-element basis.

In [None]:
# Calculate non-free sulfur dioxide


In [None]:
# Replace density with grams/litre


If a single value is used, it will fill the entire column.

In [None]:
# Add column of zeros


### Creating Summarises

Pandas allows you to create summaries of rows, columns or series. Some common methods for this are `mean`, `min`, `max`, `median`, `mode`, `std`, `var`, `sum`. These are more useful when we have grouped data, which we will introduce in the project session.

In [None]:
# Average of all columns


In [None]:
# Maximum value of each row


In [None]:
# Standard deviation of pH


Pandas also offers two useful Boolean reduction functions, `all` and `any`, return `True` if all or any of the values in the series they are applied to is `True`, respectively. They can also be applied to dataframes, in which case they act on each column independently.

In [None]:
# Are any pH values less than 3?


In [None]:
# Are are values in the dataset non-negative?


Recalling back to Beginner's Python, we saw that `True`/`False` convert to 1/0 when cast as integers. We can use this to count and obtain proportions of true values.

In [None]:
# How many 5-quality wines are there?


In [None]:
# What proportion of wines are 5-quality?


## Wrapping Up

### Writing to CSVs

We can write a dataframe to a CSV using the `to_csv` method, passing in a file path. There are many parameters found in the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html), but the most commonly used is `index=False` to avoid saving row numbers (which can make it harder for other programs to read).

In [None]:
wine.sample(5).to_csv('wine_sample.csv', index=False)

Note, this will overwrite any existing file without warning.

### Other IO tools

Pandas is capable of reading from and writing to a large number of of file types. A list a corresponding documentation can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).