<img src="https://github.com/simonwiles/colab_workshops/raw/master/cidr-logo.no-text.240x140.png" align="center" alt="Center for Interdisciplinary Digital Research @ Stanford"/>
<H1 align="center">Digital Tools and Methods for the Humanities and Social Sciences</H1>
<H2 align="center">Data Manipulation with Python</H2>

### Instructors
- Scott Bailey (CIDR), <em>scottbailey@stanford.edu</em>
- Simon Wiles (CIDR), <em>simon.wiles@stanford.edu</em>

### Goal
By the end of this workshop, we hope you'll be able to load in data into a Pandas `DataFrame`, perform basic cleaning and analysis, and create visualizations of some relevant aspects of a dataset.  For most of this workshop we will work with a dataset prepared from the [IMDb Datasets](https://www.imdb.com/interfaces/) and the [OMDb API](https://www.omdbapi.com/).

### Topics
- Pandas Series and DataFrame
- Loading data in, null and missing data
- Describing data
- Column manipulation
- String manipulation
- Split-Apply-Combine
- Plotting:
  - Basic charts (line, bar, pie)
  - Histograms
  - Scatter plots
  - Boxplots, violinplots

### Jupyter Notebooks and Google Colaboratory

Jupyter notebooks are a way to write and run Python code in an interactive way. They're quickly becoming a standard way of putting together data, code, and written explanations or visualizations into a single document and sharing that. There are a lot of ways that you can run Jupyter notebooks, including just locally on your computer, but we've decided to use Google's Colaboratory notebook platform for this workshop. Colaboratory is a cloud-based platform that allows you ~to create libraries, which are effectively project folders and virtual environments that can contain static files and Python notebooks. They come with a number of popular libraries pre-installed, and allow you to install other libraries as needed.~

Using the Google Colaboratory platform allows us to focus on learning and writing Python in the workshop rather than on setting up Python, which can sometimes take a bit of extra work depending on platforms, operating systems, and other installed applications. If you'd like to install a Python distribution locally, though, we have some instructions (with gifs!) on installing Python through the Anaconda distribution, which will also help you handle virtual environments: https://github.com/sul-cidr/python_workshops/blob/master/setup.ipynb <mark> ← TODO: migrate this to a wiki page on the CIDR Workshops repo</mark>

If you run into problems, or would like to look into other ways of installing Python or handling virtual environments, feel free to send us an email (contact-cidr@stanford.edu) or visit us during our [consulting hours](https://library.stanford.edu/research/cidr/consulting).

~For now, go ahead to https://notebooks.azure.com and login with your Stanford ID and password.~

### Environment
If you would prefer to use Anaconda or their own local installation of python or Jupyter Notebooks, for this workshop you will need an environment with the following packages installed and available:
- `pandas`
- `matplotlib`
- `requests`
- `sqlalchemy`
- `seaborn` (available in the `conda-forge` channel)

Please note that we will likely not have time during the workshop to support you with problems related to a local environment, and we do recommend using the Colaboratory notebooks if you are at all unsure.

###  Copying this notebook
~Go to https://notebooks.azure.com/versae/libraries/cidr-data-manipulation~
    
~From there, click "Clone" to create a full copy of this library.~

## 1. What is Pandas?

Pandas is a high-level data manipulation tool first created in 2008 by Wes McKinney.  The name is derived from the term “panel data,” an econometrics term for data sets that include observations over multiple time periods for the same individuals.<sup>[[wikipedia](https://en.wikipedia.org/wiki/Pandas_(software))]</sup>

From Jake Vanderplas’ book [**Python Data Science Handbook**](http://shop.oreilly.com/product/0636920034919.do) (from which some code excerpts are used in this workshop):

> Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a `DataFrame`. `DataFrame`s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

### 1.1. What does Pandas *do*?

<mark>TODO: wip</mark>
* Reading and writing data from persistent storage
* Cleaning, filtering, and otherwise preparing data
* Calculating statistics and analyzing data


... but perhaps we should let Pandas introduce itself:

In [None]:
import pandas as pd

pd?

### 1.2. Where can I get more help with Pandas?

The [Pandas website](https://pandas.pydata.org/) and [online documentation](http://pandas.pydata.org/pandas-docs/stable/) are useful resources, and of course the indispensible [Stack Overflow](https://stackoverflow.com/questions/tagged/pandas) has a "pandas" tag, and there is also a (much younger, much smaller) sister [site dedicated to Data Science questions](https://datascience.stackexchange.com/questions/tagged/pandas) that has a "pandas" tag too.

In [None]:
pd.isnull?

## 2. Introduction to `DataFrame`s and `Series`

The main data structure that Pandas implements is the `DataFrame`, and a `DataFrame` is composed of one or more `Series` and, optionaly, an `Index`.  

A `DataFrame` is a two-dimensional array with flexible row indices and flexible column names. It can be thought of as a generalization of a two-dimensional NumPy array, or a specialization of a dictionary in which each column name maps to a `Series` of column data.

A `Series` is a one-dimensional array of indexed data. It can be thought of as a specialized dictionary or a generalized NumPy array.

A `DataFrame` is made up of `Series` in a similar way in which a table is made up of columns. The only restriction is that each column must be of the same data type.  Many of the operations that can be performed on a `DataFrame` can also be performed on an individual `Series`.


<mark>**GRAPHIC HERE**</mark>

## 3. Creating `DataFrame`s and loading data

There are a great many ways to create a Pandas `DataFrame` -- we can build one ourselves in lower-level Python datatypes, of course, but Pandas also provides methods to load data in from common storage and serialization formats.

<a title="PerryPlanet [Public domain], via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Bayarea_map.svg" style="float:right"><img width="256" alt="Bayarea map" src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/78/Bayarea_map.svg/512px-Bayarea_map.svg.png"></a>
### 3.1. Introduction to `DataFrame`s

The simplest way to generate a `DataFrame` is to create it directly from a `dict` of `list`s:

In [None]:
data = {
    "county": ["Alameda", "Contra Costa", "Marin", "Napa", "San Francisco", "San Mateo", "Santa Clara", "Solano", "Sonoma"],
    "county seat": ["Oakland", "Martinez", "San Rafael", "Napa", "San Francisco", "Redwood City", "San Jose", "Fairfield", "Santa Rosa"],
    "population": [1494876, 1037817, 250666, 135377, 870887, 711622, 1762754, 411620, 478551],
    "area": [2130, 2080, 2140, 2040, 600.59, 1930, 3380, 2350, 4580]
}
bay_area_counties = pd.DataFrame(data)
bay_area_counties

Pandas has automatically created an `Index` on this `DataFrame` ([0..8]), but we can also specify our own `Index` when we instantiate the frame ourselves:

In [None]:
bay_area_counties = pd.DataFrame(data, index=["Ala", "Con", "Mar", "Nan", "SF", "SM", "SC", "Sol", "Son"])
bay_area_counties

This allows us to `loc`ate a specific reference using the key in the `Index`:

In [None]:
bay_area_counties.loc['Ala']

We can also set an `Index` at any time after the `DataFrame` has been created, either by adding a new index:

In [None]:
bay_area_counties = pd.DataFrame(data)
bay_area_counties.index = ["Ala", "Con", "Mar", "Nan", "SF", "SM", "SC", "Sol", "Son"]
bay_area_counties

or by choosing one of the existing columns to become the index: <mark>(note the use of `inplace=True`)</mark>

In [None]:
bay_area_counties = pd.DataFrame(data)
bay_area_counties.set_index('county', inplace=True)
bay_area_counties

In [None]:
bay_area_counties.loc['Santa Clara']

### 3.2. Reading data from CSV files

However, most of the time we're more likely to be reading data in from an external source of some kind, and Pandas has us well covered here.

First, let's grab some data into our Colaboratory Notebook environment so that we can work with it locally:

In [None]:
!wget 

#### 3.2.1. CSV files
Reading in data from CSV files is as simple as:

In [None]:
data_frame = pd.read_csv('sample_data/imdb_top_1000.csv')
data_frame

Notice again that Pandas has created a default `Index` for this `DataFrame` -- we probably want the `imdbID` column to be the `Index`, and we can set that after import, as above, or we can specify it when loading the CSV initially:

In [None]:
data_frame = pd.read_csv('sample_data/imdb_top_1000.csv', index_col='imdbID')
data_frame

#### 3.2.2. Reading data from JSON Files

Notice here that the nature of JSON as a file format is such that the `Index` is explicit, and Pandas will set it correctly for us initially.

In [None]:
pd.read_json('sample_data/imdb_top_1000.json')

#### 3.2.3. Reading data from a SQL database

## 4. Writing DataFrames back out to persistant storage

## 5. Working with `DataFrame`s

Accessing columns can be done using the dot notation, `df.column_name`, or the dictionary notation, `df["column_name"]`.

`DataFrame`s can be sliced to extract just a set of the columns you are interested in. We just pass in a list of the columns we need to the slice and get a `DataFrame` back.

All `DataFrame`s are indexed. If an index is not explictly provided Pandas will asign one, giving each row a consecutive number. `Series` and slices keep these indices, which makes further operations such as merging or columns manipulation possible.

`DataFrames` are designed to operate at the column level, not at the row level. However, a subset of rows can be returned easily using a slice like in any Python list.

<mark>TODO: fill all these out with suitable examples!</mark>

### 5.x. Activity

Given the `DataFrame` defined above, write an expression to extract a `DataFrame` with the columns `text`, `user_screen_name`, `user_name`, `user_lang`, and `hashtags`. Show only the first 5 rows of it.

## 6. Indexing and Expressions
