## Jupyter Notebook Tutorial

(Adapted from geopandas tutorial by Ryan Maas and Scott Henderson: https://github.com/uwescience/dssg2018-geopandasSQL-tutorial)

We will start today with the interactive environment that we will be using in the tutorial: the [Jupyter Notebook](http://jupyter.org).

Walk through the following steps

1. Download this notebook `180-biglecture4.ipynb`, and the data file `Places_Full.csv`

2. Type ``jupyter notebook`` or `jupyter lab` in the terminal to start the notebook

   ```
   $ jupyter notebook
   ```
   
   If everything has worked correctly, it should automatically launch your default browser
   ```
   ```
   
3. Click on ``180-biglecture4.ipynb`` to open the notebook containing the content for this lecture.


## About Jupyter

- Combines the features of an IPython terminal with a fancy web interface
- Every notebook has one python instance running with it
- Also supports markdown and html embedding
- Easy to insert $\LaTeX$ formulas
- Be ware of VIM-style keyboard shortcuts

## Python refresher
- Components with the same capabilities are of the same *type*. 
  - For example, the numbers 2 and 200 are both integers.
  - This is called "duck typing" (if it looks like a duck, quacks like a duck...)
  
- A type is defined recursively. Some examples.
  - A list is a collection of objects that can be indexed by position.
  - A list of integers contains an integer at each position.
  
- A type has a set of supported operations. For example:
  - Integers can be added
  - Strings can be concatented
  - A table can find the name of its columns
    
- In python, members (components and operations) are indicated by a '.'
  - If `a` is a list, the `a.append(1)` adds `1` to the list.

In [8]:
# Python lists can store data of different types at the same time
a_list = [1, 'a', [1,2]]

In [None]:
a_list.append(2)

In [None]:
# What are the full set of functions available for our list
a_list.

In [None]:
# Count how many times the integer '1' occurs
a_list.count(1)

## Python's Data Science Ecosystem

There are also many often-used third-party modules that are core tools for doing data science with Python.
Some of the most important ones are:

#### [``numpy``](http://numpy.org/): Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data.
If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

#### [``scipy``](http://scipy.org/): Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more.

#### [``pandas``](http://pandas.pydata.org/): Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a *Data Frame*.
If you've used the [R](http://rstats.org) statistical language , much of the functionality in Pandas should feel very familiar.

#### [``matplotlib``](http://matplotlib.org): Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

# Basic Pandas



In [32]:
# import all the libraries we are going to use
import pandas as pd
import shapely
import matplotlib.pyplot as plt
%matplotlib inline

In [34]:
print('Pandas version: ', gpd.pd.__version__)

Pandas version:  0.23.0


In [36]:
# Pandas has convenient methods for reading tabluar data, in this case we have 1 CSV file:
!ls -lh ./*csv

-rw-r--r--  1 ryanmaas  staff   181K Apr  7 19:47 ./Places_Full.csv


In [39]:
# Note that the 'places' information has 9 columns with labels in the first row
!head Places_Full.csv

name,address,city,lat,lng,place_id,rating,class,type
Trader Joe's,1700 East Madison Street,Seattle,47.6158665,-122.3099133,ChIJx0M1ztNqkFQRtgspEllQxk8,4.5,supermarket,
Hillcrest Market,110 Summit Avenue East,Seattle,47.6188496,-122.3250047,ChIJt5emCTMVkFQR-BCgFDTvu9o,3.5,supermarket,
Uwajimaya,600 5th Avenue South,Seattle,47.596843,-122.326929,ChIJq9nX27xqkFQRu05rxkrN7f4,4.5,supermarket,
Kress IGA Supermarket,1427 3rd Avenue,Seattle,47.6093957,-122.3378216,ChIJbdIhoLNqkFQRMTiHCN6W4nU,3.9,supermarket,
Double Dorjee,1501 Pike Street # 511,Seattle,47.608822,-122.3395702,ChIJkX3s-LJqkFQRQRexqCN0jQY,5,supermarket,
Whole Foods Market,2210 Westlake Avenue,Seattle,47.6183442,-122.3380965,ChIJyz9g9EkVkFQRutvXu56_BLk,4.3,supermarket,
Grocery Outlet Bargain Market,1702 4th Avenue South,Seattle,47.5878704,-122.3286255,ChIJIWM_pp5qkFQRsh-E4oUVcIM,4.3,supermarket,
Metropolitan Market Uptown,100 Mercer Street,Seattle,47.624805,-122.354842,ChIJsY2q3UMVkFQRBAbqlwfTqdE,4.5,supermarket,
Trader Joe's,1916

# Pandas review

In [40]:
# Let's work with the smaller 'Places_Full.csv' first
# All Pandas methods are accessed via the 'pd' attribute.
# Since the file is well-formatted, it is easily read into memory:
filePath = './Places_Full.csv'
df = pd.read_csv(filePath)

In [41]:
# 'df' stands for 'Data Frame'. It is essentially a spreadsheet:
df.head()

Unnamed: 0,name,address,city,lat,lng,place_id,rating,class,type
0,Trader Joe's,1700 East Madison Street,Seattle,47.615866,-122.309913,ChIJx0M1ztNqkFQRtgspEllQxk8,4.5,supermarket,
1,Hillcrest Market,110 Summit Avenue East,Seattle,47.61885,-122.325005,ChIJt5emCTMVkFQR-BCgFDTvu9o,3.5,supermarket,
2,Uwajimaya,600 5th Avenue South,Seattle,47.596843,-122.326929,ChIJq9nX27xqkFQRu05rxkrN7f4,4.5,supermarket,
3,Kress IGA Supermarket,1427 3rd Avenue,Seattle,47.609396,-122.337822,ChIJbdIhoLNqkFQRMTiHCN6W4nU,3.9,supermarket,
4,Double Dorjee,1501 Pike Street # 511,Seattle,47.608822,-122.33957,ChIJkX3s-LJqkFQRQRexqCN0jQY,5.0,supermarket,


In [42]:
# The dataframe has a lot of convenient methods for fast data exploration
# Start with info to confirm that things were read in correctly
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1616 entries, 0 to 1615
Data columns (total 9 columns):
name        1616 non-null object
address     1520 non-null object
city        1540 non-null object
lat         1616 non-null float64
lng         1616 non-null float64
place_id    1616 non-null object
rating      1449 non-null float64
class       1616 non-null object
type        15 non-null object
dtypes: float64(3), object(6)
memory usage: 113.7+ KB


In [26]:
# Simple statistics are obtained for numerical columns
df.describe()

Unnamed: 0,lat,lng,rating
count,1616.0,1616.0,1449.0
mean,47.619907,-122.321132,4.194134
std,0.056363,0.049129,0.638952
min,47.455556,-122.411493,1.0
25%,47.586566,-122.347183,3.9
50%,47.614611,-122.328568,4.3
75%,47.661334,-122.30948,4.6
max,47.790207,-122.114213,5.0


In [43]:
# Columns can be accessed as dictionary items:
print(df['city'].unique())
# Or accessed as attributes for faster typing:
print(df.city.unique())

['Seattle' 'Medina' 'Mercer Island' 'Shoreline' 'Edmonds' 'Tukwila'
 'Kenmore' 'Kirkland' 'Lake Forest Park' 'Bothell' 'Renton' 'Bellevue'
 'Newcastle' 'SeaTac' 'Burien' nan]
['Seattle' 'Medina' 'Mercer Island' 'Shoreline' 'Edmonds' 'Tukwila'
 'Kenmore' 'Kirkland' 'Lake Forest Park' 'Bothell' 'Renton' 'Bellevue'
 'Newcastle' 'SeaTac' 'Burien' nan]


In [44]:
# NOTE: seems there is bug in v 0.3 since the attribute 'class' 
# is reserved for internal use, but it is also a column heading
print(df['class'].unique())
#print(df.class.unique()) # This causes an error

['supermarket' 'library' 'hospital' 'pharmacy' 'post_office' 'school'
 'cafe' 'urban village' 'destination park' 'citywide']


In [45]:
# let's change the name of class to avoid that error
df.rename(columns={'class':'place_class'}, inplace=True)
df.place_class.unique()

array(['supermarket', 'library', 'hospital', 'pharmacy', 'post_office',
       'school', 'cafe', 'urban village', 'destination park', 'citywide'],
      dtype=object)

In [46]:
# Another common issue with tabular data - 
# certain measurements don't always fit into the defined columns 
# or are missing data, and therefore filled with 'not-a-number (nan)'
# For example, some entries don't have a listed city:
print(df.city.unique())


['Seattle' 'Medina' 'Mercer Island' 'Shoreline' 'Edmonds' 'Tukwila'
 'Kenmore' 'Kirkland' 'Lake Forest Park' 'Bothell' 'Renton' 'Bellevue'
 'Newcastle' 'SeaTac' 'Burien' nan]


In [31]:
# Extract data entries without a city
dfNan = df[df.city.isna()]
dfNan

Unnamed: 0,name,address,city,lat,lng,place_id,rating,place_class,type
1540,12th Avenue,,,47.608315,-122.317345,12th-Avenue,,urban village,
1541,23rd & Union-Jackson,,,47.603145,-122.306682,23rd-&-Union-Jackson,,urban village,
1542,Admiral,,,47.582350,-122.386420,Admiral,,urban village,
1543,Aurora-Licton Springs,,,47.696854,-122.345977,Aurora-Licton-Springs,,urban village,
1544,Ballard,,,47.670593,-122.382603,Ballard,,urban village,
1545,Ballard-Interbay-Northend,,,47.659726,-122.372020,Ballard-Interbay-Northend,,urban village,
1546,Belltown,,,47.614435,-122.347341,Belltown,,urban village,
1547,Bitter Lake Village,,,47.728560,-122.350653,Bitter-Lake-Village,,urban village,
1548,Capitol Hill,,,47.620316,-122.319866,Capitol-Hill,,urban village,
1549,Chinatown-International District,,,47.597980,-122.325308,Chinatown-International-District,,urban village,
