# Behavioral Lab: Getting started with Python and Jupyter Notebook

**Welcome to S&DS 177/577 -- YData: Covid-19 Behavioral Impacts** 

Most class meeting times will include a lecture portion and a lab portion. This is an example of a lab for behavioral analysis using public data. The purpose of the labs is to give you hands-on experience using the skills from the main YData course (S&DS 123/523) while working within a particular domain. For this course, the focus is on analyzing possible behavioral impacts under the current pandemic, COVID-19!

Before we get started, there are some details for the class:

* I highly recommend to go through the [Practice 01 questions](http://ydata123.org/sp19/calendar.html) on the main YData website first, which explains Jupyter Notebook's setup. 
* This lab covers parts of [Chapter 3](http://www.inferentialthinking.com/chapters/03/programming-in-python.html) of the online textbook. If you are not familiar with any expression below, this can be a good reference!
* I encourage group collaboration on the course labs. If you get stuck for a while on a question, feel free to ask your classmates (or email me for help if needed). Please do take some time to think by yourself first and do not just share answers to each other though. 

In today's exercise, we will go through the ways to:

* Navigate Jupyter Notebook
* Write basic expressions in Python
* Import and manipulate tables

# 1. Jupyter Notebook

This portion is copied from the [Practice 01 questions](http://ydata123.org/sp19/calendar.html). If you have gone through the basic concepts there, you can skip this secion. If different platforms ever confuse you, just remember that *GitHub* allows us to share code, but the contents are static, we usually share code in `.ipynb` that was the Python code coded in *Jupyter Notebook*, and *Binder* allows us to try and run each cell in an interactive way.

When you type in Jupyter Notebook for future homework assignments or exams, select *Markdown* on the Code Box to type words like this cell (the shortcut is 'm' when the cell is highlighted), select *Code* to type computer code.

........................................................................................................................................................................

This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results. 

## 1.1. Text cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but you might want to.

After you edit a text cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions of the lab.)

**Question 1.1.1.** This paragraph is in its own text cell.  Try editing it so that this sentence is the last sentence in the paragraph, and then click the "run cell" ▶| button .  This sentence, for example, should be deleted.  So should this one.

## 1.2. Code cells
Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains.

To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, either press ▶| or hold down the `shift` key and press `return` or `enter`.

Try running this cell:

In [6]:
print("Hello, World!")

Hello, World!


And this one:

In [7]:
print("\N{WAVING HAND SIGN}, \N{EARTH GLOBE ASIA-AUSTRALIA}!")

👋, 🌏!


The fundamental building block of Python code is an expression. Cells can contain multiple lines with multiple expressions. When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

In [8]:
print("First this line is printed,")
print("and then this one.")

First this line is printed,
and then this one.


**Question 1.2.1.** Change the cell above so that it prints out:

    First this line,
    then the whole 🌏,
    and then this one.

*Hint:* If you're stuck on the Earth symbol for more than a few minutes, try talking to a neighbor or a TF.  That's a good idea for any exercise.

## 1.3. Writing Jupyter notebooks
You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar.  It'll start out as a text cell.  You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Code".

**Question 1.3.1.** Add a code cell below this one.  Write code in it that prints out:
   
    A whole new cell! ♪🌏♪

(That musical note symbol is like the Earth symbol.  Its long-form name is `\N{EIGHTH NOTE}`.)

Run your cell to verify that it works.

## 1.4. Errors
Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell.  Run it and see what happens.

print("This line is missing something."

You should see something like this (minus our annotations):

<img src="error.jpg"/>

The last line of the error output attempts to tell you what went wrong.  The *syntax* of a language is its structure, and this `SyntaxError` tells you that you have created an illegal structure.  "`EOF`" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it.  (Of course, if you're frustrated, ask a neighbor or a TA for help.)

Try to fix the code above so that you can run the cell and see the intended message instead of an error.

## 1.5. The Kernel
The kernel is a program that executes the code inside your notebook and outputs the results. In the top right of your window, you can see a circle that indicates the status of your kernel. If the circle is empty (⚪), the kernel is idle and ready to execute code. If the circle is filled in (⚫), the kernel is busy running some code. 

You may run into problems where your kernel is stuck for an excessive amount of time, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try the following steps:
1. At the top of your screen, click **Kernel**, then **Interrupt**.
2. If that doesn't help, click **Kernel**, then **Restart**. If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work.
3. If that doesn't help, restart your server. First, save your work by clicking **File** at the top left of your screen, then **Save and Checkpoint**. Next, click **Control Panel** at the top right. Choose **Stop My Server** to shut it down, then **My Server** to start it back up. Then, navigate back to the notebook you were working on.

# 2. Import a Local Dataset: 

Now, we are getting familiar with Jupyter Notebook. Remember that it provides a platform to do complicated data analysis. And the first thing we want to know is the ways to import dataset for any analysis. 

In the main YData class, you will learn a lot more about the `Table object`. This object is part of the `datascience` package which was created by the creators of the Berkeley Data 8 class, which our YData classes is based. We will extensively use this Table objects for our baseball analyses so you will become very familiar with how to manipulate tables. For today, let's start with some of the basic functionality including loading tables and selecting rows and columns from tables. 

## 2.1 Loading data tables

Remember to download the course datasets from the [Google Drive link](https://drive.google.com/drive/folders/1SLPLMpgdzcn79iMBOYhh9CmOZaw7odpE). Again, 
please do not share the datasets with others outside this class unless you contact me and receive my permission.  

After downloading the dataset, put them in a folder (I put one named "YData_SDS177", as you can see below) under your course path.

The cell below imports the functions related to tables and to loads the data into a table.

In [1]:
from datascience import *
# weather.csv is a locally saved table.
# YData_SDS177 is the folder I put all the datasets in.

weather = Table.read_table('YData_SDS177/weather.csv')  ## change the table to a more informative one

In [4]:
weather

geoid,date,precip,rmax,rmin,srad,tmin,tmax,wind_speed
1001,01jan2020,1.53333,33.2828,21.9805,125.57,56.4397,69.5507,4.8467
1001,02jan2020,3.72874,79.1391,64.1092,41.7069,54.3438,63.1245,14.7124
1001,03jan2020,46.0931,90.3954,65.0517,36.4356,54.2569,66.3355,7.82156
1001,04jan2020,0.477011,82.2195,40.4034,106.547,44.1003,65.7893,12.388
1001,05jan2020,0.0,61.6414,22.2126,139.33,37.2334,65.7397,4.61015
1001,06jan2020,0.0,93.1793,25.8115,136.917,36.77,73.0679,5.89574
1001,07jan2020,0.0,78.7816,24.1483,141.071,40.2252,73.9183,10.3105
1001,08jan2020,0.0,44.8644,14.2172,142.807,44.3693,75.8983,5.08839
1001,09jan2020,0.0,99.946,42.8897,135.445,34.2541,60.0169,11.3698
1001,10jan2020,0.877011,100.0,98.654,110.045,29.0776,55.5955,15.3963


Notice also that there is a note next to the final line of code that begins with the #. Python ignores anything following the # and it is a great way to add documentation and notes to your code (these lines are called comments).

How large is this table? We can assess the number of rows in the full table using the syntax `my_table.num_columns` and `my_table.num_rows`. 

**Question 2.1.** What does each row represent, what do the columns represent? Also, how many rows and columns are there in the batting table?  


**Answers:**

In [38]:
# Type your answer here


## 2.2 Selecting a subset of columns

We can select a subset of columns using the `mytable.select()` method.  For example, to select columns 0, 1, and 5 and print them out we can use the syntax: `my_table.select(0,1,5)`. Notice that the indexing starts at 0, not 1! 

We can also select columns by name using the sytax: `my_table.select('col_name_1', 'col_name_2', 'col_name3')`

**Question 2.2** Select the columns `precip` and `tmax` from the weather table. What is a reason why it might be useful to select a subset of columns?

**Answer:** 

In [39]:
# Type your answer here


## 2.3 Selecting a subset of rows from a table

To select rows from a table, we can use the `take()` method. For example, we can select rows 2, 5, and 6 using the syntax: `my_table.take((2,5,6))`. 


**Question 3.3:** Select the columns date, precip, tmin, and tmax, and the rows 65470, 77657, 74271, and 88498 from the weather table. Do this with **a single line of Python code**! Is there anything similar about these players' statistics?  

**Answer**: 

In [40]:
# Type your answer here



## 2.4 Using methods on values in a column of data in a table

The Tables objects in the `datascience` package also have methods can operate on the values in a column. For example, we can sum the values in a column using the `my_table.sum()` method. **Note**, if you use this method on a table that contains values that are not numbers you will get an error message. 


**Question 2.4**. What is the total precipitation from Jan 1st to May 26th in this year? 

**Answer:** 

In [41]:
# Type your answer here


## 2.5 Sorting values in a column of data in a table

We can also sort values in a table using the `my_table.sort('col_name')` method. If we want to sort the columns from largest value to the smallest value we can set the additional argument `descending = True` (i.e., `my_table.sort('col_name', descending = True)`. 


**Question 2.5**. Use the sort function to find the minimum temperature in the weather dataset. When (date) and where (geoid) did it reach to this minimum temperature? 


**Answer:**

In [None]:
# Type your answer here


# 3. Import an Online Dataset: COVID-19 CASE/DEATH REPORTS in New York Times

We sometimes need to import library (the library contains modules written in Python that provide standardized solutions for many problems that occur in everyday programming) before *calling* a function. For instance, if I want to get a data with a website link, I need to run the following cell first.

In [2]:
from urllib.request import urlopen 
import re
def read_url(url): 
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

We first use the code that was used in [Chapter 1](https://www.inferentialthinking.com/chapters/01/3/Plotting_the_Classics.html) with `read_url`.

In [3]:
nyt_url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
nyt_data = read_url(nyt_url)

If we read the url directly, instead of getting a nice table, we will get a pile of words and numbers mixed together. If you don't believe this, run the following cell (remove #) to see, although it will take lots of time (and more time if you want to export the file as a pdf or a ipynb file (so do not do this for future homework assignments). This would be hard for us to visualize and analyze data direcly.

In [1]:
# nyt_data 

In [4]:
# Let's try a new tool named pandas
import pandas as pd

# sep=',' in the parentheses allow the data to be splitted by comma.
nyt_pddata = pd.read_csv(nyt_url,sep=",")

In [5]:
nyt_pddata

Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0
1,2020-01-22,Snohomish,Washington,53061.0,1,0
2,2020-01-23,Snohomish,Washington,53061.0,1,0
3,2020-01-24,Cook,Illinois,17031.0,1,0
4,2020-01-24,Snohomish,Washington,53061.0,1,0
...,...,...,...,...,...,...
670631,2020-10-26,Sweetwater,Wyoming,56037.0,476,2
670632,2020-10-26,Teton,Wyoming,56039.0,756,1
670633,2020-10-26,Uinta,Wyoming,56041.0,442,3
670634,2020-10-26,Washakie,Wyoming,56043.0,145,7


This is a dataset showing the COVID daily cases and deaths in each US county. For each county, it starts documenting when the first case showed up, so you may notice that each county has a different date to start with. 

This is a major dataset that we will focus on in our next lab! 

Continue to explore the data on your own and report three findings you think are interesting. Also describe any additional capabilities you wish you knew how to do that would enable you to answer additional interesting questions. 

Great job!  You have finished the first lab. 

Now, let's recall the way to save the file (in both .pdf and .ipynb format), so any changes we made here can be saved and printed.
   

To produce the .pdf, please do the following in order to preserve the cell structure of the notebook:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "HTML (.html)"
3.  After the .html has downloaded, open it and then select "File" and "Print" (note you will not actually be printing)
4.  From the print window, select the option to save as a .pdf

To produce the .ipynb, please do the following:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "Notebook (.ipynb)"