# Lab 1: Basic Data Analysis

The purpose of this lab is to get you started with running Jupyter notebooks, and getting you familiar with loading and analyzing data in a notebook. You could use the notebooks from the lectures as reference points, as well.

## Learning Objectives

The purpose of this lab is to give you a gentle introduction to Python notebooks, as well as an introduction to loading a dataset into a notebook and performing some basic queries with the notebook.

After completing this lab, you should be able to: 

1. Start a Juypter notebook and write/execute basic Python code in the notebook.
2. Understand basic Python operations, including how to write a function, and how to load data.
3. Understand the purpose of Pandas and plotting libraries (e.g., matplotlib) and gain some initial, basic experience with them.

## 1. Python and Notebooks: Hello World

If you've gotten this far, hopefully you have the notebook running, but you should have started this from the command-line (using the instructions on the course page), or via Anaconda.

Elements in a notebook are divided into cells, which might be markdown (such as this cell), which contains text, or code.  The cell below contains code. You can execute a cell by clicking on the cell and typing "Shift + Enter". Try your first "Hello World" Python program below.

In [1]:
# YOUR CODE HERE

You can also define and call functions in cells. In the cell below, write a simple function that takes a string as an argument and prints that string as an output.

In [2]:
# YOUR CODE HERE

## 2. Basic Operations in Python: Loading Data

You should famliarize yourself with loading data into a Python notebook in various formats.  Some of the notebooks we have provided have some examples for loading files.

Much of the data we will use is in formats such as comma-separated value (CSV), but you may also find the need to load data that is in other formats (e.g., JSON, SQL databases). The Python Pandas library provides easy ways to load these types of files into Pandas "data frames".

### 2.1 Loading a CSV File into Pandas

Load the [Divvy Trip data](https://data.cityofchicago.org/Transportation/Divvy-Trips/fg6s-gzvg) from the City of Chicago data portal into a Pandas data frame.

**Note**: The file is large (5 GB), and so this will possibly take a fair bit of time to download/load. Maybe 5-10 minutes. Be patient! If you have problems, the course staff has a smaller, truncated version of the file to work with; let us know.

In [3]:
# YOUR CODE HERE

### 2.2 Basic Data Analysis in Pandas 

Now that you have the data loaded into a basic dataframe, you can ask some questions about the data, using Pandas. Each of these questions is intended to give you practice manipulating and analyzing a data frame.

#### 2.2.1 What is the number of rows in the data frame? 

This question is intended to help you understand one of the most basic questions about your data: How many data points does it have?

Call a single Pandas dataframe function to figure out how large this dataset is. Your answer should produce a single integer.

In [4]:
# YOUR CODE HERE

#### 2.2.2 What are the start and end dates of the rides in the data set?

It is typically important to understand basic information about the data, such as when it starts and ends. 

This is also an example of _looking for outliers_. The Divvy program started in Chicago somewhat recently (find out when!) and so if the earliest ride in the dataset predates that, you know you the dataset has a problem! Performing these kinds of basic sanity checks on the data is critical and something you should always be doing.

In [5]:
# YOUR CODE HERE

#### 2.2.3 What is the mean duration of all trips?

This question is intended to give you exposure to (1) selecting a column from a Pandas dataframe; (2) applying an aggregate function (e.g., a mean) to a column of the dataframe. Of course, it is possible to apply other aggregate functions, and you should familiarize yourself with those.

In [6]:
# YOUR CODE HERE

#### 2.2.4 Do men or women take longer trips on average?

The goal of this question is to give you experience with the groupby function in Pandas, as well as how to combine groupby with an aggregation operation. There are a couple of ways to answer this question, actually; you could also do it with a conditional select. Try it both ways!

**Using Groupby**

In [7]:
# YOUR CODE HERE

**Using Conditional Selection**

In [8]:
# YOUR CODE HERE

#### 2.2.5 Sanity Checks

We just performed the above operations without checking how many rides were taken by males and females.  Quickly do that below.  Also compute the sum of male and female rides.

In [9]:
# YOUR CODE HERE

Compare the number you just computed to the total number of data points.  Do they match? Why or why not?

In [None]:
# ENTER YOUR ANSWER AS A COMMENT

#### 2.2.6 Other checks

We know anecdotally that the birth year column (`BIRTH YEAR`) has several missing values. How many rows exactly are missing a birth year? 

In [10]:
# YOUR CODE HERE

What proportion of rows in the dataset have a missing birth year?

In [11]:
# YOUR CODE HERE

## 3. Basic Plotting and Visualization

The goal of this part of the assignment is to give you basic experience with plotting data from datasets. You can use the same types of operations as were demonstrated in lecture to perform this part of the lab.

### 3.1 Plotting Ride Volumes Over Time

Below will provide some experience with plotting rides over time.

#### 3.1.1 Setting an index in the data frame

Recall the first steps include importing plotting libraries and setting one of the columns in the data frame to be the index.

In [12]:
# YOUR CODE HERE

#### 3.1.2 Plotting the total trip duration by day

Plot the total trip duration by day. While there are a number of ways to perform this operation, you may find the `resample` function in Pandas useful.

In [13]:
# YOUR CODE HERE

#### 3.1.3 Plot the total number of trips by day

This calcuation is a little bit trickier, since there's no number in the data frame to plot: each row is what you're trying to count. You may find `resample` useful, but you may have to add some data to the data frame to make it work. group by and count may also work.

In [14]:
# YOUR CODE HERE

### 3.2 Data Exploration on Your Own

Pick a question or hypothesis, justify **why** you picked that question (i.e., why it might be an interesting question to some audience, such as city officials), and present a simple analysis. 

Optionally, you may pick a related dataset from the Chicago data portal and join it onto the Divvy trips data. As above, be sure to justify your choice and explain what analysis you can now perform (as well as the analysis itself).

Some example questions might include:
* Adjusting for seasons, is ridership increasing? (You could use conditional selection on dates or months.)
* Are rides getting longer? (on average? max?)
* Do ride characteristics differ by user type?
* Are certain trip routes (e.g. pairs of start and end stations) more popular than others? Does this change during peak and non-peak "rush" hours (defined loosely)?
* Which neighborhoods are seeing the most ridership? (More difficult! Requires spatial analysis!)

In [15]:
# YOUR CODE HERE