# PSTAT 100 Project


You'll work with other students in the class to complete a group project in which you'll practice the data science lifecycle in the context of a topic of your choosing. As a group, you'll be responsible for:

* choosing a topic and obtaining and tidying data;
* conducting exploratory analysis to hone/identify questions;
* carrying out a focused analysis addressing one or more of your questions;
* preparing a written summary of your work.

Project work will be divided into two stages.

**Stage 1**: *Preparation and planning.* In this stage you'll lay the groundwork for your project, make some initial progress, and come up with a vision for what your analysis might look like. The goals for preparation are to get organized as a group -- discuss interests, assign roles, and make a communication plan -- select a topic and dataset(s) to work with, and gather background information on your data, and tidy it up. The goals for planning are to conduct some simple explorations and formulate a shortlist of tentative questions that you could potentially address using the data you've obtained.

**Stage 2**: *Analysis and reporting.* In this stage you'll follow through on one or more of the questions you identified in the preparation and planning stage and report your findings. The main goals in this stage are to carry out one or more analyses of your data, prepare presentation graphics and tables, and put together a project report conveying the substance of your work.

You'll have approximately three weeks to work on each stage, and each stage has an associated deliverable. That's plenty of time if you spread out the effort -- the experience will be more rewarding and less stressful if your group plans effectively to make a little progress each week.

### Goal: *Use what you know to learn about something you're interested in.*

This project presents you with two opportunities: a chance to practice skills you've acquired in the course; and a chance to dig into a subject you're curious about through the lens of data.

*Using what you know* means applying skills you're comfortable with. You'll have more fun and your work will be better if you're not struggling to do something too close to the edge of your skill level. So if complicated dataframe manipulations and joins are not your thing, it will pay off to avoid choosing a dataset that's especially messy, or letting someone else in the group handle data tidying and volunteering to contribute elsewhere.

*Learning about something you're interested in* means following your curiosity. This could mean curiosity about the topic you choose *or* curiosity about a specific skill/method/technique that you apply. Again, you'll have more fun and your work will be better if you're engaged. So if you don't love the topic your group chooses but you do like making plots, consider trying to learn a new-to-you visualization technique.

### Expectations

Our expectations are pretty basic. We're expecting you to:
* work collaboratively;
* demonstrate at least one or two specific techniques from the class (visualization, KDE, PCA, regresssion, etc.);
* communicate your findings clearly.

We are *not* expecting you to produce novel discoveries or use sophisticated methods. Remember, the goal is to use what you know to learn about something you're interested in. Keep it simple, and have fun. 

---
## Stage 1: Preparation and planning

Your main goal in this stage is, in short, to figure out what you're going to do. This should involve the following steps.

1. Get organized as a group. 
    * Make **introductions**.
    * Discuss **interests** and areas of comfort
    * Assign **roles**. At the very least, designate someone to coordinate meetings and communication and someone to coordinate preparing deliverables.
    * Make a **communication** plan. Checking in once a week is recommended.


2. Choose a dataset or datasets.
    * Get everyone involved in an **initial search.** Agree on a general area or areas based on gorup interests and identify promising datasets.
    * **Discuss** the pros and cons of each possibility.
    * **Decide** as a group which one(s) you will work with. 


3. Acquaint and tidy.
    * Gather **background** information on data collection and measurement and review any data documentation.
    * Put the data in **tidy** format.


4. Make a plan for next steps.
    * Conduct some **initial explorations** of the data.
    * Generate a shortlist of **questions** that you think would be interesting to explore.
    * Discuss **possible approaches** to your questions.
    

### Data requirements

While the project is pretty open-ended, the dataset(s) you choose must meet a few minimum requirements.

* Raw data files should not exceed 100Mb.
* The data should come from a source identifiable by citation or link.
* The data source should provide some basic information about how it was collected.
* Data files should be in .csv format.
* After tidying, the data should comprise at least 100 observations and at least 4 variables.
* After tidying, the data should not contain more than 100,000 observations or more than 100 variables. (More is okay in the raw data file).
* If not provided with the raw data, you should make a metadata .csv file with variable names, descriptions, units of measurement, and a URL or citation indicating the data source. 

### Possible data sources

You are welcome and encouraged to use data from any source, including data from the class meeting the above requirements (though you should choose a different direction if so).

Here are a few trustworthy sources with interesting data:

* [World Bank Open Data](https://data.worldbank.org/) contains a wealth of usually well-organized datasets comprising country indicators. I've used a variety of data from this source in class examples and labs.

* [NOAA NCEI](https://www.ncdc.noaa.gov/) and the [EPA](https://www.epa.gov/data) publish open data related to climate and environment.

* The [CDC Data & Statistics portal](https://www.cdc.gov/datastatistics/index.html) links to a variety of health-related datasets. 

* The [National Center for Education Statistics](https://nces.ed.gov/) links to Department of Education data on secondary and higher education in the U.S. 

* The state of California has an [open data portal](https://data.ca.gov/) with data provided by state agencies on energy, health, transportation, etc.

* The [Census Bureau](https://data.census.gov/cedsci/) publishes many data products, although -- fair warning -- they are not the easiest to navigate and often contain overwhelming amounts of information.

You might also find inspiration and potential data sources from [Our World in Data](https://ourworldindata.org/), which contains a number of neat articles with nice visualizations organized by topic. 

Lastly, there are some repositories out there like the [UCI machine learning repository](https://archive.ics.uci.edu/ml/index.php), though these often provide pretty minimal background about available data.

### Deliverable

Your deliverable is a brief interim report (2 - 4 pages). The objective is simply to communicate what you're working on so that we can offer feedback.

Your report should comprise: 
* background information; 
* a description of the data, including collection methods, sampling design (if any), variable descriptions and other semantics;
* some example rows of the data in tidy format; 
* selected initial explorations, if any; 
* and a list of at least 2 focused questions and possible approaches.

A template will be provided separately. The most important parts are the data description, tidy rows, and the questions/approaches items, as these will enable us to offer constructive feedback. 

## Evaluation

Your report will be evaluated based on:
* meeting minimum data requirements;
* clarity of background and data description;
* successful tidying of the data;
* clarity and relevance of questions.

---
## Stage 2: Analysis and reporting

In this stage you'll follow through on the analysis plan your group made in the preparation and planning stage and prepare a report.

Your **analysis** should consist of a thorough exploratory analysis, possibly a focusing or reframing of questions, and an effort to address those questions through either visualizations and descriptive statistics or statistical modeling, whichever is more appropriate.

Your **report** should detail the substance of your project in a form that is presentable to a general audience with a basic knowledge of statistics and data science.

Work in this stage should involve the following steps.

1. Exploratory analysis.
    * If not completed in stage 1, do a general investigation of your data: explore variable distributions and relationships between variables through visual and descriptive summaries.
    * Choose and set aside one or two exploratory plots or tables that convey general properties of the dataset. 
    * Choose and set aside one or two exploratory plots or tables that convey the aspect of the data you intend to focus on.


2. Focused analysis.
    * Develop a graphic or table that directly addresses at least one of the questions you posed.
    * If applicable, fit a model to the data, check model assumptions, and interpret results.


3. Prepare your report.
    * Decide on which graphics and summaries should be included in the report.
    * Develop 'clean' versions of figures and tables that you intend to include in your report.
    * Assign writing tasks to each group member and draft report sections.
    * Review, edit, and finalize the report.

### Deliverable

The objective of your final report is to provide a thorough overview of your project. It should be 3-5 pages; if you feel that further material needs to be included, you can add an appendix with supplementary information, tables, and plots.

The report should include the following sections:

* Abstract
    + One-paragraph summary of the report contents.
* Background
    + Adapt your interim report for this section.
* Data
    + Adapt your interim report for this section.
    + Include select results of your exploratory analysis.
* Aims
    + State the specific questions you address in your analysis.
* Methods and results
    + Describe, in general, how you address each question.
    + Summarize the results of your analyses.
* Discussion
    + Highlight your main findings and takeaways.
    + Discuss any concerns, caveats, or limitations of your analysis.
    + Suggest at least one further step that would extend or complement your work.

A template will be provided with additional detail that you can follow.

**Guidelines**
* Aim to include a minimum of codes. Prepare your results separately and save images and tables for direct import into your report notebook, so that you do not need to include long code cells to generate results.
* Focus on communicating background, data, analysis, and takeaways as concisely as possible. Provide just enough detail so that the reader can understand your process but focus on your message.
* Avoid the temptation to document every step in your work or explain methods in full mathematical detail.

### Evaluation

Your report will be evaluated based on:
* adherence to structure and guidelines;
* clarity and thoughtfulness;
* apparent accuracy of quantitative results;
* successful use of one or more techniques in the course.