# UCSB DS100 FINAL PROJECT

## Due Date:  June 11, midnight

<span style="color:red">Carefully read and follow <b>all guidelines</b> provided below.</span>

## About the Project
The final project is an opportunity to put all that you have learned in this course. Think of this project as a step towards building your portfolio that showcases your work, skills, and interests. 

The following two resources provide helpful guidelines for how to approach the project and structure your write-up:
* *A Data Science Project Style Guide from Dataquest*: https://www.dataquest.io/blog/data-science-project-style-guide/
* *Building a data science portfolio: Storytelling with data*: https://www.dataquest.io/blog/data-science-portfolio-project/

Your task is to conduct a start-to-finish analysis of a real dataset. This will require many of the skills we have learned over the quarter: from manipulating data, exploratory analysis and dimension reduction, translating scientific questions into models and algorithms, basic regression and classification, and carefully crafted visualizations.  You should present your results in a clear and concise manner. 

You should submit the following files: 
1. your completed writeup as a PDF (note the page limit listed below)
1. a complete jupyter notebook with all of your analyses
1. all data files necessary to run your notebook (note the file type and size limit listed below)

You can work individually or in a pair.

<span style="color:red">
    <h1> PROJECT TITLE GOES HERE </h1>
</span>

<span style="color:blue">
<h2> Names: Partner 1 and Partner 2 </h2>
</span>

## Dataset, Questions of Interest, and Exploratory Analysis

We have provided a handful of vetted datasets that might make for interesting projects: you can find the links and descriptions in the [datasets.md](./datasets.md) file, provided in this directory (you can use [this link](https://github.com/ucsb-ds/ds100-s20-content/blob/master/final-project/datasets.md) if you have trouble opening it). 
Use this list as a starting point. If you can find additional data related to your project, you should feel free to use it, **making sure to cite and link to the source of the data**. 

Look over these datasets and decide which one is most interesting to you (and your project partner). Identify some relevant preliminary questions to pursue with this dataset.

You can use your own dataset, making sure that there are no obvious ethical issues with this dataset or with your usage of it. 

**IMPORTANT**: When selecting the dataset, make sure that **the total file size does not exceed 500Mb**. The dataset's file type should be an xsl/xslx, csv, or json format; do not use sql or Pickle files.

Before starting data analysis, answer the following questions:

* Are there missing values in the dataset? How do you know?
* Are there any missing values that were filled in? 
* Which parts of the data were entered by a human? Are there any other potential sources of error?
* What are the ethical considerations regarding this dataset?

Take a look at the Quartz "bad data guide" for a more complete list of checks: https://github.com/Quartz/bad-data-guide

## The Final Project Report Structure

Each group submits **one** report -- if you are working in a pair, **make sure to add your partner to your group on Gradescope**.

The maximum page limit (_excluding_ the title page and appendices) for the project report is **8 pages**. The report should be prepared using 11pt or 12pt font size with 1-inch margins, on US letter paper. **Do not simply convert your analysis notebook into a PDF**: format your report as you would a scientific paper/report. We recommend (but don't require) using [Overleaf](https://www.overleaf.com/) makes it easy to create well-formatted documents, which make cross-referencing faster than doing with, for example, with Google Docs.

The final project report must be structured as follows:

**1. Title Page:**

Must include the title of your project, and the names of the group members. 

**2. Abstract (100 words or more):**

One paragraph that briefly summarizes the problem you addressed, questions of interest, the data, the techniques used, findings and key results, and conclusions. 

**3. Introduction (200 words or more):**

In this section you should describe 
1. the primary goal(s) or question(s) that your project addresses,
1. the motivation for your project, i.e., why your readers should be interested,
1. the relevant background of your topic, including a brief literature review (a paragraph with 1-3 references) describing any prior related work. 
1. the dataset you are using to answer your question. You should address why this data is appropriate for answering your question. 


**4. Questions of Interest:**

State clearly and concisely your questions of interest.

**5. Data and Methods:**
* **Data.** Cite and state the source of your data. (Is there a license that tells others how to use and attribute the authors?) 
* *Briefly* describe the relevant variables that you will use in your project. 
    * Did you do any preprocessing or transformations on the original variables? 
    * Are there missing data, and if so, what values are missing?  
* Discuss which "**principles of measurement**" might be particularly relevant to your questions (distortion, relevance, precision, cost).  

* Think about the **ethical considerations** behind the dataset.
    + Who collected this dataset and why?
    + Can the analysis of this dataset cause any harm to those represented in the dataset? To others?
    + Who or what is represented in the data? Is someone or something over-represented? Who or what is not represented in the data?
    + Are there other potential ethical issues?


* **Methods.** For each question of interest, state how you will attempt to answer that question. There is no need to describe the mathematical or statistical details of the methods.
* **Exploratory Analysis.** Exploratory plots of the data and numerical summaries are essential in beginning any analysis. At this stage, scatterplots, added variable plots, box-plots, etc. can give you a sense of relationships that exist between relevant variables. 
    * Will transformations be needed/useful for any of these variables? 
    * Can you use dimension reduction (PCA or an alternative) to identify interesting patterns? 
    * Are outliers and/or measurement error a concern? 
    * You should comment on your findings, particularly if there are interesting or counterintuitive observations to be made.
* What **inferential or predictive methods / models** are you using (linear regression, classification methods, etc).  Consider using the `statsmodels` and `scikit-learn` libraries.
* You are free to use methods not discussed in this class if you are comfortable doing so but **you should clearly state the purpose of the method, what it does and how it fits into the ideas we've discussed in class** (e.g., exploratory, dimension reduction, regression, classification, visualization, etc).


**6. Analysis, Results and Interpretation**

In separate subsections, you should answer each of your questions of interest. Your narrative should include:

* Important Details of the Analysis. Perform the analysis -- depending on the questions you want to answer, this might include various items from the following list: 
     
    * Assumptions:
         * Were your assumptions plausible? Why?
         * Are there outliers? Missing values?

    * Interpretation:
         * What do your results mean for the questions you were trying to answer?
             * What  were  two  or  three  of  the  most  interesting  features  you  came  across? Describe  the  process  of  finding  those  features.
             * Describe  one  feature  you  thought  would  be useful, but turned out to be ineffective.

* Relevant plots **in Altair** with proper title, variable names, axis labels, legends, etc. must be included within the body of the text. 
    + All visualizations should make a clear point and/or tell a story.
    + Plots should not be too small or too large. 
    + Make sure to number all figures and tables (e.g., Figure 1, Table 1)  and include informative captions. 
    + **All** figures/tables should be references within the text.

* After conducting the analysis, you should give concrete (i.e. data-specific), accurate and complete interpretations of your results. These interpretations should involve a mix of  terminology, variable names, and appropriate scientific units.

* Discuss any ethical issues surrounding your data or analysis (1-2 paragraphs).  
    * What effects (positive/negative) could your results have? 
    * Who is it likely to affect? 
    * How is it likely to affect them? 
    
    
**7. Conclusions and Future Work (about 200 words)**

In this section you should summarize your findings based on your final model in clearly understandable, non-statistical terms. What is the main message produced by your analysis? There may also be additional questions that arise, problems you encounter, or possible extensions of your analysis that could be addressed here. 

Include any final comments and thoughts about your project. For example, do you trust your results? How general are your results, to what situations do they apply? Add any other comments that are relevant.

## Overall Grading Criteria

Your grade on the project will be based on the following criteria:

**1. Compatibility of Scientific Question and Analysis**

Is the scientific question being addressed actually of interest, and were suitable tools employed?

**2. Coherent Thought Process and Presentation of Results**

* Does the analysis indicate a sound understanding of methods discussed in class? This is often best judged by the preliminary comments on the questions of interest as well as the conclusions made after the analysis. 
* Is the analysis presented in a clear, consistent, coherent style with the appropriately labeled requested components and visualizations?
* Does the conclusion clearly follow from the presented analysis?

**3. Scope of the analysis and methods used**

Did the analysis demonstrate a wide understanding of methodology and ideas presented throughout the course? 

**4. Reproducible results**

Is it possible to reproduce the analysis and the visuals by executing the provided code? We will run your notebook to confirm that your visuals and results are reproducible.

# Final Project Rubric

- **Title page (1 pt)**
  + A separate page with the names of the authors and an informative project title.


- **Abstract (5 pts)**
  + One paragraph that briefly summarizes the problem you addressed, questions of interest, the data, the techniques used, findings and key results, and conclusions. 


- **Introduction (20 pts)**
 + **(10 pts)** Clearly introduce the primary goal(s) or question(s) that your project addresses, and the motivation for your project.
 + **(5 pts)** Include a brief literature review (a paragraph with 1-3 references) describing any prior related work. What is the relevant background readers need to understand your project? 
 + **(5 pts)** Briefly describe the dataset you are using to answer your question. You should address why this data is appropriate for answering your question. 


- **Data (25 pts)**
  + **(10 pts)** Where did the data come from? Are you using additional data to augment the original that we provided? What are the relevant "principles of measurement"?
  + **(5 pts)** Ethical Considerations. What concerns do you have about the data?
  + **(10 pts)** Describe the relevant variables in the dataset. 
      * Did you do any preprocessing or transformations on the original variables? 
      * Are there missing data, and if so, what values are missing?  


- **Methods (15 pts)**
  + What methods did you use for your analysis? Why do you think they help you answer your question(s)? 
  
  
- **Results (30 pts)**
  + **(15 pts)** Visualizations, tables and figures
      * are clear (not be too small or too large), readable, well-labeled, and appropriate; 
      * make a clear point and/or tell a story; 
      * are numbered (e.g., Figure 1, Table 1) and have an informative short caption;
      * are **clearly referred to and described** in the paper. 
      * Charts are created using Altair.
  + **(15 pts)** Appropriate interpretation of the results (e.g., correct interpretation of PCA, regression, etc).  Results are put into context.  What are the set of conclusions that could be drawn from your results?
  
  
- **Discussion (10 pts)**
  + Did you achieve your goal? If not, why? What were some challenges and lessons you learned from them? 
  + What were your primary conclusions and how do your results support these conclusions? 
  + What extensions or future work would you recommend? 
  
  
- **Miscellaneous (15 pts)**
  + **(5 pts)** General coherence, clarity, sufficient referencing and overall effort. 
  + **(5 pts)** Include references for **any and all resources** used for your project: data sources, software tools used, books, journal papers, online resources, etc.  _Failure to cite your references can be grounds for an academic integrity violation case._ Make sure that you read and follow the policies regarding Academic Integrity: https://studentconduct.sa.ucsb.edu/academic-integrity. The Office of Student Conduct lists actions considered to be plagiarism, cheating, or other forms of academic theft, and students’ responsibilities. You are required to read the policies and to abide by them.
  + **(5 pts)** Python notebook runs without errors.  _More points can be deducted if the notebook is missing, if there is a lack of evidence of sufficient work, or if the results presented in the report cannot be reproduced using the notebook._

## Submission

You are required to submit the following files to Gradescope:

1. Submit your completed writeup as a PDF.  You should address all of the components described above, adhering to the page limit, and include any figures and tables that are necessary. (Make sure to number figures and tables and include informative captions and are referenced within the text.)

2.  Submit a complete jupyter notebook with all of your analyses.  (You can use this jupyter notebook as a template: remove the instructions, replacing them with your analysis).  **We should be able to reproduce _all_ your results and visualizations by running your notebook.**

3. Include all data files necessary to run your notebook (ensuring that the files are in the correct file type and are below the requested file size limit).

Before you submit the notebook, make sure that you save it and then select from the top menu `Kernel -> Restart & Clear Output` followed by `Cell -> Run All`. Verify that all computations execute correctly. **There should be no errors when we run your notebook.**

<span style="color:red">Carefully read and follow <b>all guidelines</b> provided above.</span>