<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# What is Data Science?
 
_Authors: Alexander Egorenkov (DC), Amy Roberts (NYC)_
 
---


# Learning Objectives

After this lesson, you will...

- Have a working terminal and Anaconda environment.
- Be familiar with course policies and procedures.

and be able to...

- Define "data science."
- Explain what is involved in each step of the GA data science workflow.
- Distinguish between supervised and unsupervised learning problems.
- Distinguish between regression and classification problems.

# Setting Up Your Environment

## Git Bash (Windows only)

**Windows users:** [Download the Git Bash shell](https://gitforwindows.org/), install it, and confirm that you can open it.

The Git Bash shell emulates many of the common functions and commands that are available in Linux and Linux-like operating systems.

## Anaconda

[Download Anaconda](https://docs.anaconda.com/anaconda/install/) and follow the installation instructions package for your operating system. Make sure that you're downloading the latest stable version for Python 3!

To confirm successful installation, run the following command in your command line application (Git Bash for Windows users, Terminal for Mac users):

```bash
which conda
```

The output should be something like this:

```bash
/Users/USERNAME/anaconda3/bin/conda
```

Also run `python -V`; you should get `Python 3.6.x :: Anaconda, Inc.`, where `x`  can be any number. Make sure that you do not get `Python 2.7`!

Run the following command to make sure that some frequently used libraries are installed. Anaconda may also update your packages at this time (which is OK!).

```bash
conda install jupyter notebook python matplotlib nltk numpy pip setuptools scikit-learn scipy statsmodels
```

## Git

Download git ([Mac](https://git-scm.com/downloads), [Windows](https://gitforwindows.org/)) and install it.

To check if your git installation was successful, open a new terminal window and try to run git from the command line:

```bash
git --version
```

The output should be something like this:

```bash
git version 2.5.0
```

Use the following commands to provide git with your name and email. Make sure to use the same email address that you registered at [https://git.generalassemb.ly](https://git.generalassemb.ly): <br>

```bash
git config --global user.name "Your Name"
git config --global user.email your.name@example.com
```

These identifiers will be added to your commits and show up when you push your changes to [GitHub](https://git.generalassemb.ly) from the command line!

# Course Policies and Procedures

https://git.generalassemb.ly/chi-ds-8/course_info

# What is Data Science?

## Data Science Skills

Data scientists use **data modeling** and **programming** skills to answer questions.

They also need **domain knowledge** and a variety of "soft skills." Developing a solution is often easier than identifying a good problem and getting the solution deployed!

## Example Data Science Questions

**Machine learning questions:**

- Does X predict y? (Where X is a set of data and y is an outcome.)
- Are there any distinct groups in our data?
- What are the key components of our data?
- Is one of our observations “weird”?

**Business questions:**

- What is the likelihood that a customer will buy this product?
- Is this a good or bad review?
- How much demand will there be for my service tomorrow?
- Is this the cheapest way to deliver my goods?
- Is there a better way to segment my marketing strategies?
- What groups of products are customers purchasing together?
- Can we automate this simple yes/no decision?

**Exercise.**

List five products or services that you think utilize data science.

# The Data Science Workflow

![](./assets/Data-Framework-White-BG.png)

- This process is often **iterative**.
- Talking with subject-matter experts early and often greatly increases your chances of producing a useful result.

# Application: Data Science Workflow Through Ames Data

## Frame

---

Identify:

- High-level business objectives
- Deliverables
- Success criteria
- Relevant data sets

### High-Level Business Objectives

Suppose a real estate wants to predict prices for houses so that they can more reliably buy them at a discount, make cost-effective improvements, and sell them for a large profit.

### Deliverables

E.g.

* Presentation to the real estate team
* Business report discussing results, procedures used, and rationales
* API that provides estimated returns

### Success Criteria

This project will be considered a success if the estimated returns provided by the API are at least as accurate as the estimates that the company currently produces manually (while saving time).

**Note:** It can be difficult to predict what level of performance a data science model will be able to achieve before you dig into the data and start building models. **Keep your criteria for success minimal** and **figure out as quickly as possible whether you are going to fail**.

### Relevant Data Sets

**Key questions:**

- What data would be ideal?
- What data is available?
- What can we do to close that gap?
- Is it plausible that we can succeed with the data we can get?

**Subsidiary questions:**

- Where is the data set coming from? How was it collected? Can it be trusted?
- What variables does it contain?
- If the data is spread across multiple pieces, how do those pieces fit together?
- Do our data appropriately align with the question/problem statement?
- Is this data set aggregated? Can we use the aggregation, or do we need to obtain it pre-aggregation?
- Is there enough data?
- Does the data cover all of the types of situations (times, places, etc.) to which we want to apply our model?
- Is the data representative
- How can we access it (e.g. file, database, web API, web scraping)?
- What are the most appropriate tools for working with the data, given its size and format?

**Exercise**

Answer the following questions about the [Ames housing data set](./assets/ames_data_documentation.txt).

- How closely does the set match the ideal data that you envisioned?
- Would it be sufficient for our purposes?
- What limitations does it have?

##  Prepare

---

Data scientists often work with data that they did not collect ("secondary data"), so they have to use *data dictionaries* and other documentation to learn how the set was gathered.

Here's an example of a data dictionary:

Variable | Description | Type of Variable
---| ---| ---
Square Footage | Floating Point | Continuous
Street Type | 1 - Gravel, 2 - Paved | Categorical
Neighborhood | String, e.g., 'Lake View' | Categorical
Number of Bedrooms | Integer | Discrete

**Common data preparation steps**:

- Addressing missing values
- Addressing outliers
- Restructuring
- Reformatting
- Aggregating
- Transforming

![](./assets/clean_data_borat.png)

## Analyze

---

### Descriptive Modeling

Data scientists use statistics such as frequencies, means, and standard deviations to give compact descriptions fo their data sets.

Variable | Mean or Frequency (%)
---| ---
Square Footage | 2201.3
Street Type - Gravel | 8%
Street Type - Paved | 92%
Number of Bedrooms | 1.8

### Predictive Modeling

Data scientists build models to predict either discrete outcomes (e.g. this house will / will not sell in the next month) or continuous values (e.g. this house will sell for $358,000).

**Predictive modeling will be a major focus of this course.**

## Interpret

---

- Check your model for correctness.
- Determine what your model is really telling you, keeping in mind the limitations of your data and modeling techniques.
- Determine what one-off recommendations your model supports and/or what kinds of ongoing decisions it can support.
- Get input from subject-matter experts!

## Communicate

---

Without effective communication, your work will not be used.

- Identify your goals.
- Put the bottom line front and center: _"Kitchen renovations have a positive return on investment, while other renovations do not."_
- Speak the language of your audience (often $$$).
- Practice, ideally with a real audience that can give useful feedback.

**Iterate, iterate, iterate.**

<a id="summary1"></a>
# Summary

---

Use the data science workflow to develop solutions.
  - **Frame** a hypothesis.
  - **Prepare** your data.
  - **Analyze** your data.
  - **Interpret** the results of your analysis in terms of your business.
  - **Communicate** your results to different audiences.

<a id="ML"></a>

# Introduction: Machine Learning

---


- **Statistics** is about using data to draw *scientifically valid conclusions*.
- **Machine learning** is about using data to get a computer to exhibit *intelligent behavior*.

## Example of Machine Learning

[Google Quick Draw](https://quickdraw.withgoogle.com/)

<a id="common-ml-defs"> </a>
## Kinds of Machine Learning

### Supervised Learning (a.k.a., “predictive modeling”):

Given a bunch of examples with input features and an output label, predict the output label for new examples.

**Examples:**

- Predict the price of a house based on its neighborhood, number of bedrooms, etc.
- Predict whether an email is spam or "ham" based on its contents.

Predicting a *continuous value* such as house price is called **regression**.

Predicting a *discrete category* such as spam or ham is called **classification**.

**Major challenges:**

- Getting good labeled data.
- Keeping focus on business value rather than just building the most accurate model.

### Unsupervised Learning

Given a bunch of examples with features, find some kind of structure.

**Examples:**

- Put coins into groups that are similar to one another in terms of weight, composition, etc.
- Identify five traits that capture a large proportion of the personality variation among people.
- Flag unusual-looking credit card transactions.

Representing objects as members of *groups* is called **clustering**.

Representing objects in terms of a smaller number of features than you started with is called **dimensionality reduction**.

Identifying unusual objects is called **anomaly detection**.

**Major challenge:** Evaluating performance in the absence of labels.

**Exercise.**

Apply two of the following labels to each tasks below. For instance, the following task would get the labels "supervised learning" and "regression:" "Given data on prior home sales that includes home features (e.g. number of bedrooms) and sales price, predict sales prices for a new set of homes described by the same features."

**Labels:**

- Supervised learning
- Unsupervised learning
- Regression
- Classification
- Clustering
- Dimensionality reduction
- Anomaly detection

**Tasks:**

1. Given a set of music audio files, group those files by music style.
1. Derive "musical fingerprints" that allow an algorithm running on a remote server to identify what song a phone user is hearing with as little data transmission as possible.
1. Given sensor data from a locomotive and times within the data set in which the engine failed, predict engine failure from new sensor data.
1. Given sensor data from a locomotive, identify periods of time in which the engine is behaving abnormally.
1. Given a set of chest X-rays with physician's diagnoses, identify which patients in a new set of chest X-rays have pneumonia.
1. Given sensor data from a locomotive that includes GPS and fuel consumption information, predict how much fuel a locomotive will consume on a trip between two specified points.

## Machine Learning Algorithms

A machine learning **algorithm** is a procedure for training a **model**.

For instance, suppose we think that home prices are on average some fixed multiple of floor area. Of course, this relationship is not exact; there is some "noise" due to variables that we are not included in our simple model (number of bedrooms, neighboorhood, etc.). We acknowledge these variables by including an "error term" $\epsilon$ ("epsilon") in our model:

$$price = m * floor\ area + \epsilon$$

To use this model to make predictions, we need to choose a value for the multiplier $m$. Given many examples of floor area and price, we can use a machine learning *algorithm* to choose the value for $m$ that gives the best results on those examples. The result is a **fitted model**, e.g.

$$ price = 114 * floor\ area + \epsilon  $$

The process of choosing the general relationship $price = m * floor\ area + \epsilon$ is called **model building**.

The process of applying an algorithm to data to choose a value for the parameter $m$ is called **model training.**

**Exercise.**

Use the [Ames Data Set documentation](./assets/ames_data_documentation.txt) to answers the following questions:

- What is a potential target in your data for a regression model?
- What is a potential target in your data for a classification model?
- Could unsupervised learning be used within this data? How so?

Then pick one of your targets and sketch out what a data science workflow would look like for that question. For each stage in the workflow, identify one or two steps that you think would be particularly important for this data set.

You might also consult the [Ames Data Set Introduction PDF](./assets/ames.pdf).

<a id="conclusion2"></a>
## Conclusion

---

Check to see if you can answer the following questions easily:

- What is data science?
- What is the data science workflow?
- What is the difference between supervised and unsupervised learning?
- What is the difference between regression and classification? 
- What is an algorithm?