<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# What is Data Science? `#DAT25`

# Ice-breaker

Have you thought again at the TFL problem from the last lesson when you used (if you used) any TFL services? Did you think of other possible data to use or collect?

Please discuss with your neighbours

# Learning Objectives

- Define the Data Science Workflow and common Machine Learning concepts.
- Identify which type of predictive problem applies to a given task.
- Introduction to git technology

# A brief re-cap from last lesson

- [A data scientist can be](#ds_types)
- [A data scientist must have](#ds_skills)
- [The data science workflow is based on](#ds_workflow)


<a id="ds_types"> </a>

## A data scientist can be:

- type **A**: academic background, solves high-level analytical problems, strong theoretical expertise
- type **B**: hands-on experience, good in coding, works in small organisations, has to produce more tangible outcomes

In other words, DS type **A** work mostly in research, DS type **B** work mostly in development.

<a id="ds_skills"> </a>

## A data scientist must have:

- hacking skills
- math & stats knowledge
- substantive expertise


<a id="ds_workflow"> </a>

## The Data Science workflow is based on:

- Frame
- Prepare
- Analyse
- Interpret
- Communicate

### [Part 1: What are the Data Science characteristics?](#part1)

- [Data Science problems](#ds-problems)
- [Activity: Data Science problems](#ds-problems-activity)
- [Supervised vs. Unsupervised Learning](#supervised)


### [Part 2: A brief introduction to git](#part2)

### [Exercise](#lab)

### [What next?](#what-next)

<a id="part1"> </a>
# Part 1: What are the Data Science characteristics?


<a id="ds-problems"> </a>

## Data Science Problems

Example questions for Data Science:

- how many products will we sell tomorrow?
- is this picture a hot dog or not a hot dog?
- based on this user's purchase history, which other users should we target with similar ads?
- is there something suspicious about this credit card transaction?
- should we launch this ad in summer or in winter?
- if you listen to this artist, which other artists are you likely to enjoy?

### How many products will we sell tomorrow?

This is a <strong style="color:green">regression</strong> problem, because the answer is a number on a **continuous** range

### Is this picture a hot dog or not a hot dog?

This is a <strong style="color:green">classification</strong> problem, because the answer is one of a **discrete set** of answers.

### Based on this user's purchase history, which other users should we target with similar ads?

This is a <strong style="color:green">clustering</strong> problem, because we are **grouping together** users **without knowing the groups in advance**.

### Is there something suspicious about this credit card transaction?

This is an <strong style="color:green">anomaly detection</strong> problem, because we are looking for things that are **outside some definition of "normal"**.

### Should we launch this ad in summer or in winter?

This is an <strong style="color:green">decision making</strong> problem, because we are looking to predict the best outcome out of several scenarios.

### If you listen to this artist, which other artist are you likely to enjoy?

This question requires a <strong style="color:green">recommendation algorithm</strong>, because we want to suggest possible solutions to the problem and there is not only one answer.

<a id="ds-problems-activity"> </a>

## Activity: Data Science Problems

In pairs, think of 2 examples each of a(n):

- regression task
- classification task
- clustering task
- anomaly detection task
- decision making task
- recommendation algorithm task

Remember:

- Regression = **predicting continuous outcome**
    - e.g. predicting tomorrow's sales
- Classification = **telling the difference between discrete outcomes**
    - e.g. is this a picture of a hot dog or not?
- Clustering = **finding similar things without a "true" answer**
    - finding similar users based on purchases
- Anomaly detection = **finding "strange" things**
    - identifying if a credit card transaction is suspicious (fraudulent)
- Decision making = **finding different solutions and ranking them**
    - deciding which season is the best time to launch an ad
- Recommendation algorithm = **finding different solutions which can satisfy the requirements**
    - identifying a number of solutions which relate to the known one


<a id="supervised"> </a>
## Supervised vs. Unsupervised

**Supervised** learning means **learning from examples i.e. past data**.

**Unsupervised** learning means **we don't really know any answers ahead of time**.

**Regression**, **classification** and **decision making** are all <strong style="color:green">supervised</strong>.

**Clustering**, **anomaly detection** and **recommendation algorithm** are all <strong style="color:green">unsupervised</strong>.

<a id="part2"> </a>

# Part 2: Introduction to git

# One Solution: Git

![](assets/git-xkcd.png)

from [https://xkcd.com/1597](https://xkcd.com/1597)

## What's the difference between Git and GitHub?

### Git

- the underlying source control system
- allows repositories, commits, branches
- open source

### GitHub

- a company that lets you host Git
- free (if your material is public)
- alternatives include **BitBucket** (which allows free private repositories)
- GitHub Enterprise is a paid version to have a separate, private GitHub
- additional features e.g. wikis and issue tracker

# Step 1: Create a repository

A **repository** (or "repo") is a self-contained "folder" of files. Think of it as a **project**.

Let's create a repository for you to store your own work.

*Note: if you've forked `dat25` you can use that fork for your work, but remember it is **public***

![](assets/create_new_repo.png)

![](assets/new_repo.png)

### What's that `.gitignore` file?

It is a hidden file that tells Git what files to **not version**.

Why?

- generated files that you don't need
    - e.g. Jupyter creates "checkpoints" that you may not care about
- code artefacts (Python creates some)
- it's nice to have the option!

### Give the teaching team access

Go to Settings -> Collaborators

![](assets/collaborators.png)

# Step 2: Get a local copy


Your repository only exists in GitHub Enterprise. It is a **remote** repository.

To work on it, you need a copy on your machine.


To get a local copy, you **clone**.

In a terminal/command prompt navigate to a folder where you want a copy of your repository and type:

`git clone <your_url_here>`


# Step 3: Make a change

Open your automatically created `README.md` (which will get displayed by default in GitHub) and add some text.

Maybe write who you are and what the purpose of this repository is!

# Step 4: "Saving" your changes

In Git, there are **three** stages to saving your work.


1 - "Stage" your work = prepare it for a commit. This step will look like it's done nothing.

2 -  "Commit" your work. Your changes are now part of your **local** repository.


3 - "Push" your work. Your commit will be sent to the **remote** repository.


2.5 - If you are collaborating with someone, or have copies of your work on multiple machines, it's good practice to "pull" before you "push".

This brings down changes from the **remote** repository to the **local** one that have happened since you **cloned**.


### The Three Steps

#### 1 - "Stage"

To see any pending changes, type:

`git status`


To see what those changes actually are:

`git diff`

Now type

`git add README.md`

to "stage" the changes you've made to that file.

If you start typing the name of `README.md` you can press `TAB` to autocomplete it (great for long Jupyter notebook filenames).

*Note: if you **remove** a file, you still need to `add` its removal as a change!*

Check

`git status`

again. Notice your file is green and it says "Changes to be committed"

#### 2 - Commit

You're ready to commit!

`git commit -m "Add a commit message"`

#### 3 - "Push"

I said it's good practice to pull, so let's do that.

`git pull`

***Note: this is also how you should get the latest course materials!***

Now:

`git push`

Technically you should specify:

- which remote repository to push to (you can have multiple at once)
- which **branch** to push to

Then the command becomes:

`git push <remote> <branch>`

e.g. `git push origin master`


### Branching


![](assets/branching.svg)

# Summary of commands

`clone`: gets a local copy of a **remote** repository<br>
`status`: show pending local changes<br>
`add`: stage a local change<br>
`commit`: commit a set of staged changes<br>
`push`: push your commits up to the remote repository<br>
`pull`: pull updates from the remote repository<br>

# Useful links

## Git GUI

If you don't want to use the command line, you can use a nice desktop application to manage your work:

- [GitHub Desktop](https://desktop.github.com/)
- [SourceTree](https://www.sourcetreeapp.com/)
- [Git Kraken](https://www.gitkraken.com/)

## Git cheat sheet

We all need one of these.

[https://www.git-tower.com/blog/git-cheat-sheet](https://www.git-tower.com/blog/git-cheat-sheet)

## When things go wrong...

[Oh, s***, git!](http://ohshitgit.com/)

## Miscellaneous

A way to practise Git in the browser: [https://learngitbranching.js.org](https://learngitbranching.js.org)

### BREAK

<a id="lab"> </a>
# Exercise

Identify how you can tackle the following use-cases by indicating the type of Data Science Problem they represent. If these problem can be solved with Machine Learning, please devise whether they are supervised or unsupervised problems.


## Use case 1

An hospital specialised in skin-related diseases wants to **increase** the number of patients they can treat. They would like to know in a accurate and fast way, whether or not moles need to be removed surgically, in order to optimise the surgical procedures.
Currently a team of doctors visit the patients one by one, monitoring the moles by eye and therefore identifying whether or not the moles need to be treated.

A consultancy firm proposes to take pictures of the moles and send them to doctors, in order to collect their feedback.
Once the feedbacks and the visual characteristics of the pictures are matched, they claim that an algorithm would be able, automatically, to sort out moles that need treatment from those which don't.

Main points:
- having an algorithm sorting out moles would increase the number of visited patients per day, by speeding up the monitoring procedures
- the cost would be less than hiring new doctors
- in case of inconclusive result from the algorithm a doctor will always be available to visit the patient

Questions

1) What kind(s) of Data Science Problem is this use case?

2) If Machine Learning can be applied, would it be a Supervised or Unsupervised Problem?

## Use case 2

A car factory is designing a new line of luxury cars and plans to install a number of optional extras which should potentially push the sales. Among those they include **bluetooth connectivity**, **adaptive cruise control**, **voice control** and **heated seats**.
However, the engineers and the designers explain that the production costs of the cars will increase drastically if ALL of these extras are installed. For these reasons they have to choose 2 extras instead of 4.

The same consultancy firm proposes a study to determine which extras should be chosen to maximise the sales.
They will use sales data from other car factories which produce similar vehicles.


Main points:
- to find out what extras customers prefer, the company may run some marketing survey but normally this will take time and won't collect many feedbacks
- choosing to install 2 pair of extras in different cars (A/B test) may result in one model selling very well and other models selling very poorly
- choosing to install all of the extras will force the company to increase the car price, risking poor sales

Questions

1) What kind(s) of Data Science Problem is this use case?

2) If Machine Learning can be applied, would it be a Supervised or Unsupervised Problem?

<a id="what-next"> </a>
# What next?

- Introduction to Python