# Introduction to Data Science & ML

Some pictures in the presentations were self drawn with affinity designer. You can find the vector file (.afdesign) in the DSBOOK/images folder, where all images of this lecture are stored.

<div class="slide-title">
    
# Introduction to 
# Data Science & ML 
    
</div> 

## Understanding concepts for Data Science and Machine Learning

* What is Data?
* What is (not) Machine Learning?
* What is AI? and what is it good for?
* Who does what in data?
* Becoming a Data Scientist


## What is data?

### Once upon a time

<img src="../images/intro_to_data_science_ml/once_upon_a_time.png" width="1100" style="display:block; margin:auto">

data generated from human interaction with each other or situations they faced

### Now

<img src="../images/intro_to_data_science_ml/now.png" width="1100" style="display:block; margin:auto">

everything can be data:

a click, walking with your phone, opening zoom, accessing a website, the weather, buying something online, paying by card

we both produce data and are clients for the data systems.. which collect our data

### Data Lifecycle
interaction - collection - transformation - enriching - modeling - getting insights - improving the application

<img src="../images/intro_to_data_science_ml/data_lifecycle_ds_ml.png" width="1100" style="display:block; margin:auto">

## What is "NOT machine learning"?

Notes: Building on the experience of tic-tac-toe introduce the idea of deterministic and probabilistic systems

### Which problems can be solved by “NOT machine learning”

<div class="group">
  <div class="images">      
    <img src="../images/intro_to_data_science_ml/tic-tac-toe.png" >
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/hang-man.png" width="350">
  </div>
</div>

### Which problems can be solved by “NOT machine learning”

<img src="../images/intro_to_data_science_ml/cookie_monster_2.png" width="1100">

Demystify the term “Machine Learning”. Many problems don’t require ML, but can be solved by writing a simple algorithm. Let’s look at 3 variations of the cookie monster problem.

### 1. Cookie Monster gains weight

<div class="group">
  <div class="text_70">
      
Cookie Monster eats **10 kg** of cookies each **day**. For every **10 kg** that it eats, it gets fatter **by 5kg**. 
(It spends a lot of energy hunting for cookies).

How many **kgs** does Cookie Monster **weigh today** if, its **initial** weight was **100kg**, and it has been eating cookies for **5 days**?

  </div>
  <div class="images_30">
    <img src="../images/intro_to_data_science_ml/img_p2_1.png">
  </div>
</div>

Demystify the term “Machine Learning”. Many problems don’t require ML, but can be solved by writing a simple algorithm. 

### Algorithm definition

<div class="group">
  <div class="text_70">

“A **finite** set of **unambiguous instructions** that, given some set of **initial conditions**, can be performed in a **prescribed sequence** to achieve a certain goal and that has a recognizable set of **end conditions**.”

      


  </div>
  <div class="text">

  </div>
</div>

In the examples above the Machine is not learning, it’s doing what you told it to. So who’s doing the “learning”?

<div class="group">
  <div class="text_70">

“Learning” - the act, process, or experience of **gaining knowledge or skill**.      


  </div>
  <div class="images_30">
    <img src="../images/intro_to_data_science_ml/img_p3_1.png" width="500">
  </div>
</div>

The algorithms we discussed so far were prescriptive and didn’t exhibit learning.

### 2. Cookie Monster’s grandma

<div class="group">
  <div class="text_70">


    
Grandma Cookie Monster brings **15 kg** of cookies when she visits. 
She visits when she’s in a **good mood** and no **more than twice a week**.

The following is known about Grandma Cookie Monster's mood:

🌞 She likes when it's sunny outside but not too warm (<28℃). <br>
👀 She doesn't like if her neighbor is looking out of the window when she is leaving the house. <br>
🚃 She likes to take tram number 1 and not tram number 3.

Grandma has a **good mood** if the number of her **likes outweighs the number of her dislikes** on any given day. 
On average, she has a good mood **3 times a week**.

How much does Cookie Monster weigh today, if:
* Its initial weight is 100 kg, and it has been eating cookies for 5 days.
* Its grandma came to visit once this week already.
* It's been a nice week with 25℃.
* Tram number 1 is not working.

      
  </div>
  <div class="images_30">
    <img src="../images/intro_to_data_science_ml/cookie_monster_grany.png">
  </div>
</div>

Notes: Keep the system deterministic, but introduce some uncertainty. Deterministic algorithm can no longer predict exactly what will happen

### Uncertainty

**A deterministic system** is one in which the occurrence of all events is known with certainty. If the description of the system state at a particular point of time of its operation is given, the next state can be perfectly predicted.


**A probabilistic system** is one in which the occurrence of events cannot be perfectly predicted. Though the behavior of such a system can be described in terms of probability, a certain degree of error is always attached to the prediction of the behavior of the system.

### Heuristics / base model

**A heuristic** is an approach to problem-solving or self-discovery using 'a calculated guess' derived from previous experiences. 

Heuristics are mental shortcuts that ease the cognitive load of making a decision. 

Usually, the opposite process to heuristics is the application of algorithms. Algorithms involve calculated answers and guesswork is eliminated.


Notes: Heuristics are a legitimate way to make decision if they are kept under control. ML allows to go beyond heuristics, but there are overheads involved. In our case : We could assume that grandma cookie monster comes once a week. It’s not 100% right, but it’s not entirely wrong either.

### 3. Cookie Monster gains weight mysteriously

<div class="group">
  <div class="text_70">

Cookie Monster has a birthday in **2 weeks** and local municipality would like to give it a
postcard with its **exact weight** written on it. Can you accurately predict it?
      
Very little is known about how Cookie Monster gains weight. The following observations are however available :

  </div>
  <div class="images_30">
    <img src="../images/intro_to_data_science_ml/cookie_monster_kermit.png" width="400">
  </div>
</div>


| Day | Kilograms of cookies consumed | Neighbour looking out of the window | Temperature outside | Tram 1 working | Lake water temperature | Evgeny teaching ML class | Weight beginning of day | Weight end of day |
|---|---|---|---|---|---|---|---|---|
| 1 | 15 | Yes | 25 | 1 | 15 | 1 | 100 | 114.3 |
| 2 | 10 | No | 23 | 0 | 15.5 | 0 | 114.3 | 120.7 |
| 3 | 40 | Yes | 29 | 1 | 15.3 | 1 | 120.7 | 135.4 |



Notes: Last definition of the problem - deterministic rules are removed and we are in the world of uncertainty

#### Humans or Machine Learning?
### What if the system is non-determinstic and also highly complex?

<div class="group">
  <div class="text_70">
      
* It is difficult to understand what the rules are.
* The rules are too complex to write down.
* There are too many rules.
* Rules sometimes apply and sometimes don’t and you don’t know when or why.
* You have tried heuristics and they don’t work well.

  </div>
  <div class="images_30">
    <img src="../images/intro_to_data_science_ml/img_p8_1.png">
  </div>
</div>

### Perhaps the machine can figure it out?

If it’s too much for you to figure out, perhaps the machine could?

<div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p3_1.png">
  </div>
</div>
<div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p9_1.png">
  </div>
</div>

### Machine Learning is a tool to deal with  <u>uncertainty</u> in <u>probabilistic</u> systems

Use it when you have exhausted all other options and **not** because you don’t yet have the solution for a deterministic problem.


<img src="../images/intro_to_data_science_ml/img_p10_1.png" style="display:block; margin:auto">

### Do <u>NOT</u> solve <u>deterministic</u> problems with Machine Learning 

Using Machine Learning introduces complexity and overheads that can only be justified **if** they are absolutely necessary

Notes: Do not solve “Tic Tac Toe” or “Rock paper scissors” with ML

## What is AI?

### What is AI?

<div class="group">
  <div class="images">
    <img src="../images/intro_to_data_science_ml/what_is_ai_1_1.png">
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/what_is_ai_2.png">
  </div>
</div>

### What is AI?

<div class="group">
  <div class="images">
    <img src="../images/intro_to_data_science_ml/what_is_ai_1_2.png">
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/what_is_ai_2.png">
  </div>
</div>

It’s not the Terminator

It is a branch of Computer Science!! with subdomains

### What is AI?

**Narrow AI:** real AI .. math / computational statistics on steroids .. solves one task

**General AI:** imaginary AI .. killer robots, paperclip machine (decides to build paper clips and drowns all mankind)

**Technochauvinism:** believing that all problems can be solved by tech



<sub><sub>[Meredith Broussard, Artificial Unitelligence](https://www.c-span.org/video/?457638-2/artificial-unintelligence)</sub></sub>

we work with narrow. movies are about general.

### How can Machine learning help?

<div class="group">
  <div class="text_70">
      
* Smarter weather prediction and agriculture
* Energy optimization
* Self-driving cars
* AI in healthcare / Drug discovery
* Finance / Fraud detection
* On-demand language translation

  </div>
  <div class="images_30">
    <img src="../images/intro_to_data_science_ml/img_p27_1.png">
  </div>
</div>

### What can we do with it?

* Predict if a product will sell or not
* Demand prediction for a service
* Traffic prediction
* Predicting when a large system will break.. a ship, a train and so on
* Winning a game of chess

### Hot topics: GPT-3

<div class="group">
  <div class="text">
      
[https://openai.com/blog/openai-api/](https://openai.com/blog/openai-api/)

  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/chat_gpt.png">
  </div>
</div>

[https://www.theverge.com/21346343/gpt-3-explainer-openai-examples-errors-agi-potential](https://www.theverge.com/21346343/gpt-3-explainer-openai-examples-errors-agi-potential)

Generative Pre-trained Transformer 3 is an autoregressive language model released in 2020 that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt.

### How can AI be dangerous?
<div class="group">
  <div class="text_70">
      
* Autonomous weapons
* Social manipulation
* Invasion of privacy and social grading
* Recruiting
* Amplifies discrimination

*check out Coded Bias on Netflix*

  </div>
  <div class="images_30">
    <img src="../images/intro_to_data_science_ml/coded_bias.png" width="350">
  </div>
</div>

When MIT Media Lab researcher Joy Buolamwini discovers that facial recognition does not see dark-skinned faces accurately, she embarks on a journey to push for the first-ever U.S. legislation against bias in algorithms that impact us all.

### AI - Effect on Society

<div class="group">
  <div class="images">
    <img src="../images/intro_to_data_science_ml/Race-After-Technology.png" width="350">
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/Weapons-of-Math-Destruction.png" width="365">
  </div>
</div>

### AI and discrimination - PULSE AI

<div class="group">
  <div class="images">      
    <img src="../images/intro_to_data_science_ml/img_p31_2.png">
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p31_1.png">
    <img src="../images/intro_to_data_science_ml/img_p31_3.png" width="600">
  </div>
</div>



### Awful AI

<div class="group">
  <div class="text">
      
[https://github.com/daviddao/awful-ai](https://github.com/daviddao/awful-ai)

  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p33_2.png">
  </div>
</div>

## Where does bias come from?

### Who is contributing to the data?

<img src="../images/intro_to_data_science_ml/img_p35_1.png" width="900">

### What we do with the data?

<div class="group">
  <div class="text">
      
Algorithms can also be biased.

Examples:
* Do they care about the average?
* Is the target of the model really what the system should optimise for?

  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p36_3.png">
  </div>
  <div class="text">

<sub><sub>[today in the Markup newsletter](https://www.wsj.com/articles/facebook-algorithm-change-zuckerberg-11631654215)</sub></sub>

   </div>
</div>

## Machine Learning

### Birds-Eye View

<img src="../images/intro_to_data_science_ml/img_p40_1.png" width="725">

<sub><sub>Sources:[https://datute.net/bigdata.html](https://datute.net/bigdata.html)</sub></sub>

### Supervised Learning

<div class="group">
  <div class="text">
      
[]()

  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/supervised_learning_ml.png">
  </div>
</div>

Notes: Training data (known data) includes the desired output (response) as well 

Example: Predicting house prices based on given features like: number of rooms, bathrooms, garage space, year it was built, location, etc.

### Supervised Learning

<div class="group">
  <div class="text">
      
Training data (known data) includes the desired output (response) as well.

Example:

Predicting house prices based on given features like: number of rooms, bathrooms, garage space, year it was
built, location, etc.

<sub><sub> Sources:
[Apple](https://www.flaticon.com/free-icon/apple_415682?term=apple&page=1&position=12), [Machine Learning](https://www.flaticon.com/free-icon/machine-learning_2464316?term=machine%20learning&page=2&position=5), [Computer](https://www.flaticon.com/free-icon/pc-monitor_81793?term=computer%20screen&page=6&position=14) </sub></sub>
      
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/supervised_learning_ml.png">
  </div>
</div>

### Unsupervised Learning

<div class="group">
  <div class="text">
      
[]()

  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/unsupervised_learning_ml.png">
  </div>
</div>

Notes: The training data (known data) does NOT include the desired output (response) 

Example: Grouping customers by purchasing behavior

### Unsupervised Learning

<div class="group">
  <div class="text">
      
The training data (known data) does NOT includes the desired output (response).

Example:

Grouping costumers by purchasing behavior

<sub><sub> Sources:
[Apple/Banana/And Pearple](https://www.flaticon.com/packs/summer-food-drink), [Machine Learning](https://www.flaticon.com/free-icon/machine-learning_2464316?term=machine%20learning&page=2&position=5), [Computer](https://www.flaticon.com/free-icon/pc-monitor_81793?term=computer%20screen&page=6&position=14), [Thinking Bubble](https://www.flaticon.com/free-icon/thinking_522938?term=thinking%20bubble&page=1&position=17) </sub></sub>
      
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/unsupervised_learning_ml.png">
  </div>
</div>

### Semi-supervised Learning
<div class="group">
  <div class="text">
      
Training data includes SOME of the
desired output
      
Example:
      
Photo archive, where only some
images are labeled (eg. dog,
cat,person) and the majority is
unlabeled.

  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p45_1.png">
  </div>
</div>

### Reinforcement Learning
<div class="group">
  <div class="text">
      
Training data has a feedback loop
      
Example:
      
autonomous video game player
<img src="../images/intro_to_data_science_ml/img_p46_2.png" width="500">     

      
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p46_1.png">

<sub><sub>Sources: [https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html](https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html)</sub></sub>

  </div>
</div>

### Regression vs. Classification

<img src="../images/intro_to_data_science_ml/img_p40_1.png" width="725">

<sub><sub>Sources:[https://datute.net/bigdata.html](https://datute.net/bigdata.html)</sub></sub>

### Classification vs. Regression

<div class="group">
  <div class="images">
    <img src="../images/intro_to_data_science_ml/MLIntro_classification_example.png" width="550">
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/MLIntro_regression_example.png" width="550">
  </div>
</div>

### Unsupervised learning: Dimensionality reduction

<img src="../images/intro_to_data_science_ml/img_p49_1.png" width="800" style="display:block; margin:auto">

<sub><sub>Sources: Hands-on Machine Learning, Geron</sub></sub>

### Unsupervised learning: Clustering

<img src="../images/intro_to_data_science_ml/img_p50_1.png" width="800" style="display:block; margin:auto">

<sub><sub>Sources: kslearn data set, own visualization</sub></sub>

### Deep Learning

<div class="group">
  <div class="text_70">

**Deep Learning** is a class of ML algorithms that <u>uses multiple layers to progressively extract higher level features from the raw input.</u>

For example, in image processing, lower layers may identify edges, while higher layers may identify the concepts relevant to a human such as digits or letters or faces.

  </div>
  <div class="images_30">
    <img src="../images/intro_to_data_science_ml/MLIntro_deep_learning.png" width="600">
  </div>
</div>

### Time Series Forecasting

<div class="group">
  <div class="text">

**A Time Series** is a series of data points indexed in time order. Most commonly the data points are taken at equal intervals.

  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p52_1.png">
  </div>
</div>

### Natural Language Processing

<div class="group">
  <div class="text_70">
        
**NLP** is the field dealing with how to program computers to process and analyze large amounts of natural language data.
  
  </div>
  <div class="images_30">
    <img src="../images/intro_to_data_science_ml/img_p53_1.png">
  </div>
</div>

### Generative AI vs Supervised Learning
<div class="group">
  <div class="text_70">
      
> Generative AI is mind-blowing, but remember that Supervised learning is the most profitable Machine Learning technique today. The attached image is from @AndrewYNg's talk at Stanford University in July 2023. Supervised learning is massive, and he predicts it should double in the next few years. Generative AI should more than double, but it won't catch up. Don't let online hype lead you astray. Learning the fundamentals is as important as it's always been. <br> [Santiago Valdarrama](https://www.linkedin.com/posts/svpino_generative-ai-is-mind-blowing-but-remember-activity-7124020850646749184-Mc_X?utm_source=share&utm_medium=member_desktop)

Also the link to Andrew Ng's talk: https://youtu.be/5p248yoa3oE?si=DmffehuDLWa2IAFB
</div>
  <div class="images_30">
    <img src="../images/intro_to_data_science_ml/andrew-ng-supervised-vs-generative.jpeg" width=850>
  </div>
</div>

## Who does what in data?

many roles, which ones have you heard of?

data journalist, analytics engineer, mle, ds, da, de, data viz, data prod

let’s focus on 4 common ones (though dae is on the rise)

### Data Engineer

**Tasks:**

data warehouse, data lake, data infrastructure, data pipeline, data transformation and enriching, ETL, automation, software engineering

<div class="group">
  <div class="text">

**Closely related roles:**
      
Data ops, ML ops

**Key skills:** 
      
engineering, data modeling, communication

<img src="../images/intro_to_data_science_ml/MLIntro_data_engineer_skills.png" width="500">
      
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p18_1.png" width="500">
  </div>
</div>


### Data Analyst

**Tasks:**

data warehouse, data pipeline, data transformation and enriching, ETL, data analysis, EDA, KPIs, statistics, data exploration, dashboards, visualization, communicating, assessing data products

<div class="group">
  <div class="text">

**Closely related roles:**
      
Product Analyst, Data Scientist, Data Visualizer, (Growth Hacker...)
      
**Key skills:** 

statistics, domain knowledge, communication
      
<img src="../images/intro_to_data_science_ml/img_p16_4.png" width="200">
      
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p19_1.png" width="500">
  </div>
</div>

### Data Scientist

**Tasks:**

data pipeline, data analysis, KPIs, statistics, data exploration, visualization, EDA, communicating, data modeling, predicting, building data products, deep learning

<div class="group">
  <div class="text">

**Closely related roles:**

Product Analyst, ML Engineer, Data Visualizer
      
**Key skills:** 

algorithms, domain knowledge, communication
      
<img src="../images/intro_to_data_science_ml/img_p16_5.png" width="200">
      
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p20_1.png" width="500">
  </div>
</div>

### Machine Learning Engineer
**Tasks:**

data pipeline, data analysis, data modeling, predicting, building data products, automation, software engineering

<div class="group">
  <div class="text">

      

**Closely related roles:**
      
Data Scientist, Data Engineer
      
**Key skills:** 
      
engineering, algorithms, communication

<img src="../images/intro_to_data_science_ml/MLIntro_ml_engineer_skills.png" width="500">
      
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p21_1.png" width="550">
  </div>
</div>

## Becoming a Data Scienctist

### Data Scientist


<div class="group">
  <div class="text">
      
      
**Tasks:**

data pipeline, data analysis, KPIs, statistics, data exploration, visualization, EDA, communicating, data modeling, predicting, building data products, deep learning


**Closely related roles:**

Product Analyst, ML Engineer, Data Visualizer
      
**Key skills:** 

algorithms, domain knowledge, communication
    
<img src="../images/intro_to_data_science_ml/img_p16_5.png" width="200">
      
  </div>
  <div class="images">
    <img src="../images/intro_to_data_science_ml/img_p20_1.png" width="500">
  </div>
</div>

### learn about the subject and where does your **past experience** fit in

<div class="group">
  <div class="text_70">
        
book:
[https://www.manning.com/books/build-a-career-in-data-science](https://www.manning.com/books/build-a-career-in-data-science)

podcast:
[https://open.spotify.com/show/78Nft51TuU3X2urEKfCuys?si=-f7cN3v2Sgu0pyDelBc-Yg&dl_branch=1](https://open.spotify.com/show/78Nft51TuU3X2urEKfCuys?si=-f7cN3v2Sgu0pyDelBc-Yg&dl_branch=1)

  </div>
  <div class="images_30">
    <img src="../images/intro_to_data_science_ml/img_p56_6.png">
  </div>
</div>

### Try it out: kaggle.. zindi .. and more

<div class="group">
    <div class="images">       
        <img src="../images/intro_to_data_science_ml/img_p57_1.png" width="550">  
    </div>
    <div class="images">
        <img src="../images/intro_to_data_science_ml/img_p57_2.png">
    </div>
</div>

<sub><sub>Sources:[kaggle](https://www.kaggle.com/datasets), [zindi](https://zindi.africa/competitions)</sub></sub>

### First 3 Weeks: Getting started with data


|  | |
|:---:|:---|
| **1** | <ul><li>Working with IDEs and Python scripts</li></ul>  |
| **2** | <ul><li>pandas & NumPy</li><li>Data Visualization</li></ul> |
| **3** | <ul><li>Data Visualization</li><li>Data Cleaning</li><li>EDA</li></ul> |