# Data Science Course 2022: Introduction (18.10.2022)

By Thomas Jurczyk (Dr. Eberle Zentrum, Universität Tübingen).

# Overview

1. Personal Introduction

2. Data Science Introduction

3. Requirements & Course Structure

4. Goals

4. Tasks for the next session

5. Resources

# 1. Personal Introduction

**Name**: Thomas Jurczyk, PhD

**Mail**: thomas.jurczyk-q88@rub.de

## Academic Background

History/Study of Religion (PhD)

(partially) Computer Science

## Academic Interests

1\. Application of computational methods in the Humanities

1. Computational text analysis
2. Machine Learning
3. Data Science

2\. Ancient Religions / early Christianity

3\. The meaning and application of religious notions in historical and contemporary religions

# 2. Data Science (Introduction)

## What is data? What kinds of data do you know?

> In a conceptual model, data (US: /ˈdætə/; UK: /ˈdeɪtə/) is a **collection of discrete values** that convey **information**, describing **quantity, quality, fact, statistics, other basic units of meaning**, or simply **sequences of symbols** that may be further interpreted. ([Wiki article "Data"](https://en.wikipedia.org/wiki/Data))

1. Images
2. Text
3. Tables / Databases
4. Times series
5. Sensor data

Overall, these different data types can be categorized using the following classes:

1. **Structured data**
    1. Numerical
    2. Categorical
2. **Unstructured data**
3. **Semi-structured data**

We will use and analyse different types of data during this course.

### What is "Data Science"?

> Data science is a "concept to **unify statistics, data analysis, informatics, and their related methods**" in order to "**understand** and analyse **actual phenomena**" with data (Hayashi 1998). It uses techniques and theories drawn from many fields within the context of **mathematics, statistics, computer science, information science, and domain knowledge**. (quote from this [Wikipedia article](https://en.wikipedia.org/wiki/Data_science), my highlights)

![Google ngram "Data Science" 1990–2019.](images/ngram.png)

<sup>Google Ngram viewer "Data Science" 1990–2019.</sup>

![Google Trends "Data Science" 2004–2022.](images/trends.png)

<sup>Google Trends "Data Science" 2004–2022.</sup>

"Data science" is often thought of as "applied statistics" or "big data analysis with computers". However, I hope that we will see throughout this course that—even though they may play an important role at times—sophisticated statistics and big data are not mandatory to do data science. 

I suggest is to focus on **applying computational methods to analyze data in order to better understand real world phenomena**.

The working definition of "data science" for this course is:
    
> Answering a research question by analyzing data with Python.  

# 3. Requirements

This course has both formal and technical requirements.

## Technical Requirements

1\. First of all, you need to have a **decent Python knowledge**. This does not mean that you need to know every single library we will be working with, but I expect from you that feel comfortable using Python. Don't hesitate to ask questions, though.

2\. You should be comfortable using **Jupyter Notebooks** (if not, please see the course resources).

3\. You should have, bring and use your **own laptop/notebook** during this course.

4\. (optional) Using **git**/**GitHub** might be a good idea, and I highly recommend you start using git during your projects (if you haven't done so already).

## Formal Requirements

1. You will receive **3 CP** for this course. What I expect from you:
2. You **take part in the classes** on a regular basis.
3. You conduct a **full-fledged data analysis** using Python and Jupyter notebooks on a dataset that you are interested in.
4. You **present and discuss your research** towards the end of the semester.

## Course Structure

Starting with the third session (01. November, APIs), the typical class structure should look like this:

1. I will provide **literature** and/or **videos** in preparation for the session. There will be **no tests**, and I will not check if you've read the literature or watched the videos. But I will assume that you have a basic idea of the session's topic.
2. Each session will start with a **brief input** (max. 20 minutes), followed by a **hands-on** part where I will give you tasks, you can start working on your material, and/or where we discuss questions and problems (depending on the session's topic).

There is a plethora of literature, websites, video tutorials on advanced Python techniques and data science. I will refer to different sources throughout this course (also see the bibligraphy in the corresponding ILIAS folder).

Yet, a book that I personally found very useful and that I will often refer to in this course is *Practical Data Science with Python* by Nathan George (O'Reilly, 2021). You should have free access to the online version of this book via the UB, which is why I will mostly indicate the corresponding page numbers instead of uploading scans of individual chapters.

![](images/practical_data_science.png)

1. Sitzung	(18. Okt 2022, 10:00 - 12:00) **Allgemeine Einführung und Semesterplanung**
2. Sitzung (25. Okt 2022, 10:00 - 12:00) **Data Science - Eine Einführung**
3. Sitzung (01. Nov 2022, 10:00 - 12:00) **Datenbeschaffung I: APIs**
4. Sitzung (08. Nov 2022, 10:00 - 12:00) **Datenbeschaffung II: Web Scraping**
5. Sitzung (15. Nov 2022, 10:00 - 12:00) **Datenbereinigung & Datenvorbereitung**
6. Sitzung (22. Nov 2022, 10:00 - 12:00) **Explorative Datenanalyse (EDA)**
7. Sitzung (29. Nov 2022, 10:00 - 12:00) **Maschinelles Lernen I**
8. Sitzung (06. Dez 2022, 10:00 - 12:00) **Maschinelles Lernen II**
9. Sitzung (13. Dez 2022, 10:00 - 12:00) **Datenanalyse I**
10. Sitzung (20. Dez 2022, 10:00 - 12:00) **Datenanalyse II**
11. Sitzung (10. Jan 2023, 10:00 - 12:00) **Visualisierung**
12. Sitzung (17. Jan 2023, 10:00 - 12:00) **Publikation**
13. Sitzung (24. Jan 2023, 10:00 - 12:00) **Präsentation**
14. Situng (31. Jan 2023, 10:00 - 12:00) **Abschlusssitzung**


# 4. Goals

1\. Deepen your **Python knowledge**.

2\. Getting familiar with **important Python libraries** (pandas, scikit-learn, seaborn etc.).

3\. Working with **Jupyter Notebooks** and **virtual environments**.

4\. Working with **git**, **Markdown**, **HTML/CSS** etc.

5\. Acquiring some basic **statistical skills**.

6\. Knowing how to **conduct data-driven research** with Python (**!!!**).

It is **crucial that you find a dataset and a topic that interests you** during the first half of the semester! You may work with dummy data for a while, for instance to learn certain methods. However, why would you want to analyze sensor data from an industrial machine if your actual work focusses on Dutch oil paintings from early modern times?

# 5. Tasks for the Next Session

If you encounter any problems, please do not hesitate to contact me. We have enough time during the next session to solve the technical problems you may encounter. Yet, it would be great to let me know in advance (for instance, via email).

1\. **Set up a folder and a Python environment** (I recommend using Python 3.8, 3.9 or 3.10) for this course. Please do not use Anaconda/Conda but preferably `venv`.

2\. Get familiar with **Jupyter notebooks** (and install them in your virtual environment).

3\. Do some **brainstorming**: Is there a (research) question you are interested in and that you would like to approach using a computational data analysis? What kind of data do you need? Do you already have an idea where you could get it from?


4\. (**optional**) Use git for version control.

5\. (**optional**) Read through some of the articles in the `Resources` (ILIAS/GitHub) folder about "Data Science".


# 6. Resources

## Jupyter Notebooks

How to use Jupyter notebooks (by Jeremy Howard on [Kaggle](https://www.kaggle.com)): [Jupyter 101](https://www.kaggle.com/code/jhoward/jupyter-notebook-101)

## Virtual Environments (Python)
Creating a virtual environment using `venv`: [Official Python venv docs and tutorial](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/#creating-a-virtual-environment)

## Markdown

A [Medium](https://medium.com/) article introducing Markdown with Jupyter notebooks: [Link](https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd)

## Git/GitHub
W3Schools git tutorial: [Link](https://www.w3schools.com/git/default.asp?remote=github)