GitHub - ypeleg/data_hack_haifa: Datahack Haifa

Hands-On Introduction to Data Science

Yam Peleg

Syllabus and Class Notes

Total time (including breaks): 3 Hours

Each section is about 45-55 minutes to allow for a small break between sections. The tutorial will be a live-coding lecture, with a break for exercises and questions every 45-55 minutes.

Important: This tutorial is extremely Hands-On! Bring a computer with you so you won't miss out!

:00 - :45 Chapter 1: Pandas DataFrame Basics

Running python
- Anaconda, Python, IPython, and Jupyter notebooks
- Installing packages
- conda environments

Before we start cleaning data, let's begin by covering the basics of the Pandas library. We'll cover importing libraries in Python, and how to load your own datasets into Pandas. From there, you'll typically want to look around your data, so we'll cover various ways we can filter and look at our data, calculate simple aggregate statistics and visualize them. This section will end with how to save our data into files we can share with others.

Loading your first dataset
Looking at columns, rows, and cells
Subsetting columns
Subsetting rows
Subsetting both columns and rows
Boolean subsetting
Grouped and aggregated calculations
Export/save data

:45 - 1:30 Chapter 2: Applying Functions

Sometimes we need a more complex method to tidy our data. Other times, we need to perform more complex tasks on our data. Here we'll cover how to write functions in Python and how to apply them to our data. This way, if a method does not exist to perform the task we want, or if we want to combine multiple tasks together, we can write our own custom functions to process our data.

Writing a Python function
Applying functions
Vectorized functions

exercise: use the ebola dataset from the tidy section, and instead of using the .str. accessor, write a function to parse out the string.

1:30 - 2:15 Vizualizations with Seaborn

Getting ready for feature engineering

We visualise the data so we can do something with it. This tutorial takes you through the basics and various functions of Seaborn. It is specifically useful for people working on data analysis. After completing this tutorial, you will find yourself at a moderate level of expertise from where you can take yourself to higher levels of expertise.

Plotting
Box plotting / Scatter Plotting
Basic Statistics

2:15 - 3:00 Feature Engineering

After we explored the data, it is time to work with it. A common task is to fit some statistical model on our data. One last processing task will be to convert our categorical variables into "dummy variables" for a model. The goal of the last section is to how how pandas fits into the larger data science ecosystem.

dummy variables
linear regression in sklearn

exercise: fit a model on the titanic datset

Pre-readings

Setup

Python

Anaconda, an all-in-one installer, is recommended.

Regardless of how you choose to install it, please make sure you install Python version 3.x (e.g., 3.7 is fine).

When using the IPython notebook, a programming environment that runs in a web browser, you will need a reasonably up-to-date browser. The current versions of the Chrome, Safari and Firefox browsers are all supported (some older browsers, including Internet Explorer version 9 and below, are not).

Windows

Video Tutorial

Open http://continuum.io/downloads with your web browser.
Download the Python 3 installer for Windows.
Install Python 3 using all of the defaults for installation except make sure to check Make Anaconda the default Python.

Mac OS X

Video Tutorial

Open http://continuum.io/downloads with your web browser.
Download the Python 3 installer for OS X.
Install Python 3 using all of the defaults for installation.

Linux

Open http://continuum.io/downloads with your web browser.
Download the Python 3 installer for Linux.
(Installation requires using the shell. If you aren't comfortable doing the installation yourself stop here and request help at the workshop.)
Open a terminal window.
Type
```
bash Anaconda3-
```
and then press tab. The name of the file you just downloaded should appear. If it does not, navigate to the folder where you downloaded the file, for example with:
```
cd Downloads
```
Then, try again.
Press enter. You will follow the text-only prompts. To move through the text, press the space key. Type yes and press enter to approve the license. Press enter to approve the default location for the files. Type yes and press enter to prepend Anaconda to your PATH (this makes the Anaconda distribution the default Python).
Close the terminal window.

Testing If Anaconda was installed

Open up the Anaconda Command Prompt
Type "ipython" into the prompt
You should see Python open up with Python 3.7.x and using the Anaconda distribution
Type "quit()" to exit
Type "jupyer notebook" to launch the notebook (this may take a while if it is the first time you are launching it)
Note the URL (with the token), paste it into your browser
Close the anaconda prompt when you're done

Many of the slides and notebooks in this repository are based on other repositories and tutorials.

References:

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
data		data
imgs		imgs
notebooks		notebooks
.hakovetz.py.swp		.hakovetz.py.swp
README.md		README.md
hakovetz.py		hakovetz.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

.ipynb_checkpoints

.ipynb_checkpoints

data

data

imgs

imgs

notebooks

notebooks

.hakovetz.py.swp

.hakovetz.py.swp

README.md

README.md

hakovetz.py

hakovetz.py

Repository files navigation

Hands-On Introduction to Data Science

Yam Peleg

Syllabus and Class Notes

Total time (including breaks): 3 Hours

:00 - :45 Chapter 1: Pandas DataFrame Basics

:45 - 1:30 Chapter 2: Applying Functions

1:30 - 2:15 Vizualizations with Seaborn

2:15 - 3:00 Feature Engineering

Pre-readings

Setup

Python

Windows

Mac OS X

Linux

Testing If Anaconda was installed

About

Releases

Packages

Languages

ypeleg/data_hack_haifa

Folders and files

Latest commit

History

Repository files navigation

Hands-On Introduction to Data Science

Yam Peleg

Syllabus and Class Notes

Total time (including breaks): 3 Hours

:00 - :45 Chapter 1: Pandas DataFrame Basics

:45 - 1:30 Chapter 2: Applying Functions

1:30 - 2:15 Vizualizations with Seaborn

2:15 - 3:00 Feature Engineering

Pre-readings

Setup

Python

Windows

Mac OS X

Linux

Testing If Anaconda was installed

About

Resources

Stars

Watchers

Forks

Languages