STAT 451: Machine Learning (Fall 2021)  
Instructor: Sebastian Raschka (sraschka@wisc.edu)  

# L05 - Data Preprocessing and Machine Learning with Scikit-Learn

# 5.1 Reading a Dataset from a Tabular Text File

In [1]:
%load_ext watermark
%watermark -v -a 'Sebastian Raschka' -p pandas

Author: Sebastian Raschka

Python implementation: CPython
Python version       : 3.9.6
IPython version      : 7.27.0

pandas: 1.3.2



## Overview

In this lecture, we are closing the "Computational Foundation" section by introducing yet another Python library, pandas, which is extremely handy for data (pre)processing. The second focus of this lecture is on the [Scikit-learn](http://scikit-learn.org) machine learning library, which is widely considered as the most mature and most well-designed general machine learning library.

## Pandas -- A Python Library for Working with Data Frames

- Pandas is probably the most popular and convenient data wrangling library for Python (official website: https://pandas.pydata.org) 
- Pandas stands for PANel-DAta-S.
- Relativ similar to data frames in R.
- How is it different from NumPy arrays? 
    - Allows for heterogenous data (columns can have different data types)
    - Adds some more convenient functions on top that are handy for data processing

### Loading Tabular Datasets from Text Files

- Here, we are working with structured data, data which is organized similar to a "design matrix" (see lecture 1) -- that is, examples as rows and features as columns (in contrast: unstructured data such as text or images, etc.).
- CSV stands for "comma separated values" (also common: TSV, tab seperated values).
- The `head` command is a Linux/Unix command that shows the first 10 rows by default; the `!` denotes that Jupyter/the IPython kernel should execute it as a shell command (`!`-commands may not work if you are on Windows, but it is not really important).

In [2]:
!head data/iris.csv

Id,SepalLength[cm],SepalWidth[cm],PetalLength[cm],PetalWidth[cm],Species
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa
6,5.4,3.9,1.7,0.4,Iris-setosa
7,4.6,3.4,1.4,0.3,Iris-setosa
8,5.0,3.4,1.5,0.2,Iris-setosa
9,4.4,2.9,1.4,0.2,Iris-setosa


- We use the `read_csv` command to load the CSV file into a pandas data frame object f of the class `DataFrame`.
- Data frames also have a `head` command; here it shows the first 5 rows.

In [3]:
import pandas as pd


df = pd.read_csv('data/iris.csv')
df.head()

Unnamed: 0,Id,SepalLength[cm],SepalWidth[cm],PetalLength[cm],PetalWidth[cm],Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
type(df)

pandas.core.frame.DataFrame

- It is always good to double check the dimensions and see if they are what we expect. 
- The `DataFrame` `shape` attribute works the same way as the NumPy array `shape` attribute (Lecture 04).

In [5]:
df.shape

(150, 6)