## Pandas

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

Can be seen as a Python version of R's `data.frame` type, combined with some of the functionalities of `dplyr` and `tidyr`.

* Provide the DataFrame object for data manipulation with integrated indexing.
* Provide tools for reading and writing data between in-memory data structures (CSV and text files, Microsoft Excel, SQL databases, etc)
* Implement behind numpy for fast array computation and matplotlib for plotting.
* Most common data manipulation tasks: subsetting, filtering, merging, sorting, grouping, joining, reshaping, and more.

Refer to the [documentation](https://pandas.pydata.org/docs/reference/index.html#api) for more details.

In [None]:
# Download the data from the UCI Machine Learning Repository to the local ./data/ directory
# https://archive.ics.uci.edu/ml/datasets/Iris
!wget -P ./data/ https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

In [None]:
import pandas as pd

# Iris data columns names
names = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'species']

# Load the dataset from CSV
data = pd.read_csv('./data/iris.data', names=names)

# What datatype is data
type(data) # pandas DataFrame

# First 5 rows of the dataset
# data.head()

# Summary statistics
# data.describe()

# Number of rows and columns and types
# data.info()

# Conbvvert to numpy array
# subset = data['sepal length (cm)'].values



In [None]:
# Indexing and slicing

# Index by integer position using iloc
# Same as numpy array indexing

subset = data.iloc[0] # First row
# subset = data.iloc[0, 0] # First row, first column
# subset = data.iloc[0:3] # First 3 rows
# subset = data.iloc[:,-1] # Last column all rows

# Index by row and column name using loc
# Same as python dictionary indexing
subset = data.loc[0, 'sepal length (cm)'] # First row, first column
subset = data.loc[0]['sepal length (cm)']


# Index by column name
# Same as python dictionary indexing

# subset = data['sepal length (cm)'] # First column
# subset = data[['sepal length (cm)', 'sepal width (cm)']] # First and second column

# Index by boolean mask
# subset = data[data['sepal length (cm)'] > 5.0] # All rows where sepal length > 5.0

# Index by boolean mask and column name
# subset = data[data['sepal length (cm)'] > 5.0]['sepal length (cm)'] # sepal length > 5.0 and only sepal length column

subset

In [None]:
# Aggregation functions

# Mean of each column (excluding not numeric columns)
data.iloc[:,:-1].mean()
data.iloc[:,:-1].mean(axis=1) # Mean of each row

# Unique values in a column
data['species'].unique()

# Mean of each column by species
data.groupby('species').mean()

In [None]:
# Plotting

# Plot histogram of each column
data.iloc[:,:-1].hist(figsize=(15,10))

# Scatter plot of sepal length vs sepal width
# data.plot.scatter(x='sepal length (cm)', y='sepal width (cm)', figsize=(15,10))

# Scatter plot of sepal length vs sepal width by species

# map species to integer
for i, species in enumerate(data['species'].unique()):
    data.loc[data['species'] == species, 'species_id'] = int(i)

data.plot.scatter(x='sepal length (cm)', y='sepal width (cm)', c='species_id', colormap='viridis', figsize=(15,10))