# Overview
This lecture will introduce some of the foundational tools that data scientists use to work with data in Python. We will focus primarily on `pandas`, a popular package for loading data, saving it, and all the manipulation that happens in between. Specifically, we will cover:
- Importing packages
- The `DataFrame` and `Series` objects in `pandas`
- Built-in helpers for dataframes
- Selecting and slicing data
- Merging and joining data
- Parsing data from various formats

# Package imports
We experimented with package imports a bit last time to unlock `math` functionality. From now on we will rely heavily on Python's various packages to do data science work. Importing a package can follow a number of formats, but we'll stick to the canonical import statements.

To import `pandas`, simply put at the beginning of a python script (or in a notebook cell):

In [1]:
import pandas as pd

We can now access any functionality in the `pandas` package by doing `pd.function_name`. Some of the other packages we will use extensively are all imported in the cell below:

In [2]:
import numpy as np  # Numpy for most math, including linear algebra
import matplotlib.pyplot as plt  # Matplotlib for plotting

# For scikit-learn functionality, we will usually just import one class or function at a time as-needed
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier

If you get an error when trying to import any of those packages, you likely don't have it installed in your python environment. This can be solved by installing it with `pip`. For example:

In [3]:
!pip install pandas

You should consider upgrading via the '/Users/ea-gegan/Desktop/Classes/DATA 5100/data5100/bin/python -m pip install --upgrade pip' command.[0m


# Pandas

## DataFrame and Series objects
The best way to learn about pandas DataFrames and Series objects is to see an example. We'll load in a classic dataset of flower characteristics called the `iris` dataset. We'll use the `read_csv` functionality, which can load a CSV (comma separated value) file from either a web address or a filepath on your local machine. 

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

This variable `df` is an instance of a pandas DataFrame, which you should think of as an object or class that contains lots of useful methods for working with data. Let's see what the dataframe looks like if we just print it out

In [3]:
df

Kind of like an Excel spreadsheet!  This makes sense, because we loaded the data from a csv file. There are 150 rows, numbered by the integers 0...149. This array of integers is called the DataFrame *index*, and we'll see shortly that the index doesn't have to be integers: it could be strings, or times, or anything really. But for now it is just a range of integer values, which we can look at explicitly using the `DataFrame.index` attribute.

In [14]:
print(df.index)

RangeIndex(start=0, stop=150, step=1)


The dataframe also has 5 columns, each with a column name. Now using the `DataFrame.columns` attribute:

print(df.columns)

### DataFrame helpers

There are [tons of functions and attributes](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) tacked onto the DataFrame object. We'll just demonstrate a few of the most common ones here.

The standard way to take a quick look at a dataframe is to use the following methods, which print out the first 5 rows (the head) and the last 5 rows (the tail) of the dataframe:

In [22]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [23]:
df.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


We can also look at the dataframe shape and its data types:

In [40]:
df.shape  # (rows, cols)

(150, 5)

In [41]:
df.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

We can look at summary info, which includes most of the things we have printed out individually, plus a count of the non-null entries

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


And we can calculate statistics to get a basline sense of what the numerical data looks like. Note that the `species` column is dropped in the output, because it is non-numeric.

In [45]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [None]:
# Do one for groupby

### Accessing and Slicing DataFrames
The most common operation you will do with datasets is slice them up. Extract some portion you are interested in, get rid of a portion you aren't interested in, use part of the data for training a model, and another part for testing the model, etc. Pandas a couple ways of helping you do this. 

First, let's say you just want to get a single column. You can access that with square brackets and the column name:

In [16]:
sepal_width = df["sepal_width"]

In [17]:
print(sepal_width)

0      3.5
1      3.0
2      3.2
3      3.1
4      3.6
      ... 
145    3.0
146    2.5
147    3.0
148    3.4
149    3.0
Name: sepal_width, Length: 150, dtype: float64


In [18]:
type(sepal_width)

pandas.core.series.Series

As you can see, accessing the single column returned an object that is no longer a DataFrame. Instead it is a `Series`, which you can think of as a one-dimensional DataFrame. There are lots of methods that apply to Series objects but not to DataFrames (like really a lot -- see [the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)). One of my favorites for demonstration purposes:

In [47]:
species_list = df["species"].unique()

In [48]:
species_list

array(['setosa', 'versicolor', 'virginica'], dtype=object)

What if we wanted to extract two columns though? Instead of specifying a single string column name, we now pass a list of column names to the square brackets:

In [19]:
sepal_geom = df[["sepal_width", "sepal_length"]]

In [24]:
sepal_geom.head()

Unnamed: 0,sepal_width,sepal_length
0,3.5,5.1
1,3.0,4.9
2,3.2,4.7
3,3.1,4.6
4,3.6,5.0


In [21]:
type(sepal_geom)

pandas.core.frame.DataFrame

This time it remained a DataFrame because we have more than one column. In both cases though, we retained the same index that the DataFrame had originally. 

Much of the time though, we don't want to keep all of the indices. What if we just wanted indices 0...20? In that case we would use `df.loc`:

In [25]:
df20 = df.loc[:20]

In [26]:
df20

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


This worked a bit like list slicing, but with two differences. The obvious one is that index `20` is included in the output. When slicing a list with `[:20]` you only retain indices 0...19. The more subtle difference, which we can't tell from this example, is that when using `.loc[]`, you need to give it a range of the *DataFrame indices* to slice out, rather than a range of integer row numbers. In this case, those were the same thing, but we could easily have a dataframe where they are different. 

In [28]:
df2 = pd.DataFrame(index=["a", "b", "c"], columns=["red", "green", "blue"], data=np.random.rand(3, 3))

In [29]:
df2

Unnamed: 0,red,green,blue
a,0.632488,0.656912,0.643256
b,0.652871,0.639123,0.175608
c,0.381724,0.598483,0.302355


What happens if we try to use `.loc[]` to extract the first two rows?

In [30]:
df2_2 = df2.loc[:1]  # This throws an error, because our indices are strings, not integers

TypeError: cannot do slice indexing on Index with these indexers [1] of type int

Instead, we need to give it the string names we want:

In [32]:
df2_2 = df2.loc[["a", "b"]]

In [33]:
df2_2

Unnamed: 0,red,green,blue
a,0.632488,0.656912,0.643256
b,0.652871,0.639123,0.175608


We can also combine row access with column access to pull out certain row-column combinations:

In [34]:
df2_2gb = df2.loc[["a", "b"], ["green", "blue"]]

In [35]:
df2_2gb

Unnamed: 0,green,blue
a,0.656912,0.643256
b,0.639123,0.175608


Sometimes though, we just want to treat the dataframe like an n-dimensional array or list and access its contents purely using integer row and column numbers. Pandas lets you do that, but you need to use `.iloc` instead of `.loc`:

In [38]:
df2_i = df2.iloc[:2, 1:]

In [39]:
df2_i

Unnamed: 0,green,blue
a,0.656912,0.643256
b,0.639123,0.175608


We've produced the same output, but from a different point of view. Instead of telling pandas to return specific index and column *names* (which is what `loc` is for), we told pandas to return specific row and column *numbers* (which is what `iloc` is for). You'll probably find yourself using each in different circumstances; just be careful you don't mix them up! Because they expect different inputs, and produce different outputs. 

### Breakout: Loading and Slicing Dataframes (need to create)

### Merging DataFrames

### Saving DataFrames

### Breakout: Merging DataFrames (need to create)

# Parsing Data

## Text and CSVs

## Web Data and the JSON format

## Breakout: Tricky I/O (need to create)