# Overview
This lecture will introduce some of the foundational tools that data scientists use to work with data in Python. We will focus primarily on `pandas`, a popular package for loading data, saving it, and all the manipulation that happens in between. Specifically, we will cover:
- Importing packages
- The `DataFrame` and `Series` objects in `pandas`
- Built-in helpers for dataframes
- Selecting and slicing data
- Merging data
- Parsing data from various formats
- Pandas alternatives

# Package imports
We experimented with package imports a bit last time to unlock `math` functionality. From now on we will rely heavily on Python's various packages to do data science work. Importing a package can follow a number of formats, but we'll stick to the canonical import statements.

To import `pandas`, simply put at the beginning of a python script (or in a notebook cell):

In [None]:
import pandas as pd

We can now access any functionality in the `pandas` package by doing `pd.function_name`. Some of the other packages we will use extensively are all imported in the cell below:

In [None]:
import numpy as np  # Numpy for most math, including linear algebra
import matplotlib.pyplot as plt  # Matplotlib for plotting

# For scikit-learn functionality, we will usually just import one class or function at a time as-needed
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier

If you get an error when trying to import any of those packages, you likely don't have it installed in your python environment. This can be solved by installing it with `pip`. For example:

In [None]:
!pip install pandas

# Pandas

## DataFrame and Series objects
The best way to learn about pandas DataFrames and Series objects is to see an example. We'll load in a classic dataset of flower characteristics called the `iris` dataset. We'll use the `read_csv` functionality, which can load a CSV (comma separated value) file from either a web address or a filepath on your local machine. 

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

This variable `df` is an instance of a pandas DataFrame, which you should think of as an object or class that contains lots of useful methods for working with data. Let's see what the dataframe looks like if we just print it out

In [None]:
df

Kind of like an Excel spreadsheet!  This makes sense, because we loaded the data from a csv file. There are 150 rows, numbered by the integers 0...149. This array of integers is called the DataFrame *index*, and we'll see shortly that the index doesn't have to be integers: it could be strings, or times, or anything really. But for now it is just a range of integer values, which we can look at explicitly using the `DataFrame.index` attribute.

In [None]:
print(df.index)

The dataframe also has 5 columns, each with a column name. Now using the `DataFrame.columns` attribute:

print(df.columns)

### DataFrame helpers

There are [tons of functions and attributes](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) tacked onto the DataFrame object. We'll just demonstrate a few of the most common ones here.

The standard way to take a quick look at a dataframe is to use the following methods, which print out the first 5 rows (the head) and the last 5 rows (the tail) of the dataframe:

In [None]:
df.head()

In [None]:
df.tail()

We can also look at the dataframe shape and its data types:

In [None]:
df.shape  # (rows, cols)

In [None]:
df.dtypes

We can look at summary info, which includes most of the things we have printed out individually, plus a count of the non-null entries

In [None]:
df.info()

And we can calculate statistics to get a basline sense of what the numerical data looks like. Note that the `species` column is dropped in the output, because it is non-numeric.

In [None]:
df.describe()

groupby() allows you to group the dataframe by values of one of the columns:

In [None]:
df.groupby("species").describe()

### Accessing and Slicing DataFrames
The most common operation you will do with datasets is slice them up. Extract some portion you are interested in, get rid of a portion you aren't interested in, use part of the data for training a model, and another part for testing the model, etc. Pandas a couple ways of helping you do this. 

First, let's say you just want to get a single column. You can access that with square brackets and the column name:

In [None]:
sepal_width = df["sepal_width"]

In [None]:
print(sepal_width)

In [None]:
type(sepal_width)

As you can see, accessing the single column returned an object that is no longer a DataFrame. Instead it is a `Series`, which you can think of as a one-dimensional DataFrame. There are lots of methods that apply to Series objects but not to DataFrames (like really a lot -- see [the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)). One of my favorites for demonstration purposes:

In [None]:
species_list = df["species"].unique()

In [None]:
species_list

What if we wanted to extract two columns though? Instead of specifying a single string column name, we now pass a list of column names to the square brackets:

In [None]:
sepal_geom = df[["sepal_width", "sepal_length"]]

In [None]:
sepal_geom.head()

In [None]:
type(sepal_geom)

This time it remained a DataFrame because we have more than one column. In both cases though, we retained the same index that the DataFrame had originally. 

Much of the time though, we don't want to keep all of the indices. What if we just wanted indices 0...20? In that case we would use `df.loc`:

In [None]:
df20 = df.loc[:20]

In [None]:
df20

This worked a bit like list slicing, but with two differences. The obvious one is that index `20` is included in the output. When slicing a list with `[:20]` you only retain indices 0...19. The more subtle difference, which we can't tell from this example, is that when using `.loc[]`, you need to give it a range of the *DataFrame indices* to slice out, rather than a range of integer row numbers. In this case, those were the same thing, but we could easily have a dataframe where they are different. 

In [None]:
df2 = pd.DataFrame(index=["a", "b", "c"], columns=["red", "green", "blue"], data=np.random.rand(3, 3))

In [None]:
df2

What happens if we try to use `.loc[]` to extract the first two rows?

In [None]:
df2_2 = df2.loc[:1]  # This throws an error, because our indices are strings, not integers

Instead, we need to give it the string names we want:

In [None]:
df2_2 = df2.loc[["a", "b"]]

In [None]:
df2_2

We can also combine row access with column access to pull out certain row-column combinations:

In [None]:
df2_2gb = df2.loc[["a", "b"], ["green", "blue"]]

In [None]:
df2_2gb

Sometimes though, we just want to treat the dataframe like an n-dimensional array or list and access its contents purely using integer row and column numbers. Pandas lets you do that, but you need to use `.iloc` instead of `.loc`:

In [None]:
df2_i = df2.iloc[:2, 1:]

In [None]:
df2_i

We've produced the same output, but from a different point of view. Instead of telling pandas to return specific index and column *names* (which is what `loc` is for), we told pandas to return specific row and column *numbers* (which is what `iloc` is for). You'll probably find yourself using each in different circumstances; just be careful you don't mix them up! Because they expect different inputs, and produce different outputs. 

### Breakout: Slicing Dataframes

### Merging DataFrames
Oftentimes you will want to combine data from two or more dataframes into a single dataframe. This operation is often called "merging" or "joining" datasets. In pandas, the methods for this rely on logic originally developed for relational database languages like SQL, so if you are familiar with database queries then this section should look familiar. 

In [None]:
# Dataframes of measurements on certain days
df1 = pd.DataFrame({'day':[1, 2, 3, 4, 5], 'meas':[0.1, 0.2, 0.3, 0.4, 0.5 ]})
df2 = pd.DataFrame({'day':[1, 3, 5, 7], 'meas':[1.1, 1.3, 1.5, 1.7]})

In [None]:
df1

In [None]:
df2

When we join two dataframes, there are two basic parameters we need to specify:
- The key, which is specified with the `on` parameter. This should be a shared column or index name (or list of names) between the two dataframes, and is often the independent variable (e.g., time in a time series dataset so that you can match different measurements to the same time basis). If the key you want is named something different in the left and right dataframes, you can use `left_on` and `right_on` for their respective names. 
- The join method, which is specified with the `how` parameter. See [the docs](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) for all your options here, but basically this specifies which dataframe(s) we use in order to create our new merged key. 

For this example, we we will merge on the `day` key and specify a `left` join, which means that `df1['day']` will end up being the day column in the output dataframe (if we do `df2.merge(df1, ...)` then `df2` becomes the left dataframe). The measurements from both dataframes will get attached to those `df1.day` values, but will be `NaN` if there wasn't one in the source dataframe. 

In [None]:
df_out = df1.merge(df2, on = 'day', how = 'left')

In [None]:
df_out

Conversely, we could do a right join, which makes the `day` column from `df2` our output day column. 

In [None]:
df_out = df1.merge(df2, on = 'day', how = 'right')

In [None]:
df_out

In each of the above cases, we lost some information. If we wanted to preserve all of the days from either dataframe, we could do an `outer` join:

In [None]:
df_out = df1.merge(df2, on = 'day', how = 'outer')

In [None]:
df_out

### Saving DataFrames
Once we have a dataframe that we might want to use for future analysis, we usually want to save it somewhere. This can be done with the following statement.

In [None]:
 # We don't usually want to save the dataframe index as a column, but remove that if you do want to
df_out.to_csv("filepath.csv", index=False) 

### Breakout: Merging DataFrames

# Parsing Data
In a perfect world, all the data we want to use is stored in a format that can be loaded into `pandas` with one simple line of code like `pd.read_csv("filename.csv")`. Unfortunately, loading data is usually more complicated than that. Datasets can have all sorts of issues: inconsistent labels, spacing, and separators are all common, and oftentimes you need to combine data from different formats. This section will cover some of the common "gotchas" with loading data from both local and web-based sources. 

## Text and CSVs
If you are reading text data (e.g., from a `.txt` file) or CSV data (e.g., from a `.csv` file), `pd.read_csv` is probably the function you want to use. Please spend 5 minutes [reading the function documentation page](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) to gain an appreciation for how complicated and flexible this function is. Once you are done, I will highlight just a few of the common problems you might encounter while parsing files and the `read_csv` arguments you need to fix them.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("sample_dataset.txt")

In [None]:
df

### Misplaced or missing header
Looking at the data file, we see there is a title at the top and then some whitespace. We don't want any of that:

In [None]:
df = pd.read_csv("sample_dataset.txt", skiprows=2)

In [None]:
df

### Non-comma delimiter
It looks a bit better now because at least the column names are not in the DataFrame values. But the tab delimiter/seperator (`\t`) is not being recognized as a delimiter. Let's fix that:

In [None]:
df = pd.read_csv("sample_dataset.txt", skiprows=2, sep="\t")

### Extra fields or general failure on one row
Whoops! The error message is telling us that we have an extra field in line 11 of the file. Let's just skip that bad line:

In [None]:
df = pd.read_csv("sample_dataset.txt", skiprows=2, sep="\t", on_bad_lines="skip")

In [None]:
df

### Inconsistent delimiter
One of the rows actually has spaces as a delimiter rather than tabs, so there is technically no delimiter in that row. To handle this more general case, we use the following more general seperator, which is a [regex](https://en.wikipedia.org/wiki/Regular_expression) string:

In [None]:
df = pd.read_csv("sample_dataset.txt", skiprows=2, sep="\s+", on_bad_lines="skip")

In [None]:
df

### Weird null values
One of the values is being read in as the string `NotANumber` which is a non-standard way of representing a NaN value. If we want to actually encode this as a numpy `NaN` so that, for example, we can still treat that column as an array of floats, we would need to specify this NaN indicator

In [None]:
df = pd.read_csv("sample_dataset.txt", skiprows=2, sep="\s+", on_bad_lines="skip", na_values=["NotANumber"])

In [None]:
df

## Web Data and the JSON format
Not all data comes from `.csv` files. Most likely, much of the data you will end up working with as data scientists will come from the internet -- either the public internet, or via API calls to internal endpoints that the software engineers at your company set up. The easiest way to bring this data into Python is via the `requests` module, which allows you to load [json data](https://en.wikipedia.org/wiki/JSON) into a Python dictionary, which can then be used as-is or further converted into a DataFrame. 

The `requests` module allows us to send HTTP requests from python. There are [different types of HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). Some allow you to `GET` data, others allow you to `POST` data. We'll just demonstrate the `GET` method by requesting data from a free Pokemon API

In [76]:
import requests
pokemon_name = "pikachu"
url = f"https://pokeapi.co/api/v2/pokemon/{pokemon_name}"
response = requests.get(url)

What does the response look like?

In [77]:
response

<Response [200]>

This is just a response code (see [list here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)), indicating that we succesfully received the data. But where is the actual data?

In [78]:
data = response.json()

In [79]:
data

{'abilities': [{'ability': {'name': 'static',
    'url': 'https://pokeapi.co/api/v2/ability/9/'},
   'is_hidden': False,
   'slot': 1},
  {'ability': {'name': 'lightning-rod',
    'url': 'https://pokeapi.co/api/v2/ability/31/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 112,
 'cries': {'latest': 'https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/latest/25.ogg',
  'legacy': 'https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/legacy/25.ogg'},
 'forms': [{'name': 'pikachu',
   'url': 'https://pokeapi.co/api/v2/pokemon-form/25/'}],
 'game_indices': [{'game_index': 84,
   'version': {'name': 'red', 'url': 'https://pokeapi.co/api/v2/version/1/'}},
  {'game_index': 84,
   'version': {'name': 'blue', 'url': 'https://pokeapi.co/api/v2/version/2/'}},
  {'game_index': 84,
   'version': {'name': 'yellow',
    'url': 'https://pokeapi.co/api/v2/version/3/'}},
  {'game_index': 25,
   'version': {'name': 'gold', 'url': 'https://pokeapi.co/api/v2/version

Lots of data stored as a python dictionary. We can take a look at the keys for a more compact view:

In [None]:
data.keys()

In [None]:
data["types"]

We can't turn the whole thing into a pandas dataframe without further manipulation, but we can extract some of the lists of dictionaries it contains to dataframes:

In [None]:
data["stats"]

In [None]:
df = pd.DataFrame(data=data["stats"])

In [None]:
df

# Pandas Alternatives
In this class we will use `pandas` exclusively, because it is the mostly widely used package for data manipulation. But there are alternatives that you should be aware of. 

## R
`pandas` is basically Python's answer to the R `data.frame` type. If you don't like Python, or you work at a company that predominantly uses R, then you will find that it has all of the functionality of `pandas` and more. It is especially helpful if you are doing statistics-focused work, because it has built-in functionality that either doesn't exist in Python or requires tedious Python package management. On the downside, R is less of a general-purpose programming language than Python -- it would either be a pain or impossible to build your full software stack in R, because that is not what it's designed for. You probably don't want Python for your full stack either, but you could at least do it and it would be less painful

## Polars
`polars` is a Python package that is becoming increasingly popular. It is similar to `pandas` in a lot of ways (note the bear theme in the name), but it boasts significant efficiency gains, especially when dealing with large amounts of data. You are welcome to explore it and even use it for assignments in this class, but we won't be learning any of the syntax. 

## Breakout: Web Requests
Find a free API and construct a `GET` request to it. What data does it return? 