# 4. Python data science modules

Now we've got a grasp on the fundamentals of Python, we're ready to use Python for data science.

Unlike some more math-oriented programming languages (like R or MATLAB), Python relies on external packages to provide most data science functionality.

The Python community has largely standardized on the following packages, all of which we'll be covering in this module, and will be using extensively for the rest of the course:

* pandas - working with tabular data
* numpy - math tools, plus working with 1D & 2D arrays
* matplotlib - low-level plotting
* seaborn - high-level plotting 


## Pandas

Pandas is a package for working with tabular data.

Tabular data is anything in a table form! 

Common analytical examples include spreadsheets, CSV files, and database tables.

Tabular data consists of rows and columns:

* Each row represents an item, and each column represents a common feature of all the items.
* Each row has the same columns as the other rows, in the same order.
* A single column holds data of the same type, but different columns can have different types. 
* The order of rows sometimes matters, while the order of columns doesn't matter.


Tabular data isn't just work spreadsheets either: for example, a music playlist is tabular data (for each song, we know the title, genre, etc)

![Music playlist UI](img/4-itunes-ui.png)

and your text message inbox is tabular data (for each conversation, we know the participants, unread status, data of most recent message, etc) 


![Messages UI](img/4-messages-ui.png)


### Loading data


Pandas comes with many functions for reading lots of different kinds of data. We'll cover all of these throughout the course, but here's a list of the main ones for your later reference

* CSVs ([read_csv docs](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html), [py4wrds example](module-4-read-csv-example))
* SQL databases
* Excel files
* Google sheets
* Parquet files
* Any of the above, as a URL



To start off, we'll load this CSV file of ground water stations. 

<!-- TODO: after module deployment, replace r4wrds url with python one -->
<!-- TODO: add screenshot to get raw csv -->
(module-4-read-csv-example)=


In [None]:
import pandas as pd

# url = "https://github.com/r4wrds/r4wrds/blob/main/intro/data/gwl/stations.csv?raw=true"
url = "data/gwl/stations.csv"
df = pd.read_csv(url)

Note that we're loading the data directly from GitHub! This won't work for loading private files, you'll have to download the file and use the path instead of the url.

### DataFrame inspection

The code above (`df = pd.read_csv(url)`) has loaded our tabular data into an object called a DataFrame:



In [None]:
type(df)

The DataFrame is one of the two core classes that pandas gives us (the other is Series which represents a column).

DataFrame has a number of attributes (variables) and methods (functions) for **inspecting** our data, which is the first step in any analysis!

`head()` shows us the first few rows of data

In [None]:
df.head()

Because we've got lots of columns, `.columns` shows all the names on one screen

In [None]:
df.columns

How much data are we working with? `.shape` gives us both the row and column count, which is always in row, col order

In [None]:
print(df.shape)

but it's more clear to take the length directly:

In [None]:
print("n_rows: {}".format(len(df)))
print("n_cols: {}".format(len(df.columns)))

`dtypes` shows the data type of each column.

This is important to check! Pandas makes some guesses about what data type to use, and often gets things wrong. Common pitfalls to be wary of include

* Dates might be loaded as strings instead of rich datetime objects.
* Numerical columns like `$145` or `78%` might be loaded as strings instead of as numbers
* A single row with a typo (`32.111!`) or a non-numeric placeholder (using `Unknown` instead of `NaN`) will turn an other-wise numeric column into a string type.
* Pandas defaults to using 64 bit integers and floats. If your dataset is maxing out your memory, you can specify 32 bit (or smaller) dtypes to reduce the size once loaded.
* ZIP codes should be parsed as strings not integers, to avoid stripping ZIPs that begin with zero.


In [None]:
print(df.dtypes)

We can see pandas has done a pretty good job here! (the `object` type is what pandas uses to represent strings).

Fixing some dtype issues can involve more complex analysis. But for simple cases, we can simply tell pandas what to do when loading the data:

In [None]:
df = pd.read_csv(url, dtype={"ZIP_CODE": str})

`describe()` gives a summary of our numerical columns. With the reloaded dataframe, `ZIP_CODE` is no longer considered numeric!

In [None]:
df.describe()

Finally, we're not always interested in the whole dataset for every analysis. You can load a subset of the columns to speed up loading, reduce memory pressure, and to just keep your workspace tidier:

In [None]:
df_county = pd.read_csv(url, usecols=["STN_ID", "WELL_DEPTH"])

df_county.head()

## Columns

A column of a dataframe is a Series object. You can access a column by using it's name in `[]` brackets, just like a dictionary:


In [None]:
type(df["BASIN_NAME"])

In [None]:
df["COUNTY_NAME"]

A Series is conceptually very similar to a Python list, except all the values in a Series are the same data type. We can convert a Series to a list with `to_list()`

In [None]:
df["COUNTY_NAME"].to_list()[:5]

Pandas columns come with a huge range of statistical methods.


There are methods for descriptive statistics like `min` `max`, `mean`  `mode` `median` `quantile`, `sum`

In [None]:
print(df["WELL_DEPTH"].quantile(0.95))

There are methods for  unique values. If a column is all the same, that could signify a data issue, or perhaps mean we we don't need to load that column.

In [None]:
assert df["STN_ID"].is_unique, "There should be no duplicated IDs"
assert df["COUNTY_NAME"].nunique() > 1,  "Ensure we're not using a single-county subset"

The `value_counts` method is great for summarizing non-numerical columns, showing the count of each unique value.

By default, `NaN` values aren't included, but for data exploration it's really important to know where our NaNs are so we add `dropna=False`!

In [None]:
df["WELL_USE"].value_counts(dropna=False)

Pandas' method library is large and growing larger, no-one can keep track of all these methods! It's common in pandas development to be frequently asking Google / ChatGPT "how to round numbers in a pandas series". 

The pandas [official documentation](https://pandas.pydata.org/docs/reference/series.html) is another great resource: each method has multiple examples and detailed descriptions of the parameters and statistical algorithms. 

In [None]:
df["LATITUDE"].round(decimals=2)

As well as methods, Series also support arithmetic (unlike a Python list).

For example, we can convert well depth to meters and save it as a new column on our data frame: 

In [None]:
df["WELL_DEPTH_METERS"] = df["WELL_DEPTH"] * 0.3048
df.tail()

## Filtering and slicing

We've already seen the `head()` method, which shows the first n (5 by default) rows.

A similar function is the `sample()` method, which shows n random rows. This can give a better sense of the data, in case the first few rows aren't representative of the rest.

In [None]:
df.sample(n=7)

Technically what we're doing here isn't just printing some rows of our dataset, but actually creating a new DataFrame with some rows sliced from the old one, and printing that new frame. `iloc` is similar to `head` but lets you specify the row range to slice

In [None]:
df_dozen_rows = df.iloc[10:24]
len(df_dozen_rows)

There are some other DataFrame methods that return a new dataframe with a subset of rows. `drop_duplicates()` returns a DataFrame with repeated rows removed. `dropna` returns a dataframe with only rows that don't have any NaN values:

In [None]:
df_unique = df.drop_duplicates()
df_clean = df_unique.dropna()
len(df_clean) 

Because these are both DataFrame methods that return another DataFrame, we can **chain** them together to save space

In [None]:
df_clean = df.drop_duplicates().dropna()
len(df_clean) 

Most pandas methods return a new DataFrame rather than modifying the original one. We can see that our original still has the same number of rows:

In [None]:
len(df)

We can also define our own filters!

In [None]:
df_USGS = df[df.WCR_NO == "USGS"]
df_USGS.head()

## Slicing by column

By passing a list of column names, we can slice all rows for only the specified columns

In [None]:
df[["BASIN_NAME", "COUNTY_NAME"]]

We can also delete columns using the `del` keyword

In [None]:
del df["ZIP_CODE"]

print("ZIP_CODE" in df)

### String columns

As well as numerical data, pandas Series class has methods for working with strings as well.

We'll demo this with a dataset that has a few more strings: [CIWQS NPDES Permits](https://ciwqs.waterboards.ca.gov/ciwqs/readOnly/NpdesReportServlet).

In [None]:
df_npdes = pd.read_excel("./data/npdes_data.xlsx", nrows=1000, dtype={"ZIP CODE": str})
df_npdes.head()

Say we want to pull out all the permits related to AT&T.

The problem is that there's inconsistent naming of the facilities (this is often the case with user-entered data)

In [None]:
df_npdes.iloc[27:38]["FACILIITY NAME"]

To address this, lets tidy up the name field. We'll do this in a new column so we don't loose our original data.


In [None]:
df_npdes["tidy_name"] = df_npdes["FACILIITY NAME"].copy()

Most of python's builtin string functions have equivalent pandas Series methods. The pandas methods are much faster though, and for advanced users, many can be used with [regular expressions](https://en.wikipedia.org/wiki/Regular_expression).


In [None]:

# Replace numeric NaN values with empty strings.
df_npdes["tidy_name"] = df_npdes["tidy_name"].fillna("")

# Remove leading/trailing whitespace.
df_npdes["tidy_name"] = df_npdes["tidy_name"].str.strip()

# Convert to uppercase.
df_npdes["tidy_name"] = df_npdes["tidy_name"].str.upper()  

# Fix spacing.
df_npdes["tidy_name"] = df_npdes["tidy_name"].str.replace("AT & T", "AT&T")

The new `tidy_name` column can now be used for filtering:

In [None]:
df_att = df_npdes[df_npdes["tidy_name"].str.contains("AT&T")]

df_att[["FACILIITY NAME", "tidy_name"]]

We can also normalize to 5-digit zip codes by splitting on the dash, then taking the first group using the `str[]` indexing tool pandas provides:

In [None]:
df_npdes['tidy_zip_code'] = df_npdes["ZIP CODE"].str.split('-').str[0]

df_npdes[["ZIP CODE", "tidy_zip_code"]].head()

The pandas documentation has a [Working with text data](https://pandas.pydata.org/docs/user_guide/text.html) guide that goes into more details about regular expressions as well as splitting/joining strings, and has a list of all the string methods.

In [None]:
df_npdes.dtypes

### Datetime columns

Just like pandas groups string functions with a `.str` prefix, there is also a `.dt` prefix that contains functions for working with dates, times, and datetimes (timestamps).

Let's have a look at some of our date columns:


In [None]:
date_cols = ["ADOPTION DATE", "EFFECTIVE DATE", "EXPIRATION DATE"]
df_npdes[date_cols].head()

In [None]:
df_npdes[date_cols].dtypes

Our dates were just loaded as strings: we also have a numeric `NaN` mixed in there.

To fix this we're going to have to go back to the data loading. In this case it's enough to tell python which columns to treat as dates with the `parse_dates` argument.

(For more complex cases, you can specify a `date_format` argument, or use the `pd.to_datetime` function).




In [None]:
df_npdes = pd.read_excel("./data/npdes_data.xlsx", nrows=1000, parse_dates=date_cols) 
df_npdes[date_cols].head()

In [None]:
df_npdes[date_cols].dtypes

Now that our dates have the correct type, we can use pandas date/time functionality!

The `.dt` prefix has functions for accessing different parts of the timestamp

In [None]:
# English day of week name. Then replace any None or NaNs with an empty string.
df_npdes["ADOPTION DATE"].dt.day_name().fillna("")

as well as functions for manipulating timestamps

In [None]:
# Round to the nearest hour (looks like our data is already rounded!).
df_npdes["ADOPTION DATE"].dt.round("h")

In addition to timestamps, pandas also has the concept of differences between two timestamps.

A `Timedelta` is a fixed difference:

In [None]:
# Shift dates 7 days forward into the future).
df_npdes["ADOPTION DATE"] + pd.Timedelta(days=7)

while an `offset` can vary on length depending on context.

In [None]:
# 10 working days later.
df_npdes["ADOPTION DATE"] + pd.offsets.BusinessDay(n=10)

In [None]:
# df_npdes["ADOPTION DATE"].to_list()

In [None]:
# df_npdes["ADOPTION DATE"].to_list()[0]

In [None]:
# type(df_npdes["ADOPTION DATE"].to_list()[0])

In [None]:
# df_npdes["adoption_data_tidy"] = pd.to_datetime(df_npdes["ADOPTION DATE"])





## Numpy

Numpy is a Python package for representing **array data**, and comes with a large library of tools and mathematical functions that operate efficiently on arrays.

If you're familiar with more math-oriented programming languages like R or MATLAB, numpy brings much of the builtin math and data functionality from those languages into Python.

Numpy is by far the most popular Python package for data science, and is one of the [most-downloaded](https://pypistats.org/top) python packages overall. It's so useful and reliable, that most of the mathematical functionality of the other packages covered in this module (pandas, seaborn, matplotlib) is provided by numpy under the hood.

Because numpy is used a lot, it's convention to import it with the `np` abbreviation:

In [None]:
import numpy as np


### Why numpy?

A numpy array is similar to a Python list: they can both serve as containers for numbers.

In [None]:
python_list = [0, 2, 4, 6]
print(python_list)

In [None]:
numpy_array = np.array([0, 2, 4, 6])
print(numpy_array)

So why use numpy instead of lists?

* Speed
    * Although numpy is a Python package, most of the functionality is written in fast C or Fortran code.
* Memory efficient
    * Numpy uses less memory to store numbers than Python, so you can work on larger datasets.
* Functionality
    * Numpy comes with a huge range of modules with fast and thoroughly-validated algorithms from interpolation to fourier transforms.
* Manipulation syntax
    * Numpy's syntax makes it clear and easy to perform common array operations, like slicing, filtering, and summarization.


But there are some usecases where lists make more sense

* Storing different kinds of data together
    * Numpy arrays are homogeneous, all the elements must be the same type
* Working with non-numerical data
    * Some numpy functionality works with strings and other types, but performance can suffer




### Array slicing and indexing


One way to create an array is from a Python sequence like a list


In [None]:
days_per_month = np.array([31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31])
days_per_month

Like the original Python list, arrays can be sliced

In [None]:
days_per_month[0:3]

and individual elements can be index out

In [None]:
print(days_per_month[1])

Two-dimensional (and higher dimensional, there's no limit in numpy!) arrays can be created from nested Python sequences:



In [None]:
array_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
array_2d

You index a 2D array using the same notation: first your row slicing, then a comma `,`, then your column slicing.

For example the first element of the second row:

In [None]:
print(array_2d[1, 0])

or the top three values of the last column

In [None]:
print(array_2d[0:3, -1])

Note how our column has lost its "verticalness": once we've sliced it out, it's just a regular 1D array.

### Creating arrays

As well as converting Python lists to arrays, numpy can create its own arrays!

You can create an array that's filled with zeros

In [None]:
np.zeros(5)

or ones (remember numpy uses the row, col ordering)

In [None]:
np.ones((2, 5))

Numpy has it's own version of the Python `range` function:

In [None]:
np.arange(2, 9, 2)

and a related `linspace` function to create an array with elements evenly spaced

In [None]:
np.linspace(0, 10, num=5)

### Array attributes

The shape attribute gives the rows and cols (in that order!) of an array

In [None]:
array_2d.shape






1. Standard library  
2. Numpy  
   1. Why numpy?   
   2. Creating Arrays  
   3. Array Dimensions  
   4. Array Operations   
   5. Slicing, Indexing, and Broadcasting  
   6. Dot product, cross product, matrix multiplication  
   7. Exporting and loading arrays   
3. Pandas  
   1. Dataframes  
   2. DataFrame structure  
      1. Columns  
      2. Index, datetime index, datetime module   
   3. Loading dataframes from .csv and .xls files   
      1. Dealing with messy data  
      2. Example dataset  
      3. Cleaning real messy data  
   4. Selecting columns  
   5. Filtering by conditionals  
   6. Helpful dataframe functions   
      1. Convert a dict to a dataframe   
   7. Advanced dataframe topics  
      1. Multiindex   
      2. .apply   
      3. .groupby   
4. Matplotlib  
   1. Line plot  
   2. Scatterplot  
   3. Plotting 2d arrays with imshow  
   4. Formatting plots   
      1. Title  
      2. Axis labels  
      3. Legend  
5. Seaborn  
   1. Relplot  
   2. Distplot   
   3. Catplot   
6. Practical Example \- scatterplot and linear regression   
   1. Load two datasets  
   2. Create pandas dataframe with each as a column  
   3. Do a linear regression between two columns  
   4. Plot scatterplot and linear regression using matplotlib  
   5. Add axis labels, legend, title, regression equation   
   6. Save plot to .png 
