# Appendix A


## Regex

[Regular expressions](https://docs.python.org/3/library/re.html), also known as regex or regexp, are a powerful tool for manipulating text. They are widely used in various programming languages, including Python. Regular expressions match patterns in strings, allowing for advanced data extraction, data manipulation, and data validation.

### Pros of Using Regular Expressions:

**Pattern Matching:** Regular expression allows you to select specific patterns in a text rather than matching exact text. For instance, you can filter out valid email addresses or telephone numbers from a document.

**Compactness:** Regular expressions can handle a complex text manipulating task in a single line of code.

**Flexibility:** With regular expressions, you can handle many different text scenarios due to a wide range of special characters and operators.

**Powerful Tools:** Many programming languages, text editors, and database systems support regular expressions. They are indispensable for advanced text processing tasks.

Here's an example of a regular expression in Python that matches an email address:

In [1]:
import re
pattern = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
email = "example@example.com"
match = re.match(pattern, email)
if match:
    print("Valid email!")
else:
    print("Invalid email!")

Valid email!


### Cons of Using Regular Expressions:

**Complexity:** Regular expressions can become extremely complex and hard to read, especially for those who are new to programming or unfamiliar with regex syntax.

**Difficulty in Debugging:** If a regular expression has a small mistake, it could lead to major problems, and these problems can be hard to debug due to the cryptic nature of regular expression syntax.

**Performance:** Regular expressions can be slower than other methods of string manipulation, especially for more complex expressions or large pieces of text.

### Pandas Regex

Pandas is a popular data manipulation library in Python and it comes with vectorized string functions which are capable of performing operations on entire series of data. These vectorized string functions are a part of pandas and have the power of regular expressions and are also equipped to handle missing values (NaN).

To get started with regular expressions in pandas, you need to import the pandas library first:

In [2]:
import pandas as pd

A common use case of regular expressions in pandas is when you want to filter data in your dataframe. For example, let's suppose we have a DataFrame with some email data:

In [3]:
data = {'name': ['John', 'Anna', 'Peter', 'Linda'],
        'email': ['john@abc.com', 'anna@xyz.com', 'peter@123', 'linda@lmnop.com']}
df = pd.DataFrame(data)
df

Unnamed: 0,name,email
0,John,john@abc.com
1,Anna,anna@xyz.com
2,Peter,peter@123
3,Linda,linda@lmnop.com


If you want to filter out rows where email column contains valid email addresses, you can use the .str.contains method with a regular expression pattern:

In [4]:
pattern = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
valid_emails_df = df[df['email'].str.contains(pattern)]
valid_emails_df

Unnamed: 0,name,email
0,John,john@abc.com
1,Anna,anna@xyz.com
3,Linda,linda@lmnop.com


Another useful method is .str.extract, which extracts a part of the string that matches the regular expression pattern. For example, if you want to extract the domain from the email addresses, you can do:

In [5]:
df['domain'] = df['email'].str.extract(r'@([a-zA-Z0-9-.]+)')
df

Unnamed: 0,name,email,domain
0,John,john@abc.com,abc.com
1,Anna,anna@xyz.com,xyz.com
2,Peter,peter@123,123
3,Linda,linda@lmnop.com,lmnop.com


.str.replace method can be used to replace the parts matching regular expression with some other string. For instance, if you desire to replace the domain in all emails to "newdomain.com", here is how you can do it:

In [6]:
df['email'] = df['email'].str.replace(r'@[a-zA-Z0-9-.]+', '@newdomain.com', regex=True)
df

Unnamed: 0,name,email,domain
0,John,john@newdomain.com,abc.com
1,Anna,anna@newdomain.com,xyz.com
2,Peter,peter@newdomain.com,123
3,Linda,linda@newdomain.com,lmnop.com


Above, we made use of a column replace function. Note that the same function can be applied to an entire dataframe as well.

Keep in mind, while these vectorized functions greatly simplify data preprocessing, they can be a little difficult to debug due to complexities of regular expression syntax. Therefore, it's important to thoroughly test your regular expressions to ensure they're correctly matching the patterns you need.

Also note that while the regex language is somewhat similar between implementations, its not exact.  Perl regex and python regex are both slightly different in what they can do, and how to specify them.  Mixing syntax can be a pain to debug, so be careful.

## Function Definitions

In Python, a function is a block of reusable code that is used to perform a specific related group of actions. Functions help us break our program into smaller and modular blocks for better organization and code reusability.

To define a function, Python provides the def keyword. Here's the syntax for defining a function in Python:

```python
def function_name(parameters):
    """docstring"""
    statement(s)
```

* **function_name:** This is the name of the function, it should be unique and descriptive of what the function does.
* **parameters:** These are optional values that the function uses to perform its operations. If a function has parameters, you can pass different arguments each time you call it.
* **docstring:** It is optional and used to describe what the function does.
* **statement(s):** This is the code that the function executes when it's called.

For example:

In [7]:
def greet(name):
    """This function greets the person passed in as argument"""
    print("Hello, " + name + ". Good morning!")

Global variables in Python are those variables that are declared outside of a function. They can be accessed by any function in the program, making them useful for storing values that multiple functions might need to access.

Here's a simple example of a global variable:

In [8]:
x = 10  # This is a global variable

def print_global():
    print(x)  # We can access the global variable x inside this function

print_global()  # Prints: 10

10


While global variables can be very useful, they should be used sparingly because they can make it harder to understand how your program works. If you have a lot of functions that change the global state, it can be hard to keep track of all the changes.

If you really need to use a global variable inside of a function, you can do so with the global keyword:

In [9]:
x = 10

def modify_global():
    global x  # This tells Python we're going to work with the global x
    x = 20  # This changes the global x, not a local one

modify_global()
print(x)  # Prints: 20, because we changed the global x inside the function

20


Remember that using global variables can make your code harder to understand and maintain, so use them sparingly and only when necessary. It's usually better to pass values into a function as parameters and get results back as a return value.

Functions must be declared before they are used.  So, keeping with our linear flow idea of how to organize a notebook, in general, you will want to define your functions in cells that occur before they are used.

In practice, I tend to just write code in cells until I find that I need to use the code more than once, and then I define a function, extract and generalize the code into the new function, only then deleting the original cells.  The cells are a perfect place to experiment with the code and its behavior.

As mentioned above, I try to avoid using global variables, and make sure to pass all the needed data into my functions.  Often, when migrating from a cell to a function, it is common to end up with local variables named the same as what were global variables.  As a result, as you turn that into a function, it is very easy to miss one of those variables and not turn them into a newly named function parameter.  The code will still work, as the global variable you were referencing still exists.  I dont have any tricks to avoiding this, except to suggest being careful, these kinds of errors are hard to detect.

A very contrived example: 

In [10]:
my_list = [8,2,3,4,5]
lngth = 2

# original cell contents
my_list2 = sorted(my_list[:lngth])
my_list2
           

[2, 8]

When I try to turn that into a function, I might do the following:

In [11]:
my_list = [8,2,3,4,5]
lngth = 2

def get_first_N_sorted(thelist):
    my_list2 = sorted(thelist[:lngth])
    return my_list2

get_first_N_sorted(my_list)

[2, 8]

Note how I forgot to pass in the `lngth` parameter.  My function above will be using the global lngth, and should I change it, my function will begin to return different elements than before.  This can take a while to find in the real world.

In [12]:
lngth=3
get_first_N_sorted(my_list)

[2, 3, 8]

What we should have written was:

In [13]:
def get_first_N_sorted(thelist, thelen):
    return sorted(thelist[:thelen])

my_list = [8,2,3,4,5]

get_first_N_sorted(my_list, 2)

[2, 8]

## Type Conversion

Pandas provides several data structures like Series, DataFrame, and Panel. These data structures work with a variety of data types. Pandas automatically assigns data types when reading in data or converting data. However, sometimes you may need to explicitly convert data types. This process is known as Type Conversion.

Pandas support seven types of data, including:

* **object:** Used for string or mixed data types.
* **int64:** 64-bit integer data type.
* **float64:** Floating point number.
* **bool:** Used for Boolean values True/False.
* **datetime64:** Used for date and time, without timezone.
* **timedelta:** Used for representing differences in dates and times (seconds, minutes, etc.).
* **category:** Used for categorical variables.

Data Type Conversion
You can change type of a specific column using astype() function. For example:

```python
df['column_name'] = df['column_name'].astype('data_type')
```

Where 'column_name' is column you intended to convert and 'data_type' is your desired data type.
Sometimes, conversion to a certain type may fail if the column contains inappropriate data (e.g., trying to convert a column of strings to integers when some strings don't represent integers). In this case, a ValueError is raised.

```python
# Convert object type to integer
# Pretend the Age column contains the word 'One', 'Two', 'Three', etc.
df['Age'] = df['Age'].astype('int') 
```

Note that in addition to the per column `astype` method above, there is an `astype` method that applies to an entire dataframe.  The parameter to the function is a dictionary, where you specify as the key the column name and the value is the type you want to convert to.  This allows you to change many columns at once.

### convert_dtypes Method

Pandas has a method to convert columns to the best possible dtypes using the new dtypes introduced in pandas.

Here is how it can be used:

```python
df = df.convert_dtypes()
```

Internally it tries to do these conversions:

* convert integer columns from int64 to the smallest possible integer dtype (pd.Int64Dtype() or similar, depending on the size)
* convert float columns to the Nullable Float type (pd.Float64Dtype())
* convert object dtype to the appropriate dtype if the column has a dictionary-like data or is a string.

### Nullable Types

One important feature to note is the difference between traditional ('non-nullable') types and the new 'nullable' types introduced in pandas. These new types can have an optional Boolean mask that indicates if a value is missing, which allows for both missing and non-missing data to be accommodated in the same data type.

For example, a traditional integer array cannot have null values. Pandas uses a special representation for such missing values (often np.nan). However, data manipulation operations like groupbys, joins, and reshaping can introduce missing values. With the introduction of Nullable Integer Data Types, integers can now have true missing values. This makes the nullable types (Int64, float64, boolean) more flexible.

Each of pandas' data types has a corresponding nullable version, for example pd.Int32 instead of int32, which should be used after converting data types using the convert_dtypes method. These hold missing values using a separate mask, and when the result of an operation on the data cannot be represented in the data type, a missing value is produced even when the data type supports the operation.

Note: The reason pandas represent nullable types with a capital letter (for example, 'Int64'), is to differentiate them from the python types which don't support null values. The python 'int' type, for example, is represented as 'int64' in pandas.

### Why Type Conversion

When working with pandas DataFrames, data types play an important role not only in data manipulation and analysis, but also in optimizing performance and memory usage. Converting the data types in your DataFrame to their most appropriate or 'least-cost' types can bring a number of significant benefits.

#### Efficiency in Memory Usage:

Different data types in pandas use different amounts of memory. For example, an int64 type uses 64 bits of memory, while an int8 uses only 8 bits. If you know that the integer data you're working with fits within the range of an int8 (-128 to 127), using int8 instead of int64 can lead to substantial memory savings. This can be particularly beneficial when working with large datasets, where memory limitations can cause a bottleneck.

The same principle applies to other data types. For example, converting objects (which are often used to store strings) to categorical data (when the number of distinct values is limited) can significantly reduce the memory footprint.

#### Performance Improvement:

Utilizing less memory also tends to speed up computations and operations on the data. Certain operations are also optimized for certain data types. For example, operations on numerical data types are usually faster than those on object (string) types. Thus, converting columns to the correct data type can result in better performance.

#### Data Consistency and Accuracy:

Using the appropriate data type can also ensure data consistency and reduce the risk of errors. For instance, using a boolean data type for a column containing True/False values prevents this column from accidentally being treated as numeric (where True is interpreted as 1 and False as 0).

#### Enabling DataFrame functionalities:

Certain pandas functionalities only work with specific data types. For example, the .mean() method will not work on a column of strings - the column must be numeric data type (such as int or float). By ensuring the correct data type for each column, you'll be able to make full use of pandas functionalities.

Although data type conversion might seem like an extra step, it is an important aspect of pre-processing your data. It helps in efficient memory utilization, speed improvement, ensuring data accuracy and unlocking additional DataFrame functionalities. Hence, it is a good practice to always check and convert data to appropriate types when working with pandas DataFrames.


## Styling Tables

Using Pandas and Jupyter Notebook, you can apply [conditional formatting and styling](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) to your dataframes that can be very useful for data analysis and visual representation. One of these features is to color cells based on their values or display a bar chart in cell background.

**Note that this really only works in the web jupyter interface, and any other tool set that just displays the raw html that the dataframe creates.  IntelliJ, for instance, takes over all table rendering and will ignore any styles set**

Here is how you can do it:

In [14]:
df = pd.DataFrame({
    'A': [75, 85, 90, 80, 70, 80],
    'B': [65, 85, 80, 75, 95, 85],
    'C': [90, 80, 80, 85, 90, 80]
})
df

Unnamed: 0,A,B,C
0,75,65,90
1,85,85,80
2,90,80,80
3,80,75,85
4,70,95,90
5,80,85,80


Color Cells Based on Their Values
To color the cells based on their value, use the Styler.background_gradient method. This function applies a gradient coloring to the DataFrame values:

In [15]:
df.style.background_gradient(cmap='Blues')

Unnamed: 0,A,B,C
0,75,65,90
1,85,85,80
2,90,80,80
3,80,75,85
4,70,95,90
5,80,85,80


In [16]:
df.style.bar(color='blue', align='zero')

Unnamed: 0,A,B,C
0,75,65,90
1,85,85,80
2,90,80,80
3,80,75,85
4,70,95,90
5,80,85,80


The align parameter aligns the bars at the zero value for the columns.

Please note that these visualizations don't modify your data, and are just ways to visualize the data for any sort of analysis or debugging.

Both of these methods return a Styler object, which has useful methods for formatting and displaying DataFrames. The styles are built up incrementally, which means you can concatenate a bunch of style methods together and create a good looking dataframe!

Remember to make sure that your Jupyter Notebook is capable of displaying HTML output for these styles to render.

## Advanced usage of parquet files.

### A discussion of categorical data in pandas

Categorical data is a feature provided by pandas for managing categorical variables, which are variables that can take on one of a limited set of values, called categories. An example of a categorical feature could be the color of a car, which can be red, blue, black, etc.

#### Reasons for Using Categorical Data:

**Memory Efficiency:** Categorical data type utilizes significantly less memory by assigning numerical identifiers to distinct categories. This can be especially beneficial when dealing with large datasets with numerous repeating categories.

**Performance Enhancement:** Because categorical data uses less memory, the computations and operations performed on categorical data are often faster than the same operations performed on, for instance, objects data type.

**Maintaining Ordinal Relationship:** The categorical data type in pandas also supports ordered categories. This means you can express an inherent order or hierarchy within your categories, such as 'low', 'medium', 'high' for a response variable.

***Beneficial for certain statistical methods:** Certain statistical methods and machine learning algorithms require categorical data to function correctly. Using pandas' Categorical data type ensures compatibility with these methods.

Usage Code Example:

```python
import pandas as pd
s = pd.Series(["aaaaa","bbbbb","cccccc","aaaaa"], dtype="category")
```

In this case, the series uses less memory and computation on this series will be faster compared to a non-categorical series with "object" dtype.

#### Memory Benefit:

The memory usage of a Categorical is proportional to the number of categories plus the length of the data. In contrast, a regular Series with data type "object" has a memory footprint proportional to the length of the data times the length of the strings. So when the categories are a small fraction of the total number of values, memory usage will be significantly less.

#### Potential Negatives:

**Inflexibility:** Once you've set your categorical data, it's a bit more complex to change the categories compared to manipulating a regular Series. For example, adding a new category requires the "Categorical" to be recategorized.

**Complexity:** Understanding the categorical data type and using it correctly adds a layer of complexity to your code, making it potentially harder to write and understand.

**Compatibility Issues:** Not all pandas operations are compatible with the categorical data type. Some functions might not work, or might convert the categorical data back to a regular series.

### Parquet Files

Many IO operations do not know what a categorical data series is.  The output will be expressed with fully expanded values.  While the dataframe may store small numbers for a set of IP's in a dataframe, when they are displayed or saved as text, the full IP will be shown/saved.  This is perfect for interchange, but if you want to save small versions of the files, this is not ideal.

The parquet data file format does understand how to save categorical data correctly though, so if your trying to save data for personal use and you want to take as little space as you can, parquet should be your go to file format.

Parquet can save categorical data correctly, but it helps if you provide the save operation a few hints as to which columns are categorical and which arent.

```python
df.to_parquet(
    filename, # the file to save to
    version=2.6, # make sure its a new enough version of parquet to handle everything
    compression="brotli", # specify what kind of compression.  See docs for details, there 
                          # are many but I find brotli makes the smallest files.
    use_dictionary=sorted(set(df.select_dtypes(include=['category']).columns))
    # The above is designed to tell the parquet code to use a dictionary to save 
    # any data that exists in a column with type category.  As with a dataframe, with
    # with this setting, instead of saving 100,000 ip addresses (of which there are only 4
    # distinct ones, the saved data will contain a dictionary of just the 4 unique addresses
    # and the dataframe will just contain the much smaller index values.  This will likely also
    # compress better, in addition to just naturally taking up less space.. 
)
```


## Caching

Caching is something you find us doing from time to time, purely as a quality of life improvement to our software.

In short, im willing to wait 20 minutes to download a lot of data from the server once, I am not willing to do it 100s of times a day while im developing my notebooks.

To that end, sometimes it is useful to either:

* Create one notebook to download, clean, and save data and then write a second notebook to process the data.
* Write your notebook to download, clean and save data, only if it cant find a copy of that data on local disk.  If a stored version exists, use that instead.

The idea is, do the hard work to download and clean data just once, save it off, and work with that from there on. It will be much faster than repeatedly fetching data remotely.

Caching Pandas dataframes in local storage essentially means saving the dataframe to a file (like CSV, Parquet, etc.) for future use. This operation can have several benefits as well as potential drawbacks:

### Benefits of Caching Pandas DataFrames in Local Storage:

**Reuse:** Data preprocessing can be time-consuming. If you have a large raw dataset and once you've cleaned it and gotten it into the right shape, saving your processed DataFrame to a file allows you to re-use it without having to repeat the entire preprocessing step.

**Sharing DataFrames:** If you need to share your data with colleagues, or move it between different environments (like development, testing, and production), saving it to a file is a very convenient way to do that.

**Saving Processed Data:** If you're working with data that's collected in real-time and you're maintaining a running DataFrame of processed data, regularly saving to a file can ensure you don't lose your data if the process is interrupted.

**Frequency of Access:** Especially for large dataframes, if you read the same dataframe multiple times in your program, it could be beneficial to store it in your local cache to speed up the subsequent reads.

### Drawbacks of Caching Pandas DataFrames in Local Storage:

**Storage Space:** Large DataFrames can take up a lot of storage space. If you're working with multiple large dataframes and you save all of them to files, you might fill up your storage space quickly.

**I/O Operations:** Reading from and writing to a file involves disk I/O operations which are significantly slower than operations in memory. So, if your DataFrame fits comfortably in memory and you don't need the benefits listed above, it might be faster to not save it to a file.  That being said, its almost always faster to read from truly local storage than from a remote database or api server, so keep that in mind as well.

**Data Secrecy:** Storing data in local storage exposes it to the risk of being accessed by unauthorized users, especially when dealing with sensitive data.

**Data Consistency:** Frequent read/write operations can lead to data inconsistency issues, especially in multi-threaded/multi-user environments.

The choice of whether to cache pandas DataFrames in local storage or not depends on the specifics of your project, such as the size of your data, the cost of preprocessing, the frequency of data access, and storage capacity.

Most of our caching is done by writing dataframes to storage as parquet files, as shown above.  

There are a few things to think about if you decide to cache data, aside from what is above.

It is very easy to mess up with cached data and read in old data when you really wanted new stuff.  If I write a function that accepts parameters, looks for a local file, and uses it if its there, I need to make sure that the file selection takes those parameters into consideration.  An example below.


```python
def get_data(start_date, end_date, filterstring):
    if os.path.exists('cachefile.pq'):
        return pd.read_parquet('cachefile.pq')
    
    df = pd.DataFrame(server.query(start_date, end_date, filterstring))
    df.to_parquet('cachefile.pq')
    return df
    
```

The above function will use the cache file if it exists, else, it will run the query save it as a parquet file, and return the data.

Note however, that the `get_data` function accepts parameters, a start date, end date, and a filter string.  Those will be used if the cache file doesnt exist, but from there on, they will be ignored and the old data will be returned.  This is not what we want.  We want new parameters to cause new queries to be run.  Re-using old parameters should get us the saved data from past runs with the exact parameters, nothing else.

The easiest way to accomplish this is to make sure you include the parameters in the file name.

```python
def get_data(start_date, end_date, filterstring):
    cache_file = f"cachfile_{start_date}_{end_date}_{filterstring}.pq"
    if os.path.exists(cache_file):
        return pd.read_parquet(cache_file)
    
    df = pd.DataFrame(server.query(start_date, end_date, filterstring))
    df.to_parquet(cache_file)
    return df    
```

Now, every time we change the parameters, we will check, not find a cache file, and run the query.  We will end up with one cache file for each set of parameters.

Note that above, we really should have done something to makesure the cache filename was valid, that the date string didnt corrupt it or the like.  To do this, sometimes I choose not to use real variable contents in the file name, but to concat them all together and calcualte a hash of them.


In [17]:
import hashlib

# Define the strings 
arguments = ['Hello', ' ', 'World', '!']

# Initialize a new md5 hash object
hash_object = hashlib.md5()

# Iterate over the strings and add each one to the hash object
for s in arguments:
    hash_object.update(s.encode()) # Ensure the string is in bytes

# Extract the hexdigest
hex_dig = hash_object.hexdigest()

print(hex_dig)
    

ed076287532e86365e841e92bfc50d8c


With code like the above, I can take all the arguments to my query system and create a unique hash of them, and make that part of the cache filename.  This will guarantee that if I change any of the input parameters, I will also change the cache filename, meaning that there is little to no chance of me reading in old data meant for a different invocation of the function.

It's important to note that the hashlib.md5() method is often not recommended for cryptographic uses due to vulnerabilities in MD5. Nonetheless, it remains popular for checksums and data integrity verification in non-security-critical applications.
