# 01.2 Data Types and String Handling

### Understanding Data Types

An important, but these days often overlooked, aspect of learning a new scientific programming language is how that languages treat different data types. A data type is an attribute that specifies values that a variable can take and ultimately the operations that can be performed on that variable. Generally speaking, data types include integers, floating point numbers, and character strings. In Python, we declare a variable simply using a statement like:

`x = 24`

This statement effectively instructs Python to create a variable we will refer to as `x` equal to a numerical value of 24. Note that we haven't specified it's data type. This contrasts to languages like C, in which we would define a variable using something like the following:

`int x = 24;`

So, then how does Python figure out what the data type is? Let's explore:

In [None]:
x = 24 # This is an integer
y = 48 # This is an integer
z = 4.2 # This is a double precision floating point

char = 'Cappucina Ballerina' # This is a character string


We can use the `type()` function to query what the data type of each of these defined variables is:

In [None]:
type(x)

In [None]:
type(y)

In [None]:
type(z)

In [None]:
type(char)

Often times, we want to do perform mathematical or other operations on variables. As mentioned above, the data type influences what operations we can and cannot do with our variables. Let's start by adding `x` and `y`:

In [None]:
x + y

Now let's examine the data type that was returned by this operation

In [None]:
type(x+y)

So far, so good. Now what happens when we add two variables that have different types:

In [None]:
x+z

In [None]:
type(x+z)

Wait, what? Note that in many languages (C, Fortran), this would throw an error. Instead, Python has inferred the most appropriate type based on the operation. We added an integer to a float and got a float. Addition seems straightforward, but what about division?

In [None]:
x/y

Ok, the answer is _numerically_ correct, but what data type did Python return? 

In [None]:
type(x/y)

Huh? Again, in languages like C and Fortran, this would throw an error. Again, Python examined what we were trying to do, and implicitly "cast" `x` as a double before performing the operation. The resulting quotient is then a float. 

Let's try something... what if we defined `xfloat` = 24.0 (a floating point number by definition) and then compare it to `x`:

In [None]:
xfloat = 24.0

(x==xfloat)

Hmmmm... curious. Can you think of any potential issues with this? What if `xfloat = 24 + 1E-12`?

So far, Python has let us play pretty fast and loose with data types, effectively guessing what we're trying to do and then adjusting "on the fly." Let's illustrate one way that the data type *does*, in fact matter:

In [None]:
for i in range(x):
    print(i)

In [None]:
for i in range(xfloat):
    print(i)
    

Why is this? Think about the mathematical concept of a whole number.

Ok, let's try something weird...

In [None]:
x + char

In [None]:
str(x) + char

Why does the first one not work, but the second one does?

### What does all of this mean?!? And why does it matter?

Python is two things:
* __Dynamically typed__: The language determines the data type of a variable at runtime, not at compile time (this differs from languages like C and Fortran). It can interpret some operations (like `int` + `float` = `float`) but will throw errors with others that cannot be interpreted (like `int` + `char`). 
* __Strongly typed__: The language *does* have a strict set of rules regarding type compatibility during certain operations (e.g., our `for` loop required integers as the counting variables). Some important functions in Python (`for` loops for one) restricts the variable data types that can be used as inputs. 

### Printing Variables and Calculations to the Screen

We will be performing many, many mathematical operations on variables that are integers and floats throughout this class. However, there are many times when we want to report what the value of a variable is to the screen or user. We do this when we're debugging code – trying to find an error, to report progress through a long workflow – to reassure yourself or the user that progress is being made, or just to report a value to a screen when, for example, we're reporting output from a sensor. Fortunately, Python has a built-in function `print()` that is very flexible and that suffices for many uses.

For example, we can print the result of a mathematical operation without casting to a character string type. 

In [None]:
print(x+y)

We can also add useful text strings and cast the value of a mathematical operation to a string to create output that is more human readable. 

In [None]:
print('x + y = ' + str(x+y))

This even works when performing legal operations on different data types

In [None]:
print(x+z)

In [None]:
print('x + z = ' + str(x + z))

And, of course, we can print character strings...

In [None]:
print(char)

And concatenate those strings with other strings. The ability to concatenate strings like this is not limited to only the `print()` function and will be very useful in plotting and other operations

In [None]:
print(char+' is a meme character in the Italian brainrot genre')

### F-Strings

It may not seem very scientific computing oriented, but handling strings is actually an incredibly important skill to have. In fact, it's so important your first problem set will involve string handling that is a very realistic scenario. 

A relatively new addition to Python is the use of the so-called f-string. You denote an f-string by preceding the single quotes by the letter 'f', like this:

`f'this is an f-string'`

F-strings do provide us with some shortcuts to help format output when we print it to the screen.  

In [None]:
print(f'x={x}, y={y}, sum={x+y}')

In [None]:
print(f'{x=}, {y=}, {x+y=}')

However, that's not their superpower. What makes f-strings unique is the ability to **format** the way that numbers are displayed when we print them. 

For example, we can control the number of decimal places:

In [None]:
width = 10
air_temp = 21.31

print(f'{air_temp:.{width}f}')

We can add commas to large numbers:

In [None]:
salary = 421540599

print(f'Salary is... ${salary:,}')

We can display numbers as percentages without converting:

In [None]:
probability = 0.5565

print(f'probability = {probability:.2%}')

And, we can create more complicated, formatted text reports to print using three single quotes to start and end our f-string:

In [None]:
name = 'Stu'
age = 46
nickname = 'Disco'

info = f'''
Name: {name}
Age: {age}
Nickname: {nickname}
'''

print(info)

### Practical Use Cases

What are some practical uses of f-strings? There are two realistic use cases below that I'll share with you. Both very realistic. 

__Example 1__: I have written a code that computes evapotranspiration using the Penman-Monteith equation. One function (more on this later) in this code takes the air temperature (in °C) and the relative humidity (in %) as input and returns the vapor pressure deficit (VPD). My code keeps returning unrealistic numbers (evapotranspiration values of 3 **meters** per day) and I suspect that there might be an issue with the way that VPD is calculated. To debug my code, I would like to print out the values of air temperature and relative humidity that the function is receiving, and the value of VPD it is returning as output. I can basically identify three potential outcomes of this debugging test:
1. The values computed are realistic then the problem lies somewhere else in the code,
2. Air temperature and/or relative humidity are not realistic and so the function is somehow not getting the right input, 
3. The air temperature and relative humidity are realistic but the VPD is not and so there is an error in my computation.

The cell below is example code of what this debugging test might look like. For now we'll assume values of air temperature and relative humidity, which excludes outcome 2 above.

In [3]:
import numpy as np # Don't worry about what this is for now

# Assume these are the values passed to my code 
airT = 22.3 # Air temperature in °C
RH = 27.33  # Relative humidity in %

# Don't worry about these equations
esat = 0.6108*np.exp((17.27*airT)/(airT + 237.3)) # Saturation vapor pressure in kPa
ea = (RH/100.0)*esat # Actual vapor pressure in kPA

VPD = esat - ea # Vapor pressure deficit in kPa

# Create output string:
vpd_info = f'''
VPD calculation debug:
airT = {airT} °C
RH = {RH} %
VPD = {VPD:.{3}f} kPa
'''

print(vpd_info)



VPD calculation debug:
airT = 22.3 °C
RH = 27.33 %
VPD = 1.957 kPa



#### Dealing with Many File Names

__Example 2__: I need to get gridded precipitation data for the Upper Boise River Basin for the years 2020-2025. The originators of the data are at Oregon State and provide their data freely via an online web service. Precipitation data are available daily, but are lumped into monthly files. Each file contains the entire contiguous US (CONUS). As such, I want to download only those files that I need and process them on my computer. I know the file names have the following format:

`precip-YYYY-MM.nc`

Rather than pointing and clicking to download 60 files, I want to use the linux `wget` command to download the data files. As such, I need to write some code to generate the file names (note that the month is a double digit, so I'll need to pad months 1-9 with a leading zero) and create my `wget` command to automate downloading the data. 

In [None]:
file_base = 'precip'
file_ext = '.nc'

months = np.arange(12)+1
years = np.linspace(2020, 2025, num=(2025-2020+1), dtype=int, endpoint=True)

for yr in years:
    for mo in months:
        file_name = f'{file_base}-{yr}-{mo:02d}{file_ext}'
        print(file_name)


#### Challenge: Updating Progress

Once I have downloaded all of the data files I need, I need to do my data processing. That will involve a sequence of steps:

1. Subsetting the data spatially to identify only the Upper Boise River Basin
2. Collecting the monthly data into annuals
3. Getting mean annual, seasonal, and monthly precipitation totals
4. Getting extreme low and extreme high precipitation
5. Correlating precipitation to topography and land cover

I will do steps 1-4 in a single script. However, given the size of the files and the complexity of the work, it will take a significant amount of processing time. In order to reassure that the script is running and not "hung" because of memory resources, some error in my code, etc. I want to print and report on the progress of the code. 

Can you write a block of code below that reports on each file name that is being processed and the percent of months that have been processed? You effectively have the first part of the code from the cell above. You may need to introduce a "counting" variable to report on the latter. __Suggestion:__ Talk with your table mates about your strategy before you begin coding. 


In [4]:
file_base = 'precip'
file_ext = '.nc'

months = np.arange(12)+1
years = np.linspace(2020, 2025, num=(2025-2020+1), dtype=int, endpoint=True)

file_count = 0

for yr in years:
    for mo in months:
        file_name = f'{file_base}-{yr}-{mo:02d}{file_ext}'
        
        #process the data
        #process_my_data(file_name)
        file_count += 1
        print(f'I have processed {file_count/72:.2%}. Last file: {file_name}.')

I have processed 1.39%. Last file: precip-2020-01.nc.
I have processed 2.78%. Last file: precip-2020-02.nc.
I have processed 4.17%. Last file: precip-2020-03.nc.
I have processed 5.56%. Last file: precip-2020-04.nc.
I have processed 6.94%. Last file: precip-2020-05.nc.
I have processed 8.33%. Last file: precip-2020-06.nc.
I have processed 9.72%. Last file: precip-2020-07.nc.
I have processed 11.11%. Last file: precip-2020-08.nc.
I have processed 12.50%. Last file: precip-2020-09.nc.
I have processed 13.89%. Last file: precip-2020-10.nc.
I have processed 15.28%. Last file: precip-2020-11.nc.
I have processed 16.67%. Last file: precip-2020-12.nc.
I have processed 18.06%. Last file: precip-2021-01.nc.
I have processed 19.44%. Last file: precip-2021-02.nc.
I have processed 20.83%. Last file: precip-2021-03.nc.
I have processed 22.22%. Last file: precip-2021-04.nc.
I have processed 23.61%. Last file: precip-2021-05.nc.
I have processed 25.00%. Last file: precip-2021-06.nc.
I have processed 