# Lecture 1

## Outline of topics for this lecture:

1. Brief history of python
2. Python's place in the family of programming languages
3. Basic variables (integer, float, string)
4. Conversions between the basic variable types
5. Boolean variables
6. "Collection" data types
    a. lists
    b. tuples
    c. sets
    d. dictionaries
7. Operators
8. Control statements
9. Functions

## In lab later on the assigned work is:

1. Watch tutorials on github, sign up for a github account, and make a copy of (a.k.a., fork) a repo which will allow you to run the neccessary python code in a jupyter notebook.
2. Write a script to read a file from disk, rename it, and write it back out with an extra line at the end.
3. Write a function to test whether a particular word is present in a disk file or not and return `True` or `False` depending on the case.
4. Read a csv file from disk and find the historical corn yield data in preparation for plotting the corn yield trend in the United States.

## Some useful background material:

The <a href="https://the-examples-book.com/book/introduction" target="_blank">Purdue Data Mine Examples Book</a> contains many useful chapters on data science. While they have not been directly designed for this class, they may be useful. You will not need to use scholar to perform the exercises of this class so don't worry about that part. Here is a direct link to the <a href="https://the-examples-book.com/book/python/introduction" target="_blank">Python chapter.</a>


Additional useful links for Python include:

<a href="https://docs.python.org/3/" target="_blank">Python 3.9.4 documentation</a>

The <a href="https://pypi.org/" target="_blank">Python Package Index</a> (This contains many of the useful Python "add-on" packages such as the math package)

The <a href="https://numpy.org/" target="_blank">Numpy Package</a> (This contains specialized array (vector and matrix) routines. Numpy stands for "Numerical Python")

United States Department of Agriculture: <a href="https://quickstats.nass.usda.gov/" target="_blank">Quick Stats</a> (The USDA's National Ag Statistics Service -- go here and familiarize yourself with the available data)

You will need to use Git and Github to get the example code for the class. Some useful tutorial links are:

<a href="https://www.youtube.com/watch?v=USjZcfj8yxE" target="_blank">Learn Git in 15 Minutes</a> (Colt Steele)

<a href="https://www.youtube.com/watch?v=USjZcfj8yxE" target="_blank">Learn Github in 20 Minutes</a> (Colt Steele)

It makes sense to create a free github account for yourself: <a href="https://github.com/" target="_blank">Github</a>

## History of Python, Etc.

- Conceived by Guido van Rossum in December 1989 at the Center Wiskunde and Informatica (Dutch national research institute for mathematics and computer science).
- Python version 1.0 in January 1994.
- GNU General Public License (open source) since version 1.6.1.
- Python version 2.0 in October 2000.
- Python Software Foundation formed in 2001 and a new open source license.
- Python version 2.7 was the last release in the version 2 series. Support ended January 2020.
-Python version 3.0 released December 2008. It broke backward compatibility with much of the verson 2 code.
- Latest version is 3.10 (October 2021).

## According to stackoverflow survey of profession software developers ...

<img align="left" src='Figs/DeveloperSurvey2021.png' width="500"/>

## As concerns languages for data science ...

The contenders are Python and R. For **Python**

* Most popular among data scientists.
* Very useful in machine learning and artificial intelligence because of the availabilty of popular libraries such as scikit-learn, matplotlib, and tensorflow, etc.

For **R**

* A scripting language.
* Very good support for statistical computation and visulalization.

This course is focussed on Python instead of R because that is a better for the work of my research group. For an independent comparison of the two: <a href="https://www.ibm.com/cloud/blog/python-vs-r" target="_blank">Python vs. R: What's the Difference?</a>

## Basic variables

Python uses something called **dynamic typing**, which means that a variable is created when a value is assigned to it. The type can be changed after originally set. There are a few rules on variable names:

* Must start with a letter or underscore
* Names are case-sensitive

A python variable is more than just its value. It must also contain information about the type of the value. There is overhead associated with such flexibility. The code below illustrates three of the variable types: **integer, float, and string**.

In [1]:
# Integer, i.e., whole numbers both positive and negative. Later on 
# we will illustrate formatting the print command.
x = 4;
print('The type of x is:')
print(type(x))
print() # Just to give a space.
print('The value of x is:')
print(x)

The type of x is:
<class 'int'>

The value of x is:
4


In [2]:
# Floating point, i.e., computer representation of real numbers.
x = 4.0;
print('The type of x is:')
print(type(x))
print() # Just to give a space.
print('The value of x is:')
print(x)

The type of x is:
<class 'float'>

The value of x is:
4.0


In [3]:
# Strings. A string is a sequence of characters. They can be delimited
# by single quotes ('blah') or double quotes ("blah blah")
x = "four"
print('The type of x is:')
print(type(x))
print() # Just to give a space.
print('The value of x is:')
print(x)

The type of x is:
<class 'str'>

The value of x is:
four


## Conversions
Python has a built-in command `float()` that can convert integers and certain strings to floating point numbers.

In [4]:
# Start with an int.
x = 4;

print('The type of x is:')
print(type(x))
print()
print('The value of x is:')
print(x)

The type of x is:
<class 'int'>

The value of x is:
4


In [5]:
# Convert to float
x = float(x)
print('The type of x is:')
print(type(x))
print()
print('The value of x is:')
print(x)

The type of x is:
<class 'float'>

The value of x is:
4.0


In [6]:
# Can also convert strings representing floating point numbers to
# float.
x = '1.67'
print('The type of x is:')
print(type(x))
print()
print('The value of x is:')
print(x)

The type of x is:
<class 'str'>

The value of x is:
1.67


In [7]:
# Note that when we print the string version of x it prints it just
# as if it were a floating point number, i.e., we can't tell from the
# output.
x = float(x)
print('The type of x is:')
print(type(x))
print()
print('The value of x is:')
print(x)

The type of x is:
<class 'float'>

The value of x is:
1.67


There is also a python command `int()`, which can covert floats to integer and certain strings to integer, and a command `str()`, which converts numbers to strings.

## Boolean type
A Boolean value has a python type **bool**. The possible values a Boolean variable can take are: **True** and **False**. These are typically used to hold the results of logical tests, which in turn can be used to control the flow of a python program.

In [8]:
x = True;
print('The type of x is:')
print(type(x))
print()
print('The value of x is:')
print(x)

The type of x is:
<class 'bool'>

The value of x is:
True


## Collection data types
There are four **collection** data types: **lists**, **tuples**, **sets**, and **dictionaries**. (Some say that a **string** is a collection data type since it is a ordered set of characters). For now we will only consider lists and sets.

### <u>Lists</u> are ordered, changeable, and allow duplicate members:

In [9]:
# Create a list with 5 elements.
Coloradothings = ["wheat", "corn", "sugar beets", "pinto beans", 1959]
print('The type of Coloradothings is:')
print(type(Coloradothings))
print()
print('The length of Coloradothings is:')
print(len(Coloradothings))
print()
print('The value of Coloradothings is:')
print(Coloradothings)

The type of Coloradothings is:
<class 'list'>

The length of Coloradothings is:
5

The value of Coloradothings is:
['wheat', 'corn', 'sugar beets', 'pinto beans', 1959]


In [10]:
# The elements inside of Coloradothings may be of differing
# types ...
print('For Coloradothings[3] ...')
print(Coloradothings[3])
print(type(Coloradothings[3]))
print()
print('For Coloradothings[4] ...')
print(Coloradothings[4])
print(type(Coloradothings[4]))

For Coloradothings[3] ...
pinto beans
<class 'str'>

For Coloradothings[4] ...
1959
<class 'int'>


In [11]:
# We can append to a list and insert in a list

Coloradothings.append("Amherst")
print(Coloradothings)
print()
Coloradothings.insert(2, "sunflowers")
print(Coloradothings)

['wheat', 'corn', 'sugar beets', 'pinto beans', 1959, 'Amherst']

['wheat', 'corn', 'sunflowers', 'sugar beets', 'pinto beans', 1959, 'Amherst']


### <u>Sets</u> are unordered, changeable (in the sense that we can add and remove items from sets). Sets do not allow duplicates.

In [12]:
# Make a set.
Purduethings = {"Ag and Bio Engineering", "Ross-Ade Stadium", "students", "professors", "Gene Keady", "study sessions"}
print(type(Purduethings))
print(Purduethings) # Note the order it prints
print()

for x in Purduethings: # Note the order with which the for loop executes
    print(x)

print()
print("Ag and Bio Engineering" in Purduethings)
print("Medical School" in Purduethings)

<class 'set'>
{'Ag and Bio Engineering', 'Gene Keady', 'students', 'Ross-Ade Stadium', 'professors', 'study sessions'}

Ag and Bio Engineering
Gene Keady
students
Ross-Ade Stadium
professors
study sessions

True
False


From the code output above we note:
1. The order in which we included the set items when defining it is not the order that python used to enumerate the items when printing. Just FYI.
2. The statement in the last print command: `"Ag and Bio Engineering" in Purduethings` is a Boolean variable.

We can perform classical set operations (**union**, **intersection**, **difference**, **test subset**):

In [13]:
# Make another set ...
IUthings = {"Hoosiers", "Bobby Knight", "students", "professors", "parties"}
print('Purduethings union IUthings equals:')
print(Purduethings.union(IUthings))
print()
print('Purduethings intersection IUthings equals:')
print(Purduethings.intersection(IUthings))
print()
print('IUthings not also in Purduethings equals:')
print(IUthings.difference(Purduethings))
print()
print({"Gene Cernan",}.issubset(Purduethings))
Purduethings.add("Gene Cernan")
print({"Gene Cernan",}.issubset(Purduethings))

#print(Purduethings)

Purduethings union IUthings equals:
{'Ag and Bio Engineering', 'Bobby Knight', 'Gene Keady', 'Hoosiers', 'students', 'Ross-Ade Stadium', 'professors', 'study sessions', 'parties'}

Purduethings intersection IUthings equals:
{'students', 'professors'}

IUthings not also in Purduethings equals:
{'Bobby Knight', 'parties', 'Hoosiers'}

False
True


## Operators

### Arithmetic operators: +, -, *, /, %, **

In [14]:
# Arithmetic operators: +, -, *, /, %, **

print(7 + 5)  # addition
print(7 - 5)  # subtraction
print(7 * 5)  # multiplication
print(7 / 5)  # division
print(7 % 5)  # remainder upon integer division
print(7 ** 5) # exponentiation

12
2
35
1.4
2
16807


### Assignment operators: =, +=, -=, *=, /=, **=

In [15]:
# Assignment operators: =, +=, -=, *=, /=, **=

b = 5
a = b
print(a)
a += b # shorthand for a = a + b
print(a)
a -= b # shorthand for a = a - b
print(a)
a *= b # shorthand for a = a*b
print(a)
a /= b # shorthand for a = a/b
print(a)
a **= b # shorthand for a = a**b
print(a)

5
10
5
25
5.0
3125.0


### Comparison operators: ==, !=, <, <=, >, >=

In [16]:
# Comparison operators: ==, !=, <, <=, >, >=

a = 3
b = 2

print(a == b)
print(a != b)
print(a < b)
print(a <= b)
print(a > b)
print(a >= b)


False
True
False
False
True
True


### Logical operators: and, or, not

In [17]:
# Logical operators: and, or, not

x = (a == b) # The expression a == b it a Boolean value (either True or False).
             # The assignment creates a Boolean variable x
print(type(x))
print(x)

print()

y = not(x)
print(type(y))
print(y)

print()

z = True

print(x or z)
print(x and z)


<class 'bool'>
False

<class 'bool'>
True

True
False


## Control statements
There are three methods of program control that we consider here:
1. If/else statement
2. For loops
3. While loops

In [18]:
# Example if/else statement

a = 5;
b = 3;
if b > a:
    print("b is greater than a")
elif a == b:
    print("a and b are equal")
else:
    print("a is greater than b")

a is greater than b


In [19]:
# While loop: Execute while condition is true.

i = 1
while i < 6:
    print(i)
    i += 1

1
2
3
4
5


In [20]:
# For loop: Iterate over a sequence. Also, have break (stop a loop where it is 
# and exit) and continue (move to the next iteration of loop).

for x in "banana":
    print(x)
    
print("\n")    
print("Try continue command")
print("\n")

for x in "banana":
    if x == "n":
        continue
    print(x)

b
a
n
a
n
a


Try continue command


b
a
a
a


## Functions
   
Function are blocks of code that run when called. 

- Can pass parameters to a function. 
- A function can return a value.

Functions allow code to be more readable by allowing the hiding of details of an operation that may not be central to the understanding of the overall algorithm. Sometimes, this is called encapsulation. For example, perhaps we want to solve some sort of geometric problem, such as finding the height of a tree from the angle of the sun and the length of the shadow cast on the ground. The height calculation will involve intermediate calculations of trigonometric functions of the angle (e.g., sine, cosine, tangent). These sorts of intermediate calculations are naturally left to functions in python and other programming languages.

In addition, functions ...

- Assist in divide and conquer problem solving.
- Allow to reuse the function code in other parts of a larger program.

## A Useful Function Example

According to wikipedia this algorithm goes back to the Babylonians (100 AD) and is widely used for computing square roots by hand. The idea is this. If we want to find the square root of a positive number, say $Z$, we first start with a guess $x$ hoping $x^2 \approx Z$. Now if the original guess is too large, i.e., $x^2 > Z$ then $x > Z/x$ and so we could move in the correct direction (towards smaller values of $x$) by making a new guess equal to the average of $x$ and $Z/x$, i.e.,

New guess = $(x + Z/x)/2$.

If, on the other hand, the original quess was too small, i.e., $x^2 < Z$ then $x < Z/x$ and using the above formula for the new guess would move in the correct direction of larger values. The algorithm is implemented in python in the function code below.

### Hand calculation example ...

Say Z = 10 and guess x = 3 for the square root. Then the next guess is the average of 3 and 3.3333..., which is approximately 3.16666... The next step in the algorithm gives an estimate of

3.1622

In [21]:
Z = .89;
x = [1, 1, 1, 1, 1, 1, 1, 1];
N = len(x);
print(x[0])
i = 1;
while i < N:
    x[i] = (x[i-1] + Z/x[i-1])/2
    print(x[i])
    i = i+1

1
0.9450000000000001
0.9433994708994708
0.9433981132066374
0.9433981132056604
0.9433981132056604
0.9433981132056604
0.9433981132056604


In [None]:
# Z is the positive number for which we want the square root. epsilon
# is the tolerance in the accuracy of the result.

def Newtroot(Z,epsilon):
    x = 1
    xp = (x + Z/x)/2
    e = (xp - x)/x
    while (e > epsilon) or (-epsilon > e):
        x = xp
        xp = (x + Z/x)/2
        e = (xp - x)/x
    return xp

In [None]:
# Example of the square root algorithm

z = 10
epsilon = 1e-12

print(Newtroot(z,epsilon))

## The math package
Python provides many modules designed for specialized programming tasks. See: <a href="https://pypi.org" target="_blank">The Python Package Index</a>.

The math package contains trigonometric, exponential, logarithmic, hyperbolic, and special functions. It also contains a number of useful constants such as `pi` and `e`.

In [None]:
import math
P = math.pi;
print(P)

Let's use the math package to create something interesting. 

In [None]:
# This short program will print a table with 50 rows where
# each row contains an argument and the sine of the argument.
# The sine function will trace on complete period of length
# 2*pi.
N = 50;
for k in range(N):
    t = 2*P*k/N;
    print(t, math.sin(t))

The table above is difficult to read because it is not well formatted. Python has tools to simplify formating with the print command. There is a nice explanation in the Data Mind notes mentioned in the introduction. One prefered method uses `f-strings`, which stands for **format string**. They are indicated by preceeding the string with `f` or `F`.

In [None]:
# Compare two ways of printing pi
print('The value of pi is approximately', P)
print()
print(f'The value of pi is approximately {math.pi:.3f}.')

Let's make a prettier (more readable) sinewave table.

In [None]:
# Prettier sine wave table.
N = 50;
for k in range(N):
    t = 2*P*k/N;
    print(f'{t: 1.2f}', '  ', f'{math.sin(t): .3f}')


## The NumPy Package

All data manipulated by a computer is represented in binary. In otherwords, via one method or another, all data -- temperature sensor readings, hourly barometric pressure from your Davis weather station, an audio file, images from your Bushnell game camera, a yield map -- are represented as arrays of numbers.

**NumPy (Numerical Python)** provides an efficient interface to store and compute on dense data buffers. NumPy arrays are much more efficient than Python's built-in list data type.

See: <a href="http://www.numpy.org" target="_blank">The Numpy Package</a>.

In [None]:
# Import the numpy package. This command allows us to refer to numpy
# commands using the shorthand "np".
import numpy as np

## Plotting examples with Matplotlib
A better way to plot. Who says there has been no progress in the world since 1970?

In [None]:
# Import matplotlib and define a shorthand
import matplotlib as mpl
import matplotlib.pyplot as plt

In [None]:
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]);
y = np.array([3, 1, 6, 5, 4, 11, -1, 1, 2, 6]);

y

In [None]:
fig = plt.figure()
plt.style.use('classic')
plt.plot(x, y)
plt.title("Example Plot")
plt.xlabel("x")
plt.ylabel("y")
plt.grid()

In [None]:
x = np.linspace(0, 10, 100)
fig2 = plt.figure()
plt.style.use('seaborn-dark-palette')
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))


## United states wheat yield trends as a plotting example ...

Data was obtained from the United States Department of Agriculture: <a href="https://quickstats.nass.usda.gov/" target="_blank">Quick Stats</a> (The USDA's National Ag Statistics Service -- we will make more use of this in the lab later)

In [None]:
# The years for which we have average wheat yield data.
dates = np.array([2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000,1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970, 1969, 1968, 1967,1966, 1965, 1964, 1963, 1962, 1961, 1960, 1959, 1958, 1957, 1956, 1955, 1954, 1953, 1952, 1951, 1950, 1949, 1948, 1947, 1946, 1945,1944, 1943, 1942, 1941, 1940, 1939, 1938, 1937, 1936, 1935, 1934, 1933, 1932, 1931, 1930, 1929, 1928, 1927, 1926, 1925, 1924, 1923,1922, 1921, 1920, 1919, 1918, 1917, 1916, 1915, 1914, 1913, 1912,1911, 1910, 1909, 1908, 1907, 1906, 1905, 1904, 1903, 1902, 1901,1900, 1899, 1898, 1897, 1896, 1895, 1894, 1893, 1892, 1891, 1890,1889, 1888, 1887, 1886, 1885, 1884, 1883, 1882, 1881, 1880, 1879,1878, 1877, 1876, 1875, 1874, 1873, 1872, 1871, 1870, 1869, 1868, 1867, 1866])

In [None]:
dates

In [None]:
# The average wheat yield data in the order to line up with the corresponding years.
yields = np.array([44.5, 49.7, 51.7, 47.6, 46.4, 52.7, 43.6, 43.7, 47.1, 46.2, 43.6, 46.1, 44.3, 44.8, 40.2, 38.6, 42. , 43.2, 44.2, 35. , 40.2, 42. , 42.7, 43.2, 39.5, 36.3, 35.8, 37.6, 38.2, 39.3, 34.3, 39.5, 32.7, 34.1, 37.7, 34.4, 37.5, 38.8, 39.4, 35.5, 34.5, 33.5, 34.2, 31.4, 30.7, 30.3, 30.6, 27.3, 31.6, 32.7, 33.9, 31. , 30.6, 28.4, 25.8, 26.3, 26.5, 25.8, 25.2, 25. , 23.9, 26.1, 21.6, 27.5, 21.8, 20.2, 19.8, 18.1, 17.3, 18.4, 16. , 16.5, 14.5, 17.9, 18.2, 17.2, 17. , 17.7, 16.4, 19.5, 16.8, 15.3, 14.1, 13.3, 13.6, 12.8, 12.2, 12.1, 11.2, 13.1, 16.3, 14.2, 13. , 15.4, 14.7, 14.7, 12.8, 16. , 13.3, 13.8, 12.7, 13.5, 12.9, 14.8, 13.2, 11.9, 16.7, 16.1, 14.4, 15.1, 12.4, 13.7, 15.5, 14.3, 14.2, 16. , 15.2, 12.9, 13.7, 14.9, 15. , 12.2, 12.5, 15.2, 14. , 12.8, 13.9, 13.5, 12.4, 14.2, 16.5, 12.2, 14. , 12.1, 13.3, 14.1, 11.4, 14.8, 12.3, 15.1, 11. , 13.2, 13. , 13.5, 14.1, 10.9, 11.1, 13. , 12.9, 11.8, 12.2, 12.1, 13.7, 12.9, 12.6, 11. ])

In [None]:
yields

In [None]:
# Plotting the wheat yield trend
fig = plt.figure()
plt.style.use('classic')
plt.plot(dates, yields)
plt.title("United States Average Wheat Yield by Year")
plt.xlabel("Year")
plt.ylabel("Yield in bushels per acre")
plt.grid()