# Jupyter Notebook structure

Jupyter notebooks have two types of cells: one for notes/markdown (like this one) and one for code (like the next one). To run any cell and proceed to the next cell, use Shift-Enter. To run a cell and keep the focus in that cell, use Ctrl-Enter. There are two modes of each cell: Command mode (highlighted in blue on the left) and Edit mode (highlighted in green). To get a list of keyboard shortcuts for each mode, go to "Keyboard Shortcuts" in the above "Help" pulldown menu.

Please note that there are <i>many</i> helpful tutorials for Python on the web. With a little Googling you should be able to find one that suits your tastes and skill level. For example, for a focus on machine learning you could look [here](https://www.quora.com/What-is-the-best-Python-tutorial-for-machine-learning). If you'd like any suggestions, let us know. 


In [None]:
6*9**2 - 486 + 54

Each cell for code can be run in any order. The notebook indicates next to it the order in which they have been run in. Code cells also have an output that is printed right below it.

# Python Syntax

In [None]:
#The symbol for comments in code is the same as in R: the '#' symbol
euler = 2.71828     #to assign a value to a variable use the '=' symbol
euler            #to view the value of a variable, put it as the last line of the cell

One of the main characteristics of python is its intentional "easy to read and write" syntax. Many programming languages such as C, C++, Java, etc. require the use of semicolons at the end of statements. Others, including R and Python, interpret the end of a line as the end of a statement. In almost every language there is no requirement for indentations when using conditional statements and loops, and consequently there is a requirement for brackets and parenthases. In Python brackets and parenthases are kept to minimal use and indentation is a requirement.

In [None]:
for i in range(5,10):    #the range(k,j) function is mostly the same as k:j in R, but is meant for loops.
    q = i + 6 - 8  #addition and subtraction
    q = (q*10)/4   #multiplication and division
    print((i,q**2))    #exponentiation

In [None]:
h = 0
while h > -15 and h <= 15:
    h = h**2 +1
    print(h)

In [None]:
for j in range(10):
    if j%2 == 0 or j%7 == 0:  #if statements and modular arithmetic
        print(j)
    else:
        print("So sad")    

# Lists

Some of the most well known features of python are related to its use of lists.

In [None]:
h = []    #to define a list use square brackets. This is kind of like c() in R, but is a list instead of a vector/array
type(h)

In [None]:
h = [(i**2 +3)%5 for i in range(10)]  #this way of building a list with a loop inside the brackets is called "list comprehension"
h

The above kind of expression is very handy in Python. It's called a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)

In [None]:
len(h)    #get the length of a list with the len() function, which is equivalent to R's length() function

In [None]:
h[0]    #index a list with square brackets, with 0 (instead of 1 as in R) representing the first element

In [None]:
h[0:2]   #index a range of elements in a list with the ':' symbol which starts with the first number and goes (exclusively) to
         #the last number

In [None]:
h[:2]    #the first number can even be left off if we start from the beginning

In [None]:
h[5:]    #and of course the last number can be left off if we go until the end

In [None]:
#What if we want to get the last element? In R we would need to do h[length(h)]. Python is a lot simpler
print(h[len(h)-1])
h[-1]

In [None]:
h[-3:]    #get the last 3 elements of a list

So what math operations can be applied to lists?

In [None]:
a = [1,2,3]
b = [7,8,9]
a + b     #Adding lists concatenates the two lists in the respective order

In [None]:
a = ['hello', 2, "u"]   #and anything can go into a list
b = ["bye", 874]
a + b

In [None]:
#So does subtraction remove elements from a list?
l1 = ['a', 'b', 'c']
l2 = ['c']
l1 - l2

In [None]:
#Nope, one needs to use the .remove() function to do this
l1.remove('c')
l1

In [None]:
g = [1,2]
6*g     #multiplication by an integer n concatenates n copies of that list together

In [None]:
#In order to do math operations on the elements of a list, we need to convert it to an array
import numpy as np    #import packages with the import command, kind of like library() in R
g2 = [1,2,3,4]
3.14*g2

In [None]:
3.14*np.array(g2)

In [None]:
g1 = np.array([1,2,3,4])
g2 = np.array([5,10,15,20])
g1 + g2

In [None]:
g1 - g2

In [None]:
g1*g2

In [None]:
g2/g1

In [None]:
g3 = np.array([1, 2, 1, 2])
g1**g3

## Warning:
While lists are very flexible and easy to work with in python, there is also a major caution about them. If you define a list, set a new list equal to it, and perform some operation on the new list, the same operation is automatically applied to the original list.

In [None]:
list1 = [(2*j)%3 for j in range(2,5)]
list1

In [None]:
list2 = list1
list2

In [None]:
list2[0] = -600
list1

To get around this issue, you can normally use the list() function and you'll be fine. This constructs a new list rather than creating a reference to the original list.

In [None]:
list1 = [(2*j)%3 for j in range(2,5)]
list2 = list(list1)
list2[0] = 600
list1

# Strings

Strings are often easier to work with in python than R, simply because of many built in operations that work on them

In [None]:
#Just like in R, you can use either single quotes 'string' or double quotes "string" in defining a string
s1 = 'look at-this_string!'
s2 = "Look at this 1 2!!!"

In [None]:
s1[:4]   #index a range of characters in a string

In [None]:
splitted = s2.split("this")    #split a string at a given sequence of characters to create a list
splitted

In [None]:
splitted2 = s2.split(' ')
splitted2

In [None]:
s1 + s2        #adding two strings concatenates them together in the respective order

In [None]:
3*s2          #multiplying a string by an positive integer n concatenates n copies of that string

In [None]:
integer1 = -9
float1 = 3.1415
string1 = 'Yale'
"How %d cool can st%frings get at %s !!"%(integer1, float1, string1)   #insert the value of a variable into a string. Kind of
                                                                         #like paste0() in R.

There's a plethora of additional string operations available in the "re" package. Check it out!

# Functions

Other than the way they are defined, functions work essentially the same in Python as they do in R. Parenthases are for the arguments of the function, and remember that indentation is required for the tasks of the function.

In [None]:
def myfunction(a, b = "great"):    #define a function with 'def', the same as 'myfunction <- function(a, b = "great"){}' in R
    h = [i*a for i in range(5)]
    for j in range(len(h)):
        h[j] = str(h[j]) + b
    return(h)

In [None]:
myfunction(1)

In [None]:
myfunction(5, b = " bulldogs")

# Dataframes

Dataframes are the most standard way of organizing data. They are similar to a matrix with each row representing a single observation and each column representing a variable. Dataframes in python work almost exactly the same as in R. The <i>pandas</i> package is what almost everyone uses for this.

In [None]:
import pandas as pd

In [None]:
iris = pd.read_csv("iris.csv")    #read in the dataset in the file "iris.csv" as a dataframe
iris                #view the dataframe

In [None]:
iris["Sepal.Width"].values            #select the column for Sepal.Width as an array

In [None]:
iris_virginica = iris[iris["Species"] == "virginica"]
iris_virginica

There is a whole plethora of things you can do with pandas dataframes. For more details on them see [this link](https://towardsdatascience.com/pandas-dataframe-a-lightweight-intro-680e3a212b96) or just search <i>pandas dataframes</i> online.

# Plotting Data

In [None]:
import matplotlib.pyplot as plt  #to plot data, one would need to import a plotting package. matplotlib.pyplot is my favorite

In [None]:
x = np.linspace(-5, 5, 100)
y = 0.5*x**2
plt.plot(x, y)    #plot the data
plt.xlabel("X-Axis", fontsize=14)   #add a label to the x axis
plt.ylabel("Y-Axis", fontsize=14)    #add a label to the y axis
plt.title("Python Plot", fontsize=16)   #add a plot title
plt.show()    #Show us the full plot

In [None]:
randomx = np.random.normal(2, 1, 1000)    #does the same thing as 'rnorm(1000, 2, 1)' in R
randomx2 = np.random.uniform(4, 0, 1000)   #same as 'runif(1000, 0, 4)' in R
plt.hist(randomx, color='blue')       #plot the histogram of a single variable
plt.hist(randomx2, color='red', histtype=u'step', lw=3)      #plot another histogram on top of the first but don't fill in bins
plt.show()

In [None]:
x = np.random.uniform(-10, 10, 1000)         #simulate data based on a true model with normal noise
y = -5 + 1.2*x + 0.3*x**2 + np.random.normal(0, 3, 1000)

X = np.array([np.ones(1000), x, x**2]).T               #calculate the least-squares estimates of the coefficients
beta_hat = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), y)
Y_hat = np.dot(X, beta_hat)

plt.scatter(x, y, alpha=0.3)    #plot the data
plt.plot(x, Y_hat, '.', c='r')   #show the least-squares fitted curve
plt.show()

beta_hat     #print the estimates of the coefficients

In [None]:
from scipy import misc
%matplotlib inline

In [None]:
image = misc.imread("yalelogo.jpeg")
type(image)

In [None]:
np.shape(image)

In [None]:
plt.imshow(image[:,:,3], cmap=plt.cm.Blues)

In [None]:
image_matrix = np.array([[0.4*(i - 150)**2 + (j - 75)**2 for i in range(300)] for j in range(150)])
print(np.shape(image_matrix))
print(type(image_matrix))
plt.imshow(image_matrix, cmap = plt.cm.afmhot)

# Python Equivalents of Common R Commands

### Statistics


In [None]:
#Python                                                        R
x1 = np.random.normal(0, 2, 100)                        # x1 <- rnorm(100, 0, 2)
x2 = np.random.uniform(-3, 8, 100)                      # x2 <- runif(100, -3, 8)
x3 = np.random.gamma(2, 3, 100)                         # x3 <- rgamma(100, 2, scale = 3)

In [None]:
np.mean(x1)                                             # mean(x1)

In [None]:
np.std(x1)                                              # sd(x1)

In [None]:
np.median(x2)                                           # median(x2)

In [None]:
np.max(x2)                                              # max(x2)

In [None]:
np.min(x2)                                              # min(x2)

In [None]:
np.percentile(x3, 90)                                   # quantile(x3, 0.9)

### Vectors and matrices

In [None]:
#Python                                                         R
vec1 = np.linspace(-6, 10, 10)                         # vec1 <- seq(-6, 10, length.out = 10)
vec2 = np.arange(10)                                   # vec2 = 0:9

In [None]:
np.sort(vec1**2)                                       # sort(vec1^2)

In [None]:
np.argsort(vec1**2)                                    # order(vec1^2)

In [None]:
np.where((vec1 >= 3) & (vec1 < 9))[0]                  # which((vec1 >= 3) & (vec1 < 9))

In [None]:
np.vstack((vec1, vec2))                                # rbind(vec1, vec2)

In [None]:
np.vstack((vec1, vec2)).T                             # cbind(vec1, vec2)

In [None]:
np.dot(vec1, vec2)                                    # t(vec1) %*% vec2

### Math Operations

In [None]:
# Python                                                      R
x = np.pi                                             # x = pi
np.sin(x)                                             # sin(x)

In [None]:
np.cos(x)                                             # cos(x)

In [None]:
np.tan(x)                                             # tan(x)

In [None]:
np.exp(x)                                             # exp(x)

In [None]:
np.log(x)                                             # log(x)

# Example: Linear Regression

To perform linear regression in python we need to import a package such as <i>statsmodels, scipy, sklearn,</i> etc. There are also ways to do it directly with pandas dataframes, <i>Tensorflow</i>, etc.

$Sepal.Length_{i} = \beta_{0} + \beta_{1} Sepal.Width_{i} + \beta_{2} Petal.Width_{i} + \beta_{3} Petal.Length_{i} + \varepsilon_{i}$

In [None]:
import statsmodels.api as sm
model = sm.OLS(iris["Sepal.Length"], sm.add_constant(iris[["Sepal.Width", "Petal.Width", "Petal.Length"]])).fit()

In [None]:
print(model.summary())