# lambda functions

In one of the later examples, I created a lambda function

A lambda function allows you to create and use a new short function without having to formally define it.

In [None]:
import re
states = ['   Alabama ', 'Georgia!', 'Georgia', 'georgia', 
          'FlOrIda', 'south  carolina##', 'West virginia?']

In [None]:
# I could define a function that replaces  two spaces with one space:
def replace_space(x):
    return(re.sub('  ', ' ', x))

In [None]:
# and then apply it to the strings:
list(map(replace_space, states))

In [None]:
# however, because the code for the function is so short, it might be easier to just create
# a quick function without a formal name. These 'anonymous' functions are also known as lambda functions

list(map(lambda x: re.sub('  ',' ', x), states))

In [None]:
list(map(lambda x: x.title(), states))

In [None]:
list(map(lambda x: re.sub('[?#!]','', x.title().strip()) , states))

lambda functions are written in the form:

`lambda argument1, argument2, etc: expression to return`

In [None]:
# lambda functions can accept multiple arguments
# if you use it with map, you'll need to provide a list for each argument
list(map(lambda x, y: x + y, [1,2,3], [100,200,300]))

lambda functions are written in the form:

`lambda argument1, argument2, etc: expression to return`

In [None]:
# lambda functions can accept multiple arguments
# if you use it with map, you'll need to provide a list for each argument
list(map(lambda x, y: x + y, [1,2,3], [100,200,300]))

# Linear Algebra with NumPy

In [None]:
import numpy as np

In [None]:
x = np.array([[1,2],[3,4]])
print(x)

In [None]:
y = np.arange(1,5).reshape(2,2)
print(y)

In [None]:
x * x  # asterisk does elementwise multiplication (similar to R)

In [None]:
x @ x # @ sign does matrix multiplication, equivalent to R's %*%

In [None]:
np.dot(x, x)  # matrix multiplication can also be done via np.dot()

In [None]:
x @ x.T

## simple linear regression example

If we want to estimate the coefficients of a linear regression fit 

$$\hat{y} = \beta_0 + \beta_1 x$$

This can be achieved via linear algebra.

We present x as a matrix: one row for each observation, and a column of 1s to go with $\beta_0$ and the next column consists of values of x.

Y is a column matrix of values.

The coefficient estimates that minimize the sum of squares for linear regression is

$$\hat{\beta} = (x^Tx)^{-1} x^T y$$

In [None]:
x = np.array([[1,1,1,1],[1,2,3,4]]).T
y = np.array([2,6,4,8]).reshape(4,1)

In [None]:
x

In [None]:
y

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.scatter(x[:,1],y)
plt.show

The coefficient estimates that minimize the sum of squares for linear regression is

$$\hat{\beta} = (x^Tx)^{-1} x^T y$$

In [None]:
np.linalg.inv(x.T @ x) @ x.T @ y

(matches the results from R)

## other linear algebra functions

In [None]:
xtx = x.T @ x
print(xtx)

In [None]:
np.linalg.inv(xtx)

In [None]:
a = np.linalg.cholesky(xtx)  # cholesky decomposition of a square matrix produces a lower triangular matrix
print(a)

In [None]:
a @ a.T  # recreate the original matrix

In [None]:
q,r = np.linalg.qr(xtx)  # qr decomposition

In [None]:
q # q is orthogonal, shown later

In [None]:
r # r is upper triangular

In [None]:
q @ r  #q times r is the original matrix

In [None]:
q @ q.T  # q is orthogonal, so q times its transpose gives the identity matrix

In [None]:
val, vec = np.linalg.eig(xtx)  # eigen vectors and eigen values of the matrix

In [None]:
print(val)

In [None]:
print(vec)

In [None]:
xtx @ vec[:,0]  # the matrix times its eigen vector produces a vector, that is 

In [None]:
vec[:,0] * val[0]  # equivalent to the eigenvector multiplied by a scalar

# Pandas

NumPy creates ndarrays that must contain values that are of the same data type.

Pandas creates dataframes. Each column in a dataframe is an ndarray. This allows us to have traditional tables of data where each column can be a different data type.

In [None]:
import numpy as np
import pandas as pd

The basic data structure in pandas is the *series*. You can construct it in a similar fashion to making a numpy array.

The command to make a Series object is

`pd.Series(data, index=index)`

the `index` argument is optional

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)
print(type(data))

The series is printed out in a table form.

The type is a Pandas Series

In [None]:
print(data.values)
print(type(data.values))

The values attribute of the series is a numpy array.

In [None]:
print(data.index)
print(type(data.index))  # the row names are known as the index

You can subset a pandas series like other python objects

In [None]:
print(data[1])
print(type(data[1]))  # when you select only one value, it simplifies the object

In [None]:
print(data[1:3])
print(type(data[1:3]))  # selecting multiple values returns a series

In [None]:
print(data[np.array([1,0,2])])  # fancy indexing using a numpy array

In [None]:
# specifying the index values
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print(data)

In [None]:
data[1]  # subset with index position

In [None]:
data["a"]  # subset with index names

In [None]:
data["a":"c"] # using names includes the last value

In [None]:
data[0:2]  # slicing behavior is unchanged

In [None]:
# creating a series from a python dictionary
# remember, dictionary construction uses curly braces {}
samp_dict = {'Archie': 71,
             'Betty': 66,
             'Veronica': 62,
             'Jughead': 72,
             'Cheryl': 66}
samp_series = pd.Series(samp_dict)
samp_series # the series gets alphabetized by its index

In [None]:
print(samp_series.index)

In [None]:
print(type(samp_dict))
print(type(samp_series))

In [None]:
actor_dict = {'Archie': "KJ",
              'Jughead': "Cole",
              'Betty': "Lili",
              'Veronica': "Camila",
              'Cheryl': "Madelaine"}  # note that the dictionary order is not same here
actor = pd.Series(actor_dict)  # still get alphabetized by index
print(actor)

# Creating a DataFrame

In [None]:
# we create a dataframe by providing a dictionary of series objects
riverdale = pd.DataFrame({'height': samp_series,
                       'actor': actor})  

print(riverdale)

In [None]:
print(type(riverdale))  # this is a DataFrame object

In [None]:
data = [{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 5}]  # data is a list of dictionaries
data

In [None]:
print(pd.DataFrame(data, index = [1,2,3]))

In [None]:
data = [{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'c': 5}]  # data is a list of dictionaries
data

In [None]:
print(pd.DataFrame(data))

In [None]:
data = [{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'c': "5"}]  # data is a list of dictionaries
data

In [None]:
print(pd.DataFrame(data))

In [None]:
data = np.random.randint(10, size = 10).reshape([5,2])
print(data)

In [None]:
print(pd.DataFrame(data, columns = ["x","y"], index = ['a','b','c','d','e']))

# Subsetting the DataFrame

In [None]:
print(riverdale)

In [None]:
print(riverdale.keys())

In [None]:
riverdale['actor']  # extracting the column

In [None]:
riverdale.actor

In [None]:
riverdale.actor[1]

In [None]:
riverdale.actor['Jughead']

In [None]:
print(riverdale.T)  # prints a copy of the transpose

In [None]:
print(riverdale.loc['Jughead']) # subset based on location to get a row
print(type(riverdale.loc['Jughead']))
print(type(riverdale.loc['Jughead'].values))  # the values are of mixed type but is still a numpy array. 
# this is possible because it is a structured numpy array. (covered in "Python for Data Science" chapter 2)

In [None]:
print(riverdale.loc[:,'height']) # subset based on location to get a column
print(type(riverdale.loc[:,'height']))  #the object is a pandas series
print(type(riverdale.loc[:,'height'].values))

In [None]:
riverdale.loc['Archie','height']  # you can provide a pair of 'coordinates' to get a particular value

In [None]:
riverdale.iloc[3,] # subset based on index location

In [None]:
riverdale.iloc[0, 1] # pair of coordinates