# AM120 HW05
## Zachary Miller
### 1a

In [35]:
import numpy as np
import scipy.linalg

In [40]:
A = np.array([[4,0,4],
              [0,-3,-4],
              [8,-3,4],
              [20,-6,12]])
b = np.array([-4,1,-3,0]).T

We are asked to solve the above system using least-squares, which means we want to solve $A^TA\vec{x}=A^T\vec{b}$. However, as we can see below, $A^TA$ is has a detirminant of approximately zero and is of rank 2 so there are infinite solutions for $\vec{x}$.

In [41]:
A_tA = A.T@A
print("A_tA: \n", A_tA)
print("Detirminant of A_tA: \n", np.linalg.det(A_tA))
print("Rank of A_tA: \n", np.linalg.matrix_rank(A_tA))

A_tA: 
 [[ 480 -144  288]
 [-144   54  -72]
 [ 288  -72  192]]
Detirminant of A_tA: 
 1.9645085558295291e-10
Rank of A_tA: 
 2


In order to find a unique solution, we have to add an additional constraint to our original least-squares formulation. We now require that the solution $\vec{x}$ minimizes both the error and the norm of the solution. This can be found using the pseudo inverse to get $\vec{x}=(A^TA)^{\dagger}A^T\vec{b}$ which can be simplifed to $\vec{x}=A^{\dagger}\vec{b}$. Solving below...

In [54]:
# Get the SVD of A
U, Sigma, V_t = np.linalg.svd(A, full_matrices=True)

# Make the Sigma array from the list of singular values
Sigma_diag = np.diag(Sigma)
Sigma_mat = np.zeros(A.shape)
Sigma_mat[:Sigma_diag.shape[0],:Sigma_diag.shape[1]] = Sigma_diag

# Calculate the pseudo inverse of Sigma_mat
Sigma_mat_pinv = np.copy(Sigma_mat)
Sigma_mat_pinv[Sigma_mat_pinv<1e-14]=0
Sigma_mat_pinv[Sigma_mat_pinv != 0] = 1/Sigma_mat_pinv[Sigma_mat_pinv != 0]
Sigma_mat_pinv = Sigma_mat_pinv.T

# Calculate the pseudo inverse of A using the formula above
A_pinv = V_t.T@Sigma_mat_pinv@U.T
my_x = A_pinv@b
py_x = np.linalg.lstsq(A,b,rcond=None)[0]

print("Pseudo Inverse of Sigma:\n", Sigma_mat_pinv)
print("\nPseudo Inverse of A:\n", A_pinv)
print("\nMy Solution:\n", my_x)
print("\nSolution using np.linalg.lstsq:\n", py_x)


Pseudo Inverse of Sigma:
 [[0.03785218 0.         0.         0.        ]
 [0.         0.18878107 0.         0.        ]
 [0.         0.         0.         0.        ]]

Pseudo Inverse of A:
 [[-0.00857843  0.03676471  0.01960784  0.03063725]
 [ 0.04411765 -0.11764706 -0.02941176 -0.01470588]
 [ 0.0502451  -0.12009804 -0.01960784  0.01102941]]

My Solution:
 [ 0.0122549  -0.20588235 -0.2622549 ]

Solution using np.linalg.lstsq:
 [ 0.0122549  -0.20588235 -0.2622549 ]


Comparing our answer to the one obtained using np.linalg.lstsq, we can see that they are the same. From this, we can infer that when faced with infinite possible solutions, np.linalg.lstsq returns the one with the smallest norm.

### 1b

In [58]:
A = np.array([[-3,-3,2],[-9,-9,6]])
b = np.array([2,4]).T

print("A:\n", A)

A:
 [[-3 -3  2]
 [-9 -9  6]]


Just by looking at the printout of $A$ above, we can see that the second column is the same as the first, and the third column is $-2/3$ times the first column. Therefore, we can immediatly tell that the columns are all linearly dependent, and the rank of $A$ is 1. This implies that there are two free variables, and thus infinite solutions. Again, in order to find a unique solution for this system of equations, we will use the pseudo inverse of $A$ (which we will again find via SVD) to find $\vec{x}=A^{\dagger}\vec{b}$. Doing so below...

In [59]:
# Get the SVD of A
U, Sigma, V_t = np.linalg.svd(A, full_matrices=True)

# Make the Sigma array from the list of singular values
Sigma_diag = np.diag(Sigma)
Sigma_mat = np.zeros(A.shape)
Sigma_mat[:Sigma_diag.shape[0],:Sigma_diag.shape[1]] = Sigma_diag

# Calculate the pseudo inverse of Sigma_mat
Sigma_mat_pinv = np.copy(Sigma_mat)
Sigma_mat_pinv[Sigma_mat_pinv<1e-14]=0
Sigma_mat_pinv[Sigma_mat_pinv != 0] = 1/Sigma_mat_pinv[Sigma_mat_pinv != 0]
Sigma_mat_pinv = Sigma_mat_pinv.T

# Calculate the pseudo inverse of A using the formula above
A_pinv = V_t.T@Sigma_mat_pinv@U.T
my_x = A_pinv@b
py_x = np.linalg.lstsq(A,b,rcond=None)[0]

print("Pseudo Inverse of Sigma:\n", Sigma_mat_pinv)
print("\nPseudo Inverse of A:\n", A_pinv)
print("\nMy Solution:\n", my_x)
print("\nSolution using np.linalg.lstsq:\n", py_x)


Pseudo Inverse of Sigma:
 [[0.06741999 0.        ]
 [0.         0.        ]
 [0.         0.        ]]

Pseudo Inverse of A:
 [[-0.01363636 -0.04090909]
 [-0.01363636 -0.04090909]
 [ 0.00909091  0.02727273]]

My Solution:
 [-0.19090909 -0.19090909  0.12727273]

Solution using np.linalg.lstsq:
 [-0.19090909 -0.19090909  0.12727273]


### 2a

In [275]:
import numpy as np
import scipy.linalg

In [276]:
A = np.array([[ 6, 8, 6],
              [-2, 4, 10],
              [0, -3, 2],
              [4, 3, 2],
              [6, -0, 7]])
b = np.array([5,-4,-8,2,-2]).T

The solution to the above overdetirmined system $A\vec{x}=\vec{b}$ is the $\vec{x}$ that minimizes the norm of the error $\vec{e}=||A\vec{x}-\vec{b}||^2$, which is given by $\vec{e}^T\vec{e}$. To do this, we take the derivative with respect to $\vec{x}$ of $\vec{e}^T\vec{e}$ and set it to zero. After some calculus, this comes out to be 

$$2(A^TA\vec{x}-A^T\vec{b})=0$$ $$A^TA\vec{x}=A^T\vec{b}$$ $$\vec{x}=(A^TA)^{-1}A^T\vec{b}$$ 

Calculating below...

In [277]:
x_sol = np.linalg.inv(A.T@A)@A.T@b
print("Least squares solution:\n", x_sol)

Least squares solution:
 [ 0.32153484  1.09494635 -0.79573356]


### 2b

Now we are asked to find the economical QR decomposition of A. To do this, we first apply the Gram_Schmidt proccss to the columns of $A$ to find $Q$. We will then use the fact that $A=QR \implies Q^{-1}A=R$ to find $R$. Doing this below...

In [300]:
# Apply Gram-Schmidt to columns of A to find Q
q0 = A[:,0]
q1 = A[:,1] - (np.dot(A[:,1],q0)/np.dot(q0,q0))*q0
q2 = (A[:,2] - (np.dot(A[:,2],q0)/np.dot(q0,q0))*q0 - 
      (np.dot(A[:,2],q1)/np.dot(q1,q1))*q1)
Q = np.array([q0, q1, q2]).T

# Q_T to find R
R = Q.T@A
R[R<1e-14] = 0

print("Q:\n",Q)
print("R:\n",R)

Q:
 [[ 6.          4.60869565 -1.70975919]
 [-2.          5.13043478  7.64385298]
 [ 0.         -3.          4.21673004]
 [ 4.          0.73913043 -1.4157161 ]
 [ 6.         -3.39130435  5.20152091]]
R:
 [[ 92.          52.          66.        ]
 [  0.          68.60869565  50.69565217]
 [  0.           0.         108.19264892]]


### 2c
Now, we can use the QR decomposition of A to solve for X, since 
$$A\vec{x} = \vec{b}$$
$$QR\vec{x} = \vec{b}$$
$$R\vec{x} = Q^T \vec{b}$$
Let $Q^T \vec{b}$ be $\vec{b}^*$, then calculating $\vec{b}^*$ below, we find...

In [304]:
b_star = Q.T@b
print("b_star = \n", b_star)

b_star = 
 [ 34.          34.7826087  -86.09252218]


Now we can easily solve the equation $R\vec{x}=\vec{b}^*$ using forward and back substitions. The handwritten work is attached at the bottom of this file. The answer comes out to be that $\vec{x} = [0.322;1.095;-0.796]$ 

### 2d
The answer obtained in part c is the same as the answer from part a (within round off error).

### 3a

In [67]:
import numpy as np
import scipy.linalg

Recall that the proccess we learned prvioiusly for doing PCA was to first calculate the normalized covariance matrix $C = FF^T/N$, and then the principal components of F were the eigenvectors of C. When doing PCA via SVD, we first calculate $F = U\Sigma V^T$. Recall that to calculate U, we calculate the normalized eigenvectors of $FF^T$, which is the same as the eigenvectors of $C$. Therefore, we can see that the columns of U from SVD are the same as the PCs of F. 

When the covariance matrix was used to calculate the PCs, we obtaine the time series $T$ by calculating $T=U^TF$. If we plug in the SVD composition of F for F, we get
$$T = U^T(U\Sigma V^T)$$
$$T = \Sigma V^T$$

Finally, since the singluar values along the diagonal of $\Sigma$ are the square roots of the eigenvalues cooresponding to the eigenvectors that make up the columns of $U$, we can also get back varaince explained for each PC. 

Therefore, we can see that from the SVD of a data matrix $F$ ,we can get the PCs, their explained variance, and the PC time series. We can also see that the method of doing PCA using SVD will be the same as if we did it using the covariance matrix. 

### 3b

In [79]:
F = np.array([[0,-18,2.5,-10,5,-2.5,7.5,5,10],
             [10,25,2.5,12,-5,0,-12,-12,-20],
             [-30,12,-20,12,-10,12,0,12,10]])

We are asked to calculat the PCs and expansion foefficients using both SVD and the corvariance matrix methods. First let's do this via SVD using the method described in part A

In [92]:
### First do PCA using SVD ###

# Get the SVD of A
U_svd, Sigma, V_t = np.linalg.svd(F, full_matrices=True)

# Get list of singular values into matrix form
Sigma_diag = np.diag(Sigma)
Sigma_mat = np.zeros(F.shape)
Sigma_mat[:Sigma_diag.shape[0],:Sigma_diag.shape[1]] = Sigma_diag

# Calculat the expansion coefficients
T_svd = Sigma_mat@V_t

### Now do PCA using covariance matrix ###

# Calculate the covariance matrix
C = F@F.T/F.shape[1]

# Calculate the eigenvectors and eigenvalues of F
eigvals, eigvecs = np.linalg.eig(C)

# Sort the eigenvectors so that the eigenvectors cooresponding to the largest eigenvalues
# are first
inds = (-np.abs(eigvals)).argsort()
U_cov = eigvecs [:,inds]
eigvals = eigvals[inds]

# Calculate the expansion coefficients
T_cov = U_cov.T@F

### Compare Results ###
print("PCs (columns) using SVD:\n", U_svd)
print("PCs (columns) using Covariance Matrix:\n", U_cov)
print("\nExpansion Coefficients using SVD:\n", T_svd)
print("Expansion Coefficients using Covariance Matrix:\n", T_cov)

PCs (columns) using SVD:
 [[-0.45308868 -0.30380604  0.83810055]
 [ 0.83700581  0.17858356  0.51723224]
 [-0.30680926  0.9358471   0.17337323]]
PCs (columns) using Covariance Matrix:
 [[-0.45308868 -0.30380604 -0.83810055]
 [ 0.83700581  0.17858356 -0.51723224]
 [-0.30680926  0.9358471  -0.17337323]]

Expansion Coefficients using SVD:
 [[ 1.75743358e+01  2.53990304e+01  7.09597798e+00  1.08932454e+01
  -3.38237988e+00 -2.54898939e+00 -1.34422348e+01 -1.59912242e+01
  -2.43390956e+01]
 [-2.62895775e+01  2.11632630e+01 -1.90299983e+01  1.64112284e+01
  -1.17704190e+01  1.19896803e+01 -4.42154802e+00  7.56813232e+00
   2.74873943e+00]
 [-2.88746561e-02 -7.45252743e-02 -7.91327010e-02 -9.37399063e-02
  -1.29390746e-01 -1.47725798e-02  7.89673266e-02  6.41947468e-02
  -2.29906836e-01]]
Expansion Coefficients using Covariance Matrix:
 [[ 1.75743358e+01  2.53990304e+01  7.09597798e+00  1.08932454e+01
  -3.38237988e+00 -2.54898939e+00 -1.34422348e+01 -1.59912242e+01
  -2.43390956e+01]
 [-2.628

Looking at the results above, we can see that the numbers are all the same, save for some differences in sign. As we have seen in previous problems, since eigenvectors are only uniquely detirmined up to a minus sign, principal components may come out with different signs depending on how they are calculated. Importantly, however, the interpretation of each principal component remains the same. 

### 4a

In [189]:
import numpy as np
np.set_printoptions(linewidth=132)

In [190]:
X = np.array([[4,2.6,2.2,3.8,3.4,2,3,4,2.6,2.2,3.8],
             [1.9,3.3,3.9,2.2,2.7,4.1,3.1,2.1,3.3,3.8,2.3],
             [-1,4.6,6.2,-0.1,1.4,7,3.1,-0.9,4.6,6.2,-0.1],
             [-0.8,4.9,6,-0.2,1.1,6.7,2.9,-0.9,4.9,6.1,-0.2]])
Y = np.array([[-2.6,-5.9,-7.3,-3.1,-4,-7.7,-4.9,-2.4,-6.1,-7.2,-2.9],
             [-7.7,-4,-2.5,-7.2,-6.2,-2,-5.1,-7.9,-3.8,-2.5,-7.4],
             [-7.3,-4,-3.1,-6.9,-6,-2.7,-5,-7.3,-4,-3.1,-6.9]])

In order to perform multivariate PCA on the above data, we must first remove subtract the mean of each row. Then, we need to normalize each value in X by the standard deviation of all the X values, and then do the same seperately for the Y data. Note that this just means we want to divide each entry in X by the standard deviation of all entries i X, and likewise for Y.

In [191]:
# Remove mean from each row
X_norm = X-np.mean(X, axis=1).reshape(X.shape[0],1)
Y_norm = Y-np.mean(Y, axis=1).reshape(Y.shape[0],1)

# Normalize by std of each group
X_norm = X_norm/np.std(X)
Y_norm = Y_norm/np.std(Y)

print("Normalized X Data:\n", X)
print("Normalized Y Data:\n", Y)

Normalized X Data:
 [[ 4.   2.6  2.2  3.8  3.4  2.   3.   4.   2.6  2.2  3.8]
 [ 1.9  3.3  3.9  2.2  2.7  4.1  3.1  2.1  3.3  3.8  2.3]
 [-1.   4.6  6.2 -0.1  1.4  7.   3.1 -0.9  4.6  6.2 -0.1]
 [-0.8  4.9  6.  -0.2  1.1  6.7  2.9 -0.9  4.9  6.1 -0.2]]
Normalized Y Data:
 [[-2.6 -5.9 -7.3 -3.1 -4.  -7.7 -4.9 -2.4 -6.1 -7.2 -2.9]
 [-7.7 -4.  -2.5 -7.2 -6.2 -2.  -5.1 -7.9 -3.8 -2.5 -7.4]
 [-7.3 -4.  -3.1 -6.9 -6.  -2.7 -5.  -7.3 -4.  -3.1 -6.9]]


### 4b

In [192]:
# Combine the normalized X and Y data into one dataset
F = np.vstack((X_norm,Y_norm))

# Get the SVD of F
U, Sigma, V_t = np.linalg.svd(F, full_matrices=True)

# Make the Sigma array from the list of singular values
Sigma_diag = np.diag(Sigma)
Sigma_mat = np.zeros(F.shape)
Sigma_mat[:Sigma_diag.shape[0],:Sigma_diag.shape[1]] = Sigma_diag

#print("PCs (columns):\n", U)

total_var = np.sum([np.square(i) for i in Sigma])
running_sum = 0
num_comps = -1
for i, val in enumerate(Sigma):
    running_sum += np.square(val)
    print(running_sum/total_var)
    if running_sum/total_var >= 0.5:
        num_comps = i+1
        break

print("PCs (columns):\n", U)
print("\nNumber of PCs needed to explain at least 50% of total variance:", num_comps)
print("Variance explained by", num_comps, "PCs:", running_sum/total_var)

0.9978911409180963
PCs (columns):
 [[-0.13142226 -0.02092185 -0.09924493 -0.09320572 -0.01404402  0.33223223 -0.92367641]
 [ 0.13111547  0.28772149  0.53764473 -0.48628455 -0.45008344  0.39802939  0.11613808]
 [ 0.5204419   0.09070272  0.57052637  0.50443552  0.3268824   0.03136549 -0.18199399]
 [ 0.51557648 -0.74527634 -0.18488136 -0.0489759  -0.19951146  0.30910502  0.08254438]
 [-0.3735996  -0.42537059  0.38875174 -0.34856701  0.61339404  0.14437205  0.09879669]
 [ 0.41941733  0.40403955 -0.42205465 -0.34736188  0.52375119  0.2775528   0.10344021]
 [ 0.33670533 -0.09448128  0.11022293 -0.5057934  -0.00177971 -0.73265667 -0.27006995]]

Number of PCs needed to explain at least 50% of total variance: 1
Variance explained by 1 PCs: 0.9978911409180963


Looking at the results above, we can see that almost all of the variance in the dataset is contained within the first PC. Looking at this first PC, we can see that the first and fifth elements are negative, while the rest are all positive. Therefore, we can conclude that the prices in the first and fifth products vary together in one direction, where as the prices of all the other products vary together in the other direction. 

### 4c
We know that PCA requires each PC to be orthogonal, and for this reason can sometimes miss relationships in the data. Therefore, it is also prudent to investigate the covariance between the two datasets in case PCA was dominated by large variances between the datasets that consequently obscured our ability to identify the presence smaller covariance between the datasets. We can check this by looking at the covariance matrix between $X$ and $Y$, given by $C=XY^T/N$ after the mean has been removed from the rows of X and Y. Looking at this below...

In [193]:
X_nm = (X-np.mean(X, axis=1).reshape(X.shape[0],1))
Y_nm = (Y-np.mean(Y, axis=1).reshape(Y.shape[0],1))

C = (X_nm@Y_nm.T)/X_nm.shape[1]
print("C:\n", C)
print("\nTotal covariance:", np.sum(np.square(C)))

C:
 [[ 1.42280992 -1.59719008 -1.28264463]
 [-1.42140496  1.59495868  1.27950413]
 [-5.63330579  6.32396694  5.07942149]
 [-5.57049587  6.25586777  5.03404959]]

Total covariance: 205.45717312342055


The rows of $C$ represent the the four products in $X$ while the columns represent the 3 products in $Y$. The elements represent the covariance between the products cooresponding to each row and column. Looking closely at this covariance matrix, we can see the same features that we saw in the first PC of our PCA analysis. We can see that $X_1$ and $Y_1$ are positively coorelated, which we saw from those terms both being negative in the first PC, and we can also see that $X_1$ and $Y_1$ are both negatively coorelated with all the other products (which are in turn all positively coorelated with each other).

### 4d
Now that we have calculated $C$, we can use SVD on $C$ to perform MCA on this data. Doing this below...

In [194]:
# Get the SVD of C
U, Sigma, V_T = np.linalg.svd(C, full_matrices=True)

print("Singular values of C:\n",Sigma)
print("\nU:\n",U)
print("\nV:\n",V_T.T)

Singular values of C:
 [1.43337757e+01 6.97365562e-03 1.68405371e-04]

U:
 [[-0.17400224  0.17349134  0.09290314  0.96487978]
 [ 0.17373391 -0.37598545  0.9101229   0.01130403]
 [ 0.68897311 -0.56646087 -0.36878125  0.26160757]
 [ 0.68180335  0.71250107  0.16445579 -0.0209932 ]]

V:
 [[-0.57024035  0.47834665 -0.66784012]
 [ 0.64025893 -0.25055083 -0.72614929]
 [ 0.51467898  0.84167022  0.16339151]]


Looking at the SVD of $C$, we can see that the singular values are dominated by one value which is orders of magnitude greater than the other two. Since these singluar values divided by their sum represent the fraction of the total covariance explained by each SVD mode, we can tell that we only need the first SVD mode to explain most of the covariance between the two datasets. The column vectors of $U$ and $V$ represent the structures in X and Y (respectively) that vary together. Notice that the strucure of the first columbs of $U$ and $V$ above is very similar to that of our first PC. Since we see that almost all of the total covariance is explained by the first SVD mode, and since this has similar structure to the only PC we decided to keep, we can tell that PCA did not miss any significant covariances in the data.

### 4e
We know that the total covariance is equal to the sum of all the singular values of $C$ squared. Checking that this is true below...

In [195]:
# Calculate the total covariance two different ways
total_cov = np.sum(np.square(C))
total_cov_sv = np.sum(np.square(Sigma))

print("Total squared sum of elements of C:",total_cov)
print("Total squared sum of elements of Singular Values of C:",total_cov_sv)

Total squared sum of elements of C: 205.45717312342055
Total squared sum of elements of Singular Values of C: 205.45717312342057


We can see from above that the total covariance of $C$ is equal to the sum of squared singular values of $C$. We don't even need to do any calculation to see that only one SVD mode explains well over 50% of the total covariance given our observations in part d (the portion of the covariance explained by the first SVD mode is printed below anyway). In general, the difference between total variance and total covariance is that total variance represents the total deviation from the mean for each variable of the dataset, which can also be thought of as the covariance of that varaible with itself. Total covariance, on the other hand, represents the total joint variability of each possible pair of variables. However, in the case above when we are performing MCA, we only consider the covariance of pairs of $X$ and $Y$ variables, not the covariance of variable pairs within $X$ or $Y$.

In [197]:
np.square(Sigma[0])/np.sum(np.square(Sigma))

0.9999997631611869

### 5a

In [272]:
import numpy as np
import scipy.linalg
import re

The Jaccard similarity of two sets $A$ and $B$ is defined as
$$J(A,B) = \frac{|A\cap B|}{|A\cup B|}$$
Below, we calculate the Jaccard similarity of a few different sets

#### $\text{i}$

In [257]:
# Define the sets
a=np.array([1,1,0,0,0,0,0,0,1,1])
b=np.array([0,0,1,0,1,1,1,0,1,0])

# Find the Jaccard Similarity
jac_sim = np.sum(a&b)/np.sum(a|b)
print("Jaccard Similarity:", jac_sim)

Jaccard Similarity: 0.125


#### $\text{ii}$

In [260]:
# Define the sets
a=set([2, 2, 1, 3, 4, 6, 2, 2, 8, 2])
b=set([2, 1, 8, 7, 9, 4, 7, 6, 4, 8])

# Find the Jaccard Similarity
jac_sim = len(a&b)/len(a|b)
print("Jaccard Similarity:", jac_sim)

Jaccard Similarity: 0.625


#### $\text{iii}$

In [268]:
# Define the sets
a = set(['what','are','the','roots','that','clutch','what','branches','grow','out','of','this'
    ,'stony','rubbish','son','of','man','you','cannot','say','or','guess','for','you','know'
    ,'only','a','heap','of','broken','images','where','the','sun','beats','and'])

b = set(['cricket','dry','what','no','sound','man','out','know','sun','stone','images','water'
    ,'no','what','grow','tree','you','that','cannot','this','guess','say','the','the','of'
    ,'the','roots','and','broken','heap','you','gives','only','dead','rubbish','clutch'])

# Find the Jaccard Similarity
jac_sim = len(a&b)/len(a|b)
print("Jaccard Similarity:", jac_sim)

Jaccard Similarity: 0.55


#### $\text{iv}$

In [274]:
## Code adapted from website ##
# Read in the text files
fid1 = open('words1.txt','r');
fid2 = open('words2.txt','r');
carray1 =[]; carray2 =[]
contents = fid1.readlines()
for i in range(len(contents)):
    carray1.append(re.split("[ \r\n\t“”’.,-]+",contents[i]))
fid1.close()
contents = fid2.readlines()
for i in range(len(contents)):
    carray2.append(re.split("[ \r\n\t“”’.,-]+",contents[i]))
fid2.close()
## words come as a list of sublists, each corresponding to a line.
## next, need to "flatten" the lists, that is, create a single list of
## words for each file:
a= set([item for sublist in carray1 for item in sublist])
b= set([item for sublist in carray2 for item in sublist])

# Find the Jaccard Similarity
jac_sim = len(a&b)/len(a|b)
print("Jaccard Similarity:", jac_sim)

Jaccard Similarity: 0.16071428571428573
