## Objectives:
- define a vector and calculate a vector length and dot product
- define a matrix and calculate a matrix dot product, transpose, and inverse
- explain cosine similarity and compute the similarity between two vectors
- use linear algebra to solve for linear regression coefficients

#Use the following information to answer the assignment questions 1) - 11).

###Is head size related to brain weight in healthy adult humans?

The Brainhead.csv dataset provides information on 237 individuals who were subject to post-mortem examination at the Middlesex Hospital in London around the turn of the 20th century. Study authors used cadavers to see if a relationship between brain weight and other more easily measured physiological characterizes such as age, sex, and head size could be determined. The end goal was to develop a way to estimate a person’s brain size while they were still alive (as the living aren’t keen on having their brains taken out and weighed). 

**We wish to determine if we can improve on our model of the linear relationship between head size and brain weight in healthy human adults.**

Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to the Size of the Head", Biometrika, Vol. 4, pp105-123.

In [1]:
#Import the Brainhead.csv dataset from a URL and print the first few rows

import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Brainhead/Brainhead.csv'

df = pd.read_csv(data_url, skipinitialspace=True, header=0)

df.head()

Unnamed: 0,Gender,Age,Head,Brain
0,1,1,4512,1530
1,1,1,3738,1297
2,1,1,4261,1335
3,1,1,3777,1282
4,1,1,4177,1590


1) Store the response variable - brain size - as a matrix called Y.

In [2]:
### YOUR CODE HERE ###
# Import in numpy array with reshape(-1,1) to have a vertical vector
Y = np.array(df['Brain']).reshape(-1,1)

2) Store the explanatory variable - head size size - as a matrix called X.  Don't forget to include the column of 1s for the intercept term.

In [4]:

### YOUR CODE HERE ###
Head = np.array(df['Head']).reshape(-1,1)
ones = np.repeat(1, len(df)).reshape(-1,1)
X = np.concatenate((ones, Head), axis = 1)


3) Calculate $X^T$.  Explain what the transpose of a matrix is.

In [5]:
### YOUR CODE HERE ###
X_T = np.transpose(X)


Answer: -->
**The transpose of a matrix A is simply a matrix that has the rows as the columns of matrix A and the transpose's columns are the rows of matrix A.**

4) Use matrix multplication to calculate $X^TX$

In [6]:
### YOUR CODE HERE ###
X_T_X = np.dot(X_T, X)
X_T_X

array([[       237,     861256],
       [    861256, 3161283190]])

5) Calculate $(X^TX)^{-1}$.  Explain what the inverse of a matrix is.

In [7]:
### YOUR CODE HERE ###
X_T_X_inv = np.linalg.inv(X_T_X)
X_T_X_inv

array([[ 4.23638519e-01, -1.15415543e-04],
       [-1.15415543e-04,  3.17599920e-08]])

Answer: -->

**The inverse of a matrix is the reciprocal of the matrix used to generate it. The matrix itself is divided by a factor obtained by the matrix and the placements of the elements of the matrix is changed in an inverse. The example for 2x2 matrix inverse is as follows**

\begin{align}
A = \begin{bmatrix}
a & b \\
c & d
\end{bmatrix}
\qquad
A^{-1} = \frac{1}{ad-bc}\begin{bmatrix}
d & -b\\
-c & a
\end{bmatrix}
\end{align}

$\qquad$

6) Use matrix multiplication to calculate $X^TY$.

In [8]:
### YOUR CODE HERE ###

X_T_Y = np.matmul(X_T, Y)
X_T_Y

array([[    304041],
       [1113176805]])

7) Use your previous results to calculate the values of the slope and intercept using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

In [9]:
### YOUR CODE HERE ###
B = np.matmul(X_T_X_inv, X_T_Y)
print (B)

[[3.25573421e+02]
 [2.63429339e-01]]


8) Use the OLS function to calculate the slope and intercept and compare your answers.

In [10]:
### YOUR CODE HERE ###
from statsmodels.formula.api import ols
# Model format Y ~ X or Brain ~ Head
model = ols('Brain ~ Head', data = df).fit()
model.params

  import pandas.util.testing as tm


Intercept    325.573421
Head           0.263429
dtype: float64

Answer-->

**The results from the matrix calculation is inline with the results provided by OLS model.**

9) Create a new X matrix that includes coluns for both head size and age group.

In [11]:
### YOUR CODE HERE ###
Head = np.array(df['Head']).reshape(-1,1)
Age = np.array(df['Age']).reshape(-1,1)
ones = np.repeat(1, len(df)).reshape(-1,1)
X = np.concatenate((ones, Head, Age), axis = 1)


11) Calculate the values of the intercept and slope terms for head size and age using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

In [12]:
### YOUR CODE HERE ###
X_T = np.transpose(X)
X_T_X = np.dot(X_T, X)
X_T_X_inv = np.linalg.inv(X_T_X)
X_T_Y = np.matmul(X_T, Y)
B = np.matmul(X_T_X_inv, X_T_Y)
print (B)

[[ 3.68282145e+02]
 [ 2.60438766e-01]
 [-2.07316446e+01]]


11) Use the OLS function to confirm your answer in 10).

In [13]:
### YOUR CODE HERE ###
from statsmodels.formula.api import ols
# Model format Y ~ X or Brain ~ Head
model = ols('Brain ~ Head + Age', data = df).fit()
model.params

Intercept    368.282145
Head           0.260439
Age          -20.731645
dtype: float64

**The results from the matrix calculation is inline with the results provided by OLS model.**

#Use the following information to answer the assignment questions 12) - 16).

The song writing collaboration between John Lennon and Paul McCartney was one of the most productive in music history.  Unlike many other partnerships where one individual wrote lyrics and one wrote music, Lennon and McCartney composed both, and it was decided that any song that was written would be credited to both.  In the beginning of their relationship, many of their songs were truly collabroative.  However, later on, they often worked separately with little to no input from the other.    

Because of extensive reporting on the Beatles over the years, it is generally known if a Lennon-McCartney song was a true collabortion, primarily (or totally) writen by Lennon, or primarily (or totally) written by McCartney.  

However, there are several disputed songs where both Lennon and McCartney at times claimed to be the sole (or primary) composer.

We will now use cosine similarity to determine if *Ticket to Ride* (disputed) is most similar to *From Me to You* (collabortive, not disputed) or *Strawberry Fields* (Lennon, not disputed).

From the Wikipedia article on the Lennon-McCartney Partnership: Lennon said that McCartney's contribution was limited to "the way Ringo played the drums".In Many Years from Now, McCartney said "we sat down and wrote it together ... give him 60 percent of it."

12) Import the text of Strawberry Fields and calculate the freqency of song lyrics using the code below.

In [14]:
### YOUR CODE HERE ###

Lennon_Straw = "let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever living is easy with eyes closed misunderstanding all you see its getting hard to be someone but it all works out it doesnt matter much to me let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever no one I think is in my tree I mean it must be high or low that is you cant you know tune in but its all right that is I think its not too bad let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever always no sometimes think but you know I know when it's a dream I think er no I mean er yes but its all wrong that is I think I disagree let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever Strawberry Fields forever Strawberry Fields forever"
Lennon_Straw_df = pd.DataFrame({'Words': Lennon_Straw.split()})

Lennon_Straw_df.head()

# Calculate the frequency of words for Strawberry_df

Lennon_Straw_df_freq = pd.DataFrame(pd.crosstab(index= Lennon_Straw_df['Words'], columns= 'count'))

Lennon_Straw_df_freq[0:50]

col_0,count
Words,Unnamed: 1_level_1
Fields,10
I,8
Im,4
Strawberry,10
a,1
about,4
all,4
always,1
and,4
bad,1


13) Import the text of From Me to You and calculate the freqency of song lyrics using the code below.

In [15]:
### YOUR CODE HERE ###
#From Me to You - Lennon and McCartney (not disputed)

Coll_Me2U = "if there's anything that you want if there's anything I can do just call on me and Ill send it along with love from me to you Ive got everything that you want like a heart thats oh so true just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you from me to you just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you to you to you to you"

Coll_Me2U_df = pd.DataFrame({'Words': Coll_Me2U.split()})
Coll_Me2U_df_freq = pd.DataFrame(pd.crosstab(index = Coll_Me2U_df['Words'], columns= 'count'))
Coll_Me2U_df_freq

col_0,count
Words,Unnamed: 1_level_1
I,3
Ill,5
Ive,5
a,1
along,5
and,9
anything,6
arms,2
by,2
call,5


13) Import the text of Ticket to Ride using the code below.

In [16]:
### YOUR CODE HERE ###
Dis_Ticket2Ride = "I think Im gonna be sad I think its today yeah the girl thats driving me mad is going away shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care she said that living with me is bringing her down yeah for she would never be free when I was around shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care I dont know why shes ridin so high she ought to think twice she ought to do right by me before she gets to saying goodbye she ought to think twice she ought to do right by me I think Im gonna be sad I think its today yeah the girl thats driving me mad is going away yeah shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care I dont know why shes ridin so high she ought to think twice she ought to do right by me before she gets to saying goodbye she ought to think twice she ought to do right by me she said that living with me is bringing her down yeah for she would never be free when I was around ah shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care my baby dont care my baby dont care my baby dont care my baby dont care my baby dont care my baby dont care"
Dis_Ticket2Ride_df = pd.DataFrame({'Words': Dis_Ticket2Ride.split()})
Dis_Ticket2Ride_df_freq = pd.DataFrame(pd.crosstab(index = Dis_Ticket2Ride_df['Words'], columns= 'count'))
Dis_Ticket2Ride_df_freq

col_0,count
Words,Unnamed: 1_level_1
I,8
Im,2
a,12
ah,1
around,2
away,2
baby,6
be,4
before,2
bringing,2


14) Concatenate Ticket to Ride and Strawberry Fields and calculate the cosine similarity.

In [17]:
### YOUR CODE HERE ###

from numpy import dot
from numpy.linalg import norm

dfs = [Dis_Ticket2Ride_df_freq,Lennon_Straw_df_freq]

all_words = pd.concat(dfs, axis = 1)
all_words = all_words.fillna(0)
all_words.columns = ['Strawberry Fields - Lennon','Ticker to Ride - Disputed']
all_words
cos_sin  = dot(all_words['Strawberry Fields - Lennon'], all_words['Ticker to Ride - Disputed']) / ((norm(all_words['Strawberry Fields - Lennon'])) * (norm(all_words['Ticker to Ride - Disputed'])))
print ('The cosine similarity between Strawberry Fields - Lennon and Ticker to Ride - Disputed is', cos_sin)



The cosine similarity between Strawberry Fields - Lennon and Ticker to Ride - Disputed is 0.324035859004908


15) Concatenate Ticket to Ride and From Me to You and calculate the cosine similarity.

In [18]:
### YOUR CODE HERE ###

dfs = [Dis_Ticket2Ride_df_freq, Coll_Me2U_df_freq]

all_words = pd.concat(dfs, axis = 1)
all_words = all_words.fillna(0)
all_words.columns = ['Ticket to Ride - Disputed', 'From Me to You - Collaborative']
all_words
cos_sin = dot(all_words['From Me to You - Collaborative'], all_words['Ticket to Ride - Disputed']) / ((norm(all_words['From Me to You - Collaborative'])) * (norm(all_words['Ticket to Ride - Disputed'])))
print ('The cosine similarity between From Me to You - Collaborative and Ticket to Ride - Disputed is', cos_sin)

The cosine similarity between From Me to You - Collaborative and Ticket to Ride - Disputed is 0.2882268853551227


16) What is your conclusion about Ticket to Ride?  Does it appear most similar to Strawberry Fields (Lennon) or From Me to You (collaborative)?

Answer: -->

**From the cosine similarity we can conclude that there is MORE similarity in the song Ticket to Ride(Disputed) with Strawberry Fields composed by Lennon than From Me to You which was a collaborative work of Lennon and McCartney. Therefore based on the data we can conclude that Ticket to Ride was composed by Lennon.**