#### **Problem Statement:**
In this experiment we will do Ordinary least squares (multiple) regression for the prediction of Graduate Admissions from an Indian/Bangladeshi perspective. The dataset can be obtained from
[Piazza](https://piazza.com/class_profile/get_resource/ku1fdd7zhev3r2/kwaz7m8lx5a52k) and [Kaggle](https://www.kaggle.com/mohansacharya/graduate-admissions). The dataset (Admission_Predict.csv) containss even features arranged into columns in a CSV file. There are 400 sample datapoints. The features are as follows:
1. GRE Scores (out of 340)
2. TOEFL Scores (out of 120)
3. University Rating (out of 5)
4. Statement of Purpose and Letter of Recommendation Strength (out of 5)
5. Undergraduate GPA (out of 10)
6. Research Experience (either 0 or 1)
7. Chance of Admit (ranging from 0 to 1)

The first column of the dataset contains a serial number, and the final column provide the probability of getting admission, i.e. the target output for each datapoint.We will be using the dataset to create a linear regression model in order to determine the chances of admission of a new sample student, and to assess how well our model works in making a useful forecast.

#### **1. Import necessary packages:**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#### **2. Upload and load dataset:**
At first we have to upload the dataset to google colab to start working with it. Please download the **"Admission_Predict.csv"** dataset from piazza resourse or [click here](https://piazza.com/class_profile/get_resource/ku1fdd7zhev3r2/kwaz7m8lx5a52k) to download it. Then click on files form sidebar, drag and drop your file to side bar to upload the dataset.

Now, use `data = pd.read_csv("Admission_Predict.csv")` to load the data.

In [None]:
data = pd.read_csv("Admission.csv")
#data = data.to_numpy()
print(data)

     Serial No.  GRE Score  TOEFL Score  University Rating  SOP  LOR   CGPA  \
0             1        337          118                  4  4.5   4.5  9.65   
1             2        324          107                  4  4.0   4.5  8.87   
2             3        316          104                  3  3.0   3.5  8.00   
3             4        322          110                  3  3.5   2.5  8.67   
4             5        314          103                  2  2.0   3.0  8.21   
..          ...        ...          ...                ...  ...   ...   ...   
395         396        324          110                  3  3.5   3.5  9.04   
396         397        325          107                  3  3.0   3.5  9.11   
397         398        330          116                  4  5.0   4.5  9.45   
398         399        312          103                  3  3.5   4.0  8.78   
399         400        333          117                  4  5.0   4.0  9.66   

     Research  Chance of Admit   
0           1    

#### **3. Preprocess the Data:**
* To visualize the loaded data use `print(data.head())`. 
* Now, after visualizing the data did you observe we have an extra column named `Serial No.`? 
* This certainly is not a feature, so we will drop this column. Use `data.drop('Serial No.', axis=1, inplace=True)` to drop the column.
* See the column `'Chance of Admit'` is also not a feature rather it is our target. 
  * We will store it in a seperate variable `y` using `y = data['Chance of Admit ']`.
  * Convert `y` to numpy array using `y = y.values`
  * Dorp the column from `data` using `data.drop('Chance of Admit ', axis=1, inplace=True)`
* In `data` we are left with all 7 features. Covert it to numpy array and store in a new variable `X` using `X = data.values`. So, `X` is the matrix of feature columns, each column in `X` will be the feature vectors.

☢ Note: Be careful about the space after the column name `'Chance of Admit '`.

In [None]:
print(data.head())
data.drop('Serial No.', axis=1, inplace=True)


   Serial No.  GRE Score  TOEFL Score  University Rating  SOP  LOR   CGPA  \
0           1        337          118                  4  4.5   4.5  9.65   
1           2        324          107                  4  4.0   4.5  8.87   
2           3        316          104                  3  3.0   3.5  8.00   
3           4        322          110                  3  3.5   2.5  8.67   
4           5        314          103                  2  2.0   3.0  8.21   

   Research  Chance of Admit   
0         1              0.92  
1         1              0.76  
2         1              0.72  
3         1              0.80  
4         0              0.65  


In [None]:
print(data)
y = data['Chance of Admit ']
y = y.values
data.drop('Chance of Admit ', axis=1, inplace=True) #when using inplace=true, you don't have to store it inside a variable
print("y")
print(data)

     GRE Score  TOEFL Score  University Rating  SOP  LOR   CGPA  Research  \
0          337          118                  4  4.5   4.5  9.65         1   
1          324          107                  4  4.0   4.5  8.87         1   
2          316          104                  3  3.0   3.5  8.00         1   
3          322          110                  3  3.5   2.5  8.67         1   
4          314          103                  2  2.0   3.0  8.21         0   
..         ...          ...                ...  ...   ...   ...       ...   
395        324          110                  3  3.5   3.5  9.04         1   
396        325          107                  3  3.0   3.5  9.11         1   
397        330          116                  4  5.0   4.5  9.45         1   
398        312          103                  3  3.5   4.0  8.78         0   
399        333          117                  4  5.0   4.0  9.66         1   

     Chance of Admit   
0                0.92  
1                0.76  
2  

In [None]:
#x
x = data.values
print(x)
print(x.shape)

[[337.   118.     4.   ...   4.5    9.65   1.  ]
 [324.   107.     4.   ...   4.5    8.87   1.  ]
 [316.   104.     3.   ...   3.5    8.     1.  ]
 ...
 [330.   116.     4.   ...   4.5    9.45   1.  ]
 [312.   103.     3.   ...   4.     8.78   0.  ]
 [333.   117.     4.   ...   4.     9.66   1.  ]]
(400, 7)


#### **4. Add a ones column vector to X:**
Add a new column cosisting ones as $0^{th}$ column to X. Saw the [numpy documentation](https://numpy.org/doc/stable/reference/generated/numpy.c_.html) for more details. Devide data X and y into x_train, x_test, y_train and y_test. Train dataset will contains 300 datapoints and test dataset will contains 100 datapoint.

In [None]:
m = np.ones((400,7))

x1 = np.concatenate((m, x), axis=0)
x = np.column_stack((np.ones(400),x))
print(x)


[[  1.   337.   118.   ...   4.5    9.65   1.  ]
 [  1.   324.   107.   ...   4.5    8.87   1.  ]
 [  1.   316.   104.   ...   3.5    8.     1.  ]
 ...
 [  1.   330.   116.   ...   4.5    9.45   1.  ]
 [  1.   312.   103.   ...   4.     8.78   0.  ]
 [  1.   333.   117.   ...   4.     9.66   1.  ]]


In [None]:
#xtrain

x_train, y_train = x[:300], y[:300]

#print("x_train:",x_train)
#print("y_train:",y_train)

x_test, y_test= x[300:], y[300:]
print(x_train.shape, y_train.shape)

x_train: [[  1.   337.   118.   ...   4.5    9.65   1.  ]
 [  1.   324.   107.   ...   4.5    8.87   1.  ]
 [  1.   316.   104.   ...   3.5    8.     1.  ]
 ...
 [  1.   320.   120.   ...   4.5    9.11   0.  ]
 [  1.   330.   114.   ...   4.5    9.24   1.  ]
 [  1.   305.   112.   ...   3.5    8.65   0.  ]]
y_train: [0.92 0.76 0.72 0.8  0.65 0.9  0.75 0.68 0.5  0.45 0.52 0.84 0.78 0.62
 0.61 0.54 0.66 0.65 0.63 0.62 0.64 0.7  0.94 0.95 0.97 0.94 0.76 0.44
 0.46 0.54 0.65 0.74 0.91 0.9  0.94 0.88 0.64 0.58 0.52 0.48 0.46 0.49
 0.53 0.87 0.91 0.88 0.86 0.89 0.82 0.78 0.76 0.56 0.78 0.72 0.7  0.64
 0.64 0.46 0.36 0.42 0.48 0.47 0.54 0.56 0.52 0.55 0.61 0.57 0.68 0.78
 0.94 0.96 0.93 0.84 0.74 0.72 0.74 0.64 0.44 0.46 0.5  0.96 0.92 0.92
 0.94 0.76 0.72 0.66 0.64 0.74 0.64 0.38 0.34 0.44 0.36 0.42 0.48 0.86
 0.9  0.79 0.71 0.64 0.62 0.57 0.74 0.69 0.87 0.91 0.93 0.68 0.61 0.69
 0.62 0.72 0.59 0.66 0.56 0.45 0.47 0.71 0.94 0.94 0.57 0.61 0.57 0.64
 0.85 0.78 0.84 0.92 0.96 0.77 0.71 0.79 0.

#### **5. Solve the system of equation:**
Solve the system of equations $(Xβ = y)$ to find the values of the $β$ vector $(β_0, β_1, β_2, \ldots, β_n)$. You can find $β$ using $β = X^† y = (X^T X)^{−1} X^T y = R^{−1} Q^T y$. There is also a numpy function to calculate the psuedo inverse: `np.linalg.pinv()`, saw the [numpy documentation](https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html) for more details. Use `x_train` and `y_train` as dataset. 

In [None]:
Q,R = np.linalg.qr(x_train)
R_inv = np.linalg.inv(R)
beta = np.matmul(R_inv, Q.T).dot(y_train)

print(beta.shape)
print(beta)

(8,)
[-1.27488267  0.00176755  0.00290322  0.00771492 -0.00531496  0.02828439
  0.11736896  0.01922889]


#### **6. Find predicted chance of admit:**
Find the predicted chance of admit $\hat y$, by multiplying $X * β$. For prediction use `x_test` as dataset.

In [None]:
y_pred = np.dot(x_test, beta)
#print(y_pred)
print(y_pred.shape)

(100,)


#### **7. Find the error vector e:**
Find the error vector, $e$, by subtracting $\hat y$ from `y_test`.

In [None]:
# Write appropriate code
e = y_pred - y_test
print(e)

[-0.0291627   0.05766186  0.04444189  0.00295242  0.01423397  0.02456599
  0.01355752  0.02301928 -0.0093059  -0.00463892 -0.01610232 -0.004822
  0.01524217 -0.08948552 -0.02410907 -0.04437867 -0.00838076 -0.023962
 -0.05143653 -0.03255367 -0.03908204 -0.00718311 -0.04330553 -0.02406189
 -0.03057463  0.03777392 -0.07567995 -0.16899838 -0.01988532  0.06939951
 -0.04378967 -0.12201732 -0.11465392  0.02026922  0.01178839  0.01716884
 -0.00209538 -0.00134078 -0.0249297  -0.0327974  -0.05853858 -0.01838779
  0.05643793  0.02859978 -0.01923137  0.00654195  0.03479051 -0.00130943
 -0.11908446 -0.01005671 -0.04609117  0.05113857 -0.0252894  -0.04433086
 -0.06818865 -0.07999614 -0.00420757 -0.07982183 -0.15962318 -0.16903342
 -0.05977317 -0.03022282  0.00504381 -0.05678    -0.01401787 -0.03332566
 -0.01742299 -0.06197507 -0.01253476 -0.08469416 -0.10793626 -0.08207536
 -0.01710383 -0.06743465  0.17011774  0.15023548  0.11955198  0.00164999
 -0.04913035 -0.05982532 -0.0093456   0.00681752  0.010

#### **8. Compute the $r^2$ value:**
Recall that, $r^2 = 1 - SSE / SST$, where $SSE$ is the sum of squared errors: $e^Te$ and $SST = \text{Total sum of squares : } (\text{y_test} - avg(\text{y_test}))^T(\text{y_test} - avg(\text{y_test}))$

In [None]:
# Write appropriate code

#### **9. Plot the vectors $y$, $\hat y$, and $e$:**
Plot the vectors $\text{y_test}$, $\hat y$, and $e$, and make suitable observations. Use different color for three vectors while ploting.

In [None]:
# Write appropriate code

#### **10. Test with new data:**
Introduce a new sample student with your own data, and find where they fall.

In [None]:
# Write appropriate code