# Graded: 20 of 20 correct
1. Part 1
- [x] Create 1000 x 1000 matrix
- [x] Row sum
- [x] Column sum
- [x] Time sums
- [x] Loop row sum
- [x] Loop col sum
- [x] Time loop sums
- [x] Compare numpy to loop

2. Part 2
- [x] X array
- [x] Y array
- [x] Print out vehicle count
- [x] Train test split
- [x] Y test/train histograms
- [x] Linear regression fit
- [x] Print the regression coefficients
- [x] Print the mean squared error
- [x] Print R^2
- [x] Correct scatterplots
- [x] Correct layout
- [x] Axes/labels

Comments: 


##### <img src="../SDSS-Logo.png" style="display:inline; width:500px" />


# The objective of this programming exercise is two-fold:
* To run an example that shows the computational efficiency of the Numpy numerical library;
* Get experience with scikit-learn by building a predictive model from data.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time

## Part 1
### For this part, you are going to compare the time it takes to compute row sums and column sums for a large matrix using Numpy functions vs for loops.

#### Numpy method
* Create a large matrix of size `(1000 X 1000)` of random values.
* Use the Numpy `np.sum()` method to calculate the row sums and column sums of this matrix.
* Track the amount of time it took to calculate the row sum and the column sum.
    * You can use the `time.time()` function to track time.

In [2]:
m1k = np.random.randint(0,10, (1000,1000))

In [3]:
def sum_col_row_np(np_matrix):
	time_begin = time.time()
	sum_col = np.sum(np_matrix, 0)
	sum_row = np.sum(np_matrix, 1)
	time_end = time.time()
	return time_end-time_begin

print(f"""Processing time to sum col and row: {sum_col_row_np(m1k)} s""")

Processing time to sum col and row: 0.001009225845336914 s


#### for loop method
* Now use `for` loops to compute the row sums and columns sums for the same matrix.
* Again, track the time using the `time.time()` function

In [4]:
def sum_col_row_loop(loop_matrix):
	loop_begin = time.time()
	sum_loop_row = np.zeros(loop_matrix.shape[0])
	sum_loop_col = np.zeros(loop_matrix.shape[1])
	for i in range(loop_matrix.shape[0]): # get row count
		for j in range(loop_matrix.shape[1]): # get col count
			sum_loop_row[i] += loop_matrix[i][j]
			sum_loop_col[j] += loop_matrix[i][j]        
	loop_end = time.time()

	return loop_end-loop_begin

print(f"""Processing time to sum col and row: {sum_col_row_loop(m1k)} s""")

Processing time to sum col and row: 0.5457260608673096 s


#### Compare the time taken by the two methods.
* What is your conclusion?
* Try other array computations and compare the difference.

Using loop iteration is known as inefficient because we iterate the element one by one and it consumes memory back and forth fetching the value and do whatever processing we write. While using Numpy sum, the function is already written and optimized for vectorized operations.

In [None]:
m5k = np.random.randint(0, 10, (5000, 5000))

print(f"""Processing time to sum col and row Np.sum : {sum_col_row_np(m5k)} s""")
print(f"""Processing time to sum col and row loop	: {sum_col_row_loop(m5k)} s""")

Processing time to sum col and row Np.sum : 0.03491568565368652 s
Processing time to sum col and row loop	: 12.809828519821167 s


: 

## Part 2
## For this part, you will create a predictive model using scikit-learn to predict vehicle MPG from vehicle characteristic data.
* The data for this part is from [EPA fuel ecomony](https://www.fueleconomy.gov/) website.
* For the purpose of this programming exercise, we have downloaded this data and modified it to create a reasonably clean analytical data set.
* The data set includes the model years 2020-2023.

### Load the vehicles data set for 2020-2023.
* The data is in the CSV file `carDataMPG2023.csv`
* The code below uses the pandas `read_csv()` function to read the data into a pandas data frame.
    * You will learn about pandas and data frame in the next unit, but here we are using it here to simplify the setting up of the problem.

In [21]:
# Read the car MPG data
carDataRead = pd.read_csv("carDataMPG2023.csv")
carDataRead #.head(10)

Unnamed: 0,id,make,model,year,fuelType1,drive,trany,VClass,cylinders,displ,speeds,drive_number,avgMpg
0,41213,Toyota,Corolla,2020,Regular Gasoline,Front-Wheel Drive,Automatic (AV-S10),Compact Cars,4.0,2.0,10.0,2,46.31000
1,41215,Toyota,Corolla,2020,Regular Gasoline,Front-Wheel Drive,Manual 6-spd,Compact Cars,4.0,2.0,6.0,2,42.54910
2,41216,Toyota,Corolla XSE,2020,Regular Gasoline,Front-Wheel Drive,Automatic (AV-S10),Compact Cars,4.0,2.0,10.0,2,45.39342
3,41218,Toyota,Corolla,2020,Regular Gasoline,Front-Wheel Drive,Manual 6-spd,Compact Cars,4.0,1.8,6.0,2,44.54000
4,41222,Kia,Soul,2020,Regular Gasoline,Front-Wheel Drive,Manual 6-spd,Small Station Wagons,4.0,2.0,6.0,2,35.47000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4531,47517,Mitsubishi,Outlander Sport 2WD,2023,Regular Gasoline,Front-Wheel Drive,Automatic (AV-S6),Small Sport Utility Vehicle 2WD,4.0,2.4,6.0,2,33.59417
4532,47518,Mercedes-Benz,GLC300 4matic,2023,Premium Gasoline,All-Wheel Drive,Automatic 9-spd,Small Sport Utility Vehicle 4WD,4.0,2.0,9.0,4,34.07000
4533,47519,Toyota,Corolla Cross Hybrid AWD,2023,Regular Gasoline,All-Wheel Drive,Automatic (AV-S6),Small Sport Utility Vehicle 4WD,4.0,2.0,6.0,4,64.02492
4534,47520,INEOS Automotive,Grenadier,2023,Premium Gasoline,4-Wheel Drive,Automatic (S8),Standard Sport Utility Vehicle 4WD,6.0,3.0,8.0,4,19.75000


### Next create Numpy arrays of the predictor and target variables.
* The predictor variables will be the columns `cylinders`, `displ`, `speeds` and `drive_number`
    * Call the array of predictor variables X
* The target variable will be the column `avgMpg`
    * Call the array of target variable Y
* The Numpy `np.array()` function can convert a homogeneous section of a data frame to a Numpy array.

In [24]:
X = np.array([carDataRead['cylinders'], carDataRead['displ'], carDataRead['speeds'], carDataRead['drive_number']])

### Print out the number of vehicles in the dataset

In [8]:
# Your code here

### Split the X and Y arrays into training and testing datasets using `train_test_split()`.
* Keep 80% of the data for training and 20% for testing. 
* Call the training and testing splits `X_train`, `Y_train`, `X_test` and `Y_test`.

In [9]:
# Your code here

### Plot the histograms of `Y_test` and `Y_train` to make sure they are comparable.
* With a large dataset of this size, this is less of a problem than with small data set.
* Nonetheless  it is worth checking that the training-test split has not biased one of these subsets in one way or another.

In [10]:
# Your code here

### Linear regression fit
* Use the `LinearRegression` model in scikit learn to fit a regression model with `X_train` as the predictor variable and `Y_train` as the target.
* Use the `predict()` method from the fitted model to predict `avgMpg` from `X_test`, and call the predicted value `Y_pred`.

In [11]:
# Your code here

### Outputs
* Print the regression coefficients
* Print the mean squared error between `Y_test` and `Y_pred`
* Print the R^2, called the [coefficient of determination,](https://en.wikipedia.org/wiki/Coefficient_of_determination) between `Y_test` and `Y_pred`

In [12]:
# Your code here

### Scatter plots
* Create scatter plots of `Y_test` and `Y_pred` against each of the 4 predictor variables `cylinders`, `displ`, `speeds`, `drive_number`.
* Do this as subplots in a 2 X 2 grid. 
* Label the subplots appropriately.
* Include axis labels also.

In [None]:
# Your code here