<a href="https://colab.research.google.com/github/yqian000/csc448AI/blob/main/a3_Yue.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Imports section
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt

## Part 1. Loading the dataset

In [2]:
# Using pandas load the dataset (load remotely, not locally)
df = pd.read_csv("https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv")

In [3]:
# Output the first 15 rows of the data
df.head(15)

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [4]:
# Display a summary of the table information (number of datapoints, etc.)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


In [5]:
df.describe()

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
count,1000.0,1000.0,1000.0
mean,500.5,471.53,508611.1
std,288.819436,288.482872,447483.8
min,1.0,1.0,16.11429
25%,250.75,226.75,129826.7
50%,500.5,459.5,382718.2
75%,750.25,710.25,760321.1
max,1000.0,1000.0,1972127.0


## Part 2. Splitting the dataset

In [6]:
# Take the pandas dataset and split it into our features (X) and label (y)
X = df[["Temperature °C", "Mols KCL"]]
y = df["Size nm^3"]

# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

In [7]:
# validate
print(f"X_train shape is {X_train.shape}")
print(f"X_test shape is {X_test.shape}")

X_train shape is (900, 2)
X_test shape is (100, 2)


## Part 3. Perform a Linear Regression

In [8]:
# Use sklearn to train a model on the training set
model = LinearRegression().fit(X_train, y_train)

In [9]:
# Create a sample datapoint and predict the output of that sample with the trained model
sample_x = np.array([[450, 650]])
sample_y = model.predict([[450, 650]])[0]
print(f"Predict output is: {sample_y:.5f}")

Predict output is: 648476.58069




In [10]:
# Report on the score for that model, in your own words (markdown, not code) explain what the score means
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Train score is {train_score:.5f} and Test Score is {test_score:.5f}.")

Train score is 0.86214 and Test Score is 0.84754.


The score is the coefficient of determination ($R^2$) of the prediction and measures the correlation between the the dependent variable and independent variable(s). The score can varies from 0 to 1, could be negative too. The best score 1.0 means that there is a perfect correlation.  

In the above example, the train score is 0.86 and test score is 0.84, which means that the training and testing data both have a positive correlation.

In [11]:
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX
coef1 = round(model.coef_[0], 5)
coef2 = round(model.coef_[1], 5)
intercept = round(model.intercept_, 5)
print(f"Coefficents are {coef1}, {coef2}, and intercept is {intercept}")

Coefficents are 859.88701, 1025.0278, and intercept is -404740.64235


**Equation**  
$y = 878.62581x_1 + 1017.89584x_2 - 413864.0233$,  
where $x_1$ is the Temperature feature of the sample and $x_2$ is the Mols KCL feature of the sample

## Part 4. Use Cross Validation

In [12]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
k_folds = cross_val_score(model, X_train, y_train, cv = 5)
print(k_folds)
print(f"The average is: {k_folds.mean()}")

[0.88637348 0.84975116 0.82665296 0.8799896  0.83873166]
The average is: 0.85629977352959


**Report on their finding and their significance:**  
`cross_val_score` returns an array of scores of the estimator for each run. In the above case, `cv = 5` so the training data is divided into 5 parts and the returned array contains 5 scores. We can see that the scores from the runs are very close to each other.  

Cross Validation is significant because it does not need a validation set and instead it splits the training set into k smaller sets for validation. Therefore, it gives us a way to measure the accuracy of the model from the training set without the need for a validation set. This appraoch is also very efficient when the sample size is small.

## Part 5. Using Polynomial Regression

In [13]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(X_train)
x_poly_test = poly.fit_transform(X_test)
new_model = LinearRegression().fit(x_poly, y_train)

In [14]:
# Report on the metrics and output the resultant equation as you did in Part 3.
train_score = new_model.score(x_poly, y_train)
test_score = new_model.score(x_poly_test, y_test)
print(f"Train score is {train_score:.5f} and Test Score is {test_score:.5f}.")

Train score is 1.00000 and Test Score is 1.00000.


The score using polynomial regression has increased compared to linear regression. The score is 1 which means that there is a perfect correlation between X and y.

In [15]:
coef = new_model.coef_
rounded_coef = np.round_(coef, decimals = 5)
intercept = new_model.intercept_
print(f"Coefficents are {rounded_coef}\nIntercept is {intercept:.5f}")

Coefficents are [ 0.      12.      -0.      -0.       2.       0.02857]
Intercept is 0.00000


**Equation**  
Since degree-2 polynomial features are of the form $[1, a, b, a^2, ab, b^2]$, the equation is as follows:  

$y = 0.00001 + 12x_1 + 2x_1x_2 + 0.02857x_2^2$,  
where $x_1$ is the Temperature feature of the sample and $x_2$ is the Mols KCL feature of the sample