[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/samckoy/Assignment-3/blob/main/Assignment%20%233.ipynb)

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures

# Part 1 : Loading the Dataset

Using pandas, we are remotely loading and reading the csv file.

In [2]:
slime = pd.read_csv("https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv")

Using the head( ) function, we can create a table that only displays the first 15 rows of data.

In [3]:
slime.head(15)

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


The info( ) function provides a technical summary of the data. 

In [4]:
slime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


# Part 2 : Splitting the Dataset

In the case of this dataset, the temperature and mols of KCL (the independent variables) determine the size of the slime (the dependent variable). Therefore, **Temperature °C** and **Mols KCL** are our *X values*, and **Size nm^3** is our *y value*.

In [5]:
X = slime[["Temperature °C","Mols KCL"]]
y = slime["Size nm^3"]

From here, we split this data into a training set and a test set, where the training set is 90% of the data, and the test set is 10% of the data. I am setting the *random_state* parameter to 10 so that my results are consistent.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.1, random_state = 10)

# Part 3 : Perform a Linear Regression

Now, I'm using the training set to train my model. I am doing this by using the *fit( )* function. 

In [7]:
lr = LinearRegression()
lr.fit(X_train.values,y_train)

LinearRegression()

I now want to see what my model would predict the size of the slime would be if it is exposed to heat that is 500°C and 600 mols of KCL. 

In [8]:
lr.predict([[500,600]])

array([637379.74448775])

If the slime is exposed to heat with a temperature of 500°C and 600 mols of KCL, the model predicts the size of the slime would be: $$ 6.3738*10^5 nm^3 $$

The *score( )* function determines the R^2 coefficient of h(x), which tells us how accurate our model is at predicting our y-values. 

In [9]:
lr.score(X_test.values,y_test)

0.8838673863856773

Our R^2 coefficient is about **0.88**, which means that our model is 88% accurate in it's predictions. This means that our model is fairly accurate in predicting our y-values. However, the model isn't perfect, for it doesn't have 100% accuracy. 

To determine our values for h(x), we use *coef_* to find the coefficients, and *intercept_* to find the y-intercept. 

In [10]:
lr.coef_

array([ 876.56037994, 1032.21149885])

In [11]:
lr.intercept_

-420227.34479107795

Plugging these values into h(x): $$ h(x) = -4.2023*10^5 + 876.56x_1 + 1032.2x_2 $$

# Part 4 : Use Cross Validation

We now use *cross_val_score* to obtain the scores of, in this case, 10 folds of our dataset. 

In [12]:
scores = cross_val_score(lr,X,y,cv=10)
scores

array([0.81123596, 0.86440978, 0.87808742, 0.86561069, 0.87495621,
       0.84484397, 0.87941022, 0.86349411, 0.78353682, 0.88686516])

In [13]:
scores.mean(),scores.std()

(0.8552450341984701, 0.0315287629653424)

Looking at the mean and standard deviation of our cross_val_scores, we can see that our model, on average, is about 86% accurate, and that our datapoints are very close to our model. This data solidifies our findings in Part 3, that we have a fairly good model that is very likely to accurately predict the size of the slime. However, since it isn't 100% accurate, it is not a perfect model. 

# Part 5 : Using Polynomial Regression

We use the *PolynomialFeatures* transformer to transform our current data matrix into a new data matrix for a degree of 2. 

In [14]:
pr = PolynomialFeatures(degree=2)
poly = pr.fit_transform(X)
poly

array([[1.00000e+00, 4.69000e+02, 6.47000e+02, 2.19961e+05, 3.03443e+05,
        4.18609e+05],
       [1.00000e+00, 4.03000e+02, 6.94000e+02, 1.62409e+05, 2.79682e+05,
        4.81636e+05],
       [1.00000e+00, 3.02000e+02, 9.75000e+02, 9.12040e+04, 2.94450e+05,
        9.50625e+05],
       ...,
       [1.00000e+00, 7.91000e+02, 2.13000e+02, 6.25681e+05, 1.68483e+05,
        4.53690e+04],
       [1.00000e+00, 7.69000e+02, 5.53000e+02, 5.91361e+05, 4.25257e+05,
        3.05809e+05],
       [1.00000e+00, 9.19000e+02, 4.52000e+02, 8.44561e+05, 4.15388e+05,
        2.04304e+05]])

We then perform Linear Regression using the new data matrix, to then get the new coefficients for our equation. 

In [15]:
lin = LinearRegression().fit(poly,y)
lin.coef_

array([ 0.00000000e+00,  1.20000000e+01, -1.23112040e-07, -1.05619484e-11,
        2.00000000e+00,  2.85714287e-02])

Our new h(x) is: $$ h(x) = 12x_1 - 1.2311*10^{-7}x_2 - 1.0562*10^{-11}x^2_1 + 2x_1 x_2 + 2.8571*10^{-2}x^2_2 $$

In [16]:
lin.score(poly,y)

1.0

In [17]:
poly_scores = cross_val_score(lr,poly,y,cv=10)
poly_scores

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [18]:
poly_scores.mean(), poly_scores.std()

(1.0, 0.0)

Our R^2 coefficient is now 1.0, which means that our model is 100% accurate in predicting the size of the slime. This means that this is the natural function of our dataset. 