<a href="https://colab.research.google.com/github/profmcnich/example_notebook/blob/main/a3_sample_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

\(\^Be sure to update this button to point to your notebook instead of the sample notebook\)

In [20]:
# Imports section
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures

## Part 1. Loading the dataset

In [2]:
# Using pandas load the dataset (load remotely, not locally)
# Output the first 15 rows of the data
# Display a summary of the table information (number of datapoints, etc.)
df = pd.read_csv("science_data_large.csv")
print('First 15 rows of the data \n',df.head(15))
print() # new line
print('Table summary \n',df.info())

First 15 rows of the data 
     Temperature °C  Mols KCL     Size nm^3
0              469       647  6.244743e+05
1              403       694  5.779610e+05
2              302       975  6.196847e+05
3              779       916  1.460449e+06
4              901        18  4.325726e+04
5              545       637  7.124634e+05
6              660       519  7.006960e+05
7              143       869  2.718260e+05
8               89       461  8.919803e+04
9              294       776  4.770210e+05
10             991       117  2.441771e+05
11             307       781  5.006455e+05
12             206        70  3.145200e+04
13             437       599  5.390215e+05
14             566        75  9.185271e+04

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64 

## Part 2. Splitting the dataset

In [3]:
# Take the pandas dataset and split it into our features (X) and label (y)
X = df[['Temperature °C', 'Mols KCL']].to_numpy()
Y = df['Size nm^3'].to_numpy()
# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=.90, random_state=0)

## Part 3. Perform a Linear Regression

In [4]:
# Use sklearn to train a model on the training set
reg = linear_model.LinearRegression()
reg.fit(X_train, Y_train)
# Create a sample datapoint and predict the output of that sample with the trained model
print(f"Prediction: {reg.predict(X_test)}")
# Report on the score for that model, in your own words (markdown, not code) explain what the score means
print(f"Score: {reg.score(X_train,Y_train)}")
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX
print(f"coefficent: {reg.coef_}")
print(f"intercept: {reg.intercept_}")

Prediction: [ 134891.57109752  566848.1770532  1241137.60975794  -24367.1696409
  -70657.3340248   384956.44708746  467392.47066236  960873.69862144
    8407.39619923   51611.35989487  838231.36519831 1232720.46674085
  901563.3931183  -118835.20674597  188176.20244545 1158467.7057509
  665263.26735435  560560.92590561  163940.53886357  469055.93661764
 1419180.64745835  -84401.29716064  555956.53417346 1401489.68980634
  450621.8221688  1161403.05700777   45784.96014151  408948.47482721
  421868.97874553  645969.69402768  467893.44392697  208517.41869172
 1438187.73449241  708330.20055877 1400047.62029314  711132.70221441
  942380.06884206 1116186.8384737  1257396.19462681  189369.12027644
   71374.08771584  870102.22808014   94087.37936383  407174.28131093
  271473.01984826  388791.32041263  237873.60116061  790013.37060675
  112125.67358907  512139.47602221  905252.99153208 -163237.6627926
 1056768.59266956  757865.71816238  422587.22624213  838109.54727724
 1168808.55676556  420476

Equation: $Size\ nm^3 = m_1 * Temperature\ °C  + m_2 * Mols\ KCL + intercept$
 $Size\ nm^3 = 863.58108791 * Temperature\ °C  + 1006.12741921 * Mols\ KCL - 400305.9133335327$

## Part 4. Use Cross Validation

In [18]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
scores = cross_val_score(reg, X_train, Y_train, cv = 50)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
# Report on their finding and their significance

0.83 accuracy with a standard deviation of 0.09


## Part 5. Using Polynomial Regression

In [23]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2
poly = PolynomialFeatures(2)
X_train = poly.fit_transform(X_train)
X_test = poly.fit_transform(X_test)
model = linear_model.LinearRegression()
model.fit(X_train, Y_train)
print(f"Score: {model.score(X_train, Y_train)}")
print(f"coefficent: {model.coef_}")
print(f"intercept: {model.intercept_}")
# Report on the metrics and output the resultant equation as you did in Part 3.

Score: 1.0
coefficent: [-1.67092717e-06 -4.59112302e-07  5.99999995e+00 -6.85973218e-07
  2.15546316e-08  6.66666650e-01  9.52381068e-03 -6.11407661e-11
  5.99999940e+00 -1.16888293e-06 -9.97037660e-09  6.66666675e-01
  9.52381067e-03 -9.96993115e-09  6.66666675e-01 -2.28827814e-12
 -1.26892972e-11  1.10935228e-10  9.52381068e-03  1.40704083e-11
 -1.12939135e-10 -4.21785059e-12  9.72290744e-16 -9.28401615e-16
  2.38291600e-11 -2.38291617e-11  1.80924214e-15  1.70218563e-15]
intercept: 0.00045260752085596323
