<a href="https://colab.research.google.com/github/saifulislamdev/artificial-intelligence/blob/main/equation_of_a_slime.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [285]:
# Imports section

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn import svm

from sklearn.preprocessing import PolynomialFeatures

import io
import requests

## Part 1. Loading the dataset

In [286]:
# Using pandas load the dataset (load remotely, not locally)
url = "https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv"
url_content = requests.get(url).content
df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))

In [287]:
# Output the first 15 rows of the data
print(df.head(15), '\n')

    Temperature °C  Mols KCL     Size nm^3
0              469       647  6.244743e+05
1              403       694  5.779610e+05
2              302       975  6.196847e+05
3              779       916  1.460449e+06
4              901        18  4.325726e+04
5              545       637  7.124634e+05
6              660       519  7.006960e+05
7              143       869  2.718260e+05
8               89       461  8.919803e+04
9              294       776  4.770210e+05
10             991       117  2.441771e+05
11             307       781  5.006455e+05
12             206        70  3.145200e+04
13             437       599  5.390215e+05
14             566        75  9.185271e+04 



In [288]:
# Display a summary of the table information (number of datapoints, etc.)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB
None


## Part 2. Splitting the dataset

In [289]:
# Take the pandas dataset and split it into our features (X) and label (y)
X = df[['Temperature °C', 'Mols KCL']].values # features (X)
y = df['Size nm^3'].values # label (y)

# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
# random state is used to keep coefficients and intercepts consistent across reruns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=1)

## Part 3. Perform a Linear Regression

In [290]:
# Use sklearn to train a model on the training set
reg = LinearRegression().fit(X_train, y_train)

# Create a sample datapoint and predict the output of that sample with the trained model
sample_datapoint = [[89, 461]]
print("Sample datapoint output:", reg.predict(sample_datapoint)[0])

# Report on the score for that model, in your own words (markdown, not code) explain what the score means
print("Score:", reg.score(X_train, y_train))

Sample datapoint output: 141562.21554556227
Score: 0.8608840241280852


The score is essentially the coefficient of determination of the prediction. It measures how well the model can predict the output of the inputs, given the expected outputs. The best possible score is 1.0. Hence, this score is not perfect, but it is okay, depending on your definition of okay.

In [291]:
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX
print("Coefficients:", reg.coef_)
print("Intercept:", reg.intercept_)

Coefficients: [ 861.67891651 1036.69847878]
Intercept: -413045.2067400077


$h(x) = -413045.20674 + 861.67892x_1 + 1036.69848x_2$

## Part 4. Use Cross Validation

In [292]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
cv = ShuffleSplit(n_splits=100, test_size=0.1, random_state=0)
clf = svm.SVR(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=cv)

# Report on their finding and their significance
scores


array([0.8744549 , 0.86605835, 0.83270803, 0.86666097, 0.84651432,
       0.85759979, 0.83017233, 0.84737449, 0.86513442, 0.76675157,
       0.85713635, 0.85830875, 0.87327884, 0.86868118, 0.85829925,
       0.84173315, 0.88884666, 0.86097308, 0.83668964, 0.88030161,
       0.85636607, 0.87230811, 0.84885618, 0.88827511, 0.86956894,
       0.84093495, 0.8710812 , 0.86711555, 0.84878514, 0.88316784,
       0.88535583, 0.85284702, 0.87802948, 0.84533243, 0.84716785,
       0.86069718, 0.84407959, 0.84008055, 0.83574209, 0.85909357,
       0.85707895, 0.83251992, 0.8369089 , 0.86778104, 0.81400614,
       0.88458404, 0.88151102, 0.84037551, 0.79916302, 0.86615135,
       0.87401657, 0.85757245, 0.87385828, 0.87446182, 0.86359295,
       0.81150386, 0.85316269, 0.86733136, 0.88007221, 0.85370635,
       0.84962353, 0.85789815, 0.85318883, 0.83597441, 0.85231969,
       0.86606607, 0.83875678, 0.85175107, 0.87868122, 0.84157886,
       0.82573742, 0.86537851, 0.8751695 , 0.85789074, 0.83677

On some shuffles of the data, the score was higher or almost the same compared to the previous score in Part 3. However, on other shuffles of the data, the score was lower, if not significantly lower (~0.10), compared to the previous score in Part 3.

## Part 5. Using Polynomial Regression

In [293]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2
poly = PolynomialFeatures(2).fit(X_train, y_train)
X_aug = poly.fit_transform(X)
X_train_aug, X_test_aug, y_train_aug, y_test_aug = train_test_split(X_aug, y, test_size=0.1, random_state=1)
poly_reg = LinearRegression().fit(X_train_aug, y_train_aug)

# Report on the metrics and output the resultant equation as you did in Part 3.

# Report on the score for that model, in your own words (markdown, not code) explain what the score means
# explanation of the score is provided in the next markdown cell
print("Score:", poly_reg.score(X_train_aug, y_train_aug))

# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX
print("Coefficients:", poly_reg.coef_)
print("Intercept:", poly_reg.intercept_)

Score: 1.0
Coefficients: [ 0.00000000e+00  1.20000000e+01 -1.12640975e-07 -1.78967952e-11
  2.00000000e+00  2.85714287e-02]
Intercept: 1.470447750762105e-05


Wow, this time the score was perfect! The best possible score is 1.0, and it was matched! Hence, the model predicted the output of the training set pretty well.

$h(x) = 12x_1 -1.12640975*10^-7 *x_2 + -1.78967952*10^-11 *x_1^2 + 2x_1x_2 + 2.85714287*10^-2 *x_2^2$