<div style="text-align: right">INFO 6105 Data Sci Engineering Methods and Tools, Lecture 4</div>
<div style="text-align: right">Prof. Dino Konstantopoulos, 27 January 2020</div>

# Math and the Scientific Method


<br />
<center>
<img src="ipynb.images/john-stuart-mill.jpg" width=400 />
    John Stuart Mill (1806 - 1873)
</center>

# Fibonacci numbers

What's the *interesting* feature of Fibonacci numbers? The *future* ***is*** encoded in the *past*!

Please complete the fibonacci generator below:

<div style="visibility: hidden">
    def fib(n):
    a, b = 0, 1
    for _ in range(n):
        yield a
        a, b = b, a + b
</div>

In [None]:
def fib(n):
    a, b = 0, 1
    for _ in range(n):
        yield a
        a, b = b

my_fibs = list(fib(80))
print(my_fibs)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.Series(my_fibs)
data

In [None]:
data.values

In [None]:
data2 = pd.DataFrame(my_fibs, columns = ['Fibonacci'])
data2

In [None]:
data2.values

In [None]:
plt.figure(figsize=(17, 8))
plt.plot(data2.values)
plt.title('Fibonacci numbers up to fib(80)')
plt.ylabel('Fib')
plt.xlabel('Integers')
plt.grid(True)

Let's *normalize*: Computers *hate* big numbers, but they love *very small* numbers:

In [None]:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
data3 = sc.fit_transform(data2)

In [None]:
plt.figure(figsize=(17, 8))
plt.plot(data3)
plt.title('Fibonacci numbers up to fib(80), normalized')
plt.ylabel('Fib')
plt.xlabel('Integers')
plt.grid(True)

Now that our data is normalized, let's return it to a pandas DataFrame:

In [None]:
data4 = pd.DataFrame(data3, columns=data2.columns, index=data2.index)
data4

For convenience:

# Using a random training/test split

In [None]:
data2 = pd.DataFrame(my_fibs, columns = ['Fibonacci'])
data2

In [None]:
data2.plot()

Run ***only one*** of the following two cells:

Run this cell to work with potentially ***huge*** numbers:

In [None]:
data5 = data2

Run this cell to work with ***small*** numbers:

In [None]:
data5 = data3

In [None]:
data5

You may have to convert to a pandas **dataframe** if you picked the 2nd option:

In [None]:
data5 = pd.DataFrame(data5, columns=['Fibonacci'], index=data2.index)

In [None]:
data5

I only need the previous ***two*** values to evaluate the next value, but just for kicks, to see if the model correctly picks the right columns, i'll also add the the previous ***three*** values:

In [None]:
#for s in range(1,10):
for s in range(1,4):
    data5['Fibonacci_{}'.format(s)] = data5['Fibonacci'].shift(s)

In [None]:
data5

Verify:

In [None]:
data5['Fibonacci'] - data5['Fibonacci_1'] - data5['Fibonacci_2']

Oops! What's that *last* row saying??!

Let's get rid of it!

In [None]:
data5 = data5[:-1]
data5

In [None]:
data5['Fibonacci'] - data5['Fibonacci_1'] - data5['Fibonacci_2']

In [None]:
X = data5.dropna().drop('Fibonacci', axis=1)
y = data5.dropna().drop(['Fibonacci_'+str(i) for i in range(1,4)], axis=1)

In [None]:
X

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

In [None]:
# Create a model 
rf_model = RandomForestRegressor()

In [None]:
# Train the model
rf_model.fit(X_train, y_train)

In [None]:
# Get R2 measure (indicator of accuracy 1 is perfect, 0 is horrible)
rf_model.score(X_test, y_test)

pretty good!

Which columns did the algorithm pick to determine the target variable?

In [None]:
sorted(zip(X.columns, rf_model.feature_importances_),
        key=lambda x: x[1], reverse=True)

The algorithm correctly determined that the previous and next-to-previous columns are the right ones to prioritize!

In [None]:
y_pred = rf_model.predict(X_test)

In [None]:
type(y_pred)

In [None]:
y_pred

In [None]:
type(y_test)

In [None]:
y_test

In [None]:
y_pred_df = pd.DataFrame(y_pred, columns=['Fibonacci'], index=y_test.index)

In [None]:
y_pred_df

In [None]:
plt.plot(y_test.values)
plt.plot(y_pred)

# ML is Function Approximation Theory

Sometimes **linear**, sometimes **non-linear**, depending on the algorithm!

Let's prove this.

In [None]:
import numpy as np
import pandas as pd
x = np.random.uniform(low=0.5, high=20, size=(1000,))
y = np.random.uniform(low=0.5, high=20, size=(1000,))
df = pd.DataFrame({'x':x, 'y':y})
df.plot('x', 'y', kind='scatter')

In [None]:
df.head()

In [None]:
df['z'] = 5.* df['x'] + 0.2 * df['y']
df.head()

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

threedee = plt.figure().gca(projection='3d')
threedee.scatter(df['x'], df['y'], df['z'])
threedee.set_xlabel('x')
threedee.set_ylabel('y')
threedee.set_zlabel('z')
plt.show()

In [None]:
X = df.dropna().drop('z', axis=1)
y = df.dropna().drop(['x', 'y'], axis=1)

In [None]:
X

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

# Create a model 
rf_model = RandomForestRegressor()

# Train the model
rf_model.fit(X_train, y_train)

# Get R2 measure (indicator of accuracy 1 is perfect, 0 is horrible)
rf_model.score(X_test, y_test)

In [None]:
y_pred = rf_model.predict(X_test)

In [None]:
y_pred.shape

In [None]:
y_test.shape

In [None]:
dfv = pd.DataFrame({'y_test':np.squeeze(y_test).values, 'y_pred':y_pred})
dfv.plot('y_test', 'y_pred', kind='scatter')

Convinced?

Now let's try something (mildly) non-linear!

In [None]:
df['z'] = 5.* df['x']**2 + 0.2 * df['y']**3 
df.head()

In [None]:
threedee = plt.figure().gca(projection='3d')
threedee.scatter(df['x'], df['y'], df['z'])
threedee.set_xlabel('x')
threedee.set_ylabel('y')
threedee.set_zlabel('z')
plt.show()

In [None]:
X = df.dropna().drop('z', axis=1)
y = df.dropna().drop(['x', 'y'], axis=1)

In [None]:
X

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

# Create a model 
rf_model = RandomForestRegressor()

# Train the model
rf_model.fit(X_train, y_train)

# Get R2 measure (indicator of accuracy 1 is perfect, 0 is horrible)
rf_model.score(X_test, y_test)

In [None]:
y_pred = rf_model.predict(X_test)
dfv = pd.DataFrame({'y_test':np.squeeze(y_test).values, 'y_pred':y_pred})
dfv.plot('y_test', 'y_pred', kind='scatter')

Now let's try something *highly* non-linear!

In [None]:
#x = np.linspace(-10,10,1000)
#y = np.linspace(-10,10,1000)
x = np.random.uniform(low=-10, high=10, size=(1000,))
y = np.random.uniform(low=-10, high=10, size=(1000,))
df = pd.DataFrame({'x':x, 'y':y})
df.plot('x', 'y', kind='scatter')

In [None]:
df['z'] = 5.* df['x']**2 + 0.2 * df['y']**3 
df.head()

In [None]:
threedee = plt.figure().gca(projection='3d')
threedee.scatter(df['x'], df['y'], df['z'])
threedee.set_xlabel('x')
threedee.set_ylabel('y')
threedee.set_zlabel('z')
plt.show()

In [None]:
X = df.dropna().drop('z', axis=1)
y = df.dropna().drop(['x', 'y'], axis=1)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

# Create a model 
rf_model = RandomForestRegressor()

# Train the model
rf_model.fit(X_train, y_train)

# Get R2 measure (indicator of accuracy 1 is perfect, 0 is horrible)
rf_model.score(X_test, y_test)

In [None]:
y_pred = rf_model.predict(X_test)
dfv = pd.DataFrame({'y_test':np.squeeze(y_test).values, 'y_pred':y_pred})
dfv.plot('y_test', 'y_pred', kind='scatter')

Still good!

So regression forests (decision trees) pick up on **non-linear** relationships, too!

Or do they?

# A different split

What if we do not use `sklearn`'s train/test split, and instead use our own to predict **intervals** instead of isolated datapoints (e.g. the *future* from the *past*)?

In [None]:
import numpy as np
import pandas as pd

x = np.random.uniform(low=-10, high=10, size=(1000,))
y = np.random.uniform(low=-10, high=10, size=(1000,))
df = pd.DataFrame({'x':x, 'y':y})
df.plot('x', 'y', kind='scatter')

In [None]:
df['z'] = (5.* df['x']**2 + 0.2 * df['y']**3)
df.head()

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

threedee = plt.figure().gca(projection='3d')
threedee.scatter(df['x'], df['y'], df['z'])
threedee.set_xlabel('x')
threedee.set_ylabel('y')
threedee.set_zlabel('z')
plt.show()

In [None]:
X = df.dropna().drop('z', axis=1)
y = df.dropna().drop(['x', 'y'], axis=1)

In [None]:
X.shape

Let's use the interval \[0, 800\] as the *past* (to train with), and \[800, 1000\] as the *future* (to predict or test the training):

In [None]:
y_train = y[:800]
y_test = y[800:]

In [None]:
y_train

In [None]:
y_test

In [None]:
X_train = X.drop(X.index[800:])
X_test = X.drop(X.index[0:800])

In [None]:
X_train

In [None]:
X_test

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

# Create a model 
rf_model = RandomForestRegressor()

# Train the model
rf_model.fit(X_train, y_train)

# Get R2 measure (indicator of accuracy 1 is perfect, 0 is horrible)
rf_model.score(X_test, y_test)

In [None]:
y_pred = rf_model.predict(X_test)
dfv = pd.DataFrame({'y_test':np.squeeze(y_test).values, 'y_pred':y_pred})
dfv.plot('y_test', 'y_pred', kind='scatter')

Yahhhh! Still great prediction! So it's not like we're predicting individual datapoints because of smoothness and good linear approximations. We are predicting entire intervals (\[800, \1000\])!

# More wildly non-linear

Really, can ML algorithms pick up *all* non-linearities? How about we use the same non-linear `z = f(x,y)` as right above, except let's *wiggle* the hell out of it with trigonometric functions!

<br />
<center>
<img src="ipynb.images/shar-pei.jpg" width=400 />
Shar-Pei breed
</center>

In [None]:
df['z'] = (5.* df['x']**2 + 0.2 * df['y']**3) * np.sin(df['x']) * np.cos(df['y'])
df.head()

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

threedee = plt.figure().gca(projection='3d')
threedee.scatter(df['x'], df['y'], df['z'])
threedee.set_xlabel('x')
threedee.set_ylabel('y')
threedee.set_zlabel('z')
plt.show()

In [None]:
X = df.dropna().drop('z', axis=1)
y = df.dropna().drop(['x', 'y'], axis=1)

Let's try *both* types of splits! Run one ***or the other*** cell below, *not both*! Then come back, and run the other one.

In [None]:
y_train = y[:800]
y_test = y[800:]
X_train = X.drop(X.index[800:])
X_test = X.drop(X.index[0:800])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

# Create a model 
rf_model = RandomForestRegressor()

# Train the model
rf_model.fit(X_train, y_train)

# Get R2 measure (indicator of accuracy 1 is perfect, 0 is horrible)
rf_model.score(X_test, y_test)

In [None]:
y_pred = rf_model.predict(X_test)
dfv = pd.DataFrame({'y_test':np.squeeze(y_test).values, 'y_pred':y_pred})
dfv.plot('y_test', 'y_pred', kind='scatter')

Ohhh. Not that good anymore, right?

<br />
<center>
<img src="ipynb.images/home-alone.jpg" width=400 />
</center>

# Conclusion

ML algorithms are *guaranteed* to work when there is a *linear relationship* between the independent variables ($X$) and the dependent variable ($y$), and they may even work on *some* non-linear relationships between $X$ and $y$. But if the non-linearity is *too strong*, they may fail quite dramatically, and you need to work hard to be able to model it. And when there *is no relationship*, as in attempting to predict the future from the past, they are *guaranteed to fail*.

In other words, it's what's professor told you:

>Machine Learning is (linear) function approximation theory

And that is what your brain does as well! Your brain's reasoning nucleus is made out of high-dimensional *surfaces* that capture the *models* that you've built to subsume your life'e *experiences*. But like our [Shar-pei](https://en.wikipedia.org/wiki/Shar_Pei) data above, your models may *fail* you when the experience becomes highly non-linear! 

So now that you know how *easy* it is to make mistakes my dear students, you can officially consider yourself *indoctrinated* to Western's civilization [scientific approach]((https://partiallyexaminedlife.com/2015/03/06/science-technology-and-society-ii-j-s-mill-on-scientific-method/)), best described by [John Stuart Mill](https://en.wikipedia.org/wiki/John_Stuart_Mill).

>**The Scientific Approach**: In A System of Logic (1843) Mill proposed what has since become the standard description of a scientific explanation, called the Covering Law Model. According to Mill, science is concerned with the discovery of regular patterns in experience (laws), and a scientific explanation of a fact is one that fixes its relationship to such laws. As we gain experience in detecting these laws, we observe that certain features of investigation are more conducive to discovery than others. We might, in other words, propose a law about the discovery of laws – the Scientific Method. This method is, simply, to use inference and inductive reason to create a set of hypotheses, and then to use deductive reason to derive from them likely consequences. We then perform an experiment, and on that basis we eliminate or revise our theories until we arrive at the true explanation.

# Quiz

Can you change one thing from the problem above to make the prediction *successful*, assuming ***you cannot change the `z` equation***?

<br />
<center>
<img src="ipynb.images/funny-fish.gif" width=400 />
    The End
</center>