# Background information

Old Faithful is a cone geyser in Yellowstone National Park in Wyoming, United States. It is a highly predictable geothermal feature and has erupted every 44 minutes to two hours since 2000. Eruptions can shoot 3,700 to 8,400 US gallons (14,000 to 32,000 L) of boiling water to a height of 106 to 185 feet (32 to 56 m) lasting from ​1 1⁄2 to 5 minutes. The average height of an eruption is 145 feet (44 m). Intervals between eruptions can range from 60 to 110 minutes, averaging 66.5 minutes in 1939, slowly increasing to an average of 90 minutes apart today, which may be the result of earthquakes affecting subterranean water levels.

The target of this notebook is to:
1. Create a model to predict waiting time.
2. Cluster analysis

![Old Faithful](https://upload.wikimedia.org/wikipedia/commons/thumb/8/80/OldFaithful1948.jpg/250px-OldFaithful1948.jpg)

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

pd.options.display.max_columns = None

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# First look at the data:

We have eruptions duration and interval time.

In [None]:
data = pd.read_csv('/kaggle/input/old-faithful/faithful.csv')

In [None]:
data.drop(axis=1, labels='Unnamed: 0',inplace=True)
print('The dataset has',data.shape[0],'rows and',data.shape[1],'columns.')
print('-------------')
print('Sample data:')
print(data.head(5))

The range of eruptions duration is between 1.6 minutes to 5.1 minutes and range of interval is between 43 minutes and 96 minutes.

In [None]:
data.describe()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(15, 9))
sns.scatterplot(data=data, x='eruptions', y='waiting', color='blue')
ax.grid(axis='y', color='blue', linewidth=0.5, alpha=0.1)
ax.set(xlabel='Duration of eruptions in minutes')
ax.set(ylabel='Interval between eruptions in minutes')

plt.title('Duration of eruption vs intervals', fontsize = 20, c='black')
plt.show()

# Generating clusters
From the above scatter plot, we can clearly identify two clusters and the K-Means algorithm gives us the same optimal number.

In [None]:
from sklearn.cluster import KMeans

sse = []
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=42, init='random', n_init=10, max_iter=10)
    kmeans.fit(data)
    sse.append(kmeans.inertia_)

f, ax = plt.subplots(1,1,figsize=(15,9))
plt.plot(range(1, 10), sse)
plt.xticks(range(1, 10))
ax.annotate('Optimal number of clusters', xy=(2.05,9000))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.title('SSE for different number of clusters', fontsize = 20, c='black')
plt.show()

# A look at the clusters

In [None]:
kmeans = KMeans(n_clusters=2, random_state=0, init='random', n_init=10, max_iter=10)
kmeans.fit(data)
data['Cluster']=kmeans.predict(data)
fig, ax = plt.subplots(1, 1, figsize=(15, 9))
sns.scatterplot(data=data, x='eruptions', y='waiting', color='blue', hue='Cluster')
ax.grid(axis='y', color='blue', linewidth=0.5, alpha=0.1)
ax.set(xlabel='Duration of eruptions in minutes')
ax.set(ylabel='Interval between eruptions in minutes')

plt.title('Duration of eruption vs intervals', fontsize = 20, c='black')
plt.show()

# Regression model

## Regression model on the data without clustering.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

y = data['waiting']
X = data['eruptions'].to_numpy().reshape(-1, 1)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

models = [DecisionTreeRegressor(), LinearRegression(), Ridge(),  Lasso()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    from sklearn import metrics
    print('Model:', model)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('r2_score:', metrics.r2_score (y_test, y_pred))
    print('-------------------------------------')
    
    fig, ax = plt.subplots(1, 1, figsize=(15, 9))
    sns.scatterplot(data=data, x='eruptions', y='waiting', color='blue')
    x_values = np.arange(0,6,0.01)
    x_valuess = x_values.reshape(-1, 1)
    model.fit(X, y)
    pred = model.predict(x_valuess)
    sns.lineplot(x=x_values, y=pred)
    plt.title(str(model)+' - Duration of eruption vs intervals', fontsize = 20, c='black')
    plt.show()

## Regression model on the data with clustering.

This was just a try where I included the clustering which was done previously. The linear regression and ridge models actualy gave better results

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

y = data['waiting']
X = data[['eruptions', 'Cluster']]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

models = [DecisionTreeRegressor(), LinearRegression(), Ridge(),  Lasso()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    from sklearn import metrics
    print('Model:', model)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('r2_score:', metrics.r2_score (y_test, y_pred))
    print('-------------------------------------')
    
    fig, ax = plt.subplots(1, 1, figsize=(15, 9))
    sns.scatterplot(data=data, x='eruptions', y='waiting', color='blue', hue='Cluster')

    model.fit(X, y)
    pred = model.predict(X)
    sns.lineplot(x=X['eruptions'], y=pred)
    plt.title(str(model)+' - Duration of eruption vs intervals', fontsize = 20, c='black')
    plt.show()

# Some final words.

This is a dataset that I came across by chance and decided to make this notebook without much thinking, but it did turn out to be better than I planned with some data visualisation, clustering and regression topics being applied here.

Let me know any comments/improvements and if you find the notebook interesting, do give an upvote.