## Train Split
#### Train | Test Split Procedure 
```
0: Clean and adjust data as necessary for X and Y 
1: Split Data in Train/Test for both X and Y 
2: Fit/Train Scaler on Training X Data 
3: Scale X Test Data
4: Create Model 
5: Fit/Train Model on X Train Data 
6: Evaluate Model on X Test Data ( by creating predictions and compairing to Y_test)
7: Adjust Parameters as Neccessary and repeat steps 5 and 6  


In [1]:
import numpy as np 
import pandas as pd  
import matplotlib.pyplot as plt 
import seaborn as sns 


In [3]:
df = pd.read_csv("../DATA/Advertising.csv")

In [4]:
df

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,9.7
197,177.0,9.3,6.4,12.8
198,283.6,42.0,66.2,25.5


In [5]:
X = df.drop('sales',axis=1)
X

Unnamed: 0,TV,radio,newspaper
0,230.1,37.8,69.2
1,44.5,39.3,45.1
2,17.2,45.9,69.3
3,151.5,41.3,58.5
4,180.8,10.8,58.4
...,...,...,...
195,38.2,3.7,13.8
196,94.2,4.9,8.1
197,177.0,9.3,6.4
198,283.6,42.0,66.2


In [10]:
y = df['sales']
y

0      22.1
1      10.4
2       9.3
3      18.5
4      12.9
       ... 
195     7.6
196     9.7
197    12.8
198    25.5
199    13.4
Name: sales, Length: 200, dtype: float64

In [7]:
from sklearn.model_selection import train_test_split

In [9]:
help(train_test_split)

Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
    Split arrays or matrices into random train and test subsets.

    Quick utility that wraps input validation,
    ``next(ShuffleSplit().split(X, y))``, and application to input data
    into a single call for splitting (and optionally subsampling) data into a
    one-liner.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.

    test_size : float or int, default=None
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        com

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

In [12]:
# Scale the data first(or Normalise the data )

from sklearn.preprocessing import StandardScaler 



In [13]:
scaler = StandardScaler()

In [14]:
scaler.fit(X_train)

In [15]:
X_train = scaler.transform(X_train)

In [16]:
X_test = scaler.transform(X_test)

In [18]:
# Create the model , adjust hyperParameters 

In [19]:
from sklearn.linear_model import Ridge

In [20]:
model = Ridge(alpha=100)

In [21]:
model.fit(X_train,y_train)

In [22]:
y_pred = model.predict(X_test)

In [None]:
# Now Evaluate the model 


In [23]:
from sklearn.metrics import mean_squared_error

In [24]:
mean_squared_error(y_test,y_pred)

7.341775789034126

In [26]:
# Now I am trying to us e a different alpha parameter 

model_two = Ridge(alpha=1)

In [27]:
model_two.fit(X_train,y_train)

In [31]:
y_pred_2 = model_two.predict(X_test)

In [32]:
mean_squared_error(y_test,y_pred_2)

2.3190215794287514

In [33]:
# Note certain disadvantages of this model : We have to tune the alpha parameter again and again to get a better result 

# Cross Validation
Notes

```    
Till now we saw that Train | Test split method has a disadvantage of not having a portion of data that can report a performance metric on truly "unseen" data
 
While adjusting hyperparameters on test data is a fair technique and not typically referrred to as "data leakage" , it is a potential issue in regards to reporting 
```

If we want a truly fair and final set of performance metrics , we should get these metrics from a final test set that we do not allow ourselves to adjust on . 

Recall the entire reason to not adjust after the final test data set is to get the **fairest evaluation of the model** . 

The model was not fitted to the final test data and the model hyperparameters were not adjusted based off final test data. 

## This is truly never before seen data !
```
To acheive this in Python with Scikit-Learn we simply perform the **train_test_split()** function call twice.
    -- Once to split off larger training set.
    -- Second time to split remaining data into a validation set and test set. 
```     



In [46]:
df = pd.read_csv("../Data/Advertising.csv")

In [47]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [48]:
X = df.drop('sales',axis=1)
y = df['sales']

In [49]:
from sklearn.model_selection import train_test_split

In [53]:
X_train, X_other, y_train, y_other = train_test_split(X,y,test_size=0.3,random_state=101)

In [55]:
# test_size = 0.5 (50% of 30% other ----> test = 15% of all the data )
X_eval , X_test, y_eval,y_test = train_test_split(X_other,y_other,test_size=0.5,random_state=101)

In [56]:
len(df)

200

In [57]:
len(X_train)

140

In [58]:
len(X_eval)

30

In [59]:
len(X_test)

30

In [60]:
# NOw scale 
from sklearn.preprocessing import StandardScaler

In [61]:
scaler = StandardScaler()

In [62]:
scaler.fit(X_train)


In [63]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_eval = scaler.transform(X_eval)

In [64]:
from sklearn.linear_model import Ridge

In [65]:
model_one = Ridge(alpha=100)

In [66]:
model_one.fit(X_train,y_train)

In [67]:
y_eval_pred = model_one.predict(X_eval)

In [68]:
from sklearn.metrics import mean_squared_error

In [74]:
mean_squared_error(y_eval,y_eval_pred)

7.320101458823866

In [75]:
model_twoo = Ridge(alpha=1)

In [76]:
model_twoo.fit(X_train,y_train)

In [77]:
new_pred_eval = model_twoo.predict(X_eval)

In [78]:
mean_squared_error(y_eval,y_eval_pred)

7.320101458823866

In [79]:
y_final_test_pred = model_twoo.predict(X_test)

In [80]:
mean_squared_error(y_test,y_final_test_pred)

2.254260083800517

In [44]:
# Bubble Sort 

a = [2,5,3,8,4,1,9]

n = len(a)

for i in range(n):
    for j in range(0,n-i-1):
        if a[j] > a[j+1]:
            a[j] , a[j+1] = a[j+1] , a[j]

print(a)


[1, 2, 3, 4, 5, 8, 9]


In [45]:
# Insertion sort 

a = [2,8,5,3,9,4]

n = len(a)

for i in range(1,n):
    j = i 

    while j > 0 and a[j-1] > a[j]:
        a[j-1] , a[j] = a[j] , a[j-1]
        j = j-1 

print(a)            


[2, 3, 4, 5, 8, 9]
