In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### AIML Machine Learning
#### Agenda
#### -----------------------
#### Cross Validation
#### Hyperparameter Tuning
#### Class Assignments

Preprocessing -> Splitting -> Train Model -> Test Model -> Approve to deploy if quality is good

#### If you use inferential statistics or statistical modeling, you have the datasets which is a sample of the entire population. 
#### STEPS:
##### First step is Preprocessing to make  your dataset compatible for inferential statistics algorithm/modeling algorithm
##### Second step is splitting data (Train Split/Test Split); Train Split is to develop/train the model, Test Split is to validate the trained model or check the quality of trained model
##### Third step is to train the model
##### Fourth step is to check the quality of the model
##### Approve and deploy the model in the application if the quality is good
##### If quality is bad then go back to first step i.e. preprocessing. 


In [8]:
data= pd.read_csv('50_Startups.csv')

In [9]:
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [10]:
features = data.iloc[:, [0,1,2]].values
label=data.iloc[:,[4]].values
                     

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=15)

In [13]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [14]:
model.fit(X_train, y_train)

LinearRegression()

In [15]:
model.score(X_train, y_train)

0.9541516914911967

In [16]:
model.score(X_test, y_test)

0.8841631276365532

#### Overfitted Model
#### TrainScore > TestScore
#### Biggest secret to have a good model is to have the best possible sampleset
#### Only when the sample represents the entire population
#### Overfitted means that the model memorized the data. 
#### Instead I would want the model to be GENERALIZED model 
#### How to get a good sample set? Answer: Exploration
##### SL - Significance Level  | SL = 0.05
##### CL - Confidence Level | 1 - SL
##### SL signifies how much errors my project can tolerate such that the quality of the project is not affected. 
##### CL signifies how much accuracy I need to achieve such that  the quality of the project is not affected. 

##### CL = 0.95 

#### The Desi Jugaad Method
##### 1. Concept of Exploration --- I make my own Dr.Strange ---- I create my own software explorer
##### 2. Ideal Expectation is
#####     TestScore > TrainScroe and TestScore >= CL and TrainScore >=CL


In [2]:
CL = 0.95

for sampleNumber in range(1, 100): 
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(features, label, testsize=0.2, random_state=SampleNumber)

    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(X_train, y_train)
    testScore = model.score(X_test, y_test)
    trainScore = model.score(X_train, y_train)
    
    if testScore >= CL and testScore > trainSCore and trainScore>= CL:
        print("Test Score is {} and Train Score is {} and RS Value is {}".format(testScore, trainScore, sampleNumber))

NameError: name 'features' is not defined

##### Before you take the project, you need to understand the minimum accuracy from the model
##### Expected minimum accuracy from the model.  Or, in other words or it is also called DRAWDOWN
##### As a customer I will tell you that I want 100% accuracy assuming that I don't know Machine Learning. 
##### What should be the answer: It is not possible. Because sample is not equal to population


### Cross Validation
#### Goals of Cross Validation
##### 1. To get the minimum threshold of the score with respect to the dataset
##### 2. To get the maximum score that you can advise to your ML engineers (you have statistical reason to put the numbers)
##### 3. To extract the best sample that gives the best score; best sample is referring to best training data

#### How cross validation works?
#### If K = 5, it means [No. of iteration and No. of sample splits] is 5
#### Dataset -> S1 | S2 | S3 | S4 | S5
#### Iteration 1: S1 is test dataset, S2 through S5 is train dataset (train set -> model algo -> trained model -> Score)
#### Iteration 2: S2 is test dataset, and S1, S3 through S5 is train dataset
#### ....and so on
#### In the end, you get 5 score

#### The goal is to get the minimum threshold of the score and maximum score, and get the best sample
#### 



In [12]:
data = pd.read_csv('iris.csv')

In [13]:
feature= data.iloc[:, :-1].values
label = data.iloc[:, -1].values

In [18]:
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### Hyperparameter Tuning

Parameters are nothing but you supply values. They are also called variables. 

Hyperparameters are the parameters in an Machine Learning algorithm implementation functions. They are called Hyperparameters because the change in value changes the behavior of the model. 
This means discovering the best value by performing the hyperparameter tuning. 
Identify best parameters for the given dataset and the model. 

Is Hyperparameter Tuning Mandatory?
No, if you achieve the best model using default confi

Hyperparameter tuning is all about finding the best possible parameters' VALUES by trying all valid combinations. 

GOAL: GET THE BEST MODEL BY IDENTIFYING BEST PARAMETERS VALUES POSSIBLE FOR EACH APPLICABLE P

In [4]:
import pandas as pd
import numpy as np

In [5]:
data = pd.read_csv('50_Startups.csv')

In [6]:
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [7]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [8]:
data['State'] = le.fit_transform(data['State'])

In [9]:
data['State'].unique()

array([2, 0, 1])

In [10]:
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,2,192261.83
1,162597.7,151377.59,443898.53,0,191792.06
2,153441.51,101145.55,407934.54,1,191050.39
3,144372.41,118671.85,383199.62,2,182901.99
4,142107.34,91391.77,366168.42,1,166187.94


In [11]:
#data = data.astype({'R&D Spend':'int', 'Adminsitration':'int', 'Marketing Spend':'int', 'Profit':'int'})
#data['R&D Spend'] = data['R&D Spend'].apply(int)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     int64  
 4   Profit           50 non-null     float64
dtypes: float64(4), int64(1)
memory usage: 2.1 KB


In [12]:
data['Administration'] = data['Administration'].astype(int)
data['Marketing Spend'] = data['Marketing Spend'].astype(int)
data['Profit'] = data['Profit'].astype(int)

In [13]:
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897,471784,2,192261
1,162597.7,151377,443898,0,191792
2,153441.51,101145,407934,1,191050
3,144372.41,118671,383199,2,182901
4,142107.34,91391,366168,1,166187


In [17]:
data['R&D Spend'] = data ['R&D Spend'].astype(float)
data['R&D Spend']= data['R&D Spend'].astype(int)

In [18]:
data.head(2)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349,136897,471784,2,192261
1,162597,151377,443898,0,191792


In [19]:
features = data.iloc[:,:-1].values
label = data.iloc[:,-1].values

In [22]:
# Demonstrate the score threshold with LogisticRegression
from sklearn.linear_model import LogisticRegression
modelAlgo = LogisticRegression()

In [23]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(modelAlgo,
                        features,
                        label,
                        cv = 5) #5 or 10

ValueError: n_splits=5 cannot be greater than the number of members in each class.